Extracting CML from a Chem4Word authored document (Java)

I have been meaning to write this post for ages and thanks to a recent tweet from Egon I’ve finally got round to it. Basically what I am going to do over a series of posts is explain how CML can be extracted from a DOCX (OOXML) file authored using Chem4Word. I’ll post methods in both Java and C# but I am starting off in Java.

A very quick into to DOCX and OOXML

There are plenty of blogs, papers, videos and the like out there which explain OOXML in various levels of detail. I don’t want to replicate that here but I think it will be useful to have a quick overview for reference. Microsoft developed the Open Packaging Convention (OPC) specification as a successor to its binary Microsoft Office file formats. The file-extension DOCX indicates an OPC document which should be edited using Microsoft Office Word 2007 (as opposed to the XSLX file extension for example which are OPC documents editable using Excel). A DOCX document is effectively a zip-file (the package) which contains the original text as a marked-up XML component (document.xml), with images and other embedded objects stored as separate files.

the simplified structure of an OPC document

The package-part word\document.xml contains the main text and body of the document. Chem4Word stores CML files in the customXml folder within the package. This directory contains pairs of files with names item[\d].xml and itemProps[\d].xml – itemProps[n] contains a list of all the namespaces and schemas used in item[n].

Getting the CML out – the brute force extraction method

The first method of extracting the CML files is the simplest. This method does not allow us to know anything more about the data other than it has been included somewhere in the document by the user. For example, we don’t know where and how it is being used in the document (or how many times). So, for the algorithm: iterate through all those files where the CML may be found, attempt to build each file as a XOM document, if it builds then search within it for a cml element from the cml namespace (see code below).

import java.io.File;
import java.io.IOException;
import java.util.ArrayList;
import java.util.Enumeration;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import java.util.zip.ZipEntry;
import java.util.zip.ZipException;
import java.util.zip.ZipFile;

import nu.xom.Builder;
import nu.xom.Document;

public class OOXMLTools {

  public static List<Document> GetCML(File file)
    throws ZipException, IOException  {

    ZipFile zipFile = new ZipFile(file);
    List<Document> list = new ArrayList<Document>();
    Builder builder = new Builder();
    Matcher m =
      Pattern.compile("customXml/item([1-9][0-9]*)\\.xml")
        .matcher("");

    for (Enumeration <? extends ZipEntry>
      entries = zipFile.entries(); entries.hasMoreElements();) {

      ZipEntry entry = entries.nextElement();
      if (m.reset(entry.getName()).matches()) {
        try {
          Document doc =
            builder.build(zipFile.getInputStream(entry));

          if (doc.query("//*[local-name()='cml' and
            namespace-uri()='http://www.xml-cml.org/schema']")
              .size() > 0) {

            list.add(doc);
          }

        } catch (Exception e) {
          // not an XML file so can't be CML
        }
      }
    }
    return list;
  }
}

If you would like to try it out here is a DOCX file with some chemistry in it. There are three chemistry zones in the document (containing testosterone [item1] and acetic acid [item3]) but only two CML files in the customXml directory because both testosterone instances point to the same backing CML file.

In following posts I will go further into how you can discover which representation is being used in the document, how many times a particular CML file is referenced and how the data is converted into the on screen representation.

Advertisements

3 Responses to Extracting CML from a Chem4Word authored document (Java)

  1. […] CML from a Chem4Word authored document (C#) My previous post was the first in a series demonstrating how CML embedded in DOCX files could be extracted (in that […]

  2. egonw says:

    Great, thanx! What’s the license for the file?

    docProps/core.xml

    did not say anything about that… CC0? So, that everyone can start using it as test file?

  3. jat45 says:

    @egon

    Thanks – there is now a public domain license (http://creativecommons.org/licenses/publicdomain/) in the document.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: