Extracting CML from a Chem4Word authored document (C#)

January 21, 2010

My previous post was the first in a series demonstrating how CML embedded in DOCX files could be extracted (in that case using Java). For completeness I thought I ought to post some code to accomplish the same thing in C#. This should also allow people to get used to the packaging tools before we build up functionality in later posts.

If you would like a file containing CML to test this out with, one is available here.

Now for the code:


using System.Collections.Generic;
using System.IO;
using System.IO.Packaging;
using System.Linq;
using System.Xml.Linq;
using System.Xml.XPath;

namespace Chem4Word.Tools 
{
  public class OOXMLTools 
  {
    public ICollection GetCML(string path) 
    {
      ICollection list = new List();
      using (Package package = Package.Open(path, FileMode.Open)) 
      {
        foreach (PackagePart packagePart in package.GetParts()) 
        {
          if (packagePart.ContentType == "application/xml") 
          {
            using (StreamReader streamReader = 
               new StreamReader(packagePart.GetStream())) 
            {
              try 
              {
                XDocument xDocument = 
                  XDocument.Parse(streamReader.ReadToEnd());
                if (xDocument.XPathSelectElements(
              "//*[local-name()='cml' and namespace-uri()=
              'http://www.xml-cml.org/schema']").Count() > 0) 
                {
                  list.Add(xDocument);
                }
              }
              catch 
              {
                  // not valid XML so therefore can't be CML
              }
            }
          }
        }
      }
      return list;
    }
  }
}

So there we go. Pretty similar to the Java version really. Just in case you were wondering, I know I haven’t done a load of exception checking.


Extracting CML from a Chem4Word authored document (Java)

January 20, 2010

I have been meaning to write this post for ages and thanks to a recent tweet from Egon I’ve finally got round to it. Basically what I am going to do over a series of posts is explain how CML can be extracted from a DOCX (OOXML) file authored using Chem4Word. I’ll post methods in both Java and C# but I am starting off in Java.

A very quick into to DOCX and OOXML

There are plenty of blogs, papers, videos and the like out there which explain OOXML in various levels of detail. I don’t want to replicate that here but I think it will be useful to have a quick overview for reference. Microsoft developed the Open Packaging Convention (OPC) specification as a successor to its binary Microsoft Office file formats. The file-extension DOCX indicates an OPC document which should be edited using Microsoft Office Word 2007 (as opposed to the XSLX file extension for example which are OPC documents editable using Excel). A DOCX document is effectively a zip-file (the package) which contains the original text as a marked-up XML component (document.xml), with images and other embedded objects stored as separate files.

the simplified structure of an OPC document

The package-part word\document.xml contains the main text and body of the document. Chem4Word stores CML files in the customXml folder within the package. This directory contains pairs of files with names item[\d].xml and itemProps[\d].xml – itemProps[n] contains a list of all the namespaces and schemas used in item[n].

Getting the CML out – the brute force extraction method

The first method of extracting the CML files is the simplest. This method does not allow us to know anything more about the data other than it has been included somewhere in the document by the user. For example, we don’t know where and how it is being used in the document (or how many times). So, for the algorithm: iterate through all those files where the CML may be found, attempt to build each file as a XOM document, if it builds then search within it for a cml element from the cml namespace (see code below).

import java.io.File;
import java.io.IOException;
import java.util.ArrayList;
import java.util.Enumeration;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import java.util.zip.ZipEntry;
import java.util.zip.ZipException;
import java.util.zip.ZipFile;

import nu.xom.Builder;
import nu.xom.Document;

public class OOXMLTools {

  public static List<Document> GetCML(File file)
    throws ZipException, IOException  {

    ZipFile zipFile = new ZipFile(file);
    List<Document> list = new ArrayList<Document>();
    Builder builder = new Builder();
    Matcher m =
      Pattern.compile("customXml/item([1-9][0-9]*)\\.xml")
        .matcher("");

    for (Enumeration <? extends ZipEntry>
      entries = zipFile.entries(); entries.hasMoreElements();) {

      ZipEntry entry = entries.nextElement();
      if (m.reset(entry.getName()).matches()) {
        try {
          Document doc =
            builder.build(zipFile.getInputStream(entry));

          if (doc.query("//*[local-name()='cml' and
            namespace-uri()='http://www.xml-cml.org/schema']")
              .size() > 0) {

            list.add(doc);
          }

        } catch (Exception e) {
          // not an XML file so can't be CML
        }
      }
    }
    return list;
  }
}

If you would like to try it out here is a DOCX file with some chemistry in it. There are three chemistry zones in the document (containing testosterone [item1] and acetic acid [item3]) but only two CML files in the customXml directory because both testosterone instances point to the same backing CML file.

In following posts I will go further into how you can discover which representation is being used in the document, how many times a particular CML file is referenced and how the data is converted into the on screen representation.


Getting a license just got easier

July 28, 2008

I don’t normally like to repost but I am quite happy to do so for this as I think that it is a wonderful idea. Now I just hope people use it.

From savas’s blog:

When I joined Technical Computing, now part of External Research, we wanted to create an ecosystem of tools and services to support researchers worldwide. Today we announced the results of some of our efforts; there is still more going on.

A tool that was discussed was the Creative Commons addin for Microsoft Office XP/2003. We got feedback from researchers that they really liked the functionality but were very surprised that Microsoft didn’t release an update version for Microsoft Office 2007. Well, we contacted the team responsible for it and found out that they had no plans to update it so we requested and got ownership of its future.

I started prototyping some new ideas around a ribbon-based interface, allowing you to create Creative Common licenses that can be shared between Word, Powerpoint, and Excel. The plugin uses the Creative Commons web service when generating new licenses. Finally, we wanted to make the license machine readable so we are including the RDF representation of the license in the OOXML package.*

Download the Creative Common plugin for Microsoft Office 2007. The updated version for XP/2003 (fixing some reported bugs) will be released very soon.

* Unfortunately, due to timing constraints we didn’t get around to avoiding a feature of Office where document properties are URL-encoded. This is mentioned in the documentation that comes with the plugin so you can build crawlers/indexers.

Cool huh?


A challenge for Chemists and OOXML

July 27, 2008

Not all that long ago there were a series of competitions (of the new BBC version in that you could win kudos and little else) on various blogs (PMR, chemspiderman) to identify the number of chemicals in a paragraph of text. These focused largely on the difficultly of deciding what is and what is not a chemical – and consequently there was not necessarily a right answer.

Now I would like to propose a new challenge… and there is a right answer this time. I have randomly selected a preparation from an organic chemistry article a version of which is shown below. There is also a DOCX version available for download, which is fully and correctly formatted – to get the right answer I strongly suggest that you use the DOCX (although the latest version of Microsoft Office Word is not required).

So now for the challenge: how many chemicals do I think there are in the preparation – and for a further bonus point, which chemical did I get wrong?

((4S,5S)-5-Ethynyl-2,2-dimethyl-1,3-dioxolan-4-yl)methanol (14)

To a stirring mixture of 13 (10.0 g, 62.5 mmol) and anhydrous K2CO3 (11.37 g, 81.25 mmol) in dry MeOH (240 mL) at 65 °C was added a solution of Bestmann–Ohira reagent (15.6 g, 81.25 mmol) in dry MeOH (80 mL) dropwise over a period of 6 h under an argon atmosphere. After neutralization with acetic acid, the solvent was removed in vacuo, water was added and the mixture extracted with ethyl acetate (2 × 100 mL). The combined organic extracts were dried over anhydrous Na2SO4, concentrated under reduced pressure and purified by column chromatography (pet. ether–ethyl acetate, 4 : 1) to obtain 14 (6.82 g, 70%) as a colorless liquid. [α]27D -8.6 (c 1.0, MeOH); anal. calcd for C8H12O3: C, 61.52; H, 7.74; found: C, 61.79; H, 7.84; IR (neat) ν max/cm-1, 3452, 3284, 2121, 848, 665; 1H NMR (200 MHz, CDCl3, D2O exchange), δ 1.42 (s, 3H), 1.48 (s, 3H), 2.53 (d, 1H, J = 2.15 Hz), 3.64 (dd, 1H, J = 12.25, 3.67 Hz), 3.87 (dd, 1H, J = 12.25, 3.03 Hz), 4.16 (ddd, 1H, J = 7.58, 3.67, 3.03 Hz), 4.56 (dd, 1H, J = 7.57, 2.15 Hz).


A quick round of MS bashing

May 22, 2008

I read yesterday on Doug Mahugh’s blog about the new support for ODF in Word 2007. I was excited and pleased about this and eager to see what it would mean for programs such as Peter Sefton’s ICE. Then I saw the view of the Georg Greve, president of the Free Software Foundation Europe, who said:

“Support for ODF indicates there are problems with OpenXML that Microsoft cannot resolve easily and quickly.”

Similarly, Peter M-R received a fair amount of criticism when he supported OOXML. I can’t tell you how dispirited this made me. I was thinking of all the positives – people could use all the functionality of ICE and other technology developed against the ODT specification without having to use OpenOffice or similar. Because that is the problem. I have spent far more hours that I would like under the bonnet of both ODT and DOCX documents as part of the SPECTRa-T project. Both are horrible. It is not their fault, (or not entirely) some of the horridness is because they choose to do things one way (and I would have chosen the opposite) and support for legacy items and unicode characters and so on …

The thing is, I was able to get DOCX documents, all I had to do was ask people to send me a copy of their word file. I had to hunt long and hard before I could lay my hands on a ODT file because nobody was using anything that created one. And I work in an office full of people who spend most of their day using computers and writing programs. Do the users care that ODT is better (allegedly) that OOXML? Do they know? Are the users simply using the document creation software that is easiest to use and fundamentally works – I would say yes. There is nothing stopping me downloading OpenOffice now and using it. But there is also nothing making me do so. Why in gods name would I? What can I do with it that I can’t do with MS Word and can I do any of those things more easily?

I suspect that the reason that Office 2007 has not been welcomed with open arms is because people can no longer use it as easily as they used to be able to. Or at least not at first – that ribbon does hide things pretty effectively as well as taking up all that expensive screen real estate – but eventually you learn your way around.

I have been working in the text/data extraction realm for a while now and Word files used to be a dead end, then along came OOXML and suddenly I had a whole new area to work in. All this time ODT has been hanging around being open and accessible and I could data mine it – except that there wasn’t any of it. So now Microsoft are going to add ODT support to Word – this means that users can now use a decent authoring tool and people can get the results in ODT. Maybe this means that we should start caring about ODT but until I see evidence of people using it (and an appreciable fraction of the published documents being in the format) I will continue to concentrate on OOXML and other products that people actually use.