Extracting CML from a Chem4Word authored document (C#)

January 21, 2010

My previous post was the first in a series demonstrating how CML embedded in DOCX files could be extracted (in that case using Java). For completeness I thought I ought to post some code to accomplish the same thing in C#. This should also allow people to get used to the packaging tools before we build up functionality in later posts.

If you would like a file containing CML to test this out with, one is available here.

Now for the code:


using System.Collections.Generic;
using System.IO;
using System.IO.Packaging;
using System.Linq;
using System.Xml.Linq;
using System.Xml.XPath;

namespace Chem4Word.Tools
{
  public class OOXMLTools
  {
    public ICollection GetCML(string path)
    {
      ICollection list = new List();
      using (Package package = Package.Open(path, FileMode.Open))
      {
        foreach (PackagePart packagePart in package.GetParts())
        {
          if (packagePart.ContentType == "application/xml")
          {
            using (StreamReader streamReader =
               new StreamReader(packagePart.GetStream()))
            {
              try
              {
                XDocument xDocument =
                  XDocument.Parse(streamReader.ReadToEnd());
                if (xDocument.XPathSelectElements(
              "//*[local-name()='cml' and namespace-uri()=
              'http://www.xml-cml.org/schema']").Count() > 0)
                {
                  list.Add(xDocument);
                }
              }
              catch
              {
                  // not valid XML so therefore can't be CML
              }
            }
          }
        }
      }
      return list;
    }
  }
}

So there we go. Pretty similar to the Java version really. Just in case you were wondering, I know I haven’t done a load of exception checking.


Extracting CML from a Chem4Word authored document (Java)

January 20, 2010

I have been meaning to write this post for ages and thanks to a recent tweet from Egon I’ve finally got round to it. Basically what I am going to do over a series of posts is explain how CML can be extracted from a DOCX (OOXML) file authored using Chem4Word. I’ll post methods in both Java and C# but I am starting off in Java.

A very quick into to DOCX and OOXML

There are plenty of blogs, papers, videos and the like out there which explain OOXML in various levels of detail. I don’t want to replicate that here but I think it will be useful to have a quick overview for reference. Microsoft developed the Open Packaging Convention (OPC) specification as a successor to its binary Microsoft Office file formats. The file-extension DOCX indicates an OPC document which should be edited using Microsoft Office Word 2007 (as opposed to the XSLX file extension for example which are OPC documents editable using Excel). A DOCX document is effectively a zip-file (the package) which contains the original text as a marked-up XML component (document.xml), with images and other embedded objects stored as separate files.

the simplified structure of an OPC document

The package-part word\document.xml contains the main text and body of the document. Chem4Word stores CML files in the customXml folder within the package. This directory contains pairs of files with names item[\d].xml and itemProps[\d].xml – itemProps[n] contains a list of all the namespaces and schemas used in item[n].

Getting the CML out – the brute force extraction method

The first method of extracting the CML files is the simplest. This method does not allow us to know anything more about the data other than it has been included somewhere in the document by the user. For example, we don’t know where and how it is being used in the document (or how many times). So, for the algorithm: iterate through all those files where the CML may be found, attempt to build each file as a XOM document, if it builds then search within it for a cml element from the cml namespace (see code below).

import java.io.File;
import java.io.IOException;
import java.util.ArrayList;
import java.util.Enumeration;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import java.util.zip.ZipEntry;
import java.util.zip.ZipException;
import java.util.zip.ZipFile;

import nu.xom.Builder;
import nu.xom.Document;

public class OOXMLTools {

  public static List<Document> GetCML(File file)
    throws ZipException, IOException  {

    ZipFile zipFile = new ZipFile(file);
    List<Document> list = new ArrayList<Document>();
    Builder builder = new Builder();
    Matcher m =
      Pattern.compile("customXml/item([1-9][0-9]*)\\.xml")
        .matcher("");

    for (Enumeration <? extends ZipEntry>
      entries = zipFile.entries(); entries.hasMoreElements();) {

      ZipEntry entry = entries.nextElement();
      if (m.reset(entry.getName()).matches()) {
        try {
          Document doc =
            builder.build(zipFile.getInputStream(entry));

          if (doc.query("//*[local-name()='cml' and
            namespace-uri()='http://www.xml-cml.org/schema']")
              .size() > 0) {

            list.add(doc);
          }

        } catch (Exception e) {
          // not an XML file so can't be CML
        }
      }
    }
    return list;
  }
}

If you would like to try it out here is a DOCX file with some chemistry in it. There are three chemistry zones in the document (containing testosterone [item1] and acetic acid [item3]) but only two CML files in the customXml directory because both testosterone instances point to the same backing CML file.

In following posts I will go further into how you can discover which representation is being used in the document, how many times a particular CML file is referenced and how the data is converted into the on screen representation.


Post christmas lunch doodlings

December 17, 2009

The UCC had Christmas lunch on Monday and then members of the PMR group retired to the Pickerel for further enjoyment. Nick Day and I spent a little while doodling later in the evening and I have captured these for posterity. I should say that the major muse is someone who would be extremely gratified to be sketched on beer mats as their favourite medium is currently the backs of paper plates.

face on woodforde's beer mat

A collaborative effort started by Nick ... in fact I only added the beard.


The muse - on beer mat.

The muse - on beer mat.

The muse from the side - on beer mat.

The muse from the side - on beer mat.


Using Mercurial, VisualHG and Visual Studio 2008 together

December 16, 2009

There is a strong desire in our group to move towards distributed version control systems specifically mercurial. As I spend most of my time working on the Chemistry Add-in for Word 2007 (Chem4Word) I have been using Visual Studio 2008. This doesn’t come with source control support out of the box – but you can (and we have) added on team foundation server support but this isn’t distributed and quite frankly sucked.

I have spent the last couple of days investigating other options and having got a good solution up and running thought I should share this with the world. This definitely isn’t meant to be an introduction to mercurial in general and I won’t be explaining all the terms (good intros are available from the mercurial website and elsewhere). I will be giving both command line and GUI instructions.

Installing VisualHG

The instructions on the VisualHG site are pretty good – but you only need to go through the first three. Basically you need to download and install TortoiseHg and then VisualHG.

Next make VisualHG the current active Source Control Plugin in Visual Studio 2008. <Tools> -> <Options> -> <Source Control>. Set the value of the drop-down list (Current Source Control plug-in) to <VisualHG>.

How to create a new repository and add it to source control

Open Visual Studio 2008 and create a new solution (in my case I went for a Console Application called cs-walkthrough under C:\projects\c# and I didn’t want a new directory for the solution)

In a command window change directory so you are in cs-walkthrough.

>hg init

this will create the .hg directory

create a file called .hgignore as a sibling of the .hg directory the contents of the file should be:

syntax: glob

*.csproj.user
obj/
bin/
*.ncb
*.suo
_ReSharper.*
*.resharper.user

Create a project on bitbucket with the same name as the solution (doesn’t have to be but will make things easier). So in this case I now have a publicly viewable repository on bitbucket http://bitbucket.org/jat45/cs-walkthrough.

Now create a file called hgrc under the .hg directory. This file should contain the following text:

[paths]
default = http://bitbucket.org/jat45/cs-walkthrough

[ui]
username = Joe Townsend

where jat45 should be replaced with your bitbucket id and cs-walkthrough should be replaced by the name of the bitbucket repository you just created.

From here on in you can do everything either from the command line or from within Visual Studio. I will give both commands (to access the VisualHG commands right-click on any of the files in the solution). All the command line commands should be run from the [cs-walkthrough] directory.

>hg status

hg status from the command line

or <HG status> from VisualHG

hg status from VisualHG

We want all these files under source control so either use

>hg add
or select all the files from the GUI and click Add then close the window

To commit these additions either use
>hg commit -m "initial commit"
or commit using <HG Commit>

Now push these changes to bitbucket

>hg push
or <HG Synchronize>(this will allow you to push, pull and see the incoming and outgoing changes).

Warning

Before committing or pushing changes you must make sure you build the solution first. This ensures that the [project].csproj file is correctly updated.

Getting an existing project

This is much easier than creating a new repository from scratch. Once again though we are back to the command line.

Change directory to the parent of where you would like the new project repository to be checked out to. In my case this is c:\projects\c#.

I am now going to get a clone of the cs-walkthough code from bitbucket but put it in a new location.

c:\projects\c#>hg clone http://bitbucket.org/jat45/cs-walkthrough my-new-source-dir

You can use the same command because the project is public so you can access the source code but not commit.

So I think that is it. You can do all the usual, push, pull, diff and so on either from the command line or from the VisualHG add-in. The one only thing that must be done from the command line is the initial clone or init. I am also investigating if everything works in VS2010 – so far no problems but if I run into any I will update you.


Using “using”

November 5, 2009

I presented my own code for review in the last PMR group meeting and thankfully came off very lightly. However Nick Day who is currently doing a lot of Clojure work picked me up on something. I had many lines of code similar to:

XPathDocument xPathDocument = new XPathDocument(new StringReader(source));

and he was worried that I was not closing my string reader after use. After a bit of looking around I see that the code should be refactored into:

XPathDocument xPathDocument = null;
using (StringReader sr = new StringReader(source)) {
  xPathDocument = new XPathDocument(sr);
}


This ensures that the StringReader can be disposed of as soon as it has been used to create the XPathDocument. I am not at all surprised that Nick would spot this as it reminds me of the first Clojure macro that I finally got – though that was for a HTTP request.

Of course I could always have done the more conventional (and Java like):

StringReader sr = new StringReader(source);
XPathDocument xPathDocument = new XPathDocument(sr);
sr.Close();
sr.Dispose();

but I find the first way far more elegant and clear and it ensures that the resources are cleaned up no matter what – even if the creation of the XPathDocument throws exceptions.


My Brain

November 2, 2009

Unchecked Conversion Warnings

November 2, 2009

This post is taken from an email sent round by Jim Downing following a PMR group code review meeting. The topic which caused the most problem was how to remove unchecked conversion warnings from eclipse which were greatly upsetting a few members of the group (not me).

Primarily I just wanted to capture this for reference rather than keeping it in my inbox.

Had a bit of a dig around the unchecked conversion issue.
The solutions to the problem, in my opinion and in Miss World order: -

  1. I think the _worst_ thing to do is the default eclipse behaviour; adding @SuppressWarnings(“unchecked”) to the method declaration, since this will mask other warnings too, some of which can be much more severe than this one.
  2. The next worst thing to do is to suppress warnings by annotating the individual line. I dislike this one because the annotation is just distracting noisy cruft.
  3. Live with it. Stop your IDE whinging about it so much – the code will still compile.
  4. The best solution is given in the answer at

http://stackoverflow.com/questions/367626/how-do-i-fix-the-expression-of-type-list-needs-unchecked-conversion/367673#367673

The main point in the answer given above is that doing an unchecked conversion results in the ClassCastException coming from the guts of the compiled code somewhere, rather than from your code, which is a Bad Thing. So the best thing to do is: -

Rather than : -
List<String> whatYoudLike = foo.getUntypedList(); // Exception gets thrown from the guts of whatever the compile generates to do this.

for(String s: whatYoudLike) {
//
}

the best way to do it would be: -

for(Object o : foo.getUntypedList()) {
String s = (String) o; // Exception gets thrown here if at all.
//
}

This approach can get a bit more verbose, but that is, I’m afraid, tough luck.


First tentative steps in ClojureCLR

August 27, 2009

Jim and Nick (Day) have been using Clojure for a while on various projects and all too frequently I have heard exclamations of delight when they find what would have been 100s of lines of Java can be done in two or three of Clojure. I have spent most of the past year writing in C# and just haven’t been able to join in, although it has been good to have break from Java.

One of the things to come out of the Chem4Word project has been the idea of performing chemical changes via a stateless interface (CID – the Chemistry Interface Definition). This approach was strongly pushed by Savas and Jim (possibly as an excuse to learn yet another language). A stateless system lends itself beautifully to a functional language and as a bonus Clojure has both a CLR and JVM implementation so we can have a single definition which we can use on both platforms.

As a CompSci at Cambridge the first programming language you see is ML – probably because it makes it fairly easy to work out the big O. So I have some fond memories of functional programming (though quite what aroma it would require to really take me back does not bear thinking about) and (conveniently) a need for it.

The CLR implementation is still fairly bleeding edge and the installation process reminded me of the bad old days of Open Source software. Still 3 hours into the process and I had my REPL up and running and a function which would say Hello! not just to the world, but whoever happened to be passed to it. Tomorrow I shall be dusting off the ML part 1a handout by Larry Paulson (who broke off in lectures to teach us how to make bread the traditional way) and attempting any of the exercises I can find.

I’m looking forward to the Cambridge Clojure user group meeting and all this ML and functional talk means that a sneaky peak at F# over the long weekend is probably coming up too.


Chem4Word logo and my own unit

July 30, 2009

I realise that I have been quiet for a long time on here which is annoying because I have so much that I should be telling people – especially about Chem4Word . Luckily, PMR has been keeping the interest ticking over on his blog and there is always twitter for those quick messages. Speaking of PMR – he and I have had many “robust discussion” sessions over the last few weeks (months) as we have been trying to define what semantic chemistry really is, how much is possible before release and what is absolutely necessary. It has sometimes felt like smacking a puppy.

As we near a release date (we have function freeze and are now bug fixing) I was thinking about the branding etc and realised that we don’t have a logo, favicon or anything so was wondering if anyone has any suggestions.

I have to get back to writing a paper but first of all I was pleased to discover (thanks to Helen) that there is a Townsend unit (Td) but slightly less pleased that it is most important in gas discharge physics.


Dictionaries in CML

September 18, 2008

I am now allowed to be a bit more open about what I am up to following the public announcement of the chem4word project so I hope to be publishing more regularly about day-to-day (probably more like week-to-week) progress and thoughts.

I am currently preparing a set of exemplars and use cases for the first phase of the project. These provide a good source of example molecules and chemical concepts so that we (those with chemical background) can explain to them (everyone else) what on earth we are talking about. It is all too easy to forget that when we say something we know the implicit semantics but others may not. The preparation of this corpus has involved creating high-quality CML documents which conform to CMLLite (a subset of CML – effectively that required to represent chemistry in print).

CML uses dictionaries (via the dictRef attribute) liberally, this means that the schema can specify a single element which can be processed the same way each time but can hold different information. For example the property element can hold both a melting point and a molecular weight.

<cml version="3" convention="CMLLite"
xmlns="http://www.xml-cml.org/schema"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:cmlDict="http://www.xml-cml.org/dictionary/cml/"
xmlns:unitsDict="http://www.xml-cml.org/dictionary/units/">
<property dictRef="cmlDict:mw">
<scalar dataType="xs:double" units="unitsDict:dalton">247.3</scalar>
</property>
<property dictRef="cmlDict:mpt">
<scalar dataType="xs:double" units="unitsDict:c" min="202" max="205" />
</property>
</cml>

The document above should be familiar to anyone who has seen any CML before. However, there may be a difference. Each of the dictionary items (URIs in the dictRef) actually have definitions. I promised myself at the start of the project that I would never hand over any CML document which contained an undefined dictionary reference.

We will be making these dictionaries available, together with examples, during the project. I am also pushing for the dictionary items to be URLs for ease of use.

Oh! and I have also been learning C# and loving it…