Gliding down from graphs to trees: An attempt to bottle geometry and chemical content

K. Shanthi
shanthi@tnq.co.in
S. K. Venkatesan
skvenkat@tnq.co.in

Abstract

Fundamental particles, atoms, and ions combine in interesting ways to build the universe. SGML/XML has long been recognized as a means of representing hierarchical tree structures, but how can its methods be adapted to represent some of the other interesting structures, such as cyclic graphs, needed to represent the physical universe? Most prior efforts at DTD design have concentrated on either publication of formulas (MathML) or graphics (SVG). Here we demonstrate a tree structure that is quite close in spirit to actual chemical structures, making it easier to construct complex content-oriented markup that can expand and contract the trees, obtaining different modes of representation. In this method, the chemical reactions can be considered as cutting/welding of subtree structures, and the transformed products can be identified directly by means of labels. Further on in this study, we consider abstract higher-dimensional topologies using similar cutting and pasting techniques.

Keywords: Trees/Graphs; Modeling

K. Shanthi

K. Shanthi was born in India in 1969. She was awarded a Ph.D. in Chemistry by the Indian Institute of Technology, Madras, in 1996 for her research work in Environmental Chemistry involving submicro-level determination of hazardous inorganic gaseous pollutants. From 1996 to 2000, she was a Quality Control Manager in hard disk manufacturing company. She has been working in the typesetting industry for the past three years. She is at present in the Software Department at TnQ.

S. K. Venkatesan

S. K. Venkatesan was born in India in 1963. He was awarded B.Tech and Ph.D. degrees from the Indian Institute of Technology, Madras (1985) and the Indian Statistical Institute, Calcutta (1995), respectively. From 1989 to 1996, he studied turbulence in fluids using computer simulations. He has been working in the typesetting industry for the past six years. He is at present a Software Manager for TnQ.

Gliding down from graphs to trees

An attempt to bottle geometry and chemical content

K. Shanthi [TnQ Books and Journals]
S. K. Venkatesan [TnQ Books and Journals]

Extreme Markup Languages 2003® (Montréal, Québec)

Copyright © 2003 K. Shanthi & S. K. Venkatesan. Reproduced with permission.

Introduction

Fundamental particles, atoms, ions, and molecules are used as the building blocks for the physical universe. However, as you study them in further detail, it is the interaction between them that makes them interesting and complex. At a superficial level, these bonds/interactions in chemical compounds can be studied using geometrical models, such as Graphs. However, quoing Joe English [JE], SGML/XML is by itself only a Tree:

If you consider only the subnode properties, you have a Tree; if you consider the reference node properties as well, you have a directed (possibly cyclic) graph.

As we shall find in our endeavour here, a Graph can be represented by a Tree!!

The basic building block, the atom, is described in terms of an element with attributes, such as the atomic number. Starting at the level of an atom or an ion, in theory, it is possible to construct a kind of a Graph that is equivalent to chemical compound, and once a Graph is sliced open (by breaking the cycle), it becomes a Tree, which can then be marked up as SGML/XML. The sliced Tree can then be joined together using labels. In fact, if one continues in this vein, it is possible flatten a Tree completely into a one-dimensional string as in SMILES [Simplified Molecular Input Line Entry Specification] [SMILES], a general-purpose chemical nomenclature and data exchange format.

In this paper, we also introduce a concept of multilevel resolution analysis for studying chemical structures at different scales. To illustrate this, we consider some interesting 2D topologies associated with cyclic compounds, like benzene, and some 3D topologies. We then see how these chemical reactions lead to complex breaking and remolding of the original tree structure. Finally, as an extension of this work, we see how higher-dimensional topologies can also be constructed with similar cutting and pasting techiniques.

The current status of SGML/DTD design in the STM [Science, Technology and Medicine] publishing industry has been surveyed adequately by Rosenblum and Golfman [INERA]. However, as can be discerned from this article, sufficient effort has not been made to represent Chemical Formulae in a way in which it will be useful for scholarly publications. One extreme way to represent Graphs in terms of XML is the technique employed in SVG [Scalable Vector Graphics], whereby all the points to be joined are represented using (x,y) coordinates, which occurs in SVG as pairs of multiple attribute values. In this paper, we will consider more subtle methods, whereby both the SGML/XML Tree structure and the actual geometrical structure are commensurate with each other so that we can obtain a content model and a presentation model as well.

In addition, very elaborate markup has been used to construct CML [Chemical Markup Language]. Although CML is a rich source of information for marking up Chemical Formulae, it is too elaborate for authoring content markup. Our presentation here will be more in keeping with the work by Hans Hagen [Hagen], which considers simpler techniques. However, we let the underlying atomic structure of the chemical compound dictate the way we do markup, so our path diverges from that of [Hagen] considerably as we proceed to markup complex molecules.

It may look like an oxymoron, but Graphs can indeed be represented by SGML Trees. To illustrate this point, we proceed as in [TEI], to markup benzene (C6H6) as:

<graph name="Benzene"> <atom label="H" id="H1"> <atom label="H" id="H2"> 
<atom label="H" id="H3">
  <atom label="H" id="H4"> <atom label="H" id="H5"> <atom label="H" id="H6"> 
  <atom label="C" id="C1">
  <atom label="C" id="C2"> <atom label="C" id="C3"> <atom label="C" id="C4"> 
  <atom label="C" id="C5">
  <atom label="C" id="C6">

  <singlebond from="C1" to="C2"/> <doublebond from="C2" to="C3"/> 
  <singlebond  from="C3" to="C4"/>
  <doublebond from="C4" to="C5"/> <singlebond  from="C5" to="C6"/>  
  <doublebond  from="C6" to="C1"/>
  <singlebond  from="C1" to="H1"/> <singlebond  from="C2" to="H2"/> 
  <singlebond  from="C3" to="H3"/>
  <singlebond  from="C4" to="H4"/><singlebond  from="C5" to="H5"/> 
  <singlebond  from="C6" to="H6"/>
 </graph>
The trouble with this type of markup is that it doesn’t have a heirarchical structure and leaves little scope for expansion and contraction. In this paper, we consider a heirarchical markup of chemical formulae, developed intrinsically from the atomic bonding structure, which allows much scope for expansion and contraction.

In the first section, we shall discuss various ways of representing the chemical formulae in terms of the SGML/XML Trees, and then proceed to construct some elementary Chemical Structures. In the second section, we will consider more complex chemical structures, wherein we shall demonstrate the power of this technique, i.e., how this leads to the possibility, using XLink, of constructing a multilevel-resolution analysis of Chemical Structures at different scales. In the third section, we provide a brief description of the technique to markup chemical reactions. In the final section, we model some higher dimensional topologies in terms of these techniques.

Simple chemical formulae

Hans Hagen in [Hagen] marks up an hydrogen atom as <chem><atom>H</atom></chem>. This markup allows him to typeset simple chemical formulae, similarly to MathML markup. However, the markup tree does not reflect the underlying chemical structure, and there is no clear indication on how one has to proceed for more complex structures. We consider a slightly different approach, for a simple Hydrogen atom: <chem><atom name="H"></atom></chem>. As in [Hagen], we then add some attributes for atoms: <chem><atom mass ="200.09" number="80" name="Hg" CAS="7439-97-6"/></chem>. The only additional attribute we have added here was the NIST’s CAS number (http://webbook.nist.gov/chemistry/). However, our path towards content markup takes a different route from that of [Hagen] because of the following golden rules:

  1. The nodes in an SGML/XML tree can only be either a Molecule, an Atom, or a Hadron.
  2. Bonds are described using attributes which refer to other nodes.
  3. Ions are considered as either charged Molecules or Atoms described using the charge attribute.

Although an electron is also a fundamental particle, such as the Hadron, we would like to treat it through attributes of an element, rather than giving it a special status as an element, because of the quantum problems associated with its non-localizability. Even with Hadrons there is the possibility that Bosonic states might become non-localizable, as with SuperFluids such as Helium II, in which case our classical schemes may fail altogether. We will also treat the presence of a lone pair/bond pair of electrons as an additional attribute of the element.

For methane (CH4), we write:

<chem>
  <molecule>
    <atom name="C">
      <atom n="4" name="H" bond="s"/>
    </atom>
  </molecule>
</chem>
Here each of the four Hydrogen atoms has a single bond (denoted by the attribute bond="s") with the parent Carbon atom. We assume here that by default the child-atom bonds with the parent, which can also be made more explicit:
<chem>
  <molecule>
    <atom name="C" label="C1">
      <atom n="4" name="H" bond="s" bondto="C1"/>
  </molecule>
</chem>

For the compound propane (CH3–CH2–CH3), the markup is:

<chem>
  <molecule>
    <atom name="C" label="C1" bond="s" bondto="C2">
      <atom name="H" n="3" bond="s"/>
      </atom>
    <atom name="C" label="C2" bond="s" bondto="C3">
      <atom name="H" n="2" bond="s"/>
      </atom>
    <atom name="C" label="C3">
      <atom name="H" n="3" bond="s"/>
      </atom>
  </molecule>
</chem>

The compounds we have considered until now were simple straight-chain compounds. We now consider the aromatic cyclic compound benzene (C6H6) and cyclohexane. The structure of benzene is a unique one, with π-electrons forming a cloud over the molecule. However, for all practical reasons, it is treated as an aromatic molecule with alternative single and double bonds.

<chem>
  <molecule name="benzene">
    <atom name="C" label="C1" bond="s" bondto="C2">
      <atom name="H" id="H1" bond="s"/>
      </atom>
    <atom name="C" label="C2" bond="d" bondto="C3">
      <atom name="H" id="H2" bond="s"/>
      </atom>
    <atom name="C" label="C3" bond="s" bondto="C4">
      <atom name="H" id="H3" bond="s"/>
      </atom>
    <atom name="C" label="C4" bond="d" bondto="C5">
      <atom name="H" id="H4" bond="s"/>
      </atom>
    <atom name="C" label="C5" bond="s" bondto="C6">
      <atom name="H" id="H5" bond="s"/>
      </atom>
    <atom name="C" label="C6" bond="d" bondto="C1">
      <atom name="H" id="H6" bond="s"/>
      </atom>
  </molecule>
</chem>
<chem>
  <molecule name="cyclobenzene">
    <atom name="C" label="C1" bond="s" bondto="C2">
      <atom name="H" n=”2” id="H1" bond="s"/>
      </atom>
    <atom name="C" label ="C2" bond="s" bondto="C3">
      <atom name="H" n=”2” id="H2" bond="s"/>
      </atom>
    <atom name="C" label ="C3" bond="s" bondto="C4">
      <atom name="H" n=”2” id="H3" bond="s"/>
      </atom>
    <atom name="C" label ="C4" bond="s" bondto="C5">
      <atom name="H" n="2" id="H4" bond="s"/>
      </atom>
    <atom name="C" label ="C5" bond="s" bondto="C6">
      <atom name="H" n="2" id="H5" bond="s"/>
      </atom>
    <atom name="C" label="C6" bond="s" bondto="C1">
      <atom name="H" n="2" id="H6" bond="s"/>
      </atom>
  </molecule>
</chem>

The interesting thing to note here is how the benzene ring has been dissected into a straight chain and then bonded back to the first carbon atom to complete the cycle. This flexibility in tagging really gives us room to handle with ease bond breaking/fusion reactions, which are the basic pathways for producing more complex chemical compounds.

Of course, all this process of representing these molecules in terms of XML would be useless if we didn’t have mechanism to check the validity of these structures. One simple way is to check the valency of each individual atom, which in this case is the weighted (weight being the bond order) sum of the number of its children and other bonded cross-siblings.

Complex chemical structures

To show that our technique leads to user-friendly markup of complex chemical structures, we will consider some concrete examples:

Figure 1: Structure of hexaphenylbenzene
[Link to open this graphic in a separate page]

  <chem>
    <molecule name="hpbenzene">
    <molecule xlink:href="benzene" n="6" positions="1 2 3 4 5 6" bond="s"/>
      </molecule></chem>

Figure 2: Structure of complex benzene derivative
[Link to open this graphic in a separate page]

<chem>
  <molecule name="complexbenzene">
    <molecule xlink:href="benzene" n="6" positions="1 2 3 4 5 6" bond="s">
      <molecule xlink:href="benzene" position="3" bond="s"/>
      </molecule>
    </molecule>
</chem>

Figure 3: Structure of metal benzene complex
[Link to open this graphic in a separate page]

<chem>
  <molecule name="benzenemetalcomplex">
    <atom name="M" label="M1"><molecule xlink:href="benzene" label="L1" bondto="M1" 
    bond="s" positions="2 3 4 5">
      <molecule xlink:href="benzene" label="L1" bondto="M1" bond="exo" positions="1 6"/>
      </molecule>
    </atom>
  </molecule>
</chem>

In the above three examples, the benzene molecule has been called out as an external XML file (using the explicit attribute xlink:href, with two additional implicit attributes xlink:type="simple" and xlink:show="embed"). The “benzene.xml” refers to the markup defined earlier in Section 2. The difference in bonding and the chemical nature of the metal that plays a central role in deciding the shape of the molecule is illustrated. Here we have allowed a two-level expansion using the XLink standard [XLink]. For more complex molecules, we can allow two or more levels of expansion using XLink attributes, thus leading to a multilevel resolution analysis.

Apart from XLink features, it is interesting to note that the simplest straight chain molecules which we discussed in the earlier section can even be represented in the various structural modes using DOM [DOM] features. To illustrate this, we consider the propane molecule:

<chem>
  <molecule>
    <atom name="C" label="C1" bond="s" bondto="C2">
      <atom name="H" n="3" bond="s"/>
      </atom>
    <atom name="C" label="C2" bond="s" bondto="C3">
      <atom name="H" n="2" bond="s"/>
      </atom>
    <atom name="C" label="C3">
      <atom name="H" n="3" bond="s"/>
      </atom>
  </molecule>
</chem>
which can be modeled at three levels of expansions using XML DOM features:
  1. [Link to open this graphic in a separate page]
  2. CH3–CH2–CH3
  3. C3H8

Chemical reactions

We have extended the above-discussed markup to apply to reaction equations. Reaction equations can be represented symbolically by MathML equations; however, this doesn’t reveal the exact structural details of the transformation taking place. Our markup here allows for labeling the individual atoms, whereby one can reveal the bond breaking/formation taking place during the reaction.

Example I. Formation of esters — The classical ester preparation reaction from an alcohol and acid

[Link to open this graphic in a separate page]

will be marked up as:

  <chem>
    <reaction type=”equilibrium”>
      <reactants>
        <reactant>
          <molecule name=”aceticacid”>
            <atom name=”C” label=”C1” bond=”s” bondto=”C2”>
              <atom name="H" n="3" />
            <atom name="C" label="C2" bond="s">
              <atom name="O" label="O1" bond="d"/>
              <atom name="O" label="O2" bond="s">
                <atom name="H" bond="s"/>
              </atom></atom>
          </molecule>
        </reactant>
        <reactant>
          <molecule name=”methanol”>
            <atom name=”C” label=”C3” bond=”s” bondto=”O3”>
              <atom name="H" n="3" bond="s"/>
              </atom>
            <atom name="O" label="O3" bond="s">
              <atom name="H" bond="s"/>
              </atom>
        </reactant>
      </reactants>
      <conditions>
        <medium>
          <molecule name=sulfuricacid”>
            <atom name=”H” n=2”>
            <atom name=”S”>
            <atom name=”O” n="4">
          <molecule>
        </medium>
      </conditions>
      <products>
        <product>
          <molecule name=”methylacetate”>
            <atom name=”C” label=”C1” bond=”s” bondto=”C2”>
              <atom name="H" n="3" bond="s"/></atom>
            <atom name="C" label="C2" bond="s" bondto="C3">
            <atom name="O" bond="d"/>
            <atom name="O" bond="s">
              <atom name="H" bond="s"/>
            </atom></atom>
          </molecule>
        </product>
        <product>
          <molecule name=water”>
            <atom name= “H” n=”2”>
              <atom name=O”/></atom>
          </molecule>
        </product>
      </products>
  </reaction>
</chem>

The reactants and products of a reaction are decided by the conditions under which the particular reaction was performed. So we laid emphasis on this aspect and defined various reaction conditions, such as medium, temperature, pressure, light source, cryogenic systems, catalysts, etc.

Example II (for cyclic systems). The hydrogenation of benzene to yield cyclohexane

[Link to open this graphic in a separate page]

will be marked up as

<chem>
  <reaction type=”forward”>
    <reactants>
      <reactant>
        <molecule name="benzene” xlink:href=”benzene”/>
      <reactant>
        <atom name=”hydrogen" n=”2”/>
      </reactant>
    </reactants>
    <conditions>
        <medium>
          <atom name="Ni catalyst”/>
        </medium>
    </conditions>
    <products>
      <product>
        <molecule name=”cyclohexane” xlink:href=”cyclobenzene”/>
      </product>
    </products>
</chem>

As mentioned earlier, the structure of benzene is not as trivial as “alternating single and double bonds”, but it’s not rocket science, either. We are working towards bringing in Huckel’s rule to identify aromatic molecules and ions. This requires introduction of new concepts, such as orbitals and their overlapping mechanisms in the bond formation, into the present markup scheme.

Algebraic topology

All the chemical structures we constructed here were based on Graph-like structures that are embedded in two and three dimensions. However, the object obtained from the markup, which we will consider in this section, has no such restriction, i.e., it need not be embeddable in two or three dimensions. To illustrate this, we consider some application to a particular area of mathematics known as Algebraic Topology [Massey]. We start with a very simple structure, the one and only compact one-dimensional manifold — the circle, which can be marked up as follows:

<node label="N1">
<node label="N2">
<path from="N1" to="N2">
<path from="N2" to="N1">

Next we consider more complex structures, compact two-dimensional manifolds:

  1. The 2-sphere, which can be marked up as follows:
    <manifold name="2-sphere">
    <surface generator="P1 P2" paste="P1 -P2">
    <path label="P1" from="N1" to="N2">
    <path label="P2" from="N2" to="N1">
    <node label="N1">
    <node label="N2">
    </surface>
    </manifold>

    We note that by pasting the line segment P1 on -P2 we obtain the two dimensional sphere. P1 and P2 are directed paths, so the orientation of the path is important, as can be seen from the next example.
  2. The projective 2-sphere (P2) or punctured-sphere, marked up as follows:
    <manifold name="P2-sphere">
    <surface generator="P1 P2" paste="P1 P2">
    <path label="P1" from="N1" to="N2">
    <path label="P2" from="N2" to="N1">
    <node label="N1">
    <node label="N2">
    </surface>
    </manifold>

    We note that by pasting the line segment P1 on P2 we obtain the projective 2-sphere. This surface cannot be embedded in three-dimensions; it can only be embedded in four dimensions.
  3. The Torus can be marked up as follows:
    <manifold name="torus">
    <surface label="S1"  paste="P2 -P4">
    <surface label="S1" nodes="N1 N2 N3 N4"  paste="P1 -P3">
    <path label="P1" from="N1" to="N2">
    <path label="P2" from="N2" to="N3">
    <path label="P3" from="N3" to="N4">
    <path label="P4" from="N4" to="N1">
    <node label="N1">
    <node label="N2">
    <node label="N3">
    <node label="N4">
    </surface>
    </surface>
    </manifold>

    By taking a square piece of surface (with sides P1, P2, P3, P4) and pasting the paths P2 on -P4 and P1 on -P3, we obtain the Torus.
  4. The Klein bottle can be marked up as follows:
    <manifold name="Klein bottle">
    <surface label="S1"  paste="P2 P4">
    <surface label="S1" nodes="N1 N2 N3 N4"  paste="P1 -P3">
    <path label="P1" from="N1" to="N2">
    <path label="P2" from="N2" to="N3">
    <path label="P3" from="N3" to="N4">
    <path label="P4" from="N4" to="N1">
    <node label="N1">
    <node label="N2">
    <node label="N3">
    <node label="N4">
    </surface>
    </surface>
    </manifold>

    By taking a square piece of surface (with sides P1, P2, P3, P4) and by pasting the paths P2 on P4 and P1 on -P3, we obtain the Klein bottle, which is not embeddable in three dimension, but rather in four dimensions.

Conclusion

The underlying tree structure we have used here is quite close in spirit to the actual chemical structure, making it easier to construct complex content-oriented markup. These graph-based chemical models can be modeled directly using SGML/XML markup, which offers flexiblity to expand and contract the XML tree structure, thus obtaining different modes of representation. We have also seen how chemical reactions can be studied better using this markup. We hope to extend this approach to additional areas, such as higher-dimensional geometry and topology.


Bibliography

[Chemical Markup Language] Murray-Rust, P., and H. S. Rzepa, “Chemical markup Language and XML Part I. Basic principles”, J. Chem. Inf. Comp. Sci., 1999, 39, 928000.

[DOM] W3C, Document Object Model (DOM) Level 2 Specification, Version 1.0, W3C Candidate Recommendation, 13 November 2000, http://www.w3.org/TR/2000/REC-DOM-Level-2-Core-20001113/ .

[Hagen] http://www.pragma-ade.com/general/manuals/xchemml-p.pdf .

[INERA] Rosenblum, Bruce, and Irina Golfman, A Decade of DTDs and SGML in Scholarly Publishing, What Have We learned? In Extreme Markup Languages 2002: Proceedings, 2002, http://www.idealliance.org/papers/extreme02/html/2002/Rosenblum01/EML2002Rosenblum01.html .

[JE] http://xml.coverpages.org/grovesXML0.html .

[Massey] Massey, William S., Algebraic Topology: an introduction, Springer-Verlag, New York, 1977.

[SMILES] http://www.daylight.com/smiles/smiles-intro.html .

[TEI] TEI, The XML Version of the TEI Guidelines – 21 Graphs, Networks, and Trees, http://www.tei-c.org/P4X/GD.html .

[XLink] W3C, XML Linking Language (XLink) Version 1.0, W3C Candidate Recommendation, 3 July 2000.



Gliding down from graphs to trees

K. Shanthi [TnQ Books and Journals]
shanthi@tnq.co.in
S. K. Venkatesan [TnQ Books and Journals]
skvenkat@tnq.co.in