Obscuring XML

Elliotte R. Harold
elharo@metalab.unc.edu

Abstract

When reporting bugs in XML processing software, it is essential to be able to submit the actual documents that demonstrate the bugs. When reporting performance problems, it's even more important. However, many XML documents contain sensitive data that cannot be shared. This paper discusses a SAX-based technique for obscuring XML documents that removes most information content while retaining the complete structure of the original document.

Keywords: SAX; Encryption; Java

Elliotte R. Harold

Elliotte Rusty Harold is an internationally respected writer, programmer, and educator, both on the Internet and off. He got his start writing FAQ lists for the Macintosh newsgroups on Usenet and has since branched out into books, Web sites, and newsletters. He's an adjunct professor of computer science at Polytechnic University in Brooklyn, New York. His Cafe con Leche Web site at http://www.cafeconleche.org/ has become one of the most popular independent XML sites on the Web. He's currently working on the XOM API for processing XML with Java, the jaxen XPath engine, and the Jester tool for testing unit test suites.

Obscuring XML

Elliotte R. Harold [Polytechnic University, Dept. of Computer Science]

Extreme Markup Languages 2005® (Montréal, Québec)

Copyright © 2005 Elliotte R. Harold. Reproduced with permission.

People who submit bug reports and performance criticisms are often reluctant to provide the original documents that demonstrate the problems. Even if the reporters are willing to share these documents privately with the maintainer, they may not be willing to share them with competitors, government regulators, employees, and other parties. Thus even if the bug or performance issue can be identified and fixed, the original documents cannot be directly incorporated into the test suite to make sure the bug does not reoccur. This interferes with one of the fundamental principles of test-driven development. "When someone discovers a defect in your code, first write a test that will succeed if the code is working. Then debug until the test succeeds." [Beck]

For instance, Michael Kay wrote, "I've never published my Saxon test suite because so much of it comes from third parties and I've no idea of the IPR status, and it's hard to extricate the bits I wrote myself." [Kay] Another example: a couple of years ago I was interested in verifying the compression ratio claimed for a binary alternative to XML, and seeing if I could identify the characteristics of the input documents that contributed to the purported savings. However, the raw data behind the numbers presented in the paper was classified and therefore unavailable. [Cokus]

To work around such problems, I have written an open source tool based on SAX that removes almost all the information content of a document, replacing it with randomly generated data. However, the actual structure is maintained, even down to such details as punctuation characters remaining punctuation characters (though not necessarily the same ones) and characters not moving outside their Unicode block. For instance, Greek text remains Greek, and Cyrillic remains Cyrillic (albeit gibberish). The documents created by this process tend to demonstrate the same bugs and performance issues as the original documents, without revealing the original documents' sensitive content.

The goal is to obscure the original content in an irreversible way. This is not encryption. There is no key that can decrypt the data. However, the shuffling cannot be purely random either. For instance, XML significant characters such as < and & cannot be shuffled. Furthermore, every start-tag must still match its end-tag. Generally, names, attribute values, and PCDATA are obscured, while other characters are not. For example, suppose we start with the XML document in Listing 1, and imagine that while processing it I've encountered a bug in some open source project. I want to report this bug with a test case, but I don't want to reveal my musical tastes to the world.

Figure 1

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<?xml-stylesheet type="text/css" href="song.css"?>
<!DOCTYPE SONG SYSTEM "song.dtd">
<SONG
  xmlns="http://www.cafeconleche.org/namespace/song"
  xmlns:xlink="http://www.w3.org/1999/xlink">
  <TITLE>Hot Cop</TITLE>
  <PHOTO
    xlink:type="simple" xlink:show="onLoad" xlink:href="hotcop.jpg"
    ALT="Victor Willis in Cop Outfit" WIDTH="100" HEIGHT="200"/>
  <COMPOSER>Jacques Morali</COMPOSER>
  <COMPOSER>Henri Belolo</COMPOSER>
  <COMPOSER>Victor Willis</COMPOSER>
  <PRODUCER>Jacques Morali</PRODUCER>
  <!-- The publisher is actually Polygram but I needed
       an example of a general entity reference. -->
  <PUBLISHER xlink:type="simple" xlink:href="http://www.amrecords.com/">
    A &amp; M Records
  </PUBLISHER>
  <LENGTH>6:20</LENGTH>
  <YEAR>1978</YEAR>
  <ARTIST>Village People</ARTIST>
</SONG>
<!-- You can tell what album I was
     listening to when I wrote this example -->

The original embarrassing document

Listing 2 shows the result of applying the obscuring process. This result has the same basic structure as the original document. However, all the content has changed.

Figure 2

<?xml version='1.0' encoding='UTF-8'?>
<?mba-jbrlhlefac ntta}&quot;"dfwo/paf&quot;" ufjj~&quot;"fkar.jat&quot;"?>
<!DOCTYPE TMJG SYSTEM "song.dtd">
<TMJG
  xmlns="hkrk://bfj.lrqnonqmsgwm.agw/uuxstejvt/sigu"
  xmlns:hheqb="lqgd://sng.r0.qgp/3020/ninie">
  <QXFMC>Zwv Wan</QXFMC>
  <UFODT
    rcrdp:uvsl="qisdrz" gddog:nrfg="ttYqeu" qdkay:sycw="hbzgxd.egh"
    BBG="Kzeodg Snhwvi us Adh Nuykkz" VZUTF="431" JQQKAJ="094"></UFODT>
  <XTIMHMXB>Ngbxzkp Zemcot</XTIMHMXB>
  <XTIMHMXB>Lebab Frcbtu</XTIMHMXB>
  <XTIMHMXB>Xcuton Hhjsbp</XTIMHMXB>
  <CNTYDPWN>Ngbxzkp Zemcot</CNTYDPWN>
  <!-- Myo aknslqjis rf szlbyvfm Alrurkrx ulp F ciaenu
       pq yfwppcf mi z loinjat tzohve wqqhljntb. -->
  <ADHZKKION rcrdp:uvsl="qisdrz" qdkay:sycw="rjlw://sva.fivdtwshu.irg/">
    I &amp; J Mtjceri
  </ADHZKKION>
  <AUZPKW>9:72</AUZPKW>
  <XUKS>3934</XUKS>
  <HMMSQG>Hidvtjl Yzoiie</HMMSQG>
</TMJG><!-- Zhf wcr gbbt dvjc ghuvg J zzt
     wjiopmfsw ed ttfv G mtwzl eref thpieqo -->

The obscured document

The shuffling is random. If the same document were fed into the program twice in succession, two different documents would be produced.

Changes and Constants

Let's explore how the different parts of an XML document are obscured. In the future, some of this will be made configurable. However, for now I've attempted to pick a reasonable set of defaults that strikes a good balance between maintaining structure and obscuring content.

Names

Every name in the XML document is shuffled. This includes element names, attribute names, entity names, and processing instruction targets. However, once a random variant of a name has been generated, it is thereafter reused when obscuring that name. For instance if the start-tag "<div>" is shuffled to "<bbq>", then the corresponding end-tag "</div>" is also replaced by "</bbq>". Furthermore all other tags and attributes named div will similarly be replaced with bbq.

Prefixed names such as xsl:template are treated a little differently. The colon is not remapped. The prefix is remapped to an equivalent prefix. The local name is remapped to an equivalent local name. The same prefix will be mapped to the same obscured string over the scope of a document. The same is true of local names.

Attribute names, element names, entity names, processing instruction targets, and colonized and non-colonized names are not stored in separate maps. Thus, if the element "foo" is rewritten as "bar", then the attribute foo will also be rewritten as "bar" and the colonized name foo:yyy may be rewritten as bar:abc or bar:uhy, but never abc:uhy.

This name cache does not extend into element content and attribute values. Names in these places, for instance in the select expression of an xsl:template element, almost certainly won't match the names used elsewhere in the document as element and attribute names. Thus in one place in the document the attribute foo="bar" may be rewritten as bgc="rfr" and on another element the same attribute may be rewritten as bgc="hyu". Similarly, no attempt is made to rewrite words as the same words in PCDATA. Doing so would make it too easy to deduce the original text based on word frequency and word length (which is preserved).

Namespaces

Namespace URIs are also shuffled. Like names, the shuffled value is cached and reused every time the original namespace URI is encountered. Thus two elements that are in the same namespace before shuffling will still share their namespace after shuffling. However, neither will be in the same namespace as the original elements. Namespace mappings in scope are fully preserved, not merely namespace declarations. Even neurotic and psychotic namespace mappings are preserved. [English] In fact, although the exact bindings change, the neuroses and psychoses are preserved precisely. This is good because neurotic and psychotic namespaces are a fertile breeding ground for bugs.

PCDATA

Element content, that is, plain text, is shuffled character by character. No text is added. None is removed. Certain characters are not shuffled at all. These include:

  • colon
  • space
  • carriage return
  • line feed
  • tab
  • C1 controls
  • greater than sign
  • less than sign
  • ampersand
  • straight double quotation mark
  • straight single quotation mark
  • non-breaking space

These characters are not adjusted because some tools treat them specially. Changing them runs the risk of masking or introducing bugs that are not present in the original source document.

Other characters are shuffled to different characters of similar nature (unless they randomly shuffle back to themselves). In general most characters remain within the same Unicode block after shuffling. In particular,

  • The upper case ASCII letters A-Z are shuffled into other uppercase ASCII letters.
  • The lower case ASCII letters a-z are shuffled into other lowercase ASCII letters.
  • The ASCII digits 0-9 are shuffled into other ASCII digits.
  • ASCII punctuation characters that have no particular meaning in XML such as -, (, and ^ are shuffled into other non-XML-significant ASCII punctuation characters.
  • The non-ASCII Latin-1 characters such as é and ÷ are shuffled into other characters in the same block. However, within this range I do not yet guarantee that letters do not turn into punctuation marks or digits, and vice versa.
  • Other Unicode blocks in the Basic Multilingual Plane (BMP) such as the Cyrillic block and the Greek block shuffle into themselves. A Greek character will not be replaced with a Cyrillic character and vice versa.
  • High surrogates are turned into high surrogates and low surrogates are turned into low surrogates. However, Unicode blocks beyond the Basic Multilingual Plane are not yet preserved. A musical symbol may turn into a Han extension character or an undefined character and vice versa.
  • Undefined Unicode characters in the BMP such as 0xFFF0 are not shuffled at all.

Attributes

Attribute values are shuffled according to type:

  • The values of attributes of type CDATA, or which have no type, are shuffled much like element content.
  • The values of attributes of type ID, NMTOKEN and enumerated types, are shuffled like element names. However, they use a separate name pool from elements and other named things.
  • The values of attributes of type ENTITY and NOTATION are shuffled like entity names using the same name pool as elements and entities.
  • Tokenized types—IDREFS, NMTOKENS, ENTITIES, and NOTATIONS—are divided into tokens, and then each token is shuffled using the matching name pool.

Furthermore, because this sits on top of SAX all attributes are normalized before being shuffled.

Attributes with the xml: prefix—xml:space, xml:base, xml:id, xml:lang, etc.— are a special case. Because they may have meaning to an XML tool, they are not randomized at all. neither the name nor the value is changed.

What Stays the Same

The structure of the transformed document should remain the same. Well-formed documents should remain well-formed. Valid documents can remain valid, but only if there's no external DTD subset.

Element Positions

If element B is the child of element A in the original, then it will still be the child of element A in the transformed document. Only the names and text have changed. The relative positions and sizes stay the same.

Lengths

There is a 1-1 mapping between obscured characters and the original characters. Except for insignificant white space inside tags and the like, the original and the obscured document should be the same size. All element and attribute names should be the same length as they originally were. White space is not changed so word lengths remain the same.

The Algorithms

Low-level randomization is performed by a java.util.SecureRandom object. The exact algorithm depends on the provider. However, every implementation should be a "cryptographically strong pseudo-random number minimally complies with the statistical random number generator tests specified in FIPS 140-2, Security Requirements for Cryptographic Modules, section 4.9.1. Additionally, SecureRandom must produce non-deterministic output and therefore it is required that the seed material be unpredictable and that output of SecureRandom be cryptographically strong sequences as described in RFC 1750: Randomness Recommendations for Security." [Sun 2003]

All strings are randomized character by character. This randomChar method is used to retrieve a character between two specified Unicode code points. In this method the random variable is a field of type java.util.SecureRandom.

Figure 3
    private char randomChar(int low, int high) {
        int n = random.nextInt(high-low+1);
        return (char) (n+low);
    }

When randomizing a sequence of text, each character is passed to this method. A simple lookup table is used to determine the high and low boundaries for the range of the character. This isn't the fastest implementation possible, but maximum speed is rarely an issue here.

Figure 4
    private char randomize(char c) {

        if (c == ':') return ':';
        else if (c == ' ') return ' ';
        else if (c == '\t') return '\t';
        else if (c == '\n') return '\n';
        else if (c == '\r') return '\r';
        else if (c >= 'A' && c <= 'Z') return randomChar('A', 'Z');
        else if (c >= 'a' && c <= 'z') return randomChar('a', 'z');
        else if (c >= '0' && c <= '9') return randomChar('0', '9');
        else if (isASCIIPunctuationCharacter(c)) return getRandomAsciiPunctuation();
        else if (c <= 127) return c;
        else if (c >= 0xA1 && c <= 0xBF) return randomChar(0xA1, 0xBF);
        else if (c >= 0xC0 && c <= 0xD6) return randomChar(0xC0, 0xD6);
        else if (c >= 0xC0 && c <= 0xD6) return randomChar(0xC0, 0xD6);
        else if (c >= 0xD8 && c <= 0xF6) return randomChar(0xD8, 0xF6);
        else if (c >= 0xF8 && c <= 0xFF) return randomChar(0xF8, 0xFF);
        else if (c >= 0x4E00 && c <= 0x9FA5) return randomChar(0x4E00, 0x9FA5);
        // ...
        else if (c >= 0x30FC && c <= 0x30FE) return randomChar(0x30FC, 0x30FE);

        // high surrogates
        else if (c >= 0xD800 && c <= 0xDBFF) return randomChar(0xD800, 0xDBFF);
        // low surrogates
        else if (c >= 0xDC00 && c <= 0xDFFF) return randomChar(0xDC00, 0xDFFF);

        // C1 controls
        if (c > 127 && c < 160) return randomChar(127, 159);

        return c;
    }

The constants used here are the boundaries of particular Unicode blocks as defined in Unicode 4.0 If a character does not fall into one of these blocks, it is not shuffled.

Limitations

The process of obscuring is an attempt to find a middle ground between the diametrically opposed goals of preserving all information for debugging while hiding all information for security. The more you obscure the more chance that you'll also obscure the issue you were trying to expose. The less you obscure the more chance there is of revealing sensitive information.

Not military grade

In some ways, the algorithm used is stronger than most encryption schemes because it actively removes information. There is no way to reverse the process. The random number generator is simply a java.util.SecureRandom. The exact strength of this generator depends on the installed providers. However, the default installation is probably good enough for most uses. The Java Cryptography Engine makes it straight-forward to plug in more secure sources of randomness should anyone care to do so. [Sun 2002]

Much more important, however, is the information that is deliberately left unobscured in the document, such as the length of the message and the number and positions of the elements. The most sophisticated signals intelligence performed at the National Security Agency and similar organizations in other countries is based as much or more on traffic analysis than on actually breaking the codes. [Newitz] An intelligent adversary could determine a lot simply by analyzing the structures that remain in the obscured documents.

Letter Frequency

Letter frequency is not maintained. For instance, in obscured English text the letters q and v will tend to occur about as frequently as e and n. This makes the program unsuitable for testing many compression schemes since the obscured document will have significantly greater entropy than the original document.

External DTD subsets

The external DTD subset is not modified at all. As well as validity, this means that the result document can lose track of entity definitions and default attribute values, as these are modified in the instance document but not the DTD. To the extent that the SAX parser resolves these things before the transformation, this can be mitigated.

Malformed documents

SAX reports content only up to the first well-formedness error it detects. Consequently, this tool cannot reliably reproduce malformed documents. This means it cannot be used to report bugs where a parser fails to detect a well-formedness error.

Non-Infoset details

Because the Randomizer is implemented on top of SAX, it is limited to working with only the information SAX provides. Roughly, this is the content of the XML Information Set along with some additional information about the content of the DTD. [Brownell 2002] This means there's quite a lot it can't reproduce exactly including:

  • White space inside tags
  • White space inside the prolog and epilog
  • Pre-normalized attribute values
  • Parameter entity references in the DTD
  • Numeric character references. These will be normally replaced by the actual character.
  • Attribute order

Consequently, this is not an ideal tool for obscuring data for testing a parser. Parsers expect to receive data in the lexical layer, and convert it into the structure layer. [Harold 2003] Because this tool receives input from a parser, the data has already been processed into structures. The tokens output may not be quite the same, even if the higher level structures they represent are. For instance, if a parser has a bug in normalizing attribute values, then the tool will normalize the attribute value itself, thus eliminating the data that's critical to finding the bug.

On the other hand, any tool that itself sits on top of SAX such as the XOM API [Harold 2004] or the Saxon XSLT processor [Kay] should not lose any relevant details.

Future Directions

There are several areas that remain to be explored which could produce even more effective reproductions. Possibilities include:

  • SAX 2.0.2's is-standalone and document-xml-version properties [Brownell and Megginson] and the Locator2 class [Brownell 2] could be used to maintain the same XML declaration and encoding in the result document as the original document. This might help detect problems involving Unicode handling and conversion of legacy character sets.
  • SAX does not maintain complete lexical fidelity of the original document. Most alternate APIs such as XOM and StAX [Fry] offer even less. However, the Xerces Native Interface (XNI) [Apache] offers many more details than SAX does. In particular, it provides information about conditional sections and parameter entity boundaries in the DTD. It also much more reliably provides information about the XML declaration and entity boundaries, which are only optionally reported in SAX.
  • The tool should be more configurable so that users can choose what not to shuffle. For instance, some users might choose not to shuffle element and attribute names, just content. Others might not want to shuffle content in the XSLT namespace, or only shuffle content in the XHTML namespace.
  • Multiple documents do not share names. When shuffling two different documents that use the same tag-sets, the tag names will be different in each result document. It should be possible to reuse the name maps from one document to the next so that bugs that are only apparent across multiple documents can be reported.
  • Unicode blocks are preserved, but properties are not. It would be nice if currency indicators remained currency indicators, digits remained digits, combining characters remained combining characters and so forth.

Nonetheless, the tools is useful now. Although somewhat less configurable and a little less powerful than one might wish, it is still more than adequate to allow most test cases to be submitted without revealing confidential information. Sensitive data is no longer an excuse for providing insufficient detail to reproduce bugs.


Bibliography

[Apache] Apache XML Project, "The Xerces Native Interface", http://xml.apache.org/xerces2-j/xni.html

[Beck] Beck, Kent and Erich Gamma, "Test Infected: Programmers Love Writing Tests", http://junit.sourceforge.net/doc/testinfected/testing.htm

[Brownell 2] David Brownell, "Locator2", http://sax.sourceforge.net/apidoc/org/xml/sax/ext/Locator2.html

[Brownell 2002] David Brownell, "Appendix B. SAX2 and the XML Infoset", in SAX2, O'Reilly, 2002, pp. 201-218

[Brownell and Megginson] Brownell, David and David Megginson, "Package org.xml.sax", http://sax.sourceforge.net/apidoc/org/xml/sax/package-summary.html

[Cokus] Michael Cokus and Daniel Winkowski, "XML Sizing and Compression Study for Military Wireless Data", XML 2002 Proceedings, http://www.idealliance.org/papers/xml02/dx_xml02/papers/06-02-04/06-02-04.html

[English] Joe English, "A plea for Sanity", April 5, 2002, http://lists.xml.org/archives/xml-dev/200204/msg00170.html

[Fry] Christopher Fry et al, "JSR-000173 Streaming API for XML Specification 1.0", March 25, 2004, http://jcp.org/aboutJava/communityprocess/final/jsr173/index.html

[Harold 2003] Elliotte Rusty Harold, "Build on top of structures, not syntax", in Effective XML, Addison-Wesley, 2003, http://www.cafeconleche.org/books/effectivexml/chapters/15.html

[Harold 2004] Elliotte Rusty Harold, "XOM Design Principles", in Extreme Markup Languages 2004®: Proceedings http://www.mulberrytech.com/Extreme/Proceedings/html/2004/Harold01/EML2004Harold01.html

[Kay] Michael Kay, "RE: [xsl] killing xslt", May 14, 2004, http://www.stylusstudio.com/xsllist/200405/post30430.html

[Kay] Michael Kay, "SAXON The XSLT and XQuery Processor", http://saxon.sourceforge.net/

[Newitz] Annalee Newitz, "Heavy Traffic", November 17, 2004, http://www.metroactive.com/papers/metro/11.17.04/work-0447.html

[Sun 2002] Sun Microsystems, "Java™ Cryptography Architecture API Specification & Reference", August 4, 2002, http://java.sun.com/j2se/1.4.2/docs/guide/security/CryptoSpec.html#ProviderInstalling

[Sun 2003] Sun Microsystems, "SecureRandom", June 13, 2003, http://java.sun.com/j2se/1.4.2/docs/api/java/security/SecureRandom.html



Obscuring XML

Elliotte R. Harold [Polytechnic University, Dept. of Computer Science]
elharo@metalab.unc.edu