Balanced Context-Free Grammars, Hedge Grammars and Pushdown Caterpillar Automata

Anne Brüggemann-Klein
brueggem@in.tum.de
Derick Wood
dwood@cs.ust.hk

Abstract

The XML community generally takes trees and hedges as the model for XML document instances and element content. In contrast, Berstel and Boasson have discussed XML documents in the framework of extended context-free grammar, modeling XML documents as Dyck strings and schemas as balanced grammars. How can these two models be brought closer together? We examine the close relatioship between Dyck strings and hedges, observing that trees and hedges are higher level abstractions than are Dyck primes and Dyck strings. We then argue that hedge grammars are effecively identical to balanced grammars and that balanced languages are identical to regular hedge languages, modulo encoding. >From the close relationship between Dyck strings and hedges, we obtain a two-phase architecture for the parsing of balanced languages. We propose caterpillar automata with an additional pushdown stack as a new computational model for the second phase; that is, for the validation of XML documents.

Keywords: Modeling; Trees/Graphs

Anne Brüggemann-Klein

Professor Brüggemann-Klein received her PhD degree in 1985 from the University of Münster and her Habilitation in 1993 from the University of Freiburg. In 1994, she joined the Fakultät für Informatik at the Technische Universität München.

Her research interests are hypertext and document engineering, with an emphasis on modelling and formal-languages techniques. Her work, together with Derick Wood, on unambiguous content models is cited in the XML Recommendation.

Professor Brüggemann-KLein teaches electronic publishing and conducts a lab course on XML technology at the TU München.

Derick Wood

Professor Wood received his BSc and PhD degrees from the University of Leeds, England, in 1963 and 1968, respectively. He was a Postdoctoral Fellow at the Courant Institute, New York University, from 1968 to 1970 before joining the Unit of Computer Science at McMaster University in 1970. He was Chair of Computer Science from 1979 to 1982. From 1982 to 1992 he was a Professor in the Department of Computer Science, University of Waterloo.

For three years he served as Director of the Data Structuring Group. Before joining HKUST in 1995, he was a Professor in the Department of Computer Science, University of Western Ontario. He has published widely in a number of research areas and has written two textbooks, "Theory of Computation," published by John Wiley, and "Data Structures, Algorithms, and Performance," published by Addison-Wesley. In addition, he has recently written, with Eugene Fink, a research monograph "Restriction-Oriented Convexity," published by Springer.

His current research interests are: Document engineering; XML, SGML and XHTML; symbolic manipulation of language-theory objects; algorithms; data structures; and formal language theory.

Balanced Context-Free Grammars, Hedge Grammars and Pushdown Caterpillar Automata

Anne Brüggemann-Klein [Technische Universität München, Institut für Informatik]
Derick Wood [Hong Kong University of Science & Technology, Department of Computer Science]

Extreme Markup Languages 2004® (Montréal, Québec)

Copyright © 2004 Anne Brüggemann-Klein and Derick Wood. Reproduced with permission.

Note: This paper contains W3C MathML, which is not equally well supported in all browsers. If you have reason to think that mathematical expressions are not displaying properly, consult the PDF version (or try a different browser).

Introduction

Since as early as 1991, Murata has been building on the theory of regular tree and hedge languages as the foundation of research into schema, query and transformation languages for structured documents and XML [Mur00] . Most noticably, tree and hedge grammars as representations of regular tree and hedge languages form the basis of the XML schema language Relax NG ( [CM01] ). In the framework of tree and hedge automata, operational issues have been addressed ( [LMM00] ) and a taxonomy of XML schema languages has been established ( [MLM00] ). Other areas of application are document transformation ( [Mur97] ), query languages ( [Mur98] , [Mur01] ) and the definition and processing of access policies for XML documents ( [MTKH03] ).

More recently, Berstel and Boasson [BB00] , [BB02b] , [BB02a] have investigated formal properties and grammatical characterizations of XML documents within the general framework of extended context-free grammars, modeling XML documents as Dyck strings and schemas as balanced grammars.

In this work, we establish the equivalence between hedge grammars for regular hedge languages and balanced grammars for Dyck languages. This approach unifies the two competing frameworks and makes results that have been achieved in one framework applicable in a different setting.

As an alternative to hedge automata, we propose a different operational model for validation of hedges against grammars, namely pushdown caterpillar automata (PCAs). PCAs are sequential rather than parallel machines, hence, their behavior is easier to understand and to analyse than the behavior of hedge automata. Furthermore, PCAs deal more homogeneously with the vertical relationship between ancestors and descendants and the horizontal relationship between siblings in a hedge. Hedge automata, however, model the vertical relationship in detail but hide computational aspects that arise from the horizontal relationship.

The main intent of this paper is to clarify and expand the conceptual foundations of XML research in a way that is practically relevant and mathematically sound. This paper is organized into four further sections: In Section “Dyck strings and balanced grammars”, we introduce Berstel and Boasson's conceptual framework of Dyck strings and balanced grammars. Section “Hedges and balanced grammars” builds on the one-to-one correspondence between Dyck strings and hedges and presents our view on parsing that is specifically tailored to Dyck strings and balanced grammars.

After the preparatory Section “Hedges and balanced grammars”, the equivalence between balanced grammars and hedge grammars is immediately obvious from the definition of hedge grammars in Section “Hedges and hedge grammars”.

In Section “Pushdown caterpillar automata”, we introduce pushdown caterpillar automata (PCAs). We establish the essential properties of PCAs that make them an appropriate model for schema validation, drawing on previous results both for extended context-free grammars and for hedge automata.

This paper is in draft form, providing only proof sketches. In the full version we demonstrate how our results apply to the validation problem for a number of XML schema languages and how they relate to the schema-languages taxonomy of Murata, Lee and Mani ( [MLM00] ).

Dyck strings and balanced grammars

Berstel and Boasson [BB00] , [BB02b] , [BB02a] investigate formal properties and grammatical characterizations of XML documents within the general framework of extended context-free grammars, modeling XML documents as Dyck strings and schemas as balanced grammars.

Symbols in Dyck strings are called brackets that come in pairs: Each opening bracket  a from a finite alphabet  Σ has a corresponding closing bracket a from  Σ , a disjoint copy of  Σ . A Dyck string must have a well-formed bracketed structure.

Definition We define the set of Dyck strings over Σ and  Σ inductively as follows:

  1. The empty string is a Dyck string and is represented by  λ .
  2. If w is a Dyck string and a Σ , then a w a is a Dyck string.
  3. If w and  w are Dyck strings, then w w is a Dyck string.

A Dyck string that is formed using the second inductive rule is called a Dyck prime.

A Dyck language over Σ and  Σ is a language that consists solely of Dyck strings over Σ and  Σ .

We state without proof that each Dyck string can be unambiguously factored into a sequence of Dyck primes.

Dyck primes form an abstraction of XML document instances that disregards textual content and attributes, views element names as symbols and maps element start tags to symbols in  Σ and element end tags to symbols in  Σ , thus abstracting from the tag syntax. The brackets in a Dyck prime correspond to the sequence of start-element and end-element events that a SAX-compliant XML processor generates when reading an XML document instance. In the same vein, general Dyck strings correspond to sequences of elements that occur as content of an element in a document instance.

Definition A balanced grammar  G over Σ and  Σ is specified by a tuple of the form N Σ Σ P I , where

  1. N is a nonterminal alphabet.
  2. Σ Σ is a terminal alphabet disjoint from  N , with Σ being a disjoint copy of  Σ .
  3. P is a system of production schemas X L X ( X N ) such that L X is the union of languages a L X , a a ( a Σ ), each L X , a in turn being a regular language over the alphabet  N .
  4. I is the set of start strings and forms a regular language over the alphabet N .

Given a production schema X L X and a string x in L X , we say that X x is a production of G . We call L X the rhs-language of  X .

Balanced grammars are extended context-free grammars such that each production derives a string of nonterminals, surrounded by a pair of matching terminal brackets from Σ and  Σ . Hence, the language L G of strings over Σ Σ that a balanced grammar derives from any of its start strings is a Dyck language.

The language of all Dyck strings over Σ and  Σ is a balanced language; that is, it is generated by some balanced grammar over Σ and  Σ . More precisely, if Σ = a 1 , , a n , we introduce distinct symbols X a 1 , , X a n as nonterminals for the balanced grammar G D Σ Σ and define the production schemas X a i a i X a 1 | | X a n a i    1 i < n . Finally, we define the language X a 1 | | X a n as the set of start sequences for G D Σ Σ . Obviously, G D Σ Σ derives from its start strings exactly the Dyck strings over Σ and  Σ .

Following Berstel and Boasson [BB02a] , we view bracket names as colors. Hence, a production X a x a of a balanced grammar  G is colored  a . We can rewrite a balanced grammar such that all the productions of each nonterminal  X are uniformly colored; that is, each production schema of  G has the form X a X L X a X with an opening bracket  a X , the corresponding closing bracket  a X and a regular language L X over the alphabet of nonterminals. Then we call a X the color of the nonterminal  X .

The uniform coloring of a nonterminal  X with production schema X L X , L X = a L X , a a can be achieved by replacing X with alternative nonterminals  X a ( a Σ ) and by using the new, uniformly colored production schemas X a a L X , a a .

In this paper, we do not require nonterminals of balanced grammars to be uniformly colored. However, by applying normal-form algorithms for extended context-free grammars ( [AGW01] ) (which preserve the balanced property of grammars) we ensure that grammars are always reduced; that is, each nonterminal can be reached from some start string of the grammar and can itself derive some terminal string.

Hedges and balanced grammars

The XML community usually considers trees and hedges (that is, sequences of trees) as a model for XML document instances and for the contents of elements. Trees and hedges are higher-level abstractions than are Dyck primes and Dyck strings; they are exposed by DOM-compliant XML processors in object form.

In this section, we examine the close relationship between Dyck strings and hedges. We leverage the relationship to devise a two-phase architecture for the parsing of balanced languages.

Definition A hedge over  Σ is an ordered, directed and acyclic graph, whose nodes have in-degree of at most 1 and hold labels from  Σ .

We call the nodes of in-degree zero the roots of the hedge graph and we call the nodes of out-degree zero its leaves. If there is an edge from node  ν to node  ν in the graph, then ν is the parent of ν and ν is a child of  ν . Any pair of distinct roots and any pair of distinct nodes that share a parent are in the sibling relationship.

Any sequence ν 1 ν n , n 1 , of nodes such that ν i is a parent of ν i + 1 , for each  i , 1 i < n , is called a path from ν 1 to  ν n .

A tree is a hedge that has exactly one root.

There is an obvious one-to-one correspondence between hedges over  Σ and Dyck strings over Σ and  Σ :

Starting with a hedge  h over  Σ , for each node of  h we remove its label  a , leaving the node unlabeled, and add two new child nodes, one with label  a at the front and one with label  a at the back of its original child sequence. The result is a hedge whose internal nodes have no labels but whose leaf labels yield the Dyck string that corresponds to  h . The Dyck string is a sequential representation of the original hedge that uses named brackets to represent the label and the span of a node in the hedge.

Conversely, starting with a Dyck string  w , we factor it into a sequence w 1 w n of Dyck primes. Each w i has the form a i w i a i with w i being another Dyck string. We build a hedge with root nodes ν 1 , , ν n , label each ν i with  a i and add the roots of the hedge that correspond to w i as children of  ν i . The result is the hedge graph that corresponds to  w .

The two conversion processes from hedge to Dyck string and Dyck string to hedge reverse each other and convert trees into Dyck primes and Dyck primes into trees.

Definition We call the hedge over  Σ that corresponds to a Dyck string over Σ and  Σ the Dyck string's generic-derivation hedge. We call the Dyck string over Σ and  Σ that corresponds to a hedge over  Σ the hedge's sequential form.

Proposition We can compute a Dyck string's generic-derivation hedge in linear time, doing a one-pass sweep of the Dyck string. We can compute a hedge's sequential form in linear time, doing a depth-first traversal of the Dyck string.

Proof Firstly, we do not construct the generic-derivation hedge for the given Dyck string itself, but rather we construct a tree that adds a single, unlabeled node as a joint parent for all the hedge's roots. We construct this tree incrementally, during a single left-to-right sweep of the Dyck string, always pointing to some node of the tree as the current node.

We start with a single, unlabeled node, the root of the tree we wish to construct, and point to it as the current node. For each opening bracket a in  Σ that we read from the Dyck string, we add a new node that we label with  a as the rightmost child of the current node and make this new node the current node. For each closing bracket a in  Σ that we encounter in the string, we move the current-node pointer one level up. The Dyck property ensures that we never move the current-node pointer off the tree and that we end up at the node where we started from, having constructed the generic-derivation hedge hanging from our start node.

Secondly, given a hedge over  Σ , we do a depth-first traversal of the hedge, visiting each node twice, once on the way down, before visiting any of its children, and once more on the way up, after having visited all its children. During the first visit of a node with label  a we output the opening bracket  a ; during the second visit of the same node we output the corresponding closing bracket  a .

Any sequence of derivation steps that a balanced grammar takes to derive some Dyck string can be mapped to the generic-derivation hedge of that Dyck string. Hence, any derivation structure of a Dyck string, independently of the balanced grammar that derives the string, is structurally equivalent to the generic-derivation hedge. This is the reason that we have called it ``generic''.

If we are given a balanced grammar and a Dyck string and wish to test whether the grammar derives the string, we need to find nonterminal annotations for the nodes of the string's generic-derivation tree that witness a derivation of the string:

Definition Given a Dyck string  w , its generic-derivation hedge  h and a balanced grammar G , we wish to associate a nonterminal of  G as a type annotation with each node of  h in a way that is conformant to G 's productions. More precisely, if a node carries a label a in  Σ and a nonterminal type annotation X , then its children may carry any sequence of type annotations X 1 X n such that X a X 1 X n a is a production of  G . Furthermore, the roots of  h must carry, in sequence, some string of type annotations X 1 X n that is a start sequence of  G . We call such an annotation of  h a grammar-conformant type annotation.

We consider once more the balanced grammar G D Σ Σ that generates the set of all Dyck strings over Σ and  Σ . If we annotate, for any hedge h over  Σ , each node that carries label a from  Σ with the nonterminal type X a from G D Σ Σ , the result is a grammar-conformant type annotation of  h .

Proposition A balanced grammar generates exactly those Dyck strings whose generic-derivation hedges can be annotated with type information that conforms to the grammar.

Proposition  Proposition A balanced grammar generates exactly those Dyck strings whose generic-derivation hedges can be annotated with type information that conforms to the grammar. implies that the parsing of a Dyck string with respect to a balanced grammar can be split into two phases: namely, construct the generic-derivation hedge (which requires linear time, as we have seen) and then find a grammar-conformant type annotation for it.

The classical theory of formal languages abounds with results about which classes of context-free grammars lead to efficient parsing algorithms. For the area of extended context-free grammars, the authors have recently described the grammars that are amenable to top-down parsing with 1-symbol look-ahead ( [BKW03a] , [BKW03b] ). In Section “Pushdown caterpillar automata” we discuss how these results apply to balanced grammars.

We summarize our view of parsing with respect to balanced languages with the diagram in Figure 1.

Hedges and hedge grammars

Since as early as 1991, Murata has been building on the theory of regular tree and hedge languages as the foundation of research into schema, query and transformation languages for structured documents and XML [Mur00] .

There are a number of---equivalent---mechanisms that can be employed to define regularity of hedge languages; that is, sets of hedges. As with regular string languages, several types of automata turn out to be equivalent with respect to the languages they recognize ( [GS84] , [CDG+98] , [BKMW01] ). Lately, following Murata's lead ( [MLM00] , [LMM00] ), the XML community has favored tree grammars as the mechanism of choice for the definition of regular hedge languages, since they can be most easily turned into a schema language for XML, as exemplified by Relax NG ( [CM01] ).

In this section, building on the insights of Section “Hedges and balanced grammars”, we argue that hedge grammars are effectively identical to balanced grammars and that balanced languages are identical to regular hedge languages modulo encoding. As an application, as we will demonstrate in the full version of this paper, Berstel and Boasson's results on codeterministic and minimal balanced grammars ( [BB02a] ) turn out to be corollaries of well-known theorems on hedge automata.

Definition A hedge grammar  G over the alphabet  Σ is specified by a tuple N Σ P I where N is a finite set that is disjoint from  Σ , P is a subset of N Σ N such that for each X in  N and each a in  Σ the set x | X a x P is string regular, and where I is a regular subset of N . We call N the set of nonterminals of  G , P the set of productions and  I the set of start strings. A tree grammar is a hedge grammar where each start string has length 1.

We can rewrite the set  P of productions of a hedge grammar G = N Σ P I as a set of production schemas X L X ( X N ) such that L X is the union of sets a L X , a to a L X , a a and vice versa.

Definition A hedge over  Σ is valid with respect to a hedge grammar over  Σ if and only if the hedge admits a grammar-conformant type annotation. The language L G of a hedge grammar  G is the set of all hedges that are valid with respect to  G .

Let G = N Σ P I be a hedge grammar over  Σ . If we call the nonterminals of  G states, then the grammar  G becomes a nondeterministic top-down hedge automaton ( [BKMW01] ) and each grammar-conformant type annotation of a hedge  h becomes a computation of the tree automaton that recognizes the hedge. Hence, the hedges that are valid with respect to  G are precisely the hedges that the hedge automaton  M recognizes. Thus, the languages of hedge grammars are precisely the regular hedge languages.

Theorem The two mappings between Dyck strings over Σ and  Σ and hedges over  Σ that map a Dyck string to its generic-derivation hedge and a hedge to its sequential form, induce a one-to-one correspondence between balanced languages of Dyck strings and regular languages of hedges.

Proof Regular hedge languages are exactly the languages of hedge grammars. By definition, the language of a hedge grammar consists precisely of the hedges that are valid with respect to the grammar. By Proposion  Proposition A balanced grammar generates exactly those Dyck strings whose generic-derivation hedges can be annotated with type information that conforms to the grammar. , these hedges correspond, via the sequential-form mapping, one-to-one to the Dyck strings that the hedge grammar generates when it is viewed as a balanced grammar.

We illustrate Theorem  Theorem The two mappings between Dyck strings over Σ and Σ― and hedges over Σ that map a Dyck string to its generic-derivation hedge and a hedge to its sequential form, induce a one-to-one correspondence between balanced languages of Dyck strings and regular languages of hedges. with the diagram in Figure 2:

Pushdown caterpillar automata

The central problem for any schema-language approach is how to handle validation of a document with respect to a schema. In the context of hedges and hedge grammars validation means to find grammar-conformant type attributes for hedges.

We have already pointed out how to view a hedge grammar as a nondeterministic top-down hedge automaton. ``Reversing'' a top-down automaton results in an equivalent bottom-up hedge automaton that can be made deterministic either offline, in a preprocessing step that extends the subset construction from string automata to hedge automata ( [CDG+98] , [BKMW01] ), or dynamically, during validation. Hence, hedge automata serve as a framework within which validation against hedge grammars and other schema mechanisms can be explored, as has been proposed by Murata, Lee and Mani ( [MLM00] , [LMM00] ).

We recognize two problems with the approach of using hedge automata as a general framework for discussing schema validation: First, hedge automata are parallel machines, hence to discuss implementation and performance issues in this framework is not straightforward. Second, hedge automata exhibit their typical table-driven behavior only with their vertical movements up and down a hedge; they do not explore the regular horizontal relationship between sibling nodes in an automata-like fashion.

We propose to use caterpillar automata ( [BKW99] , [BKW00] ) as an automata model instead. Caterpillar automata have the sequential control of finite string automata; that is, being in one of a finite number of states, they react to an input symbol from a finite alphabet with a change of state on the basis of some transition table. Caterpillar automata operate on hedges, moving back and forth among sibling nodes and between parent and child nodes. This is facilitated by the special input symbols up , first , last , left and right . The potential moves of the caterpillar automaton on the hedge are driven by transitions on such movement symbols. A caterpillar automaton can get a sense of its position on the hedge with the help of special test input symbols such as isFirst , isLast , isRoot and isLeaf . It must only do a transition on such a test symbol if it is in a corresponding position on the hedge. Finally, caterpillar automata may read the label of the node they are sitting on by doing a transition on that symbol.

We can specify a caterpillar automaton over the label alphabet  Σ as a regular expression or, equivalently, a finite-state automaton over the alphabet Σ Δ , where we set Δ = up , left , right , first , last , isRoot , isLeaf , isFirst , isLast .

The hedge-traversing caterpillar automaton  M H performs, when started on the leftmost root of a hedge, a complete depth-first traversal of the hedge and then stops.

We know that caterpillar automata recognize only regular hedge languages ( [BKW99] , [BKW00] ). It is an open research question if they are able to recognize the whole family of regular hedge languages. Yet there is some evidence that they are strictly less powerful than hedge automata [OSD02] . Since we intend to set up a sufficiently rich automata model, we equip caterpillar automata with a pushdown stack that holds symbols from a finite pushdown alphabet  Π . A pushdown caterpillar automaton or PCA can read the topmost symbol of the stack and react to it with a transition on that symbol, leaving the stack intact. When doing a down move, the PCA must push a new symbol onto the stack. Analogously, when doing an up move, the topmost stack symbol will be popped. Hence, the stack is synchronized with the level of the node that the automaton is sitting on, assuming that we start the PCA on a root of the hedge with an empty stack.

Definition For a finite alphabet of pushdown-stack symbols  Π let Δ Π be the alphabet down z | z Π update z | z Π up , right , isRoot , isLeaf , isLast of hedge actions.

We replace the original two actions first and right of going to the first and last child, respectively, of the current node with a single action down z of going to the first child of the current node and simultaneously pushing a symbol onto the stack. Furthermore, we drop the test action isFirst since we do not seem to need it.

We assume that the alphabets Σ , Π and Δ Π are pairwise disjoint.

Definition A pushdown caterpillar automaton (PCA) over Σ is a finite-state automaton over the alphabet Σ Π Δ Π for some finite alphabet of pushdown-stack symbols  Π .

We use, as is the usual practice, the notion of a configuration to explain how a PCA operates on a hedge.

Definition A configuration of a PCA  M over Σ on a hedge h over  Σ is a tuple C = ν p w such that:

  1. ν is a node of the hedge  h (or the null pointer if h is the empty hedge); it is called the current node or the node that M is sitting on.
  2. p is a state of  M ; it is called the current state.
  3. w is a string over the pushdown-stack alphabet Π of  M ; it is called the current stack. The length of the stack must be equal to the depth of node  ν (that is, the distance from the top of the hedge), so w will be empty if ν is a root node. The last symbol in the sequence is considered to be the top of the stack; it is called the current stack symbol.

Definition A single-step operation of a PCA  M on a nonempty hedge h leads from a configuration C = ν p w to a configuration C = ν p w , denoted by C M C , if and only if one of the following conditions holds:

  1. M reads the node label: M has a transition p a p , a is the label of  ν , ν = ν and w = w .
  2. M reads the stack symbol: w has the form w = w 1 z , M has a transition p z p , ν = ν and w = w .
  3. M updates the stack symbol: w has the form w = w 1 z , M has a transition p update z p , ν = ν and w = w 1 z .
  4. M moves down in the hedge: M has a transition p up p , ν is the parent of  ν and w has the form w = w z .
  5. M moves to the right in the hedge: M has a transition p right p , ν is the right sibling of  ν and w = w .
  6. M tests its position on the hedge: M has a transition p isRoot p , p isLeaf p or p isLast p , ν is a root, a leaf or the rightmost node among its siblings, respectively, ν = ν and w = w .

A starting configuration of M on  h , which may be empty, is a configuration C = ν p λ such that ν is the leftmost root of  h , p is a start state of  M and, furthermore, the stack is empty.

An accepting configuration of M on  h , which may be empty, is a configuration C = ν p λ such that ν is the rightmost root of  h , p is a final state of  M and, once more, the stack is empty.

An accepting computation of M on  h is a sequence of configurations C 1 , , C n ( n 1 ) such that C 1 is a start configuration, C n is an accepting configuration and single-step operations lead from C i to C i + 1 ( 1 i n ).

Definition A PCA over Σ recognizes a hedge over  Σ if and only if there is an accepting computation of the PCA on the hedge. We call the set of hedges that a PCA  M recognizes its language and denote it with L M .

Theorem A hedge language over  Σ is regular if and only if it is recognizable by a PCA.

Proof Firstly, let M be a PCA. We simulate M with a two-way hedge automaton. Our result ( [BKW02] , [BKW02] ) that two-way tree automata recognize the same tree languages as one-way tree automata or tree grammars is easily extended from trees to hedges. Hence, L M is regular.

Secondly, we convert a hedge grammar  G into a PCA  M G that recognizes precisely the hedges that are valid with respect to the hedge grammar. The construction combines techniques from two earlier papers: In our work on caterpillar automata ( [BKW99] , [BKW00] ) we have converted local tree grammars into caterpillar automata by extending the hedge-traversing caterpillar automaton that we have mentioned earlier in this paper. Local tree or hedge grammars have no memory beyond the label of the ``local'' node that they are expanding. By applying our technique of partial parse trees that we have developed in our work on predictive parsing of extended context-free languages ( [BKW03a] , [BKW03b] ), we demonstrate that the stack-memory of a PCA is sufficient to verify validity of a hedge with respect to a hedge grammar.

In the sketch of the proof of Theorem  Theorem A hedge language over Σ is regular if and only if it is recognizable by a PCA. we have mentioned a PCA  M G that we construct from a hedge grammar  G . The PCA  M G recognizes precisely the hedges that are valid with respect to grammar  G .

If we equip PCAs with the write-only capability to annotate nodes with symbols from a finite set, we can extend M G in such a way that it not only recognizes a hedge that conforms to  G but also annotates it with grammar-conformant type information.

The definition of the PCA  M G depends on the way the regular sets L X , a (which consist of all the strings  x of nonterminals such that X a x is a production of  G ) are represented. We assume that each L X , a is represented by a finite-state automaton M X , a .

Theorem The PCA M G can be constructed from a hedge grammar  G in linear time.

In general, M G will operate nondeterministically on a hedge  h . >From a performance point of view it is relevant to know for which grammars  G the behavior of  M G is deterministic. Hence, let us call a color  a live for  X if and only if the language L X , a is nonempty. Let us then replace in each automaton M X , a each transition on a nonterminal  X with transitions on each color  a that is live for  X . We call the result of these replacements M X , a c (a finite-state automaton over the set of colors  Σ ).

We can now state the following theorem:

Theorem The PCA M G operates deterministically on all hedges if and only if each automaton M X , a c is deterministic.

In our previous work on predictive parsing for extended context-free grammars ( [BKW03a] , [BKW03b] ) we have identified a class of grammars whose languages can be parsed deterministically with a top-down parsing strategy that uses a look-ahead of one. This is the class eSLL(1) that is the extended analog of the classic context-free grammars to which the top-down 1-symbol-lookahead parsing strategy applies.

Which of Berstel and Boasson's balanced grammars belong to eSLL(1)? Viewing balanced grammars once more as hedge grammars, we state the following result:

Theorem A balanced grammar  G is eSLL(1) if and only if the PCA  M G operates deterministically on all hedges.


Acknowledgments

The work of the second author was supported under the CERG grant HKUST 6197/01E from the Research Grants Council of Hong Kong.


Bibliography

[AGW01] J. Albert, D. Giammarresi, and D. Wood. Normal form algorithms for extended context-free grammars. Theoretical Computer Science, 267(1--2):35--47, 2001.

[BB00] J. Berstel and L. Boasson. XML grammars. In Mogens Nielsen and Branislav Rovan, editors, Mathematical Foundations of Computer Science 2000, 25th International Symposium, MFCS 2000, Bratislava, Slovakia, August 28 - September 1, 2000, Proceedings, volume 1893 of Lecture Notes in Computer Science, pages 182--191. Springer-Verlag, 2000.

[BB02a] J. Berstel and L. Boasson. Balanced grammars and their languages. In Wilfried Brauer, Hartmut Ehrig, Juhani Karhumäki, and Arto Salomaa, editors, Formal and Natural Computing: Essays Dedicated to Grzegorz Rozenberg [on occasion of his 60th birthday, March 14, 2002], volume 2300 of Lecture Notes in Computer Science, pages 3--25. Springer-Verlag, 2002.

[BB02b] J. Berstel and L. Boasson. Formal properties of XML grammars and languages. Acta Informatica, 38(9):649--671, 2002.

[BKMW01] A. Brüggemann-Klein, M. Murata, and D. Wood. Regular tree and regular hedge languages over unranked alphabets. Technical Report HKUST-TCSC-2001-05, Hong Kong University of Science and Technology, Theoretical Computer Science Center, Computer Science Department, Hong Kong SAR, 2001.

[BKW00] A. Brüggemann-Klein and D. Wood. Caterpillars: A context specification technique. Markup Languages, 2(1):81--106, 2000.

[BKW02] A. Brüggemann-Klein and D. Wood. Regularly-extended two-way nondeterministic tree automata, 2002. Accepted for Publication in a Journal.

[BKW03a] A. Brüggemann-Klein and D. Wood. On predictive parsing and extended context-free grammars. In Jean-Marc Champarnaud and Denis Maurel, editors, Implementation and Application of Automata, 7th International Conference, CIAA 2002, Tours, France}, volume 2608 of Lecture Notes in Computer Science, pages 239--247. Springer-Verlag, 2003.

[BKW03b] A. Brüggemann-Klein and D. Wood. On predictive parsing and extended context-free grammars. In Rolf Klein, Hans-Werner Six, and Lutz Michael Wegner, editors, Computer Science in Perspective, volume 2598 of Lecture Notes in Computer Science, pages 69--87. Springer-Verlag, 2003.

[BKW99] A. Brüggemann-Klein and D. Wood. Caterpillars, context, tree automata and tree pattern matching, 1999. Proceedings of the Fourth International Conference on Developments in Formal Language Theory (DLT '99).

[CDG+98] H. Comon, M. Daucher, R. Gilleron, S. Tison, and M. Tommasi. Tree automata techniques and applications, 1998. Available on the Web from l3ux02.univ-lille3.fr in directory tata.

[CM01] J. Clark and M. Murata. Relax NG specification. The Organization for the Advancement of Structured Information Standards (OASIS), December 2001.

[GS84] F. Gécseg and M. Steinby. Tree Automata. Akadémiai Kiadó, Budapest, 1984.

[LMM00] D. Lee, M. Mani, and M. Murata. Reasoning about XML schema languages using formal language theory. Technical Report RJ# 10197, Log# 95071, IBM Almaden Research Center, 2000.

[MLM00] M. Murata, D. Lee, and M. Mani. Taxonomy of XML schema languages using formal language theory. Extreme Markup Languages, 2000.

[MTKH03] M. Murata, A. Tozawa, M. Kudo, and S. Hada. XML access control using static analysis. In Proceedings of the 10th ACM conference on Computer and communication security, pages 73--84. ACM Press, 2003.

[Mur00] M. Murata. Hedge automata: A formal model for XML schemata. Web-published manuscript, 2000.

[Mur01] M. Murata. Extended path expressions for XML. In Proceedings of the Twenteenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, May 21-23, 2001, Santa Barbara, California, USA. ACM, 2001.

[Mur97] M. Murata. Transformation of documents and schemas by patterns and contextual conditions. In C. Nicholas and D. Wood, editors, Proceedings of the Third International Workshop on Principles of Document Processing (PODP 96), pages 153--169, Heidelberg, 1997. Springer-Verlag. Lecture Notes in Computer Science 1293.

[Mur98] M. Murata. Data model for document tranformation and assembly. In E.V. Munson, C. Nicholas, and D.Wood, editors, Proceedings of the Fourth International Workshop on Principles of Digital Document Processing (PODDP 98), pages 140--152, Heidelberg, Germany, 1998. Springer-Verlag. Lecture Notes in Computer Science 1481.

[OSD02] A. Okhotin, K. Salomaa, and M. Domaratzki. One-visit caterpillar tree automata. Fundamenta Informaticae, 52(4):361--375, 2002.



Balanced Context-Free Grammars, Hedge Grammars and Pushdown Caterpillar Automata

Anne Brüggemann-Klein [Technische Universität München, Institut für Informatik]
brueggem@in.tum.de
Derick Wood [Hong Kong University of Science & Technology, Department of Computer Science]
dwood@cs.ust.hk