Invisible XML

Steven Pemberton, CWI, Amsterdam

Abstract

What if you could see everything as XML? XML has many strengths for data exchange, strengths both inherent in the nature of XML markup and strengths that derive from the ubiquity of tools that can process XML. For authoring, however, other forms are preferred: no one writes CSS or Javascript in XML. It does not follow, however, that there is no value in representing such information in XML. Invisible XML is a method for treating non-XML documents as if they were XML, enabling authors to write in a format they prefer while providing XML for processes that are more effective with XML content. There is really no reason why XML cannot be more ubiquitous than it is.

(Original paper: Pemberton, Steven. “Invisible XML.” Presented at Balisage: The Markup Conference 2013, Montréal, Canada, August 6 - 9, 2013. In Proceedings of Balisage: The Markup Conference 2013. Balisage Series on Markup Technologies, vol. 10 (2013). doi:10.4242/BalisageVol10.Pemberton01. http://www.balisage.net/Proceedings/vol10/html/Pemberton01/BalisageVol10-Pemberton01.html. This version: 2014-01-14.)

Contents

XML and Authoring

XML is a popular format. It is widely and successfully used for document and data storage, exchange and presentation. A major advantage of using XML is the toolchain and pipeline available for generic XML processing. You can easily use new formats within the generic framework.

However, for authoring purposes XML is seldom preferred over a notation more directly suited to the purpose. Few would prefer to write their CSS rules as

<rule><simple-selector name="body"/><block><property name="color" value="blue"/></block></rule>

to the more direct

body {color: blue}

and even less would prefer to write

<statement><if><condition><comparison name="&lt;"><var name="max"><var name="a"></comparison></condition><then><statement><assign><var name="max"/><expression><var name="a"/></expression></assign></statement></then></if></statement>

to the much more direct

if (max<a) then max=a;

Similarly, authoring languages designed to make HTML input easier, such as for Wikipedia [Wikipedia], or Markdown [Markdown] do not use a nested SGML/HTML style, but a flatter style that is intended to be easier to read and write.

And, of course it should be noted that even the XML schema language RELAX NG has both an XML syntax and a 'compact' syntax [RELAX NG] [RELAX NG COMPACT].

In fact if we were to be brutally honest, even XML formats take short cuts for authoring ease. Take for instance an <a> element in XHTML:

<a href="http://www.w3.org/TR/1999/xhtml">XHTML</a>

This does not surface the real structure of the underlying data. If we were to be completely faithful to the principle of making all relevant structure explicit, we should really write something along the lines of

<a><href><method type="href"/><domain name="org"/><site name="w3"/><sub name="www"/><path><root><sub name="TR"><sub name="1999"><sub name="xhtml"></sub></sub></sub></root></path></href><text>XHTML</text></a>

You might argue about the details here, but this example is only to show that there are parts of XML documents that could be further structured, but that we choose not to, possibly for authoring ease, possibly for fear of being laughed out of town.

The reasons for this are obvious: despite the disadvantages of not being able to use the generic toolchain any more, or only to a lesser degree, the increased readability of the source, and its closer relation to the problem domain makes authoring so much easier.

Parsing and Parse trees

Part of the advantage of XML is that there is a single parser needed to be able to deal with any kind of document. This can be contrasted with for instance the situation for HTML, where you need a parser for the HTML, with separate parsers for CSS and Javascript at least, (and URLs), creating extra complexity and brittleness.

But looked at through a suitable pair of glasses, what is XML apart from a description of a parse tree for some format (with some special treatment for text nodes)? And frankly, what is so difficult about general-purpose parsing? It is a widely understood and easily solved problem. Is it not possible to combine the best of both worlds, and have authorable formats, that can still use the XML tool chain? Couldn't XML become the underlying format for everything?

The Approach

The approach presented here is to add one more step to the XML processing chain, an initial one. This step takes any textual document, and a (reference to) a suitable syntax description, parses the document using the syntax description, and produces as output a parse tree that can be treated as an XML document with no further parsing necessary (or alternatively, the document can be serialised out to XML).

In other words, the input document might be

body {color: blue}

but the result of the parse will be the same as if an XML parser had been presented with the XML document

<css>
   <rule>
      <simple-selector name="body"/>
      <block><property name="color" value="blue"/></block>
   </rule>
</css>

We call this method Invisible XML, since the document is treated as XML, but it is not visibly an XML document.

Syntax Description

The requirement is to find a suitable way to describe the syntax of the input document so that the resultant parse-tree is of the form suitable for use in our XML chain. If we were to use BNF [BNF], arguably the most well-known syntax-description format, it might look like this (in what follows "..." is used for parts of the definition that have been elided and will be defined later):

<css> ::= <rules>
<rules> ::= <rule> | <rules> <rule>
<rule> ::= <selector> <block>
<block> ::= "{" <properties> "}"
<properties> ::= <property> | <property> ";" <properties>
<property> ::= <name> ":" <value> | <empty>
<selector> ::= <name>

etc, etc. But it is quickly apparent that this has some shortcomings. Firstly a surface problem that since we are using this for XML, we could quickly go crazy with the use of angle brackets for two different purposes. Although there is a certain charm to defining the <css> element with a syntax rule whose name is <css>, let us rather use a different format. Therefore we shall use a variant of VWG format [VWG]. This looks like:

css: rules.
rules: rule; rules, rule.
rule: selector, block.
block: "{", properties, "}".
properties:  property; property, ";", properties.
property:  name, ":", value; empty.
selector: name.
name: ...
value: ...
empty: .

(We shall restrict ourselves to a simplified CSS grammar for the sake of this article).

Note that ";" signifies alternatives, and as is normal in syntax definitions, if one alternative is empty (or reduces to empty), the rule is optional.

If we parse the snippet of CSS above with this, and then represent the resulting parse tree in an XML style (so that each nonterminal is represented as an XML element), a second problem becomes apparent:

<css>
   <rules>
      <rule>
         <selector><name>body</name></selector>
         <block>
            <properties>
               <property>
                  <name>color</name>
                  <value>blue</value>
               </property>
            </properties>
         </block>
      </rule>
   </rules>
</css>

namely that there are certain elements in the tree (rules, properties) that we really aren't interested in. (You'll notice that some terminal symbols such as the brackets, colons and semicolons don't appear in the parse tree. This will be discussed later).

The problem becomes even more apparent with a CSS snippet like

body {color: blue; font-weight: bold}

since the content of the <block> element then becomes even more unwieldly:

<properties>
   <property>
      <name>color</name>
      <value>blue</value>
   </property>
   <properties>
      <property>
         <name>font-weight</name>
         <value>bold</value>
      </property>
   </properties>
</properties>

where we would prefer to see the much more direct

<property>
   <name>color</name>
   <value>blue</value>
</property>
<property>
   <name>font-weight</name>
   <value>bold</value>
</property>

The problem arises in this case because the syntax description method relies on recursion to deal with repetition. To that end, we shall introduce a specific notation for repetition. Zero or more repetitions:

(rule)*

and one or more repetitions:

(rule)+

In fact we shall extend these two postfix operators to also act as infix operators, to handle a commonly occurring case:

(property)*";"
(property)+";"

which respectively mean "zero or more, separated by semicolon" and "one or more, separated by semicolon" (there is no reason to restrict the separator to a terminal as here; it may also be a nonterminal).

Now we can specify our syntax as:

css: (rule)*.
rule: selector, block.
block: "{", (property)*";", "}".
property:  name, ":", value; .
selector: ...
name: ...
value: ...

and the parsetree will now look like this:

<css>
   <rule>
      <selector>body</selector>
      <block>
         <property>
            <name>color</name>
            <value>blue</value>
         </property>
         <property>
            <name>font-weight</name>
            <value>bold</value>
         </property>
      </block>
   </rule>
</css>

However, there is another reason why we might not want a syntax rule name to appear in the parse tree, and that is when we use a syntax rule as a refinement, that is to say, when the syntax rule doesn't represent anything of semantic importance, but has been defined so that we can use it in several places without having to repeat it. For instance, suppose we wanted to define a series of properties in a separate rule:

properties: (property)*";".

and use it:

block: "{", properties, "}".

but not want <properties> to appear in the final parse tree. What we define is that the use of any rule name preceded by a minus sign is only being used for refinement. So that would give us:

properties: (property)*";".
block: "{", -properties, "}".

and this would result in the same parse-tree as above. Note that this still allows a rule to be used in other places and appear in the parse tree if needed.

Also note that for simplicity we have ignored treating spaces in the syntax description, but that is also an example of something you would not want to have in the parse tree:

colon: -spaces, ":", -spaces.
spaces: " "*.

Similarly, we can use it to make empty alternatives more explicit:

property:  name, ":", value; -empty.
empty: .

Terminals

As alluded to above, in general, terminal symbols do not appear in the parse-tree, since most of them are only there to delimit structural elements in the source file. If you want them to show up, you can add an explicit rule for them:

property:  name, colon, value; .
colon: ":".

which will cause them to show up in the tree like this:

 <property>
     <name>color</name>
     <colon/>
     <value>blue</value>
 </property>

However, there are places where terminals have semantic meaning, and you do want them to appear in the parse-tree, for instance in our example the names and values of the properties. To achieve this we mark terminals that are to be copied to the parse tree specially:

name: (+"a"; +"b"; ...etc...; +"9"; +"-")+.

In other words, normally terminals are discarded, but if they are preceded with a + they are copied to the parse-tree.

Extensions

Strictly speaking, this would be enough to allow you to parse a document, and output it as an equivalent XML document. However, there are possible extensions that give you a little more control over the result. The most obvious is allowing the specification of attributes. This is simply done by marking the use of rules with at signs:

css: (rule)*.
rule: selector, block.
selector: -name.
block: "{", (property)*";", "}".
property:  @name, ":", value.

A rule used like this may clearly not contain any structural elements (though it may contain terminals and refinements), since attributes are not structured, but this is an easy condition to check for. The parsetree will now look like this:

<css>
   <rule>
      <selector>body</selector>
      <block>
         <property name="color">
            <value>blue</value>
         </property>
         <property name="font-weight">
            <value>bold</value>
         </property>
      </block>
   </rule>
</css>

If we changed the rule for property to look like this:

property:  @name, ":", @value.

then the resultant parse-tree would look like this:

<css>
   <rule>
      <selector>body</selector>
      <block>
         <property name="color" value="blue"/>
         <property name="font-weight" value="bold"/>
      </block>
   </rule>
</css>

Note that by marking the use of a syntax rule in this way, and not the definition, it allows the syntax rule to be used for structural elements (<name>color</name>) as well as for attributes (name="color").

Parsing Algorithms

Although it would be possible to require the syntax to be restricted to some class of language, such as LL(1) or LR(1) [LL1] in order to make the parser faster, in practice it is easier for the author of the syntax if we make no such restriction, since it would require the author to understand the principles, and it would require the system to check that the syntax adhered to the requirement. In practice, parsing algorithms that can parse any context-free grammar [CFG], such as Earley's [Earley], CYK [CYK], or GLR [GLR], are fast enough, and will treat all context-free languages. The only remaining problem is if the syntax author describes an ambiguous language. To that end we just define that the parser outputs one of the parses, and leave it at that. For instance, if expression were defined as:

expr: i; expr, div, expr.
i: "i".
div: "÷".

then a string such as

i÷i÷i

could be parsed as both

<expr><i/></expr>
<div/>
<expr>
   <expr><i/></expr>
   <div/>
   <expr><i/></expr>
</expr>

and as

<expr>
   <expr><i/></expr>
   <div/>
   <expr><i/></expr>
</expr>
<div/>
<expr><i/></expr>

(i.e as both i÷(i÷i) and (i÷i)÷i ).

Delivery

To deliver a source document to be parsed by our system, we can use a media type [Mediatype] that supplies a reference to the required syntax description. For instance:

application/xml-invisible; syntax=http://example.com/syntax/css

Clearly a system can cache well-known syntax descriptions.

Using Invisible XML to define itself

It should go without saying that the syntax descriptions themselves are in Invisible XML (though in their case the syntax description must be cached to prevent an infinite loop of processing.)

The definition might look like this:

ixml: (rule)+.
rule: @name, -colon, -definition, -stop.
definition: (alternative)*-semicolon.
alternative: (-term)*-comma.
term: -symbol; -repetition.
repetition: one-or-more; zero-or-more.
one-or-more: -open, -definition, -close, -plus, separator.
zero-or-more: -open, -definition, -close, -star, separator.
separator: -symbol; -empty.
empty: .
symbol: -terminal; nonterminal; refinement.
terminal: explicit-terminal; implicit-terminal.
explicit-terminal: -plus, @string.
implicit-terminal: @string.
nonterminal: @name.
refinement: -minus, @name.
attribute: -at, @name.

string: -openquote, (-character)*, -closequote.
name: (-letter)+.
letter: +"a"; +"b"; ...
character: ...

colon: -S, ":", -S.
stop: -S, ".", -S.
semicolon: -S, ";", -S.
comma:  -S, ",", -S.
plus:  -S, "+", -S.
minus:  -S, "-", -S.
star:  -S, "*", -S.
open:  -S, "(", -S.
close:  -S, ")", -S.
at:  -S, "@", -S.
openquote: -S, """".
closequote: """", -S.
S: " "*.

This would then parse to the XML form:

<ixml>
   <rule name="ixml">
      <alternative>
         <one-or-more>
             <alternative>
                <nonterminal name="rule"/>
             </alternative><separator/>
         </one-or-more>
      </alternative>
   </rule>
   <rule name="rule">
      <alternative>
         <attribute name="name"/>
         <refinement name="definition"/>
      </alternative
   </rule>
   <rule name="definition">
      <alternative>
         <zero-or-more>
            <alternative>
               <nonterminal name="alternative"/>
            </alternative>
            <separator><refinement name="semicolon"/></separator>
         </zero-or-more>
      </alternative
   </rule>
   ... etc ...
   <rule name="separator">
      <alternative><refinement name="symbol"/></alternative>
      <alternative><refinement name="empty"/></alternative>
   </rule>
   ... etc ...
</ixml>

Thanks to the context-free parsing algorithm, we can remove the <alternative> elements when there is only one alternative in a rule, by redefining definition:

definition: -alternative; alternative, -semicolon, (alternative)+-semicolon.

Note how we have used the "-" character to prevent it being copied in the first case (when there is only one). You wouldn't be able to use such a rule as this if there were a requirement on the syntax to be LL(1) or LR(1), since the two parts of the rule start with the same symbols.

Similarly, we can get rid of empty <separators/> thusly:

one-or-more: -open, -definition, -close, -plus; -open, -definition, -close, -plus, separator.
zero-or-more: -open, -definition, -close, -star; -open, -definition, -close, -star, separator.
separator: -symbol.

We can move the value of the separator into an attribute with:

separator: @explicit; @implicit; @nonterminal; @refinement.
explicit: -plus, -string.
implicit: -string.

This would then generate:

<ixml>
   <rule name="ixml">
      <one-or-more>
         <nonterminal name="rule"/>
      </one-or-more>
   </rule>
   <rule name="rule">
      <attribute name="name"/>
      <refinement name="definition"/>
   </rule>
   <rule name="definition">
      <alternative>
         <refinement name="alternative"/>
      </alternative>
      <alternative>
         <nonterminal name="alternative"/>
         <one-or-more>
            <nonterminal name="alternative"/>
            <separator refinement="semicolon"/>
         </one-or-more>
      </alternative>
   </rule>
   ... etc ...
   <rule name="separator">
      <alternative><refinement name="symbol"/></alternative>
      <alternative><refinement name="empty"/></alternative>
   </rule>
   ... etc ...
</ixml>

(An observant reader will have spotted that we have allowed attributes to be defined by attributes here -- for instance with @refinement -- that is we treat an attribute within an attribute definition as if it were a refinement).

As yet another possibility, we can move the separator into an attribute of the one-or-more or zero-or-more elements:

one-or-more: -open, -definition, -close, -plus; -open, -definition, -close, -plus, -separator.
zero-or-more: -open, -definition, -close, -star; -open, -definition, -close, -star, -separator.
separator: @explicit; @implicit; @nonterminal; @refinement.
explicit: -plus, -string.
implicit: -string.

Alternative Representation

Although the syntax description so defined was developed iteratively based on the needs of the user, and is sufficient for its purpose, it is clear in the above example, that refinements occur far more frequently than true semantic rules. An alternative worth exploring would be to say that nothing is copied to the syntax tree unless specifically marked. Let us use the "^" character to mark items that are copied to the tree, both nonterminals as terminals. The result is clearly much more restful to the eye:

ixml: (^rule)+.
rule: @name, colon, definition, stop.
definition: (^alternative)+semicolon.
alternative: (term)*comma.
term: symbol; repetition.
repetition: ^one-or-more; ^zero-or-more.
one-or-more: open, definition, close, plus; open, definition, close, plus, ^separator.
zero-or-more: open, definition, close, star; open, definition, close, star, ^separator.
separator: terminal; @nonterminal; @refinement.
symbol: terminal; ^nonterminal; ^refinement.
terminal: ^explicit-terminal; ^implicit-terminal.
explicit-terminal: up, @string.
implicit-terminal: @string.
nonterminal: up, @name.
refinement: @name.
attribute: at, @name.

string: openquote, (character)*, closequote.
name: (letter)+.
letter: ^"a"; ^"b"; ...
character: ...

colon: S, ":", S.
stop: S, ".", S.
semicolon: S, ";", S.
comma:  S, ",", S.
plus:  S, "+", S.
up:  S, "^", S.
star:  S, "*", S.
open:  S, "(", S.
close:  S, ")", S.
at:  S, "@", S.
openquote: S, """".
closequote: """", S.
S: " "*.

Ambiguities

The grammar above is actually ambiguous in a trivial way, since with any two adjacent terminal symbols with spaces between them, the spaces can be associated with either of the terminals. This ambiguity is trivial in the sense that the resulting parse trees would be identical, and so has no effect on the output. However, the ambiguity would increase parse time. This can be fixed by removing all trailing uses of S in the rules, and adding an S to the rule for name:

name: S, (letter)+.

or preferably, since we are parsing left to right, by removing all leading S's, and add a trailing S for name.

Another example of an ambiguity is in the earlier grammar for Invisible XML which (effectively) had the two rules:

definition: (^alternative)*semicolon.
alternative: (term)*comma.

The reason that this is ambiguous can be seen from a simple rule like:

empty: .

Does this have no alternatives, or one alternative with no terms? Both match the grammar. The two parse trees would be

<rule name="empty"/>

and

<rule name="empty"><alternative/></rule>

To make such a grammar unambiguous, you could write

definition: (^alternative)*semicolon.
alternative: (term)+comma.

but this wouldn't allow a rule such as

separator: symbol; .

This could be seen as an advantage, since you would be forced to be explicit about empty rules, and have to write:

separator: symbol; empty.
empty: .

but if you wanted to be able to write empty alternatives, you have to define:

definition: (^alternative)+semicolon.
alternative: (term)*comma.

which says that there has to be at least one alternative, but it may be empty.

Operators and Identifiers

A remark that should be made is that the design reason for the introduction of the repetition operators "+" and "*" is no longer there. That is, for instance,

ixml: (^rule)+.

could be written as

ixml: ^rule; ixml, ^rule.

and

alternative: (term)*comma.

could be written

alternative: empty; terms.
terms: term; term, comma, terms.

and would give the same output. Nevertheless, the constructs are worth keeping for supporting these often-occurring idioms.

Also in passing it can be noted that although modern programmers are used to the idea that identifiers have to be a single item with no contained spaces, so that there are many programming idioms for representing multi-word identifiers, such as one-or-more, one_or_more, oneOrMore, earlier programming languages such as Fortran and the Algols had no such restriction and allowed spaces in identifiers. Similarly, there is no reason per se to restrict rule names in Invisible XML to be spaceless. There is no ambiguity in rules like:

repetition: ^one or more; ^zero or more.
one or more: open, definition, close, plus; open, definition, close, plus, ^separator.

we just have to be careful to remove the spaces in the generated XML:

name: word+spaces.
spaces: " "+.
word: letter+.
letter: ^"a"; ^"b"; ^"c"; ...

Extras

There are obvious extra odds and ends that need adding, such as sets of characters, to make terminal specification easier.

While you might be tempted to define a set notation, such as:

letter: ^{"a"-"z", "A"-"Z", "-"}.
S: {" ", "\t", "\n", ...}*.

in fact it is only the range notation that adds any power over the use of existing notation. In other words, this is already sufficient:

letter: ^"a"-"z"; ^"A"-"Z"; ^"-".
S: (" "; "\t"; "\n")*.

Restriction on the XML Produced

It should be noted in passing that in the form presented here, Invisible XML only works in one direction: you can turn any textual document into an equivalent XML document. However, it is not in general possible to turn a textual document into a particular XML form without more work. For instance, you could turn Wikipedia markup into an XML document, but not into XHTML in particular, and for instance

{"a": 1, "b": 2}

cannot be transformed into

<j><a>1</a><b>2</b></j>.

Roundtripping

Since in general the input form and the generated XML are isomorphic, returning the generated XML to its original format is just a process of presentation, nothing that a suitable bit of XSLT couldn't do, or even CSS in some simple cases. In fact it should be apparent that from the Invisible XML syntax, it would be straightforward to automatically generate the required piece of XSLT directly.

For instance (rejoicing in the possibility of formatting CSS with CSS), with the CSS example:

<css>
   <rule>
      <selector>body</selector>
      <block>
         <property name="color" value="blue"/>
         <property name="font-weight" value="bold"/>
      </block>
   </rule>
</css>

the following simple bit of CSS would return the XML back into regular CSS format:

block:before {content: "{"}
block:after {content:"}"}
property:before {content: attr(name) ":" attr(value) ";"}

and with the earlier CSS-in-XML format:

<css>
   <rule>
      <selector>body</selector>
      <block>
         <property>
            <name>color</name>
            <value>blue</value>
         </property>
         <property>
            <name>font-weight</name>
            <value>bold</value>
         </property>
      </block>
   </rule>
</css>

the CSS needed to return this to its original format would be:

block:before {content: "{"}
block:after {content: "}"}
name:after {content: ":"}
property:after {content: ";"}

Another option is to regard the grammar of a format as a specification of a presentation language for the parse-tree of that format, and write a suitable program that walks the tree hand-in-hand with the grammar.

Conclusion

There is really no reason why XML can't be more ubiquitous than it is, and similarly there is no reason why XML documents have to be written in an explicit XML format per se. Anything that can be parsed can be perceived as XML, since parsing is very easy, and parse-trees are really just XML documents in different clothing. Invisible XML allows a multitude of document formats to be authored in their traditional form, but be processed as XML, with the concomitant advantages of the XML toolchain.

References

[Wikipedia] Wikipedia Cheatsheet http://en.wikipedia.org/wiki/Help:Cheatsheet

[Markdown] John Gruber, 2004, Markdown http://daringfireball.net/projects/markdown/

[RELAX NG] James Clark, Makoto MURATA (eds.), 2001, RELAX NG Specification, https://www.oasis-open.org/committees/relax-ng/spec.html

[RELAX NG COMPACT] James Clark (ed.). 2002, RELAX NG Compact Syntax, https://www.oasis-open.org/committees/relax-ng/compact-20021121.html

[BNF] Backus-Naur Form, http://en.wikipedia.org/wiki/Backus-Naur_Form

[VWG] S. Pemberton, 1982, Executable Semantic Definition of Programming Languages Using Two-level Grammars, http://www.cwi.nl/~steven/vw.html

[LL1] Alfred Aho and Jeffrey D. Ullman, 1977, Principles of Compiler Design, Addison-Wesley, ISBN 0-201-00022-9.

[CFG] John E. Hopcroft, Jeffrey D. Ullman, 1979, Introduction to Automata Theory, Languages, and Computation, Addison-Wesley. Chapter 4: Context-Free Grammars.

[Earley] Jay Earley, 1970, An efficient context-free parsing algorithm, Communications of the ACM 13 (2): 94-102, doi:10.1145/362007.362035

[CYK] Alfred V. Aho, John E. Hopcroft, Jeffrey D. Ullman, 1974, The Design and Analysis of Computer Algorithms, Addison-Wesley.

[GLR] Masaru Tomita, 1985, An efficient context-free parsing algorithm for natural languages. IJCAI. International Joint Conference on Artificial Intelligence. pp. 756–764.

[Mediatype] N. Freed et al., 1996, Multipurpose Internet Mail Extensions, (MIME) Part Two: Media Types, http://www.ietf.org/rfc/rfc2046.txt