Serialized DOM Format (SDF)

Work in Progress — Last Update 11 September 2007

Editor
Simon Pieters, Opera Software, simonp@opera.com

Abstract

Serialized DOM Format (SDF) is a format that can represent arbitrary DOM trees, including illegal DOM trees, in text form. It is primarily indended for test suites, but might be useful for other applications as well.

The SDF Syntax

A node is represented by an identifier character followed by one or more strings which carry information about the node. The first string must always be specified, but further strings may be omitted. Unless otherwise stated, omitted strings default to the empty string.

The identifier characters and their meanings are as follows:

e
An Element node. It has three strings representing the localName, prefix, and namespaceURI, respectively. The namespaceURI defaults to "http://www.w3.org/1999/xhtml".
a
An Attr node. It has four strings representing the localName, value, prefix, and namespaceURI, respectively.
t
A Text node. It has one string representing the data.
c
A Comment node. It has one string representing the data.
s
A CDATASection node. It has one string representing the data.
p
A ProcessingInstruction node. It has two strings representing the target and data, respectively.
d
A DocumentType node. It has three strings representing the name, publicId, and systemId, respectively.

To express that a node is a child node of another node, or to express that an Attr node is part of another node, the line is indented with 2 spaces. An Attr node must not be a top-level node. The nodes must be indented appropriately so that they form a tree. There may be zero or more top-level nodes.

Should it be required that attributes come before the actual child nodes? Should attributes be required to be sorted?

A node is written as follows:

  1. Zero or more pairs of U+0020 SPACE characters.
  2. An identifier character.
  3. A U+0020 SPACE character
  4. One or more strings, as defined for each node type, separated by a U+0020 SPACE character.
  5. A U+000A LINE FEED character.

A string is a JSON string. [JSON]

Examples

In the following example, a DOM tree is represented in XML and SDF, respectively:

<foo><![CDATA[bar]]>baz<!--quux--></foo>
e "foo" "" ""
  s "bar"
  t "baz"
  c "quux"

Since HTML doesn't support CDATA sections, the above DOM can't be represented in HTML.

In the following example, the DOM tree cannot be represented in XML, but can be in HTML and SDF:

<!-- -- -->
c " -- "

In the following example, the DOM tree cannot be represented in either XML nor HTML, but can be with SDF:

c " --> "

In the following example, the DOM tree is not legal per the DOM specification, but can still be represented with SDF:

t "foo"
  e "bar"

Trying to build a DOM like this with the standard DOM methods will raise a HIERARCHY_REQUEST_ERR exception.

In the following example, the U+000C FORM FEED and U+1047E SHAVIAN LETTER IAN characters are escaped. Note that the latter is represented as a UTF-16 surrogate pair.

t "form feed: \u000C, ian: \uD801\uDC7E"

References

...

Acknowledgements

Thanks to Henri Sivonen, Lachlan Hunt and Philip Taylor for their contributions.