The manakai project

Web::HTML::Dumper

Dump DOM tree by the parser test format

SYNOPSIS

  use Web::HTML::Dumper qw(dumptree);
  
  warn dumptree $doc;

DESCRIPTION

The Web::HTML::Dumper exports a function, dumptree, which serializes the given document into the format used in HTML parser tests.

FUNCTION

The module exports a function:

$dumped = dumptree $doc

Dump the DOM tree. The argument must be a DOM document object (i.e. an instance of Web::DOM::Document class). The function returns the dump for the document and its subtree.

DUMP FORMAT

The function serializes the DOM tree into the format used in HTML parser tests, as described in <http://wiki.whatwg.org/wiki/Parser_tests#Tree_Construction_Tests> and <https://github.com/html5lib/html5lib-tests/tree/master/tree-construction>, with following exceptions:

Only the "#document" part of the tree construction test is returned.
No "| " prefix is prepended to lines.
XML-only node types are also supported.

Element type definition, entity, and notation nodes attached to a document type node is serialized as if they were children of the document type node. They are inserted before any children of the document type node, sorted by node types in the aforementioned order, then by code point order of their node names.

Element type definition nodes are represented as <!ELEMENT, followed by a U+0020 SPACE character, followed by the node name, followed by a U+0020 SPACE character, followed by the contentModelText of the node, followed by >.

Entity nodes are represented as <!ENTITY, followed by a U+0020 SPACE character, followed by the node name, followed by a U+0020 SPACE character, followed by list of textContent, publicId, and systemId of the node (the empty string is used when the value is undef), where each item is enclosed by " characters, separated by a U+0020 SPACE character, followed by a U+0020 SPACE character, followed by the notationName of the node, if it is not undef, followed by >.

Notation nodes are represented as <!NOTATION, followed by a U+0020 SPACE character, followed by the node name, followed by a U+0020 SPACE character, followecd by list of publicId and systemId of the node (the empty string is used when the value is undef), where each item is enclosed by " characters, separated by a U+0020 SPACE character, followed by >.

Attribute definition nodes attached to an element type definition node is serialized as if they were children of the element type node, sorted by code point order of their node names.

Attribute type definition nodes are represented as the node name, followed by a U+0020 SPACE character, followed by the keyword represented by declaredType of the node (or ENUMERATION if it represents the enumerated type), followed by a U+0020 SPACE character, followed by (, followed by the list of allowedTokens of the node separated by |, followed by ), followed by a U+0020 SPACE character, followed by the keyword represented by defaultType or the node (or EXPLICIT if it reprensets the explicit default value), followed by a U+0020 SPACE character, followed by ", followed by the textContent of the node, followed by ".

Namespace designators are extended.

The namespace designator for the HTML namespace (http://www.w3.org/1999/xhtml) is html. While elements in the HTML namespace are serialized without the namespace designator as in original format, attributes in the HTML namespace are serialized with this namespace designator.

An application can define a custom namespace designator by setting the key-value pair to the %$Web::HTML::Dumper::NamespaceMapping hash:

  $Web::HTML::Dumper::NamespaceMapping->{$url} = $prefix;

For example, if the application does:

  $Web::HTML::Dumper::NamespaceMapping
      ->{q<urn:x-suika-fam-cx:markup:suikawiki:0:9:>}
      = 'sw';

... then document in the SuikaWiki/0.9 namespace is serialized as sw document.

When no namespace designator is explicitly defined for a namespace, the namespace designator for the namespace is { followed by the namespace URL followed by }. If an element has no namespace, the namespace designator for the element is {}.

SEE ALSO

Parser tests - WHATWG Wiki <http://wiki.whatwg.org/wiki/Parser_tests>.

html5lib-tests <https://github.com/html5lib/html5lib-tests/tree/master/tree-construction>.

AUTHOR

Wakaba <wakaba@suikawiki.org>.

LICENSE

Copyright 2007-2013 Wakaba <wakaba@suikawiki.org>.

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.