The manakai project

Web::XML::Parser

An XML parser

SYNOPSIS

  use Web::XML::Parser;
  use Web::DOM::Document;
  $parser = Web::XML::Parser->new;
  $doc = Web::DOM::Document->new;
  
  $parser->parse_char_string ($chars => $doc);
  $parser->parse_byte_string ($encoding, $bytes => $doc);

  ## Or, just use DOM attribute:
  $doc->inner_html ($chars);

DESCRIPTION

The Web::XML::Parser module is an XML parser, supporting XML 1.0 and XML Namespaces 1.0, written in pure Perl.

This parser is inspired by the HTML Standard and the draft XML5 specification. It is not a Draconian parser; it does not abort the parsing just because of a well-formedness error but does try to recover from the error.

This module provides a low-level API to the parser, which accepts byte or character sequence as input and construct DOM tree as output, optionally reporting errors and warnings detected during the parsing. Applications such as browsers, data mining tools, validators, and so on, can use this module directly. However, it is encouraged to use higher-level APIs such as DOM inner_html method (see Web::DOM::ParentNode in the perl-web-dom package, for example).

METHODS

The Web::XML::Parser module provides following methods:

$parser = Web::XML::Parser->new

Create a new parser.

$parser->parse_char_string ($chars => $doc)

Parse a character string as XML. The first argument must be a character string (i.e. a latin1 or utf8 string). The second argument must be a DOM Document object. The Document is to be mutated during the parsing.

See <https://github.com/manakai/perl-web-markup#dependency> for the requirements on the Document object.

$parser->parse_byte_string ($encoding, $bytes => $doc)

Parse a byte string as XML. The first argument must be a character encoding label fo the byte string, if any, or undef (See "SPECIFYING ENCODING"). The second argument must be a byte string. The third argument must be a DOM Document object. The Document is to be mutated during the parsing.

See <https://github.com/manakai/perl-web-markup#dependency> for the requirements on the Document object.

$node_list = $parser->parse_char_string_with_context ($chars, $context, $empty_doc)

Parse a character string as XML in the specified context. The first argument must be a character string (i.e. a latin1 or utf8 string). The second argument must be an Element object used as the context, or undef if there is no context. The third argument must be an empty Document object used in the parsing. Note that the Document's children list is not to be affected by the parsing. The method returns an HTMLCollection object containing the result of the parsing (zero or more Node objects).

This method can be used to implement the inner_html method of an Element.

See <https://github.com/manakai/perl-web-markup#dependency> for the requirements on the Document and Element objects.

$string = $parser->known_definite_encoding
$parser->known_definite_encoding ($string)

Get or set a known character encoding used to parse the document. See also "SPECIFYING ENCODING".

The value should not be set while the parser is running. If the value is changed, the result is undefined.

$code = $parser->onerror
$parser->onerror ($new_code)

Get or set the error handler for the parser. Any parse error, as well as warning and additional processing information, is reported to the handler. See <https://github.com/manakai/data-errors/blob/master/doc/onerror.txt> for details of error handling.

The code is not expected to throw any exception. See also throw.

The value should not be set while the parser is running. If the value is changed, the result is undefined.

$parser->throw ($code)

Terminate the parser and run the specified code reference. The code reference must throw an exception.

When the error handler specified by the onerror method is intended to abort the parsing, it must invoke this method and return. Otherwise resources used by the parser might not be destroyed due to the unexpected termination.

$parser->max_entity_depth ($integer)
$integer = $parser->max_entity_depth

Get or set the maximum depth of nested entities to be expanded. The value must be a non-negative integer.

$parser->max_entity_expansions ($integer)
$integer = $parser->max_entity_expansions

Get or set the maximum number of entity references to be expanded. The value must be a non-negative integer. Note that predefined entities and HTML character entities are always expanded and not taken into account for the number of entity expansions.

$parser->ignore_doctype_pis ($boolean)
$boolean = $parser->ignore_doctype_pis

Get or set whether processing instructions in the DTD should be exposed to the DOM or not. If true, the DocumentType object, if any, contains no child node even when there are processing instructions in the DTD.

The module also has following methods for API compatibility with Web::HTML::Parser but they have no effect: locale_tag, scripting, is_xhr.

SPECIFYING ENCODING

The input to the parse_char_* methods are a string of characters. It is always interpreted as a Perl character string (utf8 or latin1).

The input to the parse_byte* methods are a string of bytes, where characters are encoded in some Web-compatible character encoding. It is decoded as specified by Encoding standards.

The parse_byte* methods accept a character encoding label as one of arguments. It is interpreted as the transport layer character encoding metadata. In HTTP, it is the value of the charset parameter in the Content-Type header. If it is unknown, the argument must be set to undef.

The known_definite_encoding method can be used to set a known definite encoding. If its value is not undef, it is used to decode the document. This takes precedence over the transport layer character encoding metadata and is always respected.

The character encoding, if specified, must be represented by one of its labels, defined by the Encoding Standard. Unknown labels are ignored. Examples of labels include (but not limited to): utf-8, windows-1252, shift_jis, euc-jp, iso-2022-jp, and gb18030. Encoding labels are ASCII case-insensitive.

SEE ALSO

Web::DOM::Document, Web::DOM::Element in the perl-web-dom package.

Web::XML::Serializer.

Web::HTML::Validator, Web::XML::DTDValidator.

Web::HTML::Parser.

See <http://suika.suikawiki.org/~wakaba/wiki/sw/n/manakai++Predefined%20User%20Data%20Names> for details of source location annotations using DOM3 user data.

SPECIFICATIONS

XML

XML 1.0 <https://www.w3.org/TR/xml/>.

XMLNS

Namespaces in XML 1.0 <https://www.w3.org/TR/xml-names/>.

INFOSET

XML Information Set <https://www.w3.org/TR/xml-infoset/>.

DOM Level 3 Core - Infoset Mapping <https://www.w3.org/TR/DOM-Level-3-Core/infoset-mapping.html>.

XML5

XML5 Standard <https://ygg01.github.io/xml5_draft/>.

HTML

HTML Standard <https://html.spec.whatwg.org/>.

The XML fragment parsing algorithm must return the children of the template content of the root element of the resulting Document, in tree order, if the /context/ element is an HTML template element.

DOMDTDEF

XML processing and DOM Document Type Definitions <https://suika.suikawiki.org/www/markup/xml/domdtdef/domdtdef-work>.

MANAKAI

manakai DOM Extensions <https://suika.suikawiki.org/~wakaba/wiki/sw/n/manakai%20DOM%20Extensions>.

Note that there is no single specification that completely defines XML parsing.

XML 1.1 is no longer supported.

See also Web::XML::DTDValidator.

AUTHOR

Wakaba <wakaba@suikawiki.org>.

LICENSE

Copyright 2007-2017 Wakaba <wakaba@suikawiki.org>.

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

This library derived from a JSON file, which contains data extracted from HTML Standard. "Written by Ian Hickson (Google, ian@hixie.ch) - Parts © Copyright 2004-2014 Apple Inc., Mozilla Foundation, and Opera Software ASA; You are granted a license to use, reproduce and create derivative works of this document."