Web::XML::Parser
An XML parser
SYNOPSIS
use Web::XML::Parser;
use Web::DOM::Document;
$parser = Web::XML::Parser->new;
$doc = Web::DOM::Document->new;
$parser->parse_char_string ($chars => $doc);
$parser->parse_byte_string ($encoding, $bytes => $doc);
## Or, just use DOM attribute:
$doc->inner_html ($chars);
DESCRIPTION
The Web::XML::Parser
module is an XML parser, supporting XML 1.0 and XML Namespaces 1.0, written in pure Perl.
This parser is inspired by the HTML Standard and the draft XML5 specification. It is not a Draconian parser; it does not abort the parsing just because of a well-formedness error but does try to recover from the error.
This module provides a low-level API to the parser, which accepts byte or character sequence as input and construct DOM tree as output, optionally reporting errors and warnings detected during the parsing. Applications such as browsers, data mining tools, validators, and so on, can use this module directly. However, it is encouraged to use higher-level APIs such as DOM inner_html
method (see Web::DOM::ParentNode in the perl-web-dom package, for example).
METHODS
The Web::XML::Parser module provides following methods:
$parser = Web::XML::Parser->new
-
Create a new parser.
$parser->parse_char_string ($chars => $doc)
-
Parse a character string as XML. The first argument must be a character string (i.e. a latin1 or utf8 string). The second argument must be a DOM
Document
object. TheDocument
is to be mutated during the parsing.See
<https://github.com/manakai/perl-web-markup#dependency>
for the requirements on theDocument
object. $parser->parse_byte_string ($encoding, $bytes => $doc)
-
Parse a byte string as XML. The first argument must be a character encoding label fo the byte string, if any, or
undef
(See "SPECIFYING ENCODING"). The second argument must be a byte string. The third argument must be a DOMDocument
object. TheDocument
is to be mutated during the parsing.See
<https://github.com/manakai/perl-web-markup#dependency>
for the requirements on theDocument
object. $node_list = $parser->parse_char_string_with_context ($chars, $context, $empty_doc)
-
Parse a character string as XML in the specified context. The first argument must be a character string (i.e. a latin1 or utf8 string). The second argument must be an
Element
object used as the context, orundef
if there is no context. The third argument must be an emptyDocument
object used in the parsing. Note that theDocument
's children list is not to be affected by the parsing. The method returns anHTMLCollection
object containing the result of the parsing (zero or moreNode
objects).This method can be used to implement the
inner_html
method of anElement
.See
<https://github.com/manakai/perl-web-markup#dependency>
for the requirements on theDocument
andElement
objects. $string = $parser->known_definite_encoding
$parser->known_definite_encoding ($string)
-
Get or set a known character encoding used to parse the document. See also "SPECIFYING ENCODING".
The value should not be set while the parser is running. If the value is changed, the result is undefined.
$code = $parser->onerror
$parser->onerror ($new_code)
-
Get or set the error handler for the parser. Any parse error, as well as warning and additional processing information, is reported to the handler. See
<https://github.com/manakai/data-errors/blob/master/doc/onerror.txt>
for details of error handling.The code is not expected to throw any exception. See also
throw
.The value should not be set while the parser is running. If the value is changed, the result is undefined.
$parser->throw ($code)
-
Terminate the parser and run the specified code reference. The code reference must throw an exception.
When the error handler specified by the
onerror
method is intended to abort the parsing, it must invoke this method and return. Otherwise resources used by the parser might not be destroyed due to the unexpected termination. $parser->max_entity_depth ($integer)
$integer = $parser->max_entity_depth
-
Get or set the maximum depth of nested entities to be expanded. The value must be a non-negative integer.
$parser->max_entity_expansions ($integer)
$integer = $parser->max_entity_expansions
-
Get or set the maximum number of entity references to be expanded. The value must be a non-negative integer. Note that predefined entities and HTML character entities are always expanded and not taken into account for the number of entity expansions.
$parser->ignore_doctype_pis ($boolean)
$boolean = $parser->ignore_doctype_pis
-
Get or set whether processing instructions in the DTD should be exposed to the DOM or not. If true, the
DocumentType
object, if any, contains no child node even when there are processing instructions in the DTD.
The module also has following methods for API compatibility with Web::HTML::Parser but they have no effect: locale_tag
, scripting
, is_xhr
.
SPECIFYING ENCODING
The input to the parse_char_*
methods are a string of characters. It is always interpreted as a Perl character string (utf8 or latin1).
The input to the parse_byte*
methods are a string of bytes, where characters are encoded in some Web-compatible character encoding. It is decoded as specified by Encoding standards.
The parse_byte*
methods accept a character encoding label as one of arguments. It is interpreted as the transport layer character encoding metadata. In HTTP, it is the value of the charset
parameter in the Content-Type
header. If it is unknown, the argument must be set to undef
.
The known_definite_encoding
method can be used to set a known definite encoding. If its value is not undef
, it is used to decode the document. This takes precedence over the transport layer character encoding metadata and is always respected.
The character encoding, if specified, must be represented by one of its labels, defined by the Encoding Standard. Unknown labels are ignored. Examples of labels include (but not limited to): utf-8
, windows-1252
, shift_jis
, euc-jp
, iso-2022-jp
, and gb18030
. Encoding labels are ASCII case-insensitive.
SEE ALSO
Web::DOM::Document, Web::DOM::Element in the perl-web-dom package.
Web::HTML::Validator, Web::XML::DTDValidator.
See <http://suika.suikawiki.org/~wakaba/wiki/sw/n/manakai++Predefined%20User%20Data%20Names>
for details of source location annotations using DOM3 user data.
SPECIFICATIONS
- XML
-
XML 1.0
<https://www.w3.org/TR/xml/>
. - XMLNS
-
Namespaces in XML 1.0
<https://www.w3.org/TR/xml-names/>
. - INFOSET
-
XML Information Set
<https://www.w3.org/TR/xml-infoset/>
.DOM Level 3 Core - Infoset Mapping
<https://www.w3.org/TR/DOM-Level-3-Core/infoset-mapping.html>
. - XML5
-
XML5 Standard
<https://ygg01.github.io/xml5_draft/>
. - HTML
-
HTML Standard
<https://html.spec.whatwg.org/>
.The XML fragment parsing algorithm must return the children of the template content of the root element of the resulting
Document
, in tree order, if the /context/ element is an HTMLtemplate
element. - DOMDTDEF
-
XML processing and DOM Document Type Definitions
<https://suika.suikawiki.org/www/markup/xml/domdtdef/domdtdef-work>
. - MANAKAI
-
manakai DOM Extensions
<https://suika.suikawiki.org/~wakaba/wiki/sw/n/manakai%20DOM%20Extensions>
.
Note that there is no single specification that completely defines XML parsing.
XML 1.1 is no longer supported.
See also Web::XML::DTDValidator.
AUTHOR
Wakaba <wakaba@suikawiki.org>.
LICENSE
Copyright 2007-2017 Wakaba <wakaba@suikawiki.org>.
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
This library derived from a JSON file, which contains data extracted from HTML Standard. "Written by Ian Hickson (Google, ian@hixie.ch) - Parts © Copyright 2004-2014 Apple Inc., Mozilla Foundation, and Opera Software ASA; You are granted a license to use, reproduce and create derivative works of this document."