The manakai project

Web::HTML::Parser

An HTML parser

SYNOPSIS

  use Web::HTML::Parser;
  use Web::DOM::Document;
  $parser = Web::HTML::Parser->new;
  $doc = Web::DOM::Document->new;
  
  $parser->parse_char_string ($chars => $doc);
  $parser->parse_byte_string ($encoding, $bytes => $doc);

  ## Or, just use DOM attribute:
  $doc->manakai_is_html (1);
  $doc->inner_html ($chars);

DESCRIPTION

The Web::HTML::Parser module is an HTML parser, as specified by the HTML Standard (i.e. an "HTML5" parser), written in pure Perl.

This module provides a low-level API to the parser, which accepts byte or character sequence as input and construct DOM tree as output, optionally reporting errors and warnings detected during the parsing. Applications such as browsers, data mining tools, validators, and so on, can use this module directly. However, it is encouraged to use higher-level APIs such as DOM inner_html method (see Web::DOM::ParentNode in the perl-web-dom package, for example).

METHODS

The Web::HTML::Parser module has following methods:

$parser = Web::HTML::Parser->new

Create a new parser.

$parser->parse_char_string ($chars => $doc)

Parse a character string as HTML. The first argument must be a character string (i.e. a latin1 or utf8 string). The second argument must be a DOM Document object. The Document is to be mutated during the parsing.

See <https://github.com/manakai/perl-web-markup#dependency> for the requirements on the Document object.

$parser->parse_byte_string ($encoding, $bytes => $doc)

Parse a byte string as HTML. The first argument must be a character encoding label fo the byte string, if any, or undef (See "SPECIFYING ENCODING"). The second argument must be a byte string. The third argument must be a DOM Document object. The Document is to be mutated during the parsing.

See <https://github.com/manakai/perl-web-markup#dependency> for the requirements on the Document object.

$node_list = $parser->parse_char_string_with_context ($chars, $context, $empty_doc)

Parse a character string as HTML in the specified context. The first argument must be a character string (i.e. a latin1 or utf8 string). The second argument must be an Element object used as the context, or undef if there is no context. The third argument must be an empty Document object used in the parsing. Note that the Document's children list is not to be affected by the parsing. The method returns an HTMLCollection object containing the result of the parsing (zero or more Node objects).

This method can be used to implement the inner_html method of an Element.

See <https://github.com/manakai/perl-web-markup#dependency> for the requirements on the Document and Element objects.

$parser->locale_tag ($string)
$string = $parser->locale_tag

Get or set the BCP 47 language tag for the locale used to parse the document, e.g. en, ja, zh-tw, and zh-cn. It is used to determine the default character encoding (which is only used when character encoding cannot be determined by other means).

If undef is specified (or the locale_tag method is not explicitly invoked at all), the default is "none", which results in the windows-1252 character encoding default.

Except for the zh-tw and zh-cn, only the primary language tag (i.e. a language code with no - and subtags) should be specified. Tags are compared ASCII case-insensitively.

The value should not be set while the parser is running. If the value is changed, the result is undefined.

$string = $parser->known_definite_encoding
$parser->known_definite_encoding ($string)

Get or set a known character encoding used to parse the document. See also "SPECIFYING ENCODING".

The value should not be set while the parser is running. If the value is changed, the result is undefined.

$boolean = $parser->is_xhr
$parser->is_xhr ($boolean)

Get or set whether the document is parsed to create XHR's responseXML document or not. See also "SPECIFYING ENCODING".

The value should not be set while the parser is running. If the value is changed, the result is undefined.

$boolean = $parser->scripting
$parser->scripting ($boolean)

Set whether the scripting flag of the parser is "enabled" or not. By default the value is "disabled" (false). If the value is "enabled", the noscript element's content is not parsed (This is how browsers parse the document by default). Otherwise the content is parsed as normal.

The value should not be set while the parser is running. If the value is changed, the result is undefined.

$code = $parser->onerror
$parser->onerror ($new_code)

Get or set the error handler for the parser. Any parse error, as well as warning and additional processing information, is reported to the handler. See <https://github.com/manakai/data-errors/blob/master/doc/onerror.txt> for details of error handling.

The code is not expected to throw any exception. See also throw.

The value should not be set while the parser is running. If the value is changed, the result is undefined.

$parser->throw ($code)

Terminate the parser and run the specified code reference. The code reference must throw an exception.

When the error handler specified by the onerror method is intended to abort the parsing, it must invoke this method and return. Otherwise resources used by the parser might not be destroyed due to the unexpected termination.

The module also has following methods for API compatibility with Web::XML::Parser but they have no effect: max_entity_depth, max_entity_expansions, ignore_doctype_pis.

SPECIFYING ENCODING

The input to the parse_char_* methods are a string of characters. It is always interpreted as a Perl character string (utf8 or latin1).

The input to the parse_byte* methods are a string of bytes, where characters are encoded in some Web-compatible character encoding. It is decoded as specified by HTML and Encoding standards.

The parse_byte* methods accept a character encoding label as one of arguments. It is interpreted as the transport layer character encoding metadata. In HTTP, it is the value of the charset parameter in the Content-Type header. If it is unknown, the argument must be set to undef. Note that in some cases this encoding metadata is ignored, as specified in HTML Standard.

The known_definite_encoding method can be used to set a known definite encoding. If its value is not undef, it is used to decode the document. This takes precedence over the transport layer character encoding metadata and is always respected.

The character encoding, if specified, must be represented by one of its labels, defined by the Encoding Standard. Unknown labels are ignored. Examples of labels include (but not limited to): utf-8, windows-1252, shift_jis, euc-jp, iso-2022-jp, and gb18030. Encoding labels are ASCII case-insensitive.

If none of these character encoding metadata is provided, parse_byte* methods try to detect the character encoding in use by the steps specified in HTML Standard. It also takes the locale information of the locale_tag method into account.

The is_xhr method's value also affects these encoding detecting process, as specified by the XMLHttpRequest Standard.

SEE ALSO

Web::DOM::Document, Web::DOM::Element in the perl-web-dom package.

Web::HTML::Serializer.

Web::HTML::Validator.

Web::XML::Parser.

SPECIFICATIONS

HTML

HTML Standard <https://html.spec.whatwg.org/>.

DOCUMENTINNERHTML

Document.prototype.innerHTML <https://html5.org/tools/web-apps-tracker?from=6531&to=6532>.

DOMPARSING

DOM Parsing and Serialization <https://domparsing.spec.whatwg.org/>.

XHR

XMLHttpRequest Standard <https://xhr.spec.whatwg.org/>.

ENCODING

Encoding Standard <https://encoding.spec.whatwg.org/>.

MANAKAI

manakai DOM Extensions <https://suika.suikawiki.org/~wakaba/wiki/sw/n/manakai++DOM%20Extensions>.

MANAKAIINDEX

manakai index data structure <https://wiki.suikawiki.org/n/manakai%20index%20data%20structures>.

AUTHOR

Wakaba <wakaba@suikawiki.org>.

LICENSE

Copyright 2007-2017 Wakaba <wakaba@suikawiki.org>.

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

This library derived from a JSON file, which contains data extracted from HTML Standard. "Written by Ian Hickson (Google, ian@hixie.ch) - Parts © Copyright 2004-2014 Apple Inc., Mozilla Foundation, and Opera Software ASA; You are granted a license to use, reproduce and create derivative works of this document."

POD ERRORS

Hey! The above document had some coding errors, which are explained below:

Around line 247:

'=item' outside of any '=over'