Web::HTML::Parser
An HTML parser
SYNOPSIS
use Web::HTML::Parser;
use Web::DOM::Document;
$parser = Web::HTML::Parser->new;
$doc = Web::DOM::Document->new;
$parser->parse_char_string ($chars => $doc);
$parser->parse_byte_string ($encoding, $bytes => $doc);
## Or, just use DOM attribute:
$doc->manakai_is_html (1);
$doc->inner_html ($chars);
DESCRIPTION
The Web::HTML::Parser
module is an HTML parser, as specified by the HTML Standard (i.e. an "HTML5" parser), written in pure Perl.
This module provides a low-level API to the parser, which accepts byte or character sequence as input and construct DOM tree as output, optionally reporting errors and warnings detected during the parsing. Applications such as browsers, data mining tools, validators, and so on, can use this module directly. However, it is encouraged to use higher-level APIs such as DOM inner_html
method (see Web::DOM::ParentNode in the perl-web-dom package, for example).
METHODS
The Web::HTML::Parser module has following methods:
$parser = Web::HTML::Parser->new
-
Create a new parser.
$parser->parse_char_string ($chars => $doc)
-
Parse a character string as HTML. The first argument must be a character string (i.e. a latin1 or utf8 string). The second argument must be a DOM
Document
object. TheDocument
is to be mutated during the parsing.See
<https://github.com/manakai/perl-web-markup#dependency>
for the requirements on theDocument
object. $parser->parse_byte_string ($encoding, $bytes => $doc)
-
Parse a byte string as HTML. The first argument must be a character encoding label fo the byte string, if any, or
undef
(See "SPECIFYING ENCODING"). The second argument must be a byte string. The third argument must be a DOMDocument
object. TheDocument
is to be mutated during the parsing.See
<https://github.com/manakai/perl-web-markup#dependency>
for the requirements on theDocument
object. $node_list = $parser->parse_char_string_with_context ($chars, $context, $empty_doc)
-
Parse a character string as HTML in the specified context. The first argument must be a character string (i.e. a latin1 or utf8 string). The second argument must be an
Element
object used as the context, orundef
if there is no context. The third argument must be an emptyDocument
object used in the parsing. Note that theDocument
's children list is not to be affected by the parsing. The method returns anHTMLCollection
object containing the result of the parsing (zero or moreNode
objects).This method can be used to implement the
inner_html
method of anElement
.See
<https://github.com/manakai/perl-web-markup#dependency>
for the requirements on theDocument
andElement
objects. $parser->locale_tag ($string)
$string = $parser->locale_tag
-
Get or set the BCP 47 language tag for the locale used to parse the document, e.g.
en
,ja
,zh-tw
, andzh-cn
. It is used to determine the default character encoding (which is only used when character encoding cannot be determined by other means).If
undef
is specified (or thelocale_tag
method is not explicitly invoked at all), the default is "none", which results in thewindows-1252
character encoding default.Except for the
zh-tw
andzh-cn
, only the primary language tag (i.e. a language code with no-
and subtags) should be specified. Tags are compared ASCII case-insensitively.The value should not be set while the parser is running. If the value is changed, the result is undefined.
$string = $parser->known_definite_encoding
$parser->known_definite_encoding ($string)
-
Get or set a known character encoding used to parse the document. See also "SPECIFYING ENCODING".
The value should not be set while the parser is running. If the value is changed, the result is undefined.
$boolean = $parser->is_xhr
$parser->is_xhr ($boolean)
-
Get or set whether the document is parsed to create XHR's
responseXML
document or not. See also "SPECIFYING ENCODING".The value should not be set while the parser is running. If the value is changed, the result is undefined.
$boolean = $parser->scripting
$parser->scripting ($boolean)
-
Set whether the scripting flag of the parser is "enabled" or not. By default the value is "disabled" (false). If the value is "enabled", the
noscript
element's content is not parsed (This is how browsers parse the document by default). Otherwise the content is parsed as normal.The value should not be set while the parser is running. If the value is changed, the result is undefined.
$code = $parser->onerror
$parser->onerror ($new_code)
-
Get or set the error handler for the parser. Any parse error, as well as warning and additional processing information, is reported to the handler. See
<https://github.com/manakai/data-errors/blob/master/doc/onerror.txt>
for details of error handling.The code is not expected to throw any exception. See also
throw
.The value should not be set while the parser is running. If the value is changed, the result is undefined.
$parser->throw ($code)
-
Terminate the parser and run the specified code reference. The code reference must throw an exception.
When the error handler specified by the
onerror
method is intended to abort the parsing, it must invoke this method and return. Otherwise resources used by the parser might not be destroyed due to the unexpected termination.
The module also has following methods for API compatibility with Web::XML::Parser but they have no effect: max_entity_depth
, max_entity_expansions
, ignore_doctype_pis
.
SPECIFYING ENCODING
The input to the parse_char_*
methods are a string of characters. It is always interpreted as a Perl character string (utf8 or latin1).
The input to the parse_byte*
methods are a string of bytes, where characters are encoded in some Web-compatible character encoding. It is decoded as specified by HTML and Encoding standards.
The parse_byte*
methods accept a character encoding label as one of arguments. It is interpreted as the transport layer character encoding metadata. In HTTP, it is the value of the charset
parameter in the Content-Type
header. If it is unknown, the argument must be set to undef
. Note that in some cases this encoding metadata is ignored, as specified in HTML Standard.
The known_definite_encoding
method can be used to set a known definite encoding. If its value is not undef
, it is used to decode the document. This takes precedence over the transport layer character encoding metadata and is always respected.
The character encoding, if specified, must be represented by one of its labels, defined by the Encoding Standard. Unknown labels are ignored. Examples of labels include (but not limited to): utf-8
, windows-1252
, shift_jis
, euc-jp
, iso-2022-jp
, and gb18030
. Encoding labels are ASCII case-insensitive.
If none of these character encoding metadata is provided, parse_byte*
methods try to detect the character encoding in use by the steps specified in HTML Standard. It also takes the locale information of the locale_tag
method into account.
The is_xhr
method's value also affects these encoding detecting process, as specified by the XMLHttpRequest Standard.
SEE ALSO
Web::DOM::Document, Web::DOM::Element in the perl-web-dom package.
SPECIFICATIONS
- HTML
-
HTML Standard
<https://html.spec.whatwg.org/>
. - DOCUMENTINNERHTML
-
Document.prototype.innerHTML
<https://html5.org/tools/web-apps-tracker?from=6531&to=6532>
. - DOMPARSING
-
DOM Parsing and Serialization
<https://domparsing.spec.whatwg.org/>
. - XHR
-
XMLHttpRequest Standard
<https://xhr.spec.whatwg.org/>
. - ENCODING
-
Encoding Standard
<https://encoding.spec.whatwg.org/>
. - MANAKAI
-
manakai DOM Extensions
<https://suika.suikawiki.org/~wakaba/wiki/sw/n/manakai++DOM%20Extensions>
. - MANAKAIINDEX
-
manakai index data structure
<https://wiki.suikawiki.org/n/manakai%20index%20data%20structures>
.
AUTHOR
Wakaba <wakaba@suikawiki.org>.
LICENSE
Copyright 2007-2017 Wakaba <wakaba@suikawiki.org>.
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
This library derived from a JSON file, which contains data extracted from HTML Standard. "Written by Ian Hickson (Google, ian@hixie.ch) - Parts © Copyright 2004-2014 Apple Inc., Mozilla Foundation, and Opera Software ASA; You are granted a license to use, reproduce and create derivative works of this document."
POD ERRORS
Hey! The above document had some coding errors, which are explained below:
- Around line 247:
-
'=item' outside of any '=over'