`Web::HTML::Parser`

An HTML parser

SYNOPSIS

  use Web::HTML::Parser;
  use Web::DOM::Document;
  $parser = Web::HTML::Parser->new;
  $doc = Web::DOM::Document->new;
  
  $parser->parse_char_string ($chars => $doc);
  $parser->parse_byte_string ($encoding, $bytes => $doc);

  ## Or, just use DOM attribute:
  $doc->manakai_is_html (1);
  $doc->inner_html ($chars);

DESCRIPTION

The Web::HTML::Parser module is an HTML parser, as specified by the HTML Standard (i.e. an "HTML5" parser), written in pure Perl.

This module provides a low-level API to the parser, which accepts byte or character sequence as input and construct DOM tree as output, optionally reporting errors and warnings detected during the parsing. Applications such as browsers, data mining tools, validators, and so on, can use this module directly. However, it is encouraged to use higher-level APIs such as DOM inner_html method (see Web::DOM::ParentNode in the perl-web-dom package, for example).

METHODS

The Web::HTML::Parser module has following methods:

$parser = Web::HTML::Parser->new

Create a new parser.

$parser->parse_char_string ($chars => $doc)

Parse a character string as HTML. The first argument must be a character string (i.e. a latin1 or utf8 string). The second argument must be a DOM Document object. The Document is to be mutated during the parsing.

See <https://github.com/manakai/perl-web-markup#dependency> for the requirements on the Document object.

$parser->parse_byte_string ($encoding, $bytes => $doc)

Parse a byte string as HTML. The first argument must be a character encoding label fo the byte string, if any, or undef (See "SPECIFYING ENCODING"). The second argument must be a byte string. The third argument must be a DOM Document object. The Document is to be mutated during the parsing.

See <https://github.com/manakai/perl-web-markup#dependency> for the requirements on the Document object.

$node_list = $parser->parse_char_string_with_context ($chars, $context, $empty_doc)

Parse a character string as HTML in the specified context. The first argument must be a character string (i.e. a latin1 or utf8 string). The second argument must be an Element object used as the context, or undef if there is no context. The third argument must be an empty Document object used in the parsing. Note that the Document's children list is not to be affected by the parsing. The method returns an HTMLCollection object containing the result of the parsing (zero or more Node objects).

This method can be used to implement the inner_html method of an Element.

See <https://github.com/manakai/perl-web-markup#dependency> for the requirements on the Document and Element objects.

$parser->locale_tag ($string)

$string = $parser->locale_tag

Get or set the BCP 47 language tag for the locale used to parse the document, e.g. en, ja, zh-tw, and zh-cn. It is used to determine the default character encoding (which is only used when character encoding cannot be determined by other means).

If undef is specified (or the locale_tag method is not explicitly invoked at all), the default is "none", which results in the windows-1252 character encoding default.

Except for the zh-tw and zh-cn, only the primary language tag (i.e. a language code with no - and subtags) should be specified. Tags are compared ASCII case-insensitively.