The manakai project

Web::HTML::Validator

DOM Conformance Checker

SYNOPSIS

  use Web::HTML::Validator;
  my $val = Web::HTML::Validator->new;
  $val->onerror (sub {
    my %arg = @_;
    warn get_node_path ($arg{node}), ": ",
        ($arg{level} || "Error"), ": ",
        $arg{type}, "\n";
  });
  $val->check_node ($doc);

DESCRIPTION

The Perl module Web::HTML::Validator contains methods for conformance checking (or validation) of DOM tree with regard to relevant Web standards such as HTML and CSS. Although the module name contains "HTML", it can also be used to check the conformance of non-HTML XML documents. See also "SPECIFICATIONS".

METHODS

This module has following methods:

$val = Web::HTML::Validator->new

Create a new instance of the validator.

$val->check_node ($node)

Validate the specified node. If the node is not a document node, the node is validated as if it were an orphaned node, i.e. a node with no parent or owner. The node can be an attribute, but element- or attribute-specific validation is not performed in that case.

Errors and warnings are reported through the onerror handler.

$code = $val->onerror
$val->onerror ($code)

Get or set the error handler for the validator. Any conformance error, as well as warning and additional processing information, is reported to the handler. See <https://github.com/manakai/data-errors/blob/master/doc/onerror.txt> for details of error handling.

The value should not be set during the validation. If the value is changed, the result is undefined.

$dids = $val->di_data_set
$val->di_data_set ($dids)

Get or set the "di" data set for the validator. It is used for reporting errors in nested subdocuments contained in the validated document (e.g. iframe srcdoc documents).

See also SuikaWiki:manakai index data structure <http://wiki.suikawiki.org/n/manakai%20index%20data%20structures>.

$boolean = $val->scripting
$val->scripting ($boolean)

Get or set scripting is enabled (true) or disabled (false) for the purpose of validation. By default, scripting is disabled. It affects validation of the HTML noscript element.

The value should not be set during the validation. If the value is changed, the result is undefined.

$boolean = $val->image_viewable
$val->image_viewable ($boolean)

Get or set whether the intended user is known to be able to view images or not. Its default is false (not known). This affects whether missing of the alt attribute of the img element is conforming or not.

The value should not be set during the validation. If the value is changed, the result is undefined.

Since the input to the validator is a DOM, not a string, syntax-level conformance errors can't be checked. For detecting any conformance error, you have to parse the string using appropriate parser (Web::HTML::Parser for HTML, or Web::XML::Parser for XML), and then invoke the validator with the result DOM as the input.

DEPENDENCY

In addition to the dependency described in the README file <https://github.com/manakai/perl-web-markup/blob/master/README.pod#dependency>, following modules (and modules required by them) are required by this module:

perl-web-css <https://github.com/manakai/perl-web-css>
perl-web-datetime <https://github.com/manakai/perl-web-datetime>
perl-web-langtag <https://github.com/manakai/perl-web-langtag>
perl-web-resource <https://github.com/manakai/perl-web-resource>
perl-web-url <https://github.com/manakai/perl-web-url>
perl-regexp-utils <https://github.com/wakaba/perl-regexp-utils>
perl-web-js <https://github.com/manakai/perl-web-js>

In addition, JE is required.

SPECIFICATIONS

XML

Extensible Markup Language (XML) 1.0 <https://www.w3.org/TR/xml/>.

XML 1.0 Fifth Edition Specification Errata <https://www.w3.org/XML/xml-V10-5e-errata>.

XMLNS

Namespaces in XML 1.0 <https://www.w3.org/TR/xml-names/>.

Namespaces in XML 1.0 (Third Edition) Errata <https://www.w3.org/XML/2009/xml-names-errata>.

HTML

HTML Standard <http://c.whatwg.org/>.

The html element in the HTML namespace MAY be used as the root element.

A DocumentFragment MAY contain any child element and text node.

The children of a template element in the HTML namespace (which is different from the template content of the element) MUST be empty.

Contents of the noscript element when scripting is enabled and the iframe element MUST be validated as follows:

  Let /context/ be the element in question.

  Let /container/ be a new HTML element whose node document is same as
  the node document of /context/.  The local name of the element is
  the return value of the following substeps:

    If /context/ is an HTML |iframe| element, return |span|.

    Otherwise, if /context/ is a descendant of a |head| element or a
    descendant of a template content whose content model is metadata
    content, return |head|.

    Otherwise, if /context/ has a parent element and the content model
    of the parent element would require the content model of /context/
    be phrasing content given that /context/ were transparent, return
    |span|.

    Otherwise, return |div|.

  Invoke the HTML fragment parsing algorithm with /container/ as the
  /context/ element and the |textContent| attribute value of /context/
  as the /input/.  Append the returned list of nodes to /container/ in
  order.  If this step results in one or more parse errors, /context/
  is not conforming.

  Let /disallowed/ be an empty list.

  Add elements disallowed by content model of inclusive ancestors of
  /context/ to /disallowed/, if any.

  If /context/ is an HTML |iframe| element, add HTML |script| element
  to /disallowed/.

  If /context/ is an HTML |noscript| element, add HTML |noscript| and
  |script| elements to /disallowed/.

  Check the conformance of /container/ and its descendants, with the
  following exceptions:

    Elements in /disallowed/ MUST NOT be used.

    If /container/ is an HTML |head| element, it MUST contain only
    HTML |link|, |style|, and |meta| elements.  The |head| element
    does not require any |title| element.

Note that this is a willful violation to the HTML Standard to simplify the validation process, as the spec's requirements are too complex to implement nevertheless that complexity would not help authors as much. The set of the validation errors detected by these steps is not exactly same as that of the HTML Standard.

Unless otherwise specified, for the purpose of validation of HTML documents or fragments (serialized in the HTML syntax and then) embedded within other DOM attribute or node, such as the srcdoc attribute of the HTML iframe element, and Atom or RSS elements, whether scripting is enabled or disabled for the document associated with the HTML parser used to parse the document or fragment, as well as whether scripting is enabled or disabled for the nodes returned by the HTML parser, is same as whether scripting is enabled or disabled for the node document of the node.

If the |http-equiv| attribute of the |meta| element is in the Default style state, the |content| attribute value MUST NOT be the empty string.

OBSVOCAB

manakai's Conformance Checking Guideline for Obsolete HTML Elements and Attributes <http://suika.suikawiki.org/www/markup/html/exts/manakai-obsvocab>.

XSLT

XSL Transformations (XSLT) Version 1.0 <http://www.w3.org/TR/xslt>.

XSL Transformations (XSLT) Version 1.0 Specification Errata <http://www.w3.org/1999/11/REC-xslt-19991116-errata/>.

Key words "must" and "should" are to be interpreted as described in RFC 2119.

ATOM

The Atom Syndication Format <http://tools.ietf.org/html/rfc4287>, <http://www.rfc-editor.org/errata_search.php?rfc=4287>.

The rel attribute value MUST be a link type or link relation for which semantics in Atom document is specified. It MUST NOT be a non-conforming link type.

ATOM03

The Atom Syndication Format 0.3 (PRE-DRAFT) <https://github.com/mnot/I-D/blob/master/Published/atom-format/draft-nottingham-atom-format-02.xml>.

The rel attribute value MUST be a link type or link relation for which semantics in Atom 0.3 document is specified. It MUST NOT be a non-conforming link type.

ATOMTHREADS

Atom Threading Extension <http://tools.ietf.org/html/rfc4685>.

ATOMHISTORY

Feed Paging and Archiving <https://tools.ietf.org/html/rfc5005>.

ATOMPUB

The Atom Publishing Protocol <https://tools.ietf.org/html/rfc5023>.

ATOMDELETED

The Atom "deleted-entry" Element <http://tools.ietf.org/html/rfc6721>.

RSS1

RDF Site Summary (RSS) 1.0 <http://web.resource.org/rss/1.0/spec>.

RSSDC

RDF Site Summary 1.0 Modules: Dublin Core <http://web.resource.org/rss/1.0/modules/dc/>.

DCES

DCMI: Dublin Core Metadata Element Set, Version 1.1: Reference Description <http://dublincore.org/documents/dces/>.

RSSCONTENT

RDF Site Summary 1.0 Modules: Content <http://web.resource.org/rss/1.0/modules/content/>.

HATENA

はてなXML名前空間 - Hatena Developer Center <http://developer.hatena.ne.jp/ja/documents/other/misc/xmlns>.

RSS2

RSS 2.0 <http://www.rssboard.org/rss-specification>.

RSSBP

RSS Best Practices Profile <http://www.rssboard.org/rss-profile>.

MEDIARSS

Media RSS Specification <http://www.rssboard.org/media-rss>.

ITUNES

RSS tags for Podcasts Connect - Podcasts Connect Help <https://help.apple.com/itc/podcasts_connect/#/itcb54353390>.

CSSSTYLEATTR

CSS Style Attributes <http://dev.w3.org/csswg/css-style-attr/>.

CSS Syntax <http://dev.w3.org/csswg/css-syntax/#parse-a-list-of-declarations>.

SCHEMAORG

Schema.org <http://schema.org/>.

An item value whose data type is <http://schema.org/Integer> MUST be a valid integer. An item value whose data type is <http://schema.org/URL> MUST be an absolute URL.

DATAVOCAB

data-vocabulary.org <http://www.data-vocabulary.org/>.

Structured data <https://support.google.com/webmasters/topic/2643152?ref_topic=30163>.

If the value is defined as a URL, image, or link, it MUST be an absolute URL.

ARIA

Accessible Rich Internet Applications (WAI-ARIA) 1.1 <http://w3c.github.io/aria/aria/aria.html>.

When an attribute value is defined as "token list", the value MUST be a valid unordered set of unique space-separated tokens.

When an attribute value is defined as "ID reference list", the value MUST be a valid ordered set of unique space-separated tokens.

OGP

The Open Graph protocol <http://ogp.me/>.

The RDF schema <http://ogp.me/ns/ogp.me.ttl>.

Open Graph Reference Documentation <https://developers.facebook.com/docs/reference/opengraph>.

Creating Custom Stories <https://developers.facebook.com/docs/opengraph/creating-custom-stories/>.

Achievements API <https://developers.facebook.com/docs/games/achievements>.

Open Graph protocol <http://web.archive.org/web/20111006152122/http://developers.facebook.com/docs/opengraph/>.

OGPMIXI

技術仕様 << mixi Developer Center (ミクシィ デベロッパーセンター) <http://developer.mixi.co.jp/connect/mixi_plugin/mixi_check/spec_mixi_check/>.

OGPGREE

Social Feedback - GREE Developer Center <https://docs.developer.gree.net/ja/platform/connect/socialfeedback>.

HTMLPRE5924

HTML Standard Tracker <https://html5.org/r/5924>.

HTMLPRE5925

HTML Standard Tracker <https://html5.org/r/5925>.

WHATWGWIKI

WHATWG Wiki MetaExtensions <https://wiki.whatwg.org/wiki/MetaExtensions>.

WHATWG Wiki RelExtensions <https://wiki.whatwg.org/wiki/RelExtensions>.

Unless otherwise specified, link types marked as "accepted" in the RelExtensions table MUST be treated as if it were part of the Microformats Wiki's relevant table.

UFWIKI

Microformats Wiki - existing rel values <http://microformats.org/wiki/existing-rel-values>.

CSP

Content Security Policy <https://w3c.github.io/webappsec/specs/content-security-policy/>.

MIMESNIFF

MIME Sniffing <https://mimesniff.spec.whatwg.org/>.

URL

URL Standard <https://url.spec.whatwg.org/>.

MANAKAI

manakai DOM Extensions <https://suika.suikawiki.org/~wakaba/wiki/sw/n/manakai%20DOM%20Extensions>.

Any node MAY be used as orphan node.

VALLANGS

DOM Tree Validation <https://manakai.github.io/spec-dom/validation-langs>.

The validator also supports much more Web standards (indirectly via required modules), including but not limited to CSS, IETF BCP 47 language tags, Encoding Standard, and XML 1.0 DTD.

Note that HTML2, HTML3, HTML4, HTML 5.0, HTML 5.1, HTML 5.2, HTML 5.3, XML 1.1, Namespaces in XML 1.1, XML Base, xml:id, XLink, XInclude, XHTML1, XHTML Modularization, Ruby Annotations, RSS 0.9, RDFa, XForms, XHTML2, HLink, XML Events, XFrames, and RDF/XML are not considered as applicable specifications. The module does not support ARIA attributes in its own namespace. Also, the module does not support historical HTML features no longer part of the language, except for those explicitly listed in the OBSVOCAB specification. See the OBSVOCAB specification for details.

AUTHOR

Wakaba <wakaba@suikawiki.org>.

LICENSE

Copyright 2007-2018 Wakaba <wakaba@suikawiki.org>.

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.