The manakai project

Web::Encoding

Web Encodings APIs

SYNOPSIS

  use Web::Encoding;
  $bytes = encode_web_utf8 $chars;
  $chars = decode_web_utf8 $bytes;

DESCRIPTION

The Web::Encoding module provides a set of functions to handle Web-compatible character encodings.

Also, there are following modules in the perl-web-encodings repository:

Web::Encoding::UnivCharDet

The universalchardet (or universal detector) implementation in Perl, which can be used to implement HTML parsers.

Web::Encoding::Normalization

Implementation of Unicode's string normalization algorithms, i.e. NFC, NFD, NFKC, and NFKD.

Web::Encoding::Preload

Preloading encoding modules and data files.

FUNCTIONS

Functions described in these subsections are exported by default.

Encoding labels and properties of encodings

There are following functions to handle encoding labels and to obtain properties of encodings:

$key = encoding_label_to_name $label

Find the encoding identified by the specified label. As does the "get an encoding" steps [ENCODING], this function ignores leading and trailing spaces, and compares labels ASCII case-insensitively. The function returns the encoding key (not a name), if found, or undef.

$key = fixup_html_meta_encoding_name $key

Replace a encoding key for the purpose of HTML character encoding declaration, as in "prescan a byte stream to determine its encoding" and "change the encoding" algorithms [HTML]. The argument must be an encoding key (not a name or label). The function returns an encoding key.

$key = get_output_encoding_key $key

Return the result of applying the steps to "get an output encoding" [ENCODING]. The argument must be an encoding key (not a name or label). The function returns an encoding key.

$name = encoding_name_to_compat_name $key

Replace an encoding key to its official name as used in e.g. characterSet or inputEncoding attributes of the Document interface [ENCODING] [DOM]. The argument must be an encoding key (not a name or label). The function returns an encoding name.

$boolean = is_ascii_compat_encoding_name $key

Return whether the specified encoding is an ASCII-compatible character encoding [ENCODING] or not. The argument must be an encoding key (not a name or label).

$boolean = is_encoding_label $label

Return whether the specified label identifies an encoding [ENCODING] or not. It compares labels ASCII case-insensitively. Unlike the encoding_label_to_name function, however, this function does not ignore spaces.

$key = locale_default_encoding_name $tag

Return the encoding key (not a name or label) of the default character encoding for a locale [HTML]. If no default is known for the specified locale, undef is returned.

The argument, which identifies the locale, must be either a BCP 47 language tag or a string *. The language tag must be the primary language tag only, zh-TW, or zh-CN, otherwise no data is available. The tags are ASCII case-insensitive. If * is specified, the global default encoding that can be used when the locale is not known or the locale has no default is returned.

For the purpose of this module, the key of the encoding is a short string uniquly identifying the encoding. It is a lowercased variant of the encoding name [ENCODING].

Note that the encoding names in the Encoding Standard are not compatible with Perl Encode module's encoding names.

Encoders and decoders

There are following functions for encoding and decoding:

$bytes = encode_web_utf8 $chars

Encode the character string in UTF-8 and return the encoded bytes.

This function can be used to implement the "UTF-8 encode" operation [ENCODING].

$chars = decode_web_utf8 $bytes

Decode the bytes as UTF-8 and return the decoded character string. Any bad byte is replaced by U+FFFD characters without failure.

This function can be used to implement the "UTF-8 decode" operation [ENCODING].

$chars = decode_web_utf8_no_bom $bytes

Decode the bytes as UTF-8, not recognizing BOM, and returns the decoded character string. Any bad byte is replaced by U+FFFD characters without failure.

This function can be used to implement the "UTF-8 decode without BOM" operation [ENCODING].

$bytes = encode_web_charset $key, $chars

Encode the character string and return the encoded bytes.

The first argument must be the key of the encoding used to encode the string.

Any character not representable in the encoding is converted to an HTML decimal character reference for the character.

This function can be used to implement the "encode" operation with error mode html [ENCODING] [ENCODING16].

$chars = decode_web_charset $key, $bytes

Decode the bytes and return the decoded character string.

The first argument must be the key of the encoding used to decode the bytes.

Any bad byte is replaced by U+FFFD characters without failure.

This function is equivalent to the following code using Web::Encoding::Decoder:

  $decoder = Web::Encoding::Decoder->new_from_encoding_key ($key);
  $decoder->ignore_bom (1);
  return $decoder->bytes ($bytes) . $decoder->eof;
[$name, $name, ...] = encoding_names

Return the list of the encoding keys (i.e. the lowercase variants of the encoding names), as an array reference.

In addition to UTF-8, following legacy encodings are supported: IBM866 ISO-8859-2 ISO-8859-3 ISO-8859-4 ISO-8859-5 ISO-8859-6 ISO-8859-7 ISO-8859-8 ISO-8859-8-I ISO-8859-10 ISO-8859-13 ISO-8859-14 ISO-8859-15 ISO-8859-16 KOI8-R KOI8-U macintosh windows-874 windows-1250 windows-1251 windows-1252 windows-1253 windows-1254 windows-1255 windows-1256 windows-1257 windows-1258 x-mac-cyrillic gb18030 GBK Big5 EUC-JP ISO-2022-JP Shift_JIS EUC-KR x-user-defined UTF-16BE UTF-16LE replacement

SPECIFICATIONS

ENCODING

Encoding Standard <https://encoding.spec.whatwg.org/>.

ENCODING16

UTF-16 encoder <https://github.com/whatwg/encoding/commit/8360f775c8df145f649047c7d59c5ff733ade112>.

HTML

HTML Standard <https://html.spec.whatwg.org/>.

DOM

DOM Standard <https://dom.spec.whatwg.org/>.

ENCVALID

Encoding Validation <https://wiki.suikawiki.org/n/Encoding%20Validation>.

DEPENDENCY

The module requires Perl 5.8 or later.

AUTHOR

Wakaba <wakaba@suikawiki.org>.

LICENSE

Copyright 2011-2018 Wakaba <wakaba@suikawiki.org>.

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.