Web::Encoding
Web Encodings APIs
SYNOPSIS
use Web::Encoding;
$bytes = encode_web_utf8 $chars;
$chars = decode_web_utf8 $bytes;
DESCRIPTION
The Web::Encoding
module provides a set of functions to handle Web-compatible character encodings.
Also, there are following modules in the perl-web-encodings
repository:
- Web::Encoding::UnivCharDet
-
The universalchardet (or universal detector) implementation in Perl, which can be used to implement HTML parsers.
- Web::Encoding::Normalization
-
Implementation of Unicode's string normalization algorithms, i.e. NFC, NFD, NFKC, and NFKD.
- Web::Encoding::Preload
-
Preloading encoding modules and data files.
FUNCTIONS
Functions described in these subsections are exported by default.
Encoding labels and properties of encodings
There are following functions to handle encoding labels and to obtain properties of encodings:
$key = encoding_label_to_name $label
-
Find the encoding identified by the specified label. As does the "get an encoding" steps [ENCODING], this function ignores leading and trailing spaces, and compares labels ASCII case-insensitively. The function returns the encoding key (not a name), if found, or
undef
. $key = fixup_html_meta_encoding_name $key
-
Replace a encoding key for the purpose of HTML character encoding declaration, as in "prescan a byte stream to determine its encoding" and "change the encoding" algorithms [HTML]. The argument must be an encoding key (not a name or label). The function returns an encoding key.
$key = get_output_encoding_key $key
-
Return the result of applying the steps to "get an output encoding" [ENCODING]. The argument must be an encoding key (not a name or label). The function returns an encoding key.
$name = encoding_name_to_compat_name $key
-
Replace an encoding key to its official name as used in e.g.
characterSet
orinputEncoding
attributes of theDocument
interface [ENCODING] [DOM]. The argument must be an encoding key (not a name or label). The function returns an encoding name. $boolean = is_ascii_compat_encoding_name $key
-
Return whether the specified encoding is an ASCII-compatible character encoding [ENCODING] or not. The argument must be an encoding key (not a name or label).
$boolean = is_encoding_label $label
-
Return whether the specified label identifies an encoding [ENCODING] or not. It compares labels ASCII case-insensitively. Unlike the
encoding_label_to_name
function, however, this function does not ignore spaces. $key = locale_default_encoding_name $tag
-
Return the encoding key (not a name or label) of the default character encoding for a locale [HTML]. If no default is known for the specified locale,
undef
is returned.The argument, which identifies the locale, must be either a BCP 47 language tag or a string
*
. The language tag must be the primary language tag only,zh-TW
, orzh-CN
, otherwise no data is available. The tags are ASCII case-insensitive. If*
is specified, the global default encoding that can be used when the locale is not known or the locale has no default is returned.
For the purpose of this module, the key of the encoding is a short string uniquly identifying the encoding. It is a lowercased variant of the encoding name [ENCODING].
Note that the encoding names in the Encoding Standard are not compatible with Perl Encode module's encoding names.
Encoders and decoders
There are following functions for encoding and decoding:
$bytes = encode_web_utf8 $chars
-
Encode the character string in UTF-8 and return the encoded bytes.
This function can be used to implement the "UTF-8 encode" operation [ENCODING].
$chars = decode_web_utf8 $bytes
-
Decode the bytes as UTF-8 and return the decoded character string. Any bad byte is replaced by U+FFFD characters without failure.
This function can be used to implement the "UTF-8 decode" operation [ENCODING].
$chars = decode_web_utf8_no_bom $bytes
-
Decode the bytes as UTF-8, not recognizing BOM, and returns the decoded character string. Any bad byte is replaced by U+FFFD characters without failure.
This function can be used to implement the "UTF-8 decode without BOM" operation [ENCODING].
$bytes = encode_web_charset $key, $chars
-
Encode the character string and return the encoded bytes.
The first argument must be the key of the encoding used to encode the string.
Any character not representable in the encoding is converted to an HTML decimal character reference for the character.
This function can be used to implement the "encode" operation with error mode
html
[ENCODING] [ENCODING16]. $chars = decode_web_charset $key, $bytes
-
Decode the bytes and return the decoded character string.
The first argument must be the key of the encoding used to decode the bytes.
Any bad byte is replaced by U+FFFD characters without failure.
This function is equivalent to the following code using Web::Encoding::Decoder:
$decoder = Web::Encoding::Decoder->new_from_encoding_key ($key); $decoder->ignore_bom (1); return $decoder->bytes ($bytes) . $decoder->eof;
[$name, $name, ...] = encoding_names
-
Return the list of the encoding keys (i.e. the lowercase variants of the encoding names), as an array reference.
In addition to UTF-8, following legacy encodings are supported: IBM866 ISO-8859-2 ISO-8859-3 ISO-8859-4 ISO-8859-5 ISO-8859-6 ISO-8859-7 ISO-8859-8 ISO-8859-8-I ISO-8859-10 ISO-8859-13 ISO-8859-14 ISO-8859-15 ISO-8859-16 KOI8-R KOI8-U macintosh windows-874 windows-1250 windows-1251 windows-1252 windows-1253 windows-1254 windows-1255 windows-1256 windows-1257 windows-1258 x-mac-cyrillic gb18030 GBK Big5 EUC-JP ISO-2022-JP Shift_JIS EUC-KR x-user-defined UTF-16BE UTF-16LE replacement
SPECIFICATIONS
- ENCODING
-
Encoding Standard
<https://encoding.spec.whatwg.org/>
. - ENCODING16
-
UTF-16 encoder
<https://github.com/whatwg/encoding/commit/8360f775c8df145f649047c7d59c5ff733ade112>
. - HTML
-
HTML Standard
<https://html.spec.whatwg.org/>
. - DOM
-
DOM Standard
<https://dom.spec.whatwg.org/>
. - ENCVALID
-
Encoding Validation
<https://wiki.suikawiki.org/n/Encoding%20Validation>
.
DEPENDENCY
The module requires Perl 5.8 or later.
AUTHOR
Wakaba <wakaba@suikawiki.org>.
LICENSE
Copyright 2011-2018 Wakaba <wakaba@suikawiki.org>.
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.