The manakai project

Web::LangTag

Language Tag Parsing, Conformance Checking, and Normalization

SYNOPSIS

  use Web::LangTag;
  
  my $lt = Web::LangTag->new;
  $lt->onerror ($code);
  $parsed = $lt->parse_tag ($tag);
  $result = $lt->check_parsed_tag ($parsed);
  $tag = $lt->normalize_tag ($tag);

DESCRIPTION

The Web::LangTag module contains methods to handle language tags as defined by BCP 47. It can be used to parse, validate, or normalize language tags according to relevant standard.

METHODS

For the following methods, if an input or output is a language tag or a language range, it is interpreted as a character string (or possibly utf8 flagged string of characters), not a byte string. Note that although language tags and ranges are specified as a string of ASCII characters, illegal tags and ranges can always contain any non-ASCII characters.

Since relevant standards have been incompatibly changed, a language tag comformant to old standard can be non-conforming according to the latest standard. For this reason, the module provides parsing, validating, and normalizing methods for every versions of standards. However, in general, you should simply use non-versioned methods.

$lt = Web::LangTag->new

Create a new language tag processor.

$lt->onerror ($code)
$code = $lt->onerror

Get or set the error handler for the processor. Any parse error, as well as warning and additional processing information, is reported to the handler. See <https://github.com/manakai/data-errors/blob/master/doc/onerror.txt> for details of error handling.

The value should not be set while the processor is running. If the value is changed, the result is undefined.

Parsing

$parsed = $lt->parse_tag ($tag)

Parses a language tag into subtags. This method interprets the language tag using the latest version of the language tag specification. At the time of writing, the latest version is RFC 5646.

$parsed = $lt->parse_rfc5646_tag ($tag)

Parses a language tag into subtags. This method interprets the language tag using the definition in RFC 5646.

$parsed = $lt->parse_rfc4646_tag ($tag)

Parses a language tag into subtags. This method interprets the language tag using the definition in RFC 4646.

These methods return a hash reference, which contains one or more key-value pairs from the following list:

language (string)

The language subtag. There is always a language subtag, even if the input is illegal, unless there is grandfathered tag. E.g. 'ja' for input ja-JP.

extlang (arrayref of strings)

The extlang subtags. E.g. 'yue' for input zh-yue.

script (string or undef)

The script subtag. E.g. 'Latn' for input ja-Latn-JP.

region (string or undef)

The region subtag. E.g. 'JP' for input en-JP.

variant (arrayref of strings)

The variant subtags. E.g. ['fonipa'] for input en-JP-fonipa.

extension (arrayref of arrayrefs of strings)

The extension subtags. E.g. [['u', 'islamCal']] for input en-US-u-islamCal.

privateuse (arrayref of strings)

The privateuse subtags. E.g. ['x', 'pig', 'latin'] for input x-pig-latin.

illegal (arrayref of strings)

Illegal (syntactically non-conforming) string fragments. E.g. ['1234', 'xyz', 'abc'] for input 1234-xyz-abc.

grandfathered (string or undef)

"Grandfathered" language tag. E.g. 'i-default' for input i-default.

u

If the tag contains a u extension, parse result of the extension is contained here. The value is an array reference of array references of strings. The first inner array reference contains the attributes in the extension. The remaining inner array references, if any, represent the keywords (i.e. the key-type pairs) in the extension in original order. E.g. [[], ['ca', 'japanese'], ['va', '0061', '0061']] for input ja-u-ca-japanese-va-0061-0061.

t

If the tag contains a t extension, parse result of the extension is contained here. The value is an array reference of parsed language tag and array references of strings. The first (zeroth) item in the outer array reference is the embedded language tag, if any, or the undef value. The remaining items, if any, represent fields in the extension as array references of subtags, in original order. E.g. [{language = 'de', region => 'JP'}, ['m0', 'und'], ['x0', 'medical']]> for input ja-Latn-t-de-JP-m0-und-x0-medical.

Note that original cases (lower- or upper-case) is preserved in the output.

Serialization

$tag = $lt->serialize_parsed_tag ($parsed_tag)

Convert a parsed language tag into a language tag string. The argument must be a parsed tag as defined in the previous section; a broken value would not be processed properly.

If the given parsed tag does not represent a well-formed language tag, the result string would not be a well-formed language tag.

Conformance checking (validation)

$result = $lt->check_parsed_tag ($parsed)

Checks for conformance errors in the parsed language tag, against the latest version of the language tag specification. At the time of writing, the latest version is RFC 5646.

$result = $lt->check_rfc5646_parsed_tag ($parsed)

Checks for conformance errors in the parsed language tag, against RFC 5646.

This method does not report any parse errors, as this method receives a parsed language tag.

The method returns a hash reference with two keys: well-formed and valid. They represent whether the given language tag is well-formed or valid or not as per RFC 5646.

$result = $lt->check_rfc4646_parsed_tag ($parsed)

Checks for conformance errors in the parsed language tag, against RFC 4646.

This method does not report any parse erros, as this method receives a parsed language tag.

The method returns a hash reference with two keys: well-formed and valid. They represent whether the given language tag is well-formed or valid or not as per RFC 4646.

$result = $lt->check_rfc3066_tag ($tag)

Parses and checks for conformance errors in the parsed language tag, against RFC 3066.

The method returns an empty hash reference.

$result = $lt->check_rfc1766_tag ($tag)

Parses and checks for conformance errors in the parsed language tag, against RFC 1766.

The method returns an empty hash reference.

Note that specs sometimes contain semantic or contextual conformance rules, such as: "strongly RECOMMENDED that users not define their own rules for language tag choice" (RFC 4646 4.1.), "Subtags SHOULD only be used where they add useful distinguishing information" (RFC 4646 4.1.), and "Use as precise a tag as possible, but no more specific than is justified" (RFC 4646 4.1. 1.). These kinds of requirements cannot be tested without human interpretation, and therefore the methods in this module do not (or in fact cannot) try to detect violation to these rules.

Normalization

$tag = $lt->normalize_tag ($tag_orig)

Normalize the language tag by folding cases, following the latest version of the language tag specification. At the time of writing, the latest version is RFC 5646.

$tag = $lt->normalize_rfc5646_tag ($tag_orig)

Normalize the language tag by folding cases, following RFC 5646 2.1. and 2.2.6. Note that this method does not replace any subtag into its preferred alternative; this method does not rearrange ordering of subtags.

Although this method does not completely convert language tags into their canonical form, its result will be good enough for comparison in most usual situations.

$tag = $lt->canonicalize_tag ($tag_orig)

Normalize the language tag into its canonicalized form, as per the latest version of the language tag specification. At the time of writing, the latest version is RFC 5646.

$tag = $lt->canonicalize_rfc5646_tag ($tag_orig)

Normalize the language tag into its canonicalized form, as per RFC 5646 4.5. That is, replace any subtag into its Preferred-Value form if possible and sort any extension subtags. Note that this method does NOT do any case folding. In addition, the "canonicalized form" of a langauge tag is not necessary a fully canonicalized form at all - for example, variant subtags might not be in the recommended order. Also, it does not canonicalize extension subtags.

Note that if the input is not a well-formed language tag according to RFC 5646, the result string might not be a well-formed language tag as well. Sometimes the canonicalization would turn a valid langauge tag into an invalid language tag.

$tag = $lt->to_extlang_form_tag ($tag_orig)

Normalize the language tag into its extlang form, as per the latest version of the language tag specification. At the time of writing, the latest version is RFC 5646.

$tag = $lt->to_extlang_form_rfc5646_tag ($tag_orig)

Normalize the language tag into its extlang form, as per RFC 5646 4.5. The extlang form is same as the canonicalized form, except that use of extlang subtags is preferred to language-only (or extlang-free) representation.

Note that if the input is not a well-formed language tag according to RFC 5646, the result string might not be a well-formed language tag as well. Sometimes the canonicalization would turn a valid langauge tag into an invalid language tag.

Comparison

$boolean = $lt->basic_filtering_range ($range, $tag)

Compares a basic language range to a language tag, according to the latest version of the language range specification. At the time of writing, the latest version is RFC 4645.

$boolean = $lt->basic_filtering_rfc4647_range ($range, $tag)

Compares a basic language range to a language tag, according to RFC 4647 Section 3.3.1. This method returns whether the range matches to the tag or not.

A basic language range is either a language tag or *. (For more information, see RFC 4647 Section 2.1.).

$boolean = $lt->match_rfc3066_range ($range, $tag)

Compares a language-range to a language tag according to RFC 3066 Section 2.5. This method returns whether the range matches to the tag or not. Note that RFC 3066 is obsoleted by RFC 4647.

A language range is either a language tag or *. (For more information, see RFC 3066 2.5).

Note that this method is equivalent to basic_filtering_rfc4647_range by definition.

$boolean = $lt->extended_filtering_range ($range, $tag)

Compares an extended language range to a language tag, according to the latest version of the language range specification. At the time of writing, the latest version is RFC 4647.

$boolean = $lt->extended_filtering_rfc4647_range ($range, $tag)

Compares an extended language range to a language tag, according to RFC 4647 Section 3.3.2. This method returns whether the range matches to the tag or not.

An extended language range is a language tag whose subtags can be *s. (For more information, see RFC 4647 Section 2.2.).

SPECIFICATIONS

RFC1766

RFC 1766: Tags for the Identification of Languages <http://tools.ietf.org/html/rfc1766>. (Obsolete)

RFC3066

RFC 3066: Tags for the Identification of Languages <http://tools.ietf.org/html/rfc3066>. (Obsolete)

RFC4646

RFC 4646: Tags for Identifying Languages <http://tools.ietf.org/html/rfc4646>. (Obsolete)

RFC4647

RFC 4647: Matching of Language Tags <http://tools.ietf.org/html/rfc4647>.

RFC5646

RFC 5646: Tags for Identifying Languages <http://tools.ietf.org/html/rfc5646>.

RFC6067

RFC 6067: BCP 47 Extension U <http://tools.ietf.org/html/rfc6067>.

RFC6497

RFC 6497: BCP 47 Extension T - Transformed Content <http://tools.ietf.org/html/rfc6497>.

LANGSUBTAGREG

IANA Language Subtag Registry <http://www.iana.org/assignments/language-subtag-registry>.

LANGEXTREG

Language Tag Extensions Registry <http://www.iana.org/assignments/language-tag-extensions-registry>.

LDML

UTS #35: Unicode Locale Data Markup Language <http://unicode.org/reports/tr35/>.

UNICODELOCALEREG

Unicode Locale Extensions for BCP 47 <http://cldr.unicode.org/index/bcp47-extension>, <http://unicode.org/repos/cldr/trunk/common/bcp47/>.

WEBLANGTAG

Comments in the lib/Web/LangTag.pm.

DEPENDENCY

The module requires Perl 5.8 or later.

DEVELOPMENT

Latest version of the module is available at GitHub <https://github.com/manakai/perl-web-langtag>.

Tests are run at Travis CI: <https://travis-ci.org/manakai/perl-web-langtag>.

SEE ALSO

SuikaWiki:Language Tags <http://suika.suikawiki.org/~wakaba/wiki/sw/n/language%20tags>

Language tags <https://github.com/manakai/data-web-defs/blob/master/data/langtags.json>.

AUTHOR

Wakaba <wakaba@suikawiki.org>.

LICENSE

Copyright 2007-2014 Wakaba <wakaba@suikawiki.org>.

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.