The manakai project

Char::Normalize::FullwidthHalfwidth

Fullwidth/halfwidth character normalization

SYNOPSIS

  use Char::Normalize::FullwidthHalfwidth qw/normalize_width/;
  
  $s = <>;
  normalize_width (\$s);
  print $s;

DESCRIPTION

The Char::Normalize::FullwidthHalfwidth module provides a function that normalizes fullwidth/halfwidth compatibility characters into their canonical representations.

FUNCTIONS

This module provides functions normalize_width and combine_voiced_sound_marks. They can be imported to a package by specifying their names as arguments to the use statement:

  use Char::Normalize::FullwidthHalfwidth qw/normalize_width/;

Note that the use statement does not export anything unless the function names were explicitly specified.

Alternatively, you can invoke the functions in their fully-qualified forms:

  require Char::Normalize::FullwidthHalfwidth;
  Char::Normalize::FullwidthHalfwidth::normalize_width (\$scalarref);
normalize_width ($scalarref)

Normalize the fullwidth/halfwidth characters in the scalar referenced by the argument into their preferable form. The argument must be a scalar reference. The scalar is treated as a character string (possibly with the utf8 flag set), not a byte string. The function returns the scalar reference.

The function performs the following conversions:

A character U+3000 IDEOGRAPHIC SPACE (so-called fullwidth space)

Replaced by a U+0020 SPACE (so-called halfwidth space) character.

Characters in the range U+FF01..U+FF5E (so-called fullwidth ASCII characters)

Replaced by a character in the range U+0021..U+007E (so-called halfwidth ASCII characters).

Characters in the range U+FF61..U+FF9F (halfwidth Katakana)

Replaced by a corresponding so-called fullwidth Katakana (or ideographic punctuation). Note that U+FF9E HALFWIDTH KATAKANA VOICED SOUND MARK and U+FF9F HALFWIDTH KATAKANA SEMI-VOICED SOUND MARK are replaced by U+3099 COMBINING KATAKANA-HIRAGANA VOICED SOUND MARK and U+309A COMBINING KATAKANA-HIRAGANA SEMI-VOICED SOUND MARK respectively, not their spacing variants.

Characters in the range U+FFE0..U+FFE6 (fullwidth symbols)

Replaced by a corresponding canonical character.

combine_voiced_sound_marks ($scalarref)

Replace any sequence of (fullwidth) hiragana or katakana followed by a U+3099 COMBINING KATAKANA-HIRAGANA VOICED SOUND MARK or U+309A COMBINING KATAKANA-HIRAGANA SEMI-VOICED SOUND MARK by its precomposed form, if possible.

In many cases you would like to apply this function just after the normalize_width function.

$t = get_fwhw_normalized $s

Return a normalized copy of the argument string (not reference).

It performes normalization performed by normalize_width and combine_voiced_sound_marks, as well as some additional convertions.

BUGS

Not all compatibility characters in the fullwidth and halfwidth block of the Unicode Standard are currently supported - especially, halfwidth Hangul syllables are not converted to their fullwidth equivalents. A future version of this module is expected to address this issue by extending the conversion table.

AUTHOR

Wakaba <wakaba@suikawiki.org>.

HISTORY

This module was originally developed as part of SuikaWiki https://suika.suikawiki.org/~wakaba/wiki/sw/n/SuikaWiki.

LICENSE

Copyright 2008-2016 Wakaba <wakaba@suikawiki.org>.

This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.