Numbers

The manakai project, 14 December 2015

Latest version
https://manakai.github.io/spec-numbers/
Version history
https://github.com/manakai/spec-numbers/commits/gh-pages

Abstract

This document defines a CJK number (漢数字) parsing algorithm.

Table of contents

  1. 1 Terminology
  2. 2 CJK numbers
    1. 2.1 Digits
    2. 2.2 Parsing CJK numbers
  3. Data files
  4. Author

1 Terminology

All diagrams, examples, and notes in this specification are non-normative, as are all sections explicitly marked non-normative. Everything else in this specification is normative.

The key words “MUST in the normative parts of this document are to be interpreted as described in RFC 2119.

Requirements phrased in the imperative as part of algorithms (such as “strip any leading space characters” or “return false and abort these steps”) are to be interpreted with the meaning of the key word (e.g. “MUST”) used in introducing the algorithm.

Conformance requirements phrased as algorithms or specific steps MAY be implemented in any manner, so long as the end result is equivalent. (In particular, the algorithms defined in this specification are intended to be easy to follow, and not intended to be performant.)

2 CJK numbers

This section defines an algorithm to parse CJK numbers (漢数字).

2.1 Digits

Digits used in CJK numbers are defined by the following table. Each character in the table has its value shown in the value column of the same row and belongs to the category shown in the category column of the same row.

The value of a character is a non-negative integer.

There are following categories: CJK digit, CJK ten, CJK multiple tens, CJK hundred, CJK thousand, CJK ten thousand, CJK hundred million, CJK trillion, and CJK ten quadrillion.

Character Value Category
Code point Name Character
U+0030 DIGIT ZERO 0 0 CJK digit
U+0031 DIGIT ONE 1 1 CJK digit
U+0032 DIGIT TWO 2 2 CJK digit
U+0033 DIGIT THREE 3 3 CJK digit
U+0034 DIGIT FOUR 4 4 CJK digit
U+0035 DIGIT FIVE 5 5 CJK digit
U+0036 DIGIT SIX 6 6 CJK digit
U+0037 DIGIT SEVEN 7 7 CJK digit
U+0038 DIGIT EIGHT 8 8 CJK digit
U+0039 DIGIT NINE 9 9 CJK digit
U+3007 IDEOGRAPHIC NUMBER ZERO 0 CJK digit
U+4E00 CJK UNIFIED IDEOGRAPH-4E00 1 CJK digit
U+4E03 CJK UNIFIED IDEOGRAPH-4E03 7 CJK digit
U+4E07 CJK UNIFIED IDEOGRAPH-4E07 10000 CJK ten thousand
U+4E09 CJK UNIFIED IDEOGRAPH-4E09 3 CJK digit
U+4E17 CJK UNIFIED IDEOGRAPH-4E17 30 CJK multiple tens
U+4E5D CJK UNIFIED IDEOGRAPH-4E5D 9 CJK digit
U+4E8C CJK UNIFIED IDEOGRAPH-4E8C 2 CJK digit
U+4E94 CJK UNIFIED IDEOGRAPH-4E94 5 CJK digit
U+4E96 CJK UNIFIED IDEOGRAPH-4E96 4 CJK digit
U+4EAC CJK UNIFIED IDEOGRAPH-4EAC 10000000000000000 CJK ten quadrillion
U+4EDF CJK UNIFIED IDEOGRAPH-4EDF 1000 CJK thousand
U+4F0D CJK UNIFIED IDEOGRAPH-4F0D 5 CJK digit
U+4F70 CJK UNIFIED IDEOGRAPH-4F70 100 CJK hundred
U+5104 CJK UNIFIED IDEOGRAPH-5104 100000000 CJK hundred million
U+5146 CJK UNIFIED IDEOGRAPH-5146 1000000000000 CJK trillion
U+516B CJK UNIFIED IDEOGRAPH-516B 8 CJK digit
U+516D CJK UNIFIED IDEOGRAPH-516D 6 CJK digit
U+5341 CJK UNIFIED IDEOGRAPH-5341 10 CJK ten
U+5343 CJK UNIFIED IDEOGRAPH-5343 1000 CJK thousand
U+5344 CJK UNIFIED IDEOGRAPH-5344 20 CJK multiple tens
U+5345 CJK UNIFIED IDEOGRAPH-5345 30 CJK multiple tens
U+534C CJK UNIFIED IDEOGRAPH-534C 40 CJK multiple tens
U+53C1 CJK UNIFIED IDEOGRAPH-53C1 3 CJK digit
U+53C2 CJK UNIFIED IDEOGRAPH-53C2 3 CJK digit
U+53C3 CJK UNIFIED IDEOGRAPH-53C3 3 CJK digit
U+53C4 CJK UNIFIED IDEOGRAPH-53C4 3 CJK digit
U+56DB CJK UNIFIED IDEOGRAPH-56DB 4 CJK digit
U+58F1 CJK UNIFIED IDEOGRAPH-58F1 1 CJK digit
U+58F9 CJK UNIFIED IDEOGRAPH-58F9 1 CJK digit
U+5EFF CJK UNIFIED IDEOGRAPH-5EFF 廿 20 CJK multiple tens
U+5F0C CJK UNIFIED IDEOGRAPH-5F0C 1 CJK digit
U+5F0D CJK UNIFIED IDEOGRAPH-5F0D 2 CJK digit
U+5F0E CJK UNIFIED IDEOGRAPH-5F0E 3 CJK digit
U+5F10 CJK UNIFIED IDEOGRAPH-5F10 2 CJK digit
U+62FE CJK UNIFIED IDEOGRAPH-62FE 10 CJK ten
U+634C CJK UNIFIED IDEOGRAPH-634C 8 CJK digit
U+67D2 CJK UNIFIED IDEOGRAPH-67D2 7 CJK digit
U+6F06 CJK UNIFIED IDEOGRAPH-6F06 7 CJK digit
U+7396 CJK UNIFIED IDEOGRAPH-7396 9 CJK digit
U+767E CJK UNIFIED IDEOGRAPH-767E 100 CJK hundred
U+8086 CJK UNIFIED IDEOGRAPH-8086 4 CJK digit
U+842C CJK UNIFIED IDEOGRAPH-842C 10000 CJK ten thousand
U+8CAE CJK UNIFIED IDEOGRAPH-8CAE 2 CJK digit
U+8CB3 CJK UNIFIED IDEOGRAPH-8CB3 2 CJK digit
U+8CEA CJK UNIFIED IDEOGRAPH-8CEA 7 CJK digit
U+8D30 CJK UNIFIED IDEOGRAPH-8D30 2 CJK digit
U+9621 CJK UNIFIED IDEOGRAPH-9621 1000 CJK thousand
U+9646 CJK UNIFIED IDEOGRAPH-9646 6 CJK digit
U+964C CJK UNIFIED IDEOGRAPH-964C 100 CJK hundred
U+9678 CJK UNIFIED IDEOGRAPH-9678 6 CJK digit
U+96F6 CJK UNIFIED IDEOGRAPH-96F6 0 CJK digit
U+FF10 FULLWIDTH DIGIT ZERO 0 CJK digit
U+FF11 FULLWIDTH DIGIT ONE 1 CJK digit
U+FF12 FULLWIDTH DIGIT TWO 2 CJK digit
U+FF13 FULLWIDTH DIGIT THREE 3 CJK digit
U+FF14 FULLWIDTH DIGIT FOUR 4 CJK digit
U+FF15 FULLWIDTH DIGIT FIVE 5 CJK digit
U+FF16 FULLWIDTH DIGIT SIX 6 CJK digit
U+FF17 FULLWIDTH DIGIT SEVEN 7 CJK digit
U+FF18 FULLWIDTH DIGIT EIGHT 8 CJK digit
U+FF19 FULLWIDTH DIGIT NINE 9 CJK digit

2.2 Parsing CJK numbers

To parse a CJK number string, the implementation MUST run the following steps:

  1. Let input be a copy of string.
  2. If input is the empty string, return null and abort these steps.
  3. Set value to the result of applying the rules for parsing a large CJK number.
  4. If input is not the empty string, return null and abort these steps.
  5. Return value.

These steps returns either a number or null. The null value represents an error.

The rules for parsing a large CJK number are as given in the following steps, which share the same input with the steps that invoke these steps:

  1. Let value be zero.
  2. If input starts with five or more CJK digits, remove them from input and run these substeps:
    1. Let digits be those characters.
    2. Replace each character in digits by the ASCII digit representing its value.
    3. Set v to the value obtained by interpreting digits as a decimal number.
  3. Otherwise, let v be the result of applying the rules for parsing a small CJK number.
  4. If v is null, return null and abort these steps.
  5. If input starts with a CJK ten quadrillion, remove it from input and run these substeps:
    1. Let m be the value of the removed character.
    2. Add v × m to value.
    3. Set v to the result of applying the rules for parsing a small CJK number.
    4. If v is null, return value.
  6. If input starts with a CJK trillion, remove it from input and run these substeps:
    1. Let m be the value of the removed character.
    2. Add v × m to value.
    3. Set v to the result of applying the rules for parsing a small CJK number.
    4. If v is null, return value.
  7. If input starts with a CJK hundred million, remove it from input and run these substeps:
    1. Let m be the value of the removed character.
    2. Add v × m to value.
    3. Set v to the result of applying the rules for parsing a small CJK number.
    4. If v is null, return value.
  8. If input starts with a CJK ten thousand, remove it from input and run these substeps:
    1. Let m be the value of the removed character.
    2. Add v × m to value.
    3. Set v to the result of applying the rules for parsing a small CJK number.
    4. If v is null, return value.
  9. Add v to value.
  10. Return value.

The rules for parsing a small CJK number are as given in the following steps, which share the same input with the steps that invoke these steps:

  1. If input starts with two, three, or four CJK digits, remove them from input and run these substeps:
    1. Let digits be those characters.
    2. Replace each character in digits by the ASCII digit representing its value.
    3. Return the value obtained by interpreting digits as a decimal number and abort the entire steps.
  2. Let value be zero.
  3. If input starts with a CJK digit followed by a CJK thousand, remove them from input and run these substeps:
    1. Let n be the value of the CJK digit.
    2. Let m be the value of the CJK thousand.
    3. Add n × m to value.
  4. Otherwise, if input starts with a CJK thousand, remove it from input and add the value of it to value.
  5. If input starts with a CJK digit followed by a CJK hundred, remove them from input and run these substeps:
    1. Let n be the value of the CJK digit.
    2. Let m be the value of the CJK hundred.
    3. Add n × m to value.
  6. Otherwise, if input starts with a CJK hundred, remove it from input and add the value of it to value.
  7. If input starts with a CJK digit followed by a CJK ten, remove them from input and run these substeps:
    1. Let n be the value of the CJK digit.
    2. Let m be the value of the CJK ten.
    3. Add n × m to value.
  8. Otherwise, if input starts with a CJK ten, remove it from input and add the value of it to value.
  9. Otherwise, if input starts with a CJK multiple tens, remove it from input and add the value of it to value.
  10. If input starts with a CJK digit, remove it from input and add the value of it to value.
  11. If no character is removed by these steps, return null and abort these steps.
  12. Return value.

Data files

This section is non-normative.

There are a JSON data file on values of characters defined in this document and a JSON test data file for the parse a CJK number algorithm (documentation).

There is an implementation: perl-number-cjk.

Author

This document is written by Wakaba <wakaba@suikawiki.org>.

This document is developed as part of the manakai project.

Per CC0, to the extent possible under law, the author has waived all copyright and related or neighboring rights to this work.