Numbers

The manakai project, 18 August 2023

Latest version
https://manakai.github.io/spec-numbers/
Version history
https://github.com/manakai/spec-numbers/commits/gh-pages

Abstract

This document defines a CJK number (漢数字) parsing algorithm.

Table of contents

  1. 1 Terminology
  2. 2 CJK numbers
    1. 2.1 Digits
    2. 2.2 Parsing CJK numbers
    3. 2.3 Serializing CJK numbers
  3. Data files
  4. Author

1 Terminology

This specification depends on the Infra Standard.

The terms string, length, and concatenate are defined by the Infra Standard.

The operators ×, /, and % are defined by the Encoding Standard.

The term serialize an integer is defined by the Fetch Standard.

The empty string is a string whose length is zero (0).

To append a string s to a string variable v, set v to the concatenation of « v, s ».

2 CJK numbers

This section defines an algorithm to parse CJK numbers (漢数字).

2.1 Digits

Digits and other characters used in CJK numbers are defined by the following table. Each character in the table has its value shown in the value column of the same row and belongs to the categories shown in the categories column of the same row. If the value column's content is "-", the character has no value.

The value of a character is a non-negative integer.

There are following categories: CJK digit, CJK zero, CJK non-zero digit, CJK ten, CJK multiple tens, CJK hundred, CJK multiple hundreds, CJK thousand, CJK ten thousand, CJK hundred million, CJK trillion, CJK ten quadrillion, CJK and separator, CJK digit group separator, and CJK decimal separator.

Character Value Categories
Code point Name Character
U+0020 SPACE - CJK digit group separator
U+002C COMMA , - CJK digit group separator
U+002E FULL STOP . - CJK decimal separator
U+0030 DIGIT ZERO 0 0 CJK digit, CJK zero
U+0031 DIGIT ONE 1 1 CJK digit, CJK non-zero digit
U+0032 DIGIT TWO 2 2 CJK digit, CJK non-zero digit
U+0033 DIGIT THREE 3 3 CJK digit, CJK non-zero digit
U+0034 DIGIT FOUR 4 4 CJK digit, CJK non-zero digit
U+0035 DIGIT FIVE 5 5 CJK digit, CJK non-zero digit
U+0036 DIGIT SIX 6 6 CJK digit, CJK non-zero digit
U+0037 DIGIT SEVEN 7 7 CJK digit, CJK non-zero digit
U+0038 DIGIT EIGHT 8 8 CJK digit, CJK non-zero digit
U+0039 DIGIT NINE 9 9 CJK digit, CJK non-zero digit
U+00A0 NO-BREAK SPACE   - CJK digit group separator
U+00B7 MIDDLE DOT · - CJK digit group separator
U+2009 THIN SPACE - CJK digit group separator
U+202F NARROW NO-BREAK SPACE - CJK digit group separator
U+3007 IDEOGRAPHIC NUMBER ZERO 0 CJK digit, CJK zero
U+30FB KATAKANA MIDDLE DOT - CJK decimal separator
U+4E00 CJK UNIFIED IDEOGRAPH-4E00 1 CJK digit, CJK non-zero digit
U+4E03 CJK UNIFIED IDEOGRAPH-4E03 7 CJK digit, CJK non-zero digit
U+4E07 CJK UNIFIED IDEOGRAPH-4E07 10000 CJK ten thousand
U+4E09 CJK UNIFIED IDEOGRAPH-4E09 3 CJK digit, CJK non-zero digit
U+4E17 CJK UNIFIED IDEOGRAPH-4E17 30 CJK multiple tens
U+4E5D CJK UNIFIED IDEOGRAPH-4E5D 9 CJK digit, CJK non-zero digit
U+4E8C CJK UNIFIED IDEOGRAPH-4E8C 2 CJK digit, CJK non-zero digit
U+4E94 CJK UNIFIED IDEOGRAPH-4E94 5 CJK digit, CJK non-zero digit
U+4E96 CJK UNIFIED IDEOGRAPH-4E96 4 CJK digit, CJK non-zero digit
U+4EAC CJK UNIFIED IDEOGRAPH-4EAC 10000000000000000 CJK ten quadrillion
U+4EBF CJK UNIFIED IDEOGRAPH-4EBF 亿 100000000 CJK hundred million
U+4EDF CJK UNIFIED IDEOGRAPH-4EDF 1000 CJK thousand
U+4F0D CJK UNIFIED IDEOGRAPH-4F0D 5 CJK digit, CJK non-zero digit
U+4F70 CJK UNIFIED IDEOGRAPH-4F70 100 CJK hundred
U+5104 CJK UNIFIED IDEOGRAPH-5104 100000000 CJK hundred million
U+5146 CJK UNIFIED IDEOGRAPH-5146 1000000000000 CJK trillion
U+516B CJK UNIFIED IDEOGRAPH-516B 8 CJK digit, CJK non-zero digit
U+516D CJK UNIFIED IDEOGRAPH-516D 6 CJK digit, CJK non-zero digit
U+5341 CJK UNIFIED IDEOGRAPH-5341 10 CJK ten
U+5343 CJK UNIFIED IDEOGRAPH-5343 1000 CJK thousand
U+5344 CJK UNIFIED IDEOGRAPH-5344 20 CJK multiple tens
U+5345 CJK UNIFIED IDEOGRAPH-5345 30 CJK multiple tens
U+534C CJK UNIFIED IDEOGRAPH-534C 40 CJK multiple tens
U+53C1 CJK UNIFIED IDEOGRAPH-53C1 3 CJK digit, CJK non-zero digit
U+53C2 CJK UNIFIED IDEOGRAPH-53C2 3 CJK digit, CJK non-zero digit
U+53C3 CJK UNIFIED IDEOGRAPH-53C3 3 CJK digit, CJK non-zero digit
U+53C4 CJK UNIFIED IDEOGRAPH-53C4 3 CJK digit, CJK non-zero digit
U+56DB CJK UNIFIED IDEOGRAPH-56DB 4 CJK digit, CJK non-zero digit
U+58F1 CJK UNIFIED IDEOGRAPH-58F1 1 CJK digit, CJK non-zero digit
U+58F9 CJK UNIFIED IDEOGRAPH-58F9 1 CJK digit, CJK non-zero digit
U+5EFE CJK UNIFIED IDEOGRAPH-5EFE 20 CJK multiple tens
U+5EFF CJK UNIFIED IDEOGRAPH-5EFF 廿 20 CJK multiple tens
U+5F0C CJK UNIFIED IDEOGRAPH-5F0C 1 CJK digit, CJK non-zero digit
U+5F0D CJK UNIFIED IDEOGRAPH-5F0D 2 CJK digit, CJK non-zero digit
U+5F0E CJK UNIFIED IDEOGRAPH-5F0E 3 CJK digit, CJK non-zero digit
U+5F10 CJK UNIFIED IDEOGRAPH-5F10 2 CJK digit, CJK non-zero digit
U+62FE CJK UNIFIED IDEOGRAPH-62FE 10 CJK ten
U+634C CJK UNIFIED IDEOGRAPH-634C 8 CJK digit, CJK non-zero digit
U+6709 CJK UNIFIED IDEOGRAPH-6709 - CJK and separator
U+67D2 CJK UNIFIED IDEOGRAPH-67D2 7 CJK digit, CJK non-zero digit
U+6F06 CJK UNIFIED IDEOGRAPH-6F06 7 CJK digit, CJK non-zero digit
U+7396 CJK UNIFIED IDEOGRAPH-7396 9 CJK digit, CJK non-zero digit
U+767E CJK UNIFIED IDEOGRAPH-767E 100 CJK hundred
U+7695 CJK UNIFIED IDEOGRAPH-7695 200 CJK multiple hundreds
U+8086 CJK UNIFIED IDEOGRAPH-8086 4 CJK digit, CJK non-zero digit
U+842C CJK UNIFIED IDEOGRAPH-842C 10000 CJK ten thousand
U+8CAE CJK UNIFIED IDEOGRAPH-8CAE 2 CJK digit, CJK non-zero digit
U+8CB3 CJK UNIFIED IDEOGRAPH-8CB3 2 CJK digit, CJK non-zero digit
U+8CEA CJK UNIFIED IDEOGRAPH-8CEA 7 CJK digit, CJK non-zero digit
U+8D30 CJK UNIFIED IDEOGRAPH-8D30 2 CJK digit, CJK non-zero digit
U+9621 CJK UNIFIED IDEOGRAPH-9621 1000 CJK thousand
U+9646 CJK UNIFIED IDEOGRAPH-9646 6 CJK digit, CJK non-zero digit
U+964C CJK UNIFIED IDEOGRAPH-964C 100 CJK hundred
U+9678 CJK UNIFIED IDEOGRAPH-9678 6 CJK digit, CJK non-zero digit
U+96F6 CJK UNIFIED IDEOGRAPH-96F6 0 CJK digit, CJK zero
U+FF0C FULLWIDTH COMMA - CJK digit group separator
U+FF0E FULLWIDTH FULL STOP - CJK decimal separator
U+FF10 FULLWIDTH DIGIT ZERO 0 CJK digit, CJK zero
U+FF11 FULLWIDTH DIGIT ONE 1 CJK digit, CJK non-zero digit
U+FF12 FULLWIDTH DIGIT TWO 2 CJK digit, CJK non-zero digit
U+FF13 FULLWIDTH DIGIT THREE 3 CJK digit, CJK non-zero digit
U+FF14 FULLWIDTH DIGIT FOUR 4 CJK digit, CJK non-zero digit
U+FF15 FULLWIDTH DIGIT FIVE 5 CJK digit, CJK non-zero digit
U+FF16 FULLWIDTH DIGIT SIX 6 CJK digit, CJK non-zero digit
U+FF17 FULLWIDTH DIGIT SEVEN 7 CJK digit, CJK non-zero digit
U+FF18 FULLWIDTH DIGIT EIGHT 8 CJK digit, CJK non-zero digit
U+FF19 FULLWIDTH DIGIT NINE 9 CJK digit, CJK non-zero digit
U+2099C CJK UNIFIED IDEOGRAPH-2099C 𠦜 40 CJK multiple tens

2.2 Parsing CJK numbers

To parse a CJK number string, the implementation MUST run these steps:

  1. Let input be a copy of string.
  2. If input is the empty string, return null and abort these steps.
  3. Set value to the result of applying the rules for parsing a large CJK number.
  4. If input is not the empty string, return null and abort these steps.
  5. Return value.

These steps returns either a number or null. The null value represents an error.

Running the steps to parse a CJK number with 三十五 returns 35 while running with 四万五万 returns null.

The rules for parsing a large CJK number are these steps, which share the same input with the steps that invoke these steps:

  1. Let value be zero.
  2. Let large digits flag be false.
  3. If input starts with four or more CJK digits, or one or more CJK digits followed by one or more groups of a CJK digit group separator followed by three CJK digits:
    1. Let digits be the found substring.
    2. Remove them from input.
    3. Remove any CJK digit group separator from digits.
    4. Replace each character in digits by the ASCII digit representing its value.
    5. Set v to the value obtained by interpreting digits as a decimal number.
    6. Set large digits flag to true.
  4. Otherwise, let v be the result of applying the rules for parsing a small CJK number.
  5. If v is null, return null and abort these steps.
  6. If input starts with a CJK decimal separator followed by one or more CJK digits:
    1. Let digits be the found substring.
    2. Remove them from input.
    3. Replace the first character from digits by 0..
    4. Replace each other character in digits by the ASCII digit representing its value.
    5. Add the value obtained by interpreting digits as a decimal number to v.
  7. If input starts with a CJK ten quadrillion, remove it from input and run these steps:
    1. Let m be the value of the removed character.
    2. Add v × m to value.
    3. Set v to the result of applying the rules for parsing a small CJK number.
    4. If v is null, return value.
  8. If input starts with a CJK trillion, remove it from input and run these steps:
    1. Let m be the value of the removed character.
    2. Add v × m to value.
    3. Set v to the result of applying the rules for parsing a small CJK number.
    4. If v is null, return value.
  9. If input starts with a CJK hundred million, remove it from input and run these steps:
    1. Let m be the value of the removed character.
    2. Add v × m to value.
    3. Set v to the result of applying the rules for parsing a small CJK number.
    4. If v is null, return value.
  10. If input starts with a CJK hundred followed by a CJK ten thousand:
    1. Let m be the value of the first character of input.
    2. Let n be the value of the second character of input.
    3. Remove first two characters from input.
    4. Add v × m × n to value.
    5. Set v to the result of applying the rules for parsing a small CJK number.
    6. If v is null, return value.
  11. Otherwise, if input starts with a CJK ten thousand, remove it from input and run these steps:
    1. Let m be the value of the removed character.
    2. Add v × m to value.
    3. Set v to the result of applying the rules for parsing a small CJK number.
    4. If v is null, return value.
  12. If large digits flag is true and input starts with a CJK thousand followed by zero, one, two, or three CJK digits:
    1. Let m be the value of the first character of input.
    2. Let digits be the found CJK digits.
    3. Remove them from input.
    4. Add v × m to value.
    5. Replace each character in digits by the ASCII digit representing its value.
    6. If digits is the empty string, set v to 0. Otherwise, set v to the value obtained by interpreting digits as a decimal number.
  13. Add v to value.
  14. Return value.

The rules for parsing a small CJK number are these steps, which share the same input with the steps that invoke these steps:

  1. If input starts with: ... then:
    1. Let digits be the found substring.
    2. Remove them from input.
    3. Remove any CJK digit group separator from digits.
    4. Replace each character in digits by the ASCII digit representing its value.
    5. Return the value obtained by interpreting digits as a decimal number and abort the entire steps.
  2. Let value be zero.
  3. Let thousand flag be false.
  4. Let removed flag be false.
  5. If input starts with a CJK digit followed by a CJK thousand, remove them from input and run these steps:
    1. Let n be the value of the CJK digit.
    2. Let m be the value of the CJK thousand.
    3. Add n × m to value.
    4. Set thousand flag to true.
    5. Set removed flag to true.
  6. Otherwise, if input starts with a CJK thousand:
    1. Remove it from input and add the value of it to value.
    2. Set thousand flag to true.
    3. Set removed flag to true.
  7. Otherwise, if input starts with a CJK zero followed by a CJK digit:
    1. Remove the CJK zero from input.
  8. If input starts with a CJK digit followed by a CJK hundred, remove them from input and run these steps:
    1. Let n be the value of the CJK digit.
    2. Let m be the value of the CJK hundred.
    3. Add n × m to value.
    4. Set thousand flag to false.
    5. Set removed flag to true.
  9. Otherwise, if input starts with a CJK hundred or CJK multiple hundreds:
    1. Remove it from input and add the value of it to value.
    2. Set thousand flag to false.
    3. Set removed flag to true.
  10. Otherwise, if input starts with a CJK zero followed by a CJK digit:
    1. Remove the CJK zero from input.
    2. Set thousand flag to false.
  11. If input starts with a CJK digit followed by a CJK ten, remove them from input and run these steps:
    1. Let n be the value of the CJK digit.
    2. Let m be the value of the CJK ten.
    3. Add n × m to value.
    4. Set thousand flag to false.
    5. Set removed flag to true.
  12. Otherwise, if input starts with a CJK ten or CJK multiple tens:
    1. Remove it from input and add the value of it to value.
    2. Set thousand flag to false.
    3. Set removed flag to true.
  13. Otherwise, if input starts with a CJK zero followed by a CJK digit:
    1. Remove the CJK zero from input.
    2. Set thousand flag to false.
  14. If thousand flag is true and input starts with two or three CJK digits:
    1. Let digits be them.
    2. Remove them from input.
    3. Replace each character in digits by the ASCII digit representing its value.
    4. Add the value obtained by interpreting digits as a decimal number to value.
    5. Set removed flag to true.
  15. Otherwise, if input starts with a CJK digit, or if removed flag is true and input starts with a CJK and separator followed by a CJK digit:
    1. Let digit be the CJK digit.
    2. Remove it from input.
    3. Add input's value to value.
  16. If removed flag is false, return null and abort these steps.
  17. Return value.

2.3 Serializing CJK numbers

To serialize a number in CJK-10000-grouped number string, with an integer number, the implementation MUST run these steps:

  1. Let string be the empty string.
  2. If number is less than zero (0):
    1. Append a U+2212 MINUS SIGN character (−) to string.
    2. Set number to number × −1.
  3. Let kei be ⌊ number / 10000000000000000 ⌋.
  4. If kei is greater than zero (0):
    1. Append the result of serializing kei to string.
    2. Append "" to string.
  5. Let chou be ⌊ (number % 10000000000000000) / 1000000000000 ⌋.
  6. If chou is greater than zero (0):
    1. Append the result of serializing chou to string.
    2. Append "" to string.
  7. Let oku be ⌊ (number % 1000000000000) / 100000000 ⌋.
  8. If oku is greater than zero (0):
    1. Append the result of serializing oku to string.
    2. Append "" to string.
  9. Let man be ⌊ (number % 100000000) / 10000 ⌋.
  10. If man is greater than zero (0):
    1. Append the result of serializing man to string.
    2. Append "" to string.
  11. Let one be number % 10000.
  12. If one is greater than zero (0) or string is either the empty string or a U+2212 MINUS SIGN character (−):
    1. Append the result of serializing man to string.
  13. Return string.

Running the steps to serialize a number in CJK-10000-grouped number string with 1230567 returns 123万567.

Data files

This section is non-normative.

There is a JSON data file on values of characters defined in this document.

There are test data:

There is an implementation: perl-number-cjk.

Author

This document is written by Wakaba <wakaba@suikawiki.org>.

This document is developed as part of the manakai project.

Per CC0, to the extent possible under law, the author has waived all copyright and related or neighboring rights to this work.