- Latest version
- https://manakai.github.io/spec-numbers/
- Version history
- https://github.com/manakai/spec-numbers/commits/gh-pages

This document defines a CJK number (漢数字) parsing algorithm.

All diagrams, examples, and notes in this specification are non-normative, as are all sections explicitly marked non-normative. Everything else in this specification is normative.

The key words “*MUST*” in the normative parts of this document are to be interpreted as
described in RFC 2119.

Requirements phrased in the imperative as part of algorithms (such
as “strip any leading space characters” or “return false and abort
these steps”) are to be interpreted with the meaning of the key word
(e.g. “*MUST*”) used in introducing the
algorithm.

Conformance requirements phrased as algorithms or specific steps
*MAY* be implemented in any manner, so long as
the end result is equivalent. (In particular, the algorithms defined
in this specification are intended to be easy to follow, and not
intended to be performant.)

This section defines an algorithm to parse CJK numbers (漢数字).

Digits used in CJK numbers are defined by the following table.
Each character in the table has its value shown in
the *value* column of the same row and belongs to the category
shown in the *category* column of the same row.

The value of a character is a non-negative integer.

There are following categories: CJK digit, CJK ten, CJK multiple tens, CJK hundred, CJK thousand, CJK ten thousand, CJK hundred million, CJK trillion, and CJK ten quadrillion.

Character | Value | Category | ||
---|---|---|---|---|

Code point | Name | Character | ||

`U+0030`
| `DIGIT ZERO`
| 0 | 0 | CJK digit |

`U+0031`
| `DIGIT ONE`
| 1 | 1 | CJK digit |

`U+0032`
| `DIGIT TWO`
| 2 | 2 | CJK digit |

`U+0033`
| `DIGIT THREE`
| 3 | 3 | CJK digit |

`U+0034`
| `DIGIT FOUR`
| 4 | 4 | CJK digit |

`U+0035`
| `DIGIT FIVE`
| 5 | 5 | CJK digit |

`U+0036`
| `DIGIT SIX`
| 6 | 6 | CJK digit |

`U+0037`
| `DIGIT SEVEN`
| 7 | 7 | CJK digit |

`U+0038`
| `DIGIT EIGHT`
| 8 | 8 | CJK digit |

`U+0039`
| `DIGIT NINE`
| 9 | 9 | CJK digit |

`U+3007`
| `IDEOGRAPHIC NUMBER ZERO`
| 〇 | 0 | CJK digit |

`U+4E00`
| `CJK UNIFIED IDEOGRAPH-4E00`
| 一 | 1 | CJK digit |

`U+4E03`
| `CJK UNIFIED IDEOGRAPH-4E03`
| 七 | 7 | CJK digit |

`U+4E07`
| `CJK UNIFIED IDEOGRAPH-4E07`
| 万 | 10000 | CJK ten thousand |

`U+4E09`
| `CJK UNIFIED IDEOGRAPH-4E09`
| 三 | 3 | CJK digit |

`U+4E17`
| `CJK UNIFIED IDEOGRAPH-4E17`
| 丗 | 30 | CJK multiple tens |

`U+4E5D`
| `CJK UNIFIED IDEOGRAPH-4E5D`
| 九 | 9 | CJK digit |

`U+4E8C`
| `CJK UNIFIED IDEOGRAPH-4E8C`
| 二 | 2 | CJK digit |

`U+4E94`
| `CJK UNIFIED IDEOGRAPH-4E94`
| 五 | 5 | CJK digit |

`U+4E96`
| `CJK UNIFIED IDEOGRAPH-4E96`
| 亖 | 4 | CJK digit |

`U+4EAC`
| `CJK UNIFIED IDEOGRAPH-4EAC`
| 京 | 10000000000000000 | CJK ten quadrillion |

`U+4EDF`
| `CJK UNIFIED IDEOGRAPH-4EDF`
| 仟 | 1000 | CJK thousand |

`U+4F0D`
| `CJK UNIFIED IDEOGRAPH-4F0D`
| 伍 | 5 | CJK digit |

`U+4F70`
| `CJK UNIFIED IDEOGRAPH-4F70`
| 佰 | 100 | CJK hundred |

`U+5104`
| `CJK UNIFIED IDEOGRAPH-5104`
| 億 | 100000000 | CJK hundred million |

`U+5146`
| `CJK UNIFIED IDEOGRAPH-5146`
| 兆 | 1000000000000 | CJK trillion |

`U+516B`
| `CJK UNIFIED IDEOGRAPH-516B`
| 八 | 8 | CJK digit |

`U+516D`
| `CJK UNIFIED IDEOGRAPH-516D`
| 六 | 6 | CJK digit |

`U+5341`
| `CJK UNIFIED IDEOGRAPH-5341`
| 十 | 10 | CJK ten |

`U+5343`
| `CJK UNIFIED IDEOGRAPH-5343`
| 千 | 1000 | CJK thousand |

`U+5344`
| `CJK UNIFIED IDEOGRAPH-5344`
| 卄 | 20 | CJK multiple tens |

`U+5345`
| `CJK UNIFIED IDEOGRAPH-5345`
| 卅 | 30 | CJK multiple tens |

`U+534C`
| `CJK UNIFIED IDEOGRAPH-534C`
| 卌 | 40 | CJK multiple tens |

`U+53C1`
| `CJK UNIFIED IDEOGRAPH-53C1`
| 叁 | 3 | CJK digit |

`U+53C2`
| `CJK UNIFIED IDEOGRAPH-53C2`
| 参 | 3 | CJK digit |

`U+53C3`
| `CJK UNIFIED IDEOGRAPH-53C3`
| 參 | 3 | CJK digit |

`U+53C4`
| `CJK UNIFIED IDEOGRAPH-53C4`
| 叄 | 3 | CJK digit |

`U+56DB`
| `CJK UNIFIED IDEOGRAPH-56DB`
| 四 | 4 | CJK digit |

`U+58F1`
| `CJK UNIFIED IDEOGRAPH-58F1`
| 壱 | 1 | CJK digit |

`U+58F9`
| `CJK UNIFIED IDEOGRAPH-58F9`
| 壹 | 1 | CJK digit |

`U+5EFF`
| `CJK UNIFIED IDEOGRAPH-5EFF`
| 廿 | 20 | CJK multiple tens |

`U+5F0C`
| `CJK UNIFIED IDEOGRAPH-5F0C`
| 弌 | 1 | CJK digit |

`U+5F0D`
| `CJK UNIFIED IDEOGRAPH-5F0D`
| 弍 | 2 | CJK digit |

`U+5F0E`
| `CJK UNIFIED IDEOGRAPH-5F0E`
| 弎 | 3 | CJK digit |

`U+5F10`
| `CJK UNIFIED IDEOGRAPH-5F10`
| 弐 | 2 | CJK digit |

`U+62FE`
| `CJK UNIFIED IDEOGRAPH-62FE`
| 拾 | 10 | CJK ten |

`U+634C`
| `CJK UNIFIED IDEOGRAPH-634C`
| 捌 | 8 | CJK digit |

`U+67D2`
| `CJK UNIFIED IDEOGRAPH-67D2`
| 柒 | 7 | CJK digit |

`U+6F06`
| `CJK UNIFIED IDEOGRAPH-6F06`
| 漆 | 7 | CJK digit |

`U+7396`
| `CJK UNIFIED IDEOGRAPH-7396`
| 玖 | 9 | CJK digit |

`U+767E`
| `CJK UNIFIED IDEOGRAPH-767E`
| 百 | 100 | CJK hundred |

`U+8086`
| `CJK UNIFIED IDEOGRAPH-8086`
| 肆 | 4 | CJK digit |

`U+842C`
| `CJK UNIFIED IDEOGRAPH-842C`
| 萬 | 10000 | CJK ten thousand |

`U+8CAE`
| `CJK UNIFIED IDEOGRAPH-8CAE`
| 貮 | 2 | CJK digit |

`U+8CB3`
| `CJK UNIFIED IDEOGRAPH-8CB3`
| 貳 | 2 | CJK digit |

`U+8CEA`
| `CJK UNIFIED IDEOGRAPH-8CEA`
| 質 | 7 | CJK digit |

`U+8D30`
| `CJK UNIFIED IDEOGRAPH-8D30`
| 贰 | 2 | CJK digit |

`U+9621`
| `CJK UNIFIED IDEOGRAPH-9621`
| 阡 | 1000 | CJK thousand |

`U+9646`
| `CJK UNIFIED IDEOGRAPH-9646`
| 陆 | 6 | CJK digit |

`U+964C`
| `CJK UNIFIED IDEOGRAPH-964C`
| 陌 | 100 | CJK hundred |

`U+9678`
| `CJK UNIFIED IDEOGRAPH-9678`
| 陸 | 6 | CJK digit |

`U+96F6`
| `CJK UNIFIED IDEOGRAPH-96F6`
| 零 | 0 | CJK digit |

`U+FF10`
| `FULLWIDTH DIGIT ZERO`
| ０ | 0 | CJK digit |

`U+FF11`
| `FULLWIDTH DIGIT ONE`
| １ | 1 | CJK digit |

`U+FF12`
| `FULLWIDTH DIGIT TWO`
| ２ | 2 | CJK digit |

`U+FF13`
| `FULLWIDTH DIGIT THREE`
| ３ | 3 | CJK digit |

`U+FF14`
| `FULLWIDTH DIGIT FOUR`
| ４ | 4 | CJK digit |

`U+FF15`
| `FULLWIDTH DIGIT FIVE`
| ５ | 5 | CJK digit |

`U+FF16`
| `FULLWIDTH DIGIT SIX`
| ６ | 6 | CJK digit |

`U+FF17`
| `FULLWIDTH DIGIT SEVEN`
| ７ | 7 | CJK digit |

`U+FF18`
| `FULLWIDTH DIGIT EIGHT`
| ８ | 8 | CJK digit |

`U+FF19`
| `FULLWIDTH DIGIT NINE`
| ９ | 9 | CJK digit |

To parse a CJK number `string`, the
implementation *MUST* run the following steps:

- Let
`input`be a copy of`string`. - If
`input`is the empty string, return null and abort these steps. - Set
`value`to the result of applying the rules for parsing a large CJK number. - If
`input`is not the empty string, return null and abort these steps. - Return
`value`.

These steps returns either a number or null. The null value represents an error.

The rules for parsing a large CJK number are as given in
the following steps, which share the same `input` with the
steps that invoke these steps:

- Let
`value`be zero. - If
`input`starts with five or more CJK digits, remove them from`input`and run these substeps:- Let
`digits`be those characters. - Replace each character in
`digits`by the ASCII digit representing its value. - Set
`v`to the value obtained by interpreting`digits`as a decimal number.

- Let
- Otherwise, let
`v`be the result of applying the rules for parsing a small CJK number. - If
`v`is null, return null and abort these steps. - If
`input`starts with a CJK ten quadrillion, remove it from`input`and run these substeps:- Let
`m`be the value of the removed character. - Add
`v`×`m`to`value`. - Set
`v`to the result of applying the rules for parsing a small CJK number. - If
`v`is null, return`value`.

- Let
- If
`input`starts with a CJK trillion, remove it from`input`and run these substeps:- Let
`m`be the value of the removed character. - Add
`v`×`m`to`value`. - Set
`v`to the result of applying the rules for parsing a small CJK number. - If
`v`is null, return`value`.

- Let
- If
`input`starts with a CJK hundred million, remove it from`input`and run these substeps:- Let
`m`be the value of the removed character. - Add
`v`×`m`to`value`. - Set
`v`to the result of applying the rules for parsing a small CJK number. - If
`v`is null, return`value`.

- Let
- If
`input`starts with a CJK ten thousand, remove it from`input`and run these substeps:- Let
`m`be the value of the removed character. - Add
`v`×`m`to`value`. - Set
`v`to the result of applying the rules for parsing a small CJK number. - If
`v`is null, return`value`.

- Let
- Add
`v`to`value`. - Return
`value`.

The rules for parsing a small CJK number are as given in
the following steps, which share the same `input` with the
steps that invoke these steps:

- If
`input`starts with two, three, or four CJK digits, remove them from`input`and run these substeps:- Let
`digits`be those characters. - Replace each character in
`digits`by the ASCII digit representing its value. - Return the value obtained by interpreting
`digits`as a decimal number and abort the entire steps.

- Let
- Let
`value`be zero. - If
`input`starts with a CJK digit followed by a CJK thousand, remove them from`input`and run these substeps:- Let
`n`be the value of the CJK digit. - Let
`m`be the value of the CJK thousand. - Add
`n`×`m`to`value`.

- Let
- Otherwise, if
`input`starts with a CJK thousand, remove it from`input`and add the value of it to`value`. - If
`input`starts with a CJK digit followed by a CJK hundred, remove them from`input`and run these substeps:- Let
`n`be the value of the CJK digit. - Let
`m`be the value of the CJK hundred. - Add
`n`×`m`to`value`.

- Let
- Otherwise, if
`input`starts with a CJK hundred, remove it from`input`and add the value of it to`value`. - If
`input`starts with a CJK digit followed by a CJK ten, remove them from`input`and run these substeps: - Otherwise, if
`input`starts with a CJK ten, remove it from`input`and add the value of it to`value`. - Otherwise, if
`input`starts with a CJK multiple tens, remove it from`input`and add the value of it to`value`. - If
`input`starts with a CJK digit, remove it from`input`and add the value of it to`value`. - If no character is removed by these steps, return null and abort these steps.
- Return
`value`.

*This section is non-normative.*

There are a JSON data file on values of characters defined in this document and a JSON test data file for the parse a CJK number algorithm (documentation).

There is an implementation: perl-number-cjk.

This document is written by Wakaba <wakaba@suikawiki.org>.

This document is developed as part of the manakai project.

Per CC0, to the extent possible under law, the author has waived all copyright and related or neighboring rights to this work.