This document defines a CJK number (漢数字) parsing algorithm.
All diagrams, examples, and notes in this specification are non-normative, as are all sections explicitly marked non-normative. Everything else in this specification is normative.
The key words “MUST” in the normative parts of this document are to be interpreted as described in RFC 2119.
Requirements phrased in the imperative as part of algorithms (such as “strip any leading space characters” or “return false and abort these steps”) are to be interpreted with the meaning of the key word (e.g. “MUST”) used in introducing the algorithm.
Conformance requirements phrased as algorithms or specific steps MAY be implemented in any manner, so long as the end result is equivalent. (In particular, the algorithms defined in this specification are intended to be easy to follow, and not intended to be performant.)
This section defines an algorithm to parse CJK numbers (漢数字).
Digits and other characters used in CJK numbers are defined by the following table. Each character in the table has its value shown in the value column of the same row and belongs to the categories shown in the categories column of the same row. If the value column's content is "-", the character has no value.
The value of a character is a non-negative integer.
There are following categories: CJK digit, CJK zero, CJK non-zero digit, CJK ten, CJK multiple tens, CJK hundred, CJK multiple hundreds, CJK thousand, CJK ten thousand, CJK hundred million, CJK trillion, CJK ten quadrillion, CJK and separator, CJK digit group separator, and CJK decimal separator.
Character | Value | Categories | ||
---|---|---|---|---|
Code point | Name | Character | ||
U+0020
| SPACE
| - | CJK digit group separator | |
U+002C
| COMMA
| , | - | CJK digit group separator |
U+002E
| FULL STOP
| . | - | CJK decimal separator |
U+0030
| DIGIT ZERO
| 0 | 0 | CJK digit, CJK zero |
U+0031
| DIGIT ONE
| 1 | 1 | CJK digit, CJK non-zero digit |
U+0032
| DIGIT TWO
| 2 | 2 | CJK digit, CJK non-zero digit |
U+0033
| DIGIT THREE
| 3 | 3 | CJK digit, CJK non-zero digit |
U+0034
| DIGIT FOUR
| 4 | 4 | CJK digit, CJK non-zero digit |
U+0035
| DIGIT FIVE
| 5 | 5 | CJK digit, CJK non-zero digit |
U+0036
| DIGIT SIX
| 6 | 6 | CJK digit, CJK non-zero digit |
U+0037
| DIGIT SEVEN
| 7 | 7 | CJK digit, CJK non-zero digit |
U+0038
| DIGIT EIGHT
| 8 | 8 | CJK digit, CJK non-zero digit |
U+0039
| DIGIT NINE
| 9 | 9 | CJK digit, CJK non-zero digit |
U+00A0
| NO-BREAK SPACE
| - | CJK digit group separator | |
U+00B7
| MIDDLE DOT
| · | - | CJK digit group separator |
U+2009
| THIN SPACE
| - | CJK digit group separator | |
U+202F
| NARROW NO-BREAK SPACE
| - | CJK digit group separator | |
U+3007
| IDEOGRAPHIC NUMBER ZERO
| 〇 | 0 | CJK digit, CJK zero |
U+30FB
| KATAKANA MIDDLE DOT
| ・ | - | CJK decimal separator |
U+4E00
| CJK UNIFIED IDEOGRAPH-4E00
| 一 | 1 | CJK digit, CJK non-zero digit |
U+4E03
| CJK UNIFIED IDEOGRAPH-4E03
| 七 | 7 | CJK digit, CJK non-zero digit |
U+4E07
| CJK UNIFIED IDEOGRAPH-4E07
| 万 | 10000 | CJK ten thousand |
U+4E09
| CJK UNIFIED IDEOGRAPH-4E09
| 三 | 3 | CJK digit, CJK non-zero digit |
U+4E17
| CJK UNIFIED IDEOGRAPH-4E17
| 丗 | 30 | CJK multiple tens |
U+4E5D
| CJK UNIFIED IDEOGRAPH-4E5D
| 九 | 9 | CJK digit, CJK non-zero digit |
U+4E8C
| CJK UNIFIED IDEOGRAPH-4E8C
| 二 | 2 | CJK digit, CJK non-zero digit |
U+4E94
| CJK UNIFIED IDEOGRAPH-4E94
| 五 | 5 | CJK digit, CJK non-zero digit |
U+4E96
| CJK UNIFIED IDEOGRAPH-4E96
| 亖 | 4 | CJK digit, CJK non-zero digit |
U+4EAC
| CJK UNIFIED IDEOGRAPH-4EAC
| 京 | 10000000000000000 | CJK ten quadrillion |
U+4EBF
| CJK UNIFIED IDEOGRAPH-4EBF
| 亿 | 100000000 | CJK hundred million |
U+4EDF
| CJK UNIFIED IDEOGRAPH-4EDF
| 仟 | 1000 | CJK thousand |
U+4F0D
| CJK UNIFIED IDEOGRAPH-4F0D
| 伍 | 5 | CJK digit, CJK non-zero digit |
U+4F70
| CJK UNIFIED IDEOGRAPH-4F70
| 佰 | 100 | CJK hundred |
U+5104
| CJK UNIFIED IDEOGRAPH-5104
| 億 | 100000000 | CJK hundred million |
U+5146
| CJK UNIFIED IDEOGRAPH-5146
| 兆 | 1000000000000 | CJK trillion |
U+516B
| CJK UNIFIED IDEOGRAPH-516B
| 八 | 8 | CJK digit, CJK non-zero digit |
U+516D
| CJK UNIFIED IDEOGRAPH-516D
| 六 | 6 | CJK digit, CJK non-zero digit |
U+5341
| CJK UNIFIED IDEOGRAPH-5341
| 十 | 10 | CJK ten |
U+5343
| CJK UNIFIED IDEOGRAPH-5343
| 千 | 1000 | CJK thousand |
U+5344
| CJK UNIFIED IDEOGRAPH-5344
| 卄 | 20 | CJK multiple tens |
U+5345
| CJK UNIFIED IDEOGRAPH-5345
| 卅 | 30 | CJK multiple tens |
U+534C
| CJK UNIFIED IDEOGRAPH-534C
| 卌 | 40 | CJK multiple tens |
U+53C1
| CJK UNIFIED IDEOGRAPH-53C1
| 叁 | 3 | CJK digit, CJK non-zero digit |
U+53C2
| CJK UNIFIED IDEOGRAPH-53C2
| 参 | 3 | CJK digit, CJK non-zero digit |
U+53C3
| CJK UNIFIED IDEOGRAPH-53C3
| 參 | 3 | CJK digit, CJK non-zero digit |
U+53C4
| CJK UNIFIED IDEOGRAPH-53C4
| 叄 | 3 | CJK digit, CJK non-zero digit |
U+56DB
| CJK UNIFIED IDEOGRAPH-56DB
| 四 | 4 | CJK digit, CJK non-zero digit |
U+58F1
| CJK UNIFIED IDEOGRAPH-58F1
| 壱 | 1 | CJK digit, CJK non-zero digit |
U+58F9
| CJK UNIFIED IDEOGRAPH-58F9
| 壹 | 1 | CJK digit, CJK non-zero digit |
U+5EFE
| CJK UNIFIED IDEOGRAPH-5EFE
| 廾 | 20 | CJK multiple tens |
U+5EFF
| CJK UNIFIED IDEOGRAPH-5EFF
| 廿 | 20 | CJK multiple tens |
U+5F0C
| CJK UNIFIED IDEOGRAPH-5F0C
| 弌 | 1 | CJK digit, CJK non-zero digit |
U+5F0D
| CJK UNIFIED IDEOGRAPH-5F0D
| 弍 | 2 | CJK digit, CJK non-zero digit |
U+5F0E
| CJK UNIFIED IDEOGRAPH-5F0E
| 弎 | 3 | CJK digit, CJK non-zero digit |
U+5F10
| CJK UNIFIED IDEOGRAPH-5F10
| 弐 | 2 | CJK digit, CJK non-zero digit |
U+62FE
| CJK UNIFIED IDEOGRAPH-62FE
| 拾 | 10 | CJK ten |
U+634C
| CJK UNIFIED IDEOGRAPH-634C
| 捌 | 8 | CJK digit, CJK non-zero digit |
U+6709
| CJK UNIFIED IDEOGRAPH-6709
| 有 | - | CJK and separator |
U+67D2
| CJK UNIFIED IDEOGRAPH-67D2
| 柒 | 7 | CJK digit, CJK non-zero digit |
U+6F06
| CJK UNIFIED IDEOGRAPH-6F06
| 漆 | 7 | CJK digit, CJK non-zero digit |
U+7396
| CJK UNIFIED IDEOGRAPH-7396
| 玖 | 9 | CJK digit, CJK non-zero digit |
U+767E
| CJK UNIFIED IDEOGRAPH-767E
| 百 | 100 | CJK hundred |
U+7695
| CJK UNIFIED IDEOGRAPH-7695
| 皕 | 200 | CJK multiple hundreds |
U+8086
| CJK UNIFIED IDEOGRAPH-8086
| 肆 | 4 | CJK digit, CJK non-zero digit |
U+842C
| CJK UNIFIED IDEOGRAPH-842C
| 萬 | 10000 | CJK ten thousand |
U+8CAE
| CJK UNIFIED IDEOGRAPH-8CAE
| 貮 | 2 | CJK digit, CJK non-zero digit |
U+8CB3
| CJK UNIFIED IDEOGRAPH-8CB3
| 貳 | 2 | CJK digit, CJK non-zero digit |
U+8CEA
| CJK UNIFIED IDEOGRAPH-8CEA
| 質 | 7 | CJK digit, CJK non-zero digit |
U+8D30
| CJK UNIFIED IDEOGRAPH-8D30
| 贰 | 2 | CJK digit, CJK non-zero digit |
U+9621
| CJK UNIFIED IDEOGRAPH-9621
| 阡 | 1000 | CJK thousand |
U+9646
| CJK UNIFIED IDEOGRAPH-9646
| 陆 | 6 | CJK digit, CJK non-zero digit |
U+964C
| CJK UNIFIED IDEOGRAPH-964C
| 陌 | 100 | CJK hundred |
U+9678
| CJK UNIFIED IDEOGRAPH-9678
| 陸 | 6 | CJK digit, CJK non-zero digit |
U+96F6
| CJK UNIFIED IDEOGRAPH-96F6
| 零 | 0 | CJK digit, CJK zero |
U+FF0C
| FULLWIDTH COMMA
| , | - | CJK digit group separator |
U+FF0E
| FULLWIDTH FULL STOP
| . | - | CJK decimal separator |
U+FF10
| FULLWIDTH DIGIT ZERO
| 0 | 0 | CJK digit, CJK zero |
U+FF11
| FULLWIDTH DIGIT ONE
| 1 | 1 | CJK digit, CJK non-zero digit |
U+FF12
| FULLWIDTH DIGIT TWO
| 2 | 2 | CJK digit, CJK non-zero digit |
U+FF13
| FULLWIDTH DIGIT THREE
| 3 | 3 | CJK digit, CJK non-zero digit |
U+FF14
| FULLWIDTH DIGIT FOUR
| 4 | 4 | CJK digit, CJK non-zero digit |
U+FF15
| FULLWIDTH DIGIT FIVE
| 5 | 5 | CJK digit, CJK non-zero digit |
U+FF16
| FULLWIDTH DIGIT SIX
| 6 | 6 | CJK digit, CJK non-zero digit |
U+FF17
| FULLWIDTH DIGIT SEVEN
| 7 | 7 | CJK digit, CJK non-zero digit |
U+FF18
| FULLWIDTH DIGIT EIGHT
| 8 | 8 | CJK digit, CJK non-zero digit |
U+FF19
| FULLWIDTH DIGIT NINE
| 9 | 9 | CJK digit, CJK non-zero digit |
U+2099C
| CJK UNIFIED IDEOGRAPH-2099C
| 𠦜 | 40 | CJK multiple tens |
To parse a CJK number string, the implementation MUST run the following steps:
These steps returns either a number or null. The null value represents an error.
Running the steps to parse a CJK number
with 三十五
returns 35 while running
with 四万五万
returns null
.
The rules for parsing a large CJK number are as given in the following steps, which share the same input with the steps that invoke these steps:
0.
.
The rules for parsing a small CJK number are as given in the following steps, which share the same input with the steps that invoke these steps:
This section is non-normative.
There are a JSON data file on values of characters defined in this document and a JSON test data file for the parse a CJK number algorithm (documentation).
There is an implementation: perl-number-cjk.
This document is written by Wakaba <wakaba@suikawiki.org>.
This document is developed as part of the manakai project.
Per CC0, to the extent possible under law, the author has waived all copyright and related or neighboring rights to this work.