The following table describes the unicode script ranges that IDOL Server identifies.
Script | Begin | End |
---|---|---|
Arabic | U+0600 | U+06FF |
BasicLatin | U+0000 | U+007F |
Bengali | U+0981 | U+09FB |
Burmese | U+1000 | U+109F |
CJKComp | U+3300 | U+33FF |
CJKComp | U+2F00 | U+2FDF |
CJKComp | U+FE30 | U+FE4F |
CJKCompIdeo | U+F900 | U+FAFF |
CJKCompIdeo | U+2F800 | U+2FA1F |
CJKRadicalsSup | U+2E80 | U+2EFF |
CJKRadicalsSup | U+3000 | U+303F |
CJKRadicalsSup | U+31C0 | U+31EF |
CJKUnifIdeo | U+4E00 | U+9FFF |
CJKUnifIdeo | U+20000 | U+2A6D6 |
CJKUnifIdeo | U+2A700 | U+2B73F |
CJKUnifIdeo | U+2B740 | U+2B81F |
CJKUnifIdeo | U+3200 | U+32FF |
CJKUnifIdeo | U+2FF0 | U+2FFF |
CJKUnifIdeoExtA | U+3400 | U+4DBF |
Cyrillic | U+0400 | U+04FF |
Cyrillic | U+0500 | U+052F |
Cyrillic | U+2DE0 | U+2DFF |
Cyrillic | U+A640 | U+A69F |
Devanagari | U+0901 | U+097F |
Ethiopic | U+1200 | U+1399 |
Georgian | U+10A0 | U+10FF |
GreekAndCoptic | U+0370 | U+03FF |
GreekAndCoptic | U+1F00 | U+1FFF |
Gujarati | U+0A81 | U+0AF1 |
Hangul | U+AC00 | U+D7A3 |
Hangul | U+1100 | U+11FF |
Hangul | U+3130 | U+318F |
Hangul | U+A960 | U+A97F |
Hangul | U+D7B0 | U+D7FF |
Hebrew | U+0590 | U+05FF |
Hiragana | U+3040 | U+309F |
Kannada | U+0C82 | U+0CF2 |
Katakana | U+30A0 | U+30FF |
Katakana | U+31F0 | U+31FF |
Lao | U+0E81 | U+0EDF |
Latin1Sup | U+0080 | U+00FF |
LatinExtA | U+0100 | U+017F |
LatinExtB | U+0180 | U+024F |
Malayalam | U+0D02 | U+0D7F |
Mongolian | U+1800 | U+18AA |
OrientalMisc | U+3105 | U+312C |
OrientalMisc | U+31A0 | U+31BF |
OrientalMisc | U+3190 | U+319F |
OrientalMisc | U+4DC0 | U+4DFF |
Oriya | U+0B01 | U+0B77 |
Sinhala | U+0D82 | U+0DF4 |
Tamil | U+0B82 | U+0BFA |
Telugu | U+0C01 | U+0C7F |
Thai | U+0E01 | U+0E5B |
Tibetan | U+0F00 | U+0FDA |
Vietnamese | U+1EA0 | U+1EF9 |
When processing text, IDOL Server identifies the script range that a character belongs to. In some cases, the script range can determine how that part of the text is processed. For example, when a language has NGramOrientalOnly
set to True
in the configuration, IDOL Server only produces NGrams from words that consist entirely of characters that belong to one of the following Chinese, Japanese, and Korean script ranges:
CJKUnifIdeo
CJKUnifIdeoExtA
CJKCompIdeo
CJKComp
CJKRadicalsSup
Hiragana
Katakana
Hangul
OrientalMisc
|