Chinese Input Processors Chain
Introduction
An Input Processor (IP) pre-processes inputs for the Teneo Engine to be able to perform different processes on them, such as normalization and tokenization for example. Each language supported by the Teneo Platform has a chain of Input Processors that know how to process that particular language.
IP Chain Setup
The following graph displays the Input Processors Chain for Chinese; each Input Processor is described further in the following sections.
graph TD subgraph ips [ ] tokenizer[Chinese Tokenizer] --> annotator annotator[Chinese Annotator] --> number number[Chinese Numbers] --> annotation annotation[System Annotation] --> languagedetect languagedetect[Language Detector] --> predict end input([User Input]) --User Gives Input--> tokenizer predict[Predict] --Parsed Input--> parsed([To Dialog Processing]) classDef ip_optional stroke-dasharray:5,5; classDef external fill:#00000000,stroke-dasharray:5,5; class solution,settings external;
Chinese Simplifier
The Chinese Simplifier is a special kind of professor that is used to normalize the user input by:
- converting full width Latin letters and Arabic digits into their half width version, and
- lowercasing the uppercased Latin letters.
This Simplifier is special because it is not run as part of the Input Processors chain, but rather by the Tokenizer when it puts the tokens into a Teneo data structure. Additionally, the Simplifier is also run by the condition parser inside the Teneo Engine, which normalizes the Language Object syntax words before adding them to the internal Engine dictionary.
Input Processors
Chinese Tokenizer
The Chinese Tokenizer is the first of the input processors to be run on Chinese user inputs; it essentially does two things: first, it converts traditional Mandarin Chinese characters into simplified, and secondly, it tokenizes the converted user input and generates sentences based on the tokens.
The conversion of traditional characters into simplified characters is done by a one-to-one character mapping and after the conversion to simplified Mandarin Chinese, the user input is segmented into words and sentences.
The Chinese Tokenizer splits the user input words via a statistical model and a user dictionary; the words specified in the user dictionary are guaranteed to be segmented as such by the Tokenizer.
The user dictionary has a static component, which is specified in the configuration of the Tokenizer, and a dynamic component, which is collected from the language objects defined in a user solution that have syntaxes of type DICTEXT_word_POStag.
The Chinese Tokenizer passes the Part-of-Speech (POS) tags generated by the Chinese Tokenizer to the user as annotations by mapping them according to a configuration file that maps Panda generated POS tags to annotations, for example, NN=NN.POS. A list of characters which should not be output as tokens, e.g., punctuation, brackets, etc., is also defined in the Chinese Tokenizer.
The last step in the tokenization process is the splitting of user input tokens into sentences based on a defined set of characters which mark sentence boundaries, for example, 。!?… ! ? ..The Chinese Tokenizer does not split decimal numbers around the decimal markers, but rather concatenates the split tokens into one; this makes it easier to identify and annotate decimal numbers later in the processing chain.
Note that Numbers with a factor other than 万 or 亿 after the decimal point are not numbers and therefore are being split instead of being concatenated together.
Chinese Annotator
The Chinese Annotator includes a range of analyzers which treat specific morphological phenomena of Chinese. In general, three operations can be performed by the morphological analyzers:
- Annotation (addition of one or more morphological annotations)
- Change of the base form property
- Concatenation of multiple tokens.
The morphological analyzers are applied in a fixed order; the table below shows the current sequence of analyzers, along with the operations that are performed by them. In the following sections more details are provided for each individual analyzer.
Analyzer | Example | Annotation | Base form change | Concatenation | |
---|---|---|---|---|---|
1. | VNotV Analyzer | 是-不-是 | Yes | Yes | Yes |
2. | Verb Analyzer | 吃-完,跑-上 | Yes | Yes | Yes |
3. | Reduplication Analyzer | 红-红 | Yes | Yes | Yes |
4. | Loc Analyzer | 桌子-上 | Yes | Yes | Yes |
5. | Aspect Analyzer | 吃 了, 坐 着 | Yes | No | No |
6. | Negation Analyzer | 不 吃 | Yes | No | No |
7. | SC Analyzer | 洗了一个澡 | Yes | Yes | No |
8. | Affix Analyzer | 我们, 标准化 | Yes | Yes | No |
VNotV Analyzer
The VNotV Analyzer concatenate and analyses V-Not-V sequences.
In V-Not-V structures, the same verb occurs twice with a negation word (不, 没, 否) between the two occurrences:
- 你 去-不-去 买 东西?
Nǐ qù-bù-qù mǎi dōngxī?
You go-VNOTV.BU-go buy things
‘Do you go shopping?’
The V-Not-V structure has two uses:
- 我 不 知道 他 去-不-去 买 东西。
Wǒ bù zhīhào tā qù-bù-qù mǎi dōngxī.
I NEG know he go-VNOTV.BU-go buy things
‘I don’t know whether he goes shopping.’
In the case of bi-syllabic words, the second syllable of the first verb might be deleted:
- 你 喜(欢)-不-喜欢 买 东西 ?
Nǐ xǐ(huān)-bù-xǐhuān mǎi dōngxī?
You like-VNOTV.BU-like buy things
‘Do you like shopping?’
The VNotV Analyzer concatenates the three tokens. It assigns one structural annotation (prefixed with VNOTV) signaling the negation form. The base form of the resulting token is set to the full form of the verb. Thus, the example 3 above WITH second syllable deletion is analyzed as follows:
- The three tokens are concatenated into one word 喜不喜欢
- This word gets the base form 喜欢
- Additionally, it gets the annotation VNOTV.BU.
Verb Analyzer
The Verb Analyzer performs analysis of resultative and directional compounds. Resultative compounds consist of one main verb and one resultative suffix:
- 小王 吃-完 了。
Xiǎowáng chī-wán le.
Xiaowang eat-RESULT ASPECT
‘Xiaowang finished eating.’
Directional compounds consist of one main verb and one or two directional suffixes:
-
阿明 跑-上 楼机 了。
Āmíng pǎo-shàng lóujī le.
Aming run-DIR.NONDEICTIC.SHANG stairs ASPECT
‘Aming ran up the stairs.’ -
阿明 跑-上-去 了。
Āmíng pǎo-shàng-qù le.
Aming run-DIR.NONDEICTIC.SHANG-DIR.DEICTIC.QU ASPECT
‘Aming ran up.’
The combination of the main verb with the resultative/directional complements is concatenated. The base form of the resulting token is changed to the base form of the main verb. The token is assigned the annotations associated with the resultative/directional suffixes.
Annotations for resultative suffixes carry the prefix RESULT. Annotations for directional suffixes carry the prefix DIR. Additionally, we distinguish between deictic (DIR_DEICTIC…) and non-deictic (DIR_NONDEICTIC…) directional complements. Cases with two directional suffixes are limited to a non-deictic complement followed by deictic complement.
Reduplication Analyzer
The Reduplication Analyzer analyzes reduplications of verbs, adjectives and adverbs:
-
a. 红-红
hóng-hóng
red-red
‘very red’b. 讨论-讨论
tǎolùn-tǎolùn
discuss-discuss
‘to discuss a little’
Reduplication of adjectives manifests some variability in the distribution of the syllables. Specifically, some adjectives expose the following asymmetric reduplication patterns:
-
a. AABB:
干净 干-干-净-净
gānjìng gān-gān-jìng-jìng
clean clean-clean
‘clean very clean’b. ABB:
雪白 雪白白
xuébǎi xué-bǎi-bǎi
white white-white
‘white very white’c. AAB:
逛街 逛逛街
guàngjiē guàng-guàng-jiē
walk street walk-walk-street
‘walk street go window shopping’
In verbal reduplication, the particles 一 and 了 can occur between the two copies:
- 看-一-看
kàn-yī-kàn
look-one-look
‘to take a look’
If the two words are segmented in the original tokenization, they are concatenated by the Reduplication Analyzer. The reduplicated word gets the annotation REDUP.
Loc Analyzer
The Loc Analyzer performs concatenation and analysis of noun + localizer combinations; localizers follow nouns and ‘transform’ them into locative nouns:
- 桌子-上
table-LOC.ON.SHANG
‘on the table’
Localizers form a closed set; the table below shows the mapping from localizers to their annotations.
Form of localizer | Annotation |
---|---|
上 | LOC_ON_SHANG |
下 | LOC_UNDER_XIA |
里 | LOC_INSIDE_LI |
内 | LOC_INSIDE_NEI |
外 | LOC_OUTSIDE_WAI |
前 | LOC_BEFORE_QIAN |
后 | LOC_BEHIND_HOU |
旁 | LOC_NEXTTO_PANG |
中 | LOC_IN_ZHONG |
The Loc Analyzer concatenates the noun + localizer combination into one word and assigns it an annotation with the label of the localizer. The base form of the resulting token is set to the base form of the noun. Thus, example 10 is analyzed as follows:
- The two tokens are concatenated into one word 桌子上.
- This word gets the base form 桌子.
Additionally, it is assigned the annotation LOC_ON_SHANG.
Aspect Analyzer
The Aspect Analyzer analyzes aspect markers. Chinese has both pre-verbal and post-verbal aspect markers:
-
a. 她 正在 吃。
Tā zhèngzài chī.
she ASPECT eat.
‘She is eating.’b. 她 吃 了。
Tā chī le.
she eat ASPECT
‘She ate.’
Marker | Aspect | Annotation | Position |
---|---|---|---|
了 | Perfective | ASPECT_PERFECTIVE_LE | Postverbal |
着 | Progressive | ASPECT_PROGRESSIVE_ZHE | Postverbal |
过 | Experiential | ASPECT_EXPERIENTIAL_GUO | Postverbal |
在 | Progressive | ASPECT_PREVERBAL_PROGRESSIVE_ZAI | Preverbal |
正在 | Progressive | ASPECT_PREVERBAL_PROGRESSIVE_ZHENGZAI | Preverbal |
The set of aspect markers that are analyzed by the Aspect Analyzer are displayed in the above table.
The Aspect Analyzer attaches the respective annotation of the aspect marker to the main verb.
Negation Analyzer
The Negation Analyzer analyses negations of adverbs, verbs and adjectives. In these cases, the negation particle can immediately precede the negated word:
-
a. 我 没 去。
I NEG.MEI go
‘I didn’t go’b. 不 容易
NEG.BU easy
‘not easy’c. 不 太
NEG.BU too
‘not too’
The negation particle can also be separated from the verb by additional material:
- 别 这么 做。
Bié zhème zuò.
NEG.BIE this do
‘Don’t do this.’
The set of currently analyzed negation words is shown in the below table.
Form of negator | Annotation |
---|---|
不 | NEG_BU |
否 | NEG_FOU |
没 | NEG_MEI, ASPECT_PERFECTIVE |
没有 | NEG_MEIYOU, ASPECT_PERFECTIVE |
别 | NEG_BIE, MODE_IMPERATIVE |
不太 | NEG_BUTAI |
并不 | NEG_BINGBU |
不怎么 | NEG_BUZENME |
Three of the negation particles (没, 没有, 别) have two annotations. Their second annotation contains aspectual or mode information that is implied by the particle. The Negation Analyzer attaches an annotation to the negated word. It contains the corresponding annotation of the negation particle as well as its index in the sentence. An additional annotation is attached to the negated word if the negation particle carries aspect or mode information.
For example, in example 12. a (further above), the verb 去 is annotated with two annotations, {‘NEG_MEI’, 1} and ASPECT_PERFECTIVE.
SC Analyzer
The SC Analyzer analyzes splitable compounds; the splitable verb-object compounds (SCs) are verb-object combinations with an idiomatic meaning, e.g. 担-心 (worry+heart = ‘to worry’), 生-气 (create+air = ‘to get angry’), 见-面 (see+face = ‘to meet so’). They allow for various kinds of syntactic activity between verb and object, e.g. insertion of aspect markers, additional objects, demonstratives, etc.:
-
a. Aspect marker:
我们 见- 了 -面
we see- ASPECT -face
‘We met.’b. Additional object:
帮- 她 一个 -忙
help- she one -affair
‘to help her’c. Nominal modifier:
见 他 的 面
see- he DEG -face
The set of SCs is large and diverse. Although it is difficult to exhaustively enumerate all SCs, the most common instances are captured in a list with currently 163 compounds. Once the SC Analyzer identifies a verb in a splitable compound, it goes forward in the sentence and looks for a valid CS object for this verb. While looking, it checks with each subsequent word whether the sequence following the verb is still a valid splitting sequence. If it arrives at a suitable object before the sequence becomes invalid, it attaches an annotation to the verb. This annotation carries two pieces of information: the tag of the splitable compound (SPLIT_pinyin of compound) as well as the index of the dependent object. Further, the base form of the verb is set to the base form of the splitable compound.
Thus, in the example 14.a above, the verb 见 is annotated with the annotation {SPLIT_JIANMIAN, 3}. Its base form is set to 见面.
Affix Analyzer
The Affix Analyzer analyzes inflectional and derivational suffixes. Chinese only has one inflectional suffix, that is the plural suffix -们, which can be attached to human nouns/pronouns:
-
a. 老师-们
teacher-PLURAL
‘the teachers’b. 我-们
me-PLURAL
‘we’
Additionally, Chinese has a set of derivational suffixes which change the part of speech of the word to which they are attached. For example, the suffix -者 is attached to verbs, and the resulting combination is a noun and denotes the actor of the base form verb:
- 使用-者
shǐ-yòng(-)zhě
use-ACTOR.ZHE
‘the user’
A suffixed word gets the corresponding annotation of its suffix, and the base form of the word is changed to the base form without the suffix. Thus, 使用者 in example 16 is analyzed as follows:
- 使用者 gets the annotation ACTOR_ZHE
- 使用者gets the base form 使用.
The below table displays the set of tags used by the Affix Analyzer.
Form of affix | Annotation | Example |
---|---|---|
-于 | COMPARATIVE_YU | 高于 (两米) |
-度 | PROPERTY_DU | 精确度 |
-性 | PROPERTY_XING | 流线性 |
-化 | TRANSFORM_HUA | 现代化 |
-者 | ACTOR_ZHE | 使用者 |
-师 | ACTOR_SHI | 设计师 |
-员 | ACTOR_YUAN | 操作员 |
可- | ABILITY_KE | 可上升 |
-们 | PLURAL_MEN | 老师们 |
-城 | CITY_CHENG | 北京城 |
-市 | CITY_SHI | 上海市 |
-省 | PROVINCE_SHENG | 河北省 |
-儿 | RCOLORING_ERHUA | 好玩儿 |
-于 (word contains the suffix and has a base form of at least 2 characters) | PREP_YU | 致力于 |
Chinese Numbers
The Chinese Numbers Recognizer simplifies writing syntaxes against numbers and numeric expressions in Teneo Studio solutions and provide the following functionalities:
The Chinese Numbers Recognizer Input Processor simplifies writing syntaxes against numbers and numeric expressions in Teneo Studio solutions and provides the following functionalities:
- Normalization of tokens containing numeric values into Hindu-Arabic numerals
- Creation of a NUMBER annotation with a numericValue variable which as type BigDecimal and contains a representation of normalized numbers
- Creation of an annotation with the name of the normalized number value
- Annotate inexact numbers with annotation INEXACT (i.e. numbers containing characters 几 or 数 or 余 or 多).
The Chinese Numbers Recognizer Input Processor leaves the tokenization unmodified and does not try to concatenate neighboring numeric expressions, nor does it split numeric parts of a token from its non-numeric parts. It will however identify and annotate tokens which contain numeric subparts, e.g. having the token “三点”, the normalized numeric value would be 3. Furthermore, it works with decimal factored numbers like 5.5万or 1.2亿 and supports fractions and formal Kanji numbers.
Numeric Normalization
Numeric string normalization is done to substrings in the input string. The normalized values are used in creation of annotations, the input string itself remains unmodified. The following normalization steps are applied by the Chinese Numbers IP:
- Hindu-Arabic numerals remain unchanged
- Hanzi numerals are normalized to their Hindu-Arabic numeric value
- Mixed Hanzi/Hindu-Arabic numerals are normalized to Hindu-Arabic numerals.
Input token | Normalized Numeric Value |
---|---|
10 | 10 |
3.14 | 3.14 |
一 | 1 |
一点 | 1 |
两百 | 200 |
三百万五千 | 3005000 |
3百万5千 | 3005000 |
三百五 | 350 |
一万零一 | 10001 |
一万〇一 | 10001 |
The above table shows examples of normalization; in the last three examples it is possible to see that even more colloquial numeric expressions such as “三百五” are handled correctly.
Normalized Number Annotation
The normalized number annotation is just the numeric value of the NUMBER annotation as an annotation itself. This allows the Teneo Studio user to write syntaxes against specific numbers, without the need to specify all the different surface variants. Thanks to the traditional-to-simplified Chinese character conversion done in the Chinese Tokenizer IP, even traditional numeric Hanzi characters match.
In the below table please find examples of normalized number annotations.
Syntax | Matching inputs |
---|---|
%$2 | '2', '两', '二', '2', ... |
%$10000 | '10000', '万', '萬', '一万', '一〇〇〇〇', '10000', ... |
%$3.14 | '3.14', '三.一四', ... |
%$350 | '350', '三百五', '三五〇', ... |
Date and Time Annotation
The TIME.DATETIME and DATE.DATETIME annotations are created in the Teneo Platform for numbers which could be either time or date expressions, for example 五点零零 creates annotation__TIME.DATETIME__ with values hour: 5 and minute: 0, or 1/2 creating the DATE.DATETIME annotation with values month: 1, day: 2.
To read more about how to use the natively understanding and interpretation of date and time expressions in the Teneo Platform, please see here.
System Annotation
Teneo bundles two default collections of annotations in all language configurations: standard annotations added by the System Annotation Input Processor and special system annotations added by the Engine; the System Annotation Input Processor performs simple analysis of the sentence texts and may generate the standard annotations listed below.
Annotation | Description |
---|---|
_BINARY | The input consists of only 0s and 1s |
_BRACKETPAIR | At least one matching pair of brackets appears in the input; possible bracket types: ( ), [ ], { } |
_EXCLAMATION | At least one exclamation mark (!) appears in the input |
_EM3 | Three (or more) exclamation marks (!!!) appear in a row in the input |
_EMPTY | The input contains no text / the sentence text is empty |
_NONSENSE | The input contains nonsense text, e.g., 'asdf', 'wgwwgwg', 'xxxxxx' |
_QUESTION | At least one question mark (?) appears in the input |
_QT3 | Three (or more) question marks (???) appear in a row in the input |
_QUOTE | At least one single quotation mark (') appears in the input |
_DBLQUOTE | At least one quotation mark (") appears in the input |
Special System Annotations
The following two, special annotations are set by the Teneo Engine. These special system annotations are not related to individual inputs but rather to whole dialogues and are dependent on the session state.
Annotation | Description |
---|---|
_INIT | Indicates session start, i.e., the first input in a dialogue |
_TIMEOUT | Indicates the continuation of a previously timed-out session/dialogue |
Language Detector
The Language Detector uses a machine learning model to predict the language of a given user input and adds an annotation, as seen in below table, to the input together with a confidence score of the prediction.
Annotation | Variable | Description |
---|---|---|
<language label>.LANG, e.g., %$DA.LANG | Confidence | Annotation created for the predicted language |
The Language Detector can predict the following 45 languages; the language label used to create the annotation name is in brackets:
Arabic (AR), Bulgarian (BG), Bengali (BN), Catalan (CA), Czech (CS), Danish (DA), German (DE), Greek (EL), English (EN), Esperanto (EO), Spanish (ES), Estonian (ET), Basque (EU), Persian (FA), Finnish (FI), French (FR), Hebrew (HE), Hindi (HI), Hungarian (HU), Indonesian-Malay (ID_MS), Icelandic (IS), Italian (IT), Japanese (JA), Korean (KO), Lithuanian (LT), Latvian (LV), Macedonian (MK), Dutch (NL), Norwegian (NO), Polish (PL), Portuguese (PT), Romanian (RO), Russian (RU), Slovak (SK), Slovenian (SL), Serbian-Croatian-Bosnian (SR_HR), Swedish (SV), Tamil (TA), Telugu (TE), Thai (TH), Tagalog (TL), Turkish (TR), Urdu (UR), Vietnamese (VI) and Chinese (ZH).
Serbian, Bosnian and Croatian are treated as one language under the label SR_HR, and Indonesian and Malay are treated as one language under the label ID_MS
A number of regexes are also in use by the Input Processor, helping the model to not predict a language for fully numerical inputs, URLs or other type of nonsense inputs.
The Language Detector will provide an annotation when the confidence prediction threshold is above 0.2 for the languages, but for the following listed languages, language annotations are always created (even for predictions below 0.2) since the Language Detector is mostly accurate when predicting them: Arabic, Bengali, Greek, Hebrew, Hindi, Japanese, Korean, Tamil, Telugu, Thai, Chinese, Vietnamese, Persian and Urdu.
Predict
The Predict Input Processor makes use of an intent model generated when classes are available in a Teneo Studio solution to annotate user inputs with the defined classes; intent models can be generated either with Teneo Learn or CLU. Note that as of Teneo 7.3, deferred intent classification is applied and annotations are only created by Predict if references to class annotations are found during the input matching process.
When Predict receives a user input, confidence scores are calculated for each class based on the model and annotations created for the most confident class and for each other class that matches the following criteria:
- the confidence is above the minimum confidence (defaults to 0.01)
- the confidence is higher than 0.5 times the confidence value of the top class.
For each selected class, an annotation with the scheme <CLASS_NAME>.INTENT is created, with the value of the model's confidence in the class as well as an annotation variable specifying the used classifier (i.e., Learn, CLU or LearnFallback) and an Order variable defining the order of the selected classes (i.e., 0 for the class with the highest confidence score and 4 for the selected class with the lowest confidence score).
A special annotation <CLASS_NAME>.TOP_INTENT is created for the class with the highest confidence score.
Annotation | Variable | Variable | Variable | Description |
---|---|---|---|---|
<CLASS_NAME>.TOP_INTENT | classifier | confidence | Annotation created for the class with the highest confidence score | |
<CLASS_NAME>.INTENT | classifier | confidence | Order | Annotation given to each selected class with a maximum of five top classes |
The Predict Input Processor creates a maximum of 5 annotations, regardless of how many classes match the criteria.