Chinese Input Processors Chain

Introduction

An Input Processor (IP) pre-processes inputs for the Teneo Engine to be able to perform different processes on them, such as normalization and tokenization for example. Each language supported by the Teneo Platform has a chain of Input Processors that know how to process that particular language.

IP Chain Setup

The following graph displays the Input Processors Chain for Chinese; each Input Processor is described further in the following sections.

graph TD
  subgraph ips [ ]
    tokenizer[Chinese Tokenizer] --> annotator
    annotator[Chinese Annotator] --> number
    number[Chinese Numbers] --> annotation
    annotation[System Annotation] --> languagedetect
    languagedetect[Language Detector] --> predict
  end
    input([User Input]) --User Gives Input--> tokenizer
    predict[Predict] --Parsed Input--> parsed([To Dialog Processing])

  classDef ip_optional stroke-dasharray:5,5;
  classDef external fill:#00000000,stroke-dasharray:5,5;
  class solution,settings external;

Chinese Simplifier

The Chinese Simplifier is a special kind of professor that is used to normalize the user input by:

converting full width Latin letters and Arabic digits into their half width version, and
lowercasing the uppercased Latin letters.

This Simplifier is special because it is not run as part of the Input Processors chain, but rather by the Tokenizer when it puts the tokens into a Teneo data structure. Additionally, the Simplifier is also run by the condition parser inside the Teneo Engine, which normalizes the Language Object syntax words before adding them to the internal Engine dictionary.

Input Processors

Chinese Tokenizer

The Chinese Tokenizer is the first of the input processors to be run on Chinese user inputs; it essentially does two things: first, it converts traditional Mandarin Chinese characters into simplified, and secondly, it tokenizes the converted user input and generates sentences based on the tokens.

The conversion of traditional characters into simplified characters is done by a one-to-one character mapping and after the conversion to simplified Mandarin Chinese, the user input is segmented into words and sentences.

The Chinese Tokenizer splits the user input words via a statistical model and a user dictionary; the words specified in the user dictionary are guaranteed to be segmented as such by the Tokenizer.
The user dictionary has a static component, which is specified in the configuration of the Tokenizer, and a dynamic component, which is collected from the language objects defined in a user solution that have syntaxes of type DICTEXT_word_POStag.

The Chinese Tokenizer passes the Part-of-Speech (POS) tags generated by the Chinese Tokenizer to the user as annotations by mapping them according to a configuration file that maps Panda generated POS tags to annotations, for example, NN=NN.POS. A list of characters which should not be output as tokens, e.g., punctuation, brackets, etc., is also defined in the Chinese Tokenizer.

The last step in the tokenization process is the splitting of user input tokens into sentences based on a defined set of characters which mark sentence boundaries, for example, 。！？… ! ? .．The Chinese Tokenizer does not split decimal numbers around the decimal markers, but rather concatenates the split tokens into one; this makes it easier to identify and annotate decimal numbers later in the processing chain.

Note that Numbers with a factor other than 万 or 亿 after the decimal point are not numbers and therefore are being split instead of being concatenated together.

Chinese Annotator

The Chinese Annotator includes a range of analyzers which treat specific morphological phenomena of Chinese. In general, three operations can be performed by the morphological analyzers:

Annotation (addition of one or more morphological annotations)
Change of the base form property
Concatenation of multiple tokens.

The morphological analyzers are applied in a fixed order; the table below shows the current sequence of analyzers, along with the operations that are performed by them. In the following sections more details are provided for each individual analyzer.

	Analyzer	Example	Annotation	Base form change	Concatenation
1.	VNotV Analyzer	是-不-是	Yes	Yes	Yes
2.	Verb Analyzer	吃-完，跑-上	Yes	Yes	Yes
3.	Reduplication Analyzer	红-红	Yes	Yes	Yes
4.	Loc Analyzer	桌子-上	Yes	Yes	Yes
5.	Aspect Analyzer	吃了, 坐着	Yes	No	No
6.	Negation Analyzer	不吃	Yes	No	No
7.	SC Analyzer	洗了一个澡	Yes	Yes	No
8.	Affix Analyzer	我们, 标准化	Yes	Yes	No

VNotV Analyzer

The VNotV Analyzer concatenate and analyses V-Not-V sequences.
In V-Not-V structures, the same verb occurs twice with a negation word (不, 没, 否) between the two occurrences:

你去-不-去买东西？
Nǐ qù-bù-qù mǎi dōngxī?
You go-VNOTV.BU-go buy things
‘Do you go shopping?’

The V-Not-V structure has two uses:

我不知道他去-不-去买东西。
Wǒ bù zhīhào tā qù-bù-qù mǎi dōngxī.
I NEG know he go-VNOTV.BU-go buy things
‘I don’t know whether he goes shopping.’

In the case of bi-syllabic words, the second syllable of the first verb might be deleted:

你喜(欢)-不-喜欢买东西？
Nǐ xǐ(huān)-bù-xǐhuān mǎi dōngxī?
You like-VNOTV.BU-like buy things
‘Do you like shopping?’

The VNotV Analyzer concatenates the three tokens. It assigns one structural annotation (prefixed with VNOTV) signaling the negation form. The base form of the resulting token is set to the full form of the verb. Thus, the example 3 above WITH second syllable deletion is analyzed as follows:

The three tokens are concatenated into one word 喜不喜欢
This word gets the base form 喜欢
Additionally, it gets the annotation VNOTV.BU.

Verb Analyzer

The Verb Analyzer performs analysis of resultative and directional compounds. Resultative compounds consist of one main verb and one resultative suffix:

小王吃-完了。
Xiǎowáng chī-wán le.
Xiaowang eat-RESULT ASPECT
‘Xiaowang finished eating.’

Directional compounds consist of one main verb and one or two directional suffixes:

阿明跑-上楼机了。
Āmíng pǎo-shàng lóujī le.
Aming run-DIR.NONDEICTIC.SHANG stairs ASPECT
‘Aming ran up the stairs.’
阿明跑-上-去了。
Āmíng pǎo-shàng-qù le.
Aming run-DIR.NONDEICTIC.SHANG-DIR.DEICTIC.QU ASPECT
‘Aming ran up.’

The combination of the main verb with the resultative/directional complements is concatenated. The base form of the resulting token is changed to the base form of the main verb. The token is assigned the annotations associated with the resultative/directional suffixes.

Annotations for resultative suffixes carry the prefix RESULT. Annotations for directional suffixes carry the prefix DIR. Additionally, we distinguish between deictic (DIR_DEICTIC…) and non-deictic (DIR_NONDEICTIC…) directional complements. Cases with two directional suffixes are limited to a non-deictic complement followed by deictic complement.

Reduplication Analyzer

The Reduplication Analyzer analyzes reduplications of verbs, adjectives and adverbs:

a. 红-红
hóng-hóng
red-red
‘very red’

b. 讨论-讨论
tǎolùn-tǎolùn
discuss-discuss
‘to discuss a little’

Reduplication of adjectives manifests some variability in the distribution of the syllables. Specifically, some adjectives expose the following asymmetric reduplication patterns:

a. AABB:
干净干-干-净-净
gānjìng gān-gān-jìng-jìng
clean clean-clean
‘clean very clean’

b. ABB:
雪白雪白白
xuébǎi xué-bǎi-bǎi
white white-white
‘white very white’

c. AAB:
逛街逛逛街
guàngjiē guàng-guàng-jiē
walk street walk-walk-street
‘walk street go window shopping’

In verbal reduplication, the particles 一 and 了 can occur between the two copies:

看-一-看
kàn-yī-kàn
look-one-look
‘to take a look’

If the two words are segmented in the original tokenization, they are concatenated by the Reduplication Analyzer. The reduplicated word gets the annotation REDUP.

Loc Analyzer

The Loc Analyzer performs concatenation and analysis of noun + localizer combinations; localizers follow nouns and ‘transform’ them into locative nouns:

桌子-上
table-LOC.ON.SHANG
‘on the table’

Localizers form a closed set; the table below shows the mapping from localizers to their annotations.

Form of localizer	Annotation
上	LOC_ON_SHANG
下	LOC_UNDER_XIA
里	LOC_INSIDE_LI
内	LOC_INSIDE_NEI
外	LOC_OUTSIDE_WAI
前	LOC_BEFORE_QIAN
后	LOC_BEHIND_HOU
旁	LOC_NEXTTO_PANG
中	LOC_IN_ZHONG

The Loc Analyzer concatenates the noun + localizer combination into one word and assigns it an annotation with the label of the localizer. The base form of the resulting token is set to the base form of the noun. Thus, example 10 is analyzed as follows:

The two tokens are concatenated into one word 桌子上.
This word gets the base form 桌子.

Additionally, it is assigned the annotation LOC_ON_SHANG.

Aspect Analyzer

The Aspect Analyzer analyzes aspect markers. Chinese has both pre-verbal and post-verbal aspect markers:

a. 她正在吃。
Tā zhèngzài chī.
she ASPECT eat.
‘She is eating.’

b. 她吃了。
Tā chī le.
she eat ASPECT
‘She ate.’

Marker	Aspect	Annotation	Position
了	Perfective	ASPECT_PERFECTIVE_LE	Postverbal
着	Progressive	ASPECT_PROGRESSIVE_ZHE	Postverbal
过	Experiential	ASPECT_EXPERIENTIAL_GUO	Postverbal
在	Progressive	ASPECT_PREVERBAL_PROGRESSIVE_ZAI	Preverbal
正在	Progressive	ASPECT_PREVERBAL_PROGRESSIVE_ZHENGZAI	Preverbal

The set of aspect markers that are analyzed by the Aspect Analyzer are displayed in the above table.
The Aspect Analyzer attaches the respective annotation of the aspect marker to the main verb.

Negation Analyzer

The Negation Analyzer analyses negations of adverbs, verbs and adjectives. In these cases, the negation particle can immediately precede the negated word:

a. 我没去。
I NEG.MEI go
‘I didn’t go’

b. 不容易
NEG.BU easy
‘not easy’

c. 不太
NEG.BU too
‘not too’

The negation particle can also be separated from the verb by additional material:

别这么做。
Bié zhème zuò.
NEG.BIE this do
‘Don’t do this.’

The set of currently analyzed negation words is shown in the below table.

Form of negator	Annotation
不	NEG_BU
否	NEG_FOU
没	NEG_MEI, ASPECT_PERFECTIVE
没有	NEG_MEIYOU, ASPECT_PERFECTIVE
别	NEG_BIE, MODE_IMPERATIVE
不太	NEG_BUTAI
并不	NEG_BINGBU
不怎么	NEG_BUZENME

Three of the negation particles (没, 没有, 别) have two annotations. Their second annotation contains aspectual or mode information that is implied by the particle. The Negation Analyzer attaches an annotation to the negated word. It contains the corresponding annotation of the negation particle as well as its index in the sentence. An additional annotation is attached to the negated word if the negation particle carries aspect or mode information.

For example, in example 12. a (further above), the verb 去 is annotated with two annotations, {‘NEG_MEI’, 1} and ASPECT_PERFECTIVE.

SC Analyzer

The SC Analyzer analyzes splitable compounds; the splitable verb-object compounds (SCs) are verb-object combinations with an idiomatic meaning, e.g. 担-心 (worry+heart = ‘to worry’), 生-气 (create+air = ‘to get angry’), 见-面 (see+face = ‘to meet so’). They allow for various kinds of syntactic activity between verb and object, e.g. insertion of aspect markers, additional objects, demonstratives, etc.:

a. Aspect marker:
我们见- 了 -面
we see- ASPECT -face
‘We met.’

b. Additional object:
帮- 她一个 -忙
help- she one -affair
‘to help her’

c. Nominal modifier:
见他的面
see- he DEG -face

The set of SCs is large and diverse. Although it is difficult to exhaustively enumerate all SCs, the most common instances are captured in a list with currently 163 compounds. Once the SC Analyzer identifies a verb in a splitable compound, it goes forward in the sentence and looks for a valid CS object for this verb. While looking, it checks with each subsequent word whether the sequence following the verb is still a valid splitting sequence. If it arrives at a suitable object before the sequence becomes invalid, it attaches an annotation to the verb. This annotation carries two pieces of information: the tag of the splitable compound (SPLIT_pinyin of compound) as well as the index of the dependent object. Further, the base form of the verb is set to the base form of the splitable compound.

Thus, in the example 14.a above, the verb 见 is annotated with the annotation {SPLIT_JIANMIAN, 3}. Its base form is set to 见面.

Affix Analyzer

The Affix Analyzer analyzes inflectional and derivational suffixes. Chinese only has one inflectional suffix, that is the plural suffix -们, which can be attached to human nouns/pronouns:

a. 老师-们
teacher-PLURAL
‘the teachers’

b. 我-们
me-PLURAL
‘we’

Additionally, Chinese has a set of derivational suffixes which change the part of speech of the word to which they are attached. For example, the suffix -者 is attached to verbs, and the resulting combination is a noun and denotes the actor of the base form verb:

使用-者
shǐ-yòng(-)zhě
use-ACTOR.ZHE
‘the user’

A suffixed word gets the corresponding annotation of its suffix, and the base form of the word is changed to the base form without the suffix. Thus, 使用者 in example 16 is analyzed as follows:

使用者 gets the annotation ACTOR_ZHE
使用者gets the base form 使用.

The below table displays the set of tags used by the Affix Analyzer.

Form of affix	Annotation	Example
-于	COMPARATIVE_YU	高于 (两米)
-度	PROPERTY_DU	精确度
-性	PROPERTY_XING	流线性
-化	TRANSFORM_HUA	现代化
-者	ACTOR_ZHE	使用者
-师	ACTOR_SHI	设计师
-员	ACTOR_YUAN	操作员
可-	ABILITY_KE	可上升
-们	PLURAL_MEN	老师们
-城	CITY_CHENG	北京城
-市	CITY_SHI	上海市
-省	PROVINCE_SHENG	河北省
-儿	RCOLORING_ERHUA	好玩儿
-于 (word contains the suffix and has a base form of at least 2 characters)	PREP_YU	致力于

Chinese Numbers

The Chinese Numbers Recognizer simplifies writing syntaxes against numbers and numeric expressions in Teneo Studio solutions and provide the following functionalities:

The Chinese Numbers Recognizer Input Processor simplifies writing syntaxes against numbers and numeric expressions in Teneo Studio solutions and provides the following functionalities:

Normalization of tokens containing numeric values into Hindu-Arabic numerals
Creation of a NUMBER annotation with a numericValue variable which as type BigDecimal and contains a representation of normalized numbers
Creation of an annotation with the name of the normalized number value
Annotate inexact numbers with annotation INEXACT (i.e. numbers containing characters 几 or 数 or 余 or 多).

The Chinese Numbers Recognizer Input Processor leaves the tokenization unmodified and does not try to concatenate neighboring numeric expressions, nor does it split numeric parts of a token from its non-numeric parts. It will however identify and annotate tokens which contain numeric subparts, e.g. having the token “三点”, the normalized numeric value would be 3. Furthermore, it works with decimal factored numbers like 5.5万or 1.2亿 and supports fractions and formal Kanji numbers.

Numeric Normalization

Numeric string normalization is done to substrings in the input string. The normalized values are used in creation of annotations, the input string itself remains unmodified. The following normalization steps are applied by the Chinese Numbers IP:

Hindu-Arabic numerals remain unchanged
Hanzi numerals are normalized to their Hindu-Arabic numeric value
Mixed Hanzi/Hindu-Arabic numerals are normalized to Hindu-Arabic numerals.

Input token	Normalized Numeric Value
10	10
3.14	3.14
一	1
一点	1
两百	200
三百万五千	3005000
3百万5千	3005000
三百五	350
一万零一	10001
一万〇一	10001

The above table shows examples of normalization; in the last three examples it is possible to see that even more colloquial numeric expressions such as “三百五” are handled correctly.

Normalized Number Annotation

The normalized number annotation is just the numeric value of the NUMBER annotation as an annotation itself. This allows the Teneo Studio user to write syntaxes against specific numbers, without the need to specify all the different surface variants. Thanks to the traditional-to-simplified Chinese character conversion done in the Chinese Tokenizer IP, even traditional numeric Hanzi characters match.

In the below table please find examples of normalized number annotations.

Syntax	Matching inputs
%$2	'2', '两', '二', '２', ...
%$10000	'10000', '万', '萬', '一万', '一〇〇〇〇', '１００００', ...
%$3.14	'3.14', '三.一四', ...
%$350	'350', '三百五', '三五〇', ...

Date and Time Annotation

The TIME.DATETIME and DATE.DATETIME annotations are created in the Teneo Platform for numbers which could be either time or date expressions, for example 五点零零 creates annotation__TIME.DATETIME__ with values hour: 5 and minute: 0, or 1/2 creating the DATE.DATETIME annotation with values month: 1, day: 2.

To read more about how to use the natively understanding and interpretation of date and time expressions in the Teneo Platform, please see here.

System Annotation

Teneo bundles two default collections of annotations in all language configurations: standard annotations added by the System Annotation Input Processor and special system annotations added by the Engine; the System Annotation Input Processor performs simple analysis of the sentence texts and may generate the standard annotations listed below.

Annotation	Description
_BINARY	The input consists of only 0s and 1s
_BRACKETPAIR	At least one matching pair of brackets appears in the input; possible bracket types: ( ), [ ], { }
_EXCLAMATION	At least one exclamation mark (!) appears in the input
_EM3	Three (or more) exclamation marks (!!!) appear in a row in the input
_EMPTY	The input contains no text / the sentence text is empty
_NONSENSE	The input contains nonsense text, e.g., 'asdf', 'wgwwgwg', 'xxxxxx'
_QUESTION	At least one question mark (?) appears in the input
_QT3	Three (or more) question marks (???) appear in a row in the input
_QUOTE	At least one single quotation mark (') appears in the input
_DBLQUOTE	At least one quotation mark (") appears in the input

Special System Annotations

The following two, special annotations are set by the Teneo Engine. These special system annotations are not related to individual inputs but rather to whole dialogues and are dependent on the session state.

Annotation	Description
_INIT	Indicates session start, i.e., the first input in a dialogue
_TIMEOUT	Indicates the continuation of a previously timed-out session/dialogue

Language Detector

The Language Detector uses a machine learning model to predict the language of a given user input and adds an annotation, as seen in below table, to the input together with a confidence score of the prediction.

Annotation	Variable	Description
<language label>.LANG, e.g., %$DA.LANG	Confidence	Annotation created for the predicted language

The Language Detector can predict the following 45 languages; the language label used to create the annotation name is in brackets:

Arabic (AR), Bulgarian (BG), Bengali (BN), Catalan (CA), Czech (CS), Danish (DA), German (DE), Greek (EL), English (EN), Esperanto (EO), Spanish (ES), Estonian (ET), Basque (EU), Persian (FA), Finnish (FI), French (FR), Hebrew (HE), Hindi (HI), Hungarian (HU), Indonesian-Malay (ID_MS), Icelandic (IS), Italian (IT), Japanese (JA), Korean (KO), Lithuanian (LT), Latvian (LV), Macedonian (MK), Dutch (NL), Norwegian (NO), Polish (PL), Portuguese (PT), Romanian (RO), Russian (RU), Slovak (SK), Slovenian (SL), Serbian-Croatian-Bosnian (SR_HR), Swedish (SV), Tamil (TA), Telugu (TE), Thai (TH), Tagalog (TL), Turkish (TR), Urdu (UR), Vietnamese (VI) and Chinese (ZH).

Serbian, Bosnian and Croatian are treated as one language under the label SR_HR, and Indonesian and Malay are treated as one language under the label ID_MS

A number of regexes are also in use by the Input Processor, helping the model to not predict a language for fully numerical inputs, URLs or other type of nonsense inputs.

The Language Detector will provide an annotation when the confidence prediction threshold is above 0.2 for the languages, but for the following listed languages, language annotations are always created (even for predictions below 0.2) since the Language Detector is mostly accurate when predicting them: Arabic, Bengali, Greek, Hebrew, Hindi, Japanese, Korean, Tamil, Telugu, Thai, Chinese, Vietnamese, Persian and Urdu.

Predict

The Predict Input Processor makes use of an intent model generated when classes are available in a Teneo Studio solution to annotate user inputs with the defined classes; intent models can be generated either with Teneo Learn or CLU. Note that as of Teneo 7.3, deferred intent classification is applied and annotations are only created by Predict if references to class annotations are found during the input matching process.

When Predict receives a user input, confidence scores are calculated for each class based on the model and annotations created for the most confident class and for each other class that matches the following criteria:

the confidence is above the minimum confidence (defaults to 0.01)
the confidence is higher than 0.5 times the confidence value of the top class.

For each selected class, an annotation with the scheme <CLASS_NAME>.INTENT is created, with the value of the model's confidence in the class as well as an annotation variable specifying the used classifier (i.e., Learn, CLU or LearnFallback) and an Order variable defining the order of the selected classes (i.e., 0 for the class with the highest confidence score and 4 for the selected class with the lowest confidence score).
A special annotation <CLASS_NAME>.TOP_INTENT is created for the class with the highest confidence score.

Annotation	Variable	Variable	Variable	Description
<CLASS_NAME>.TOP_INTENT	classifier	confidence		Annotation created for the class with the highest confidence score
<CLASS_NAME>.INTENT	classifier	confidence	Order	Annotation given to each selected class with a maximum of five top classes

The Predict Input Processor creates a maximum of 5 annotations, regardless of how many classes match the criteria.