Chinese Input Processors Chain

Introduction

An Input Processor (IP) pre-processes inputs for the Teneo Engine to be able to perform different processes on them, such as normalization and tokenization for example. Each language supported by the Teneo Platform has a chain of Input Processors that know how to process that particular language.

Input Processors Chain setup

The following graph displays the Input Processors chain for Chinese:

graph TD subgraph ips [ ] tokenizer[Chinese Tokenizer] --> annotator annotator[Chinese Annotator] --> number number[Chinese Numbers] --> annotation annotation[System Annotation] --> languagedetect languagedetect[Language Detector] --> predict end input([User Input]) --User Gives Input--> tokenizer predict[Predict] --Parsed Input--> parsed([To Dialog Processing]) classDef ip_optional stroke-dasharray:5,5; classDef external fill:#00000000,stroke-dasharray:5,5; class solution,settings external;

The Input Processors are listed below with a short description of the Input Processor's functionality, the follow sections will go into further details.

The Chinese Tokenizer IP first converts the user input to Simplified characters and then splits it into words and sentence.
The Chinese Annotator IP performs a morphological analysis on the user input sentences and words and annotates them to provide morphological information in addition to what the Tokenizer provides as words and their Part-of-Speech (POS) tags.
The Chinese Numbers IP identifies and annotates the numbers present in the user input to make it easier for the final user to write syntaxes that depend on numbers.
The System Annotation IP sets a number of annotations, based on properties of the user input text.
The Language Detector IP identifies the language of the input sentence provided and annotates it with the predicted language and associates a confidence score of the prediction.
The Predict IP classifies user input based on a machine learning model trained in Teneo Learn and annotates the user input with the predicted top intent classes and a confidence score.

Chinese Simplifier

The Chinese Simplifier is a special kind of processor that is used to normalize the user input by:

converting full width Latin letters and Arabic digits into their half width version, and
lowercasing the uppercased Latin letters.

This Simplifier is special because it is not run as part of the Input Processor chain, but rather by the Tokenizer when it puts the tokens into a Teneo data structure.

Additionally, the Simplifier is also run by the condition parser inside Teneo Engine, which normalizes the Language Object syntax words before adding them to the internal Engine dictionary.

Chinese Tokenizer IP

The Chinese Tokenizer Input Processor is the first of the input processors to be run on Chinese user inputs; it essentially does two things: first, it converts traditional Mandarin Chinese characters into simplified, and secondly, it tokenizes the converted user input and generates sentences based on the tokens.

Traditional-to-simplified conversion

The conversion of traditional characters into simplified characters is done via a one-to-one characters mapping. This mapping is configured via two properties of the Chinese Tokenizer IP:

A list of characters: traditionalCharacters.file.name
The mappings of traditional characters to simplified characters: traditionalSimplifiedMappings.file.name.

After the conversion to simplified Mandarin Chinese, the user input is segmented into words and sentences.

Chinese tokenization

The Chinese Tokenizer splits the user input words via a statistical model and a user dictionary. The words specified in the user dictionary are guaranteed to be segmented as such by the Tokenizer.

The user dictionary has a static component, which is specified as a configuration file via the property dictionary, and a dynamic component, which is collected from the language objects defined in a user solution that have syntaxes of type DICTEXT_word_POStag.

The Chinese Tokenizer passes the Part-of-Speech (POS) tags generated by the Chinese Tokenizer to the user as annotations by mapping them according to a configuration file that maps Panda generated POS tags to annotations, e.g. NN=NN.POS.

The Tokenizer also uses a configuration property called nonWordTokens to specify which characters should not be output as tokens, e.g. punctuation, brackets, etc.

Name	Type	Required	Default
`nonWordTokens`	string	no	`＂"“” 『』'「」()[]{}（）〔〕［］｛｝〈〉《》!！?？…,，、。.；;：:．`

The last step in the tokenization process is the splitting of the user input tokens into sentences. For this, the Tokenizer uses another configuration property called sentenceDelimiters to know which characters mark sentence boundaries.

Name	Type	Required	Default
`sentenceDelimiters`	string	no	`。！？…!?.．`

The Chinese Tokenizer do not split decimal numbers around the decimal markers but rather concatenates the split tokens into one; this makes it easier to identify and annotate decimal numbers later in the processing chain.

Note that Numbers with a factor other than 万 or 亿 after the decimal point are not numbers and therefore are being split instead of being concatenated together. This is a change from the behavior in versions pre Teneo 6.

Chinese Annotator IP

The Chinese Annotator Input Processor includes a range of analyzers which treat specific morphological phenomena of Chinese. In general, three operations can be performed by the morphological analyzers:

Annotation (addition of one or more morphological annotations)
Change of the base form property
Concatenation of multiple tokens.

The morphological analyzers are applied in a fixed order; the table below shows the current sequence of analyzers, along with the operations that are performed by them. In the following sections more details are provided for each individual analyzer.

	Analyzer	Example	Annotation	Base form change	Concatenation
1.	VNotV Analyzer	是-不-是	Yes	Yes	Yes
2.	Verb Analyzer	吃-完，跑-上	Yes	Yes	Yes
3.	Reduplication Analyzer	红-红	Yes	Yes	Yes
4.	Loc Analyzer	桌子-上	Yes	Yes	Yes
5.	Aspect Analyzer	吃了, 坐着	Yes	No	No
6.	Negation Analyzer	不吃	Yes	No	No
7.	SC Analyzer	洗了一个澡	Yes	Yes	No
8.	Affix Analyzer	我们, 标准化	Yes	Yes	No

VNotV Analyzer

The VNotV Analyzer concatenate and analyses V-Not-V sequences.

In V-Not-V structures, the same verb occurs twice with a negation word (不, 没, 否) between the two occurrences:

你去-不-去买东西？
Nǐ qù-bù-qù mǎi dōngxī?
You go-VNOTV.BU-go buy things
‘Do you go shopping?’

The V-Not-V structure has two uses:

我不知道他去-不-去买东西。
Wǒ bù zhīhào tā qù-bù-qù mǎi dōngxī.
I NEG know he go-VNOTV.BU-go buy things
‘I don’t know whether he goes shopping.’

In the case of bi-syllabic words, the second syllable of the first verb might be deleted:

你喜(欢)-不-喜欢买东西？
Nǐ xǐ(huān)-bù-xǐhuān mǎi dōngxī?
You like-VNOTV.BU-like buy things
‘Do you like shopping?’

The VNotV Analyzer concatenates the three tokens. it assigns one structural annotation (prefixed with VNOTV) signaling the negation form. The base form of the resulting token is set to the full form of the verb. Thus, the example 3 above WITH second syllable deletion is analyzed as follows:

The three tokens are concatenated into one word 喜不喜欢
This word gets the base form 喜欢
Additionally, it gets the annotation VNOTV.BU.

VerbAnalyzer

The Verb Analyzer performs analysis of resultative and directional compounds. Resultative compounds consist of one main verb and one resultative suffix:

小王吃-完了。
Xiǎowáng chī-wán le.
Xiaowang eat-RESULT ASPECT
‘Xiaowang finished eating.’

Directional compounds consist of one main verb and one or two directional suffixes:

阿明跑-上楼机了。
Āmíng pǎo-shàng lóujī le.
Aming run-DIR.NONDEICTIC.SHANG stairs ASPECT
‘Aming ran up the stairs.’
阿明跑-上-去了。
Āmíng pǎo-shàng-qù le.
Aming run-DIR.NONDEICTIC.SHANG-DIR.DEICTIC.QU ASPECT
‘Aming ran up.’

The combination of the main verb with the resultative/directional complements is concatenated. The base form of the resulting token is changed to the base form of the main verb. The token is assigned the annotations associated with the resultative/directional suffixes.

Annotations for resultative suffixes carry the prefix RESULT. Annotations for directional suffixes carry the prefix DIR. Additionally, we distinguish between deictic (DIR_DEICTIC…) and non-deictic (DIR_NONDEICTIC…) directional complements. Cases with two directional suffixes are limited to a non-deictic complement followed by deictic complement.

ReduplicationAnalyzer

The Reduplication Analyzer analyzes reduplications of verbs, adjectives and adverbs:

a. 红-红
hóng-hóng
red-red
‘very red’

b. 讨论-讨论
tǎolùn-tǎolùn
discuss-discuss
‘to discuss a little’

Reduplication of adjectives manifests some variability in the distribution of the syllables. Specifically, some adjectives expose the following asymmetric reduplication patterns:

a. AABB:
干净干-干-净-净
gānjìng gān-gān-jìng-jìng
clean clean-clean
‘clean very clean’

b. ABB:
雪白雪白白
xuébǎi xué-bǎi-bǎi
white white-white
‘white very white’

c. AAB:
逛街逛逛街
guàngjiē guàng-guàng-jiē
walk street walk-walk-street
‘walk street go window shopping’

In verbal reduplication, the particles 一 and 了 can occur between the two copies:

看-一-看
kàn-yī-kàn
look-one-look
‘to take a look’

If the two words are segmented in the original tokenization, they are concatenated by the Reduplication Analyzer. The reduplicated word gets the annotation REDUP.

Loc Analyzer

The Loc Analyzer performs concatenation and analysis of noun + localizer combinations; localizers follow nouns and ‘transform’ them into locative nouns:

桌子-上
table-LOC.ON.SHANG
‘on the table’

Localizers form a closed set; the table below shows the mapping from localizers to their annotations.

Form of localizer	Annotation
上	`LOC_ON_SHANG`
下	`LOC_UNDER_XIA`
里	`LOC_INSIDE_LI`
内	`LOC_INSIDE_NEI`
外	`LOC_OUTSIDE_WAI`
前	`LOC_BEFORE_QIAN`
后	`LOC_BEHIND_HOU`
旁	`LOC_NEXTTO_PANG`
中	`LOC_IN_ZHONG`

The Loc Analyzer concatenates the noun + localizer combination into one word and assigns it an annotation with the label of the localizer. The base form of the resulting token is set to the base form of the noun. Thus, example 10 is analyzed as follows:

The two tokens are concatenated into one word 桌子上.
This word gets the base form 桌子.

Additionally, it is assigned the annotation LOC_ON_SHANG.

Aspect Analyzer

The Aspect Analyzer analyzes aspect markers. Chinese has both pre-verbal and post-verbal aspect markers:

a. 她正在吃。
Tā zhèngzài chī.
she ASPECT eat.
‘She is eating.’

b. 她吃了。
Tā chī le.
she eat ASPECT
‘She ate.’

Marker	Aspect	Annotation	Position
了	Perfective	`ASPECT_PERFECTIVE_LE`	Postverbal
着	Progressive	`ASPECT_PROGRESSIVE_ZHE`	Postverbal
过	Experiential	`ASPECT_EXPERIENTIAL_GUO`	Postverbal
在	Progressive	`ASPECT_PREVERBAL_PROGRESSIVE_ZAI`	Preverbal
正在	Progressive	`ASPECT_PREVERBAL_PROGRESSIVE_ZHENGZAI`	Preverbal

The set of aspect markers that are analyzed by the Aspect Analyzer are displayed in the above table.

The Aspect Analyzer attaches the respective annotation of the aspect marker to the main verb.

Negation Analyzer

The Negation Analyzer analyses negations of adverbs, verbs and adjectives. In these cases, the negation particle can immediately precede the negated word:

a. 我没去。
I NEG.MEI go
‘I didn’t go’

b. 不容易
NEG.BU easy
‘not easy’

c. 不太
NEG.BU too
‘not too’

The negation particle can also be separated from the verb by additional material:

别这么做。
Bié zhème zuò.
NEG.BIE this do
‘Don’t do this.’

The set of currently analyzed negation words is shown in the below table.

Form of negator	Annotation
不	`NEG_BU`
否	`NEG_FOU`
没	`NEG_MEI, ASPECT_PERFECTIVE`
没有	`NEG_MEIYOU, ASPECT_PERFECTIVE`
别	`NEG_BIE, MODE_IMPERATIVE`
不太	`NEG_BUTAI`
并不	`NEG_BINGBU`
不怎么	`NEG_BUZENME`

Three of the negation particles (没, 没有, 别) have two annotations. Their second annotation contains aspectual or mode information that is implied by the particle. The NegationAnalyzer attaches an annotation to the negated word. It contains the corresponding annotation of the negation particle as well as its index in the sentence. An additional annotation is attached to the negated word if the negation particle carries aspect or mode information.

For example, in example 12. a (further above), the verb 去 is annotated with two annotations, {‘NEG_MEI’, 1} and ASPECT_PERFECTIVE.

SC Analyzer

The SC Analyzer analyzes splitable compounds; the splitable verb-object compounds (SCs) are verb-object combinations with an idiomatic meaning, e.g. 担-心 (worry+heart = ‘to worry’), 生-气 (create+air = ‘to get angry’), 见-面 (see+face = ‘to meet so’). They allow for various kinds of syntactic activity between verb and object, e.g. insertion of aspect markers, additional objects, demonstratives, etc.:

a. Aspect marker:
我们见- 了 -面
we see- ASPECT -face
‘We met.’

b. Additional object:
帮- 她一个 -忙
help- she one -affair
‘to help her’

c. Nominal modifier:
见他的面
see- he DEG -face

The set of SCs is large and diverse. Although it is difficult to exhaustively enumerate all SCs, the most common instances are captured in a list with currently 163 compounds. Once the SC Analyzer identifies a verb in a splitable compound, it goes forward in the sentence and looks for a valid CS object for this verb. While looking, it checks with each subsequent word whether the sequence following the verb is still a valid splitting sequence. If it arrives at a suitable object before the sequence becomes invalid, it attaches an annotation to the verb. This annotation carries two pieces of information: the tag of the splitable compound (SPLIT_pinyin of compound) as well as the index of the dependent object. Further, the base form of the verb is set to the base form of the splitable compound.

Thus, in the example 14. a above, the verb 见 is annotated with the annotation {SPLIT_JIANMIAN, 3}. Its base form is set to 见面.

Affix Analyzer

The Affix Analyzer analyzes inflectional and derivational suffixes. Chinese only has one inflectional suffix, that is the plural suffix -们, which can be attached to human nouns/pronouns:

a. 老师-们
teacher-PLURAL
‘the teachers’

b. 我-们
me-PLURAL
‘we’

Additionally, Chinese has a set of derivational suffixes which change the part of speech of the word to which they are attached. For example, the suffix -者 is attached to verbs, and the resulting combination is a noun and denotes the actor of the base form verb:

使用-者
shǐ-yòng(-)zhě
use-ACTOR.ZHE
‘the user’

A suffixed word gets the corresponding annotation of its suffix, and the base form of the word is changed to the base form without the suffix. Thus, 使用者 in example 16 is analyzed as follows:

使用者 gets the annotation ACTOR_ZHE
使用者gets the base form 使用.

The below table displays the set of tags used by the Affix Analyzer.

Form of affix	Annotation	Example
-于	`COMPARATIVE_YU`	高于 (两米)
-度	`PROPERTY_DU`	精确度
-性	`PROPERTY_XING`	流线性
-化	`TRANSFORM_HUA`	现代化
-者	`ACTOR_ZHE`	使用者
-师	`ACTOR_SHI`	设计师
-员	`ACTOR_YUAN`	操作员
可-	`ABILITY_KE`	可上升
-们	`PLURAL_MEN`	老师们
-城	`CITY_CHENG`	北京城
-市	`CITY_SHI`	上海市
-省	`PROVINCE_SHENG`	河北省
-儿	`RCOLORING_ERHUA`	好玩儿
-于 (word contains the suffix and has a base form of at least 2 characters)	`PREP_YU`	致力于

Chinese Numbers IP

The Chinese Numbers Recognizer Input Processor simplifies writing syntaxes against numbers and numeric expressions in Teneo Studio solutions and provides the following functionalities:

Normalization of tokens containing numeric values into Hindu-Arabic numerals
Creation of a NUMBER annotation with a numericValue variable which as type BigDecimal and contains a representation of normalized numbers
Creation of an annotation with the name of the normalized number value
Annotate inexact numbers with annotation INEXACT (i.e. numbers containing characters 几 or 数 or 余 or 多).

The Chinese Numbers Recognizer Input Processor leaves the tokenization unmodified and does not try to concatenate neighboring numeric expressions, nor does it split numeric parts of a token from its non-numeric parts. It will however identify and annotate tokens which contain numeric subparts, e.g. having the token “三点”, the normalized numeric value would be 3. Furthermore, it works with decimal factored numbers like 5.5万or 1.2亿 and supports fractions and formal Kanji numbers.

Numeric normalization

Numeric string normalization is done to substrings in the input string. The normalized values are used in creation of annotations, the input string itself remains unmodified. The following normalization steps are applied by the Chinese Numbers IP:

Hindu-Arabic numerals remain unchanged
Hanzi numerals are normalized to their Hindu-Arabic numeric value
Mixed Hanzi/Hindu-Arabic numerals are normalized to Hindu-Arabic numerals.

Input token	Normalized Numeric Value
10	10
3.14	3.14
一	1
一点	1
两百	200
三百万五千	3005000
3百万5千	3005000
三百五	350
一万零一	10001
一万〇一	10001

The above table shows examples of normalization; in the last three examples it is possible to see that even more colloquial numeric expressions such as “三百五” are handled correctly.

The NUMBER annotation

The NUMBER annotation allows for writing syntaxes on existences of numbers in user inputs, without the need to specify any number explicitly. The only thing the Teneo Studio user should do is use the NUMBER annotation in the syntax. For example:

tlml

1%I_WANT.PHR + %$NUMBER + %PRODUCT.LIST
2

The numeric value can also be retrieved using a listener and used later in the flow. The listings below show how numeric value retrieval is done.

tlml

1%$NUMBER + PRODUCT.LIST
2

properties

1int numberAnnotIndex = (_.usedWordIndices as List)[0] 
2
3def numberAnnot = _.inputAnnotations.getByName('NUMBER').find { 
4    
5	// be sure that the annotation points to the correct word 
6    numberAnnotIndex in it.getWordIndices() 
7} 
8
9	// stores value in flow variable numProducts 
10	numProducts = annot.getVariables()['numericValue'] as int
11

The numeric value can also be retrieved using an NLU variable:

tlml

1%I_WANT.PHR + %$NUMBER^{someVariable=lob.numericValue} + %PRODUCTS.LIST
2

The normalized number annotation

The normalized number annotation is just the numeric value of the NUMBER annotation as an annotation itself. This allows the Teneo Studio user to write syntaxes against specific numbers, without the need to specify all the different surface variants. Thanks to the traditional-to-simplified Chinese character conversion done in the Chinese Tokenizer IP, even traditional numeric Hanzi characters match.

In the below table please find examples of normalized number annotations.

Syntax	Matching inputs
%$2	'2', '两', '二', '２', ...
%$10000	'10000', '万', '萬', '一万', '一〇〇〇〇', '１００００', ...
%$3.14	'3.14', '三.一四', ...
%$350	'350', '三百五', '三五〇', ...
%$1234	'1234', '１２３４', '一二三四', '一千两百三十四', ...

Date and Time annotations

The TIME.DATETIME and DATE.DATETIME annotations are created in the Teneo Platform for numbers which could be either time or date expressions, for example 五点零零 creates annotationTIME.DATETIME with values hour: 5 and minute: 0, or 1/2 creating the DATE.DATETIME annotation with values month: 1, day: 2.

To read more about how to use the natively understanding and interpretation of date and time expressions in the Teneo Platform, please see here.

System Annotation IP

The System Annotation Input Processor, shared among the different languages of the Teneo Platform, performs simple analysis of the sentence text to set some annotations. The decision algorithms are configurable by various properties. Further customization is possible by sub-classing this Input Processor and overriding one or more of the methods decideBinary, decideBrackets, decideEmpty, decideExclamation, decideNonsense, decideQuestion, decideQuote.

This IP works on the sentences passed in, but does not modify them.

Other considerations

Extra request parameters read by this input processor: (none) Processing options read by this input processor: (none) Annotations this input processor may generate:

_EMPTY: the sentence text is empty
_EXCLAMATION: the sentence text contains at least one of the characters specified with property exclamationMarkCharacters
_EM3: the sentence text contains three or more characters in a row of the characters specified with property exclamationMarkCharacters
_QUESTION: the sentence text contains at least one of the characters specified with property questionMarkCharacters
_QT3: the sentence text contains three or more characters in a row of the characters specified with questionMarkCharacters
_QUOTE: the sentence text contains at least one of the characters specified with property quoteCharacters
_DBLQUOTE: the sentence text contains at least one of the characters specified with property doubleQuoteCharacters
_BRACKETPAIR: the sentence text contains at least one matching pair of the bracket characters specified with property bracketPairCharacters
_NONSENSE: the sentence probably contains nonsense text as configured with properties consonants, nonsenseThreshold.absolute and nonsenseThreshold.relative
_BINARY: the sentence text only contains characters specified by properties binaryCharacters (at least one of them) and binaryIgnoredCharacters (zero or more of them).

Special System annotations

Two special annotations related not to individual inputs, but to whole dialogues, are added by the Teneo Engine itself:

_INIT: indicates session start, i.e. the first input in a dialogue
_TIMEOUT: indicates the continuation of a previously timed-out session/dialogue.

Several configuration properties are available for the System Annotation Input Processor; please see the details here.

Language Detector IP

The Language Detector Input Processor uses a machine learning model that predicts the language of a given input and adds an annotation of the format %${language label}.LANG to the input as well as a confidence score of the prediction.

The Language Detector IP can predict the following 45 languages (language label in brackets):

Arabic (AR), Bulgarian (BG), Bengali (BN), Catalan (CA), Czech (CS), Danish (DA), German (DE), Greek (EL), English (EN), Esperanto (EO), Spanish (ES), Estonian (ET), Basque (EU), Persian (FA), Finnish (FI), French (FR), Hebrew (HE), Hindi (HI), Hungarian (HU), Indonesian-Malay (ID_MS), Icelandic (IS), Italian (IT), Japanese (JA), Korean (KO), Lithuanian (LT), Latvian (LV), Macedonian (MK), Dutch (NL), Norwegian (NO), Polish (PL), Portuguese (PT), Romanian (RO), Russian (RU), Slovak (SK), Slovenian (SL), Serbian-Croatian-Bosnian (SR_HR), Swedish (SV), Tamil (TA), Telugu (TE), Thai (TH), Tagalog (TL), Turkish (TR), Urdu (UR), Vietnamese (VI) and Chinese (ZH).

Serbian, Bosnian and Croatian are treated as one language, under the label SR_HR and Indonesian and Malay are treated as one language, under the label ID_MS.

A number of regexes are also in use by the Input Processor, helping the model to not predict language for fully numerical inputs, URLs or other type of nonsense inputs.

The Language Detector will provide an annotation when the confidence prediction threshold is above 0.2 for the languages, but for Arabic (AR), Bengali (BN), Greek (EL), Hebrew (HE), Hindi (HI), Japanese (JA), Korean (KO), Tamil (TA), Telugu (TE), Thai (TH), Chinese (ZH), Vietnamese (VI), Persian (FA) and Urdu (UR) language annotations will always be created, even for predictions below 2.0, since the Language Detector is mostly accurate when predicting them.

Predict IP

The Predict Input Processor makes use of a machine learning model generated in the Teneo Learn component when machine learning classes are available in a Teneo Studio solution. The Predict IP uses the model to annotate each user input with the machine learning classed defined.

Whenever the Predict IP receives a user input, the Input Processor calculates a confidence score for each of the classes based on the model, creating annotations for the most confident class and for each other class that matches the following criteria:

the confidence is above the minimum confidence (defaults to 0.01)
the confidence is higher than 0.5 times the confidence value of the top class.

The Predict Input Processor will create a maximum of 5 annotations, regardless of how many classes match the criteria. The numerical thresholds can be configured in the properties file of the Input Processor.

For each selected class, an annotation with the name <CLASS_NAME>.INTENT will be created, with the value of the model confidence in the class. A special annotation <CLASS_NAME>.TOP_INTENT is also created for the class with the highest score.

Configuration properties

Name	Type	Required	Default
`minConfidenceSimilarityDistance`	float	no	0.5

Confidence percentage of the top score confidence a class must have in order to be considered, e.g. if the top confidence class has a confidence of 0.7, classes with confidence lower than 0.5 x 0.7 = 0.35 will be discarded.

Name	Type	Required	Default
`maxNumberOfAnnotations`	int	no	5

Maximum number of class annotations to create for each user input.

Name	Type	Required	Default
`minConfidenceThreshold`	float	no	0.01

Minimum value of confidence a model must have for a class in order to add it as one of the candidate annotations.

Name	Type	Required	Default
`intent.model.file.name`	string (filename)	no	inexistent

Name of the file containing the machine learning model. It is usually set automatically by Teneo Studio, so no configuration is required.

Chinese Input Processors Chain

Introduction

Input Processors Chain setup

Chinese Simplifier

Chinese Tokenizer IP

Traditional-to-simplified conversion

Chinese tokenization

Chinese Annotator IP

VNotV Analyzer

VerbAnalyzer

ReduplicationAnalyzer

Loc Analyzer

Aspect Analyzer

Negation Analyzer

SC Analyzer

Affix Analyzer

Chinese Numbers IP

Numeric normalization

The NUMBER annotation

The normalized number annotation

Date and Time annotations

System Annotation IP

Other considerations

Special System annotations

Language Detector IP

Predict IP

Configuration properties

Related topics