Turkish Input Processors Chain
Introduction
An Input Processor (IP) pre-processes inputs for the Teneo Engine to be able to perform different processes on them, such as normalization and tokenization of inputs, for example. Each language supported by the Teneo Platform has a chain of input processors that know how to process that particular language.
IP Chain Setup
The following graph displays the setup of the Turkish Input Processors chain; each Input Processor is described further in the following sections.
Standard Simplifier
The Standard Simplifier is a separate processing unit which is not an input processor and which provides a method to normalize some text, usually - but not necessarily - a word. Here, "normalization" means removal of text properties that are semantically insignificant, like conversion to lower case (considering the configured language locale), removal of some accents and normalization of Unicode combining characters. By default, the Input Processors call the Simplifier when they generate a new word item. Furthermore, the Simplifier is called by the language condition parser of the Teneo Engine when it stores a language condition word (i.e., TLML syntax word) in the solution dictionary.
- Simplification
- The Simplifier decompose and normalize characters, for example lower casing characters and converting to Unicode.
Input Processors
Turkish Analyzer
The Turkish Analyzer is based on Zemberek and performs the following tasks:
- User input normalization,
- Sentence splitting,
- Tokenization, and
- Part-of-Speech (POS) and morphological annotations.
Normalization, Sentence Splitting and Tokenization
The Turkish Analyzer performs normalization on user inputs and, furthermore, will segment the input into sentences, tokenize and analyze the morphological structure of each token in the context of the sentence.
This means that each sentence will be normalized by the Turkish Analyzer, i.e., the sentence will be lowercased and, in some cases, typos will be fixed. Unlike other Teneo input processors, the API method getOriginal() on a word object will return the normalized form (which might be different from the simplified form) as the normalization happens before the tokenization.
This has direct implications on the exact option, which for other languages works on the ORIGINAL form, but for Turkish, users need to be aware that the exact option operates on the normalized strings.
The original user input is not modified and can be retrieved with getUserInputText().
A sentence in Turkish is an instance of the TurkishSentence class, which implements the SentenceI interface from the engine-input-processor-api. The method getText() of the class TurkishSentence returns the normalized sentence text. The original sentence text can be retrieved with the method getRawSentence() within for a direct caller of the input processor chain by casting a Sentence to a TurkishSentence. It cannot be accessed via the engine scripting API.
The sentence indices point to the characters in the original user input string. The word indices point to the characters in the sentence, i.e. the normalized sentence string.
POS and Morphological Annotations
The Turkish Analyzer also annotates user inputs with POS and morphological information. Each word will be annotated with its lemma, if available. A lemma annotation contains the POS tag as an annotation variable pos:<string>.
The morphological information will be returned as annotations for the three different types that Zemberek returns with the following suffixes:
- .POS: primary part-of-speech tag of the entire token
- .POS/.NER: secondary part-of-speech tag (mix of entities/POS tags) based on the stem of the token
- .MST: morphosyntactic information based on the morphemes of the token
The MST annotations all have the annotation variable surface=<string> that contains the substring of the surface form of that morpheme in the word, if available.
The table below lists how the tags from Zemberek are mapped to annotations in Teneo; please see here for information related to available ANNOT Language Objects in the Turkish Lexical Resource.
Zemberek Type | Zemberek Tag | Map to annotations |
---|---|---|
POS | Noun | NN.POS |
POS | Adj | ADJ.POS |
POS | Adv | ADV.POS |
POS | Conj | CC.POS |
POS | Interj | INTERJ.POS |
POS | Verb | VB.POS |
POS | Pron | PRON.POS |
POS | Num | NUMERAL.POS |
POS | Det | DET.POS |
POS | Postp | POST_POSITIVE.POS |
POS | Ques | INTERROG.POS |
POS | Dup | DUPLICATOR.POS |
POS | Punc | PUNCT.POS |
POS2 | Demons | DEMOS.POS |
POS3 | Time | TIME.NER |
POS4 | Quant | QUANTITATIVE.POS |
POS5 | Ques | INTERROG.POS |
POS6 | Prop | PROPER.POS |
POS7 | Pers | PERS.POS |
POS8 | Reflex | REFLEXIVE.POS |
POS9 | Ord | ORDINAL.POS |
POS10 | Card | CARDINAL.POS |
POS11 | Percent | PERCENT.NER |
POS12 | Ratio | RATIO.NER |
POS13 | Range | RANGE.NER |
POS14 | Dist | DIST.NER |
POS15 | Clock | CLOCK.NER |
POS16 | Date | DATE.NER |
POS17 | EMAIL.NER | |
POS18 | Url | URL.NER |
POS19 | Mention | MENTION.NER |
POS20 | HashTag | HASHTAG.NER |
POS21 | Emoticon | EMOTICON.NER |
POS22 | RegAbbrv | ABBREVIATION.NER |
POS23 | Abbrv | ABBREVIATION.NER |
MST | Noun | NN.MST |
MST | Adj | ADJ.MST |
MST | Adv | ADV.MST |
MST | Conj | CC.MST |
MST | Interj | INTERJ.MST |
MST | Verb | VB.MST |
MST | Pron | PRON.MST |
MST | Num | NUMERAL.MST |
MST | Det | DET.MST |
MST | Postp | POST_POSITIVE.MST |
MST | Ques | INTERROG.MST |
MST | Dup | DUPLICATOR.MST |
MST | Punc | PUNCT.MST |
MST | A1sg | 1STPERSON.MST, SG.MST |
MST | A2sg | 2NDPERSON.MST, SG.MST |
MST | A3sg | 3RDPERSON.MST, SG.MST |
MST | A1pl | 1STPERSON.MST, PL.MST |
MST | A2pl | 2NDPERSON.MST, PL.MST |
MST | A3pl | 3RDPERSON.MST, PL.MST |
MST | Pnon | NO_POSESSION.MST |
MST | P1sg | POSS_1STPERSON.MST, POSS_SG.MST |
MST | P2sg | POSS_2NDPERSON.MST, POSS_SG.MST |
MST | P3sg | POSS_3RDPERSON.MST, POSS_SG.MST |
MST | P1pl | POSS_1STPERSON.MST, POSS_PL.MST |
MST | P2pl | POSS_2NDPERSON.MST, POSS_PL.MST |
MST | P3pl | POSS_3RDPERSON.MST, POSS_PL.MST |
MST | Nom | NOMINATIVE.MST |
MST | Dat | DATIVE.MST |
MST | Acc | ACCUSATIVE.MST |
MST | Abl | ABLATIVE.MST |
MST | Loc | LOCATIVE.MST |
MST | Ins | INSTRUMENTAL.MST |
MST | Gen | GENITIVE.MST |
MST | Equ | EQUATIVE.MST |
MST | Dim | DIMINUTIVE.MST |
MST | Ness | NESS.MST |
MST | With | WITH.MST |
MST | Without | WITHOUT.MST |
MST | Related | RELATED.MST |
MST | JustLike | JUST_LIKE.MST |
MST | Rel | RELATION.MST |
MST | Agt | AGENTIVE.MST |
MST | Become | BECOME.MST |
MST | Acquire | ACQUIRE.MST |
MST | Ly | LY.MST |
MST | Caus | CAUSATIVE.MST |
MST | Recip | RECIPROCAL.MST |
MST | Reflex | REFLEXIVE.MST |
MST | Able | ABILITY.MST |
MST | Pass | PASSIVE.MST |
MST | Inf1 | INFINITIVE1.MST |
MST | Inf2 | INFINITIVE2.MST |
MST | Inf3 | INFINITIVE3.MST |
MST | ActOf | ACT_OF.MST |
MST | PastPart | PART_PAST.MST |
MST | NarrPart | PART_NARRATIVE.MST |
MST | FutPart | PART_FUTURE.MST |
MST | PresPart | PART_PRESENT.MST |
MST | AorPart | PART_AORIST.MST |
MST | NotState | NOT_STATE.MST |
MST | FeelLike | FEEL_LIKE.MST |
MST | EverSince | EVER_SINCE.MST |
MST | Repeat | REPEAT.MST |
MST | Almost | ALMOST.MST |
MST | Hastily | HASTILY.MST |
MST | Stay | STAY.MST |
MST | Start | START.MST |
MST | AsIf | AS_IF.MST |
MST | While | WHILE.MST |
MST | When | WHEN.MST |
MST | SinceDoingSo | SINCE_DOING_SO.MST |
MST | AsLongAs | AS_LONG_AS.MST |
MST | ByDoingSo | BY_DOING_SO.MST |
MST | Adamantly | ADAMANTLY.MST |
MST | AfterDoingSo | AFTER_DOING_SO.MST |
MST | WithoutHavingDoneSo | WITHOUT_HAVING_DONE_SO.MST |
MST | WithoutBeingAbleToHaveDoneSo | WITHOUT_BEING_ABLE_TO_DO_SO.MST |
MST | Zero | ZERO.MST |
MST | Cop | COP.MST |
MST | Neg | NEGATIVE.MST |
MST | Unable | UNABLE.MST |
MST | Pres | PRESENT.MST |
MST | Past | PAST.MST |
MST | Narr | NARRATIVE.MST |
MST | Cond | CONDITION.MST |
MST | Prog1 | PROGRESSIVE1.MST |
MST | Prog2 | PROGRESSIVE2.MST |
MST | Aor | AORIST.MST |
MST | Fut | FUTURE.MST |
MST | Imp | IMPERATIVE.MST |
MST | Opt | OPTATIVE.MST |
MST | Desr | DESIRE.MST |
MST | Neces | NECESSITY.MST |
System Annotation
Teneo bundles two default collections of annotations in all language configurations: standard annotations added by the System Annotation Input Processor and special system annotations added by the Engine; the System Annotation Input Processor performs simple analysis of the sentence texts and may generate the standard annotations listed below.
Annotation | Description |
---|---|
_BINARY | The input consists of only 0s and 1s |
_BRACKETPAIR | At least one matching pair of brackets appears in the input; possible bracket types: ( ), [ ], { } |
_EXCLAMATION | At least one exclamation mark (!) appears in the input |
_EM3 | Three (or more) exclamation marks (!!!) appear in a row in the input |
_EMPTY | The input contains no text / the sentence text is empty |
_NONSENSE | The input contains nonsense text, e.g., 'asdf', 'wgwwgwg', 'xxxxxx' |
_QUESTION | At least one question mark (?) appears in the input |
_QT3 | Three (or more) question marks (???) appear in a row in the input |
_QUOTE | At least one single quotation mark (') appears in the input |
_DBLQUOTE | At least one quotation mark (") appears in the input |
Special System Annotations
The following two, special annotations are set by the Teneo Engine. These special system annotations are not related to individual inputs but rather to whole dialogues and are dependent on the session state.
Annotation | Description |
---|---|
_INIT | Indicates session start, i.e., the first input in a dialogue |
_TIMEOUT | Indicates the continuation of a previously timed-out session/dialogue |
Basic Number Recognizer
The Basic Number Recognizer identifies all Arabic numbers of the type 123 and 3.14 in the user input and annotates each of them with an annotation associated with a variable which holds the actual numeric value of the number found.
The Basic Number Recognizer is language dependent and each language has its own configuration defining the decimal point characters and the thousands separator character to be ignored.
Annotation | Variable | Description |
---|---|---|
NUMBER | numericValue | Annotation created for identified Arabic numbers in user inputs |
For the annotation and its numeric value variable to be added, a number in the user input must meet the following syntax:
ā It must match the regular expression:
[,]?[0-9]+([,][0-9]+)*([.][0-9]+)?|[.][0-9]+
ā It must be parseable by Java's BigDecimal to ensure it is a number
The above syntax provides the following guarantees:
- The sign is not included in the annotated token
- The numericValue variable contains a BigDecimal representation of the number.
In the above example regex, the dot is used as a decimal marker and the comma as a regular expression; as described earlier this configuration is language dependent and therefore varies depending on the selected solution language.
Language Detector
The Language Detector uses a machine learning model to predict the language of a given user input and adds an annotation, as seen in below table, to the input together with a confidence score of the prediction.
Annotation | Variable | Description |
---|---|---|
<language label>.LANG, e.g., %$DA.LANG | Confidence | Annotation created for the predicted language |
The Language Detector can predict the following 45 languages; the language label used to create the annotation name is in brackets:
Arabic (AR), Bulgarian (BG), Bengali (BN), Catalan (CA), Czech (CS), Danish (DA), German (DE), Greek (EL), English (EN), Esperanto (EO), Spanish (ES), Estonian (ET), Basque (EU), Persian (FA), Finnish (FI), French (FR), Hebrew (HE), Hindi (HI), Hungarian (HU), Indonesian-Malay (ID_MS), Icelandic (IS), Italian (IT), Japanese (JA), Korean (KO), Lithuanian (LT), Latvian (LV), Macedonian (MK), Dutch (NL), Norwegian (NO), Polish (PL), Portuguese (PT), Romanian (RO), Russian (RU), Slovak (SK), Slovenian (SL), Serbian-Croatian-Bosnian (SR_HR), Swedish (SV), Tamil (TA), Telugu (TE), Thai (TH), Tagalog (TL), Turkish (TR), Urdu (UR), Vietnamese (VI) and Chinese (ZH).
Serbian, Bosnian and Croatian are treated as one language under the label SR_HR, and Indonesian and Malay are treated as one language under the label ID_MS
A number of regexes are also in use by the Input Processor, helping the model to not predict a language for fully numerical inputs, URLs or other type of nonsense inputs.
The Language Detector will provide an annotation when the confidence prediction threshold is above 0.2 for the languages, but for the following listed languages, language annotations are always created (even for predictions below 0.2) since the Language Detector is mostly accurate when predicting them: Arabic, Bengali, Greek, Hebrew, Hindi, Japanese, Korean, Tamil, Telugu, Thai, Chinese, Vietnamese, Persian and Urdu.
Predict
The Predict Input Processor makes use of an intent model generated when classes are available in a Teneo Studio solution to annotate user inputs with the defined classes; intent models can be generated either with Teneo Learn or CLU. Note that as of Teneo 7.3, deferred intent classification is applied and annotations are only created by Predict if references to class annotations are found during the input matching process.
When Predict receives a user input, confidence scores are calculated for each class based on the model and annotations created for the most confident class and for each other class that matches the following criteria:
- the confidence is above the minimum confidence (defaults to 0.01)
- the confidence is higher than 0.5 times the confidence value of the top class.
For each selected class, an annotation with the scheme <CLASS_NAME>.INTENT is created, with the value of the model's confidence in the class as well as an annotation variable specifying the used classifier (i.e., Learn, CLU or LearnFallback) and an Order variable defining the order of the selected classes (i.e., 0 for the class with the highest confidence score and 4 for the selected class with the lowest confidence score).
A special annotation <CLASS_NAME>.TOP_INTENT is created for the class with the highest confidence score.
Annotation | Variable | Variable | Variable | Description |
---|---|---|---|---|
<CLASS_NAME>.TOP_INTENT | classifier | confidence | Annotation created for the class with the highest confidence score | |
<CLASS_NAME>.INTENT | classifier | confidence | Order | Annotation given to each selected class with a maximum of five top classes |
The Predict Input Processor creates a maximum of 5 annotations, regardless of how many classes match the criteria.