Turkish Input Processors Chain
Introduction
An Input Processor (IP) pre-processes inputs for Teneo Engine to be able to perform different processes on them, such as normalization and tokenize the inputs or do Part-of-Speech (POS) tagging. Each language supported by the Teneo Platform has a chain of input processors that know how to process that language.
Input Processors chain setup
The following graph displays the default setup of the Turkish Input Processors chain:
The default Input Processors are:
- The TurkishAnalyzer IP performs user input normalization, sentence splitting, tokenization and Part-of-Speech (POS) and morphological annotations.
- The SystemAnnotation IP sets a number of annotations based on properties of the user input text.
- The BasicNumberRecognizer IP identifies all Arabic numbers of the type 123 and 3.14 in the user input, annotates each of them with the
NUMBER
annotation and associates a variable to this annotation callednumericValue
. - The LanguageDetector IP identifies the language of the input sentence provided and annotates it with the predicted language and associates a confidence score of the prediction.
- The Predict IP classifies user inputs based on a machine learning model trained in Teneo Learn and annotates it with the predicted top intent classes and confidence score.
Standard Simplifier
The simplification in Turkish is done using the Standard Simplifier from the European Input Processors' chain. The StandardSimplifier is a simplifier implementation with support for configurable character decomposition and normalization, as well as character mapping.
It executes the following processing steps:
- Conversion to lower case, considering the configured language locale.
- Optional compatibility simplification: this is Unicode compatibility decomposition (like mapping 2 to 2, etc.), with optional exceptions defined by property
excludeFromCompatibilitySimplify
. - This step is disabled by default, see
compatibilitySimplify
. Optional canonical simplification: Unicode canonical decomposition is applied, then by default all combining characters are deleted (exceptions can be given with the propertyexcludeFromCanonicalSimplify
, these letter-combining character combinations will be left untouched). - Conversion to Unicode composed form.
- Optional simplification mapping: character/substring replacement as specified by properties
simplificationMapping.*
are applied. No mappings are set by default.
Configuration properties
Name | Type | Required | Default |
---|---|---|---|
canocicalSimplify | true/false | no | true |
canonicalSimplify
enables/disables simplification based on canonical decomposition of Unicode characters (see Unicode normalization forms for more information). An exception list can be defined in excludeFromCanonicalSimplify
.
If enabled:
- Canonical decomposition will be applied first; this means accented characters will be decomposed into the base letter and combining marks (non-spacing mark) for the accent(s).
- On a second step, all non-spacing marks are deleted, i.e. á will be come a, etc.
- Finally, canonical composition is applied.
Name | Type | Required | Default |
---|---|---|---|
excludeFromCanonicalSimplify | string | no | empty |
All characters in the string given here will be excluded from the canonical simplification defined above. To be more precise, for character-combinations resulting from step one while step two will be skipped.
Name | Type | Required | Default |
---|---|---|---|
compatibilitySimplify | true/false | no | false |
compatibilitySimplify
enables/disables simplification based on compatibility decomposition of Unicode characters (see Unicode normalization forms for more information). For example, 5 will become 5.
Name | Type | Required | Default |
---|---|---|---|
excludeFromCompatibilitySimplify | string | no | empty |
All characters in the string given here will be excluded from the compatibility simplification as defined above.
Name | Type | Required | Default |
---|---|---|---|
simplificationMapping.* | Format: simplificationMapping.<n> = <letter(s)>=<replacement> <n> : number, which must be unique within the simplification mappings of one file <letter(s)> : string, letter(s) to be replaced <replacement> : string, replacement | no | empty |
Custom simplification mapping is applied AFTER canonical and compatibility simplification. This means, for example, that an accented character for which a custom simplification mapping has been applied must be listed under excludeFromCanonicalSimplify
if canonical simplification isn't disabled.
Example
properties
1simplificationMapping.1 = ä=ae
2
also requires
properties
1excludeFromCanonicalSimplify = ...ä...
2
Turkish Analyzer IP
The TurkishAnalyzer Input Processor is based on Zemberek and performs the following tasks:
- User input normalization
- Sentence splitting
- Tokenization
- Part-of-Speech (POS) and morphological annotation
Normalization, sentence splitting and tokenization
The TurkishAnalyzer performs normalization on user inputs, and furthermore it will segment the input into sentences, tokenize and analyze the morphological structure of each token in the context of the sentence.
This means that each sentence will be normalized by the TurkishAnalyzer, i.e. the sentence will be lowercased, and, in some cases, typos will be fixed. Unlike other Teneo input processors, the API method getOriginal()
on a Word
object will return the normalized form (which might be different from the simplified form) as the normalization happens before the tokenization.
Please note that this has direct implications on the exact match operator, which for other languages works on the ORIGINAL form, but for Turkish, users need to be aware that the exact match operator operates on the normalized strings.
The original user input is not modified and can be retrieved with getUserInputText()
.
A sentence in Turkish is an instance of the TurkishSentence
class, which implements the SentenceI
interface from the engine-input-processor-api
. The method getText()
of the class TurkishSentence
returns the normalized sentence text. The original sentence text can be retrieved with the method getRawSentence()
within for a direct caller of the input processor chain by casting a Sentence
to a TurkishSentence
. It cannot be accessed via the engine scripting API.
The sentence indices point to the characters in the original user input string. The word indices point to the characters in the sentence, i.e. the normalized sentence string.
POS and morphological annotation
The TurkishAnalyzer also annotates user inputs with POS and morphological information. Each word will be annotated with its lemma, if available. A lemma annotation contains the POS tag as an annotation variable pos:<string>
.
The morphological information will be returned as annotations for the three different types that Zemberek returns with the following suffixes:
.POS
: primary part-of-speech tag of the entire token.POS
/.NER
: secondary part-of-speech tag (mix of entities/POS tags) based on the stem of the token.MST
: morphosyntactic information based on the morphemes of the token
The MST annotations all have the annotation variable surface=<string>
that contains the substring of the surface form of that morpheme in the word, if available.
The table below lists how the tags from Zemberek are mapped to annotations in Teneo; please see here for information related to available ANNOT Language Objects in the Turkish Lexical Resource.
Zemberek Type | Zemberek Tag | Map to annotations |
---|---|---|
POS | Noun | NN.POS |
POS | Adj | ADJ.POS |
POS | Adv | ADV.POS |
POS | Conj | CC.POS |
POS | Interj | INTERJ.POS |
POS | Verb | VB.POS |
POS | Pron | PRON.POS |
POS | Num | NUMERAL.POS |
POS | Det | DET.POS |
POS | Postp | POST_POSITIVE.POS |
POS | Ques | INTERROG.POS |
POS | Dup | DUPLICATOR.POS |
POS | Punc | PUNCT.POS |
POS2 | Demons | DEMOS.POS |
POS3 | Time | TIME.NER |
POS4 | Quant | QUANTITATIVE.POS |
POS5 | Ques | INTERROG.POS |
POS6 | Prop | PROPER.POS |
POS7 | Pers | PERS.POS |
POS8 | Reflex | REFLEXIVE.POS |
POS9 | Ord | ORDINAL.POS |
POS10 | Card | CARDINAL.POS |
POS11 | Percent | PERCENT.NER |
POS12 | Ratio | RATIO.NER |
POS13 | Range | RANGE.NER |
POS14 | Dist | DIST.NER |
POS15 | Clock | CLOCK.NER |
POS16 | Date | DATE.NER |
POS17 | EMAIL.NER | |
POS18 | Url | URL.NER |
POS19 | Mention | MENTION.NER |
POS20 | HashTag | HASHTAG.NER |
POS21 | Emoticon | EMOTICON.NER |
POS22 | RegAbbrv | ABBREVIATION.NER |
POS23 | Abbrv | ABBREVIATION.NER |
MST | Noun | NN.MST |
MST | Adj` | ADJ.MST |
MST | Adv | ADV.MST |
MST | Conj | CC.MST |
MST | Interj | INTERJ.MST |
MST | Verb | VB.MST |
MST | Pron | PRON.MST |
MST | Num | NUMERAL.MST |
MST | Det | DET.MST |
MST | Postp | POST_POSITIVE.MST |
MST | Ques | INTERROG.MST |
MST | Dup | DUPLICATOR.MST |
MST | Punc | PUNCT.MST |
MST | A1sg | 1STPERSON.MST,SG.MST |
MST | A2sg | 2NDPERSON.MST,SG.MST |
MST | A3sg | 3RDPERSON.MST,SG.MST |
MST | A1pl | 1STPERSON.MST,PL.MST |
MST | A2pl | 2NDPERSON.MST,PL.MST |
MST | A3pl | 3RDPERSON.MST,PL.MST |
MST | Pnon | NO_POSESSION.MST |
MST | P1sg | POSS_1STPERSON.MST,POSS_SG.MST |
MST | P2sg | POSS_2NDPERSON.MST,POSS_SG.MST |
MST | P3sg | POSS_3RDPERSON.MST,POSS_SG.MST |
MST | P1pl | POSS_1STPERSON.MST,POSS_PL.MST |
MST | P2pl | POSS_2NDPERSON.MST,POSS_PL.MST |
MST | P3pl | POSS_3RDPERSON.MST,POSS_PL.MST |
MST | Nom | NOMINATIVE.MST |
MST | Dat | DATIVE.MST |
MST | Acc | ACCUSATIVE.MST |
MST | Abl | ABLATIVE.MST |
MST | Loc | LOCATIVE.MST |
MST | Ins | INSTRUMENTAL.MST |
MST | Gen | GENITIVE.MST |
MST | Equ | EQUATIVE.MST |
MST | Dim | DIMINUTIVE.MST |
MST | Ness | NESS.MST |
MST | With | WITH.MST |
MST | Without | WITHOUT.MST |
MST | Related | RELATED.MST |
MST | JustLike | JUST_LIKE.MST |
MST | Rel | RELATION.MST |
MST | Agt | AGENTIVE.MST |
MST | Become | BECOME.MST |
MST | Acquire | ACQUIRE.MST |
MST | Ly | LY.MST |
MST | Caus | CAUSATIVE.MST |
MST | Recip | RECIPROCAL.MST |
MST | Reflex | REFLEXIVE.MST |
MST | Able | ABILITY.MST |
MST | Pass | PASSIVE.MST |
MST | Inf1 | INFINITIVE1.MST |
MST | Inf2 | INFINITIVE2.MST |
MST | Inf3 | INFINITIVE3.MST |
MST | ActOf | ACT_OF.MST |
MST | PastPart | PART_PAST.MST |
MST | NarrPart | PART_NARRATIVE.MST |
MST | FutPart | PART_FUTURE.MST |
MST | PresPart | PART_PRESENT.MST |
MST | AorPart | PART_AORIST.MST |
MST | NotState | NOT_STATE.MST |
MST | FeelLike | FEEL_LIKE.MST |
MST | EverSince | EVER_SINCE.MST |
MST | Repeat | REPEAT.MST |
MST | Almost | ALMOST.MST |
MST | Hastily | HASTILY.MST |
MST | Stay | STAY.MST |
MST | Start | START.MST |
MST | AsIf | AS_IF.MST |
MST | While | WHILE.MST |
MST | When | WHEN.MST |
MST | SinceDoingSo | SINCE_DOING_SO.MST |
MST | AsLongAs | AS_LONG_AS.MST |
MST | ByDoingSo | BY_DOING_SO.MST |
MST | Adamantly | ADAMANTLY.MST |
MST | AfterDoingSo | AFTER_DOING_SO.MST |
MST | WithoutHavingDoneSo | WITHOUT_HAVING_DONE_SO.MST |
MST | WithoutBeingAbleToHaveDoneSo | WITHOUT_BEING_ABLE_TO_DO_SO.MST |
MST | Zero | ZERO.MST |
MST | Cop | COP.MST |
MST | Neg | NEGATIVE.MST |
MST | Unable | UNABLE.MST |
MST | Pres | PRESENT.MST |
MST | Past | PAST.MST |
MST | Narr | NARRATIVE.MST |
MST | Cond | CONDITION.MST |
MST | Prog1 | PROGRESSIVE1.MST |
MST | Prog2 | PROGRESSIVE2.MST |
MST | Aor | AORIST.MST |
MST | Fut | FUTURE.MST |
MST | Imp | IMPERATIVE.MST |
MST | Opt | OPTATIVE.MST |
MST | Desr | DESIRE.MST |
MST | Neces | NECESSITY.MST |
Configuration properties
Name | Type | Required | Default |
---|---|---|---|
nonWordChars | string | no | "“”.¡!¿?…,;‘’'´` |
A list of characters that will be removed if they are single tokens.
There are three types of annotation mapping properties. Their value is of the form:
properties
1P1sg=POSSESIVE.MST,1STPERSON.MST,SG.MST
2
Note that the properties are numbered in the configuration file. For example:
properties
1annotationsForMST.1 = A1sg=1STPERSON.MST,SG.MST
2annotationsForMST.2 = A2sg=2NDPERSON.MST,SG.MST
3
Name | Type | Required | Default |
---|---|---|---|
AnnotationsForPos | string | no | empty |
Mapping for the POS tags (primary POS returned from Zemberek API)
Name | Type | Required | Default |
---|---|---|---|
annotationsForPos2 | string | no | empty |
Mapping for the POS/NER tags (secondary POS returned from Zemberek API)
Name | Type | Required | Default |
---|---|---|---|
annotationsForMST | string | no | empty |
Mapping for the morphological information tags (morpheme returned from Zemberek API)
Further, there are the following files that configure the Zemberek normalizer directly:
asci-map
: list of auto-correct mappinglm.2gram.slm
: language modellook-up-from-graph
: list of auto-correct mappings- look-up-from-graph: list of auto-correct mappings
split
: list of words to be split
For more information please visit the project site of Zemberek.
System Annotation IP
The SystemAnnotation Input Processor performs simple analysis of the sentence texts to set some annotations. The decision algorithms are configurable by various properties. Further customization is possible by sub-classing this Input Processor and overriding one or more of the methods: decideBinary
, decideBrackets
, decideEmpty
, decideExclamation
, decideNonsense
, decideQuestion
, decideQuote
.
This IP works on the sentences passed in but does not modify them.
Other considerations
Extra request parameters read by this input processor: (none)
Processing options read by this input processor: (none)
Annotations this input processor may generate:
- _EMPTY: the sentence text is empty
- _EXCLAMATION: the sentence text contains at least one of the characters specified with property
exclamationMarkCharacters
- _EM3: the sentence text contains three or more characters in a row of the characters specified with property
exclamationMarkCharacters
- _QUESTION: the sentence text contains at least one of the characters specified with property
questionMarkCharacters
- _QT3: the sentence text contains three or more characters in a row of the characters specified with
questionMarkCharacters
- _QUOTE: the sentence text contains at least one of the characters specified with property
quoteCharacters
- _DBLQUOTE: the sentence text contains at least one of the characters specified with property
doubleQuoteCharacters
- _BRACKETPAIR: the sentence text contains at least one matching pair of the bracket characters specified with property
bracketPairCharacters
- _NONSENSE: the sentence probably contains nonsense text as configured with properties
consonants
,nonsenseThreshold.absolute
andnonsenseThreshold.relative
- _BINARY: the sentence text only contains characters specified by properties
binaryCharacters
(at least one of them) andbinaryIgnoredCharacters
(zero or more of them).
Configuration properties
Name | Type | Required | Default |
---|---|---|---|
consonants | string | no | BCÇDFGĞHJKLMNPQRSŞTVWXZ bcçdfgğhjklmnpqrsştvwxz |
Contains all letters (upper and lower case) that are considered consonants in the language. Together with the properties nonsenseThreshold.absolute
and nonsenseThreshold.relative
these will be used for detecting probable nonsense inputs like “kljljljljjlj”.
Name | Type | Required | Default |
---|---|---|---|
nonsenseThreshold.absolute | Positive integer number | No | 6 |
For nonsense detection an input exclusively consisting of so many consonants without any non-consonants is considered nonsense.
Name | Type | Required | Default |
---|---|---|---|
nonsenseThreshold.relative | Positive integer number | no | 10 |
For nonsense detection an input containing so many consonants in a row is considered nonsense.
Name | Type | Required | Default |
---|---|---|---|
exclamationMarkCharacters | string | no | ! |
List of characters of which at least one must occur in the sentence text to set annotations _EXCLAMATION and _EM3 (in case of a sequence of at least three of the specified characters).
Name | Type | Required | Default |
---|---|---|---|
questionMarkCharacters | string | no | ? |
List of characters of which at least one must occur in the sentence text to set annotations _QUESTION and _QT3 (in case of a sequence of at least three of the specified characters).
Name | Type | Required | Default |
---|---|---|---|
doubleQuoteCharacters | string | no | “ |
List of characters of which at least one must occur in the sentence text to set annotation _DBLQUOTE.
Name | Type | Required | Default |
---|---|---|---|
quoteCharacters | string | no | ‘ |
List of characters of which at least one must occur in the sentence text to set annotation _QUOTE.
Name | Type | Required | Default |
---|---|---|---|
binaryCharacters | string | no | 01 |
List of characters recognized in the sentence text to set annotation _BINARY.
Name | Type | Required | Default |
---|---|---|---|
binaryIgnoredCharacters | string | no | !?,.-;:# \r\n\t\"' |
List of characters additionally allowed in binary text.
Name | Type | Required | Default |
---|---|---|---|
bracketPairCharacters | string | no | ()[]{} |
List of pairs of bracketing characters of which at least one pair (opening and closing bracket of the same type) must occur in the sentence text to set annotation _BRACKETPAIR.
Special System annotations
Two special annotations related not to individual inputs, but to whole dialogues, are added by the Teneo Engine itself:
- _INIT: indicating session start, i.e. the first input in a dialogue
- _TIMEOUT: indicating that continuation of a previously timed-out session/dialogue.
Basic Number Recognizer IP
The BasicNumberRecognizer Input Processor identifies all Arabic numbers of the type 123 and 3,14 in the user input and annotates each of them with the NUMBER
annotation and associates a variable to this annotation called numericValue
which holds the numeric value of the number found.
This Input Processor is language independent, but every language has its own configuration file for this IP defining decimal point characters and the thousands separator character to be ignored.
For the NUMBER
annotation and the variable to be added, a “number” in the user input must meet the following syntaxes:
It must match the regular expression:
properties
1[,]?[0-9]+([,][0-9]+)*([.][0-9]+)?|[.][0-9]+
2
It must be parseable by Java's BigDecimal to ensure it is a number
The above syntax provides the following guarantees:
- The sign is not included in the annotated token
- The
numericValue
variable contains a BigDecimal representation of the number.
The decimal marker(s) and the thousand separator(s) can be configured; in the above regex, the dot is used as a decimal marker and the comma as a regular expression.
Configuration properties
Name | Default |
---|---|
decimalMarkers | , |
The default decimal markers in Turkish is the comma (,
)
Name | Default |
---|---|
charactersToIgnore | . |
The default character to ignore is the dot (.
)
Language Detector IP
The Language Detector Input Processor uses a machine learning model that predicts the language of a given input and adds an annotation of the format %${language label}.LANG
to the input as well as a confidence score of the prediction.
The Language Detector IP can predict the following 45 languages (language label in brackets):
Arabic (AR), Bulgarian (BG), Bengali (BN), Catalan (CA), Czech (CS), Danish (DA), German (DE), Greek (EL), English (EN), Esperanto (EO), Spanish (ES), Estonian (ET), Basque (EU), Persian (FA), Finnish (FI), French (FR), Hebrew (HE), Hindi (HI), Hungarian (HU), Indonesian-Malay (ID_MS), Icelandic (IS), Italian (IT), Japanese (JA), Korean (KO), Lithuanian (LT), Latvian (LV), Macedonian (MK), Dutch (NL), Norwegian (NO), Polish (PL), Portuguese (PT), Romanian (RO), Russian (RU), Slovak (SK), Slovenian (SL), Serbian-Croatian-Bosnian (SR_HR), Swedish (SV), Tamil (TA), Telugu (TE), Thai (TH), Tagalog (TL), Turkish (TR), Urdu (UR), Vietnamese (VI) and Chinese (ZH).
Serbian, Bosnian and Croatian are treated as one language, under the label SR_HR and Indonesian and Malay are treated as one language, under the label ID_MS.
A number of regexes are also in use by the Input Processor, helping the model to not predict language for fully numerical inputs, URLs or other type of nonsense inputs.
The Language Detector will provide an annotation when the confidence prediction threshold is above 0.2 for the languages, but for Arabic (AR), Bengali (BN), Greek (EL), Hebrew (HE), Hindi (HI), Japanese (JA), Korean (KO), Tamil (TA), Telugu (TE), Thai (TH), Chinese (ZH), Vietnamese (VI), Persian (FA) and Urdu (UR) language annotations will always be created, even for predictions below 2.0, since the Language Detector is mostly accurate when predicting them.
Predict IP
The Predict Input Processor makes use of an intent model generated when classes are available in a Teneo Studio solution to annotate user inputs with the defined classes; intent models can be generated either with Teneo Learn or CLU. Note that as of Teneo 7.3, deferred intent classification is applied and annotations are only created by Predict if references to class annotations are found during the input matching process.
When Predict receives a user input, confidence scores are calculated for each class based on the model and annotations created for the most confident class and for each other class that matches the following criteria:
- the confidence is above the minimum confidence (defaults to 0.01)
- the confidence is higher than 0.5 times the confidence value of the top class.
For each selected class, an annotation with the scheme <CLASS_NAME>.INTENT
is created, with the value of the model's confidence in the class as well as an annotation variable specifying the used classifier (i.e., Learn, CLU or Le) and an Order variable defining the order of the selected classes (i.e., 0 for the class with the highest confidence score and 4 for the selected class with the lowest confidence score).
A special annotation <CLASS_NAME>.TOP_INTENT
is created for the class with the highest confidence score.
Annotation | Variable | Variable | Variable | Description |
---|---|---|---|---|
<CLASS_NAME>.TOP_INTENT | classifier | confidence | Annotation created for the class with the highest confidence score | |
<CLASS_NAME>.INTENT | classifier | confidence | Order | Annotation given to each selected class with a maximum of five top classes |
The Predict Input Processor creates a maximum of 5 annotations, regardless of how many classes match the criteria. The numerical threshold can be configured in the properties file of the Input Processor.
Configuration properties
Name | Type | Required | Default |
---|---|---|---|
minConfidenceSimilarityDistance | float | no | 0.5 |
Confidence percentage of the top score confidence a class must have in order to be considered (e.g.: if the top confidence class has a confidence of 0.7, classes with confidences lower than 0.5 x 0.7 = 0.35 will be discarded).
Name | Type | Required | Default |
---|---|---|---|
maxNumberOfAnnotations | int | no | 5 |
Maximum number of class annotations to create for each user input.
Name | Type | Required | Default |
---|---|---|---|
minConfidenceThreshold | float | no | 0.01 |
Minimum value of confidence a model must have for a class in order to add it as one of the candidate annotations.
Name | Type | Required | Default |
---|---|---|---|
intent.model.file.name | string (filename) | no | inexistent |
Name of the file containing the machine learning model. It is usually set automatically by Teneo Studio, so no configuration is required.