Turkish Input Processors Chain

Introduction

An Input Processor (IP) pre-processes inputs for Teneo Engine to be able to perform different processes on them, such as normalization and tokenize the inputs or do Part-of-Speech (POS) tagging. Each language supported by the Teneo Platform has a chain of input processors that know how to process that language.

Input Processors chain setup

The following graph displays the default setup of the Turkish Input Processors chain:

graph TD subgraph ips [ ] subgraph turkishanalyzer [Turkish Analyzer] normalization[Normalization] splitting[Sentence splitting] tokenization[Tokenization] POS[Part-of-Speech and Morphological annotation] end annotation[SystemAnnotation] --> number[BasicNumberRecognizer] number[BasicNumberRecognizer] --> languagedetect[LanguageDetector] languagedetect[LanguageDetector] --> predict[Predict] end input([User Input]) --User Gives Input--> normalization --> splitting --> tokenization --> POS --> annotation predict[Predict] --Parsed Input--> parsed([To Dialog Processing]) classDef contained stroke-dasharray:5,2; class normalization,splitting,tokenization,POS contained; classDef analyzer stroke:#2f286e,fill:#ffffff; class turkishanalyzer analyzer;

The default Input Processors are:

The TurkishAnalyzer IP performs user input normalization, sentence splitting, tokenization and Part-of-Speech (POS) and morphological annotations.
The SystemAnnotation IP sets a number of annotations based on properties of the user input text.
The BasicNumberRecognizer IP identifies all Arabic numbers of the type 123 and 3.14 in the user input, annotates each of them with the NUMBER annotation and associates a variable to this annotation called numericValue.
The LanguageDetector IP identifies the language of the input sentence provided and annotates it with the predicted language and associates a confidence score of the prediction.
The Predict IP classifies user inputs based on a machine learning model trained in Teneo Learn and annotates it with the predicted top intent classes and confidence score.

Standard Simplifier

The simplification in Turkish is done using the Standard Simplifier from the European Input Processors' chain. The StandardSimplifier is a simplifier implementation with support for configurable character decomposition and normalization, as well as character mapping.

It executes the following processing steps:

Conversion to lower case, considering the configured language locale.
Optional compatibility simplification: this is Unicode compatibility decomposition (like mapping 2 to 2, etc.), with optional exceptions defined by property excludeFromCompatibilitySimplify.
This step is disabled by default, see compatibilitySimplify. Optional canonical simplification: Unicode canonical decomposition is applied, then by default all combining characters are deleted (exceptions can be given with the property excludeFromCanonicalSimplify, these letter-combining character combinations will be left untouched).
Conversion to Unicode composed form.
Optional simplification mapping: character/substring replacement as specified by properties simplificationMapping.* are applied. No mappings are set by default.

Configuration properties

Name	Type	Required	Default
`canocicalSimplify`	true/false	no	true

canonicalSimplify enables/disables simplification based on canonical decomposition of Unicode characters (see Unicode normalization forms for more information). An exception list can be defined in excludeFromCanonicalSimplify.

If enabled:

Canonical decomposition will be applied first; this means accented characters will be decomposed into the base letter and combining marks (non-spacing mark) for the accent(s).
On a second step, all non-spacing marks are deleted, i.e. á will be come a, etc.
Finally, canonical composition is applied.

Name	Type	Required	Default
`excludeFromCanonicalSimplify`	string	no	empty

All characters in the string given here will be excluded from the canonical simplification defined above. To be more precise, for character-combinations resulting from step one while step two will be skipped.

Name	Type	Required	Default
`compatibilitySimplify`	true/false	no	false

compatibilitySimplify enables/disables simplification based on compatibility decomposition of Unicode characters (see Unicode normalization forms for more information). For example, 5 will become 5.

Name	Type	Required	Default
`excludeFromCompatibilitySimplify`	string	no	empty

All characters in the string given here will be excluded from the compatibility simplification as defined above.

Name	Type	Required	Default
`simplificationMapping.*`	Format: `simplificationMapping.<n> = <letter(s)>=<replacement><n>`: number, which must be unique within the simplification mappings of one file `<letter(s)>`: string, letter(s) to be replaced `<replacement>`: string, replacement	no	empty

Custom simplification mapping is applied AFTER canonical and compatibility simplification. This means, for example, that an accented character for which a custom simplification mapping has been applied must be listed under excludeFromCanonicalSimplify if canonical simplification isn't disabled.

Example

properties

1simplificationMapping.1 = ä=ae
2

also requires

properties

1excludeFromCanonicalSimplify = ...ä...
2

Turkish Analyzer IP

The TurkishAnalyzer Input Processor is based on Zemberek and performs the following tasks:

User input normalization
Sentence splitting
Tokenization
Part-of-Speech (POS) and morphological annotation

Normalization, sentence splitting and tokenization

The TurkishAnalyzer performs normalization on user inputs, and furthermore it will segment the input into sentences, tokenize and analyze the morphological structure of each token in the context of the sentence.

This means that each sentence will be normalized by the TurkishAnalyzer, i.e. the sentence will be lowercased, and, in some cases, typos will be fixed. Unlike other Teneo input processors, the API method getOriginal() on a Word object will return the normalized form (which might be different from the simplified form) as the normalization happens before the tokenization.

Please note that this has direct implications on the exact match operator, which for other languages works on the ORIGINAL form, but for Turkish, users need to be aware that the exact match operator operates on the normalized strings.

The original user input is not modified and can be retrieved with getUserInputText().

A sentence in Turkish is an instance of the TurkishSentence class, which implements the SentenceI interface from the engine-input-processor-api. The method getText() of the class TurkishSentence returns the normalized sentence text. The original sentence text can be retrieved with the method getRawSentence() within for a direct caller of the input processor chain by casting a Sentence to a TurkishSentence. It cannot be accessed via the engine scripting API.

The sentence indices point to the characters in the original user input string. The word indices point to the characters in the sentence, i.e. the normalized sentence string.

POS and morphological annotation

The TurkishAnalyzer also annotates user inputs with POS and morphological information. Each word will be annotated with its lemma, if available. A lemma annotation contains the POS tag as an annotation variable pos:<string>.

The morphological information will be returned as annotations for the three different types that Zemberek returns with the following suffixes:

.POS: primary part-of-speech tag of the entire token
.POS/.NER: secondary part-of-speech tag (mix of entities/POS tags) based on the stem of the token
.MST: morphosyntactic information based on the morphemes of the token

The MST annotations all have the annotation variable surface=<string> that contains the substring of the surface form of that morpheme in the word, if available.

The table below lists how the tags from Zemberek are mapped to annotations in Teneo; please see here for information related to available ANNOT Language Objects in the Turkish Lexical Resource.

Zemberek Type	Zemberek Tag	Map to annotations
POS	Noun	`NN.POS`
POS	Adj	`ADJ.POS`
POS	Adv	`ADV.POS`
POS	Conj	`CC.POS`
POS	Interj	`INTERJ.POS`
POS	Verb	`VB.POS`
POS	Pron	`PRON.POS`
POS	Num	`NUMERAL.POS`
POS	Det	`DET.POS`
POS	Postp	`POST_POSITIVE.POS`
POS	Ques	`INTERROG.POS`
POS	Dup	`DUPLICATOR.POS`
POS	Punc	`PUNCT.POS`
POS2	Demons	`DEMOS.POS`
POS3	Time	`TIME.NER`
POS4	Quant	`QUANTITATIVE.POS`
POS5	Ques	`INTERROG.POS`
POS6	Prop	`PROPER.POS`
POS7	Pers	`PERS.POS`
POS8	Reflex	`REFLEXIVE.POS`
POS9	Ord	`ORDINAL.POS`
POS10	Card	`CARDINAL.POS`
POS11	Percent	`PERCENT.NER`
POS12	Ratio	`RATIO.NER`
POS13	Range	`RANGE.NER`
POS14	Dist	`DIST.NER`
POS15	Clock	`CLOCK.NER`
POS16	Date	`DATE.NER`
POS17	Email	`EMAIL.NER`
POS18	Url	`URL.NER`
POS19	Mention	`MENTION.NER`
POS20	HashTag	`HASHTAG.NER`
POS21	Emoticon	`EMOTICON.NER`
POS22	RegAbbrv	`ABBREVIATION.NER`
POS23	Abbrv	`ABBREVIATION.NER`
MST	Noun	`NN.MST`
MST	Adj`	`ADJ.MST`
MST	Adv	`ADV.MST`
MST	Conj	`CC.MST`
MST	Interj	`INTERJ.MST`
MST	Verb	`VB.MST`
MST	Pron	`PRON.MST`
MST	Num	`NUMERAL.MST`
MST	Det	`DET.MST`
MST	Postp	`POST_POSITIVE.MST`
MST	Ques	`INTERROG.MST`
MST	Dup	`DUPLICATOR.MST`
MST	Punc	`PUNCT.MST`
MST	A1sg	`1STPERSON.MST,SG.MST`
MST	A2sg	`2NDPERSON.MST,SG.MST`
MST	A3sg	`3RDPERSON.MST,SG.MST`
MST	A1pl	`1STPERSON.MST,PL.MST`
MST	A2pl	`2NDPERSON.MST,PL.MST`
MST	A3pl	`3RDPERSON.MST,PL.MST`
MST	Pnon	`NO_POSESSION.MST`
MST	P1sg	`POSS_1STPERSON.MST,POSS_SG.MST`
MST	P2sg	`POSS_2NDPERSON.MST,POSS_SG.MST`
MST	P3sg	`POSS_3RDPERSON.MST,POSS_SG.MST`
MST	P1pl	`POSS_1STPERSON.MST,POSS_PL.MST`
MST	P2pl	`POSS_2NDPERSON.MST,POSS_PL.MST`
MST	P3pl	`POSS_3RDPERSON.MST,POSS_PL.MST`
MST	Nom	`NOMINATIVE.MST`
MST	Dat	`DATIVE.MST`
MST	Acc	`ACCUSATIVE.MST`
MST	Abl	`ABLATIVE.MST`
MST	Loc	`LOCATIVE.MST`
MST	Ins	`INSTRUMENTAL.MST`
MST	Gen	`GENITIVE.MST`
MST	Equ	`EQUATIVE.MST`
MST	Dim	`DIMINUTIVE.MST`
MST	Ness	`NESS.MST`
MST	With	`WITH.MST`
MST	Without	`WITHOUT.MST`
MST	Related	`RELATED.MST`
MST	JustLike	`JUST_LIKE.MST`
MST	Rel	`RELATION.MST`
MST	Agt	`AGENTIVE.MST`
MST	Become	`BECOME.MST`
MST	Acquire	`ACQUIRE.MST`
MST	Ly	`LY.MST`
MST	Caus	`CAUSATIVE.MST`
MST	Recip	`RECIPROCAL.MST`
MST	Reflex	`REFLEXIVE.MST`
MST	Able	`ABILITY.MST`
MST	Pass	`PASSIVE.MST`
MST	Inf1	`INFINITIVE1.MST`
MST	Inf2	`INFINITIVE2.MST`
MST	Inf3	`INFINITIVE3.MST`
MST	ActOf	`ACT_OF.MST`
MST	PastPart	`PART_PAST.MST`
MST	NarrPart	`PART_NARRATIVE.MST`
MST	FutPart	`PART_FUTURE.MST`
MST	PresPart	`PART_PRESENT.MST`
MST	AorPart	`PART_AORIST.MST`
MST	NotState	`NOT_STATE.MST`
MST	FeelLike	`FEEL_LIKE.MST`
MST	EverSince	`EVER_SINCE.MST`
MST	Repeat	`REPEAT.MST`
MST	Almost	`ALMOST.MST`
MST	Hastily	`HASTILY.MST`
MST	Stay	`STAY.MST`
MST	Start	`START.MST`
MST	AsIf	`AS_IF.MST`
MST	While	`WHILE.MST`
MST	When	`WHEN.MST`
MST	SinceDoingSo	`SINCE_DOING_SO.MST`
MST	AsLongAs	`AS_LONG_AS.MST`
MST	ByDoingSo	`BY_DOING_SO.MST`
MST	Adamantly	`ADAMANTLY.MST`
MST	AfterDoingSo	`AFTER_DOING_SO.MST`
MST	WithoutHavingDoneSo	`WITHOUT_HAVING_DONE_SO.MST`
MST	WithoutBeingAbleToHaveDoneSo	`WITHOUT_BEING_ABLE_TO_DO_SO.MST`
MST	Zero	`ZERO.MST`
MST	Cop	`COP.MST`
MST	Neg	`NEGATIVE.MST`
MST	Unable	`UNABLE.MST`
MST	Pres	`PRESENT.MST`
MST	Past	`PAST.MST`
MST	Narr	`NARRATIVE.MST`
MST	Cond	`CONDITION.MST`
MST	Prog1	`PROGRESSIVE1.MST`
MST	Prog2	`PROGRESSIVE2.MST`
MST	Aor	`AORIST.MST`
MST	Fut	`FUTURE.MST`
MST	Imp	`IMPERATIVE.MST`
MST	Opt	`OPTATIVE.MST`
MST	Desr	`DESIRE.MST`
MST	Neces	`NECESSITY.MST`

Configuration properties

Name	Type	Required	Default
`nonWordChars`	string	no	"“”.¡!¿?…,;‘’'´`

A list of characters that will be removed if they are single tokens.

There are three types of annotation mapping properties. Their value is of the form:

properties

1P1sg=POSSESIVE.MST,1STPERSON.MST,SG.MST
2

Note that the properties are numbered in the configuration file. For example:

properties

1annotationsForMST.1 = A1sg=1STPERSON.MST,SG.MST  
2annotationsForMST.2 = A2sg=2NDPERSON.MST,SG.MST
3

Name	Type	Required	Default
`AnnotationsForPos`	string	no	empty

Mapping for the POS tags (primary POS returned from Zemberek API)

Name	Type	Required	Default
`annotationsForPos2`	string	no	empty

Mapping for the POS/NER tags (secondary POS returned from Zemberek API)

Name	Type	Required	Default
`annotationsForMST`	string	no	empty

Mapping for the morphological information tags (morpheme returned from Zemberek API)

Further, there are the following files that configure the Zemberek normalizer directly:

asci-map: list of auto-correct mapping
lm.2gram.slm: language model
look-up-from-graph: list of auto-correct mappings
look-up-from-graph: list of auto-correct mappings
split: list of words to be split

For more information please visit the project site of Zemberek.

System Annotation IP

The SystemAnnotation Input Processor performs simple analysis of the sentence texts to set some annotations. The decision algorithms are configurable by various properties. Further customization is possible by sub-classing this Input Processor and overriding one or more of the methods: decideBinary, decideBrackets, decideEmpty, decideExclamation, decideNonsense, decideQuestion, decideQuote.

This IP works on the sentences passed in but does not modify them.

Other considerations

Extra request parameters read by this input processor: (none)
Processing options read by this input processor: (none)
Annotations this input processor may generate:

_EMPTY: the sentence text is empty
_EXCLAMATION: the sentence text contains at least one of the characters specified with property exclamationMarkCharacters
_EM3: the sentence text contains three or more characters in a row of the characters specified with property exclamationMarkCharacters
_QUESTION: the sentence text contains at least one of the characters specified with property questionMarkCharacters
_QT3: the sentence text contains three or more characters in a row of the characters specified with questionMarkCharacters
_QUOTE: the sentence text contains at least one of the characters specified with property quoteCharacters
_DBLQUOTE: the sentence text contains at least one of the characters specified with property doubleQuoteCharacters
_BRACKETPAIR: the sentence text contains at least one matching pair of the bracket characters specified with property bracketPairCharacters
_NONSENSE: the sentence probably contains nonsense text as configured with properties consonants, nonsenseThreshold.absolute and nonsenseThreshold.relative
_BINARY: the sentence text only contains characters specified by properties binaryCharacters (at least one of them) and binaryIgnoredCharacters (zero or more of them).

Configuration properties

Name	Type	Required	Default
`consonants`	string	no	BCÇDFGĞHJKLMNPQRSŞTVWXZ bcçdfgğhjklmnpqrsştvwxz

Contains all letters (upper and lower case) that are considered consonants in the language. Together with the properties nonsenseThreshold.absolute and nonsenseThreshold.relative these will be used for detecting probable nonsense inputs like “kljljljljjlj”.

Name	Type	Required	Default
`nonsenseThreshold.absolute`	Positive integer number	No	6

For nonsense detection an input exclusively consisting of so many consonants without any non-consonants is considered nonsense.

Name	Type	Required	Default
`nonsenseThreshold.relative`	Positive integer number	no	10

For nonsense detection an input containing so many consonants in a row is considered nonsense.

Name	Type	Required	Default
`exclamationMarkCharacters`	string	no	!

List of characters of which at least one must occur in the sentence text to set annotations _EXCLAMATION and _EM3 (in case of a sequence of at least three of the specified characters).

Name	Type	Required	Default
`questionMarkCharacters`	string	no	?

List of characters of which at least one must occur in the sentence text to set annotations _QUESTION and _QT3 (in case of a sequence of at least three of the specified characters).

Name	Type	Required	Default
`doubleQuoteCharacters`	string	no	“

List of characters of which at least one must occur in the sentence text to set annotation _DBLQUOTE.

Name	Type	Required	Default
`quoteCharacters`	string	no	‘

List of characters of which at least one must occur in the sentence text to set annotation _QUOTE.

Name	Type	Required	Default
`binaryCharacters`	string	no	01

List of characters recognized in the sentence text to set annotation _BINARY.

Name	Type	Required	Default
`binaryIgnoredCharacters`	string	no	`!?,.-;:# \r\n\t\"'`

List of characters additionally allowed in binary text.

Name	Type	Required	Default
`bracketPairCharacters`	string	no	`()[]{}`

List of pairs of bracketing characters of which at least one pair (opening and closing bracket of the same type) must occur in the sentence text to set annotation _BRACKETPAIR.

Special System annotations

Two special annotations related not to individual inputs, but to whole dialogues, are added by the Teneo Engine itself:

_INIT: indicating session start, i.e. the first input in a dialogue
_TIMEOUT: indicating that continuation of a previously timed-out session/dialogue.

Basic Number Recognizer IP

The BasicNumberRecognizer Input Processor identifies all Arabic numbers of the type 123 and 3,14 in the user input and annotates each of them with the NUMBER annotation and associates a variable to this annotation called numericValue which holds the numeric value of the number found.

This Input Processor is language independent, but every language has its own configuration file for this IP defining decimal point characters and the thousands separator character to be ignored.

For the NUMBER annotation and the variable to be added, a “number” in the user input must meet the following syntaxes:

It must match the regular expression:

properties

1[,]?[0-9]+([,][0-9]+)*([.][0-9]+)?|[.][0-9]+
2

It must be parseable by Java's BigDecimal to ensure it is a number

The above syntax provides the following guarantees:

The sign is not included in the annotated token
The numericValue variable contains a BigDecimal representation of the number.

The decimal marker(s) and the thousand separator(s) can be configured; in the above regex, the dot is used as a decimal marker and the comma as a regular expression.

Configuration properties

Name	Default
`decimalMarkers`	`,`

The default decimal markers in Turkish is the comma (,)

Name	Default
`charactersToIgnore`	`.`

The default character to ignore is the dot (.)

Language Detector IP

The Language Detector Input Processor uses a machine learning model that predicts the language of a given input and adds an annotation of the format %${language label}.LANG to the input as well as a confidence score of the prediction.

The Language Detector IP can predict the following 45 languages (language label in brackets):

Arabic (AR), Bulgarian (BG), Bengali (BN), Catalan (CA), Czech (CS), Danish (DA), German (DE), Greek (EL), English (EN), Esperanto (EO), Spanish (ES), Estonian (ET), Basque (EU), Persian (FA), Finnish (FI), French (FR), Hebrew (HE), Hindi (HI), Hungarian (HU), Indonesian-Malay (ID_MS), Icelandic (IS), Italian (IT), Japanese (JA), Korean (KO), Lithuanian (LT), Latvian (LV), Macedonian (MK), Dutch (NL), Norwegian (NO), Polish (PL), Portuguese (PT), Romanian (RO), Russian (RU), Slovak (SK), Slovenian (SL), Serbian-Croatian-Bosnian (SR_HR), Swedish (SV), Tamil (TA), Telugu (TE), Thai (TH), Tagalog (TL), Turkish (TR), Urdu (UR), Vietnamese (VI) and Chinese (ZH).

Serbian, Bosnian and Croatian are treated as one language, under the label SR_HR and Indonesian and Malay are treated as one language, under the label ID_MS.

A number of regexes are also in use by the Input Processor, helping the model to not predict language for fully numerical inputs, URLs or other type of nonsense inputs.

The Language Detector will provide an annotation when the confidence prediction threshold is above 0.2 for the languages, but for Arabic (AR), Bengali (BN), Greek (EL), Hebrew (HE), Hindi (HI), Japanese (JA), Korean (KO), Tamil (TA), Telugu (TE), Thai (TH), Chinese (ZH), Vietnamese (VI), Persian (FA) and Urdu (UR) language annotations will always be created, even for predictions below 2.0, since the Language Detector is mostly accurate when predicting them.

Predict IP

The Predict Input Processor makes use of a machine learning model generated in the Teneo Learn component when machine learning classes are available in a Teneo Studio solution. The Predict IP uses the model to annotate each user input with the machine learning classes defined.

Whenever the Predict IP receives a user input, the Input Processor calculates a confidence score for each of the classes based on the model, creating annotations for the most confident class and for each other class that matches the following criteria:

the confidence is above the minimum confidence (defaults to 0.01)
the confidence is higher than 0.5 times the confidence value of the top class.

The Predict Input Processor will create a maximum of 5 annotations, regardless of how many classes match the criteria. The numerical thresholds can be configured in the properties file of the Input Processor.

For each selected class, an annotation with the name <CLASS_NAME>.INTENT will be created, with the value of the model confidence in the class. A special annotation <CLASS_NAME>.TOP_INTENT is created for the class with the highest confidence score.

Configuration properties

Name	Type	Required	Default
`minConfidenceSimilarityDistance`	float	no	0.5

Confidence percentage of the top score confidence a class must have in order to be considered (e.g.: if the top confidence class has a confidence of 0.7, classes with confidences lower than 0.5 x 0.7 = 0.35 will be discarded).

Name	Type	Required	Default
`maxNumberOfAnnotations`	int	no	5

Maximum number of class annotations to create for each user input.

Name	Type	Required	Default
`minConfidenceThreshold`	float	no	0.01

Minimum value of confidence a model must have for a class in order to add it as one of the candidate annotations.

Name	Type	Required	Default
`intent.model.file.name`	string (filename)	no	inexistent

Name of the file containing the machine learning model. It is usually set automatically by Teneo Studio, so no configuration is required.

Turkish Input Processors Chain

Introduction

Input Processors chain setup

Standard Simplifier

Configuration properties

Turkish Analyzer IP

Normalization, sentence splitting and tokenization

POS and morphological annotation

Configuration properties

System Annotation IP

Other considerations

Configuration properties

Special System annotations

Basic Number Recognizer IP

Configuration properties

Language Detector IP

Predict IP

Configuration properties

Related topics