Korean Input Processors Chain

Introduction

An Input Processor (IP) pre-processes inputs for the Teneo Engine to be able to perform different processes on them, such as normalization and tokenize inputs or do Part-of-Speech (POS) tagging. Each language that the Teneo Platform supports have a chain of Input Processors that know how to process that language. This page details the Input Processors chain for Korean language.

Input Processors chain setup

The following graph displays the default setup of the Korean Input Processor Chain:

graph TD subgraph ips [ ] splitting[Standard Splitting] --> morphological[Korean Morphological Analyzer] morphological[Korean Morphological Analyzer] --> annotation[System Annotation] annotation[System Annotation] --> number[Basic Number Recognizer] number[Basic Number Recognizer] --> languagedetect[Language Detector] languagedetect[Language Detector] --> predict[Predict] end input([User Input]) --User gives input--> splitting predict[Predict] --Parsed input--> parsed([To Dialog Processing])

The standard Input Processors are listed below with a short description of the IP's functionality:

The Standard Splitting IP divides the user input text into sentences and words, considering abbreviations that should not be split.
The Korean Morphological Analyzer IP
The System Annotation IP sets a number of annotations based on properties of the user input text.
The Basic Number Recognizer IP identifies all Arabic numbers of the type 123 and 3.14 in the user input, annotates each of them with the NUMBER annotation and associates a variable to this annotation called numericValue.
The Language Detector IP identifies the language of the input sentence provided and annotates it with the predicted language and associates the confidence score of the prediction.
The Predict IP classifies user input based on a machine learning model trained in Teneo Learn and annotates the user input with the predicted top intent classes and confidence score.

Korean Simplifier

The Korean Simplifier is a special kind of processor that is used to normalize the user input by:

converting full width Latin letters and Arabic digits into their half width version, and
lowercasing the uppercase Latin letters.

This Simplifier is special because it is not run as a part of the input processor chain, but rather by the Tokenizer when it puts the tokens into a Teneo data structure. Additionally, the Simplifier is also run by the condition parser inside Teneo Engine, which normalizes the Language Object syntax words before adding them to the internal Engine dictionary.

Standard Splitting IP

The Standard Splitting Input Processor splits the user input text into sentences and words. Splitting is performed at configurable sentence and word delimiters. Splitting exceptions can be defined as a configurable list of abbreviations and a configurable regular expression.

This Input Processor generates one or more sentences, with zero or more words. The generated WordData objects contain the original and simplified form of the word. The final word-form is initialized with the simplified word form.

Other considerations

Extra request parameters read by this input processor: (none)
Processing options read by this input processor: (none)
Annotations generated by this input processor: (none)

Configuration properties

Defining abbreviations

Name	Type	Required	Data
`abbreviations.item.*`	Format: `abbreviations.item.<n> = <abbreviation>` `<n>`: number, which must be unique within the abbreviation definitions of one file `<abbreviation>`: an abbreviation	no	none

List of abbreviations. Abbreviations are considered in the sentence separation process. Sentence delimiters within abbreviations will not lead to separated sentences.

Name	Type	Required	Default
`abbreviations.file.name`	string (filename)	no	empty

Filename (including path) of an extra file containing abbreviations. A relative filename relates to the location of the properties file.

Name	Type	Required	Default
`abbreviations.file.encoding`	string (encoding name)	no	UTF-8

Encoding of the extra file containing abbreviations.

Controlling user input separation into sentences and words

Name	Type	Required	Default
`inputSeparation.sentenceDelimiters`	string	no	`. ¡ ! ¿ ? …`

List of characters that are used to separate sentences (unless part of an abbreviation).

Name	Type	Required	Default
`inputSeparation.wordDelimiters`	string	no	`$ € £ % & ^ " “ ” # \| ~ § ° [] () < > {} = + - ÷ \\ * / , : ; \r \n \t • . ¡ ! ¿ ? …`

List of characters that are used to separate words. Delimiting characters will be kept as separate words, except for those that are listed under inputSeparation.nonWordCharacters (see below).

inputSeparation.additionalWordDelimiterRegEx may be used to specify additional or alternative word delimiting.

Name	Type	Required	Default
`inputSeparation.additionalWordDelimiterRegEx`	string	no	empty

Additional word delimiting regular expression. This is an optional regular expression for delimiting words or defining (optionally zero width) word boundaries. It may be specified as addition or alternative to inputSeparation.wordDelimiters.

NOTE: in Java 6 & 7 a 'position look behind' construct in the regex does not work with Unicode blocks outside the BMP if the block is specified with \p{ln...} construct, probably due to a bug in the Java regex implementation. Instead the characters must be specified directly as a range.

Name	Type	Required	Default
`inputSeparation.nonWordCharacters`	string	no	`" “ ” . ¡ ! ¿ ? … , ; \t \r \n`

Word separators that shall not be kept as words. The set of characters specified here should be a subset of inputSeparation.wordDelimiters and the characters matched by inputSeparation.additionalWordDelimiterRegEx.

Example (assuming defaults):

Argh$%, separate this!

will be separated into:

Argh
$
%
separate
this

Name	Type	Required	Default
`inputSeparation.excludeWordDelimitersRegEx`	string	no	`(?<=([<SP><HT><CR><LF> "“”,;.¡!¿?…\d]\|^))[.,](?=\d)`

Regular expressions that specify exceptions to the splitting of a sentence into words.

The default regular expression prevents the characters , (comma) and . (dot) from acting as word delimiter when they appear in the context of a number.

Note: the text matched by the regular expression will be excluded from splitting, thus any word splitting characters used only as context condition should be given as zero-width look-behind/look-ahead construct.

Korean Morphological Analyzer IP

The Korean Morphological Analyzer input processor runs Komoran on every sentence from the user input, as provided by the Standard Splitting Input Processor before it. Komoran returns the root of every word in the sentence as well as a tag that contains both Part-of-Speech and morphological information. The Korean Morphological Analyzer Input Processor then converts into Teneo annotations the root, the Part-of-Speech and the morphological information for every word.

The table below lists how the tags from Komoran are mapped to annotations in Teneo:

Komoran tag	Description	Map to Teneo annotation(s)
VA	Adjective	ADJ.POS
JKG	Adnominal case marker	JKG.MST
MAG	Adverb	ADV.POS
JKB	Adverbial case marker	JKB.MST
VCP	Affirmation/positive	VCP.MST
NA	Analytical Category	NA.MST
JX	Auxiliary postpositional particle	JX.MST
VX	Auxiliary predicate element	VX.MST
NNB	Bound noun	NN.POS , NNB.MST
SN	Cardinal number	CARDINAL.POS
JKC	Complement case marker	JKC.MST
JC	Conjunctive postpositional particle	JC.MST
MAJ	Connective adverb	ADV.POS, CONNECTIVE.MST
EC	Connective ending	EC.MST
VCN	Denial/negative	VCN.MST
MM	Determiner	DET.POS
ET	Ending of a word	ET.MST
ETM	Ending of a word	ETM.MST
SL	Foreign word	FOREIGN.POS
SH	Foreign word	FOREIGN.POS
IC	Interjection	INTERJ.POS
NNG	Noun	NN.POS
NF	Noun estimation category	NF.MST
NR	Numeral	NR.MST
JKO	Object case marker	JKO.MST
JKQ	Postposition, postpositional particle	JKQ.MST
EP	Pre-final ending	EP.MST
XPN	Prefix	XPN.MST
NP	Pronoun	PRON.POS
NNP	Proper Noun	NN.POS , PROPER.POS
SF	Punctuation	PUNCT.POS
SP	Punctuation	PUNCT.POS
SS	Punctuation	PUNCT.POS
SE	Punctuation	PUNCT.POS
SO	Punctuation	PUNCT.POS
XR	Root	XR.MST
EF	Sentence-closing ending	EF.MST
JKS	Subject case marker	JKS.MST
XSN	Suffix	XSN.MST
XSV	Suffix	XSV.MST
XSA	Suffix	XSA.MST
SW	Symbol	SYM.POS
VV	Verb	VB.POS
NV	Verb estimation category	NV.MST
JKV	Vocative case marker	JKV.MST

System Annotation IP

The System Annotation Input Processor performs simple analysis of the sentence texts to set some annotations. The decision algorithms are configurable by various properties. Further customization is possible by sub-classing this Input Processor and overriding one or more of the methods: decideBinary, decideBrackets, decideEmpty, decideExclamation, decideNonsense, decideQuestion, decideQuote.

This IP works on the sentences passed in but does not modify them.

Other considerations

Extra request parameters read by this input processor: (none)
Processing options read by this input processor: (none)
Annotations this input processor may generate:

_EMPTY: the sentence text is empty
_EXCLAMATION: the sentence text contains at least one of the characters specified with property exclamationMarkCharacters
_EM3: the sentence text contains three or more characters in a row of the characters specified with property exclamationMarkCharacters
_QUESTION: the sentence text contains at least one of the characters specified with property questionMarkCharacters
_QT3: the sentence text contains three or more characters in a row of the characters specified with questionMarkCharacters
_QUOTE: the sentence text contains at least one of the characters specified with property quoteCharacters
_DBLQUOTE: the sentence text contains at least one of the characters specified with property doubleQuoteCharacters
_BRACKETPAIR: the sentence text contains at least one matching pair of the bracket characters specified with property bracketPairCharacters
_NONSENSE: the sentence probably contains nonsense text as configured with properties consonants, nonsenseThreshold.absolute and nonsenseThreshold.relative
_BINARY: the sentence text only contains characters specified by properties binaryCharacters (at least one of them) and binaryIgnoredCharacters (zero or more of them).

Configuration properties

Name	Type	Required	Default
`consonants`	string	no	BCDFGHJKLMNPQRSTVWXZ bcdfghjklmnpqrsßtvwxz ㄱㄲㄴㄷㄸㄹㅁㅂㅃㅅㅆㅇㅈㅉㅊㅋㅌㅍㅎ

Contains all letters (upper and lower case) that are considered consonants in the language. Together with the properties nonsenseThreshold.absolute and nonsenseThreshold.relative these will be used for detecting probable nonsense inputs like kljljljljjlj.

Name	Type	Required	Default
`nonsenseThreshold.absolute`	Positive integer number	No	6

For nonsense detection an input exclusively consisting of so many consonants without any non-consonants is considered nonsense.

Name	Type	Required	Default
`nonsenseThreshold.relative`	Positive integer number	no	10

For nonsense detection an input containing so many consonants in a row is considered nonsense.

Name	Type	Required	Default
`exclamationMarkCharacters`	string	no	`!！`

List of characters of which at least one must occur in the sentence text to set annotations _EXCLAMATION and _EM3 (in case of a sequence of at least three of the specified characters).

Name	Type	Required	Default
`questionMarkCharacters`	string	no	`?？`

List of characters of which at least one must occur in the sentence text to set annotations _QUESTION and _QT3 (in case of a sequence of at least three of the specified characters).

Name	Type	Required	Default
`doubleQuoteCharacters`	string	no	`"“”＂『』`

List of characters of which at least one must occur in the sentence text to set annotation _DBLQUOTE.

Name	Type	Required	Default
`quoteCharacters`	string	no	`'＇「」`

List of characters of which at least one must occur in the sentence text to set annotation _QUOTE.

Name	Type	Required	Default
`binaryCharacters`	string	no	01

List of characters recognized in the sentence text to set annotation _BINARY.

Name	Type	Required	Default
`binaryIgnoredCharacters`	string	no	`!?,.-;:# \r\n\t\"'`

List of characters additionally allowed in binary text.

Name	Type	Required	Default
`bracketPairCharacters`	string	no	`()[]{}（）〔〕［］｛｝〈〉《》〚〛〘〙〖〗【】｟｠`

List of pairs of bracketing characters of which at least one pair (opening and closing bracket of the same type) must occur in the sentence text to set annotation _BRACKETPAIR.

Special System annotations

Two special annotations related not to individual inputs, but to whole dialogues, are added by the Teneo Engine itself:

_INIT: indicates session start, i.e. the first input in a dialogue
_TIMEOUT: indicates the continuation of a previously timed-out session/dialogue.

Basic Number Recognizer IP

The Basic Number Recognizer Input Processor identifies all Arabic numbers of the type 123 and 3.14 in the user input and annotates each of them with the NUMBER annotation and associates a variable to this annotation called numericValuewhich holds the numeric value of the number found.

This Input Processor is language independent, but every language has its own configuration file for this IP defining decimal point characters and the thousands separator character to be ignored.

For the NUMBER annotation and the variable to be added, a "number" in the user input must meet the following syntaxes:

It must match the regular expression:

[,]?[0-9]+([,][0-9]+)*([.][0-9]+)?|[.][0-9]+

It must be parseable by Java's BigDecimal to ensure it is a number

The above syntax provides the following guarantees:

The sign is not included in the annotated token
The numericValue variable contains a BigDecimal representation of the number.

The decimal marker(s) and the thousands separator(s) can be configured; in the above regex, the dot is used as a decimal marker and the comma as a regular expression.

Configuration properties

Name	Default
`decimalMarkers`	`．.`

The default decimal markers in Korean.

Name	Default
`charactersToIgnore`	`,，`

The default characters to ignore.

Language Detector IP

The Language Detector Input Processor uses a machine learning model that predicts the language of a given input and adds an annotation of the format %${language label}.LANG to the input as well as a confidence score of the prediction.

The Language Detector IP can predict the following 45 languages (language label in brackets):

Arabic (AR), Bulgarian (BG), Bengali (BN), Catalan (CA), Czech (CS), Danish (DA), German (DE), Greek (EL), English (EN), Esperanto (EO), Spanish (ES), Estonian (ET), Basque (EU), Persian (FA), Finnish (FI), French (FR), Hebrew (HE), Hindi (HI), Hungarian (HU), Indonesian-Malay (ID_MS), Icelandic (IS), Italian (IT), Japanese (JA), Korean (KO), Lithuanian (LT), Latvian (LV), Macedonian (MK), Dutch (NL), Norwegian (NO), Polish (PL), Portuguese (PT), Romanian (RO), Russian (RU), Slovak (SK), Slovenian (SL), Serbian-Croatian-Bosnian (SR_HR), Swedish (SV), Tamil (TA), Telugu (TE), Thai (TH), Tagalog (TL), Turkish (TR), Urdu (UR), Vietnamese (VI) and Chinese (ZH).

Serbian, Bosnian and Croatian are treated as one language, under the label SR_HR and Indonesian and Malay are treated as one language, under the label ID_MS.

A number of regexes are also in use by the Input Processor, helping the model to not predict language for fully numerical inputs, URLs or other type of nonsense inputs.

The Language Detector will provide an annotation when the confidence prediction threshold is above 0.2 for the languages, but for Arabic (AR), Bengali (BN), Greek (EL), Hebrew (HE), Hindi (HI), Japanese (JA), Korean (KO), Tamil (TA), Telugu (TE), Thai (TH), Chinese (ZH), Vietnamese (VI), Persian (FA) and Urdu (UR) language annotations will always be created, even for predictions below 2.0, since the Language Detector is mostly accurate when predicting them.

The Predict IP

The Predict Input Processor makes use of a machine learning model generated in the Teneo Learn component when machine learning classes are available in a Teneo Studio solution. The Predict IP uses the model to annotate each user input with the machine learning classes defined.

Whenever the Predict IP receives a user input, the Input Processor calculates a confidence score for each of the classes based on the model, creating annotations for the most confident class and for each other class that matches the following criteria:

the confidence is above the minimum confidence (defaults to 0.01)
the confidence is higher than 0.5 times the confidence value of the top class.

The Predict Input Processor will create a maximum of 5 annotations, regardless of how many classes match the criteria. The numerical thresholds can be configured in the properties file of the Input Processor.

For each selected class, an annotation with the name <CLASS_NAME>.INTENT will be created, with the value of the model confidence in the class. A special annotation <CLASS_NAME>.TOP_INTENT is created for the class with the highest confidence score.

Configuration properties

Name	Type	Required	Default
`minConfidenceSimilarityDistance`	float	no	0.5

Confidence percentage of the top score confidence a class must have in order to be considered, e.g. if the top confidence class has a confidence of 0.7, classes with confidence lower than 0.5 x 0.7 = 0.35 will be discarded.

Name	Type	Required	Default
`maxNumberOfAnnotations`	int	no	5

Maximum number of class annotations to create for each user input.

Name	Type	Required	Default
`minConfidenceThreshold`	float	no	0.01

Minimum value of confidence a model must have for a class in order to add it as one of the candidate annotations.

Name	Type	Required	Default
`intent.model.file.name`	string (filename)	no	inexistent

Name of the file containing the machine learning model. It is usually set automatically by Teneo Studio, so no configuration is required.

Korean Input Processors Chain

Introduction

Input Processors chain setup

Korean Simplifier

Standard Splitting IP

Other considerations

Configuration properties

Defining abbreviations

Controlling user input separation into sentences and words

Korean Morphological Analyzer IP

System Annotation IP

Other considerations

Configuration properties

Special System annotations

Basic Number Recognizer IP

Configuration properties

Language Detector IP

The Predict IP

Configuration properties

Related topics