Japanese Input Processors Chain
Introduction
An Input Processor (IP) pre-processes inputs for Teneo Engine to be able to perform different processes on them. Each language supported by the Teneo Platform has a chain of input processors that know how to process that language.
Japanese was the first language of the Teneo Platform to fully take advantage of the possibility to customize user input processing according to the Teneo Engine architecture and, while more and more languages take advantages of this approach, the Japanese input processors keep evolving.
Input Processors Chain setup
The following graph displays the default setup of the Japanese Input Processor Chain:
The Input Processors are listed below with a short description of the Input Processor's functionality, the follow sections will go into further details.
- The Japanese Tokenizer IP performs sentence segmentation and tokenization.
- The Japanese Annotator IP creates annotations related to lemma, Part-of-Speech (POS), morphosyntactic information and named entities.
- The Japanese Number Recognizer IP is capable of recognizing the following types of number expressions: Arabic numbers, formal and colloquial Kanji numbers, Hiragana numbers, numbers with counters not split from the actual numeric expression, numbers with factors both larger and smaller than zero, decimal numbers and fractions.
- The System Annotation IP sets a number of annotations based on properties of the user input text.
- The Language Detector IP identifies the language of the input sentence provided and annotates it with the predicted language and associates a confidence score of the prediction.
- The Predict IP classifies user inputs based on a machine learning model trained in Teneo Learn and annotates it with the predicted top intent classes and confidence score.
- The DateTime Recognizer IP recognizes and annotates various date and/or time expressions which are used by Language Objects to support the date and time interpretation.
Japanese Simplifier
The Japanese Simplifier is a special kind of processor that is used to normalize the user input by:
- converting full width Latin letters and Arabic digits into their half width version, and
- lowercasing the uppercase Latin letters.
This Simplifier is special because it is not run as a part of the input processor chain, but rather by the Tokenizer when it puts the tokens generated by Kuromoji into a Teneo data structure. Additionally, the Simplifier is also run by the condition parser inside Teneo Engine, which normalizes the Language Object syntax words before adding them to the internal Engine dictionary.
Japanese Tokenizer IP
The Japanese Tokenizer Input Processor runs Kuromoji (a Japanese tokenizer) on raw input strings and then processes the tokens returned by Kuromoji into words and sentence for Teneo.
Since the tokenization of Kuromoji is too aggressive for the purposes of Teneo, the processing of the tokens involves a set of hard-coded rules that concatenate some of the tokens into bigger units.
In the exceptional case of the Japanese interpunct symbol “·”, the Japanese Tokenizer also splits tokens from Kuromoji.
The concatenation is done by a separate helper class JapaneseConcatenator
which is instantiated for each input. The functionality of concatenation and sentence segmentation is exclusively implemented in that class. The concatenation rules are hard-coded and process a sequence from left to right, deciding whether the observed tokens should be concatenated and if a new sentence should be started.
When a word object is created, the features from the Kuromoji tokens that form part of that word are passed into the property map. Those features are then retrieved from the property map of each word by the Japanese Annotator.
The concatenation of the Japanese Tokenizer can be overridden by introducing entries into the solution dictionary (i.e. in a language object) that follow the below pattern:
properties
1DICTEXT_tok_lemma
2
In other words, entries that have the prefix DICTEXT
will be considered by the concatenator and the token tok
will never be concatenated, i.e. will be a standalone word that will have the lemma annotation lemma
.
Splitting of tokens with numbers
As of Teneo 6, in order to cater for date and time recognition in Japanese, the Japanese Tokenizer splits tokens that contain slashes, dashes, tildes, colons, commas, dots and interpuncts in certain contexts, as detailed below:
- Splits numbers from slashes and keep the slash as a separate token if there are at least two slashes in-between numbers, e.g. 25/04/2020, but not when there is a single slash between numbers as that is a fraction that the Number Recognizer should recognize, e.g. 2/3.
- Splits numbers from dashes, tildes and interpuncts and keep them as separate tokens, e.g. 25 - 04 - 2020, 25 ~ 04 ~ 2020, 25 ・ 04 ・ 2020.
- Splits numbers from dots and keep the dots as separate tokens, e.g. 25 . 04 . 2020; but not when there is a single dot between numbers as in that case the dot could be a decimal marker for numbers, e.g. 1.5.
- Splits numbers from comma and keep the comma as a separate token, e.g. 2,3; but not when there are three digits after the comma as in that case the comma could be a thousands separator for numbers, e.g. 1,000.
- Splits numbers from colons and keep the colon as a separate token, e.g. 10 : 30
- Splits special number tokens that Kuromoji doesn't split, e.g. 2人 or 2,3.
Configuration properties
Name | Type | Required | Default |
---|---|---|---|
nonWordToken | String | no | "“”\r『』'「」()[]{}()〔〕[]{}〈〉《》!!??…,,、..。。・〚〛〘〙〖〗【】⦅⦆'" |
List of characters that will not be output as a word
Name | Type | Required | Default |
---|---|---|---|
sentenceDelimiters | String | no | !!??..。。… |
List of characters that delimit a sentence
Name | Type | Required | Default |
---|---|---|---|
decimalPoints | String | no | ...点・ |
List of characters that is used as a decimal point in numbers
Name | Type | Required | Default |
---|---|---|---|
noSeparationRegEx | String | no | (?:\w+(?:[-+.']\w+)*@\w+(?:[-.]\w+)*\.\w+(?:[-.]\w+)*)| (?iuU:(?:(?:(?:https?|ftp|file)://)|www\.)[-\p{IsLatin}0-9+&@#/%?=~_|!:,.;]*[-\p{IsLatin}0-9+&@#/%=~_|])* |
Regex for sequences that will not be split if the regex convers the full surface forms of multiple Kuromoji tokens. The default covers most URLs and e-mails.
Name | Type | Required | Default |
---|---|---|---|
Emojis | String | no | [\x{1f000}-\x{1f7ff}\x{2600}-\x{27ff}\x{1f900}-\x{1f9e6}\] |
Regex to detect emojis that will be tokenized into individual words.
Japanese Annotator IP
The Japanese Annotator Input Processor processes each word one by one and for each Kuromoji token that forms part of a word, i.e. that form part of the concatenation, it goes through the list of annotation rules defined in JapaneseAnnotator.properties
and produces the annotations if the rules match on the context of the current token.
The context of the current token can be the feature themselves returned by Kuromoji for that token, such as the POS tag or the lemma, or they can be the annotations that were assigned to the previous token in the same word, if applicable. Note that the context of the first token contains NULL
as a previous annotation.
The Japanese Annotator creates annotations of the following type:
- Lemma (the lemma of a word is provided as an annotation if available)
- Part-of-speech information
- Morphosyntactic information
- Named-entity information.
Configuration properties
The Japanese Annotator can be configured by two types of properties: annotation
and annotationRegex
properties.
Annotation properties
The values of the annotation properties for the Japanese Annotator need to follow the following syntax:
properties
1<feature 1>,<feature 2>,<... feature N>=<annotation 1>-<annotation 2>-<... annotation M>|<remove 1>-<remove 2>-<... remove M>
2
In other words, a property key is defined by a list of features separated by a comma. The order of the features is irrelevant. A feature can be any of the features provided by Kuromoji, such as POS tags or lemmas, or any of the annotations of the previous token of the same word.
The right-hand side of the property is separated by the =
(equals) character and split into two parts, separated by a |
(vertical line) character. The left part indicates which annotations should be produced if the property key applies to the current token. The right part is a list of annotations that will be removed for the current word if they were previously produced for the same word by a previous rule or in a previous token. Note that therefore, the ordering of the properties matters.
The following is an example of two annotation properties:
properties
1annotation.33 = 助詞,接続助詞=FW.POS-PARTICLE.POS
2
3annotation.34 = 助動詞,助動詞-ダ,連体形-一般=FW.POS-PARTICLE.POS|COPULA.POS-VB.POS
4
The first property (33) in the example produces the annotations FW.POS
and PARTICLE.POS
. The second property (34) produces the same annotations and at the same time removes the annotations COPULA.POS
and VB.POS
from the current word if they exist.
Annotation Regex properties
The Annotation Regex properties map entire words to annotations, and their values need to follow the below syntax:
properties
1<regex>=<annotation 1>-<annotation 2>-<... annotation M>
2
Note that the regex needs to match on the entire word. The Annotation Regex rules will be executed before the annotation properties. If the regex matches, its corresponding annotations are assigned to the word. No other annotations will be assigned after a regex matches. Therefore, the order of the Annotation Regex is relevant.
The following is an example of an Annotation Regex property that matches on e-mails and produces two annotations: EMAIL.NER
and NN.POS
.
properties
1annotationRegex.2 = \\w+([-+.']\\w+)*@\\w+([-.]\\w+)*\\.\\w+([-.]\\w+)*=EMAIL.NER-NN.POS
2
Annotations available in Teneo Studio
The following Part-of-Speech, morphosyntactic and named-entity annotations may be generated according to the default properties by the Japanese Annotator. Note that a token can have multiple annotations. The annotations carry one of the suffixes: POS
, MST
or NER
to distinguish their type.
For information related to ANNOT Language Objects available in the Japanese Lexical Resource, please see the lists of POS/MST ANNOT language objects and/or NER ANNOT language objects for Japanese.
Annotations | Type |
---|---|
ADJ.POS | POS |
ADV.POS | POS |
CARDINAL.POS | POS |
CONJ.POS | POS |
COPULA.POS | POS |
COUNTER.POS | POS |
DET.POS | POS |
FW.POS | POS |
INTERJ.POS | POS |
NN.POS | POS |
PARTICLE.POS | POS |
PREFIX.POS | POS |
PREP.POS | POS |
PRON.POS | POS |
PROPER.POS | POS |
SUFFIX.POS | POS |
SYM.POS | POS |
VB.POS | POS |
ALMOST.MST | MST |
ASSUMPTION.MST | MST |
CAUSATIVE.MST | MST |
DASU.MST | MST |
DESIRE.MST | MST |
EXCESS.MST | MST |
FORMAL.MST | MST |
GARU.MST | MST |
GERUND.MST | MST |
HAJIMERU.MST | MST |
IMPERATIVE.MST | MST |
ITERATIVE.MST | MST |
KANERU.MST | MST |
KIRU.MST | MST |
NEGATION.MST | MST |
OERU.MST | MST |
OWARU.MST | MST |
PASSIVE.MST | MST |
PAST.MST | MST |
PROGRESSIVE.MST | MST |
RENYOKEI.MST | MST |
TEAGERU.MST | MST |
TEIKU.MST | MST |
TEITADAKU.MST | MST |
TEKUDASARU.MST | MST |
TEKURERU.MST | MST |
TEMIRU.MST | MST |
TEMORAU.MST | MST |
TEOKU.MST | MST |
TESHIMAU.MST | MST |
TEYARU.MST | MST |
VOLITION.MST | MST |
YAGARU.MST | MST |
EMAIL.NER | NER |
LOCATION.NER | NER |
PERSON.NER | NER |
URL.NER | NER |
Japanese Number Recognizer IP
The Japanese Number Recognizer Input Processor is capable of recognizing the following types of number expressions:
- Arabic numbers
- Formal and colloquial Kanji numbers
- Hiragana numbers
- Numbers with counters not split from the actual numeric expression
- Numbers with factors both larger and smaller than zero
- Decimal numbers
- Fractions.
Numbers identified in the user input are annotated with the number
annotation and associates a variable to this annotation called numericValue
which holds the numeric value of the found number.
System Annotation IP
The SystemAnnotation Input Processor performs simple analysis of the sentence texts to set some annotations. The decision algorithms are configurable by various properties. Further customization is possible by sub-classing this Input Processor and overriding one or more of the methods: decideBinary
, decideBrackets
, decideEmpty
, decideExclamation
, decideNonsense
, decideQuestion
, decideQuote
.
This Input Processor works on the sentences passed in but does not modify them.
Other considerations
Extra request parameters read by this input processor: (none)
Processing options read by this input processor: (none)
Annotations this input processor may generate:
- _EMPTY: the sentence text is empty
- _EXCLAMATION: the sentence text contains at least one of the characters specified with property
exclamationMarkCharacters
- _EM3: the sentence text contains three or more characters in a row of the characters specified with property
exclamationMarkCharacters
- _QUESTION: the sentence text contains at least one of the characters specified with property
questionMarkCharacters
- _QT3: the sentence text contains three or more characters in a row of the characters specified with
questionMarkCharacters
- _QUOTE: the sentence text contains at least one of the characters specified with property
quoteCharacters
- _DBLQUOTE: the sentence text contains at least one of the characters specified with property
doubleQuoteCharacters
- _BRACKETPAIR: the sentence text contains at least one matching pair of the bracket characters specified with property
bracketPairCharacters
- _NONSENSE: the sentence probably contains nonsense text as configured with properties
consonants
,nonsenseThreshold.absolute
andnonsenseThreshold.relative
- _BINARY: the sentence text only contains characters specified by properties
binaryCharacters
(at least one of them) andbinaryIgnoredCharacters
(zero or more of them).
Special System annotations
Two special annotations related not to individual inputs, but to whole dialogues, are added by the Teneo Engine itself:
- _INIT: indicates session start, i.e. the first input in a dialogue
- _TIMEOUT: indicates the continuation of a previously timed-out session/dialogue.
Several configuration properties are available for the System Annotation Input Processor; please see the details here.
Language Detector IP
The Language Detector Input Processor uses a machine learning model that predicts the language of a given input and adds an annotation of the format %${language label}.LANG
to the input as well as a confidence score of the prediction.
The Language Detector IP can predict the following 45 languages (language label in brackets):
Arabic (AR), Bulgarian (BG), Bengali (BN), Catalan (CA), Czech (CS), Danish (DA), German (DE), Greek (EL), English (EN), Esperanto (EO), Spanish (ES), Estonian (ET), Basque (EU), Persian (FA), Finnish (FI), French (FR), Hebrew (HE), Hindi (HI), Hungarian (HU), Indonesian-Malay (ID_MS), Icelandic (IS), Italian (IT), Japanese (JA), Korean (KO), Lithuanian (LT), Latvian (LV), Macedonian (MK), Dutch (NL), Norwegian (NO), Polish (PL), Portuguese (PT), Romanian (RO), Russian (RU), Slovak (SK), Slovenian (SL), Serbian-Croatian-Bosnian (SR_HR), Swedish (SV), Tamil (TA), Telugu (TE), Thai (TH), Tagalog (TL), Turkish (TR), Urdu (UR), Vietnamese (VI) and Chinese (ZH).
Serbian, Bosnian and Croatian are treated as one language, under the label SR_HR and Indonesian and Malay are treated as one language, under the label ID_MS.
A number of regexes are also in use by the Input Processor, helping the model to not predict language for fully numerical inputs, URLs or other type of nonsense inputs.
The Language Detector will provide an annotation when the confidence prediction threshold is above 0.2 for the languages, but for Arabic (AR), Bengali (BN), Greek (EL), Hebrew (HE), Hindi (HI), Japanese (JA), Korean (KO), Tamil (TA), Telugu (TE), Thai (TH), Chinese (ZH), Vietnamese (VI), Persian (FA) and Urdu (UR) language annotations will always be created, even for predictions below 2.0, since the Language Detector is mostly accurate when predicting them.
Predict IP
The Predict Input Processor makes use of an intent model generated when classes are available in a Teneo Studio solution to annotate user inputs with the defined classes; intent models can be generated either with Teneo Learn or CLU. Note that as of Teneo 7.3, deferred intent classification is applied and annotations are only created by Predict if references to class annotations are found during the input matching process.
When Predict receives a user input, confidence scores are calculated for each class based on the model and annotations created for the most confident class and for each other class that matches the following criteria:
- the confidence is above the minimum confidence (defaults to 0.01)
- the confidence is higher than 0.5 times the confidence value of the top class.
For each selected class, an annotation with the scheme <CLASS_NAME>.INTENT
is created, with the value of the model's confidence in the class as well as an annotation variable specifying the used classifier (i.e., Learn, CLU or LearnFal) and an Order variable defining the order of the selected classes (i.e., 0 for the class with the highest confidence score and 4 for the selected class with the lowest confidence score).
A special annotation <CLASS_NAME>.TOP_INTENT
is created for the class with the highest confidence score.
Annotation | Variable | Variable | Variable | Description |
---|---|---|---|---|
<CLASS_NAME>.TOP_INTENT | classifier | confidence | Annotation created for the class with the highest confidence score | |
<CLASS_NAME>.INTENT | classifier | confidence | Order | Annotation given to each selected class with a maximum of five top classes |
The Predict Input Processor creates a maximum of 5 annotations, regardless of how many classes match the criteria. The numerical threshold can be configured in the properties file of the Input Processor.
Configuration properties
Name | Type | Required | Default |
---|---|---|---|
minConfidenceSimilarityDistance | float | no | 0.5 |
Confidence percentage of the top score confidence a class must have in order to be considered, e.g. if the top confidence class has a confidence of 0.7, classes with confidence lower than 0.5 x 0.7 = 0.35 will be discarded.
Name | Type | Required | Default |
---|---|---|---|
maxNumberOfAnnotations | int | no | 5 |
Maximum number of class annotations to create for each user input.
Name | Type | Required | Default |
---|---|---|---|
minConfidenceThreshold | float | no | 0.01 |
Minimum value of confidence a model must have for a class in order to add it as one of the candidate annotations.
Name | Type | Required | Default |
---|---|---|---|
intent.model.file.name | string (filename) | no | inexistent |
Name of the file containing the machine learning model. It is usually set automatically by Teneo Studio, so no configuration is required.
DateTime Recognizer IP
The DateTime Recognizer Input Processor available in the Japanese input processing chain recognizes and annotates various date and/or time expressions which are then used by Language Objects to support the date and time interpretation.
The two annotations in the below table are created by the DateTime Recognizer Input Processor.
Annotation | Description | Variables | Examples |
---|---|---|---|
DATE.DATETIME | Annotation for collated date expressions | dmy (map), mdy (map), ymd (map), all with keys (int) day_of_month , month and year | "140219", "14.02.19", "3.4.2018", "03.04.2018", "20180403" |
TIME.DATETIME | Annotation for collated time expressions | hour (int), minute (int), second (int), meridiem (string) | "15h30", "11pm", "30 sec" |
To read more about how to use the native understanding and interpretation of date and time expressions in the Teneo Platform, please see here.