# Japanese Input Processors Chain

## Introduction

An Input Processor (IP) pre-processes inputs for the Teneo Engine to be able to perform different processes on them, such as normalization and tokenization, for example. Each language supported by the Teneo Platform has a chain of Input Processors that know how to process that particular language.

> The Japanese Input Processors chain implemented in Teneo 5.1.1 is backwards incompatible with the Japanese Teneo Lexical Resource and solutions of former Teneo versions due to different tokenization.

## IP Chain Setup

The following graph displays the setup of the Japanese Input Processors chain; each Input Processor is described further in the following sections.

```mermaid
graph TD
  subgraph ips [ ]
    tokenizer[Japanese Tokenizer] --> annotator[Japanese Annotator]
    annotator[Japanese Annotator] --> number[Japanese Number Recognizer]
    number[Japanese Number Recognizer] --> annotation[System Annotation]
    annotation[System Annotation] --> languagedetect[Language Detector]
    languagedetect[Language Detector] --> predict[Predict]
    predict[Predict] --> datetime[DateTime Recognizer]
  end
    input([User Input]) --User gives input--> tokenizer
    datetime[DateTime Recognizer] --Parsed input--> parsed([To Dialog Processing])
```

## Japanese Simplifier

The Japanese Simplifier is a special kind of processor that is used to normalize the user input by:

+ converting full width Latin letters and Arabic digits into their half width version, and
+ lowercasing the uppercase Latin letters.

This Simplifier is special because it is _not_ run as part of the Input Processors chain, but rather by the Tokenizer when it puts the tokens generated by Kuromoji into a Teneo data structure. Additionally, the Simplifier is also run by the condition parser inside the Teneo Engine, which normalizes the Language Object syntax words before adding them to the internal Engine dictionary. 

## Input Processors

### Japanese Tokenizer

The Japanese Tokenizer, which performs sentence segmentation and tokenization, runs [Kuromoji](https://www.atilika.org/) (Japanese tokenizer) on raw input strings and then processes the tokens returned by Kuromoji into words and sentences for Teneo. Since the tokenization of Kuromoji is too aggressive for the purpose of Teneo, the processing of the tokens involves a set of hard-coded rules that concatenate some of the tokens into bigger units.

?> Please note the distinction in the terminology:  
A _token_ is a object returned by Kuromoji and an element of the original presentation.  
In the Teneo Platform, the words that are used in the Platform are referred to as _words_, which might be the concatenation of several Kuromoji tokens.

In the exceptional case of the Japanese interpunct symbol “·”, the Japanese Tokenizer also splits tokens from Kuromoji.

The concatenation is done by a separate helper class __JapaneseConcatenator__ which is instantiated for each input. The functionality of concatenation and sentence segmentation is exclusively implemented in that class. The concatenation rules are hard-coded and process a sequence from left to right, deciding whether the observed tokens should be concatenated and if a new sentence should be started.

When a **word** object is created, the features from the Kuromoji tokens that form part of that word are passed into the property map. Those features are then retrieved from the property map of each word by the [Japanese Annotator](#japanese-annotator-ip).

The concatenation of the Japanese Tokenizer can be overridden by introducing entries into the solution dictionary (i.e. in a language object) that follow the below pattern:

````properties
DICTEXT_tok_lemma
````

In other words, entries that have the prefix __DICTEXT__ will be considered by the concatenator and the token __tok__ will never be concatenated, i.e. will be a standalone word that will have the lemma annotation __lemma__.

#### Splitting of Tokens with Numbers

As of Teneo 6, in order to cater for date and time recognition in Japanese, the Japanese Tokenizer splits tokens that contain **slashes, dashes, tildes, colons, commas, dots** and **interpuncts** in certain contexts, as detailed below:

+ Splits numbers from slashes and keep the slash as a separate token if there are at least two slashes in-between numbers, e.g. 25/04/2020, but not when there is a single slash between numbers as that is a fraction that the Number Recognizer should recognize, e.g. 2/3.
+ Splits numbers from dashes, tildes and interpuncts and keep them as separate tokens, e.g. 25 - 04 - 2020, 25 \~ 04 \~ 2020,  25 ・ 04 ・ 2020.
+ Splits numbers from dots and keep the dots as separate tokens, e.g. 25 . 04 . 2020; but not when there is a single dot between numbers as in that case the dot could be a decimal marker for numbers, e.g. 1.5.
+ Splits numbers from comma and keep the comma as a separate token, e.g. 2,3; but not when there are three digits after the comma as in that case the comma could be a thousands separator for numbers, e.g. 1,000.
+ Splits numbers from colons and keep the colon as a separate token, e.g. 10 : 30
+ Splits special number tokens that Kuromoji doesn't split, e.g. ２人 or ２，３.

### Japanese Annotator

The Japanese Annotator creates annotations related to lemma, Part-of-Speech (POS), morphosyntactic information and named entities by processing each word one by one and for each Kuromoji token that forms part of a word, i.e., forms part of the concatenation, it goes through the list of annotation rules and produces the annotations if the rules match on the context of the current token. 

The context of the current token can be the features themselves returned by Kuromoji for that token, such as the POS tag or the lemma, or they can be the annotations that were assigned to the previous token in the same word, if applicable. Note that the context of the first token contains NULL as previous annotation. 

The Japanese Annotator creates annotations of the following types:

+ Lemma (the lemma of a word is provided as an annotation if available)
+ Part-of-Speech (POS) information
+ Morphosyntactic information
+ Named-entity information.

#### Annotations in Teneo

The following Part-of-Speech, morphosyntactic and named-entity annotations may be generated by the Japanese Annotator. Note that a token can have multiple annotations. The annotations carry one of the following suffixes to distinguish their type: POS, MST, or NER.

For information related to the ANNOT Language Objects available in the Japanese Lexical Resource, please see the lists of [POS/MST ANNOT language objects](reference/pre-built-knowledge/part-of-speech-and-morphology?id=japanese) and/or [NER ANNOT language objects](reference/pre-built-knowledge/ners?id=japanese) for Japanese.

| Annotations         | Type |
| ------------------- | ---- |
| __ADJ.POS__         | POS  |
| __ADV.POS__         | POS  |
| __CARDINAL.POS__    | POS  |
| __CONJ.POS__        | POS  |
| __COPULA.POS__      | POS  |
| __COUNTER.POS__     | POS  |
| __DET.POS__         | POS  |
| __FW.POS__          | POS  |
| __INTERJ.POS__      | POS  |
| __NN.POS__          | POS  |
| __PARTICLE.POS__    | POS  |
| __PREFIX.POS__      | POS  |
| __PREP.POS__        | POS  |
| __PRON.POS__        | POS  |
| __PROPER.POS__      | POS  |
| __SUFFIX.POS__      | POS  |
| __SYM.POS__         | POS  |
| __VB.POS__          | POS  |
| __ALMOST.MST__      | MST  |
| __ASSUMPTION.MST__  | MST  |
| __CAUSATIVE.MST__   | MST  |
| __DASU.MST__        | MST  |
| __DESIRE.MST__      | MST  |
| __EXCESS.MST__      | MST  |
| __FORMAL.MST__      | MST  |
| __GARU.MST__        | MST  |
| __GERUND.MST__      | MST  |
| __HAJIMERU.MST__    | MST  |
| __IMPERATIVE.MST__  | MST  |
| __ITERATIVE.MST__   | MST  |
| __KANERU.MST__      | MST  |
| __KIRU.MST__        | MST  |
| __NEGATION.MST__    | MST  |
| __OERU.MST__        | MST  |
| __OWARU.MST__       | MST  |
| __PASSIVE.MST__     | MST  |
| __PAST.MST__        | MST  |
| __PROGRESSIVE.MST__ | MST  |
| __RENYOKEI.MST__    | MST  |
| __TEAGERU.MST__     | MST  |
| __TEIKU.MST__       | MST  |
| __TEITADAKU.MST__   | MST  |
| __TEKUDASARU.MST__  | MST  |
| __TEKURERU.MST__    | MST  |
| __TEMIRU.MST__      | MST  |
| __TEMORAU.MST__     | MST  |
| __TEOKU.MST__       | MST  |
| __TESHIMAU.MST__    | MST  |
| __TEYARU.MST__      | MST  |
| __VOLITION.MST__    | MST  |
| __YAGARU.MST__      | MST  |
| __EMAIL.NER__       | NER  |
| __LOCATION.NER__    | NER  |
| __PERSON.NER__      | NER  |
| __URL.NER__         | NER  |

### Japanese Number Recognizer

The Japanese Number Recognizer is capable of recognizing the following types of number expressions:

+ Arabic numbers
+ Formal and colloquial Kanji numbers
+ Hiragana numbers
+ Numbers with counters not split from the actual numeric expression
+ Numbers with factors both larger and smaller than zero
+ Decimal numbers
+ Fractions.

When a number expression is detected in a user input, the following annotation is created with a variable which holds the found number.

| Annotation | Variable     | Description                               |
| ---------- | ------------ | ----------------------------------------- |
| __NUMBER__ | numericValue | Annotation created for identified numbers |

### System Annotation

Teneo bundles two default collections of annotations in all language configurations: standard annotations added by the System Annotation Input Processor and [special system annotations](#special-system-annotations) added by the Engine; the System Annotation Input Processor performs simple analysis of the sentence text and may generate the standard annotations listed below.

| Annotation       | Description                                                  |
| ---------------- | ------------------------------------------------------------ |
| **\_BINARY**      | The input consists of only 0s and 1s                         |
| **\_BRACKETPAIR** | At least one matching pair of brackets appears in the input; for example, **( )**, **[ ]**, **{ }** |
| **\_EXCLAMATION** | At least one exclamation mark (**!**) appears in the input   |
| **\_EM3**         | Three (or more) exclamation marks (**!!!**) appear in a row in the input |
| **\_EMPTY**       | The input contains no text / the sentence text is empty      |
| **\_NONSENSE**    | The input contains nonsense text, e.g., '*asdf*', '*wgwwgwg*', '*xxxxxx*' |
| **\_QUESTION**    | At least one question mark (**?**) appears in the input      |
| **\_QT3**         | Three (or more) question marks (**???**) appear in a row in the input |
| **\_QUOTE**       | At least one single quotation mark (**'**) appears in the input |
| **\_DBLQUOTE**    | At least one quotation mark (**"**) appears in the input     |

#### Special System Annotations

The following two, special annotations are set by the Teneo Engine. These special system annotations are not related to individual inputs but rather to whole dialogues and are dependent on the session state. 

| Annotation   | Description                                                  |
| ------------ | ------------------------------------------------------------ |
| **\_INIT**    | Indicates session start, i.e., the first input in a dialogue |
| **\_TIMEOUT** | Indicates the continuation of a previously timed-out session/dialogue |

### Language Detector

The Language Detector uses a machine learning model to predict the language of a given user input and adds an annotation, as seen in below table, to the input together with a confidence score of the prediction.

The Language Detector can predict the following 45 languages; the language label used to create the annotation name is in brackets:

> Arabic (AR), Bulgarian (BG), Bengali (BN), Catalan (CA), Czech (CS), Danish (DA), German (DE), Greek (EL), English (EN), Esperanto (EO), Spanish (ES), Estonian (ET), Basque (EU), Persian (FA), Finnish (FI), French (FR), Hebrew (HE), Hindi (HI), Hungarian (HU), Indonesian-Malay (ID\_MS), Icelandic (IS), Italian (IT), Japanese (JA), Korean (KO), Lithuanian (LT), Latvian (LV), Macedonian (MK), Dutch (NL), Norwegian (NO), Polish (PL), Portuguese (PT), Romanian (RO), Russian (RU), Slovak (SK), Slovenian (SL), Serbian-Croatian-Bosnian (SR\_HR), Swedish (SV), Tamil (TA), Telugu (TE), Thai (TH), Tagalog (TL), Turkish (TR), Urdu (UR), Vietnamese (VI) and Chinese (ZH).

_Serbian, Bosnian and Croatian are treated as one language under the label SR\_HR, and Indonesian and Malay are treated as one language under the label ID\_MS_

A number of regexes are also in use by the Input Processor, helping the model to not predict a language for fully numerical inputs, URLs or other type of nonsense inputs.

The Language Detector will provide an annotation when the confidence prediction threshold is above 0.2 for the languages, but for the following listed languages, language annotations are always created (even for predictions below 0.2) since the Language Detector is mostly accurate when predicting them: Arabic, Bengali, Greek, Hebrew, Hindi, Japanese, Korean, Tamil, Telugu, Thai, Chinese, Vietnamese, Persian and Urdu. 

### Predict

The __Predict__ Input Processor makes use of an intent model generated when classes are available in a Teneo Studio solution to annotate user inputs with the defined classes; intent models can be generated either with Teneo Learn or CLU. Note that as of Teneo 7.3, [deferred intent classification](reference/conceptual-overviews/from-request-to-response?id=deferred-intent-classification) is applied and annotations are only created by Predict if references to class annotations are found during the input matching process.

When Predict receives a user input, confidence scores are calculated for each class based on the model and annotations created for the most confident class and for each other class that matches the following criteria:

+ the confidence is above the minimum confidence (defaults to 0.01)
+ the confidence is higher than 0.5 times the confidence value of the top class.

For each selected class, an annotation with the scheme __\<CLASS\_NAME>.INTENT__ is created, with the value of the model's confidence in the class as well as an annotation variable specifying the used classifier (i.e., Learn, CLU or LearnFallback) and an Order variable defining the order of the selected classes (i.e., 0 for the class with the highest confidence score and 4 for the selected class with the lowest confidence score).  
A special annotation __\<CLASS\_NAME>.TOP\_INTENT__ is created for the class with the highest confidence score.

| Annotation                   | Variable   | Variable   | Variable | Description                                                  |
| ---------------------------- | ---------- | ---------- | -------- | ------------------------------------------------------------ |
| **\<CLASS_NAME>.TOP_INTENT** | classifier | confidence |          | Annotation created for the class with the highest confidence score |
| **\<CLASS_NAME>.INTENT**     | classifier | confidence | Order    | Annotation given to each selected class with a maximum of five top classes |

The Predict Input Processor creates a maximum of 5 annotations, regardless of how many classes match the criteria.

### Date Time Recognizer

The [Date Time Recognizer](reference/nlp-capabilities/nl-analyzer-datetime) available in the Japanese input processing chain recognizes and annotates various date and/or time expressions which are then used by language objects to support the date and time interpretation; the following annotations \- with the listed variables \- are created by the Date Time Recognizer.

| Annotation        | Variables                                                    | Description                              | Examples                                                   |
| ----------------- | ------------------------------------------------------------ | ---------------------------------------- | ---------------------------------------------------------- |
| __DATE.DATETIME__ | __dmy__ (map), __mdy__ (map), __ymd__ (map), all with keys (int) __day_of_month__, __month__ and __year__ | Annotation for collated date expressions | "140219", "14.02.19", "3.4.2018", "03.04.2018", "20180403" |
| __TIME.DATETIME__ | __hour__ (int), __minute__ (int), __second__ (int), __meridiem__ (string) | Annotation for collated time expressions | "15h30", "11pm", "30 sec"                                  |

To read more about how to use the native understanding and interpretation of date and time expressions in the Teneo Platform, please see [here](reference/pre-built-knowledge/date-and-time).


RELEASE NOTES

Platform Dependencies

Dependencies and Licenses

Prerequisites of Teneo Frontends

Teneo 7.0

Teneo 7.0.1

Teneo 7.0.2

Teneo 7.0.3

Teneo 7.1

Technology and Deployment

Teneo Studio

Teneo 7.2

Teneo 7.3

Teneo Engine

Teneo Languages

Teneo 7.4

Teneo 7.5

Technology and deployment

REFERENCE

APIs

API reference doc

Teneo Inquire Client

Interfacing with API JSPs

Logins, Sessions and Authentication

Web Socket API

Conceptual Overviews

Annotating Inputs

From Request to Response

Intent Classification

Log Data Handling

NLU Generation

Session Data Model

NLP Capabilities

Chinese IP Chain

Japanese IP Chain

Korean IP Chain

Standard IP Chain

Turkish IP Chain

Date and Time

Named Entity Recognizer

POS Tagger

Conversational Modules

Date & Time

Deprecated objects

Pre-built Entities

Flow lists

Named Entity ANNOT objects

POS/Morphology ANNOT objects

Dialogue Resources

Lexical Resources

Offensive Language Detection

Sentiment & Intensity Analysis

Teneo Programming

TLML Reference Manual

TQL Manual

TENEO 7.5.0

Auto-test

Auto-test results panel

Automated testing

Class management

Class Manager Basics

Class Manager Window

CLU Manager

Troubleshooting Class Manager

Entity

Entity Basics

Entity Window

Entries Editor

Troubleshooting: Entity

Variables

Flow

After Matches

Bot Output

Flow Basics

Flow Link

Flow Listener

Nodes and elements

Flow Script

Flow Variable

Flow window

Integration

Matches

Prompt

Script Node

Sub-flow

Transition

Troubleshooting: Flow

User Intent

Globals

Emotions

Global Listeners

Global Scripted Contexts

Global Scripts

Global Variables

Metadata

Troubleshooting: Globals

Import Export

Add Content

Bulk import

File format

Export

Generative QnA

Language Object

Language Object Basics

Language Object Window

Troubleshooting: Language Object

Optimization

Augmenters

Class performance

Classifier

Improvement suggestions

Log Data Sources

Log Data Source window

Query log data

Ordering

Ordering groups

Ordering suggestions

Trigger Ordering window

Working with trigger relations

Publish

Publish environments

Publish a solution

Resources

Integrations

Resource Files

Troubleshooting: Resources

Solution

Document Basics

Version Control

Keyboard Shortcuts

Lexical Resource Information

Properties Lists

Localization Structure

Updates From Master

Recycle Bin

Script Editor

Search

Solution Properties

Solution Window

Syntax Editor

Troubleshooting: Referenced Documents

Troubleshooting: Solution

Troubleshooting: TLML Syntax

Tryout

Advanced Tryout

Detailed information

Troubleshooting: Tryout

# Japanese Input Processors Chain

## Introduction

An Input Processor (IP) pre-processes inputs for the Teneo Engine to be able to perform different