# Standard Input Processors Chain

## Introduction

An Input Processor (IP) pre-processes inputs for the Teneo Engine to be able to perform different processes on them, such as normalization and tokenization of inputs or spelling correction. Each language supported by the Teneo Platform has a chain of Input Processors which know how to process that particular language. The Standard Input Processor chain offers support to a a large number of the supported Teneo languages.

### Supported Languages

Currently, the below listed languages are supported by the Standard Input Processor chain.

| Supported languages |           |            |                 |                            |                     |                   |
| ------------------- | --------- | ---------- | --------------- | -------------------------- | ------------------- | ----------------- |
| Afrikaans           | Czech     | Georgian   | Kinyarwanda     | Nepali                     | Sango               | Tigrinya          |
| Albanian            | Danish    | German     | Kirundi (Rundi) | Norwegian (Nynorsk/Bokmål) | Scottish Gaelic     | Tsonga            |
| Amharic             | Dutch     | Greek      | Kyrgyz          | Odia                       | Serbian             | Tswana (Setswana) |
| Armenian            | English   | Gujarati   | Latvian         | Oromo                      | Shona               | Turkmen           |
| Azerbaijani         | Esperanto | Hindi      | Lithuanian      | Papiamento                 | Sinhala             | Ukrainian         |
| Basque              | Estonian  | Hungarian  | Luxembourgish   | Polish                     | Slovak              | Uzbek             |
| Belarusian          | Ewe       | Icelandic  | Macedonian      | Portuguese                 | Slovene             | Vietnamese        |
| Bengali/Bangla      | Faroese   | Igbo       | Malagasy        | Quechuan (Quechua)         | Somali              | Welsh             |
| Bosnian             | Finnish*  | Indonesian | Malay           | Romanian                   | Spanish             | Yoruba            |
| Bulgarian           | French    | Irish      | Maltese         | Romansh                    | Swahili (Kiswahili) | Zulu (isiZulu)    |
| Catalan             | Frisian   | Italian    | Marathi         | Russian                    | Swazi               |                   |
| Croatian            | Galician  | Kazakh     | Mongolian       | Sámi                       | Swedish             |                   |

__\*__ The Input Processor chain for Finnish language also contains the [Finnish Splitting Input Processor](#finnish-input-processors-chain) on top of the IPs in the Standard Input Processor chain.

## IP Chain Setup

The following graph displays the setup of the Standard Input Processors chain; each Input Processor is described further in the following sections.

`````mermaid
graph TD
  subgraph ips [ ]
    split[Standard Splitting]
    autocorrect[Standard Auto Correction]

    predict[Predict]
    similarity[Standard Similarity Match Correction]
    annotation[System Annotation]
    number[Basic Number Recognizer]
    datetime[DateTime Recognizer *]
    languagedetect[Language Detector]
    pos[POS Tagger / Morphological Analyzer *]
    ner[Named Entity Recognizer *]

  end
  input([User Input])
  subgraph settings [Input Processor Configuration]
    abbr[/Abbreviations/]
    correct[/Autocorrections/]
  end
  subgraph solution [Solution]
    soln[/Solution Dictionary/]
  end 
  parsed([To Dialog Processing])

  split --> autocorrect
  autocorrect --> predict

  predict --> similarity
  similarity --> annotation
  annotation --> number
  number --> datetime
  datetime --> languagedetect
  languagedetect --> pos
  pos --> ner
  input -->|User Gives Input| split
  abbr --> split
  correct --> autocorrect
  soln --> similarity
  ner --Parsed Input--> parsed

  classDef ip_optional stroke-dasharray:5,5;
  class datetime,pos,ner ip_optional;
  classDef external fill:#00000000,stroke-dasharray:5,5;
  class solution,settings external;
`````

__\*__  _The Input Processors marked with a star (\*) in the above graph are currently only available as NL Analyzers for a selection of the languages; for more information on available languages, please see the specific sections._

## Standard Simplifier

The Standard Simplifier is a separate processing unit which is __not__ an input processor and which provides a method to normalize some text, usually \- but not necessarily \- a word. Here, "_normalization_" means removal of text properties that are semantically insignificant, like conversion to lower case (considering the configured language locale), removal of some accents and normalization of Unicode combining characters. By default, the Input Processors call the Simplifier when they generate a new word item. Furthermore, the Simplifier is called by the language condition parser of the Teneo Engine when it stores a language condition word (i.e., TLML syntax word) in the solution dictionary. 

+ [Simplification](reference/conceptual-overviews/from-request-to-response?id=simplification)
+ The Simplifier decompose and normalize characters, for example lower casing characters and converting to Unicode.

## Input Processors

### Standard Splitting

The Standard Splitting Input Processor splits the user input text into sentences and words; this Input Processor generates one or more sentences with zero or more words. The generated WordData objects contain the original and the simplified form of the word; the final word form is initialized with the simplified word form.

+ [Division into sentences](reference/conceptual-overviews/from-request-to-response?id=division-into-sentences)
+ [Division into words](reference/conceptual-overviews/from-request-to-response?id=division-into-words)

### Standard Auto Correction

The Standard Auto Correction Input Processor applies spelling correction based on a list of auto-correction mappings. The corrections are applied to the finalized form of the sentence words, and the Input Processor works on the existing sentences and words passed in. It may modify the final form of the word, but the count of sentences and words is not modified.

+ [Automatic spelling correction](reference/conceptual-overviews/from-request-to-response?id=automatic-spelling-correction)

### Predict

The __Predict__ Input Processor makes use of an intent model generated when classes are available in a Teneo Studio solution to annotate user inputs with the defined classes; intent models can be generated either with Teneo Learn or CLU. Note that as of Teneo 7.3, [deferred intent classification](reference/conceptual-overviews/from-request-to-response?id=deferred-intent-classification) is applied and annotations are only created by Predict if references to class annotations are found during the input matching process.

When Predict receives a user input, confidence scores are calculated for each class based on the model and annotations created for the most confident class and for each other class that matches the following criteria:

+ the confidence is above the minimum confidence (defaults to 0.01)
+ the confidence is higher than 0.5 times the confidence value of the top class.

For each selected class, an annotation with the scheme __\<CLASS\_NAME>.INTENT__ is created, with the value of the model's confidence in the class as well as an annotation variable specifying the used classifier (i.e., Learn, CLU or LearnFallback) and an Order variable defining the order of the selected classes (i.e., 0 for the class with the highest confidence score and 4 for the selected class with the lowest confidence score).  
A special annotation __\<CLASS\_NAME>.TOP\_INTENT__ is created for the class with the highest confidence score.

| Annotation                    | Variable   | Variable   | Variable | Description                                                  |
| ----------------------------- | ---------- | ---------- | -------- | ------------------------------------------------------------ |
| **<CLASS\_NAME>.TOP\_INTENT** | classifier | confidence |          | Annotation created for the class with the highest confidence score |
| **<CLASS\_NAME>.INTENT**      | classifier | confidence | Order    | Annotation given to each selected class with a maximum of five top classes |

The Predict Input Processor creates a maximum of 5 annotations, regardless of how many classes match the criteria.

### Standard Similarity Match Correction

The Standard Similarity Match Correction Input Processor applies spelling correction based on a similarity matching of sentence words against words provided by a dictionary. The corrections are applied to the final form of the sentence words.  
This Input Processors works on the existing sentences and words passed in and it may modify the final form of a word; the count of sentences and words is not modified.

The dictionary used by this Input Processor is the _solution dictionary_ which is solution specific and generated on (re)load of the Engine. It is formed of the bare words in TLML syntaxes in the solution and in referenced libraries. Adding a word to the dictionary is done by adding the word to the syntax of a new or existing Language Object, Entity, Trigger or Transition (in case of the latter two, as TLML Syntax).



+ [Spelling tolerance](reference/conceptual-overviews/from-request-to-response?id=spelling-tolerance)



### System Annotation

Teneo bundles two default collections of annotations in all language configurations: standard annotations added by the System Annotation Input Processor and [special system annotations](#special-system-annotations) added by the Engine; the System Annotation Input Processor performs simple analysis of the sentence texts and may generate the standard annotations listed below.

| Annotation        | Description                                                  |
| ----------------- | ------------------------------------------------------------ |
| **\_BINARY**      | The input consists of only 0s and 1s                         |
| **\_BRACKETPAIR** | At least one matching pair of brackets appears in the input; possible bracket types: **\( )**, **\[ ]**, **\{ }** |
| **\_EXCLAMATION** | At least one exclamation mark (**!**) appears in the input   |
| **\_EM3**         | Three (or more) exclamation marks (**!!!**) appear in a row in the input |
| **\_EMPTY**       | The input contains no text / the sentence text is empty      |
| **\_NONSENSE**    | The input contains nonsense text, e.g., '_asdf_', '_wgwwgwg_', '_xxxxxx_' |
| **\_QUESTION**    | At least one question mark (**?**) appears in the input      |
| **\_QT3**         | Three (or more) question marks (**???**) appear in a row in the input |
| **\_QUOTE**       | At least one single quotation mark (**\'**) appears in the input |
| **\_DBLQUOTE**    | At least one quotation mark (**\"**) appears in the input    |

#### Special System Annotations

The following two, special annotations are set by the Teneo Engine. These special system annotations are not related to individual inputs but rather to whole dialogues and are dependent on the session state.

| Annotation    | Description                                                  |
| ------------- | ------------------------------------------------------------ |
| **\_INIT**    | Indicates session start, i.e., the first input in a dialogue |
| **\_TIMEOUT** | Indicates the continuation of a previously timed-out session/dialogue |

### Basic Number Recognizer

The Basic Number Recognizer identifies all Arabic numbers of the type 123 and 3.14 in the user input and annotates each of them with an annotation associated with a variable which holds the actual numeric value of the number found.  
The Basic Number Recognizer is language dependent and each language has its own configuration defining the decimal point characters and the thousands separator character to be ignored. 

| Annotation | Variable     | Description                                                  |
| ---------- | ------------ | ------------------------------------------------------------ |
| __NUMBER__ | numericValue | Annotation created for identified Arabic numbers in user inputs |

For the annotation and its numeric value variable to be added, a _number_ in the user input must meet the following syntax:

​	_It must match the regular expression:_
````
[,]?[0-9]+([,][0-9]+)*([.][0-9]+)?|[.][0-9]+
````
​	_It must be parseable by Java's BigDecimal to ensure it is a number_

The above syntax provides the following guarantees:

+ The sign is not included in the annotated token
+ The __numericValue__ variable contains a BigDecimal representation of the number.

_In the above example regex, the dot is used as a decimal marker and the comma as a regular expression; as described earlier this configuration is language dependent and therefore varies depending on the selected solution language._

### Language Detector

The Language Detector uses a machine learning model to predict the language of a given user input and adds an annotation, as seen in below table, to the input together with a confidence score of the prediction.

| Annotation                                  | Variable   | Description                                   |
| ------------------------------------------- | ---------- | --------------------------------------------- |
| **\<language label>.LANG**, e.g., %$DA.LANG | Confidence | Annotation created for the predicted language |

The Language Detector can predict the following 45 languages; the language label used to create the annotation name is in brackets:

> Arabic (AR), Bulgarian (BG), Bengali (BN), Catalan (CA), Czech (CS), Danish (DA), German (DE), Greek (EL), English (EN), Esperanto (EO), Spanish (ES), Estonian (ET), Basque (EU), Persian (FA), Finnish (FI), French (FR), Hebrew (HE), Hindi (HI), Hungarian (HU), Indonesian-Malay (ID\_MS), Icelandic (IS), Italian (IT), Japanese (JA), Korean (KO), Lithuanian (LT), Latvian (LV), Macedonian (MK), Dutch (NL), Norwegian (NO), Polish (PL), Portuguese (PT), Romanian (RO), Russian (RU), Slovak (SK), Slovenian (SL), Serbian-Croatian-Bosnian (SR\_HR), Swedish (SV), Tamil (TA), Telugu (TE), Thai (TH), Tagalog (TL), Turkish (TR), Urdu (UR), Vietnamese (VI) and Chinese (ZH).

_Serbian, Bosnian and Croatian are treated as one language under the label SR\_HR, and Indonesian and Malay are treated as one language under the label ID\_MS_

A number of regexes are also in use by the Input Processor, helping the model to not predict a language for fully numerical inputs, URLs or other type of nonsense inputs.

The Language Detector will provide an annotation when the confidence prediction threshold is above 0.2 for the languages, but for the following listed languages, language annotations are always created (even for predictions below 0.2) since the Language Detector is mostly accurate when predicting them: Arabic, Bengali, Greek, Hebrew, Hindi, Japanese, Korean, Tamil, Telugu, Thai, Chinese, Vietnamese, Persian and Urdu. 

## Finnish Input Processors Chain

The input processing chain for Finnish language shares its Input Processors with the Standard Input Processor chain, but furthermore includes the Finnish Splitting Input Processor which comes between the Standard Splitting and Standard AutoCorrect Input Processors as displayed in the below graph.

```mermaid
graph TD
  subgraph ips [ ]
    split[Standard Splitting] --> finnishsplit
    finnishsplit[Finnish Splitting] --> autocorrect
    autocorrect[Standard AutoCorrection] --> predict

    predict[Predict] --> similarity
    similarity[Standard Similarity Match Correction] --> annotation
    annotation[System Annotation] --> number
    number[Basic Number Recognizer] --> datetime
    datetime[DateTime Recognizer*] --> languagedetect
  end
  input([User Input]) -->|User Gives Input| split
  subgraph settings [Input Processor Configuration]
    abbr[/Abbreviations/] --> split
    correct[/Autocorrections/] --> autocorrect
  end
  subgraph solution [Solution]
    soln[/Solution Dictionary/] --> similarity
  end
  languagedetect[Language Detector] -->|Parsed Input| parsed([To Dialog Processing])

  classDef ip_optional stroke-dasharray:5,5;
  class datetime,pos,ner ip_optional;
  classDef external fill:#00000000,stroke-dasharray:5,5;
  class solution,settings external;
```

__\*__ _The DateTime Recognizer (in the above graph) is also available in the Finnish Input Processors chain, but it is currently **not** supported by the [Approach in the Teneo Platform](reference/pre-built-knowledge/date-and-time) for understanding and interpretation of date and time expressions._

### Finnish Splitting

The Finnish Splitting Input Processor splits off suffixes from the existing sentence words passed in, using configurable word lists in its algorithm. It may modify an existing word (it is set to the word stem) and add one or more words after it (the suffixes split off). These added words all have the same original word form and begin index as the modified word. Words shorter than five characters or contained in the no-cut list will not be split. The count of sentences is not modified.

The suffixes are grouped into five lists and are searched for and split off in the order listed below:

+ clitic
+ participe
+ poss
+ cases
+ comparison.


RELEASE NOTES

Platform Dependencies

Dependencies and Licenses

Prerequisites of Teneo Frontends

Teneo 7.0

Teneo 7.0.1

Teneo 7.0.2

Teneo 7.0.3

Teneo 7.1

Technology and Deployment

Teneo Studio

Teneo 7.2

Teneo 7.3

Teneo Engine

Teneo Languages

Teneo 7.4

Technology and deployment

REFERENCE

APIs

API reference doc

Teneo Inquire Client

Interfacing with API JSPs

Logins, Sessions and Authentication

Web Socket API

Conceptual Overviews

Annotating Inputs

From Request to Response

Intent Classification

Log Data Handling

NLU Generation

Session Data Model

NLP Capabilities

Chinese IP Chain

Japanese IP Chain

Korean IP Chain

Standard IP Chain

Turkish IP Chain

Date and Time

Named Entity Recognizer

POS Tagger

Conversational Modules

Date & Time

Deprecated objects

Pre-built Entities

Flow lists

Named Entity ANNOT objects

POS/Morphology ANNOT objects

Dialogue Resources

Lexical Resources

Offensive Language Detection

Sentiment & Intensity Analysis

Teneo Programming

TLML Reference Manual

TQL Manual

TENEO 7.4.0

Auto-test

Auto-test results panel

Automated testing

Class management

Class Manager Basics

Class Manager Window

CLU Manager

Troubleshooting Class Manager

Entity

Entity Basics

Entity Window

Entries Editor

Troubleshooting: Entity

Variables

Flow

After Matches

Bot Output

Flow Basics

Flow Link

Flow Listener

Nodes and elements

Flow Script

Flow Variable

Flow window

Integration

Matches

Prompt

Script Node

Sub-flow

Transition

Troubleshooting: Flow

User Intent

Globals

Emotions

Global Listeners

Global Scripted Contexts

Global Scripts

Global Variables

Metadata

Troubleshooting: Globals

Import Export

Add Content

Bulk import

File format

Export

Generative QnA

Language Object

Language Object Basics

Language Object Window

Troubleshooting: Language Object

Optimization

Augmenters

Class performance

Classifier

Improvement suggestions

Log Data Sources

Log Data Source window

Query log data

Ordering

Ordering groups

Ordering suggestions

Trigger Ordering window

Working with trigger relations

Publish

Publish environments

Publish a solution

Resources

Integrations

Resource Files

Troubleshooting: Resources

Solution

Document Basics

Version Control

Keyboard Shortcuts

Lexical Resource Information

Properties Lists

Localization Structure

Updates From Master

Recycle Bin

Script Editor

Search

Solution Properties

Solution Window

Syntax Editor

Troubleshooting: Referenced Documents

Troubleshooting: Solution

Troubleshooting: TLML Syntax

Tryout

Advanced Tryout

Detailed information

Troubleshooting: Tryout

# Standard Input Processors Chain

## Introduction

An Input Processor (IP) pre-processes inputs for the Teneo Engine to be able to perform different