Standard Input Processors Chain
Introduction
An Input Processor (IP) pre-processes inputs for the Teneo Engine to be able to perform different processes on them, such as normalization and tokenization of inputs, or spelling correction. Each language supported by the Teneo Platform has a chain of Input Processors that know how to process that particular language. The Standard Input Processor chain offers support to a large number of the supported languages in the Teneo Platform.
Supported languages
Currently the below listed languages are supported by the Standard Input Processor chain.
Supported languages | ||||||
---|---|---|---|---|---|---|
Afrikaans | Czech | Georgian | Kinyarwanda | Nepali | Sango | Tigrinya |
Albanian | Danish | German | Kirundi (Rundi) | Norwegian (Nynorsk/Bokmål) | Scottish Gaelic | Tsonga |
Amharic | Dutch | Greek | Kyrgyz | Odia | Serbian | Tswana (Setswana) |
Armenian | English | Gujarati | Latvian | Oromo | Shona | Turkmen |
Azerbaijani | Esperanto | Hindi | Lithuanian | Papiamento | Sinhala | Ukrainian |
Basque | Estonian | Hungarian | Luxembourgish | Polish | Slovak | Uzbek |
Belarusian | Ewe | Icelandic | Macedonian | Portuguese | Slovene | Vietnamese |
Bengali/Bangla | Faroese | Igbo | Malagasy | Quechuan (Quechua) | Somali | Welsh |
Bosnian | Finnish* | Indonesian | Malay | Romanian | Spanish | Yoruba |
Bulgarian | French | Irish | Maltese | Romansh | Swahili (Kiswahili) | Zulu (isiZulu) |
Catalan | Frisian | Italian | Marathi | Russian | Swazi | |
Croatian | Galician | Kazakh | Mongolian | Sámi | Swedish |
* The Input Processor chain for Finnish language also contains the Finnish Splitting Input Processor on top of the IPs in the Standard Input Processor chain.
Input Processors Chain setup
The following graph displays the default setup of the Standard Input Processors chain:
* The Input Processors marked with a star (*) in the above graph are currently only available as NL Analyzers for a selection of the languages; for more information on available languages, please see the specific sections.
The common, default Input Processors are listed below with a short description of the IP's functionality; the follow sections go into further details.
- The Standard Splitting IP divides the user input text into sentences and words, considering abbreviations that should not be split.
- The Standard Auto Correction IP applies spelling correction to the existing words, based on a fixed list of auto-correction mappings.
- The Predict IP classifies user inputs based on a machine learning model trained in Teneo Learn and annotates the user input with the predicted top intent classes and a confidence score.
- The Standard Similarity Match Correction IP applies spelling correction to the existing words, based on similarity match to the words in the solution dictionary.
- The System Annotation IP sets a number of annotations based on properties of the user input text.
- The Basic Number Recognizer IP identifies all Arabic numbers of the type 123 and 3.14 in the user input, annotates each of them with the
NUMBER
annotation and associates a variable to this annotation callednumericValue
. - The Language Detector IP identifies the language of the input sentence provided and annotates it with the predicted language and associates a confidence score to the prediction.
Property reference
Properties can be referenced by other properties using the schema:
properties
1${<property name>}
2
The expression is replaced by the value of the property, and the characters ${
and }
are removed. This can be applied to property values.
The Java system properties can be referenced by the expression:
properties
1${systemProperties.<system property name>}
2
If the web app Controller module is used, the servlet context init parameters (defined in element <context-param>
in the web.xml
deployment descriptor file) and the servlet configuration init parameters (defined in element <servlet>
in the web.xml
) can be referenced by expressions:
properties
1${servletContextParameters.<parameter name>}
2
properties
1${servletConfigParameters.<parameter name>}
2
General properties
The following properties are generally available:
Name | Value |
---|---|
properties.file.path | The absolute path of the folder containing additional configuration files for the input processors. |
Standard Simplifier
The Standard Simplifier is a simplifier implementation with support for configurable character decomposition and normalization, as well as character mapping.
It executes the following processing steps:
- Conversion to lower case, considering the configured language locale.
- Optional compatibility simplification: this is Unicode compatibility decomposition (like mapping <sup>2</sup> to 2, etc.), with optional exceptions defined by property
excludeFromCompatibilitySimplify
. - This step is disabled by default, see
compatibilitySimplify
.
Optional canonical simplification: Unicode canonical decomposition is applied, then by default all combining characters are deleted (exceptions can be given with the propertyexcludeFromCanonicalSimplify
, these letter-combining character combinations will be left untouched). - Conversion to Unicode composed form.
- Optional simplification mapping: character/substring replacement as specified by properties
simplificationMapping.*
are applied. No mappings are set by default.
Configuration properties
Name | Type | Required | Default |
---|---|---|---|
canocicalSimplify | true/false | no | true |
canonicalSimplify
enables/disables simplification based on canonical decomposition of Unicode characters (see Unicode normalization forms for more information). An exception list can be defined in excludeFromCanonicalSimplify
.
If enabled:
- Canonical decomposition will be applied first; this means accented characters will be decomposed into the base letter and combining marks (non-spacing mark) for the accent(s).
- On a second step, all non-spacing marks are deleted, i.e. á will be come a, etc.
- Finally, canonical composition is applied.
Name | Type | Required | Default |
---|---|---|---|
excludeFromCanonicalSimplify | string | no | empty |
All characters in the string given here will be excluded from the canonical simplification defined above. To be more precise, for character-combinations resulting from step one while step two will be skipped.
Name | Type | Required | Default |
---|---|---|---|
compatibilitySimplify | true/false | no | false |
compatibilitySimplify
enables/disables simplification based on compatibility decomposition of Unicode characters (see Unicode normalization forms for more information). For example, <sup>5</sup> will become 5.
Name | Type | Required | Default |
---|---|---|---|
excludeFromCompatibilitySimplify | string | no | empty |
All characters in the string given here will be excluded from the compatibility simplification as defined above.
Name | Type | Required | Default |
---|---|---|---|
simplificationMapping.* | Format:simplificationMapping.<n> = <letter(s)>=<replacement> <n> : number, which must be unique within the simplification mappings of one file<letter(s)> : string, letter(s) to be replaced <replacement> : string, replacement | no | empty |
Custom simplification mapping is applied AFTER canonical and compatibility simplification. This means, for example, that an accented character for which a custom simplification mapping has been applied must be listed under excludeFromCanonicalSimplify
if canonical simplification isn't disabled.
Example
properties
1simplificationMapping.1 = ä=ae
2
also requires
properties
1excludeFromCanonicalSimplify = ...ä...
2
Standard Splitting IP
The Standard Splitting Input Processor splits the user input text into sentences and words. Splitting is performed at configurable sentence and word delimiters. Splitting exceptions can be defined as a configurable list of abbreviations and a configurable regular expression.
This Input Processor generates one or more sentences, with zero or more words. The generated WordData objects contain the original and simplified form of the word. The final word-form is initialized with the simplified word form.
Other considerations
Extra request parameters read by this input processor: (none)
Processing options read by this input processor: (none)
Annotations generated by this input processor: (none)
Configuration properties
Properties for defining abbreviations
Name | Type | Required | Data |
---|---|---|---|
abbreviations.item.* | Format:abbreviations.item.<n> = <abbreviation> <n> : number, which must be unique within the abbreviation definitions of one file<abbreviation> : an abbreviation | no | none |
List of abbreviations. Abbreviations are considered in the sentence separation process. Sentence delimiters within abbreviations will not lead to separated sentences.
Name | Type | Required | Default |
---|---|---|---|
abbreviations.file.name | string (filename) | no | empty |
Filename (including path) of an extra file containing abbreviations. A relative filename relates to the location of the properties file.
Name | Type | Required | Default |
---|---|---|---|
abbreviations.file.encoding | string (encoding name) | no | UTF-8 |
Encoding of the extra file containing abbreviations.
Properties for controlling user input separation into sentences and words
Name | Type | Required | Default |
---|---|---|---|
inputSeparation.sentenceDelimiters | string | no | . ¡ ! ¿ ? … |
List of characters that are used to separate sentences (unless part of an abbreviation).
Name | Type | Required | Default |
---|---|---|---|
inputSeparation.wordDelimiters | string | no | ```^"“”'‘’`´#$€£%&§ |
List of characters that are used to separate words. Delimiting characters will be kept as separate words, except for those that are listed under inputSeparation.nonWordCharacters
(see below).
inputSeparation.additionalWordDelimiterRegEx
may be used to specify additional or alternative word delimiting.
Name | Type | Required | Default |
---|---|---|---|
inputSeparation.additionalWordDelimiterRegEx | string | no | empty |
Additional word delimiting regular expression. This is an optional regular expression for delimiting words or defining (optionally zero width) word boundaries. It may be specified as addition or alternative to inputSeparation.wordDelimiters
.
NOTE: in Java 6 & 7 a 'position look behind' construct in the regex does not work with Unicode blocks outside the BMP if the block is specified with \p{ln...} construct, probably due to a bug in the Java regex implementation. Instead the characters must be specified directly as a range.
Name | Type | Required | Default |
---|---|---|---|
inputSeparation.nonWordCharacters | string | no | "“”'‘’`´,;.¡!¿?…<SP><CR><LF><HT> |
Word separators that shall not be kept as words. The set of characters specified here should be a subset of inputSeparation.wordDelimiters
and the characters matched by inputSeparation.additionalWordDelimiterRegEx
.
Example (assuming defaults):
Argh$%, separate this!
will be separated into:
Argh
$
%
separate
this
Name | Type | Required | Default |
---|---|---|---|
inputSeparation.excludeWordDelimitersRegEx | string | no | ```(?<=([ "“”,;.¡!¿?…\d] |
Regular expressions that specify exceptions to the splitting of a sentence into words.
The default regular expression prevents the characters ,
(comma) and .
(dot) from acting as word delimiter when they appear in the context of a number.
Note: the text matched by the regular expression will be excluded from splitting, thus any word splitting characters used only as context condition should be given as zero-width look-behind/look-ahead construct.
Standard Auto Correction IP
The Standard Auto Correction Input Processor applies spelling correction based on a configurable list of auto-correction mappings. The corrections are applied to the finalized form of the sentence word.
This IP works on the existing sentences and words passed in. It may modify the final form of words. The count of sentences and words is not modified.
Other considerations
Extra request parameters read by this input processor: (none)
Processing options read by this input processor: (none)
Annotations generated by this input processor: (none)
Configuration properties
Properties for defining autocorrection mappings
Name | Type | Required | Default |
---|---|---|---|
autoCorrections.item.* | Format:autoCorrections.item.<n> = <incorrect word>=<correct word> <n> : number, which must be unique within the autocorrection definitions of one file<incorrect word> : misspelled word that shall be mapped to a corrected version<correct word> : the corrected version (it must be a single word; word splitting is not supported) | no | none |
List of word mappings for direct replacement of typical misspellings. The replacement takes place after the simplification.
Properties pointing to an external autocorrection list
Name | Type | Required | Default |
---|---|---|---|
autoCorrections.file.name | string | no | empty |
Filename (including path) of an extra file containing autocorrection mappings of the form:
properties
1<incorrect word>=<correct word>
2
<incorrect word>
: misspelled word that shall be mapped to a corrected version;
<correct word>
: the corrected spelling of the word (it must be a single word; word splitting is not supported).
A relative filename relates to the location of the properties file.
Name | Type | Required | Default |
---|---|---|---|
autoCorrections.file.encoding | string (encoding name) | no | UTF-8 |
Encoding of the extra file containing autocorrection mappings.
Predict IP
The Predict Input Processor makes use of a machine learning model generated in the Teneo Learn component when machine learning classes are available in a Teneo Studio solution. The Predict IP uses the model to annotate each user input with the machine learning classes defined.
Whenever the Predict IP receives a user input, the Input Processor calculates a confidence score for each of the classes based on the model, creating annotations for the most confident class and for each other class that matches the following criteria:
- the confidence is above the minimum confidence (defaults to 0.01)
- the confidence is higher than 0.5 times the confidence value of the top class.
The Predict Input Processor will create a maximum of 5 annotations, regardless of how many classes match the criteria. The numerical thresholds can be configured in the properties file of the Input Processor.
For each selected class, an annotation with the name <CLASS_NAME>.INTENT
will be created, with the value of the model confidence in the class. A special annotation <CLASS_NAME>.TOP_INTENT
is created for the class with the highest confidence score.
Configuration properties
Name | Type | Required | Default |
---|---|---|---|
minConfidenceSimilarityDistance | float | no | 0.5 |
Confidence percentage of the top score confidence a class must have in order to be considered, e.g. if the top confidence class has a confidence of 0.7, classes with confidence lower than 0.5 x 0.7 = 0.35 will be discarded.
Name | Type | Required | Default |
---|---|---|---|
maxNumberOfAnnotations | int | no | 5 |
Maximum number of class annotations to create for each user input.
Name | Type | Required | Default |
---|---|---|---|
minConfidenceThreshold | float | no | 0.01 |
Minimum value of confidence a model must have for a class in order to add it as one of the candidate annotations.
Name | Type | Required | Default |
---|---|---|---|
intent.model.file.name | string (filename) | no | inexistent |
Name of the file containing the machine learning model. It is usually set automatically by Teneo Studio, so no configuration is required.
Standard Similarity Match Correction IP
The Standard Similarity Match Correction Input Processor applies spelling correction based on a configurable similarity matching of sentence words against words provided by a dictionary. The corrections are applied to the finalized form of the sentence words.
This IP works on the existing sentences and words passed in and it may modify the final form of a word. The count of sentences and words is not modified.
Other considerations
Extra request parameters read by this input processor: (none)
Processing options read by this input processor: (none)
Annotations generated by this input processor: (none)
The various spelling distance constants define the spelling tolerance behavior. For fine-tuning, they can be changed from their default values, although generally fine-tuning should not be required. The defaults are sensible and tested. It is not recommended to change settings due to an isolated problem, it may compromise the IP.
All values are given as percentages. The spelling tolerance process will add up all distance values and divide them by the length of the word in the syntax. The result is compared to the spelling tolerance threshold.
Distance from similarities (defined below) take precedence over the standard distance defined here.
Example
properties
1similarities.1 = ah=a:5
2
Now, blah will have a distance of 5 to bla, no matter what value is given under spellingDistance.missingEndLetter
.
Configuration properties
Name | Type | Required | Default |
---|---|---|---|
spellingTolerance | Integer number 0-100 (0=off) | no | 15 |
The spelling tolerance limit. The accumulated distance value of comparing a user input word with a syntax word, divided by the length of the syntax word, must not be greater than this limit to consider a user input word similar to a syntax word.
Name | Type | Required | Default |
---|---|---|---|
spellingDistance.extraEndLetter | Integer number >=0 | no | 100 |
Spelling distance for an extra letter at the end of the word.
Example
syntax: abcd
user input word: abcdx
Name | Type | Required | Default |
---|---|---|---|
spellingDistance.doubleInsteadSingleLetter | Integer number >=0 | no | 62 |
Spelling distance for double letter where a single letter should be.
Example
syntax: abcd
user input word: abbcd
Name | Type | Required | Default |
---|---|---|---|
spellingDistance.singleInsteadDoubleLetter | Integer number >=0 | no | 62 |
Spelling distance for a single letter where a double letter should be.
Example
syntax: abbcd
user input word: abcd
Name | Type | Required | Default |
---|---|---|---|
spellingDistance.swappedLetter | Integer number >=0 | no | 100 |
Spelling distance for swapped letters.
Example
syntax: abcd
user input word: acbd
Name | Type | Required | Default |
---|---|---|---|
spellingDistance.extraLetter | Integer number >=0 | no | 75 |
Spelling distance for an extra letter that should not be there.
Example
syntax: abcd
user input word: abxcd
Name | Type | Required | Default |
---|---|---|---|
spellingDistance.missingLetter | Integer number >=0 | no | 75 |
Spelling distance for a missing letter.
Example
syntax: abcd
user input word: abd
Name | Type | Required | Default |
---|---|---|---|
spellingDistance.wrongLetter | Integer number >=0 | no | 100 |
Spelling distance for a completely wrong letter.
Example
syntax: abcd
user input word: abxd
Name | Type | Required | Default |
---|---|---|---|
spellingDistance.keyAdjacentLetter | Integer number >=0 | no | 75 |
Spelling distance for a wrong letter, which is adjacent to the correct one on the keyboard.
Example
(on qwerty or qwertz keyboards)
syntax: hello
user input word: hrllo
Name | Type | Required | Default |
---|---|---|---|
similarities.* | Format:similarities.<n> = <letter(s)>=<letter(s)>:<d> orsimilarities.<n> = <letter(s)>><letter(s)>:<d> (In the second case, note the > symbol between the two <letter(s)> strings)<n> : number, which must be unique within the similarity definitions of one file<letter(s)> : string, letter(s) on which a similarity is defined<d> : positive number | no | none |
Similarity definitions.
With =
(equals sign) the similarity defined works bidirectional.
With >
(greater-than sign) the first letter (combination) in the user input is regarded similar to the second in the syntax, but not vice versa.
The word matching process taking into account similarities is usually running after simplification. So, defining similarities between letters that will be replaced by simplification makes no sense.
The number <d>
is the spelling distance, given as a percentage value.
Example
similarities.4 = f?ph:25
Name | Type | Required | Default |
---|---|---|---|
keyboard.row1 | string | no | qwertyuiop |
Upper keyboard row.
Name | Type | Required | Default |
---|---|---|---|
keyboard.row2 | string | no | asdfghjkl |
Middle keyboard row.
Name | Type | Required | Default |
---|---|---|---|
keyboard.row3 | string | no | zxcvbnm |
Lower keyboard row.
System Annotation IP
The System Annotation Input Processor performs simple analysis of the sentence texts to set some annotations. The decision algorithms are configurable by various properties. Further customization is possible by sub-classing this Input Processor and overriding one or more of the methods: decideBinary
, decideBrackets
, decideEmpty
, decideExclamation
, decideNonsense
, decideQuestion
, decideQuote
.
This IP works on the sentences passed in but does not modify them.
Other considerations
Extra request parameters read by this input processor: (none)
Processing options read by this input processor: (none)
Annotations this input processor may generate:
- _EMPTY: the sentence text is empty
- _EXCLAMATION: the sentence text contains at least one of the characters specified with property
exclamationMarkCharacters
- _EM3: the sentence text contains three or more characters in a row of the characters specified with property
exclamationMarkCharacters
- _QUESTION: the sentence text contains at least one of the characters specified with property
questionMarkCharacters
- _QT3: the sentence text contains three or more characters in a row of the characters specified with
questionMarkCharacters
- _QUOTE: the sentence text contains at least one of the characters specified with property
quoteCharacters
- _DBLQUOTE: the sentence text contains at least one of the characters specified with property
doubleQuoteCharacters
- _BRACKETPAIR: the sentence text contains at least one matching pair of the bracket characters specified with property
bracketPairCharacters
- _NONSENSE: the sentence probably contains nonsense text as configured with properties
consonants
,nonsenseThreshold.absolute
andnonsenseThreshold.relative
- _BINARY: the sentence text only contains characters specified by properties
binaryCharacters
(at least one of them) andbinaryIgnoredCharacters
(zero or more of them).
Configuration properties
Name | Type | Required | Default |
---|---|---|---|
consonants | string | no | BCDFGHJKLMNPQRSTVWXYZ bcdfghjklmnpqrstvwxyz |
Contains all letters (upper and lower case) that are considered consonants in the language. Together with the properties nonsenseThreshold.absolute
and nonsenseThreshold.relative
these will be used for detecting probable nonsense inputs like kljljljljjlj.
Name | Type | Required | Default |
---|---|---|---|
nonsenseThreshold.absolute | Positive integer number | No | 6 |
For nonsense detection an input exclusively consisting of so many consonants without any non-consonants is considered nonsense.
Name | Type | Required | Default |
---|---|---|---|
nonsenseThreshold.relative | Positive integer number | no | 10 |
For nonsense detection an input containing so many consonants in a row is considered nonsense.
Name | Type | Required | Default |
---|---|---|---|
exclamationMarkCharacters | string | no | ! |
List of characters of which at least one must occur in the sentence text to set annotations _EXCLAMATION
and _EM3
(in case of a sequence of at least three of the specified characters).
Name | Type | Required | Default |
---|---|---|---|
questionMarkCharacters | string | no | ? |
List of characters of which at least one must occur in the sentence text to set annotations _QUESTION
and _QT3
(in case of a sequence of at least three of the specified characters).
Name | Type | Required | Default |
---|---|---|---|
doubleQuoteCharacters | string | no | “ |
List of characters of which at least one must occur in the sentence text to set annotation _DBLQUOTE
.
Name | Type | Required | Default |
---|---|---|---|
quoteCharacters | string | no | ‘ |
List of characters of which at least one must occur in the sentence text to set annotation _QUOTE
.
Name | Type | Required | Default |
---|---|---|---|
binaryCharacters | string | no | 01 |
List of characters recognized in the sentence text to set annotation _BINARY
.
Name | Type | Required | Default |
---|---|---|---|
binaryIgnoredCharacters | string | no | !?,.-;:# \r\n\t\"' |
List of characters additionally allowed in binary text.
Name | Type | Required | Default |
---|---|---|---|
bracketPairCharacters | string | no | ()[]{} |
List of pairs of bracketing characters of which at least one pair (opening and closing bracket of the same type) must occur in the sentence text to set annotation _BRACKETPAIR
.
Special System annotations
Two special annotations related not to individual inputs, but to whole dialogues, are added by the Teneo Engine itself:
- _INIT: indicates session start, i.e. the first input in a dialogue
- _TIMEOUT: indicates the continuation of a previously timed-out session/dialogue.
Basic Number Recognizer IP
The Basic Number Recognizer Input Processor identifies all Arabic numbers of the type 123 and 3.14 in the user input and annotates each of them with the NUMBER
annotation and associates a variable to this annotation called numericValue
which holds the numeric value of the number found.
This Input Processor is language independent, but every language has its own configuration file for this IP defining decimal point characters and the thousands separator character to be ignored.
For the NUMBER
annotation and the variable to be added, a "number" in the user input must meet the following syntaxes:
It must match the regular expression:
[,]?[0-9]+([,][0-9]+)*([.][0-9]+)?|[.][0-9]+
It must be parseable by Java's BigDecimal to ensure it is a number
The above syntax provides the following guarantees:
- The sign is not included in the annotated token
- The
numericValue
variable contains a BigDecimal representation of the number.
The decimal marker(s) and the thousands separator(s) can be configured; in the above regex, the dot is used as a decimal marker and the comma as a regular expression.
Language Detector IP
The Language Detector Input Processor uses a machine learning model that predicts the language of a given input and adds an annotation of the format %${language label}.LANG
to the input as well as a confidence score of the prediction.
The Language Detector IP can predict the following 45 languages (language label in brackets):
Arabic (AR), Bulgarian (BG), Bengali (BN), Catalan (CA), Czech (CS), Danish (DA), German (DE), Greek (EL), English (EN), Esperanto (EO), Spanish (ES), Estonian (ET), Basque (EU), Persian (FA), Finnish (FI), French (FR), Hebrew (HE), Hindi (HI), Hungarian (HU), Indonesian-Malay (ID_MS), Icelandic (IS), Italian (IT), Japanese (JA), Korean (KO), Lithuanian (LT), Latvian (LV), Macedonian (MK), Dutch (NL), Norwegian (NO), Polish (PL), Portuguese (PT), Romanian (RO), Russian (RU), Slovak (SK), Slovenian (SL), Serbian-Croatian-Bosnian (SR_HR), Swedish (SV), Tamil (TA), Telugu (TE), Thai (TH), Tagalog (TL), Turkish (TR), Urdu (UR), Vietnamese (VI) and Chinese (ZH).
Serbian, Bosnian and Croatian are treated as one language, under the label SR_HR and Indonesian and Malay are treated as one language, under the label ID_MS.
A number of regexes are also in use by the Input Processor, helping the model to not predict language for fully numerical inputs, URLs or other type of nonsense inputs.
The Language Detector will provide an annotation when the confidence prediction threshold is above 0.2 for the languages, but for Arabic (AR), Bengali (BN), Greek (EL), Hebrew (HE), Hindi (HI), Japanese (JA), Korean (KO), Tamil (TA), Telugu (TE), Thai (TH), Chinese (ZH), Vietnamese (VI), Persian (FA) and Urdu (UR) language annotations will always be created, even for predictions below 2.0, since the Language Detector is mostly accurate when predicting them.
Finnish Input Processor Chain
The input processing chain for Finnish language shares its Input Processors with the Standard Input Processors chain, but furthermore includes the Finnish Splitting Input Processor which comes between the Standard Splitting and Standard AutoCorrect Input Processors as displayed in the below graph.
The Input Processors shared with the Standard Input Processors chain are:
- Standard Splitting IP,
- the Standard Auto Correct IP,
- the Predict IP,
- the Standard Similarity Match Correction IP,
- the System Annotation IP,
- the Basic Number Recognizer, and
- the Language Detector IP.
The DateTime Recognizer Input Processor is also available in the Finnish Input Processors chain, but is currently not supported by the Approach in the Teneo Platform for understanding and interpretation of date and time expressions.
Finnish Splitting Input Processor
The Finnish Splitting Input Processor splits off suffixes from the existing sentence words passed in, using configurable word lists in its algorithm. It may modify an existing word (it is set to the word stem) and add one or more words after it (the suffixes split off). These added words all have the same original word form and begin index as the modified word. Words shorter than five characters or contained in the no-cut list will not be split. The count of sentences is not modified.
The suffixes are grouped into five lists:
- clitic
- participe
- poss
- cases
- comparison
Suffixes are searched for and split off in the order of the groups listed above. Within each group the suffixes are searched in the order given in the configuration file containing the suffixes of the group.
Other considerations
Extra request parameters read by this input processor: (none)
Processing options read by this input processor: (none)
Annotations generated by this input processor: (none)
Configuration properties
Name | Type | Required | Default |
---|---|---|---|
nocut.file.name | string (filename) | no | empty |
Filename (including path) of an extra file containing the words not to split. A relative filename relates to the location of the properties file.
Name | Type | Required | Default |
---|---|---|---|
nocut.file.encoding | string (encoding name) | no | UTF-8 |
Encoding of the extra file containing the words not to split.
Name | Type | Required | Default |
---|---|---|---|
clitic.file.name | string (filename) | no | empty |
Filename (including path) of an extra file containing the clitic suffixes. A relative filename relates to the location of the properties file.
Name | Type | Required | Default |
---|---|---|---|
clitic.file.encoding | string (encoding name) | no | UTF-8 |
Encoding of the extra file containing the clitic suffixes.
Name | Type | Required | Default |
---|---|---|---|
participe.file.name | string (filename) | no | empty |
Filename (including path) of an extra file containing participe suffixes. A relative filename relates to the location of the properties file.
Name | Type | Required | Default |
---|---|---|---|
participe.file.encoding | string (encoding name) | no | UTF-8 |
Encoding of the extra file containing the participe suffixes.
Name | Type | Required | Default |
---|---|---|---|
poss.file.name | string (filename) | no | empty |
Filename (including path) of an extra file containing the possessive suffixes. A relative filename relates to the location of the properties file.
Name | Type | Required | Default |
---|---|---|---|
poss.file.encoding | string (encoding name) | no | UTF-8 |
Encoding of the extra file containing the possessive suffixes.
Name | Type | Required | Default |
---|---|---|---|
cases.file.name | string (filename) | no | empty |
Filename (including path) of an extra file containing the cases suffixes. A relative filename relates to the location of the properties file.
Name | Type | Required | Default |
---|---|---|---|
cases.file.encoding | string (encoding name) | no | UTF-8 |
Encoding of the extra file containing the cases suffixes.
Name | Type | Required | Default |
---|---|---|---|
comparison.file.name | string (filename) | no | empty |
Filename (including path) of an extra file containing the comparison suffixes. A relative filename relates to the location of the properties file.
Name | Type | Required | Default |
---|---|---|---|
comparison.file.encoding | string (encoding name) | no | UTF-8 |
Encoding of the extra file containing the comparison suffixes.