Teneo Developers

Annotations

Data layers

Annotations are a feature that generate an additional information or data layer which can be attached to a word, a sentence or an entire user-input text, for example, data generated by NER, POS taggers, or the Predict Input Processor. The annotation layer enriches the user input with further details which can then be used to enhance the standard engine matching process.

An annotation is assigned to an input sentence, and then to one or more words in that sentence; i.e. a word, sentence or input may carry multiple annotations. Optionally, variables of the type string (property key and value pairs) may be attached to annotations, which can then be used to pass data about the match to be used in scripting to provide better natural language understanding.

One can think of annotations as dynamic language objects that are generated at runtime. Compared to normal, static Language Objects that always exist and require a TLML syntax to be fulfilled in order to match, annotations are their own syntax and they only match when they exist.

Types of annotations

By default, the Teneo Engine provides basic annotations with information regarding the user input such as the start of a dialogue or the end of a session.

Furthermore, the Teneo Input Processors (IPs) generate annotations if, for example, the user input is empty, or if it contains certain characters (like exclamation point or question mark), or combination of characters. Depending on the active language configuration, the Input Processors may also produce annotations for numbers, Part-of-Speech (POS) and morphological information.

In addition to this, specific annotations can be created as required for a particular project. These custom annotations can be generated from within a custom Input Processor, a Pre-matching script or a Global Pre-listener script in the solution.

Why annotations?

The power of annotations comes from how they are created, since they are generated via Input Processors or solution scripting, the data they can provide goes beyond what is achievable with standard, NLU rule-based engine matching against words, Entities and Language Objects.

Annotations can be generated based on context, sentence structure, user specific configurations or more complex machine learning models, etc. and can later be used in the matching process. As a result, annotations contain brand-new data which makes it possible to create syntaxes that more precisely identify sentence elements that are necessary for interpreting the input correctly.

Annotations as java classes

Annotations are represented as java classes that store information about the word they have labelled, and they have the following fields:

  • Name.
  • Position of the sentence where the annotation is added. The first sentence in the input is 0.
  • Position(s) of the word (or words) in the sentence where the annotation has been added. The first word in each sentence is a position 0. An annotation can be applied to multiple words.
  • Annotation variables as a key value pair; the key is a string, while the value can be either a string, a number, or an object (these are optional and depend on the information the annotation needs to store).

The annotation key is made from the three first fields.

See the article Creating an annotation for a longer description of the class.

System and Standard annotations

The Teneo Platform bundles two default collections of annotations in all language configurations: System annotations, added by the Teneo Engine, and Standard annotations, added by the Teneo Input Processors.

System annotations

Two special annotations are set by the Teneo Engine itself. These are related not to individual inputs but to whole dialogues and are dependent on the session state.

Annotation name
_INITIndicating that a session has started, i.e. the first input in any new dialogue.
_TIMEOUTIndicating that a previously timed out session/dialogue has restarted.

Standard annotations

The Standard annotations are set by the Teneo Input Processors. Regardless of the language configuration, the following annotations are set:

Annotation name
_QUESTIONA question mark (?) appears in the input
_EXCLAMATIONAn exclamation point (!) appears in the input
_DBLQUOTEA quotation mark (“) appears in the input
_QUOTESingle quotation marks (‘) appear in the input
_BRACKETPAIRA pair of brackets ( ), [ ] or { } appear in the input
_NONSENSEThe input contains nonsense text (such as ‘asdf’, ‘wgwwgwg’, ‘xxxxxx’)
_EMPTYThe input contained no text
_BINARYThe input consists of only 0s and 1s
_QT3Triple question marks (???) appear in the input
_EM3Triple exclamation marks (!!!) appear in the input

Annotations from Input Processors

In addition to the previous mentioned annotations, depending on the solution language configuration the Teneo Input Processors generate additional annotations.

Number annotations

Standard, Korean and Turkish Input Processor chains

The Standard Input Processors chain, as well as the chains for Korean and Turkish include the Basic Number Recognizer Input Processor. This Input Processor identifies all Arabic numbers of the type 123 and 3.14 in the user input, annotates each of them with an annotation with the name number and associates a variable to this annotation named numericValue which holds the number found.

Note that although this Input Processor is language independent, each language has its own configuration file that defines which is the decimal point character and the thousands separator character to be ignored.

To read more, please see the specific information for the Basic Number Recognizer Input processor in the relevant Input Processor (IP) chain description: Korean IP chain, Turkish IP chain or the Standard IP chain.

Chinese Input Processor chain

The Chinese Numbers Input Processor is further advanced when generating number annotations, annotating all numbers as well as numerical expressions. It first normalizes tokens containing numeric values into Hindu-Arabic numerals, then creates an annotation named NUMBERwith a numericValue variable containing the normalized number; as well as generating a second annotation with the name of the normalized number value (e.g. 3.14 is annotated as %$3.14 whereas 3,14 is annotated as %$314). The Chinese Number Input Processor also annotates inexact numbers with the annotation INEXACT (i.e. numbers containing characters 几 or 数 or 余 or 多).

To read more, please see the Chinese Numbers Input Processor.

Japanese Input Processor chain

The Japanese Number Recognizer Input Processor is capable of recognizing various types of number expressions which then are annotated with a numberannotation and associated to a variable called numericValue which holds the numeric value of the found number.

To read more about the Input Processor, visit the section Japanese Input Processors chain.

Predict annotations

The Teneo Predict Input Processor makes use of a machine learning model generated in the Teneo Learn component when machine learning classes are available in a Teneo Studio solution. The Predict Input processor uses the model to annotate each user input with the machine learning classes defined.
Whenever the Predict IP receives a user input, the Input Processor calculates a confidence score for each of the classes based on the model, creating annotations for the most confident class and for each other class that matches the following criteria:

  • the confidence is above the minimum confidence (defaults to 0.01)
  • the confidence is higher than 0.5 times the confidence value of the top class.

The Predict Input Processor will create a maximum of 5 annotations, regardless of how many classes math the criteria; for each selected class an annotation with the name <CLASS_NAME>.INTENT is created with the value of the model confidence; a special annotation, <CLASS_NAME>.TOP_INTENT is created for the class with the highest confidence score.

Read more about Machine Learning in Teneo or visit the Teneo Predict section to read more about the input processor.

Part-of-Speech and Morphological annotations

Depending on the language configuration, the Teneo Input Processors may also set annotations carrying Part-Of-Speech and morphological information.

The POS-tagger / Morphological Analyzer creates annotations for each word in the user input with names like NN.POS, VB.POS, PAST.POS, PRESENT.POS, etc. that help to distinguish whether, for example, the word is a noun or verb. Furthermore, the Analyzer also provides annotations which indicate whether a noun is in singular or plural, or whether a verb was in the present, in the past, in the 3rd person, an imperative, etc.

These sets of annotations are language-specific and differ from one language to the next, for more information please see the Teneo Input Processors sections.

Language annotations

The Teneo Platform Input Processor chain includes a Language Detector Input Processor, which uses a machine learned model to generate an annotation for the predicted language of the user input in the format %${language label}.LANG, as well as a confidence score of the prediction.

Language Detector annotation

The value of the confidence score reflects the probability of a tag being correct and ranges from 0 (lowest probability) to 1 (highest probability).

For more information, please visit the NLP Capabilities section describing the available Input Processor chains.

Named Entity annotations

For several languages the Teneo Input Processors also create annotations for entities detected in user inputs such as location, organizations, products, etc. Annotations are also created for entities that might carry Personal Identifiable Information (PII), such as names, addresses, unique identifiers, e-mail addresses, etc.

For more information on the Named Entity Recognizer please visit the Input Processors section.

Custom annotations

Users can create their own custom annotations. This opens up the possibility to enrich user inputs with an extra layer of information, such as labelling postal codes or phone numbers based on given patterns; using named-entity recognizers to find products, places, or names; or tagging up inputs using a machine learned classifier, and so on.

Custom specific annotations can be created in custom input processors; in this case the annotations are implemented outside Teneo Studio, in an input processors especially developed for the solution.

Custom annotations can also be created in Pre-matching scripts within the solution itself by calling the annotation-related Teneo Engine API methods in the global pre-matching script. This method allows for an easy and comprehensive use of the annotator capabilities, allowing annotations to be added from within the solution. Here annotation data is based on user inputs and thereby it is possible to make use of these annotations in TLML syntaxes such as language objects, variable values, or other annotations, etc. This is not possible using input processors.

In the below image, the Global Pre-matching script will generate the annotation START and attach it to the first word in the input sentence (if the input is not empty):

Global Pre-matching script

groovy

1_.getInputAnnotations().add(_.createInputAnnotation(“START”, 0, [0] as HashSet, [:]) )
2

It is furthermore possible to script annotations from Pre-matching Global Listeners the same way they are created in Pre-matching scripts.

In the below image, the annotation FAVORITE_COLOR is generated and attached to a word mentioned if it matches the TLML syntax of the Language Object COLORS.LIST, and appears as the first word in the input and if the Boolean variable bPositiveSentiment at the same time is set to true:

Listener Favorite Color

Once created, this annotation can be used for further syntax matching TLML syntaxes elsewhere in the solution, for example:

Flow and Syntax Match

The generated annotation in Try Out

As Pre-matching Global Listeners' execution is sequentially following a defined order, one action performed by a listener (like removing an annotation) might affect the syntax matching of the next ones.