Annotations
Data layers
Annotations are a feature that generate an additional information or data layer which can be attached to a word, a sentence or an entire user-input text, for example, data generated by NER, POS taggers, or the Predict Input Processor. The annotation layer enriches the user input with further details which can then be used to enhance the standard engine matching process.
An annotation is assigned to an input sentence, and then to one or more words in that sentence; i.e. a word, sentence or input may carry multiple annotations. Optionally, variables of the type string (property key and value pairs) may be attached to annotations, which can then be used to pass data about the match to be used in scripting to provide better natural language understanding.
One can think of annotations as dynamic language objects that are generated at runtime. Compared to normal, static Language Objects that always exist and require a TLML syntax to be fulfilled in order to match, annotations are their own syntax and they only match when they exist.
Types of annotations
By default, the Teneo Engine provides basic annotations with information regarding the user input such as the start of a dialogue or the end of a session.
Furthermore, the Teneo Input Processors (IPs) generate annotations if, for example, the user input is empty, or if it contains certain characters (like exclamation point or question mark), or combination of characters. Depending on the active language configuration, the Input Processors may also produce annotations for numbers, Part-of-Speech (POS) and morphological information.
In addition to this, specific annotations can be created as required for a particular project. These custom annotations can be generated from within a custom Input Processor, a Pre-matching script or a Global Pre-listener script in the solution.
Why annotations?
The power of annotations comes from how they are created, since they are generated via Input Processors or solution scripting, the data they can provide goes beyond what is achievable with standard, NLU rule-based engine matching against words, Entities and Language Objects.
Annotations can be generated based on context, sentence structure, user specific configurations or more complex machine learning models, etc. and can later be used in the matching process. As a result, annotations contain brand-new data which makes it possible to create syntaxes that more precisely identify sentence elements that are necessary for interpreting the input correctly.
Annotations as java classes
Annotations are represented as java classes that store information about the word they have labelled, and they have the following fields:
- Name.
- Position of the sentence where the annotation is added. The first sentence in the input is 0.
- Position(s) of the word (or words) in the sentence where the annotation has been added. The first word in each sentence is a position 0. An annotation can be applied to multiple words.
- Annotation variables as a key value pair; the key is a string, while the value can be either a string, a number, or an object (these are optional and depend on the information the annotation needs to store).
The annotation key is made from the three first fields.
See the article Creating an annotation for a longer description of the class.
System and Standard annotations
The Teneo Platform bundles two default collections of annotations in all language configurations: System annotations, added by the Teneo Engine, and Standard annotations, added by the Teneo Input Processors.
System annotations
Two special annotations are set by the Teneo Engine itself. These are related not to individual inputs but to whole dialogues and are dependent on the session state.
Annotation name | |
---|---|
_INIT | Indicating that a session has started, i.e. the first input in any new dialogue. |
_TIMEOUT | Indicating that a previously timed out session/dialogue has restarted. |
Standard annotations
The Standard annotations are set by the Teneo Input Processors. Regardless of the language configuration, the following annotations are set:
Annotation name | |
---|---|
_QUESTION | A question mark (?) appears in the input |
_EXCLAMATION | An exclamation point (!) appears in the input |
_DBLQUOTE | A quotation mark (“) appears in the input |
_QUOTE | Single quotation marks (‘) appear in the input |
_BRACKETPAIR | A pair of brackets ( ), [ ] or { } appear in the input |
_NONSENSE | The input contains nonsense text (such as ‘asdf’, ‘wgwwgwg’, ‘xxxxxx’) |
_EMPTY | The input contained no text |
_BINARY | The input consists of only 0s and 1s |
_QT3 | Triple question marks (???) appear in the input |
_EM3 | Triple exclamation marks (!!!) appear in the input |
Annotations from Input Processors
In addition to the previous mentioned annotations, depending on the solution language configuration the Teneo Input Processors generate additional annotations.
Number annotations
Standard, Korean and Turkish Input Processor chains
The Standard Input Processors chain, as well as the chains for Korean and Turkish include the Basic Number Recognizer Input Processor. This Input Processor identifies all Arabic numbers of the type 123 and 3.14 in the user input, annotates each of them with an annotation with the name number
and associates a variable to this annotation named numericValue
which holds the number found.
Note that although this Input Processor is language independent, each language has its own configuration file that defines which is the decimal point character and the thousands separator character to be ignored.
To read more, please see the specific information for the Basic Number Recognizer Input processor in the relevant Input Processor (IP) chain description: Korean IP chain, Turkish IP chain or the Standard IP chain.
Chinese Input Processor chain
The Chinese Numbers Input Processor is further advanced when generating number annotations, annotating all numbers as well as numerical expressions. It first normalizes tokens containing numeric values into Hindu-Arabic numerals, then creates an annotation named NUMBER
with a numericValue
variable containing the normalized number; as well as generating a second annotation with the name of the normalized number value (e.g. 3.14 is annotated as %$3.14
whereas 3,14 is annotated as %$314
). The Chinese Number Input Processor also annotates inexact numbers with the annotation INEXACT
(i.e. numbers containing characters 几 or 数 or 余 or 多).
To read more, please see the Chinese Numbers Input Processor.
Japanese Input Processor chain
The Japanese Number Recognizer Input Processor is capable of recognizing various types of number expressions which then are annotated with a number
annotation and associated to a variable called numericValue
which holds the numeric value of the found number.
To read more about the Input Processor, visit the section Japanese Input Processors chain.
Predict annotations
The Teneo Predict Input Processor makes use of a machine learning model generated in the Teneo Learn component when machine learning classes are available in a Teneo Studio solution. The Predict Input processor uses the model to annotate each user input with the machine learning classes defined.
Whenever the Predict IP receives a user input, the Input Processor calculates a confidence score for each of the classes based on the model, creating annotations for the most confident class and for each other class that matches the following criteria:
- the confidence is above the minimum confidence (defaults to 0.01)
- the confidence is higher than 0.5 times the confidence value of the top class.
The Predict Input Processor will create a maximum of 5 annotations, regardless of how many classes math the criteria; for each selected class an annotation with the name <CLASS_NAME>.INTENT
is created with the value of the model confidence; a special annotation, <CLASS_NAME>.TOP_INTENT
is created for the class with the highest confidence score.
Read more about Machine Learning in Teneo or visit the Teneo Predict section to read more about the input processor.
Part-of-Speech and Morphological annotations
Depending on the language configuration, the Teneo Input Processors may also set annotations carrying Part-Of-Speech and morphological information.
The POS-tagger / Morphological Analyzer creates annotations for each word in the user input with names like NN.POS
, VB.POS
, PAST.POS
, PRESENT.POS
, etc. that help to distinguish whether, for example, the word is a noun or verb. Furthermore, the Analyzer also provides annotations which indicate whether a noun is in singular or plural, or whether a verb was in the present, in the past, in the 3rd person, an imperative, etc.
These sets of annotations are language-specific and differ from one language to the next, for more information please see the Teneo Input Processors sections.
Language annotations
The Teneo Platform Input Processor chain includes a Language Detector Input Processor, which uses a machine learned model to generate an annotation for the predicted language of the user input in the format %${language label}.LANG
, as well as a confidence score of the prediction.
The value of the confidence score reflects the probability of a tag being correct and ranges from 0 (lowest probability) to 1 (highest probability).
For more information, please visit the NLP Capabilities section describing the available Input Processor chains.
Named Entity annotations
For several languages the Teneo Input Processors also create annotations for entities detected in user inputs such as location, organizations, products, etc. Annotations are also created for entities that might carry Personal Identifiable Information (PII), such as names, addresses, unique identifiers, e-mail addresses, etc.
For more information on the Named Entity Recognizer please visit the Input Processors section.
Custom annotations
Users can create their own custom annotations. This opens up the possibility to enrich user inputs with an extra layer of information, such as labelling postal codes or phone numbers based on given patterns; using named-entity recognizers to find products, places, or names; or tagging up inputs using a machine learned classifier, and so on.
Custom specific annotations can be created in custom input processors; in this case the annotations are implemented outside Teneo Studio, in an input processors especially developed for the solution.
Custom annotations can also be created in Pre-matching scripts within the solution itself by calling the annotation-related Teneo Engine API methods in the global pre-matching script. This method allows for an easy and comprehensive use of the annotator capabilities, allowing annotations to be added from within the solution. Here annotation data is based on user inputs and thereby it is possible to make use of these annotations in TLML syntaxes such as language objects, variable values, or other annotations, etc. This is not possible using input processors.
In the below image, the Global Pre-matching script will generate the annotation START
and attach it to the first word in the input sentence (if the input is not empty):
groovy
1_.getInputAnnotations().add(_.createInputAnnotation(“START”, 0, [0] as HashSet, [:]) )
2
It is furthermore possible to script annotations from Pre-matching Global Listeners the same way they are created in Pre-matching scripts.
In the below image, the annotation FAVORITE_COLOR
is generated and attached to a word mentioned if it matches the TLML syntax of the Language Object COLORS.LIST
, and appears as the first word in the input and if the Boolean variable bPositiveSentiment
at the same time is set to true:
Once created, this annotation can be used for further syntax matching TLML syntaxes elsewhere in the solution, for example: