Annotating Inputs
Annotations are a feature that generates additional information or an extra data layer which can be attached to a word, a sentence or an entire user input text. The annotation layer enriches the user input with further details which can then be used to enhance the standard engine matching process.
Data layer
An annotation represents an additional data layer associated with words or sentences in the user input, e.g., data generated by the Named Entity Recognizer or a Part-of-Speech tagger. An annotation has a name and is associated with a sentence via the sentence's index in the list of sentences and holds a set of indices of the word or words in the sentence to which it is associated. I.e., an annotation is assigned to a sentence from a user input, and then to one or more words in that sentence and, therefore, a word or sentence may carry one or multiple annotations.
Optionally, annotations also store a map of variables (property key and value pairs), which enrich their attached objects with further details; these variables can then be used to pass data about the match to be used in scripting to provide better natural language understanding, enhancing the standard engine matching process.
One can think of annotations as dynamic language objects that are generated at runtime. Compared to normal, static language objects which always exists and require a TLML syntax to be fulfilled in order to match, annotations are their own syntax and they only match when they exist.
Types of annotations
The Teneo Engine provides basic annotations, by default, with information regarding the user input, such as the start of a dialogue or the end of a session. The Teneo Input Processors generate annotations if, for example, the user input is empty or if it contains certain characters (like exclamation point or question mark), or combination of characters. Depending on the active language configuration, the Input Processors may also produce annotations for numbers, Part-of-Speech (POS) and morphological information, etc.
In addition to this, custom annotations can be created as required for a particular project; these custom annotation can be generated from within a custom input processor, a Pre-matching script or a Global Pre-listener script in the Teneo solution.
Why annotations?
Annotations are generated via Input Processors and/or solution scripting and the power of the annotations come from exactly how they are created as the data they can provide goes beyond what is achievable with standard TLML rule-based engine matching against words, Entities and Language Objects.
The annotations can be generated based on context, sentence structure, user specific configurations or more complex machine learning models, etc., and can then be used in the matching process. As a result, annotations contain brand-new data which makes it possible to create syntaxes which more precisely identify sentence elements that are necessary for interpreting the input correctly.
Annotations as java classes
The annotations are represented as java classes that store information about the word they have labelled and they have the following fields:
- Name
- Position of the sentence where the annotation is added; the first sentence in the input is 0
- Position(s) of the word (or words) in the sentence where the annotation has been added; the first word in each sentence is at position 0; an annotation can be applied to multiple words
- Annotation variables as key-value pair; the key is a string while the value can be either a string, a number, or an object (these are optional and depend on the information the annotation needs to store).
The annotation key is made from the first three fields; see Annotation methods further below for more details.
Collections of annotations
The Teneo Engine, by default and independent of solution language, provides basic annotations related to whole dialogues (and not individual inputs) while the System Annotation Input Processor (IP) provides annotations related to, for example, empty inputs, if question marks or quotes were present in the user input or if nonsense text or binary characters were detected.
In addition to the above, and in this case dependent on the solution language, more annotations are generated by other Teneo Input Processors, for example, annotations related to intent classification, basic number recognition, language recognition or even annotations related to Part-of-Speech tagging and morphological analysis.
Then, last but not least, project-specific custom annotations can be created for a particular project by generating the annotation directly in the solution in a Pre-matching script or a Global Pre-listener.
The following sub-sections introduce the various collections of annotations available with Teneo as well as custom annotations.
System and Standard annotations
The Teneo Platform bundles two default collections of annotations in all language configurations: System annotations added by the Teneo Engine and Standard annotations added by the System Annotation Input Processor.
System annotations
Two special annotations are set by the Teneo Engine itself, they are related not to individual inputs but to whole dialogues and are dependent on the session state.
Annotation | Description |
---|---|
_INIT | Indicates that a session has started, i.e., the first input in any new dialogue |
_TIMEOUT | Indicates that a previously timed out session/dialogue has restarted |
Standard annotations
The Standard annotations are set by the System Annotation Input Processor and, regardless of the language configuration, the following annotations are set:
Annotation | Description |
---|---|
_QUESTION | A question mark (?) appears in the input |
_EXCLAMATION | A exclamation point (!) appears in the input |
_DBLQUOTE | A quotation mark (") appears in the input |
_QUOTE | Single quotation marks (‘) appear in the input |
_BRACKETPAIR | A pair of brackets ( ), [ ] or { } appear in the input |
_NONSENSE | The input contains nonsense text (such as ‘asdf’, ‘wgwwgwg’, ‘xxxxxx’) |
_EMPTY | The input contains no text |
_BINARY | The input consists of only 0s and 1s |
_QT3 | Triple question marks (???) appear in the input |
_EM3 | Triple exclamation marks (!!!) appear in the input |
Annotations from other Input Processors
In addition to the above mentioned annotations, depending on the solution language configuration, more Teneo Input Processors (IPs) are available generating additional annotations.
Number annotations
Standard, Korean and Turkish Input Processor chains
The Standard Input Processor chain, as well as the chains for Korean and Turkish, include the Basic Number Recognizer Input Processor. This Input Processor identifies all Arabic numbers of the type 123 and 3.14 in the user input, annotates each of them with an annotation and associates a variable to this annotation which holds the number found. Although this Input Processor is language independent, each language has its own configuration file that defines which is the decimal point character and the thousands separator character to be ignored.
Annotation | Variable | Description |
---|---|---|
NUMBER | numericValue | Arabic numbers of the type 123 and 3.14, the associated variable numericValue stores the detected number |
Read more: Korean IP chain, Turkish IP chain and Standard IP chain.
Chinese Input Processor chain
The Chinese Numbers Input Processor is further advanced when generating number annotations, annotating all numbers as well as numerical expressions. It first normalizes tokens containing numeric values into Hindu-Arabic numerals, then creates an annotations with a variable containing the normalized number, as well as generating a second annotation with the name of the normalized number value (e.g., 3.14 is annotated as %$3.14 whereas 3,14 is annotated as %$314). The Chinese Number IP also annotates inexact numbers, i.e., numbers containing characters 几 or 数 or 余 or 多.
Annotation | Variable | Description |
---|---|---|
NUMBER | numericValue | All numbers as well as numerical expressions; the associated variable stores the normalized number |
INEXACT | Inexact numbers are annotated with INEXACT; i.e., numbers containing characters 几 or 数 or 余 or 多 |
To read more, please see the Chinese Numbers Input Processor.
Japanese Input Processor chain
The Japanese Number Recognizer Input Processor is capable of recognizing various types of number expressions, which are then annotated with a number annotation and associated with a variable which holds the numeric value of the found number.
Annotation | Variable | Description |
---|---|---|
NUMBER | numericValue | Covers various types of number expressions, the associated variable stores the normalized number |
To read more about the Input Processor, visit the section Japanese Input Processors chain.
Predict annotations
The Predict Input Processor makes use of a machine learning model generated when classes are available in a Teneo Studio solution to annotate user inputs with the defined classes. Models can be generated either with Teneo Learn or CLU; note that as of Teneo 7.3, deferred intent classification is applied.
Whenever the Predict Input Processor receives an input, the Input Processor calculates a confidence score for each of the classes based on the model, creating annotations for the most confident class and for each other class that matches the following criteria:
- the confidence is above the minimum confidence (defaults to 0.01)
- the confidence is higher than 0.5 times the confidence value of the top class.
Teneo Predict will create a maximum of 5 annotations, regardless of how many classes match the criteria.
Annotation | Variable | Variable | Variable | Description |
---|---|---|---|---|
<CLASS_NAME>.TOP_INTENT | classifier | confidence | Annotation created for the class with the highest confidence score | |
<CLASS_NAME>.INTENT | classifier | confidence | Order | Annotation given to each selected class with a maximum of five top classes |
Read more about Intent Classification in Teneo or visit the Teneo Predict section to read more about the Input Processor.
Part-of-Speech and Morphological annotations
Depending on the language configuration, the Teneo Input Processors may also set annotations carrying Part-Of-Speech and morphological information.
The POS-tagger / Morphological Analyzer creates annotations for each word in the user input with names like NN.POS, VB.POS, PAST.POS, PRESENT.POS, etc. that help to distinguish whether, for example, the word is a noun or verb. Furthermore, the Analyzer also provides annotations which indicate whether a noun is in singular or plural, or whether a verb was in the present, in the past, in the 3rd person, an imperative, etc.
The sets of annotations related to Part-of-Speech and Morphology are language specific, for more information and availability in a specific language, please see refer to the POS Tagger and Morphological Analyzer.
Language annotations
The Teneo Input Processor chains include a Language Detector Input Processor which uses a machine learned model to generate an annotation for the predicted language of the user input alongside a confidence score of the prediction. The value of the confidence score reflects the probability of a tag being correct and ranges from 0 (lowest probability) to 1 (highest probability).
Annotation | Variable | Description |
---|---|---|
<language label>.LANG | confidence | The Language Detector generates an annotation for the predicted language of the user input, and the associated variable contains the confidence score (reflecting the probability of the tag being correct) |
Read more about the Language Detector in the Standard Input Processor chain, or visit the NLP Capabilities section to select the wanted Input Processor chain in the menu.
Named Entity annotations
For several languages, the Teneo Input Processors also create annotations for entities detected in user inputs, such as location, organizations, products, etc. Annotations are also created for entities that might carry Personal Identifiable Information (PII), such as names, addresses, unique identifiers, e-mail addresses, etc.
Read more about the Named Entity Recognizer.
Custom annotations
Users can create their own custom annotations which opens up for the possibility to enrich user inputs with an extra layer of information tailored for a specific project and, for example, label postal codes or phone numbers based on given patterns, use named-entity recognizers to find products, places or names, or tagging the user inputs based on a machine learning classifier, all depending on the specific needs for the given project.
The custom annotations can be created in the following places:
- Pre-matching scripts
- Global Pre(-matching) Listeners
Both Pre-matching scripts and Global Pre Listeners allows to create custom annotations directly in the solution in Teneo Studio, this is done by calling the annotation-related Teneo Engine API methods which allow for an easy and comprehensive use of the annotator capabilities. See an introduction to this in the following section.
The above image displays an example of a custom annotation created in the Global Pre-matching script of a solution which generates the annotation START and attach it to the first word in the input sentence (if the input is not empty).
As mentioned above, it is also possible to script annotations from Global Pre(-matching) Listeners. Below is exampled the implementation of the FAVORITE_COLOR annotation which is attached to the mentioned word if it, first, matches the TLML syntax of the Language Object COLORS.LIST, secondly if it appears as the first word in the input and, third, if the Language Object SENTIMENT_POSITIVE.INDICATOR is also matched (indicating the user input talks positively about the mentioned color).
Once created, this annotation can be used for further syntax matching elsewhere in the solution, for example in a Flow as visualized below, where the TLML Syntax Match of the Flow trigger uses the annotation for the trigger matching.
And of course, be tested in Tryout, where the user is able to see the answer of the Flow (to the left) and the created annotation (to the right) in the Annotations view under the Input section of the Tryout window (read more about annotations in Tryout further below).
As Global Pre Listeners' execution is sequentially following a defined order, one action performed by a Listener (like removing an annotation) might affect the syntax matching of the next one!
Annotation methods
This section provides an introduction to the method for creating a new annotations as well as a brief overview of how to update or remove an annotation, for more details please see the Teneo Engine Scripting API:
engineAccess
: provides access to the state and functionalities of the Teneo Engine, including input annotations, for example,createInputAnnotation
Annotation
Class method: find here the available methods for the classAnnotation
AnnotationsI
Interface methods: a collection which containsAnnotation
objects; accessible through the syntax.getInputAnnotations ()
, e.g.,_.getInputAnnotations().add(annotation)
Create input annotation
The following method creates a new annotation instance for the given data:
groovy
1Annotation createInputAnnotation(String _sName,
2 int _iSentenceIndex,
3 Set<Integer> _zWordIndices,
4 Map<String,Object> _mVariables)
5
The annotation parameters are:
- _sName: the name of the annotation which must follow the same naming conventions as Language Objects (i.e., names must be uppercased, no whitespace or other reserved characters are allowed)
- _iSentenceIndex: the index in the user input's List<SentenceI> to which this annotation belongs (the first sentence has index 0)
- _zWordIndices: the indices in the SentenceI's List<WordData> to which this annotation belongs (the first word has index 0)
- _mVariables: an arbitrary collection of key/value pairs; pass null if no variables are required.
Passing a value for the parameter _mVariables is optional; the annotation can also be created by
_.createAnnotation(_sName, _iSentenceIndex, zWordIndices, null)
.
The method may throw the following exceptions:
- NullPointerException: if the name is null or the word indices map is null
- IllegalArgumentException: if the name is empty, the sentence index is negative or not less than the number of sentences, the word indices map contains a negative index or an index not less than the word count of the selected sentence, or the variables map contains a null key.
Manage annotations
Annotations can be added, updated or removed in Pre-matching scripts and by scripting Global Pre Listeners.
For example, given the object testAnnotation, created as:
groovy
1def testAnnotation = _.createInputAnnotation("Test", 0, [] as Set, null)
2
test_annotation can be added:
groovy
1_.getInputAnnotations().add(testAnnotation)
2
Or updated the same way (note that "add" will overwrite if the added annotation already exists):
groovy
1_.getInputAnnotations().add(_.createInputAnnotation("TEST" 0, [] as Set, ["new": "yes"]))
2
Or removed:
groovy
1_.getInputAnnotations().remove(testAnnotation)
2
Other methods allow to delete all annotations, for example:
groovy
1_.getInputAnnotations().clear()
2
Annotations in TLML syntax
Annotations can be used anywhere within a solution where it is possible to use a Language Object or Entity applying the Teneo Linguistic Modeling Language syntax; the annotations are, in addition to the % (percentage) sign, also prefixed by a $ (dollar) sign. For an annotation to be fulfilled, an annotation with the same name given in the syntax must exist on the sentence itself or in one or more sentence words.
As an example, the syntax in the below image matches if the input sentence contains any word annotated as a noun, directly followed by any word annotated as a verb, directly followed by any word annotated as a pronoun.
However, when working with annotations, it often make sense to use the annotations together with the Extended And operators and their negative equivalents, as all of these are used-word based. With the help of these operators, it is possible to write TLML syntax on both traditional Language Objects, Entities and on attributes from the annotation layer on the same used word.
Extended And operators | Negated And operators |
---|---|
&= Same Match operator | !&= Not Same Match operator |
&> Bigger Match operator | !&> Not Bigger Match operator |
&< Smaller Match operator | !&< Not Smaller Match operator |
&~ Overlap Match operator | !&~ Not Overlap Match operator |
&^ Different Match operator | !&^ Not Different Match operator |
Read more about these operators in the Teneo Linguistic Modeling Language Manual.
Annotation variables
Annotation variables are accessible within a syntax in the same way as either NLU variables or Language Object variables.
Since annotation variables can be of any type, their values need to be converted according if saved to other variable types as exampled in the below image.
Annotation variables can also be accessed via scripting; in the below example script, any annotation having a name ending with .POS and where the variable confidence has a value of 0.5 or lower is removed.
Tryout
In Teneo Studio Desktop, the visualization for tracking annotations is available in the Input section of the advanced Tryout window, where the Annotations view summarizes information concerning the annotations managed during the input processing; hovering over the different annotations provide more information about them, such as a more detailed description or information related to variables and values. The view also highlights if an annotation is updated or deleted.
The Input Processor Results view (in the Input section of the Tryout window) displays more information regarding the annotations created by Input Processors, including whether they were added, deleted or modified, their values, etc.
The information in the Tryout related to annotations is also included in the text / CSV exports available by right-clicking and selecting Open As Text / CSV or Copy as Text / CSV.