Teneo Developers

Intent Classification

Intent Classification uses machine learning and natural language processing to automatically associate words and expressions with a particular intent and, in essence, an intent classifier automatically analyzes user inputs and categorize them into intents in order to better understand the intentions behind end-users queries, automate processes and gain valuable insights. Teneo provides the possibility to create an intent classifier where machine learning algorithms allow to build a model based on the provided training data to be used for making predictions.

The model is trained in the Teneo Learn component and the Teneo Predict Input Processor uses this model at runtime to determine to which intent class/classes a user input most probably relates. Teneo Predict creates annotations for the most probable intent(s) and denotes these with a confidence score. Intent classification in Teneo is designed to associate the input with one class - the most probable one, but in cases where the predictions aren't clear and several classes could be the potential right one, multiple annotations are created.

To improve performance of the intent classification of a solution, Teneo provides two options:

  • Class Performance, where the user can run Cross Validation (CV) on the solution's model in order to estimate the performance of the machine learning model, and
  • the Classifier, which is the workbench for the classifier where the user, based on real inputs coming from logs, can review how the model performed and based on this information improve the balance of the model, assign specific inputs to an intent class or fix misclassifications detected.

Machine Learning model

The Intent Classification's machine learning model is trained in the Teneo Learn component directly in Teneo Studio and the machine learning pipeline for training the model is automatically selected by Teneo Learn using an algorithm that assigns a pipeline optimized to the number of training data provided.
Every change to any of the classes in the Class Manager, including addition or removal of one or more classes, will activate a new training of the solution's machine learning model in Teneo Learn as soon as the user saves the changes.

Input Processing

Tokenization

For the intent classifier to generalize well on unseen user inputs, the user inputs are pre-processed so that irrelevant features such punctuation characters or casing do not have an impact on the machine learning prediction.
In general, this means that user inputs are lowercased and certain punctuation characters are removed. The following table contains several user inputs in English that are normalized to the same form and as such the intent classifier makes the same prediction for them.

User inputs
Hello, how are you?
Hello. How are you?
HELLO HOW ARE YOU??
hello how are you

It is vital that the exact same pre-processing that is applied to the user inputs at runtime for the intent prediction is applied to the training data that is used to train the intent classifier/machine learning model. This means that pre-processing happens in two places: once for the training data in Teneo Studio and once at prediction time when the model is called in the Teneo Predict Input Processor.

Pre-processing

For the pre-processing, the language-specific Teneo Input Processors (IPs) are used since they are tailored for the needs of each language. The final pre-processing string will be a concatenation using a single whitespace character " " of all the FINAL word forms of all sentences for a user input.

Note that the order of the input processors in the chain matters and all input processors that modify the ORIGINAL, SIMPLIFIED or FINAL word form before the Predict Input Processor impacts on the actual prediction. Input Processors that are placed after the Predict IP in the chain do not affect the intent classification.

Configuration

For each language, the chain of input processors used to normalize the data is defined explicitly in the language-specific Input Processor configuration file. That configuration file (config.properties) can be found and modified when exporting and re-importing a custom input processor setup (custominputprocessorsetup.zip). See the Custom Input Processor configuration section for more information on how to do this.

The configuration file contains a mandatory property languageproperty.normalizationIPs that takes as argument an ordered, comma-separated list of Input Processors that will be applied when normalizing the data. Note that at least one Input Processor needs to be provided.

Be aware that the Input Processors that are defined in the property languageproperty.normalizationIPs{} must occur in the same order in the definition of the input processor chain (inputProcessorHandler.inputProcessor.class.) before the Predict Input Processor and that Input Processors that modify the FINAL word form do not occur before the Predict Input Processor in the definition of the chain if they are not explicitly mentioned in the right order in the property languageproperty.normalizationIPs.

Note that the Simplifier defined in the configuration file is always run when normalizing the data and usually affects the FINAL word form and thus the normalized output unless otherwise defined in the input processors.

Standard configuration

The following table shows the Input Processors which are used by default to normalize the data. Again, note that per default the Simplifiers are also applied.

LanguageInput Processors
Standard (all languages except below)StandardSplitting, StandardAutoCorrection
ChineseChineseTokenizerIP
FinnishStandardSplitting, FinnishSplitting, StandardAutoCorrection
JapaneseJapaneseTokenizer, JapaneseConcatenator
TurkishTurkishAnalyzer

Please see language specific information in the Input Processors section.

Generation of annotations

Whenever a user input is received, a confidence score for each and one of the classes is calculated based on the machine learning model and annotations for the most probable intent classes for that input are created (following the scheme <CLASS_NAME>.INTENT). For the class with the highest confidence (i.e. the most probable one), Teneo Predict generates a top-rated annotation tagging it with TOP_INTENT suffix.

By default, Teneo generates annotations for up to five intent classes and these annotations are created only if the difference in the confidence between an intent and the top intent is less than TOP_INTENT divided by two. As an example, imagine the machine learning model predicts these top five intent classes for a particular user input:

​ A 0.14
​ B 0.11
​ C 0.06
​ D 0.05
​ E 0.03

Teneo will only generate the following annotations:

A.TOP_INTENT confidence 0.14
A.INTENT confidence 0.14
B.INTENT confidence 0.11

In this example, there are no annotations for C, D nor E. The reason for this is that the confidence values of C (0.06), D (0.05) and E (0.03) are lower than TOP_INTENT (0.14) divided by two (0.07).

Global Confidence threshold

The Confidence threshold of the machine learning classifier determines the minimum confidence value the model must assign to a class in order for a Class Match to be considered for triggering. The confidence threshold is a numeric value between 0 and 1. By default, the confidence threshold is 0.45; the confidence threshold can be modified in the Solution Properties.

Cross Validation

Cross validation is a statistical method for estimating the performance of a machine learning model. In Teneo Studio this process is available with the Class Performance functionality allowing users to check the performance of the solution's machine learning model and analyze which classes conflict with one another. The process is described in more details in the following sections, but basically it is a method in which a machine learning model is trained using a subset of the training data and then evaluated using the complementary subset of the training data in order to get an estimation of the performance; the results are then displayed to the user in the Confidence Threshold graph and the Class Performance table where it is possible to compare the one cross validation process with others previously run.

Process

The evaluation is performed using a k-fold cross validation process, where all the training data of the solution's classes are split into K folds and, for each fold, a machine learning model is trained with the rest of the K-1 folds. Once the training is completed, the performance of the ML model is evaluated against the retained fold; the results of all these evaluations are averaged to get an estimation of the performance of the machine learning model.

It is important to remember that cross validation is an estimation of the performance of the model which can only be directly measured with a test dataset, and that it is stochastic in nature. This means that different executions of the cross-validation process may give different results because the training data is randomly split into folds. Results would probably be stable for homogeneous classes with a high number of training data and a high variance could be a symptom of excessive heterogenicity among the training data of some classes.

Metrics

The following metrics are used to evaluate the performance; these are standard metrics for performance measurement of machine learning models and an in-depth description can be found on Wikipedia.

  • Precision measures the percentage of the detected classes which were really positive matches, i.e., measuring to which degree one can rely on the classifier having marked as positive only training data that is positive.
  • Recall measures, from all the positive matches, how many of them were successfully retrieved, i.e., measuring how sure the classifier is of having retrieved all the existent positive from the dataset.
  • F1 is the harmonic means of Precision and Recall. It is usually used as a general measure of classifier performance.

Data limitations

Old executions of cross validations are kept in Teneo for comparison purposes, but historic data has some size and time limitations:

  • Failed cross validation executions are kept for one week after which they are removed from the server. In case the Studio backend service gets restarted while a cross validation process is running, the process will be stopped and marked as failed and, in this case, the user would have to start the process again.
  • There is no time limitation for succeeded cross validation executions, but the number of executions stored in the server is limited; by default, only the last 20 are kept but this configuration may be changed in the server.

Views in Teneo Studio

Confidence Threshold Graph

Whenever a new input arrives to the conversational AI application, the machine learning model analyzes that input and generates a set of class predictions, i.e., for each class in the model it assigns a probability for the input to beong to that class. A top intent annotation is always created for the most proabel class, and if the probability value exceeds the solution-wise confidence threshold, that class is considered for triggering (the trigger actually selected also depends on other factors in Teneo, e.g., Ordering, other defined Matches, etc.).

The solution threshold is set up under the assumption that predications with a very low degree of confidence will most likely be wrong. So, the thresholding process can be thought of as a binary classifier that determines whether the predictions of the machine learning model are reliable for a given input or not (based only on the prediction confidence).

The purpose of the Confidence Threshold graph is to provide a tool to analyze the estimated performance of the classes in the solution with regard to this threshold setting.

Confidence Threshold Graph

The view shows the values of the classification metrics for the thresholding process for each value of the confidence threshold, in the [0, 1] range, i.e., considering the threshold a binary classifier whose training data is accepted predictions of the model (predictions with confidence over the threshold) and whose negative training data is rejected predictions (predictions with confidence values below the threshold).

In this context, the performance metrics can be interpreted in the following way:

  • Precision measures the percentage of the accepted inputs (classifications with a confidence over the threshold) that were rightfully accepted, i.e., they correspond to training data that was correctly classified by the model.
  • Recall measures the percentage of the correct inputs (training data that was correctly classified by the model) that were accepted by the threshold.
  • F1 has the usual meaning as the harmonic mean between the other two metrics.

The values on this graph can be used to decide where to set the solution confidence threshold. There is no golden rule to set this value, but when taking the decision, consider the following pieces of advice:

  • A high threshold value will reject dubious predictions, so if the solution contains triggers which depend on the classes, it will be highly improbably to mistakenly trigger a Flow based on the machine learning predictions. On the other hand, this will cause many correct predictions to be discarded due to a low confidence score. A high threshold is probably wanted when the consequence of marking a wrong input as positive is worse than those of marking them as a negative (a typical example is spam classifiers).
  • A low threshold value will make the solution accept more predictions, so one can be confident that most of the times a Flow should be triggered by a machine learning prediction it actually will. Conversely, this will increase the probability of triggering a Flow with an incorrect prediction. One probably wants a low threshold when the consequences of losing one message are worse than those of processing an incorrect one (emergency services would be a typical example).

Setting the threshold implies a trade-off between these two situations, the appropriate value will depend on the particular use case and project.

Class Performance Table

The Class Performance table shows the performance metrics for each of the classes in the solution, including how many errors correspond to false positives (FP), which are those predictions where the classifier assigned that class when it should have assigned another, and to false negatives (FN), which are those predictions where the classifier assigned another class when it should have assigned the analyzed one.

Class Performance table

The table displays one row for each class and a single row for the average values for all classes. For each row, the following columns are displayed:

  • Class name, name of the class.
  • Precision, Recall, F1 these are the binary classification metrics for the row's class, i.e., for all the training data examples whose ground truth class is the row's class, training data predicted as belonging to that class are considered positive and any other predictions as negatives.
  • Examples number of training data examples of that class at the moment of execution of cross validation.
  • Conflicting classes shows the number of mistaken predictions of the model. Those predictions can either be false positives (FP) or false negatives (FN); the arrow at the end of the column unfolds a list of rows inside the cell, each one specifying one of the classes that were confused with the class of the row, the kind of error and the percentage of classified training data that suffered from that kind of error for the particular class.

All the numeric columns and the class name are sortable, and classes which appears in the current execution but didn't exist in the historic ones are marked with a star.

If the user selected an old cross validation to compare with, differences from the current run will be displayed as deltas on all the numeric values, with a green background if the metric has improved from the older execution and otherwise red.