Input Processor API

Introduction

In the Teneo Platform, splitting of the user input text into sentences and words, spelling corrections as well as other solution language dependent processing is handled by separate modules: the Input Processors (IPs) and the Simplifier. These are pluggable into Teneo Engine via the well-defined Input Processor API. This allows quickly adding support for new languages and creating input processors for specific language handling, such as Part-of-Speech (POS), Named Entity Recognition and morphological analysis.

The Input Processors and Simplifiers can be developed and tested independently from Teneo Engine, even by customers.

With the Input Processor API, input annotations are a central feature in the input pre-processing of Teneo Engine and it allows for integrations of custom or third-part Natural Language Processing (NLP) tools, like POS taggers and Entity extractors, in Teneo.

Access the Engine Scripting API here; including the Input Processor package.

Architecture

The Input Processor API defines these main types:

Input processor chain
Input processor
SentenceData (sentence)
WordData (word)
Annotation
Simplifier
Dictionary

The entire process of input pre-processing is typically handled by multiple input processors, organized as the input processor chain, where each processor is called in sequence. The output of one input processor is forward as input to the next input processor, which can add to this data or modify it.

The input processors generate, at least, sentence and words by tokenizing the user input text in a suitable way. These words are the units later matched against match requirements in the Teneo Engine.

Input processors may also add annotations to the output data. An annotation is assigned to a sentence and optionally to one or more words in the sentence. Annotations are named entities that, for example, represent data generated by third party NLP tools, like POS taggers and Entity extractors, but may also represent certain properties of the entire user input text or an entire sentence.

In this process, the Simplifier serves to normalize the word data generated by the Input Processors. It is also utilized by the language condition parser of Teneo Engine.

The Dictionary provides a list of words that may be used by input processors. It is provided to the input processor API on initialization of an input processor chain.

Input Processor chain

The Input Processor chain provides the main entry point of the Input Processor API. It handles processing of user inputs by sequential calls to the Input Processors, as well as Input Processor initialization and shutdown. The Input Processor chain furthermore holds a reference to the Simplifier.

Input Processor

Instances of Input Processors are called by the Input Processor chain; they perform the processing of user inputs. The user input text, additional request parameters, a list of sentences (containing words), a set of annotations and option properties are passed to the processing method of an Input Processor.

The Input Processor may modify or add to the sentence, word or annotation based on their current state or the other parameters passed.

SentenceData

A SentenceData object represents a sentence in the user input. It contains the sentence text (a substring of the user input text), the character position in the user input text where the sentence begins and zero or more words.

WordData

A WordData object represents a word in a sentence. It contains the word in different forms: the original form (a substring of the sentence text), the simplified form (usually the result of the original word form after normalization by the Simplifier) and the final word form (e.g. the spelling-corrected version of the simplified word form). It also contains the character position in the sentence text where the original form of the word begins and an arbitrary collection of key/value pairs (properties).

Annotation

An annotation is a named entity that represents a property of a word (e.g. data generated by third party NLP tools), a sentence or even the entire user input text. An annotation is assigned to a sentence and may be assigned to one or more words in the sentence. Optionally, variables (key/value pairs) may be attached to annotations. Annotations are managed in a set that allows adding or removing an annotation or searching annotations by certain criteria.

Simplifier

The Simplifier is a separate processing unit (it is not an input processor) which provides a method to normalize some text (usually, but not necessarily, a word). Here "normalization" means removal of text properties that are semantically insignificant, like conversion to lower case, removal of some accents and normalization of Unicode combining characters. By default, Input Processors call the Simplifier when they generate a new word item.

Furthermore, the Simplifier is called by the language condition parser of Teneo Engine when it stores a language condition word in the solution dictionary.

The Input Processor API provides a standard implementation of a Simplifier suitable for European languages.

Dictionary

The Dictionary provides a list of words that may be used by input processors, e.g. to apply spelling-correction to words found in the user input text. Typically, the Teneo Engine provides the dictionary and it contains the words used in the language conditions of a solution. The Dictionary is provided to Input Processor API on initialization of an Input Processor chain.

Functionality

This section describes the functionality of the Input Processor API from a life-cycle perspective.

Initialization phase

The Input Processor chain is the main entry point for the application to the Input Processor API. An Input Processor chain needs to be created and initialized before it can be used to process a user input.

After an Input Processor chain has been created, it needs to be set with a Simplifier instance and an ordered list of Input Processor instances. Next, the initialization needs to be called on the Input Processor chain; it initializes the Input Processors in the given order and handles initialization failures.

Each Input Processor may require an arbitrary set of configuration properties to be passed to its initialization method. The combined set of configuration properties for all input processors in the list is passed into the Input Processor chain. From this set, the Input Processor chain selects the properties for a particular Input Processor based on its class name.

The Simplifier instance needs to have been previously initialized, as it will not be initialized by the Input Processor chain. This is required because the Simplifier may be needed to setup the Dictionary that is passed into the initialization method of the Input Processor chain.

User input processing

The processing of a user input is started by calling the processing method on the Input Processor chain, passing in the user input text, any additional request parameters (key/value pairs) and optional processing control options. The Input Processors may use certain request parameters to modify the processing functionality depending on the request data. In contrast, the control options are used by the caller of the Input Processor API to modify the Input Processor functionality depending on the call context, but independent from the request data, e.g. to disable an Input Processor that applies spelling-correction.

The Input Processor chain calls the processing method of the Input Processor in sequence, starting at the first Input Processor in the list. An empty list of sentences and an empty set of annotations is passed into the initial Input Processor.

After return from the processing method of an Input Processor, the Input Processor chain calls the processing method of the next Input Processor in the list. It passes in the same objects as passed into the previous Input Processor. The Input Processors cannot modify the user input text, nor the request parameters or control options. However, since they can modify the list of sentences and the set of annotations, this data as the result of the processing functionality of an Input Processor is passed in as input to the next Input Processor.

The processing method of an Input Processor may modify the list of sentences, e.g. by adding a sentence item for each sentence found in the user input text, with word items for each word found in the sentence text. An Input Processor may also modify the sentences and words it gets passed in as result from the previous Input Processor. In the same way, each Input Processor may modify the set of annotations, by adding, modifying, or deleting annotation items.

The state of the list of sentences and the set of annotations after return of the last Input Processor in the list becomes the result of the processing method of the Input Processor chain.

Role of the Simplifier

The Simplifier is a separate unit that may be called by the Input Processors, but is not an Input Processor itself. Thus, the Input Processor chain will never call it directly during processing of a user input.

An application of the Input Processor API may call the Simplifier directly, usually during the setting up of the Dictionary that will be passed into the Input Processor chain in the initialization phase.

Typical functionality of input processors

The overall input processing functionality managed by an Input Processor chain is typically distributed over multiple Input Processors. This allows to reconfigure an input processing chain according to the needs of the application and to reuse Input Processors on different setups of an Input Processor chain, usually depending on the language of the text to process.

Typically, a separate Input Processor instance exists for functionality classes like

splitting (tokenization) of user input text into sentences and words,
spelling correcting of words, and
creation of annotations.

Implementation examples

In the below section, simple code examples are given to show the basic of an Input Processor implementation, how to setup the Input Processor chain and how to process a user input.

Generating sentences and words

The SplittingProcessor class implements an Input Processor that splits the user input text into sentences and words, at matches of configurable regular expressions. Sentence delimiters are configurable and are passed as properties to the initializing method of the SplittingProcessor.

groovy

1/*
2 * SplittingProcessor.java
3 */
4
5package com.artisol.ipapiexample;
6
7import com.artisol.teneo.engine.core.inputprocessor.AnnotationsI;
8import com.artisol.teneo.engine.core.inputprocessor.DictionaryI;
9import com.artisol.teneo.engine.core.inputprocessor.InputProcessor;
10import com.artisol.teneo.engine.core.inputprocessor.SentenceI;
11import java.util.List;
12import java.util.Map;
13import java.util.Properties;
14
15/**
16 * This input processor tokenizes the user input text by splitting it at
17 * configurable sentence and word delimiters.
18 */
19public class SplittingProcessor extends InputProcessor
20{
21	public static final String PROPERTY_SENTENCE_DELIMITER = "sentenceDelimiter";
22	public static final String PROPERTY_WORD_DELIMITER = "wordDelimiter";
23
24	private String sSentenceDelimiter;
25	private String sWordDelimiter;
26
27	@Override
28	protected void initialize(Properties _properties, DictionaryI _dictionary) throws Exception
29	{
30		sSentenceDelimiter = _properties.getProperty(PROPERTY_SENTENCE_DELIMITER);
31		sWordDelimiter = _properties.getProperty(PROPERTY_WORD_DELIMITER);
32	}
33
34	@Override
35	protected void process	(String _sUserInputText,
36							Map<String, String> _mInputParameters,
37							Properties _options,
38							DictionaryI _dictionaryExtension,
39							List<SentenceI> _lSentences,
40							AnnotationsI _annotations)
41{
42	// split the input text at the configured delimiter
43	String[] sentenceTexts = _sUserInputText.split(sSentenceDelimiter);
44	int iSentenceBeginIndex = 0;
45
46	// create a sentence item for each sentence text
47	for (String sSentenceText : sentenceTexts)
48	{
49		SentenceI sentence = createSentence(sSentenceText, iSentenceBeginIndex);
50		// split the sentence text at the configured delimiter
51		String[] wordTexts = sSentenceText.split(sWordDelimiter);
52		int iWordBeginIndex = 0;
53
54		// create a word item for each word text
55		for (String sWordText : wordTexts)
56		{
57			if (!sWordText.trim().isEmpty())
58				sentence.getWords().add(createWordData(sWordText.trim(),
59iWordBeginIndex));
60
61				iWordBeginIndex += sWordText.length();
62			}
63
64			_lSentences.add(sentence);
65			iSentenceBeginIndex += sSentenceText.length();
66		}
67	}
68
69	@Override
70	protected boolean canHandleDictionaryExtension()
71	{
72		return true;
73	}
74}
75

Managing the input processor chain

The following class shows how to create and initialize the input processor chain, the two example Input Processors and the Simplifier. It also shows how to invoke the Input Processor chain to process a user input text. When run, it prints the resulting sentence, words and annotations.