One challenge we have been thinking about recently is what it means to test and assure that a machine learning classifier will perform as expected
The BJSS AI team carries out many conversational AI projects on behalf of our commercial clients. Because we work across many industries including government, healthcare and financial services, often our clients face stringent regulatory requirements, which we have to address through our technical solutions. One challenge we have been thinking about recently is what it means to test and assure that a machine learning classifier will perform as expected.
Contrary to popular belief, most modern chatbot implementations go beyond rule-based implementations. This means it’s not always straightforward to predict how they will behave. Many of the latest Chatbot/ conversational interface/ virtual assistant solutions are starting to rely on machine learning classifiers for the creation of automated human interactions and to supply intent classification and entity extraction, both crucial sub-routines to any chatbot runtime.
These algorithms, although often exponentially superior at accurately discerning the intent behind a particular user input or extrapolating a parameter, pose an interesting problem to any tester of a conversational product: how does one test and assure that for a given set of known inputs (training data), a natural language model will supply the correct classification every time?
The nature of natural language processing in production means that a live implementation can expect any nnumber of unstructured inputs from end users, most variants of which cannot be accurately predicted for. As one cannot test for infinite variance before deploying a solution, at BJSS we have focused our efforts around testing what we can account for, namely that – at the very least – the training data that the classifier itself was trained on, when fed in as an input should return the correct classification 100% of the time.
We have given this measurement methodology the title of ‘Bot Health’, with the goal being to track this score as a function of model mutations, i.e.training cycles, epochs, etc. Our attempt here is to create a benchmark at T=0 that can give a tester a ‘confusion rate’ whose value deviation can be tracked across each subsequent mutation of the natural language classifier model.
The model’s confusion rate can be ascertained at the end of each new natural language training cycle by testing the model itself against each training phrase/input it was trained on along with a negative set of training data which we want a classification returned for. In this way we are testing to see if the correct intent was or wasn’t triggered, given a defined confidence threshold.
As an output for each input, a 4-state confusion matrix can be constructed with each input classified into one of the 4 states (within a defined confidence threshold):
- True positive:The training phrase returned an intent classification that was correct
- False positive: The training phrase returned an intent classification that was incorrect
- True negative:The training phrase correctly returned no classification
- False negative:Training phrase incorrectly returned no classification
Once the matrix has been constructed a trackable confusion rate can be calculated through the following formula:
Because the confusion matrix state is obtained from an input that is known and stored a great value-add of the testing methodology is the ability to highlight which exact training phase in which exact intent may have caused the confusion rate to increase for a given training cycle.
This highlighting becomes an invaluable tool to bot testers and conversation designers at the time of debugging large bots with hundreds of intents and thousands of training phrases. It allows them to quickly identify culprits affecting accuracy or polluting intent accuracy, especially in cases where similar training phases have been used across different intents.
In a world where regulators are increasingly demanding that AI be explainable, testing methodologies like these are invaluable tools in allowing designers of conversational products to test for and assure situations that a machine learning model has been directly trained on.