Tim Walpole

Tim Walpole

We assume, often naively, that our personal data is used, and then deleted, and that there will be no trace of our interaction persisted.

There is not a week that goes by where the news does not have a story about how our privacy has been violated. Violated by the devices that we operate in our homes which listen to, and record, our conversations and yet, we still carry on using those devices to operate our lights, turn on our coffee machines, and in some cases even carry out banking transactions. We happily, unknowingly or unwittingly give away our private, personal data and conversations assuming that the organisations that run these services are trustworthy and will properly protect our personal data. We assume, often naively, that our personal data is used, and then deleted, and that there will be no trace of our interaction persisted.

“Apple contractors regularly hear confidential medical information, drug deals, and recordings of couples having sex, as part of their job providing quality control, or ‘grading”, the company’s Siri voice assistant.” – The Guardian, July 2019.

However, behind the scenes, we find that our data is far from secure.  We find that our data is passed to numerous cloud-based service without being anonymised.   These services include:

  • Natural language understanding – To try and identify what we said
  • Sentiment analysis – To understand how we may be feeling.
  • Text to speech – To respond in a human like voice.

“We only annotate an extremely small sample of Alexa voice recordings in order improve the customer experience. For example, this information helps us train our speech recognition and natural language understanding systems, so Alexa can better understand your requests, and ensure the service works well for everyone.” – CNet, July 2019.

We also find that organisations are sending our personal un-anonymised data to cloud-based analytics services to understand how the service is performing.  Once the data gets to these services, our personal data may be used for model training and in some cases is even being reviewed by humans (often external contractors) in an attempt to improve the underlying service.

“Google stated that contractors listen to amount to about 0.2% of all recordings, and that user account details aren’t associated with any of them.“ – CNet, July 2019.

So, as we get used to sharing more information with conversational AI interfaces (voice based or text based), how can we do in a way that protects our privacy and personal conversations and how can we use advanced AI services like natural language and sentiment analysis without having our data compromised?

We need to radically rethink how we deliver such services, starting with prioritising our privacy, and start using privacy first AI solutions that put the confidentiality of customers data first.

In order to understand what would be involved in putting together a 100% privacy first conversational AI service, let’s look at the components that are needed in order to facilitate a conversation, either using voice, or text-based entry (such as a chatbot).

Hotword (or wake word) detection allows a device to listen for either single or multiple words then, once the hotword has been recognised, start processing any further input.

Hotword’s are usually pre-determined words that are setup within the device such as ‘Hey Google’ or ‘Alexa’  however some platforms allow you to use your own hotword’s.

Once the hotword has been identified, any further speech is captured and processed using a suitable speech recognition model in order to identify what was said. Generic models allow the speech recognition engine to identify anything that was said; however, accuracy rates are not that high, especially in a noisy environment.  In order to provider higher accuracy, a domain specific model would need to be used in order to increase accuracy.

Now we need to try and determine what a user is saying in order to process the request and produce a suitable response.  In order to do this, we send the text from either the speech recognition engine (in the case of a voice conversation), or the text that was typed in by the user (in the case of a text based conversation such as a chatbot) to a Natural Language Processing Engine. The Natural Language Processing engine will use a trained model (or set of models) to identity one or more ‘intents’  along with a confidence score for each matched intent.  Once the intent has been determined, a suitable (and potentially personalised) response is generated which is then sent back to the user. For the text-based conversation, this is sent back directly however, for voice-based conversations, the text needs to be converted back to speech using a Text to Speech engine.

Today, most voice based conversational devices (such as Alexa or Google Assistant) and most text based chatbots use cloud-based services to process conversations by sending the data (un-encrypted) to the cloud for processing as well as to cloud-based services for analytics.


So, how are privacy first solutions different?  We have two options available to us which give us different levels of privacy.

Our first option is to securely process the conversations on a ‘private by design’ server.  In this case, hotword detection happens on the end user device and any further conversations are sent to the backend service for processing. The backend server will process the conversations using its own internally trained models before returning a suitably formatted response to the end user.

The service must not use any cloud-based services and must be designed so that clients personally identifiable data is never logged, revealed or used for training.  This can be achieved by using well trained machine learning models to remove all personal data before processing with internal natural language understand, sentiment, analysis and text to speech services.

This option allows us to carry out analysis of the end users conversations, however is only as good as the machine learning models with regards to removing of the personally identifiable data, and as such, there is still a risk that personal data may inadvertently revealed or used to retrain the internal models.

Another option, which is 100% private by design, is to process the conversations completely on the end user’s device.  In this solution, a trained machine learning model and text to speech engine will be embedded into the application on the end user’s device.  The conversations will be processed locally, and the only data that will be sent to the backend service is the Intent, along with any data required to generate the response.

Any processing of conversations on device must be carried out in memory and no record of the conversation should be stored.

Using this level of privacy however comes with its own challenges.  As no record of the conversation is stored, there is no way to use the conversation history for either analytics to understand how the service is working, or to help re-train the model to make the service perform better.

For both of these options, any processing of the conversations on the end user’s device (including hotword detection) should be carried out in memory and no data should be stored locally at any time to reduce the risk of conversations being recovered.

With either of the above options, there needs to be a level of trust established between the customer and the supplier of the service.  This trust takes time to be established and can be wiped out overnight should it be proved that personal data has been compromised.

It has already been shown that we cannot trust Google, Amazon or Apple with our personal data and, even though they may be telling us that our data is secure, it will take us a long time to re-establish that trust.

“Apple has said that it will temporarily suspend its practice of using human contractors to grade snippets of Siri voice recordings for accuracy while it conducts a thorough review, they are suspending Siri grading globally. Additionally, as part of a future software update, users will have the ability to choose to participate in grading.” – The Verge, 2019

It is predicted there will be a massive explosion of smart speakers in the next few years, and the devices allow us to more than just asking for the weather or ordering a coffee.

– 50% of all searches will be voice searches by 2020
– About 30% of all searches will be done without a screen by 2020
– 13% of all households in the United States owned a smart speaker in 2017.
– That number is predicted to rise to 55% by 2022.
Wordstream, October 2018

We need to vote with our feet today and stop using these devices and work with the relevant organisations to ensure that legislation is put in place to guard our privacy and personal data. So next time you use a virtual assistant, consider the value exchange – are you getting enough value from it to warrant sharing your data in this manner?