How I Made a Speech-to-Speech Conversational AI using ChatGPT in No Time at All

    By Daniel Amini, Head of Microsoft Technologies at BJSS

    Daniel Amini

    With the rise of ChapGPT, generative artificial intelligence (AI) has exploded into the public consciousness recently. You can’t go anywhere these days without it being mentioned; I knew it had properly cracked the mainstream when my mum started asking me about it.

    Anyone who has used GPT knows it may not always be accurate or reliable and that it can generate content that is factually wrong. On the other hand, it’s saved me countless hours doing what the best assistants do, which is make me more productive.

    What I find most fascinating is how easy it is to consume and interact with. I’m not a data scientist, I’ve not looked at a neural network in decades, and, to be honest, I don’t want to. What I want is to be able to interact effortlessly with an AI model.

    This led me to write a speech-to-speech conversational AI using GPT and Azure Services. In this blog post, I'll explain what it does, how it works, and how you can use it yourself. You’ll see how easy it is to embed AI applications, how simple it is to extend, and how the barrier of entry for manipulating this technology is now incredibly low.

    What does it do?

    This project is a demonstration of how you can use Azure Cognitive Services and Azure Open AI Service to create a voice-based conversational system able to answer your questions and chat with you. It's like having your own personal assistant that can talk to you in natural language and provide you with relevant information.

    You can choose from different personas, such as Marvin, a depressed android, or Alice, a friendly chatbot. The system uses Azure Cognitive Services to recognise your speech and sends the voice input to the Azure Open AI Service. Azure OpenAI Service has a deployment of multiple GPT models, including GPT-3 and GPT-4. These models reply to the converted voice input. The resulting AI text response is replayed to you using speech synthesis, providing a speech-to-speech interface to OpenAI. Additionally, Azure Sentiment analysis is performed on the OpenAI response, so you can see how the system “feels” about your conversation.

    This results in you having an ongoing spoken conversation with a fictional character, “real” person, or any persona you can define, be it a “IT support person”, “Top Chef”, or “HAL from 2001: A Space Odyssey”. To aid the immersion you can define character traits by providing a short description to the AI model, as well as a suitable synthesised voice with emotional style.

    How does it work?

    The project uses three main components:

    azure-diagram

     

    • Speech Recognition: This is the process of converting speech into text. The project uses Azure Speech service to recognise speech from a microphone and send the voice input to Azure OpenAI Service. Azure Speech service supports over 100 languages and dialects, and has features like noise cancellation, speaker identification, and conversation transcription.
    • Open AI: The project uses Azure OpenAI Service to access a powerful natural language processing model that can understand and generate natural language. Azure OpenAI Service is a cloud-based platform that allows you to use OpenAI models without any coding or infrastructure. You can deploy your own custom models, or use pre-trained models, for various tasks like text generation, summarisation, classification, sentiment analysis, and more.
    • Speech Synthesis: This is the process of converting text into speech. The project uses Azure Speech service to synthesize the text response from OpenAI and replay it to the user using speech synthesis. Azure Speech service can produce natural-sounding speech in over 70 languages and voices, and has features like neural voices, prosody control, and custom voices.

    The project also performs sentiment analysis on the OpenAI response using the Azure Text Analytics service. This service can detect the tone and emotion of a text, and assign a score for the sentiment.

    A number of predefined personas have been created, however, these can be extended via the applications configuration file. It will be interesting to see how the emerging discipline of ‘Prompt Engineering’ can be used to define more complex, charismatic, and engaging personas for the AI.

    The project is written in Python as a simple command line app. It requires access to the Azure Cognitive Service for text-to-speech, and speech-to-text, as well as Sentiment Analysis, and the Azure OpenAI Service.

    The code can be found on GitHub along with details on how to setup the app, although I’ve aimed to make it as simple as possible.

    How can you use it?

    You can use this project for fun, for learning, or for experimenting with conversational AI. You can try different personas and see how they react to your questions and comments. You can also modify the code and add new features or functionalities. For example, you could add more speech recognition languages, more speech synthesis voices, more sentiment analysis options, or more OpenAI parameters.

    Conclusion

    Hopefully you can see that this project is a great example of how to leverage Azure Cognitive Services and the Azure OpenAI Service to create an engaging, intelligent voice application with minimal effort. These powerful AI services are easily accessible and ready for you to build your next project with.

    I hope this has inspired you to think of ideas of how you can use these types of technology, and how they can be infused in your services and offerings.

    If you want to learn more about our AI offerings click here.