AI fills in the blanks when audio data gets lost
With a vast amount of people around the world now relying on video calls for face-to-face interaction with their work colleagues or to fill the social void, video-calling platforms such as Zoom, Google Hangouts and Microsoft Teams have taken centre stage.
Internet Service Providers (ISPs) have also had to adapt to the extra demand for broadband services to accommodate this increase in usage. At peak times internet speeds may be slower for some users, meaning that video-calling could be disrupted and users may see noticeable delays in calls.
Google has rolled out a new technology to improve audio quality in calls when the service can’t maintain a steady connection. It features useful auto-complete technology for speech that can cover up any glitches in video calls with Artificial Intelligence (AI) generated ‘speech’.
Duo, one of Google’s video calling services, is a cross-platform app which allows up to 12 participants to communicate over video call. Although the app doesn’t offer the same capacity as rival apps, it is end-to-end encrypted and the AI runs on the device rather than in the cloud.
Packets of data
When users are making an online call, their voice is separated into tiny pieces that are then transmitted across the internet in data blocks known as packets. The packets often arrive to the receiver in a disorganised sequence and the software has to reorder them, and sometimes they don’t arrive at all, which then creates glitches and gaps in conversations.
The aim of the technology is to mimic an individual speaker’s manner of talking so that it can smooth over the cracks with snippets of generated speech.
The team at Google built the speech generator on a neural network developed by DeepMind, which generates realistic speech from text. The network, named WaveNetEQ, was trained on a large data set of 100 recorded human voices speaking in 48 different languages. The AI was trained until the speech generator could auto-complete short sections of speech, based on common patterns in the way that people talk.
During a video call, WaveNetEQ learns the characteristics of the speaker’s voice and generates audio snippets that match the style and content of what the speaker is saying. If a packet containing the original speech sounds was lost, the AI-generated voice would be inserted in its place.
Learn More About Hyve