With a latency of just 0.5 seconds, smartphones can now convert text to speech in 21 languages, thanks to the National Institute of Information and Communications Technology (NICT).
This solution synthesizes one second of speech in just 0.1 seconds using a single CPU core—about eight times faster than conventional methods. This implies that a typical mid-range smartphone will be able handle the required processing on its own, without needing an internet connection or external resources.
“The synthesized sound quality of text-to-speech has improved dramatically in recent years thanks to the introduction of neural network technology, and it has reached a level comparable to that of natural speech, however, the huge amount of calculation was a major issue; thus, impossible to synthesize on a smartphone without network connection,” according to NICT’s press release.
Text-to-speech systems typically use an acoustic model to convert text into intermediate features, followed by a waveform generation model to create speech.
While transformer-based neural networks used in machine translation, automatic speech recognition, and large language models (e.g. ChatGPT) are the mainstream in acoustic modelling for neural text-to-speech, NICT has introduced ConvNeXt, a high-speed, high-performance neural network originally developed for image identification. This innovation made speech synthesis three times faster than previous methods, and further improvements in signal processing boosted the overall speed to eight times faster.
The 21 languages mastered by the solution are Japanese, English, Chinese, Korean, Thai, French, Indonesian, Vietnamese, Spanish, Myanmar, Filipino, Brazilian Portuguese, Khmer, Nepali, Mongolian, Arabic, Italian, Ukrainian, German, Hindi, and Russian.
This technology is publicly available and installed on NICT’s VoiceTra, a multilingual speech translation app for smartphones. NICT also anticipates future applications in car navigation and other speech services through commercial licensing.
Additionally, NICT is working on multilingual simultaneous interpretation technology, where translated speech is generated continuously, without waiting for the speaker to finish. This will require even faster text-to-speech technology to achieve real-time machine interpretation.
The text is inspired by the press release “Developed a 21-language, Fast and High-Fidelity Neural Text-to-Speech Technology That Works on Smartphones” at the NICT website.
For more information please contact our contributor(s):