Text-To-Speech Synthesizer

Authors behind the contribution of this blog are - Nisarg Mehta, Vivek Mankar, Aadvija Medhekar, Amol Pawar, and Umang Pathrabe.

What is Text-To-Speech Synthesizer?

It is the technology that has allowed computers to read any written text to voice that is artificial but it is almost similar to the human voice.
It is also called "READ - ALOUD" technology.
The text to speech synthesis consists of 2 main components: text analysis and speech signal generation.

Natural Language Processing - It is the relationship between Human Language and a Computer. It is a generation, Manipulation, and Analysis of Human Language.

Digital Signal Processing - The signals like audio, video, voice, pressure, temperature signal that have been digitalized and then mathematically manipulated.

Let's Discuss the Involvement of Natural Language Processing -

It consists of phonetization, text analyzer, prosody generation module. Prosody refers to the stress and rhythm of the input text.
Now talking about Phonetization - The Letter to sound module is used for phonetic transcription of the input text. We use a well-known technique - Phonetic Hashing. This method combines the same small unit of sound in one bucket and provides the same hash code for all of the possible variations.
Now let's talk about the Text analyzer - It comprises 4 basic parts - Preprocessing block, contextual Analysis block, morphological analysis block, and syntactic prosodic parser.
Coming to Prosody Generation - It helps the sentences into chunks of syllables and words to identify the relationship between the chunks.

Now Coming to the Digital Signal Processing part -

This component handles the machine pronunciation of the words, sentences, phrases. This component can be handled in 2 possible ways - Concatenative synthesis and Rule-based synthesis.
Concatenative Synthesis - It possesses very limited knowledge of the data they work on. While synthesizing speech, concatenative synthesizers produce a sequence of concatenated segments, retrieved from its speech sample database. A process called equalization mainly used in rising the effects of amplitude mismatches.
Rule-based Synthesis - These are the format based synthesizer and this format based synthesizer involves a large number of parameters that complications during the analysis and to hence generate the non-natural speech.
Concatenative synthesizers usually have programs that are larger than Rule-based synthesizers.
Signal processing does not require, algorithm, software, DAC, and ADC. The processing relies entirely on electrical and electronic devices such as resistors, transistors, and integrated circuits.

Unit Selection Speech Synthesis Using Digital Signal Processing -

The recorded speeches are segmented phonetically in the discrete-time domain. The segmented units are stored in the form of sample values.
There are two approaches to speech segmentation. They are hand labeling and automatic speech segmentation.
In this type of system, hand labeling is applied because the database is small. But it is time-consuming. The unit selection algorithm selects the better units which match the target linguistic features. So these units together produce the speech.

Design -

The first module that includes the GUI components which handles the main important operations of the application such as the input of parameters for conversion either via file input or the browser.
The second module, which is the conversion engine in which it is integrated into the main module for the acceptance of input data and hence the conversion. This would implement the API called freeTTS.

Application of Text-to-speech Synthesizer -

In a telecommunications application, these systems have made it possible to listen to the text read by machines with the process, for instance, from a database. Input to the databases can be transmitted via user speech or a telephone keypad.
For people with speech impairment, speech synthesis has provided an artificial voice hence simplifying communication with others. Also with this, there is noise reduction within the system.
In multimedia, Text-To-Speech synthesis has made possible the existence of talking books, and interactive games.

Conclusion -

In this blog, we have identified the various operations and processes involved in the text to speech synthesis. Here we have discussed both of the processes i.e involving Digital signal processing and Natural language processing that includes many methodologies. Along with the design, there is one of the most important implementations i.e Unit Selection Speech Synthesis is being discussed. And at last few of the Application of Text-to-speech Synthesizer is followed up.

References -

van Santen, J.P.H., Sproat, R. W., Olive, J.P., and Hirschberg, J., 2007. Progress in Speech Synthesis. Springer.
Text-to-speech technology: In Linguatec Language Technology Website. Retrieved February 21, 2014
Kominek, J., and Black, A.W., 2003. CMU ARCTIC databases for speech synthesis. CMU-LTI-03-177. Language Technologies Institute, School of Computer Science, Carnegie Mellon University.
Suendermann, D., Höge, H., and Black, A., 2010. Challenges in Speech Synthesis. Chen, F., Jokinen, K., (eds.), Speech Technology, Springer Science + Business Media LLC.

Search This Blog

Text-To-Speech Synthesizer

Text-To-Speech Synthesizer

What is Text-To-Speech Synthesizer?

Comments

Post a Comment