In This Article Speech Synthesis

  • Introduction
  • Textbooks, Edited Collections, Surveys, and Introductions
  • Journals and Conferences
  • Text Processing, Pronunciation Dictionaries, and Letter-to-Sound
  • Statistical Parametric Speech Synthesis
  • Software
  • Data

Linguistics Speech Synthesis
by
Simon King
  • LAST MODIFIED: 25 February 2016
  • DOI: 10.1093/obo/9780199772810-0024

Introduction

Speech synthesis has a long history, going back to early attempts to generate speech- or singing-like sounds from musical instruments. But in the modern age, the field has been driven by one key application: Text-to-Speech (TTS), which means generating speech from text input. Almost universally, this complex problem is divided into two parts. The first problem is the linguistic processing of the text, and this happens in the front end of the system. The problem is hard because text clearly does not contain all the information necessary for reading out loud. So, just as human talkers use their knowledge and experience when reading out loud, machines must also bring additional information to bear on the problem; examples include rules regarding how to expand abbreviations into standard words, or a pronunciation dictionary that converts spelled forms into spoken forms. Many of the techniques currently used for this part of the problem were developed in the 1990s and have only advanced very slowly since then. In general, techniques used in the front end are designed to be applicable to almost any language, although the exact rules or model parameters will depend on the language in question. The output of the front end is a linguistic specification that contains information such as the phoneme sequence and the positions of prosodic phrase breaks. In contrast, the second part of the problem, which is to take the linguistic specification and generate a corresponding synthetic speech waveform, has received a great deal of attention and is where almost all of the exciting work has happened since around 2000. There is far more recent material available on the waveform generation part of the text-to-speech problem than there is on the text processing part. There are two main paradigms currently in use for waveform generation, both of which apply to any language. In concatenative synthesis, small snippets of prerecorded speech are carefully chosen from an inventory and rearranged to construct novel utterances. In statistical parametric synthesis, the waveform is converted into two sets of speech parameters: one set captures the vocal tract frequency response (or spectral envelope) and the other set represents the sound source, such as the fundamental frequency and the amount of aperiodic energy. Statistical models are learned from annotated training data and can then be used to generate the speech parameters for novel utterances, given the linguistic specification from the front end. A vocoder is used to convert those speech parameters back to an audible speech waveform.

Textbooks, Edited Collections, Surveys, and Introductions

Steady progress in synthesis since around 1990, and the especially rapid progress in the early 21st century, is a challenge for textbooks. Taylor 2009 provides the most up-to-date entry point to this field and is an excellent starting point for students at all levels. For a wider-ranging textbook that also provides coverage of Natural Language Processing and Automatic Speech Recognition, Jurafsky and Martin 2009 is also excellent. For those without an electrical engineering background, the chapter by Ellis giving “An Introduction to Signal Processing for Speech” in Hardcastle, et al. 2010 is essential background reading, since most other texts are aimed at readers with some previous knowledge of signal processing. Most of the advances in the field since around 2000 have been in the statistical parametric paradigm. No current textbook covers this subject in sufficient depth. King 2011 gives a short and simple introduction to some of the main concepts, and Taylor 2009 contains one relatively brief chapter. For more technical depth, it is necessary to venture beyond textbooks, and the tutorial article Tokuda, et al. 2013 is the best place to start, followed by the more technical article Zen, et al. 2009. Some older books, such as Dutoit 1997, still contain relevant material, especially in their treatment of the text processing part of the problem. Sproat’s comment that “text-analysis has not received anything like half the attention of the synthesis community” (p. 73) in his introduction to text processing in van Santen, et al. 1997 is still true, and Yarowsky’s chapter on homograph disambiguation in the same volume still represents a standard solution to that particular problem. Similarly, the modular system architecture described by Sproat and Olive in that volume is still the standard way of configuring a text-to-speech system.

  • Dutoit, Thierry. 1997. An introduction to text-to-speech synthesis. Norwell, MA: Kluwer Academic.

    DOI: 10.1007/978-94-011-5730-8E-mail Citation »

    Starting to get dated, but still contains useful material.

  • Hardcastle, W. J., J. Laver, and F. E. Gibbon. 2010. The handbook of phonetic sciences. Blackwell Handbooks in Linguistics. Oxford: Wiley-Blackwell.

    DOI: 10.1002/9781444317251E-mail Citation »

    A wealth of information, one highlight being the excellent chapter by Ellis introducing speech signal processing to readers with minimal technical background. The chapter on speech synthesis is too dated. Other titles in this series are worth consulting, such as the one on speech perception.

  • Jurafsky, D., and J. H. Martin. 2009. Speech and language processing: An introduction to natural language processing, computational linguistics, and speech recognition. 2d ed. Upper Saddle River, NJ: Prentice Hall.

    E-mail Citation »

    A complete course in speech and language processing, very widely used for teaching at advanced undergraduate and graduate levels. The authors have a free online video lecture course covering the Natural Language Processing parts. A third edition of the book is expected.

  • King, S. 2011. An introduction to statistical parametric speech synthesis. Sadhana 36.5: 837–852.

    DOI: 10.1007/s12046-011-0048-yE-mail Citation »

    A gentle and nontechnical introduction to this topic, designed to be accessible to readers from any background. Should be read before attempting the more advanced material.

  • Taylor, P. 2009. Text-to-speech synthesis. Cambridge, UK: Cambridge Univ. Press.

    DOI: 10.1017/CBO9780511816338E-mail Citation »

    The most comprehensive and authoritative textbook ever written on the subject. The content is still up-to-date and highly relevant. Of course, developments since 2009—such as advanced techniques for HMM-based synthesis and the resurgence of Neural Networks—are not covered.

  • Tokuda, K., Y. Nankaku, T. Toda, H. Zen, J. Yamagishi, and K. Oura. 2013. Speech synthesis based on Hidden Markov Models. Proceedings of the IEEE 101.5: 1234–1252.

    DOI: 10.1109/JPROC.2013.2251852E-mail Citation »

    A tutorial article covering the main concepts of statistical parametric speech synthesis using Hidden Markov Models. Also touches on singing synthesis and controllable models.

  • van Santen, J. P. H., R. W. Sproat, J. P. Oliver, and J. Hirschberg, eds. 1997. Progress in speech synthesis. New York: Springer.

    E-mail Citation »

    Covering most aspects of text-to-speech, but now dated. Material that remains relevant: Yarowsky on homograph disambiguation; Sproat’s introduction to the Linguistic Analysis section; Campbell and Black’s inclusion of prosody in the unit selection target cost, to minimize the need for subsequent signal processing (implementation details no longer relevant).

  • Zen, H., K. Tokuda, and A. W. Black. 2009. Statistical parametric speech synthesis. Speech Communication 51.11: 1039–1064.

    DOI: 10.1016/j.specom.2009.04.004E-mail Citation »

    Written before the resurgence of neural networks, this is an authoritative and technical introduction to HMM-based statistical parametric speech synthesis.

back to top

Users without a subscription are not able to see the full content on this page. Please subscribe or login.

How to Subscribe

Oxford Bibliographies Online is available by subscription and perpetual access to institutions. For more information or to contact an Oxford Sales Representative click here.

Article

Up

Down