ESpeak


eSpeak is a free and open-source, cross-platform, compact, software speech synthesizer. It uses a formant synthesis method, providing many languages in a relatively small file size. eSpeakNG is a continuation of the original developer's project with more feedback from native speakers.
Because of its small size and many languages, eSpeakNG is included in NVDA open source screen reader for Windows, as well as Android, Ubuntu and other Linux distributions. Its predecessor eSpeak was recommended by Microsoft in 2016 and was used by Google Translate for 27 languages in 2010; 17 of these were subsequently replaced by proprietary voices.
The quality of the language voices varies greatly. In eSpeakNG's predecessor eSpeak, the initial versions of some languages were based on information found on Wikipedia. Some languages have had more work or feedback from native speakers than others. Most of the people who have helped to improve the various languages are blind users of text-to-speech.

History

In 1995, Jonathan Duddington released the Speak speech synthesizer for RISC OS computers supporting British English. On 17 February 2006, Speak 1.05 was released under the GPLv2 license, initially for Linux, with a Windows SAPI 5 version added in January 2007. Development on Speak continued until version 1.14, when it was renamed to eSpeak.
Development of eSpeak continued from 1.16 with the addition of an eSpeakEdit program for editing and building the eSpeak voice data. These were only available as separate source and binary downloads up to eSpeak 1.24. The 1.24.02 version of eSpeak was the first version of eSpeak to be version controlled using subversion, with separate source and binary downloads made available on SourceForge. From eSpeak 1.27, eSpeak was updated to use the GPLv3 license. The last official eSpeak release was 1.48.04 for Windows and Linux, 1.47.06 for RISC OS and 1.45.04 for macOS. The last development release of eSpeak was 1.48.15 on 16 April 2015.
eSpeak uses the Usenet scheme to represent phonemes with ASCII characters.

eSpeak NG

On 25 June 2010, Reece Dunn started a fork of eSpeak on GitHub using the 1.43.46 release. This started off as an effort to make it easier to build eSpeak on Linux and other POSIX platforms.
On 4 October 2015, this fork started diverging more significantly from the original eSpeak.
On 8 December 2015, there were discussions on the eSpeak mailing list about the lack of activity from Jonathan Duddington over the previous 8 months from the last eSpeak development release. This evolved into discussions of continuing development of eSpeak in Jonathan's absence. The result of this was the creation of the espeak-ng fork, using the GitHub version of eSpeak as the basis for future development.
On 11 December 2015, the espeak-ng fork was started. The first release of espeak-ng was 1.49.0 on 10 September 2016, containing significant code cleanup, bug fixes, and language updates.

Features

eSpeakNG can be used as a command-line program, or as a shared library.
It supports Speech Synthesis Markup Language.
Language voices are identified by the language's ISO 639-1 code. They can be modified by "voice variants". These are text files which can change characteristics such as pitch range, add effects such as echo, whisper and croaky voice, or make systematic adjustments to formant frequencies to change the sound of the voice. For example, "af" is the Afrikaans voice. "af+f2" is the Afrikaans voice modified with the "f2" voice variant which changes the formants and the pitch range to give a female sound.
eSpeakNG uses an ASCII representation of phoneme names which is loosely based on the Usenet system.
Phonetic representations can be included within text input by including them within double square-brackets. For example: espeak-ng -v en "Hello w3:ld" will say in English.

Synthesis method

eSpeakNG can be used as a text-to-speech translator in different ways, depending on which text-to-speech translation step the user wants to use.

1. step – text to phoneme translation

There are many languages which do not have straightforward one-to-one rules between writing and pronunciation; therefore, the first step in text-to-speech generation has to be text-to-phoneme translation.
  1. input text is translated into pronunciation phonemes.
  2. pronunciation phonemes are synthesized into sound e.g., is voiced as
To add intonation for speech i.e. prosody data are necessary and other information, which allows to synthesize more human, non-monotonous speech. E.g. in eSpeakNG format stressed syllable is added using apostrophe: which provides more natural speech:
For comparison two samples with and without prosody data:
  1. is spelled
  2. is spelled
If eSpeakNG is used for generation of prosody data only, then prosody data can be used as input for MBROLA diphone voices.

2. step – sound synthesis from prosody data

The eSpeakNG provides two different types of formant speech synthesis using its two different approaches. With its own eSpeakNG synthesizer and a Klatt synthesizer:
  1. The eSpeakNG synthesizer creates voiced speech sounds such as vowels and sonorant consonants by additive synthesis adding together sine waves to make the total sound. Unvoiced consonants e.g. /s/ are made by playing recorded sounds, because they are rich in harmonics, which makes additive synthesis less effective. Voiced consonants such as /z/ are made by mixing a synthesized voiced sound with a recorded sample of unvoiced sound.
  2. The Klatt synthesizer mostly uses the same formant data as the eSpeakNG synthesizer. But, it also produces sounds by subtractive synthesis by starting with generated noise, which is rich in harmonics, and then applying digital filters and enveloping to filter out necessary frequency spectrum and sound envelope for particular consonant or sonorant sound.
For the MBROLA voices, eSpeakNG converts the text to phonemes and associated pitch contours. It passes this to the MBROLA program using the PHO file format, capturing the audio created in output by MBROLA. That audio is then handled by eSpeakNG.

Languages

eSpeakNG performs text-to-speech synthesis for the following languages:
  1. Afrikaans
  2. Albanian
  3. Amharic
  4. Ancient Greek
  5. Arabic
  6. Aragonese
  7. Armenian
  8. Armenian
  9. Assamese
  10. Azerbaijani
  11. Bashkir
  12. Basque
  13. Belarusian
  14. Belter Creole
  15. Bengali
  16. Bishnupriya Manipuri
  17. Bosnian
  18. Bulgarian
  19. Burmese
  20. Cantonese
  21. Catalan
  22. Cherokee
  23. Chinese
  24. Chinese
  25. Chuvash
  26. Croatian
  27. Czech
  28. Danish
  29. Dutch
  30. English
  31. English
  32. English
  33. English
  34. English
  35. English
  36. English
  37. English
  38. Esperanto
  39. Estonian
  40. Finnish
  41. French
  42. French
  43. French
  44. Georgian
  45. German
  46. Greek
  47. Greenlandic
  48. Guarani
  49. Gujarati
  50. Haitian Creole
  51. Hawaiian
  52. Hebrew
  53. Hindi
  54. Hungarian
  55. Icelandic
  56. Ido
  57. Indonesian
  58. Interlingua
  59. Irish
  60. Italian
  61. Japanese
  62. Kannada
  63. Kazakh
  64. Kʼicheʼ
  65. Klingon
  66. Konkani
  67. Korean
  68. Kurdish
  69. Kyrgyz
  70. Latgalian
  71. Latin
  72. Latvian
  73. Lingua Franca Nova
  74. Lithuanian
  75. Lojban
  76. Luxembourgish
  77. Macedonian
  78. Malay
  79. Malayalam
  80. Maltese
  81. Māori
  82. Marathi
  83. Nahuatl
  84. Nepali
  85. Nogai
  86. Norwegian
  87. Oriya
  88. Oromo
  89. Papiamento
  90. Persian
  91. Persian
  92. Polish
  93. Portuguese
  94. Portuguese
  95. Punjabi
  96. Pyash
  97. Quechua
  98. Quenya
  99. Romanian
  100. Russian
  101. Russian
  102. Saami
  103. Scottish Gaelic
  104. Serbian
  105. Setswana
  106. Shan
  107. Sindarin
  108. Sindhi
  109. Sinhala
  110. Slovak
  111. Slovenian
  112. Spanish
  113. Spanish
  114. Swahili
  115. Swedish
  116. Tamil
  117. Tatar
  118. Telugu
  119. Thai
  120. Turkish
  121. Turkmen
  122. Ukrainian
  123. Urdu
  124. Uyghur
  125. Uzbek
  126. Vietnamese
  127. Vietnamese
  128. Vietnamese
  129. Welsh