In previous synthesizers we have labeled spoken prompts by using DTW (dynamic time warping) techniques on a synthesized version of the prompts generated by an existing synthesizer. This technique, based on [4], works well within a language but we have also often used this cross-lingually. In the latter case, one needs to take a close language (or perhaps just English) and map the phones in the new language to approximations in the target language. Synthesizing using that mapping provides acoustic prompts, which although may sound very English, have approximately the right properties to allow reasonable alignment using DTW.
However, such techniques require phonetic knowledge to decide which phoneme in the labeling language maps to which in the target language. And we wish to require no such knowledge of the target language.
In this case, we used the SphinxTrain acoustic modeling tools [5] to build context-dependent semi-continuous HMM models using the letters as phone names. This does require an orthographic transcription of the prompts (which were read by the native speaker when they were recorded). It also implicitly requires sufficient data to have reasonable acoustic coverage.
At this point, we have probably taken advantage of some phonetic knowledge in the original choice of sentences to include in the prompt set, in that they were selected to have a rich diphone coverage. However it could be argued that using a selection criteria based on letter rather than phone distribution would produce a similar database.