For speech recognition, we used the CMU Sphinx II system [8], a relatively light-weight recognizer that works in real time even on machines with relatively small memory and modest-speed processors. For Automatic Speech Recognition (ASR) to work we need to build two basic types of models. Acoustic Models which model the acoustic phonetic space for the given language and Language Models which model the probability of word sequences. In addition to these models we also need two lexicons one for English and one for Croatian that map words to their pronunciations.
For the English acoustic models, we could have used existing acoustic models trained from similar wide-band speech, but as there were not any readily available conversational wide-band speech databases in the intended domain, it was felt better to train on the chaplain dialogs directly rather than use existing models and some form of adaptation. Although such adaptation techniques may have been beneficial and feasible for English, we knew that for the Croatian no such data was available, and part of this exercise was to develop speech-to-speech translation systems for languages that did not already have speech resources constructed for them. Thus for English we took only the 4.25 hours of chaplain speech and directly trained semi-continous HMM models for Sphinx2.
For the English language model we required a larger collection of in-domain text. We used the dialog transcriptions themselves but also augmented that with text from chaplain handbooks that were made available to us. Although we knew we could provide better recognition accuracy by using more resources, we were interested in limiting what resources were necessary for this work, and also (see below) we found the trained models from this data adequate for the task.
Building Croatian models was harder. As we were aware that our resource of Croatian speakers was limited, and they had less skill in carrying out full word transcription of conversational speech, we wished to find a simpler, less resource-intensive method to build Croatian acoustic models. From the the translated chaplain transcripts, we wished to select example utterances that when recorded would give sufficient acoustic coverage to allow reasonable acoustic models to be trained. To do this, we used a technique originally developed for selecting text to record for speech synthesis [2]. By using the initially developed Croatian speech synthesizer, we could find the phonemes that would be used to say each utterance. We then ran a greedy selection algorithm that selects utterances that would best cover the acoustic space [2]. From a list of several thousand utterances, we selected groups of 250 utterances that were phonetically rich. These sets were then read by a number of Croatian speakers. Using read speech avoided the process of hand-transcription of the speech, though it does make it less like the intended conversational speech. Due to the relative scarcity of native Croatian speakers, we recorded only 15 different speakers, of which 13 were female and 2 were male. This resulted in a gender imbalance, which was not however observed to affect the system's performance greatly. In all, a total of 4.0 hours of Croatian speech was collected. This data alone was then used to train new acoustic models for Croatian.
For both English and Croatian recognition systems, semi-continuous 5-state triphone HMMs were trained. The number of tied states used in each case was commensurate with the amount of training data available. Although the English models did have explicit modeling of filled pauses (non-linguistic verbalized sounds such as ``um'', ``uh'' etc.), none were trained for Croatian. This was partially because the recorded speech was read, and had minimal spontaneous speech phenomena such as filled pauses.
Language models in both cases were word-trigrams built with absolute discounting. The language-model vocabularies consisted of 2900 words for English and 3900 words for Croatian. In pilot experiments with heldout test sets, the word error rates were found to be below 15% for English and below 20% for Croatian.
We note that as the utterances used in the training were not spontaneous, the system was more easily confused by hesitations and filled pauses. However in the actual user tests of the system this proved to be less of a problem than we expected. As turns in a conversation through a speech-to-speech translation system are slower and less spontaneous compared to single language conversations, speakers were more careful in their delivery than they might be in full conversations.