Ideally, speech data is collected in an anechoic chamber, with high-quality recording equipment, in a comfortable setting, at CD-quality, with simultaneous audio and electroglottograph (EGG) signal. In practice, we have collected KAL1 in a soundproof booth in the Electrical and Computer Engineering lab at CMU, and then collected KAL2 through KAL4 at one of the author's apartment, largely between 4 and 6 AM, before traffic got started.
We used a Shure SM-2 close-talking headset microphone, a Symmetrics SX202 microphone preamp, a Glottal Enterprises EG2-PC electroglottograph, and a SoundBlaster X-Gamer PCI audio card, with various recording and playing utilities on a decent machine running Linux. The computer was on an uninterruptible power supply (UPS), which reduced electrical noise. We used a wireless keyboard and mouse, so that the subject could sit back several feet from the computer monitor - which otherwise introduced considerable noise into the recordings. The radio frequency from the wireless keyboard and mouse appeared to have no detrimental effect.
In general, we record diphone sets at 32 KHz, with simultaneous audio and EGG signal, after making sure the levels are sane for both (sane being peaking in the 80% range). After collection, however, these signals are split into separate files.
We have also collected diphone sets directly to a laptop computer in a quiet room (i.e. one without air-conditioning or other computers, which isn't easy on the CMU campus). Laptops should be run on battery power to reduce hum. The audio systems on some laptops, however, are not good high enough quality for recording, and ensuring that the machine's audio device is good enough is very important. For other synthesis techniques, such as the limited domain synthesizers [2] we have built in the FestVox framework, the audio quality is less important as there are typical multiple examples of each phone type. In our diphone database there are often only one examples and hence every part of the recording must be good.