What the Language and Culture Atlas of Ashkenazic Jewry is:

Notes:
Foil 11 of 26

In the archive signal and transcription will be automatically aligned. The precision of the alignment depends on the quality of the sound and the transcription. The alignment is based on forced speech recognition.

Our system was developed using version 2.0 of Entropic's Hidden Markov Toolkit and a grapheme to phoneme converter. The input consisting of orthographic text and sampled speech data is preprocessed and then given to the HTK Viterbi decoder which does the actual alignment. The grapheme to phoneme converter partly allows to handle morphological pecularities of German which influence the pronunciation, and it can easily be adapted to any other natural language.

It consists of 740 context-sensitive rules for pronunciation and about 500 rules formulated according to a right linear regular grammar to describe morphology. Further information will be given within the next section.
HTK depends on a regular grammar for the Viterbi decoder to prune the search space of possible word sequences. For alignment, the regular grammar is simply the linear concatenation of phonemes, resp. words. As already mentioned above, the only variability is given by pauses between words. The grammar is generated in such a way that it represents these phenomena adequately. A language model does not have to be considered.
Furthermore, Viterbi decoding needs vectors that describe the sampled speech data in terms of features. Our system uses spectral information represented by 12 mel frequency ceptral coeãcients plus overall energy and their first and second derivatives, giving a total of 39 parameters per vector. The vectorframes have a distance of 10 ms and are calculated using a Hamming weighted window of 25.6 ms length.
The HMMs were trained by the ERBA (Erlanger Bahnauskunft) material consisting of 40 speakers both male and female and 100 sentences per speaker.

The training material consists of six hours length with a sample rate of 16 kHz and a resolution of 16 bit per sample. The phonemes are modeled by context independent left-to-right HMMs with 3 emitting states, single mixture gaussian output probability density functions and a diagonal covariance matrix. There aren't any skip transitions per model and the model for speech pauses has the same topology.

Because of the great variety of dialect forms, there is little chance to make them part of a pronunciation dictionary. Furthermore, there is no convention of how to transcribe them "correctly", and there is a need to transcribe special kinds of utterances without using the IPA transcription. Thus, a rule-based phonetization tool has to be part of the aligner in order to be flexibel enough to handle these requirements. The tool we developed transcribes words according to context-sensitive rules of pronunciation. Most of these rules are formulated according to the standard pronunciation of German and are able both to consider morphological segmentations of words or not to consider them. In general, dialect forms are not morphologically segmented but so far it has proven to work sufficiently when we enable the aligner to run on audioles of a length of 20 minutes which are completely uttered and transcribed in dialect. The results of these runs can be integrated into the database without any further correction.'

R. Schmidt/ R. Neumann: Automatic Text-Speech Alignment: Aspects of Robustication in: V. Matousek, P. Mautner, J. Okélícová, P. Sojka (Eds.): Text, Speech and Dialogue, Second International Workshop, TSD '99, Proceedings, Plzen 1999.

Notizen:
Folie 11 von 26

Es wäre wünschenswert, die Zuordnung von Signal und Transkription automatisch zu verbessern (alignieren). Diese Möglichkeit hängt von der Qualität des Tons und der Transkription ab. Dem Versuch liegt erzwungene Spracherkennung zugrunde.
„Unser System wurde auf der Basis der Version 2.0 des Entropic Hidden Markov Toolkit und mit Hilfe eines Konverters (grapheme nach phoneme) entwickelt.Orthographischer Text und gesammelte Sprachdaten werden vorbereitet und in den HTK Viterbi Decoder eingegeben, der dann die eigentliche Zuordnung vornimmt. Der Konverter ist teilweise in der Lage, morphologische Eigenheiten der deutschen Sprache, die die Betonung beeinflussen, zu verarbeiten. Es ist kein Problem, diese Lösung auch bei jeder anderen lebenden Sprache anzuwenden.

Es besteht aus 740 kontextsensitiven Betonungsregeln und ca. 500 rechtsbündigen Grammatikregeln, um die Morphologie zu beschreiben. HTK hängt ab von einer regulären Grammatik für den Viterbi Decoder, um den Suchbereich für mögliche Wortsequenzen zu straffen. Für die Zuordnung ist die reguläre Grammatik die lineare Verkettung von Phonemen bzw. Worten. Die einzige Variabilität ergibt sich durch die Pausen zwischen den Worten. Die Grammatik wurde so erzeugt, daß sie diese Phänomene entsprechend repräsentiert. Ein Sprachmodel ist nicht erforderlich.“

ins Deutsche übersetzt aus: R. Schmidt/ R. Neumann: Automatic Text-Speech Alignment: Aspects of Robustication in: V. Matousek, P. Mautner, J. Okélícová, P. Sojka (Eds.): Text, Speech and Dialogue, Second International Workshop, TSD '99, Proceedings, Plzen 1999.