Posted on Monday, June 11, 02012 by Andréa Davis
This summer, the Rosetta Project is working on a series of Record-a-thon events. While previous Record-a-thons have collected recordings from many languages at once, the Record-a-thon events planned this summer are focused on specific languages. The aim is to capture, through video recording, spoken language samples from a number of speakers. We aim to record small groups of speakers for each language we record, with at least 2 hours of recording for each language.
The Record-a-thon events that will be taking place this summer will collect many hours of video and audio linguistic data. While this is certainly valuable information, it would be much more useful for future research if it were in an easily searchable format. Generally, as the amount of information becomes greater, it becomes less possible for a human to get anything out of that information. This is, of course, the problem of the information age.
With the advent of modern recording techniques, this is also a large problem for speech scientists and speech technologists. A common way of formatting audio data is to segment the speech, which is continuous (no pauses between sounds or words), into speech sounds (phonemes) and to label them (eg, the second sound in the word Rosetta would be labeled 'o', and would be marked as occurring from the end of the 'r' sound to the beginning of the 'z' sound). Having myself segmented and labeled speech sounds, I can attest that is an extremely laborious process. To have data that is already formatted, processed, and ready for analysis is an enormous boon.
Recordings of spoken language have tremendous value, but are not always immediately useful for those who would have the greatest interest in the data. It is my hope to make the data collected this summer immediately useful to researchers, both in the academic world and in industry. In particular, I am interested in creating an audio corpus (body of data) for each language recorded, ideally which will be divided into individual speech sounds (segmented), with each sound labeled. This type of corpus is called an aligned speech corpus. Recently, there have been attempts to automate this process, letting a computer segment the speech sounds. This greatly reduces the amount of time needed to turn raw speech data into formatted, more immediately useful data.
Who exactly is interested in spoken language data? I previously mentioned phoneticians. Unlike Professor Henry Higgins in the movie My Fair Lady, today's phoneticians are interested in the question of how language is actually spoken rather than how it is supposed to be spoken. One question a phonetician might be interested in, is whether vowels are longer at the end of a word than at the beginning. Generally, English-speakers don't think of speech sounds as being longer or shorter, but in fact different sounds differ from one another in length, and even the same speech sound may vary depending on its location in a word. This kind of question is answerable by looking at spoken data that has the beginnings and ends of sounds marked (segmented), and the sounds themselves labeled.
Speech and hearing scientists could also find value in speech data that has been segmented and labeled. Speech and hearing science has the goal of identifying and treating language disorders. But in order to do this, one must have examples of "normal" language. How is language normally spoken? What is a sign of a disorder, vs normal variation within the language? Having many examples of a language, from different speakers, would be helpful for answering this question.
Finally, the language technology community could also find labeled and segmented speech data extremely useful. Babies are not born knowing a particular language, but require continuous exposure to a particular language in order to learn it. An automatic speech recognition system is the same; it needs to be trained in a particular language in order for it to recognize words, phrases, and sentences in that language. Many speech recognition systems are trained on aligned speech corpora.
Several aligned speech corpora exist for English. What would be especially valuable about creating aligned corpora from data collected this summer is that the languages we anticipate collecting are not the extremely well-studied, more common languages. Having corpora for these languages provides wider access to them. A researcher in Tucson (where I live) might struggle to find 10 speakers of Latvian, but could have access to a substantial spoken Latvian corpus with access to the internet. Further, having a corpus that is already labeled would allow the researcher to have a much larger quantity of usable data than is collectible by a single person, at least within a reasonable time-frame. Having data for less common languages allows for a better understanding of language in general, better understanding and diagnosis of language disorders, and the expansion of speech technologies to new populations.