Title |
The EPAC Corpus: Manual and Automatic Annotations of Conversational Speech in French Broadcast News |
Authors |
Yannick Estève, Thierry Bazillon, Jean-Yves Antoine, Frédéric Béchet and Jérôme Farinas |
Abstract |
This paper presents the EPAC corpus which is composed by a set of 100 hours ofconversational speech manually transcribed and by the outputs of automatictools (automatic segmentation, transcription, POS tagging, etc.) applied on theentire French ESTER 1 audio corpus: this concerns about 1700 hours of audiorecordings from radiophonic shows. This corpus was built during the EPACproject funded by the French Research Agency (ANR) from 2007 to 2010. Thiscorpus increases significantly the amount of French manually transcribed audiorecordings easily available and it is now included as a part of the ESTER 1corpus in the ELRA catalog without additional cost. By providing a large set ofautomatic outputs of speech processing tools, the EPAC corpus should be usefulto researchers who want to work on such data without having to develop and dealwith such tools. These automatic annotations are various: segmentation andspeaker diarization, one-best hypotheses from the LIUM automatic speechrecognition system with confidence measures, but also word-lattices andconfusion networks, named entities, part-of-speech tags, chunks, etc. The 100hours of speech manually transcribed were split into three data sets in orderto get an official training corpus, an official development corpus and anofficial test corpus. These data sets were used to develop and to evaluate someautomatic tools which have been used to process the 1700 hours of audiorecording. For example, on the EPAC test data set our ASR system yields a worderror rate equals to 17.25%. |
Language |
Topics |
Corpus (creation, annotation, etc.) |
Full paper  |
The EPAC Corpus: Manual and Automatic Annotations of Conversational Speech in French Broadcast News |
Bibtex |
author = {Yannick Estève, Thierry Bazillon, Jean-Yves Antoine, Frédéric Béchet and Jérôme Farinas}, title = {The EPAC Corpus: Manual and Automatic Annotations of Conversational Speech in French Broadcast News}, booktitle = {Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC'10)}, year = {2010}, month = {may}, date = {19-21}, address = {Valletta, Malta}, editor = {Nicoletta Calzolari (Conference Chair), Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odjik, Stelios Piperidis, Mike Rosner, Daniel Tapias}, publisher = {European Language Resources Association (ELRA)}, isbn = {2-9517408-6-7}, language = {english} } |