Summary of the paper

Title Large Scale Multilingual Broadcast Data Collection to Support Machine Translation and Distillation Technology Development
Authors Kevin Walker, Christopher Caruso and Denise DiPersio
Abstract The development of technologies to address machine translation and distillationof multilingual broadcast data depends heavily on the collection of largevolumes of material from modern data providers. To address the needs of GALEresearchers, the Linguistic Data Consortium (LDC) developed a system forcollecting broadcast news and conversation from a variety of Arabic, Chineseand English broadcasters. The system is highly automated, easily extensible androbust and is capable of collecting, processing and evaluating hundreds ofhours of content from several dozen sources per day. In addition to thisextensive system, LDC manages three remote collection sites to maximize thevariety of available broadcast data and has designed a portable broadcastcollection platform to facilitate remote collection. This paper will present adetailed a description of the design and implementation of LDC’s collectionsystem, the technical challenges and solutions to large scale broadcast datacollection efforts and an overview of the system’s operation. This paper willalso discuss the challenges of managing remote collections, in particular, thestrategies used to normalize data formats, naming conventions and deliverymethods to achieve optimal integration of remotely-collected data into LDC’scollection database and downstream tasking workflow.
Language Other
Topics Speech resource/database, Tools, systems, applications, Other
Full paper Large Scale Multilingual Broadcast Data Collection to Support Machine Translation and Distillation Technology Development
Bibtex @InProceedings{WALKER10.667,
  author = {Kevin Walker, Christopher Caruso and Denise DiPersio},
  title = {Large Scale Multilingual Broadcast Data Collection to Support Machine Translation and Distillation Technology Development},
  booktitle = {Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC'10)},
  year = {2010},
  month = {may},
  date = {19-21},
  address = {Valletta, Malta},
  editor = {Nicoletta Calzolari (Conference Chair), Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odjik, Stelios Piperidis, Mike Rosner, Daniel Tapias},
  publisher = {European Language Resources Association (ELRA)},
  isbn = {2-9517408-6-7},
  language = {english}
 }
Powered by ELDA © 2010 ELDA/ELRA