Title |
Large Scale Multilingual Broadcast Data Collection to Support Machine Translation and Distillation Technology Development |
Authors |
Kevin Walker, Christopher Caruso and Denise DiPersio |
Abstract |
The development of technologies to address machine translation and distillationof multilingual broadcast data depends heavily on the collection of largevolumes of material from modern data providers. To address the needs of GALEresearchers, the Linguistic Data Consortium (LDC) developed a system forcollecting broadcast news and conversation from a variety of Arabic, Chineseand English broadcasters. The system is highly automated, easily extensible androbust and is capable of collecting, processing and evaluating hundreds ofhours of content from several dozen sources per day. In addition to thisextensive system, LDC manages three remote collection sites to maximize thevariety of available broadcast data and has designed a portable broadcastcollection platform to facilitate remote collection. This paper will present adetailed a description of the design and implementation of LDCs collectionsystem, the technical challenges and solutions to large scale broadcast datacollection efforts and an overview of the systems operation. This paper willalso discuss the challenges of managing remote collections, in particular, thestrategies used to normalize data formats, naming conventions and deliverymethods to achieve optimal integration of remotely-collected data into LDCscollection database and downstream tasking workflow. |
Language |
Other |
Topics |
Speech resource/database, Tools, systems, applications, Other |
Full paper  |
Large Scale Multilingual Broadcast Data Collection to Support Machine Translation and Distillation Technology Development |
Bibtex |
@InProceedings{WALKER10.667,
author = {Kevin Walker, Christopher Caruso and Denise DiPersio}, title = {Large Scale Multilingual Broadcast Data Collection to Support Machine Translation and Distillation Technology Development}, booktitle = {Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC'10)}, year = {2010}, month = {may}, date = {19-21}, address = {Valletta, Malta}, editor = {Nicoletta Calzolari (Conference Chair), Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odjik, Stelios Piperidis, Mike Rosner, Daniel Tapias}, publisher = {European Language Resources Association (ELRA)}, isbn = {2-9517408-6-7}, language = {english} } |