Summary of the paper

Title Challenges in Building a Multilingual Alpine Heritage Corpus
Authors Martin Volk, Noah Bubenhofer, Adrian Althaus, Maya Bangerter, Lenz Furrer and Beni Ruef
Abstract This paper describes our efforts to build a multilingual heritage corpus ofalpine texts. Currently we digitize the yearbooks of the Swiss Alpine Clubwhich contain articles in French, German, Italian and Romansch. Articlescomprise mountaineering reports from all corners of the earth, but alsoscientific topics such as topography, geology or glacierology as well asoccasional poetry and lyrics.We have already scanned close to 70,000 pages which has resulted in a corpus of25 million words, 10% of which is a parallel French-German corpus. We havesolved a number of challenges in automatic language identification and textstructure recognition. Our next goal is to identify the great variety oftoponyms (e.g. names of mountains and valleys, glaciers and rivers, trails andcabins) in this corpus, and we sketch how a large gazetteer of Swisstopographical names can be exploited for this purpose. Despite the size of theresource, exact matching leads to a low recall because of spelling variations,language mixtures and partial repetitions.
Language Named Entity recognition
Topics Corpus (creation, annotation, etc.), Multilinguality, Named Entity recognition
Full paper Challenges in Building a Multilingual Alpine Heritage Corpus
Bibtex @InProceedings{VOLK10.110,
  author = {Martin Volk, Noah Bubenhofer, Adrian Althaus, Maya Bangerter, Lenz Furrer and Beni Ruef},
  title = {Challenges in Building a Multilingual Alpine Heritage Corpus},
  booktitle = {Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC'10)},
  year = {2010},
  month = {may},
  date = {19-21},
  address = {Valletta, Malta},
  editor = {Nicoletta Calzolari (Conference Chair), Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odjik, Stelios Piperidis, Mike Rosner, Daniel Tapias},
  publisher = {European Language Resources Association (ELRA)},
  isbn = {2-9517408-6-7},
  language = {english}
 }
Powered by ELDA © 2010 ELDA/ELRA