Title |
MULTEXT-East Version 4: Multilingual Morphosyntactic Specifications, Lexicons and Corpora |
Authors |
Tomaž Erjavec |
Abstract |
The paper presents the fourth, ``Mondilex'' edition of the MULTEXT-Eastlanguage resources, a multilingual dataset for language engineering researchand development, focused on the morphosyntactic level of linguisticdescription. This standardised and linked set of resources covers a largenumber of mainly Central and Eastern European languages and includes theEAGLES-based morphosyntactic specifications; morphosyntactic lexica; andannotated parallel, comparable, and speech corpora. The fourth release of theseresources introduces XML-encoded morphosyntactic specifications and adds sixnew languages, bringing the total to 16: to Bulgarian, Croatian, Czech,Estonian, English, Hungarian, Romanian, Serbian, Slovene, and the Resiandialect of Slovene it adds Macedonian, Persian, Polish, Russian, Slovak, andUkrainian. This dataset, unique in terms of languages covered and the wealth ofencoding, is extensively documented, and freely available for research purposesat http://nl.ijs.si/ME/V4/. |
Language |
Standards for LRs |
Topics |
Part of speech tagging, Morphology, Standards for LRs |
Full paper  |
MULTEXT-East Version 4: Multilingual Morphosyntactic Specifications, Lexicons and Corpora |
Bibtex |
author = {Tomaž Erjavec}, title = {MULTEXT-East Version 4: Multilingual Morphosyntactic Specifications, Lexicons and Corpora}, booktitle = {Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC'10)}, year = {2010}, month = {may}, date = {19-21}, address = {Valletta, Malta}, editor = {Nicoletta Calzolari (Conference Chair), Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odjik, Stelios Piperidis, Mike Rosner, Daniel Tapias}, publisher = {European Language Resources Association (ELRA)}, isbn = {2-9517408-6-7}, language = {english} } |