Title |
The JOS Linguistically Tagged Corpus of Slovene |
Authors |
Tomaž Erjavec, Darja Fišer, Simon Krek and Nina Ledinek |
Abstract |
The JOS language resources are meant to facilitate developments of HLT andcorpus linguistics for the Slovene language and consist of the morphosyntacticspecifications, defining the Slovene morphosyntactic features and tagset; twoannotated corpora (jos100k and jos1M); and two web services (a concordancer andtext annotation tool). The paper introduces these components, and concentrateson jos100k, a 100,000 word sampled balanced monolingual Slovene corpus,manually annotated for three levels of linguistic description. On themorphosyntactic level, each word is annotated with its morphosyntacticdescription and lemma; on the syntactic level the sentences are annotated withdependency links; on the semantic level, all the occurrences of 100 top nounsin the corpus are annotated with their wordnet synset from the Slovene semanticlexicon sloWNet. The JOS corpora and specifications have a standardisedencoding (Text Encoding Initiative Guidelines TEI P5) and are available forresearch from http://nl.ijs.si/jos/ under the Creative Commons licence. |
Language |
Semantics |
Topics |
Corpus (creation, annotation, etc.), Grammar and Syntax, Semantics |
Full paper  |
The JOS Linguistically Tagged Corpus of Slovene |
Bibtex |
@InProceedings{ERJAVEC10.139,
author = {Tomaž Erjavec, Darja Fišer, Simon Krek and Nina Ledinek}, title = {The JOS Linguistically Tagged Corpus of Slovene}, booktitle = {Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC'10)}, year = {2010}, month = {may}, date = {19-21}, address = {Valletta, Malta}, editor = {Nicoletta Calzolari (Conference Chair), Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odjik, Stelios Piperidis, Mike Rosner, Daniel Tapias}, publisher = {European Language Resources Association (ELRA)}, isbn = {2-9517408-6-7}, language = {english} } |