LREC 2010 Proceedings

Summary of the paper

Title	The JOS Linguistically Tagged Corpus of Slovene
Authors	Tomaž Erjavec, Darja Fišer, Simon Krek and Nina Ledinek
Abstract	The JOS language resources are meant to facilitate developments of HLT andcorpus linguistics for the Slovene language and consist of the morphosyntacticspecifications, defining the Slovene morphosyntactic features and tagset; twoannotated corpora (jos100k and jos1M); and two web services (a concordancer andtext annotation tool). The paper introduces these components, and concentrateson jos100k, a 100,000 word sampled balanced monolingual Slovene corpus,manually annotated for three levels of linguistic description. On themorphosyntactic level, each word is annotated with its morphosyntacticdescription and lemma; on the syntactic level the sentences are annotated withdependency links; on the semantic level, all the occurrences of 100 top nounsin the corpus are annotated with their wordnet synset from the Slovene semanticlexicon sloWNet. The JOS corpora and specifications have a standardisedencoding (Text Encoding Initiative Guidelines TEI P5) and are available forresearch from http://nl.ijs.si/jos/ under the Creative Commons licence.
Language	Semantics
Topics	Corpus (creation, annotation, etc.), Grammar and Syntax, Semantics
Full paper	The JOS Linguistically Tagged Corpus of Slovene
Bibtex	@InProceedings{ERJAVEC10.139, author = {Tomaž Erjavec, Darja Fišer, Simon Krek and Nina Ledinek}, title = {The JOS Linguistically Tagged Corpus of Slovene}, booktitle = {Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC'10)}, year = {2010}, month = {may}, date = {19-21}, address = {Valletta, Malta}, editor = {Nicoletta Calzolari (Conference Chair), Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odjik, Stelios Piperidis, Mike Rosner, Daniel Tapias}, publisher = {European Language Resources Association (ELRA)}, isbn = {2-9517408-6-7}, language = {english} }