Summary of the paper

Title Design, Compilation, and Preliminary Analyses of Balanced Corpus of Contemporary Written Japanese
Authors Kikuo Maekawa, Makoto Yamazaki, Takehiko Maruyama, Masaya Yamaguchi, Hideki Ogura, Wakako Kashino, Toshinobu Ogiso, Hanae Koiso and Yasuharu Den
Abstract Compilation of a 100 million words balanced corpus called the Balanced Corpusof Contemporary Written Japanese (or BCCWJ) is underway at the NationalInstitute for Japanese Language and Linguistics. The corpus covers a wide rangeof text genres including books, magazines, newspapers, governmental whitepapers, textbooks, minutes of the National Diet, internet text (bulletin boardand blogs) and so forth, and when possible, samples are drawn from the rigidlydefined statistical populations by means of random sampling. All texts aredually POS-analyzed based upon two different, but mutually related, definitionsof ‘word.’ Currently, more than 90 million words have been sampled and XMLannotated with respect to text-structure and lexical and character information.A preliminary linear discriminant analysis of text genres using the data of POSfrequencies and sentence length revealed it was possible to classify the textgenres with a correct identification rate of 88% as far as the samples ofbooks, newspapers, whitepapers, and internet bulletin boards are concerned.When the samples of blogs were included in this data set, however, theidentification rate went down to 68%, suggesting the considerable variance ofthe blog texts in terms of the textual register and style.
Language Document Classification, Text categorisation
Topics Corpus (creation, annotation, etc.), Morphology, Document Classification, Text categorisation
Full paper Design, Compilation, and Preliminary Analyses of Balanced Corpus of Contemporary Written Japanese
Bibtex @InProceedings{MAEKAWA10.99,
  author = {Kikuo Maekawa, Makoto Yamazaki, Takehiko Maruyama, Masaya Yamaguchi, Hideki Ogura, Wakako Kashino, Toshinobu Ogiso, Hanae Koiso and Yasuharu Den},
  title = {Design, Compilation, and Preliminary Analyses of Balanced Corpus of Contemporary Written Japanese},
  booktitle = {Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC'10)},
  year = {2010},
  month = {may},
  date = {19-21},
  address = {Valletta, Malta},
  editor = {Nicoletta Calzolari (Conference Chair), Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odjik, Stelios Piperidis, Mike Rosner, Daniel Tapias},
  publisher = {European Language Resources Association (ELRA)},
  isbn = {2-9517408-6-7},
  language = {english}
 }
Powered by ELDA © 2010 ELDA/ELRA