Summary of the paper

Title The Web Library of Babel: evaluating genre collections
Authors Serge Sharoff, Zhili Wu and Katja Markert
Abstract We present experiments in automatic genre classification on web corpora,comparing a wide variety of features on several different genreannotateddatasets (HGC, I-EN, KI-04, KRYS-I, MGC and SANTINIS).We investigate theperformance of several types of features (POS n-grams, character n-grams andword n-grams) and show that simple character n-grams perform best on currentcollections because of their ability to generalise both lexical and syntacticphenomena related to genres. However, we also show that these impressiveresults might not be transferrable to the wider web due to the lack ofcomparability between different annotation labels (many webpages cannot bedescribed in terms of the genre labels in individual collections), lack ofrepresentativeness of existing collections (many genres are represented bywebpages coming from a small number of sources) as well as problems in thereliability of genre annotation (many pages from the web are difficult tointerpret in terms of the labels available). This suggests that more researchis needed to understand genres on the Web.
Language Corpus (creation, annotation, etc.)
Topics Document Classification, Text categorisation, Statistical and machine learning methods, Corpus (creation, annotation, etc.)
Full paper The Web Library of Babel: evaluating genre collections
Bibtex @InProceedings{SHAROFF10.28,
  author = {Serge Sharoff, Zhili Wu and Katja Markert},
  title = {The Web Library of Babel: evaluating genre collections},
  booktitle = {Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC'10)},
  year = {2010},
  month = {may},
  date = {19-21},
  address = {Valletta, Malta},
  editor = {Nicoletta Calzolari (Conference Chair), Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odjik, Stelios Piperidis, Mike Rosner, Daniel Tapias},
  publisher = {European Language Resources Association (ELRA)},
  isbn = {2-9517408-6-7},
  language = {english}
 }
Powered by ELDA © 2010 ELDA/ELRA