Title |
The Web Library of Babel: evaluating genre collections |
Authors |
Serge Sharoff, Zhili Wu and Katja Markert |
Abstract |
We present experiments in automatic genre classification on web corpora,comparing a wide variety of features on several different genreannotateddatasets (HGC, I-EN, KI-04, KRYS-I, MGC and SANTINIS).We investigate theperformance of several types of features (POS n-grams, character n-grams andword n-grams) and show that simple character n-grams perform best on currentcollections because of their ability to generalise both lexical and syntacticphenomena related to genres. However, we also show that these impressiveresults might not be transferrable to the wider web due to the lack ofcomparability between different annotation labels (many webpages cannot bedescribed in terms of the genre labels in individual collections), lack ofrepresentativeness of existing collections (many genres are represented bywebpages coming from a small number of sources) as well as problems in thereliability of genre annotation (many pages from the web are difficult tointerpret in terms of the labels available). This suggests that more researchis needed to understand genres on the Web. |
Language |
Corpus (creation, annotation, etc.) |
Topics |
Document Classification, Text categorisation, Statistical and machine learning methods, Corpus (creation, annotation, etc.) |
Full paper  |
The Web Library of Babel: evaluating genre collections |
Bibtex |
@InProceedings{SHAROFF10.28,
author = {Serge Sharoff, Zhili Wu and Katja Markert}, title = {The Web Library of Babel: evaluating genre collections}, booktitle = {Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC'10)}, year = {2010}, month = {may}, date = {19-21}, address = {Valletta, Malta}, editor = {Nicoletta Calzolari (Conference Chair), Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odjik, Stelios Piperidis, Mike Rosner, Daniel Tapias}, publisher = {European Language Resources Association (ELRA)}, isbn = {2-9517408-6-7}, language = {english} } |