Summary of the paper

Title BlogBuster: A Tool for Extracting Corpora from the Blogosphere
Authors Georgios Petasis and Dimitrios Petasis
Abstract This paper presents BlogBuster, a tool for extracting a corpus from theblogosphere. The topic of cleaning arbitrary web pages with the goal ofextracting a corpus from web data, suitable for linguistic and languagetechnology research and development, has attracted significant researchinterest recently. Several general purpose approaches for removing boilerplatehave been presented in the literature; however the blogosphere poses additionalrequirements, such as a finer control over the extracted textual segments inorder to accurately identify important elements, i.e. individual blog posts,titles, posting dates or comments. BlogBuster tries to provide such additionaldetails along with boilerplate removal, following a rule-based approach. Asmall set of rules were manually constructed by observing a limited set ofblogs from the Blogger and Wordpress hosting platforms. These rules operate onthe DOM tree of an HTML page, as constructed by a popular browser, MozillaFirefox. Evaluation results suggest that BlogBuster is very accurate whenextracting corpora from blogs hosted in the Blogger and Wordpress, whileexhibiting a reasonable precision when applied to blogs not hosted in these twopopular blogging platforms.
Language Web Services
Topics Corpus (creation, annotation, etc.), Information Extraction, Information Retrieval, Web Services
Full paper BlogBuster: A Tool for Extracting Corpora from the Blogosphere
Bibtex @InProceedings{PETASIS10.808,
  author = {Georgios Petasis and Dimitrios Petasis},
  title = {BlogBuster: A Tool for Extracting Corpora from the Blogosphere},
  booktitle = {Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC'10)},
  year = {2010},
  month = {may},
  date = {19-21},
  address = {Valletta, Malta},
  editor = {Nicoletta Calzolari (Conference Chair), Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odjik, Stelios Piperidis, Mike Rosner, Daniel Tapias},
  publisher = {European Language Resources Association (ELRA)},
  isbn = {2-9517408-6-7},
  language = {english}
 }
Powered by ELDA © 2010 ELDA/ELRA