Summary of the paper

Title New Tools for Web-Scale N-grams
Authors Dekang Lin, Kenneth Church, Heng Ji, Satoshi Sekine, David Yarowsky, Shane Bergsma, Kailash Patil, Emily Pitler, Rachel Lathbury, Vikram Rao, Kapil Dalwani and Sushant Narsale
Abstract While the web provides a fantastic linguistic resource, collecting andprocessing data at web-scale is beyond the reach of most academic laboratories.Previous research has relied on search engines to collect online information,but this is hopelessly inefficient for building large-scale linguisticresources, such as lists of named-entity types or clusters ofdistributionally similar words. An alternative to processing web-scale textdirectly is to use the information provided in an N-gram corpus. An N-gramcorpus is an efficient compression of large amounts of text. An N-gram corpusstates how often each sequence of words (up to length N) occurs. We proposetools for working with enhanced web-scale N-gram corpora that include richerlevels of source annotation, such as part-of-speech tags. We describe a newset of search tools that make use of these tags, and collectively lower thebarrier for lexical learning and ambiguity resolution at web-scale. They willallow novel sources of information to be applied to long-standing naturallanguage challenges.
Language
Topics Tools, systems, applications, Text mining
Full paper New Tools for Web-Scale N-grams
Bibtex @InProceedings{LIN10.233,
  author = {Dekang Lin, Kenneth Church, Heng Ji, Satoshi Sekine, David Yarowsky, Shane Bergsma, Kailash Patil, Emily Pitler, Rachel Lathbury, Vikram Rao, Kapil Dalwani and Sushant Narsale},
  title = {New Tools for Web-Scale N-grams},
  booktitle = {Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC'10)},
  year = {2010},
  month = {may},
  date = {19-21},
  address = {Valletta, Malta},
  editor = {Nicoletta Calzolari (Conference Chair), Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odjik, Stelios Piperidis, Mike Rosner, Daniel Tapias},
  publisher = {European Language Resources Association (ELRA)},
  isbn = {2-9517408-6-7},
  language = {english}
 }
Powered by ELDA © 2010 ELDA/ELRA