Title |
New Tools for Web-Scale N-grams |
Authors |
Dekang Lin, Kenneth Church, Heng Ji, Satoshi Sekine, David Yarowsky, Shane Bergsma, Kailash Patil, Emily Pitler, Rachel Lathbury, Vikram Rao, Kapil Dalwani and Sushant Narsale |
Abstract |
While the web provides a fantastic linguistic resource, collecting andprocessing data at web-scale is beyond the reach of most academic laboratories.Previous research has relied on search engines to collect online information,but this is hopelessly inefficient for building large-scale linguisticresources, such as lists of named-entity types or clusters ofdistributionally similar words. An alternative to processing web-scale textdirectly is to use the information provided in an N-gram corpus. An N-gramcorpus is an efficient compression of large amounts of text. An N-gram corpusstates how often each sequence of words (up to length N) occurs. We proposetools for working with enhanced web-scale N-gram corpora that include richerlevels of source annotation, such as part-of-speech tags. We describe a newset of search tools that make use of these tags, and collectively lower thebarrier for lexical learning and ambiguity resolution at web-scale. They willallow novel sources of information to be applied to long-standing naturallanguage challenges. |
Language |
|
Topics |
Tools, systems, applications, Text mining |
Full paper  |
New Tools for Web-Scale N-grams |
Bibtex |
@InProceedings{LIN10.233,
author = {Dekang Lin, Kenneth Church, Heng Ji, Satoshi Sekine, David Yarowsky, Shane Bergsma, Kailash Patil, Emily Pitler, Rachel Lathbury, Vikram Rao, Kapil Dalwani and Sushant Narsale}, title = {New Tools for Web-Scale N-grams}, booktitle = {Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC'10)}, year = {2010}, month = {may}, date = {19-21}, address = {Valletta, Malta}, editor = {Nicoletta Calzolari (Conference Chair), Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odjik, Stelios Piperidis, Mike Rosner, Daniel Tapias}, publisher = {European Language Resources Association (ELRA)}, isbn = {2-9517408-6-7}, language = {english} } |