LREC 2010 Proceedings

Summary of the paper

Title	Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information
Authors	Satoshi Sekine and Kapil Dalwani
Abstract	We developed a search tool for ngrams extracted from a very large corpus (thecurrent system uses the entire Wikipedia, which has 1.7 billion tokens). Thetool supports queries with an arbitrary number of wildcards and/orspecification by a combination of token, POS, chunk (such as NP, VP, PP) andNamed Entity (NE). The previous system (Sekine 08) can only handle tokensand unrestricted wildcards in the query, such as “* was established in *”. However, being able to constrain the wildcards by POS, chunk or NE is quiteuseful to filter out noise. For example, the new system can search for “NE=COMPANY was established in POS=CD”. This finer specification reduces the number of outputs to less than half and avoids the ngrams which have a comma or a common noun at the first position or location information at the last position. It outputs the matched ngrams with their frequencies as wellas all the contexts (i.e. sentences, KWIC lists and document ID information)where the matched ngrams occur in the corpus. It takes a fraction of a secondfor a search on a single CPU Linux-PC (1GB memory and 500GB disk) environment.
Language	Corpus (creation, annotation, etc.)
Topics	Tools, systems, applications, Knowledge Discovery/Representation, Corpus (creation, annotation, etc.)
Full paper	Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information
Bibtex	@InProceedings{SEKINE10.158, author = {Satoshi Sekine and Kapil Dalwani}, title = {Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information}, booktitle = {Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC'10)}, year = {2010}, month = {may}, date = {19-21}, address = {Valletta, Malta}, editor = {Nicoletta Calzolari (Conference Chair), Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odjik, Stelios Piperidis, Mike Rosner, Daniel Tapias}, publisher = {European Language Resources Association (ELRA)}, isbn = {2-9517408-6-7}, language = {english} }