
kino at daxtron
Mar 8, 2007, 7:48 PM
Post #1 of 2
(779 views)
Permalink
|
|
Kino uses KinoSearch with Google n-grams...
|
|
Hello, Seems I and the program are named after the same character from the same book! Well when something is named after you it just begs to be used... especially when you have a use for it. My problem is I want an inverted index of the Google N-gram data http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T13 The following is an example of the 4-gram data in this corpus: serve as the incoming 92 serve as the incubator 99 serve as the independent 794 serve as the index 223 File sizes: approx. 24 GB compressed (gzip'ed) text files Number of tokens: 1,024,908,267,229 Number of sentences: 95,119,665,584 Number of unigrams: 13,588,391 Number of bigrams: 314,843,401 Number of trigrams: 977,069,902 Number of fourgrams: 1,313,818,354 Number of fivegrams: 1,176,470,663 It's the +3,793 Million documents the longest of which are 5 words and a frequency count. Not your normal indexing problem! I've worked on the unigrams and bigrams which are inserted just fine, and finish() . But when I run a query, perl get's very upset and seg faults. Not good. I can run the query when the index is in the "S"'s just fine, but when I add the rest and finish : seg fault. I thought that some of then non-keyboard entry characters were affecting it, but again after filtering them out: seg fault. One unusual thing that did happen during the indexing was a power failure. However the insert and search process seem to work and only when the bigram indexing process finishes does it cause problems. And of course there are still the trigrams, fourgrams and fivegrams. Do you have any ideas or know that this is impossible to do with a single index? Maybe a job for 0.20 ? It seems to be doing a good job until it finishes. Also one other option I am looking at is building the indexes in parallel and merging them into a unified index. I know it's possible but will it be happy with the sizes I have to deal with? And in general are there recommended and absolute limits on the index size? If it can handle this then the few extra million semantic records I want to mix in should be easy. Bests, Kino Coursey (the other Kino) Daxtron Labs -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.rectangular.com/pipermail/kinosearch/attachments/20070308/ba2ac360/attachment.html
|