
xonixx at gmail
Apr 19, 2012, 4:26 AM
Post #1 of 9
(281 views)
Permalink
|
|
Two questions on RussianAnalyzer
|
|
Hi, Upon updating to Lucene 3.6 I've noticed that new RussianAnalyzer analyzes not the same way as before. Please see example: private List<String> getTokens(Analyzer theAnalyzer, String str) throws IOException { final TokenStream tokenStream = theAnalyzer.tokenStream(MessageFields.BODY, new StringReader(str)); tokenStream.reset(); final CharTermAttribute termAttribute = tokenStream.getAttribute(CharTermAttribute.class); List<String> tokens = new LinkedList<String>(); while (tokenStream.incrementToken()) { final String term = new String(termAttribute.buffer(), 0, termAttribute.length()); tokens.add(term); // System.out.println(">>" + term); } return tokens; } @Test public void testDots() throws IOException { final String str = "aaa.bbb.com:8888 " + "a,b;c/d'e$f&g*h+i-j%k/l_m#n@o!p?q>r\"s~t(u`v|z}y\\z"; System.out.println("New analyzer:"); System.out.println(getTokens(new RussianAnalyzer(Version.LUCENE_36), str)); System.out.println("Old analyzer:"); System.out.println(getTokens(new RussianAnalyzer(Version.LUCENE_30), str)); } This shows: New analyzer: [.aaa.bbb.com, 8888, a, b, c, d'e, f, g, h, i, j, k, l_m, n, o, p, q, r, s, t, u, v, z, y, z] Old analyzer: [.aaa, bbb, com, 8888, a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, q, r, s, t, u, v, z, y, z] Please note the differences. The most uncomfortable in new behaviour to me is that in past I used to search by subdomain like bbb.com:8888 and have displayed results with www.bbb.com:8888, aaa.bbb.com:8888 and so on. Now I have 0 results. My questions are: 1) it this change is by design (not a mistake) and 2) is the only option to achieve old behaviour is to use Version.LUCENE_30 for creating analyzer? The other problem with RussionAnalyzer is with the letter Yo http://en.wikipedia.org/wiki/Yo_(Cyrillic) which in russian often replaced by letter Ye http://en.wikipedia.org/wiki/Ye_(Cyrillic), and such words are considered same. What I want to achieve is that my search by word with yo also yield words with this letter replaced to ye (and vice-versa). What I'm currently doing is roughly next: // NOTE: I have to define my class in this package, because method russianAnalyzer.createComponents is protected package org.apache.lucene.analysis.ru; public class RussianAnalyzerImproved extends ReusableAnalyzerBase{ private RussianAnalyzer russianAnalyzer = new RussianAnalyzer(LuceneVersion.VERSION); @Override protected Reader initReader(Reader reader) { return new YoCharFilter(CharReader.get(reader)); } @Override protected TokenStreamComponents createComponents(String fieldName, Reader reader) { return russianAnalyzer.createComponents(fieldName, reader); } } public class YoCharFilter extends CharFilter { public YoCharFilter(CharStream in) { super(in); } @Override public int read(char[] cbuf, int off, int len) throws IOException { final int charsRead = super.read(cbuf, off, len); if (charsRead > 0) { final int end = off + charsRead; while (off < end) { if (cbuf[off] == 'ё' || cbuf[off] == 'Ё') cbuf[off] = 'е'; off++; } } return charsRead; } } But I'm not sure this is the correct approach. What do you think? Maybe it may have sense to add a configuration option to RussianAnalyzer itself (distinguish or not yo & ye)? Sincerely yours, Vladimir --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe [at] lucene For additional commands, e-mail: java-user-help [at] lucene
|