Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-User

need to find locations of query hits in doc: works fine for regular text but not for phone numbers

 

 

Lucene java-user RSS feed   Index | Next | Previous | View Threaded


izavorin at caci

Jun 13, 2012, 7:52 PM

Post #1 of 8 (880 views)
Permalink
need to find locations of query hits in doc: works fine for regular text but not for phone numbers

Hello All,

I am using 3.4. I need to find locations of query hits in a document. What I've implemented works fine for textual queries but does not work for phone numbers.

Here's how I index my docs:

String oc = "Joe dialed 800-555-1212 but got a busy signal";
doc.add(new Field("contents",
oc,
Field.Store.NO,
Field.Index.ANALYZED,
Field.TermVector.WITH_POSITIONS_OFFSETS));


Now, here how I find locations. I search for a query. If I get a hit, I split my query (in case it's multi-word) into words and search for each of them using TermFreqVector like this:


//String qstr = "my multiword query"; // for queries like this it works fine...
String qstr = "800-555-1212"; // ...but not for ones like this
Query query = parser.parse(qstr);
TopDocs results = searcher.search(query, Integer.MAX_VALUE);
ScoreDoc[] hits = results.scoreDocs;

String[] subTerms = qstr.split("\\s+"); // phone string stays intact here

for (int i = 0; i < hits.length; i++) {
int docId = hits[i].doc;
Document doc = searcher.doc(docId);

TermFreqVector tfvector = reader.getTermFreqVector(docId, "contents");
TermPositionVector tpvector = (TermPositionVector)tfvector;

for (String subTerm : subTerms)
{
String subq = subTerm.toLowerCase();
int termidx = tfvector.indexOf(subq); // get termidx = -1 here

TermVectorOffsetInfo[] tvoffsetinfo = tpvector.getOffsets(termidx);
for (int j=0;j<tvoffsetinfo.length;j++) {
int offsetStart = tvoffsetinfo[j].getStartOffset();
int offsetEnd = tvoffsetinfo[j].getEndOffset();
// ...

For a query like "800-555-1212", tfvector.indexOf returns -1. What am I doing wrong?

Thanks,

Ilya Zavorin


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


jack at basetechnology

Jun 13, 2012, 8:41 PM

Post #2 of 8 (859 views)
Permalink
Re: need to find locations of query hits in doc: works fine for regular text but not for phone numbers [In reply to]

Try putting the phone number in quotes in the query:

String qstr = "\"800-555-1212\"";

And check query.toString to see how the query parser analyzed the term, bot
with and without quotes.

And make sure you initialized the query parser with "contents" as the
default field.

-- Jack Krupansky

-----Original Message-----
From: Ilya Zavorin
Sent: Wednesday, June 13, 2012 10:52 PM
To: java-user [at] lucene
Subject: need to find locations of query hits in doc: works fine for regular
text but not for phone numbers

Hello All,

I am using 3.4. I need to find locations of query hits in a document. What
I've implemented works fine for textual queries but does not work for phone
numbers.

Here's how I index my docs:

String oc = "Joe dialed 800-555-1212 but got a busy signal";
doc.add(new Field("contents",
oc,
Field.Store.NO,
Field.Index.ANALYZED,
Field.TermVector.WITH_POSITIONS_OFFSETS));


Now, here how I find locations. I search for a query. If I get a hit, I
split my query (in case it's multi-word) into words and search for each of
them using TermFreqVector like this:


//String qstr = "my multiword query"; // for queries like this it works
fine...
String qstr = "800-555-1212"; // ...but not for ones like this
Query query = parser.parse(qstr);
TopDocs results = searcher.search(query, Integer.MAX_VALUE);
ScoreDoc[] hits = results.scoreDocs;

String[] subTerms = qstr.split("\\s+"); // phone string stays intact here

for (int i = 0; i < hits.length; i++) {
int docId = hits[i].doc;
Document doc = searcher.doc(docId);

TermFreqVector tfvector = reader.getTermFreqVector(docId, "contents");
TermPositionVector tpvector = (TermPositionVector)tfvector;

for (String subTerm : subTerms)
{
String subq = subTerm.toLowerCase();
int termidx = tfvector.indexOf(subq); // get termidx = -1 here

TermVectorOffsetInfo[] tvoffsetinfo = tpvector.getOffsets(termidx);
for (int j=0;j<tvoffsetinfo.length;j++) {
int offsetStart = tvoffsetinfo[j].getStartOffset();
int offsetEnd = tvoffsetinfo[j].getEndOffset();
// ...

For a query like "800-555-1212", tfvector.indexOf returns -1. What am I
doing wrong?

Thanks,

Ilya Zavorin


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


izavorin at caci

Jun 14, 2012, 9:49 AM

Post #3 of 8 (858 views)
Permalink
RE: need to find locations of query hits in doc: works fine for regular text but not for phone numbers [In reply to]

OK, so I figured out what the problem was. It wasn't with the digits but rather with the various delimiters like "(" and "-" that I use.

Essentially, the statement

String[] subTerms = qstr.split("\\s+");

Does not split a query the same way as the query parser would do it. And thanks, query.toString(), helped me see that.

My question now is this: is there a way of easily extracting a sequence of substrings from query to use in place of the subTerms array I get from split?

I see that sometimes query.toString() returns things like

"contents:800 contents:555 contents:1212"

but other times it's somehting like

"contents:800 (contents:555 contents:1212)"

So instead of trying to guess what other formats query.toString can produce and trying to parse those, can I somehow extract the substrings of the query reliably?

Thanks!


-----Original Message-----
From: Jack Krupansky [mailto:jack [at] basetechnology]
Sent: Wednesday, June 13, 2012 11:42 PM
To: java-user [at] lucene
Subject: Re: need to find locations of query hits in doc: works fine for regular text but not for phone numbers

Try putting the phone number in quotes in the query:

String qstr = "\"800-555-1212\"";

And check query.toString to see how the query parser analyzed the term, bot with and without quotes.

And make sure you initialized the query parser with "contents" as the default field.

-- Jack Krupansky

-----Original Message-----
From: Ilya Zavorin
Sent: Wednesday, June 13, 2012 10:52 PM
To: java-user [at] lucene
Subject: need to find locations of query hits in doc: works fine for regular text but not for phone numbers

Hello All,

I am using 3.4. I need to find locations of query hits in a document. What I've implemented works fine for textual queries but does not work for phone numbers.

Here's how I index my docs:

String oc = "Joe dialed 800-555-1212 but got a busy signal"; doc.add(new Field("contents", oc, Field.Store.NO, Field.Index.ANALYZED, Field.TermVector.WITH_POSITIONS_OFFSETS));


Now, here how I find locations. I search for a query. If I get a hit, I split my query (in case it's multi-word) into words and search for each of them using TermFreqVector like this:


//String qstr = "my multiword query"; // for queries like this it works fine...
String qstr = "800-555-1212"; // ...but not for ones like this Query query = parser.parse(qstr); TopDocs results = searcher.search(query, Integer.MAX_VALUE); ScoreDoc[] hits = results.scoreDocs;

String[] subTerms = qstr.split("\\s+"); // phone string stays intact here

for (int i = 0; i < hits.length; i++) {
int docId = hits[i].doc;
Document doc = searcher.doc(docId);

TermFreqVector tfvector = reader.getTermFreqVector(docId, "contents"); TermPositionVector tpvector = (TermPositionVector)tfvector;

for (String subTerm : subTerms)
{
String subq = subTerm.toLowerCase();
int termidx = tfvector.indexOf(subq); // get termidx = -1 here

TermVectorOffsetInfo[] tvoffsetinfo = tpvector.getOffsets(termidx);
for (int j=0;j<tvoffsetinfo.length;j++) {
int offsetStart = tvoffsetinfo[j].getStartOffset();
int offsetEnd = tvoffsetinfo[j].getEndOffset(); // ...

For a query like "800-555-1212", tfvector.indexOf returns -1. What am I doing wrong?

Thanks,

Ilya Zavorin


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


uwe at thetaphi

Jun 14, 2012, 9:57 AM

Post #4 of 8 (857 views)
Permalink
RE: need to find locations of query hits in doc: works fine for regular text but not for phone numbers [In reply to]

Just take the BooleanQuery returned by the QueryParser and get its clauses
(sub-queries like TermQuery, PhraseQuery, other BooleanQuery...). By that
you get all query components. In most cases some recursive instanceof
checking for various Query subclasses can do this.

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe [at] thetaphi


> -----Original Message-----
> From: Ilya Zavorin [mailto:izavorin [at] caci]
> Sent: Thursday, June 14, 2012 6:49 PM
> To: java-user [at] lucene
> Subject: RE: need to find locations of query hits in doc: works fine for
regular
> text but not for phone numbers
>
> OK, so I figured out what the problem was. It wasn't with the digits but
rather
> with the various delimiters like "(" and "-" that I use.
>
> Essentially, the statement
>
> String[] subTerms = qstr.split("\\s+");
>
> Does not split a query the same way as the query parser would do it. And
> thanks, query.toString(), helped me see that.
>
> My question now is this: is there a way of easily extracting a sequence of
> substrings from query to use in place of the subTerms array I get from
split?
>
> I see that sometimes query.toString() returns things like
>
> "contents:800 contents:555 contents:1212"
>
> but other times it's somehting like
>
> "contents:800 (contents:555 contents:1212)"
>
> So instead of trying to guess what other formats query.toString can
produce
> and trying to parse those, can I somehow extract the substrings of the
query
> reliably?
>
> Thanks!
>
>
> -----Original Message-----
> From: Jack Krupansky [mailto:jack [at] basetechnology]
> Sent: Wednesday, June 13, 2012 11:42 PM
> To: java-user [at] lucene
> Subject: Re: need to find locations of query hits in doc: works fine for
regular
> text but not for phone numbers
>
> Try putting the phone number in quotes in the query:
>
> String qstr = "\"800-555-1212\"";
>
> And check query.toString to see how the query parser analyzed the term,
bot
> with and without quotes.
>
> And make sure you initialized the query parser with "contents" as the
default
> field.
>
> -- Jack Krupansky
>
> -----Original Message-----
> From: Ilya Zavorin
> Sent: Wednesday, June 13, 2012 10:52 PM
> To: java-user [at] lucene
> Subject: need to find locations of query hits in doc: works fine for
regular text
> but not for phone numbers
>
> Hello All,
>
> I am using 3.4. I need to find locations of query hits in a document. What
I've
> implemented works fine for textual queries but does not work for phone
> numbers.
>
> Here's how I index my docs:
>
> String oc = "Joe dialed 800-555-1212 but got a busy signal"; doc.add(new
> Field("contents", oc, Field.Store.NO, Field.Index.ANALYZED,
> Field.TermVector.WITH_POSITIONS_OFFSETS));
>
>
> Now, here how I find locations. I search for a query. If I get a hit, I
split my
> query (in case it's multi-word) into words and search for each of them
using
> TermFreqVector like this:
>
>
> //String qstr = "my multiword query"; // for queries like this it works
fine...
> String qstr = "800-555-1212"; // ...but not for ones like this Query query
=
> parser.parse(qstr); TopDocs results = searcher.search(query,
> Integer.MAX_VALUE); ScoreDoc[] hits = results.scoreDocs;
>
> String[] subTerms = qstr.split("\\s+"); // phone string stays intact here
>
> for (int i = 0; i < hits.length; i++) {
> int docId = hits[i].doc;
> Document doc = searcher.doc(docId);
>
> TermFreqVector tfvector = reader.getTermFreqVector(docId, "contents");
> TermPositionVector tpvector = (TermPositionVector)tfvector;
>
> for (String subTerm : subTerms)
> {
> String subq = subTerm.toLowerCase();
> int termidx = tfvector.indexOf(subq); // get termidx = -1 here
>
> TermVectorOffsetInfo[] tvoffsetinfo = tpvector.getOffsets(termidx);
> for (int j=0;j<tvoffsetinfo.length;j++) {
> int offsetStart = tvoffsetinfo[j].getStartOffset();
> int offsetEnd = tvoffsetinfo[j].getEndOffset(); // ...
>
> For a query like "800-555-1212", tfvector.indexOf returns -1. What am I
doing
> wrong?
>
> Thanks,
>
> Ilya Zavorin
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


hossman_lucene at fucit

Jun 14, 2012, 10:51 AM

Post #5 of 8 (857 views)
Permalink
Re: need to find locations of query hits in doc: works fine for regular text but not for phone numbers [In reply to]

: Subject: need to find locations of query hits in doc: works fine for regular
: text but not for phone numbers
: Message-ID: <A57498EDEC10C64781EA0F7DBA665CEF264DEC53 [at] ex2010mb01-1>
: References: <1339635547170-3989548.post [at] n3>
: In-Reply-To: <1339635547170-3989548.post [at] n3>

https://people.apache.org/~hossman/#threadhijack
Thread Hijacking on Mailing Lists

When starting a new discussion on a mailing list, please do not reply to
an existing message, instead start a fresh email. Even if you change the
subject line of your email, other mail headers still track which thread
you replied to and your question is "hidden" in that thread and gets less
attention. It makes following discussions in the mailing list archives
particularly difficult.



-Hoss

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


izavorin at caci

Jun 14, 2012, 11:36 AM

Post #6 of 8 (858 views)
Permalink
RE: need to find locations of query hits in doc: works fine for regular text but not for phone numbers [In reply to]

Uwe, sorry but I am having trouble understanding this. Can you point me to a place in documentation that explains this in more detail (I've read http://lucene.apache.org/core/old_versioned_docs/versions/3_4_0/api/core/org/apache/lucene/queryParser/QueryParser.html but still am confused) or some example code?

Thanks much,

Ilya


-----Original Message-----
From: Uwe Schindler [mailto:uwe [at] thetaphi]
Sent: Thursday, June 14, 2012 12:57 PM
To: java-user [at] lucene
Subject: RE: need to find locations of query hits in doc: works fine for regular text but not for phone numbers

Just take the BooleanQuery returned by the QueryParser and get its clauses (sub-queries like TermQuery, PhraseQuery, other BooleanQuery...). By that you get all query components. In most cases some recursive instanceof checking for various Query subclasses can do this.

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe [at] thetaphi


> -----Original Message-----
> From: Ilya Zavorin [mailto:izavorin [at] caci]
> Sent: Thursday, June 14, 2012 6:49 PM
> To: java-user [at] lucene
> Subject: RE: need to find locations of query hits in doc: works fine
> for
regular
> text but not for phone numbers
>
> OK, so I figured out what the problem was. It wasn't with the digits
> but
rather
> with the various delimiters like "(" and "-" that I use.
>
> Essentially, the statement
>
> String[] subTerms = qstr.split("\\s+");
>
> Does not split a query the same way as the query parser would do it.
> And thanks, query.toString(), helped me see that.
>
> My question now is this: is there a way of easily extracting a
> sequence of substrings from query to use in place of the subTerms
> array I get from
split?
>
> I see that sometimes query.toString() returns things like
>
> "contents:800 contents:555 contents:1212"
>
> but other times it's somehting like
>
> "contents:800 (contents:555 contents:1212)"
>
> So instead of trying to guess what other formats query.toString can
produce
> and trying to parse those, can I somehow extract the substrings of the
query
> reliably?
>
> Thanks!
>
>
> -----Original Message-----
> From: Jack Krupansky [mailto:jack [at] basetechnology]
> Sent: Wednesday, June 13, 2012 11:42 PM
> To: java-user [at] lucene
> Subject: Re: need to find locations of query hits in doc: works fine
> for
regular
> text but not for phone numbers
>
> Try putting the phone number in quotes in the query:
>
> String qstr = "\"800-555-1212\"";
>
> And check query.toString to see how the query parser analyzed the
> term,
bot
> with and without quotes.
>
> And make sure you initialized the query parser with "contents" as the
default
> field.
>
> -- Jack Krupansky
>
> -----Original Message-----
> From: Ilya Zavorin
> Sent: Wednesday, June 13, 2012 10:52 PM
> To: java-user [at] lucene
> Subject: need to find locations of query hits in doc: works fine for
regular text
> but not for phone numbers
>
> Hello All,
>
> I am using 3.4. I need to find locations of query hits in a document.
> What
I've
> implemented works fine for textual queries but does not work for phone
> numbers.
>
> Here's how I index my docs:
>
> String oc = "Joe dialed 800-555-1212 but got a busy signal";
> doc.add(new Field("contents", oc, Field.Store.NO,
> Field.Index.ANALYZED, Field.TermVector.WITH_POSITIONS_OFFSETS));
>
>
> Now, here how I find locations. I search for a query. If I get a hit,
> I
split my
> query (in case it's multi-word) into words and search for each of them
using
> TermFreqVector like this:
>
>
> //String qstr = "my multiword query"; // for queries like this it
> works
fine...
> String qstr = "800-555-1212"; // ...but not for ones like this Query
> query
=
> parser.parse(qstr); TopDocs results = searcher.search(query,
> Integer.MAX_VALUE); ScoreDoc[] hits = results.scoreDocs;
>
> String[] subTerms = qstr.split("\\s+"); // phone string stays intact
> here
>
> for (int i = 0; i < hits.length; i++) { int docId = hits[i].doc;
> Document doc = searcher.doc(docId);
>
> TermFreqVector tfvector = reader.getTermFreqVector(docId, "contents");
> TermPositionVector tpvector = (TermPositionVector)tfvector;
>
> for (String subTerm : subTerms)
> {
> String subq = subTerm.toLowerCase();
> int termidx = tfvector.indexOf(subq); // get termidx = -1 here
>
> TermVectorOffsetInfo[] tvoffsetinfo = tpvector.getOffsets(termidx);
> for (int j=0;j<tvoffsetinfo.length;j++) {
> int offsetStart = tvoffsetinfo[j].getStartOffset();
> int offsetEnd = tvoffsetinfo[j].getEndOffset(); // ...
>
> For a query like "800-555-1212", tfvector.indexOf returns -1. What am
> I
doing
> wrong?
>
> Thanks,
>
> Ilya Zavorin
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


jack at basetechnology

Jun 14, 2012, 12:30 PM

Post #7 of 8 (856 views)
Permalink
Re: need to find locations of query hits in doc: works fine for regular text but not for phone numbers [In reply to]

Look at this code: QueryTermExtractor.getTerms(Query query)
http://lucene.apache.org/core/3_6_0/api/contrib-highlighter/org/apache/lucene/search/highlight/QueryTermExtractor.html

-- Jack Krupansky

-----Original Message-----
From: Ilya Zavorin
Sent: Thursday, June 14, 2012 2:36 PM
To: java-user [at] lucene
Subject: RE: need to find locations of query hits in doc: works fine for
regular text but not for phone numbers



Uwe, sorry but I am having trouble understanding this. Can you point me to a
place in documentation that explains this in more detail (I've read
http://lucene.apache.org/core/old_versioned_docs/versions/3_4_0/api/core/org/apache/lucene/queryParser/QueryParser.html
but still am confused) or some example code?

Thanks much,

Ilya


-----Original Message-----
From: Uwe Schindler [mailto:uwe [at] thetaphi]
Sent: Thursday, June 14, 2012 12:57 PM
To: java-user [at] lucene
Subject: RE: need to find locations of query hits in doc: works fine for
regular text but not for phone numbers

Just take the BooleanQuery returned by the QueryParser and get its clauses
(sub-queries like TermQuery, PhraseQuery, other BooleanQuery...). By that
you get all query components. In most cases some recursive instanceof
checking for various Query subclasses can do this.

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe [at] thetaphi


> -----Original Message-----
> From: Ilya Zavorin [mailto:izavorin [at] caci]
> Sent: Thursday, June 14, 2012 6:49 PM
> To: java-user [at] lucene
> Subject: RE: need to find locations of query hits in doc: works fine
> for
regular
> text but not for phone numbers
>
> OK, so I figured out what the problem was. It wasn't with the digits
> but
rather
> with the various delimiters like "(" and "-" that I use.
>
> Essentially, the statement
>
> String[] subTerms = qstr.split("\\s+");
>
> Does not split a query the same way as the query parser would do it.
> And thanks, query.toString(), helped me see that.
>
> My question now is this: is there a way of easily extracting a
> sequence of substrings from query to use in place of the subTerms
> array I get from
split?
>
> I see that sometimes query.toString() returns things like
>
> "contents:800 contents:555 contents:1212"
>
> but other times it's somehting like
>
> "contents:800 (contents:555 contents:1212)"
>
> So instead of trying to guess what other formats query.toString can
produce
> and trying to parse those, can I somehow extract the substrings of the
query
> reliably?
>
> Thanks!
>
>
> -----Original Message-----
> From: Jack Krupansky [mailto:jack [at] basetechnology]
> Sent: Wednesday, June 13, 2012 11:42 PM
> To: java-user [at] lucene
> Subject: Re: need to find locations of query hits in doc: works fine
> for
regular
> text but not for phone numbers
>
> Try putting the phone number in quotes in the query:
>
> String qstr = "\"800-555-1212\"";
>
> And check query.toString to see how the query parser analyzed the
> term,
bot
> with and without quotes.
>
> And make sure you initialized the query parser with "contents" as the
default
> field.
>
> -- Jack Krupansky
>
> -----Original Message-----
> From: Ilya Zavorin
> Sent: Wednesday, June 13, 2012 10:52 PM
> To: java-user [at] lucene
> Subject: need to find locations of query hits in doc: works fine for
regular text
> but not for phone numbers
>
> Hello All,
>
> I am using 3.4. I need to find locations of query hits in a document.
> What
I've
> implemented works fine for textual queries but does not work for phone
> numbers.
>
> Here's how I index my docs:
>
> String oc = "Joe dialed 800-555-1212 but got a busy signal";
> doc.add(new Field("contents", oc, Field.Store.NO,
> Field.Index.ANALYZED, Field.TermVector.WITH_POSITIONS_OFFSETS));
>
>
> Now, here how I find locations. I search for a query. If I get a hit,
> I
split my
> query (in case it's multi-word) into words and search for each of them
using
> TermFreqVector like this:
>
>
> //String qstr = "my multiword query"; // for queries like this it
> works
fine...
> String qstr = "800-555-1212"; // ...but not for ones like this Query
> query
=
> parser.parse(qstr); TopDocs results = searcher.search(query,
> Integer.MAX_VALUE); ScoreDoc[] hits = results.scoreDocs;
>
> String[] subTerms = qstr.split("\\s+"); // phone string stays intact
> here
>
> for (int i = 0; i < hits.length; i++) { int docId = hits[i].doc;
> Document doc = searcher.doc(docId);
>
> TermFreqVector tfvector = reader.getTermFreqVector(docId, "contents");
> TermPositionVector tpvector = (TermPositionVector)tfvector;
>
> for (String subTerm : subTerms)
> {
> String subq = subTerm.toLowerCase();
> int termidx = tfvector.indexOf(subq); // get termidx = -1 here
>
> TermVectorOffsetInfo[] tvoffsetinfo = tpvector.getOffsets(termidx);
> for (int j=0;j<tvoffsetinfo.length;j++) {
> int offsetStart = tvoffsetinfo[j].getStartOffset();
> int offsetEnd = tvoffsetinfo[j].getEndOffset(); // ...
>
> For a query like "800-555-1212", tfvector.indexOf returns -1. What am
> I
doing
> wrong?
>
> Thanks,
>
> Ilya Zavorin
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


izavorin at caci

Jun 14, 2012, 5:31 PM

Post #8 of 8 (854 views)
Permalink
RE: need to find locations of query hits in doc: works fine for regular text but not for phone numbers [In reply to]

worked like a charm!

thx!

________________________________________
From: Jack Krupansky [jack [at] basetechnology]
Sent: Thursday, June 14, 2012 3:30 PM
To: java-user [at] lucene
Subject: Re: need to find locations of query hits in doc: works fine for regular text but not for phone numbers

Look at this code: QueryTermExtractor.getTerms(Query query)
http://lucene.apache.org/core/3_6_0/api/contrib-highlighter/org/apache/lucene/search/highlight/QueryTermExtractor.html

-- Jack Krupansky

-----Original Message-----
From: Ilya Zavorin
Sent: Thursday, June 14, 2012 2:36 PM
To: java-user [at] lucene
Subject: RE: need to find locations of query hits in doc: works fine for
regular text but not for phone numbers



Uwe, sorry but I am having trouble understanding this. Can you point me to a
place in documentation that explains this in more detail (I've read
http://lucene.apache.org/core/old_versioned_docs/versions/3_4_0/api/core/org/apache/lucene/queryParser/QueryParser.html
but still am confused) or some example code?

Thanks much,

Ilya


-----Original Message-----
From: Uwe Schindler [mailto:uwe [at] thetaphi]
Sent: Thursday, June 14, 2012 12:57 PM
To: java-user [at] lucene
Subject: RE: need to find locations of query hits in doc: works fine for
regular text but not for phone numbers

Just take the BooleanQuery returned by the QueryParser and get its clauses
(sub-queries like TermQuery, PhraseQuery, other BooleanQuery...). By that
you get all query components. In most cases some recursive instanceof
checking for various Query subclasses can do this.

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe [at] thetaphi


> -----Original Message-----
> From: Ilya Zavorin [mailto:izavorin [at] caci]
> Sent: Thursday, June 14, 2012 6:49 PM
> To: java-user [at] lucene
> Subject: RE: need to find locations of query hits in doc: works fine
> for
regular
> text but not for phone numbers
>
> OK, so I figured out what the problem was. It wasn't with the digits
> but
rather
> with the various delimiters like "(" and "-" that I use.
>
> Essentially, the statement
>
> String[] subTerms = qstr.split("\\s+");
>
> Does not split a query the same way as the query parser would do it.
> And thanks, query.toString(), helped me see that.
>
> My question now is this: is there a way of easily extracting a
> sequence of substrings from query to use in place of the subTerms
> array I get from
split?
>
> I see that sometimes query.toString() returns things like
>
> "contents:800 contents:555 contents:1212"
>
> but other times it's somehting like
>
> "contents:800 (contents:555 contents:1212)"
>
> So instead of trying to guess what other formats query.toString can
produce
> and trying to parse those, can I somehow extract the substrings of the
query
> reliably?
>
> Thanks!
>
>
> -----Original Message-----
> From: Jack Krupansky [mailto:jack [at] basetechnology]
> Sent: Wednesday, June 13, 2012 11:42 PM
> To: java-user [at] lucene
> Subject: Re: need to find locations of query hits in doc: works fine
> for
regular
> text but not for phone numbers
>
> Try putting the phone number in quotes in the query:
>
> String qstr = "\"800-555-1212\"";
>
> And check query.toString to see how the query parser analyzed the
> term,
bot
> with and without quotes.
>
> And make sure you initialized the query parser with "contents" as the
default
> field.
>
> -- Jack Krupansky
>
> -----Original Message-----
> From: Ilya Zavorin
> Sent: Wednesday, June 13, 2012 10:52 PM
> To: java-user [at] lucene
> Subject: need to find locations of query hits in doc: works fine for
regular text
> but not for phone numbers
>
> Hello All,
>
> I am using 3.4. I need to find locations of query hits in a document.
> What
I've
> implemented works fine for textual queries but does not work for phone
> numbers.
>
> Here's how I index my docs:
>
> String oc = "Joe dialed 800-555-1212 but got a busy signal";
> doc.add(new Field("contents", oc, Field.Store.NO,
> Field.Index.ANALYZED, Field.TermVector.WITH_POSITIONS_OFFSETS));
>
>
> Now, here how I find locations. I search for a query. If I get a hit,
> I
split my
> query (in case it's multi-word) into words and search for each of them
using
> TermFreqVector like this:
>
>
> //String qstr = "my multiword query"; // for queries like this it
> works
fine...
> String qstr = "800-555-1212"; // ...but not for ones like this Query
> query
=
> parser.parse(qstr); TopDocs results = searcher.search(query,
> Integer.MAX_VALUE); ScoreDoc[] hits = results.scoreDocs;
>
> String[] subTerms = qstr.split("\\s+"); // phone string stays intact
> here
>
> for (int i = 0; i < hits.length; i++) { int docId = hits[i].doc;
> Document doc = searcher.doc(docId);
>
> TermFreqVector tfvector = reader.getTermFreqVector(docId, "contents");
> TermPositionVector tpvector = (TermPositionVector)tfvector;
>
> for (String subTerm : subTerms)
> {
> String subq = subTerm.toLowerCase();
> int termidx = tfvector.indexOf(subq); // get termidx = -1 here
>
> TermVectorOffsetInfo[] tvoffsetinfo = tpvector.getOffsets(termidx);
> for (int j=0;j<tvoffsetinfo.length;j++) {
> int offsetStart = tvoffsetinfo[j].getStartOffset();
> int offsetEnd = tvoffsetinfo[j].getEndOffset(); // ...
>
> For a query like "800-555-1212", tfvector.indexOf returns -1. What am
> I
doing
> wrong?
>
> Thanks,
>
> Ilya Zavorin
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene

Lucene java-user RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.