Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-User

Indexing single words and marked phrases

 

 

Lucene java-user RSS feed   Index | Next | Previous | View Threaded


tm-oleary at comcast

Mar 2, 2007, 4:32 PM

Post #1 of 2 (1415 views)
Permalink
Indexing single words and marked phrases

I am working on a project with team that is developing a named entity
recognizer. I need to configure Lucene indexing so that it indexes the
individual words in the text that the named entity recognizer outputs as
well as the phrases that it marks. For example, in a string like



<BODY>"The population of <CITY>New York City</CITY> is not as large as that
of <CITY>Mexico City</CITY>."</BODY>



I would want to index the words "New", "York" and "City" and "Mexico" and
"City" as well as the phrases "New York City" and "Mexico City", along with
the other non-stopwords in the string in a field labeled "BODY". Would it be
better to write an Analyzer that can do this, or to adjust the XML parser
that I am using so that it adds the text within tag pairs like <CITY> and
</CITY> to the field that corresponds to the tag that is one level up. If it
is better to write an Analyzer, could someone point me to information on how
to do this? Thanks.

Mike O'Leary


ab at getopt

Mar 3, 2007, 1:24 AM

Post #2 of 2 (1309 views)
Permalink
Re: Indexing single words and marked phrases [In reply to]

Mike O'Leary wrote:
> I am working on a project with team that is developing a named entity
> recognizer. I need to configure Lucene indexing so that it indexes the
> individual words in the text that the named entity recognizer outputs as
> well as the phrases that it marks. For example, in a string like
>
>
>
> <BODY>"The population of <CITY>New York City</CITY> is not as large as that
> of <CITY>Mexico City</CITY>."</BODY>
>
>
>
> I would want to index the words "New", "York" and "City" and "Mexico" and
> "City" as well as the phrases "New York City" and "Mexico City", along with
> the other non-stopwords in the string in a field labeled "BODY". Would it be
> better to write an Analyzer that can do this, or to adjust the XML parser
> that I am using so that it adds the text within tag pairs like <CITY> and
> </CITY> to the field that corresponds to the tag that is one level up. If it
> is better to write an Analyzer, could someone point me to information on how
> to do this? Thanks.
>

Please take a look at the analyzer implementation that Nutch uses
(org.apache.nutch.analysis.NutchDocumentAnalyzer), which uses
CommonGrams to detect which tokens need to be output as phrases. In your
case you already know what are the phrases you want to output as tokens,
but the idea is still the same - the original token stream is modified
to output both individual terms and phrases.

--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
For additional commands, e-mail: java-user-help[at]lucene.apache.org

Lucene java-user RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact lists@gossamer-threads.com
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.