Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-User

Ignoring XML tags when Indexing

 

 

Lucene java-user RSS feed   Index | Next | Previous | View Threaded


kalanir at gmail

Jul 23, 2008, 11:17 PM

Post #1 of 6 (221 views)
Permalink
Ignoring XML tags when Indexing

Hi all,

I am searching for a way to ignore XML tags in the input when indexing. Is
there a built in functionality in Lucene to get this done?
I am sorry if this was discussed before. I searched but couldn't find a
clear solution.

Thanks in advance
Kalani

--
Kalani Ruwanpathirana
Department of Computer Science & Engineering
University of Moratuwa


marcelo.schneider at digitro

Jul 24, 2008, 5:40 AM

Post #2 of 6 (204 views)
Permalink
Re: Ignoring XML tags when Indexing [In reply to]

Do you just want to ignore them and store all in one field? If you know
the used tags previously, I guess you could set up a stop words list
with them. If not, you could do an "XMLAnalyzer" that simply ignores
everything inside '<>'...

If you want to split the xml content in separate fields, you have to
parse it before indexing, take a look at this article:
http://www.ibm.com/developerworks/library/j-lucene/

I'm a little bit new to Lucene, so I might be missing something here,
but I wouldn't expect it to have an API for this...


Kalani Ruwanpathirana escreveu:
> Hi all,
>
> I am searching for a way to ignore XML tags in the input when indexing. Is
> there a built in functionality in Lucene to get this done?
> I am sorry if this was discussed before. I searched but couldn't find a
> clear solution.
>
> Thanks in advance
> Kalani
>
>

--


*Marcelo Frantz Schneider*
/SIC - TCO - Tecnologia em Engenharia do Conhecimento/

*DÍGITRO TECNOLOGIA*
*E-mail:* marcelo.schneider[at]digitro.com.br
<mailto:marcelo.schneider[at]digitro.com.br>
***Site:* www.digitro.com <http://www.digitro.com>

--
Esta mensagem foi verificada pelo sistema de antivírus da Dígitro e
acredita-se estar livre de perigo.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
For additional commands, e-mail: java-user-help[at]lucene.apache.org


kalanir at gmail

Jul 24, 2008, 10:55 PM

Post #3 of 6 (194 views)
Permalink
Re: Ignoring XML tags when Indexing [In reply to]

Hi Marcelo,

Thanks for the reply. Yes I want to ignore all the tags and store the text
in one field. Previously used tags are not known and seems the "XMLAnalyzer"
is the
solution. Anyway I think Lucene itself does not support a XMLAnalyzer. Do I
have to do it manually?

Kalani

On Thu, Jul 24, 2008 at 6:10 PM, Marcelo Schneider <
marcelo.schneider[at]digitro.com.br> wrote:

> Do you just want to ignore them and store all in one field? If you know the
> used tags previously, I guess you could set up a stop words list with them.
> If not, you could do an "XMLAnalyzer" that simply ignores everything inside
> '<>'...
>
> If you want to split the xml content in separate fields, you have to parse
> it before indexing, take a look at this article:
> http://www.ibm.com/developerworks/library/j-lucene/
>
> I'm a little bit new to Lucene, so I might be missing something here, but I
> wouldn't expect it to have an API for this...
>
>
> Kalani Ruwanpathirana escreveu:
>
>> Hi all,
>>
>> I am searching for a way to ignore XML tags in the input when indexing. Is
>> there a built in functionality in Lucene to get this done?
>> I am sorry if this was discussed before. I searched but couldn't find a
>> clear solution.
>>
>> Thanks in advance
>> Kalani
>>
>>
>>
>
> --
>
>
> *Marcelo Frantz Schneider*
> /SIC - TCO - Tecnologia em Engenharia do Conhecimento/
>
> *DÍGITRO TECNOLOGIA*
> *E-mail:* marcelo.schneider[at]digitro.com.br <mailto:
> marcelo.schneider[at]digitro.com.br>
> ***Site:* www.digitro.com <http://www.digitro.com>
>
> --
> Esta mensagem foi verificada pelo sistema de antivírus da Dígitro e
> acredita-se estar livre de perigo.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
> For additional commands, e-mail: java-user-help[at]lucene.apache.org
>
>


--
Kalani Ruwanpathirana
Department of Computer Science & Engineering
University of Moratuwa


daniel at nuix

Jul 24, 2008, 11:11 PM

Post #4 of 6 (194 views)
Permalink
Re: Ignoring XML tags when Indexing [In reply to]

Kalani Ruwanpathirana wrote:
> Hi Marcelo,
>
> Thanks for the reply. Yes I want to ignore all the tags and store the text
> in one field. Previously used tags are not known and seems the "XMLAnalyzer"
> is the
> solution. Anyway I think Lucene itself does not support a XMLAnalyzer. Do I
> have to do it manually?

What makes more sense (at least the way I see it) is to implement a
Reader which returns the text you need from the XML. This sort of thing
is relatively simple to do with the newer StAX API. You can have your
reader return even small chunks of text, and it should perform okay as
long as you have a BufferedReader wrapped around the entire thing.

Daniel

--
Daniel Noll

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
For additional commands, e-mail: java-user-help[at]lucene.apache.org


marcelo.schneider at digitro

Jul 25, 2008, 4:38 AM

Post #5 of 6 (182 views)
Permalink
Re: Ignoring XML tags when Indexing [In reply to]

Daniel Noll wrote:
> What makes more sense (at least the way I see it) is to implement a
> Reader which returns the text you need from the XML. This sort of
> thing is relatively simple to do with the newer StAX API. You can
> have your reader return even small chunks of text, and it should
> perform okay as long as you have a BufferedReader wrapped around the
> entire thing.
>
> Daniel
Indeed, for the sort of thing you are trying to do it definitely makes
more sense to implement a reader.


--

Marcelo Frantz Schneider
SIC - TCO - Tecnologia em Engenharia do Conhecimento
DÍGITRO TECNOLOGIA*
*****

--
Esta mensagem foi verificada pelo sistema de antivírus da Dígitro e
acredita-se estar livre de perigo.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
For additional commands, e-mail: java-user-help[at]lucene.apache.org


kalanir at gmail

Jul 25, 2008, 5:47 AM

Post #6 of 6 (182 views)
Permalink
Re: Ignoring XML tags when Indexing [In reply to]

Thanks both of you :) I will try that out

Kalani

On Fri, Jul 25, 2008 at 5:08 PM, Marcelo Schneider <
marcelo.schneider[at]digitro.com.br> wrote:

> Daniel Noll wrote:
>
>> What makes more sense (at least the way I see it) is to implement a Reader
>> which returns the text you need from the XML. This sort of thing is
>> relatively simple to do with the newer StAX API. You can have your reader
>> return even small chunks of text, and it should perform okay as long as you
>> have a BufferedReader wrapped around the entire thing.
>>
>> Daniel
>>
> Indeed, for the sort of thing you are trying to do it definitely makes more
> sense to implement a reader.
>
>
> --
>
> Marcelo Frantz Schneider
> SIC - TCO - Tecnologia em Engenharia do Conhecimento
> DÍGITRO TECNOLOGIA*
> *****
>
> --
> Esta mensagem foi verificada pelo sistema de antivírus da Dígitro e
> acredita-se estar livre de perigo.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
> For additional commands, e-mail: java-user-help[at]lucene.apache.org
>
>


--
Kalani Ruwanpathirana
Department of Computer Science & Engineering
University of Moratuwa

Lucene java-user RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact lists@gossamer-threads.com
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.