Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-Dev

[jira] Commented: (LUCENE-1730) TrecContentSource should use a fixed encoding, rather than system dependent

 

 

Lucene java-dev RSS feed   Index | Next | Previous | View Threaded


jira at apache

Jul 2, 2009, 7:15 AM

Post #1 of 13 (793 views)
Permalink
[jira] Commented: (LUCENE-1730) TrecContentSource should use a fixed encoding, rather than system dependent

[ https://issues.apache.org/jira/browse/LUCENE-1730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12726492#action_12726492 ]

Mark Miller commented on LUCENE-1730:
-------------------------------------

nice catch

> TrecContentSource should use a fixed encoding, rather than system dependent
> ---------------------------------------------------------------------------
>
> Key: LUCENE-1730
> URL: https://issues.apache.org/jira/browse/LUCENE-1730
> Project: Lucene - Java
> Issue Type: Bug
> Components: contrib/benchmark
> Reporter: Shai Erera
> Fix For: 2.9
>
> Attachments: LUCENE-1730.patch
>
>
> TrecContentSource opens InputStreamReader w/o a fixed encoding. On Windows, this means CP1252 (at least on my machine) which is ok. However, when I opened it on a Linux machine w/ a default of UTF-8, it failed to read the files. The patch changes it to use ISO-8859-1, which seems to be the right one (and http://mg4j.dsi.unimi.it/man/manual/ch01s04.html mentions this encoding in its example of a script which reads the data).
> Patch to follow shortly.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Jul 2, 2009, 7:19 AM

Post #2 of 13 (770 views)
Permalink
[jira] Commented: (LUCENE-1730) TrecContentSource should use a fixed encoding, rather than system dependent [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12726493#action_12726493 ]

Shai Erera commented on LUCENE-1730:
------------------------------------

Thanks. Took me a while to think in that direction (I was sure UTF-8 is what's used in the code :) ).

> TrecContentSource should use a fixed encoding, rather than system dependent
> ---------------------------------------------------------------------------
>
> Key: LUCENE-1730
> URL: https://issues.apache.org/jira/browse/LUCENE-1730
> Project: Lucene - Java
> Issue Type: Bug
> Components: contrib/benchmark
> Reporter: Shai Erera
> Fix For: 2.9
>
> Attachments: LUCENE-1730.patch
>
>
> TrecContentSource opens InputStreamReader w/o a fixed encoding. On Windows, this means CP1252 (at least on my machine) which is ok. However, when I opened it on a Linux machine w/ a default of UTF-8, it failed to read the files. The patch changes it to use ISO-8859-1, which seems to be the right one (and http://mg4j.dsi.unimi.it/man/manual/ch01s04.html mentions this encoding in its example of a script which reads the data).
> Patch to follow shortly.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Jul 2, 2009, 7:53 AM

Post #3 of 13 (767 views)
Permalink
[jira] Commented: (LUCENE-1730) TrecContentSource should use a fixed encoding, rather than system dependent [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12726507#action_12726507 ]

Mark Miller commented on LUCENE-1730:
-------------------------------------

I think that it makes sense to make the default the encoding the one that trec typically/always uses, but we should probably make this configurable from the alg file. We don't want to be locked down to one input encoding. Could be done in another issue though. Should allow that for the other contentsources as well.

> TrecContentSource should use a fixed encoding, rather than system dependent
> ---------------------------------------------------------------------------
>
> Key: LUCENE-1730
> URL: https://issues.apache.org/jira/browse/LUCENE-1730
> Project: Lucene - Java
> Issue Type: Bug
> Components: contrib/benchmark
> Reporter: Shai Erera
> Fix For: 2.9
>
> Attachments: LUCENE-1730.patch
>
>
> TrecContentSource opens InputStreamReader w/o a fixed encoding. On Windows, this means CP1252 (at least on my machine) which is ok. However, when I opened it on a Linux machine w/ a default of UTF-8, it failed to read the files. The patch changes it to use ISO-8859-1, which seems to be the right one (and http://mg4j.dsi.unimi.it/man/manual/ch01s04.html mentions this encoding in its example of a script which reads the data).
> Patch to follow shortly.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Jul 2, 2009, 10:57 AM

Post #4 of 13 (765 views)
Permalink
[jira] Commented: (LUCENE-1730) TrecContentSource should use a fixed encoding, rather than system dependent [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12726582#action_12726582 ]

Shai Erera commented on LUCENE-1730:
------------------------------------

I don't understand - the change is in TrecContentSource (only), which reads the TREC collection, which is encoded in ISO-8859-1. Why should it be configurable? Only if someone will read it and write it back in, say, UTF-8, it would make sense to make it configurable, right? Or am I missing something?

> TrecContentSource should use a fixed encoding, rather than system dependent
> ---------------------------------------------------------------------------
>
> Key: LUCENE-1730
> URL: https://issues.apache.org/jira/browse/LUCENE-1730
> Project: Lucene - Java
> Issue Type: Bug
> Components: contrib/benchmark
> Reporter: Shai Erera
> Fix For: 2.9
>
> Attachments: LUCENE-1730.patch
>
>
> TrecContentSource opens InputStreamReader w/o a fixed encoding. On Windows, this means CP1252 (at least on my machine) which is ok. However, when I opened it on a Linux machine w/ a default of UTF-8, it failed to read the files. The patch changes it to use ISO-8859-1, which seems to be the right one (and http://mg4j.dsi.unimi.it/man/manual/ch01s04.html mentions this encoding in its example of a script which reads the data).
> Patch to follow shortly.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Jul 2, 2009, 11:15 AM

Post #5 of 13 (763 views)
Permalink
[jira] Commented: (LUCENE-1730) TrecContentSource should use a fixed encoding, rather than system dependent [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12726596#action_12726596 ]

Mark Miller commented on LUCENE-1730:
-------------------------------------

The trec data you are using is ISO-8859-1. Wouldn't it be conceivable that they might change the encoding to UTF-8 at some point? Or that someone else has created trec compatible data in another encoding? Or trec has data in different encodings? If something reads a source of files, and the files could technically be in any encoding, I would expect things to be configurable so that I can specify what encoding the files are in. It just seems like a good standard feature with something that reads what are essentially pluggable files. The format is going to be consistent, but why would the encoding necessarily be consistent?

> TrecContentSource should use a fixed encoding, rather than system dependent
> ---------------------------------------------------------------------------
>
> Key: LUCENE-1730
> URL: https://issues.apache.org/jira/browse/LUCENE-1730
> Project: Lucene - Java
> Issue Type: Bug
> Components: contrib/benchmark
> Reporter: Shai Erera
> Fix For: 2.9
>
> Attachments: LUCENE-1730.patch
>
>
> TrecContentSource opens InputStreamReader w/o a fixed encoding. On Windows, this means CP1252 (at least on my machine) which is ok. However, when I opened it on a Linux machine w/ a default of UTF-8, it failed to read the files. The patch changes it to use ISO-8859-1, which seems to be the right one (and http://mg4j.dsi.unimi.it/man/manual/ch01s04.html mentions this encoding in its example of a script which reads the data).
> Patch to follow shortly.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Jul 2, 2009, 11:37 AM

Post #6 of 13 (767 views)
Permalink
[jira] Commented: (LUCENE-1730) TrecContentSource should use a fixed encoding, rather than system dependent [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12726612#action_12726612 ]

Shai Erera commented on LUCENE-1730:
------------------------------------

You're right, I didn't think in that direction. I'll make it configurable, shouldn't be a problem. And if it makes sense (and I think it is), I'll put the config parameter on ContentSource.

Will post a second patch soon.

> TrecContentSource should use a fixed encoding, rather than system dependent
> ---------------------------------------------------------------------------
>
> Key: LUCENE-1730
> URL: https://issues.apache.org/jira/browse/LUCENE-1730
> Project: Lucene - Java
> Issue Type: Bug
> Components: contrib/benchmark
> Reporter: Shai Erera
> Fix For: 2.9
>
> Attachments: LUCENE-1730.patch
>
>
> TrecContentSource opens InputStreamReader w/o a fixed encoding. On Windows, this means CP1252 (at least on my machine) which is ok. However, when I opened it on a Linux machine w/ a default of UTF-8, it failed to read the files. The patch changes it to use ISO-8859-1, which seems to be the right one (and http://mg4j.dsi.unimi.it/man/manual/ch01s04.html mentions this encoding in its example of a script which reads the data).
> Patch to follow shortly.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Jul 2, 2009, 2:17 PM

Post #7 of 13 (760 views)
Permalink
[jira] Commented: (LUCENE-1730) TrecContentSource should use a fixed encoding, rather than system dependent [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12726685#action_12726685 ]

Robert Muir commented on LUCENE-1730:
-------------------------------------

I'd like this to be configurable. I used this package to test LUCENE-1628.

(For this I actually ran it with -Dfile.encoding=UTF-8 to prevent this problem), so its "configurable" already...but not obvious.


> TrecContentSource should use a fixed encoding, rather than system dependent
> ---------------------------------------------------------------------------
>
> Key: LUCENE-1730
> URL: https://issues.apache.org/jira/browse/LUCENE-1730
> Project: Lucene - Java
> Issue Type: Bug
> Components: contrib/benchmark
> Reporter: Shai Erera
> Fix For: 2.9
>
> Attachments: LUCENE-1730.patch
>
>
> TrecContentSource opens InputStreamReader w/o a fixed encoding. On Windows, this means CP1252 (at least on my machine) which is ok. However, when I opened it on a Linux machine w/ a default of UTF-8, it failed to read the files. The patch changes it to use ISO-8859-1, which seems to be the right one (and http://mg4j.dsi.unimi.it/man/manual/ch01s04.html mentions this encoding in its example of a script which reads the data).
> Patch to follow shortly.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


markrmiller at gmail

Jul 2, 2009, 2:23 PM

Post #8 of 13 (765 views)
Permalink
Re: [jira] Commented: (LUCENE-1730) TrecContentSource should use a fixed encoding, rather than system dependent [In reply to]

bq. (For this I actually ran it with -Dfile.encoding=UTF-8 to prevent this
problem), so its "configurable" already...but not obvious.

Right, I considered this option, but it changes the default encoding for the
whole JVM - probably going to be fine for running benchmark, but not ideal
in terms of managing and running content sources with different encodings
longer term.

On Thu, Jul 2, 2009 at 5:17 PM, Robert Muir (JIRA) <jira [at] apache> wrote:

>
> [
> https://issues.apache.org/jira/browse/LUCENE-1730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12726685#action_12726685]
>
> Robert Muir commented on LUCENE-1730:
> -------------------------------------
>
> I'd like this to be configurable. I used this package to test LUCENE-1628.
>
> (For this I actually ran it with -Dfile.encoding=UTF-8 to prevent this
> problem), so its "configurable" already...but not obvious.
>
>
> > TrecContentSource should use a fixed encoding, rather than system
> dependent
> >
> ---------------------------------------------------------------------------
> >
> > Key: LUCENE-1730
> > URL: https://issues.apache.org/jira/browse/LUCENE-1730
> > Project: Lucene - Java
> > Issue Type: Bug
> > Components: contrib/benchmark
> > Reporter: Shai Erera
> > Fix For: 2.9
> >
> > Attachments: LUCENE-1730.patch
> >
> >
> > TrecContentSource opens InputStreamReader w/o a fixed encoding. On
> Windows, this means CP1252 (at least on my machine) which is ok. However,
> when I opened it on a Linux machine w/ a default of UTF-8, it failed to read
> the files. The patch changes it to use ISO-8859-1, which seems to be the
> right one (and http://mg4j.dsi.unimi.it/man/manual/ch01s04.html mentions
> this encoding in its example of a script which reads the data).
> > Patch to follow shortly.
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
> For additional commands, e-mail: java-dev-help [at] lucene
>
>


--
--
- Mark

http://www.lucidimagination.com


rcmuir at gmail

Jul 2, 2009, 2:26 PM

Post #9 of 13 (764 views)
Permalink
Re: [jira] Commented: (LUCENE-1730) TrecContentSource should use a fixed encoding, rather than system dependent [In reply to]

Mark, I agree. its not really a good "option". Just saying that trec
content that isn't ISO-8859-1 does exist :)

I like Shai's idea of having a configurable option, this is more obvious.

On Thu, Jul 2, 2009 at 5:23 PM, Mark Miller<markrmiller [at] gmail> wrote:
> bq. (For this I actually ran it with -Dfile.encoding=UTF-8 to prevent this
> problem), so its "configurable" already...but not obvious.
>
> Right, I considered this option, but it changes the default encoding for the
> whole JVM - probably going to be fine for running benchmark, but not ideal
> in terms of managing and running content sources with different encodings
> longer term.
>
> On Thu, Jul 2, 2009 at 5:17 PM, Robert Muir (JIRA) <jira [at] apache> wrote:
>>
>>    [
>> https://issues.apache.org/jira/browse/LUCENE-1730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12726685#action_12726685
>> ]
>>
>> Robert Muir commented on LUCENE-1730:
>> -------------------------------------
>>
>> I'd like this to be configurable. I used this package to test LUCENE-1628.
>>
>> (For this I actually ran it with -Dfile.encoding=UTF-8 to prevent this
>> problem), so its "configurable" already...but not obvious.
>>
>>
>> > TrecContentSource should use a fixed encoding, rather than system
>> > dependent
>> >
>> > ---------------------------------------------------------------------------
>> >
>> >                 Key: LUCENE-1730
>> >                 URL: https://issues.apache.org/jira/browse/LUCENE-1730
>> >             Project: Lucene - Java
>> >          Issue Type: Bug
>> >          Components: contrib/benchmark
>> >            Reporter: Shai Erera
>> >             Fix For: 2.9
>> >
>> >         Attachments: LUCENE-1730.patch
>> >
>> >
>> > TrecContentSource opens InputStreamReader w/o a fixed encoding. On
>> > Windows, this means CP1252 (at least on my machine) which is ok. However,
>> > when I opened it on a Linux machine w/ a default of UTF-8, it failed to read
>> > the files. The patch changes it to use ISO-8859-1, which seems to be the
>> > right one (and http://mg4j.dsi.unimi.it/man/manual/ch01s04.html mentions
>> > this encoding in its example of a script which reads the data).
>> > Patch to follow shortly.
>>
>> --
>> This message is automatically generated by JIRA.
>> -
>> You can reply to this email to add a comment to the issue online.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
>> For additional commands, e-mail: java-dev-help [at] lucene
>>
>
>
>
> --
> --
> - Mark
>
> http://www.lucidimagination.com
>
>



--
Robert Muir
rcmuir [at] gmail

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Jul 5, 2009, 1:04 PM

Post #10 of 13 (686 views)
Permalink
[jira] Commented: (LUCENE-1730) TrecContentSource should use a fixed encoding, rather than system dependent [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12727359#action_12727359 ]

Shai Erera commented on LUCENE-1730:
------------------------------------

Any volunteers to help me get it in? I think it's ready for commit.

> TrecContentSource should use a fixed encoding, rather than system dependent
> ---------------------------------------------------------------------------
>
> Key: LUCENE-1730
> URL: https://issues.apache.org/jira/browse/LUCENE-1730
> Project: Lucene - Java
> Issue Type: Bug
> Components: contrib/benchmark
> Reporter: Shai Erera
> Fix For: 2.9
>
> Attachments: LUCENE-1730.patch, LUCENE-1730.patch
>
>
> TrecContentSource opens InputStreamReader w/o a fixed encoding. On Windows, this means CP1252 (at least on my machine) which is ok. However, when I opened it on a Linux machine w/ a default of UTF-8, it failed to read the files. The patch changes it to use ISO-8859-1, which seems to be the right one (and http://mg4j.dsi.unimi.it/man/manual/ch01s04.html mentions this encoding in its example of a script which reads the data).
> Patch to follow shortly.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Jul 6, 2009, 6:46 AM

Post #11 of 13 (667 views)
Permalink
[jira] Commented: (LUCENE-1730) TrecContentSource should use a fixed encoding, rather than system dependent [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12727524#action_12727524 ]

Mark Miller commented on LUCENE-1730:
-------------------------------------

I havn't patched the code in, but looking at the patch, it looks like you are setting the default *after* specifying the encoding to the InputStream?

+ GZIPInputStream zis = new GZIPInputStream(new FileInputStream(f), BUFFER_SIZE);
+ reader = new BufferedReader(new InputStreamReader(zis, encoding), BUFFER_SIZE);

...

+ if (encoding == null) {
+ encoding = "ISO-8859-1";
+ }

> TrecContentSource should use a fixed encoding, rather than system dependent
> ---------------------------------------------------------------------------
>
> Key: LUCENE-1730
> URL: https://issues.apache.org/jira/browse/LUCENE-1730
> Project: Lucene - Java
> Issue Type: Bug
> Components: contrib/benchmark
> Reporter: Shai Erera
> Fix For: 2.9
>
> Attachments: LUCENE-1730.patch, LUCENE-1730.patch
>
>
> TrecContentSource opens InputStreamReader w/o a fixed encoding. On Windows, this means CP1252 (at least on my machine) which is ok. However, when I opened it on a Linux machine w/ a default of UTF-8, it failed to read the files. The patch changes it to use ISO-8859-1, which seems to be the right one (and http://mg4j.dsi.unimi.it/man/manual/ch01s04.html mentions this encoding in its example of a script which reads the data).
> Patch to follow shortly.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Jul 6, 2009, 6:52 AM

Post #12 of 13 (660 views)
Permalink
[jira] Commented: (LUCENE-1730) TrecContentSource should use a fixed encoding, rather than system dependent [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12727529#action_12727529 ]

Shai Erera commented on LUCENE-1730:
------------------------------------

if (encoding == null) happens in setConfig and the other one in openNextFile() (forgot the exact method name). The patch includes just the changes, w/o the method names, so it may not be obvious just by looking at it.

> TrecContentSource should use a fixed encoding, rather than system dependent
> ---------------------------------------------------------------------------
>
> Key: LUCENE-1730
> URL: https://issues.apache.org/jira/browse/LUCENE-1730
> Project: Lucene - Java
> Issue Type: Bug
> Components: contrib/benchmark
> Reporter: Shai Erera
> Fix For: 2.9
>
> Attachments: LUCENE-1730.patch, LUCENE-1730.patch
>
>
> TrecContentSource opens InputStreamReader w/o a fixed encoding. On Windows, this means CP1252 (at least on my machine) which is ok. However, when I opened it on a Linux machine w/ a default of UTF-8, it failed to read the files. The patch changes it to use ISO-8859-1, which seems to be the right one (and http://mg4j.dsi.unimi.it/man/manual/ch01s04.html mentions this encoding in its example of a script which reads the data).
> Patch to follow shortly.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Jul 6, 2009, 6:56 AM

Post #13 of 13 (657 views)
Permalink
[jira] Commented: (LUCENE-1730) TrecContentSource should use a fixed encoding, rather than system dependent [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12727533#action_12727533 ]

Mark Miller commented on LUCENE-1730:
-------------------------------------

Okay, cool. I'll patch it in, run the tests, and commit later today.

Thanks Shai.

> TrecContentSource should use a fixed encoding, rather than system dependent
> ---------------------------------------------------------------------------
>
> Key: LUCENE-1730
> URL: https://issues.apache.org/jira/browse/LUCENE-1730
> Project: Lucene - Java
> Issue Type: Bug
> Components: contrib/benchmark
> Reporter: Shai Erera
> Fix For: 2.9
>
> Attachments: LUCENE-1730.patch, LUCENE-1730.patch
>
>
> TrecContentSource opens InputStreamReader w/o a fixed encoding. On Windows, this means CP1252 (at least on my machine) which is ok. However, when I opened it on a Linux machine w/ a default of UTF-8, it failed to read the files. The patch changes it to use ISO-8859-1, which seems to be the right one (and http://mg4j.dsi.unimi.it/man/manual/ch01s04.html mentions this encoding in its example of a script which reads the data).
> Patch to follow shortly.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene

Lucene java-dev RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.