Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Perl: porters

unicode question

 

 

First page Previous page 1 2 Next page Last page  View All Perl porters RSS feed   Index | Next | Previous | View Threaded


zefram at fysh

Apr 27, 2012, 2:58 AM

Post #26 of 38 (136 views)
Permalink
Re: unicode question [In reply to]

Linda W wrote:
>I didn't say unix file descriptors -- argue the straw man will ya?

What do you think stdin and stdout are?

> Well double rubbish...yeah, you are right, but I'm thinking of pipes in
>a different sense -- not unix pipes as they are now, but as they were
>originally
>created in unix

Unix pipes have always been binary-safe.

> Also -- off hand, I don't
>know of any linux pipes that handle a 9 bit data type...

Yeah, Linux doesn't run on any 9-bit-byte platforms. That's fairly
irrelevant; such platforms are now just an historical curiosity.

> This is the whole point -- expecting your STDIO/STDOUT to be binary
>safe is not logical -- it's not done.

The file descriptors stdin and stdout are binary safe; if they refer to
pipes then the pipes themselves are binary safe; but what's ultimately
on the other side may well not be. However, if your program's purpose
is to process binary data on stdin and stdout then it *is* logical, and
normal, to assume that whatever's on the other side of stdin and stdout
is binary-safe. It's the user's fault if he runs a binary-emitting
program with stdout pointing at a terminal.

> Perl wasn't written as a binary processor.

It *was* written to be capable of handling binary data, among other
things. It adopted the Unix model, where file handles (and in particular
stdin and stdout) can carry textual or binary data as the program wishes.

> We are talking usage of perl to process material on STDIN/STDOUT
>at a terminal in an environment with a standard locale set.

There's that "at a terminal" again. It's possible to test whether stdin
or stdout point at an actual terminal, with Perl's -t operator, but
you're trying to include as "at a terminal" situations where stdin/stdout
refer to pipes to text-processing programs. Those don't satisfy -t,
and can't be distinguished from pipes to binary-processing programs.
It's impossible for Perl to distinguish between the text-processing
environment that you imagine and the rather common situation of some
binary data being involved.

-zefram


alex.hartmaier at gmail

Apr 27, 2012, 11:48 AM

Post #27 of 38 (136 views)
Permalink
Re: unicode question [In reply to]

use open ':std' => ':locale';
Interesting! Looks like a good default for cli scripts/apps (that never
take binary data as input).

Doesn't that solve your problem, Linda?


perl-diddler at tlinx

May 3, 2012, 4:30 AM

Post #28 of 38 (132 views)
Permalink
Re: unicode question [In reply to]

Alexander Hartmaier wrote:
> use open ':std' => ':locale';
> Interesting! Looks like a good default for cli scripts/apps (that
> never take binary data as input).
>
> Doesn't that solve your problem, Linda?
It might... but that perl isn't bright enough to follow standards and
use it automatically on STDIO to a char device is not a
user-friendly/program friendly default, IMO.

But I realize that user-friendly is one of the last things in mind when
it comes
to perl.

;-/


demerphq at gmail

May 3, 2012, 8:38 AM

Post #29 of 38 (131 views)
Permalink
Re: unicode question [In reply to]

On 26 April 2012 19:32, Johan Vromans <jvromans [at] squirrel> wrote:
> Jesse Luehrs <doy [at] tozt> writes:
>
>> Why are you assuming that text is the only thing that people ever pipe
>> to a program? Interpreting STDIN as UTF-8 would break something along
>> the lines of a perl implementation of gzip, for instance.
>
> I expect a Perl program that processes binary information from STDIN to
> use an explicit binmode.

Based on what documentation?



--
perl -Mre=debug -e "/just|another|perl|hacker/"


demerphq at gmail

May 3, 2012, 8:40 AM

Post #30 of 38 (132 views)
Permalink
Re: unicode question [In reply to]

On 26 April 2012 20:15, Tom Christiansen <tchrist [at] perl> wrote:
> Jesse Luehrs <doy [at] tozt> wrote
> on Thu, 26 Apr 2012 09:32:59 CDT:
>
>> Why are you assuming that text is the only thing that people ever pipe
>> to a program? Interpreting STDIN as UTF-8 would break something along
>> the lines of a perl implementation of gzip, for instance. This may not
>> be a bad thing for a default assuming it can be overridden, but it would
>> certainly not be backwards compatible.
>
> Playing the devil's advocate for a moment, any program that assumes an
> unmarked stream to be in binary not text is inherently broken on all
> Microsoft-encumbered platforms, as those assume the contrary condition.

And what gives you that idea?

--
perl -Mre=debug -e "/just|another|perl|hacker/"


tchrist at perl

May 3, 2012, 8:42 AM

Post #31 of 38 (133 views)
Permalink
Re: unicode question [In reply to]

>> Playing the devil's advocate for a moment, any program that assumes an
>> unmarked stream to be in binary not text is inherently broken on all
>> Microsoft-encumbered platforms, as those assume the contrary condition.

>And what gives you that idea?

I assumed that STD{IN,OUT,ERR} were O_TEXT on Microsoft, not O_BINARY.

Is this not so?

--tom


demerphq at gmail

May 3, 2012, 8:48 AM

Post #32 of 38 (134 views)
Permalink
Re: unicode question [In reply to]

On 3 May 2012 17:42, Tom Christiansen <tchrist [at] perl> wrote:
>>> Playing the devil's advocate for a moment, any program that assumes an
>>> unmarked stream to be in binary not text is inherently broken on all
>>> Microsoft-encumbered platforms, as those assume the contrary condition.
>
>>And what gives you that idea?
>
> I assumed that STD{IN,OUT,ERR} were O_TEXT on Microsoft, not O_BINARY.
>
> Is this not so?

Not that I ever noticed. Maybe someone abstracted that away...

The only related thing that comes to mind is that Windows
traditionally puts a BOM in Unicode files. Which then causes problem
when you assume that *NIX style piping is safe. IOW,

cat x y z > xyz

will end up with three BOM's in it, when the author probably didnt
even know there were BOM's there in the first place.

Yves


--
perl -Mre=debug -e "/just|another|perl|hacker/"


demerphq at gmail

May 3, 2012, 8:53 AM

Post #33 of 38 (131 views)
Permalink
Re: unicode question [In reply to]

On 3 May 2012 17:48, demerphq <demerphq [at] gmail> wrote:
> On 3 May 2012 17:42, Tom Christiansen <tchrist [at] perl> wrote:
>>>> Playing the devil's advocate for a moment, any program that assumes an
>>>> unmarked stream to be in binary not text is inherently broken on all
>>>> Microsoft-encumbered platforms, as those assume the contrary condition.
>>
>>>And what gives you that idea?
>>
>> I assumed that STD{IN,OUT,ERR} were O_TEXT on Microsoft, not O_BINARY.
>>
>> Is this not so?
>
> Not that I ever noticed. Maybe someone abstracted that away...
>
> The only related thing that comes to mind is that Windows
> traditionally puts a BOM in Unicode files. Which then causes problem
> when you assume that *NIX style piping is safe. IOW,
>
> cat x y z > xyz
>
> will end up with three BOM's in it, when the author probably didnt
> even know there were BOM's there in the first place.

After I wrote this I went away and had a coffee and it all came
rushing back in a horrible flashback (i am a recovering windows user),
and indeed you are correct. Sorry. I had somehow managed to blot it
all out. :-)

cheers,
Yves


--
perl -Mre=debug -e "/just|another|perl|hacker/"


dmcbride at cpan

May 3, 2012, 9:40 AM

Post #34 of 38 (132 views)
Permalink
Re: unicode question [In reply to]

On Thursday May 3 2012 5:38:40 PM demerphq wrote:
> On 26 April 2012 19:32, Johan Vromans <jvromans [at] squirrel> wrote:
> > Jesse Luehrs <doy [at] tozt> writes:
> >> Why are you assuming that text is the only thing that people ever pipe
> >> to a program? Interpreting STDIN as UTF-8 would break something along
> >> the lines of a perl implementation of gzip, for instance.
> >
> > I expect a Perl program that processes binary information from STDIN to
> > use an explicit binmode.
>
> Based on what documentation?

perldoc -f binmode

On some systems (in general, DOS and Windows-based systems)
binmode() is necessary when you're not working with a text
file. For the sake of portability it is a good idea to always
use it when appropriate, and to never use it when it isn't
appropriate. Also, people can set their I/O to be by default
UTF-8 encoded Unicode, not bytes.

binmode - necessary when working with non-text.
Attachments: signature.asc (0.19 KB)


demerphq at gmail

May 3, 2012, 9:58 AM

Post #35 of 38 (131 views)
Permalink
Re: unicode question [In reply to]

On 3 May 2012 18:40, Darin McBride <dmcbride [at] cpan> wrote:
> On Thursday May 3 2012 5:38:40 PM demerphq wrote:
>> On 26 April 2012 19:32, Johan Vromans <jvromans [at] squirrel> wrote:
>> > Jesse Luehrs <doy [at] tozt> writes:
>> >> Why are you assuming that text is the only thing that people ever pipe
>> >> to a program? Interpreting STDIN as UTF-8 would break something along
>> >> the lines of a perl implementation of gzip, for instance.
>> >
>> > I expect a Perl program that processes binary information from STDIN to
>> > use an explicit binmode.
>>
>> Based on what documentation?
>
> perldoc -f binmode
>
> On some systems (in general, DOS and Windows-based systems)
> binmode() is necessary when you're not working with a text
> file. For the sake of portability it is a good idea to always
> use it when appropriate, and to never use it when it isn't
> appropriate. Also, people can set their I/O to be by default
> UTF-8 encoded Unicode, not bytes.
>
> binmode - necessary when working with non-text.

mea-culpa. i knew all this stuff, and had managed to forget it when i
migrated to *nix. :-)

Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"


jvromans at squirrel

May 3, 2012, 10:46 AM

Post #36 of 38 (133 views)
Permalink
Re: unicode question [In reply to]

[Quoting demerphq, on May 3 2012, 17:38, in "Re: unicode question"]
> > I expect a Perl program that processes binary information from
> > STDIN to use an explicit binmode.
>
> Based on what documentation?

E.g, from Camel IV, p. 906:

If you’re running on a system that distinguishes between text and
binary files, you may need to put your filehandle into binary
mode—or forgo doing so, as the case may be—to avoid mutilating your
files. On such systems, if you use text mode on a binary file, or
binary mode on a text file, you probably won’t like the results.

&c.

-- Johan


perl-diddler at tlinx

May 3, 2012, 3:36 PM

Post #37 of 38 (132 views)
Permalink
Re: unicode question [In reply to]

demerphq wrote:
> On 26 April 2012 19:32, Johan Vromans <jvromans [at] squirrel> wrote:
>
>> Jesse Luehrs <doy [at] tozt> writes:
>>
>>
>>> Why are you assuming that text is the only thing that people ever pipe
>>> to a program? Interpreting STDIN as UTF-8 would break something along
>>> the lines of a perl implementation of gzip, for instance.
>>>
>> I expect a Perl program that processes binary information from STDIN to
>> use an explicit binmode.
>>
>
> Based on what documentation?
>
---
Based on the standard usage of STDIO as coming from a user terminal.

It can come from a file. But STDIO was often presumed to come from
a user's terminal or text that they had typed in.

More reliably, modern programs at least look to see if STDIO is
connected
to a char-device, and use a switch to override defaults (i.e. ls
--color=always
when you want to get color through 'less'), as 'ls' defaults to color
off when
it sees a pipe.

But in perl -- if it is asked to parse 'newlines' as in
while (<>) {...}

Then I submit that expecting that stream to be in binary is lunacy.

It should be treated as text as it is being processed as textual lines.


fawaka at gmail

May 3, 2012, 4:27 PM

Post #38 of 38 (131 views)
Permalink
Re: unicode question [In reply to]

On Fri, May 4, 2012 at 12:36 AM, Linda W <perl-diddler [at] tlinx> wrote:
>     Based on the standard usage of STDIO as coming from a user terminal.
>
>     It can come from a file.  But STDIO was often presumed to come from
> a user's terminal or text that they had typed in.
>
>     More reliably, modern programs at least look to see if STDIO is
> connected
> to a char-device, and use a switch to override defaults (i.e. ls
> --color=always
> when you want to get color through 'less'), as 'ls' defaults to color off
> when
> it sees a pipe.
>
>     But in perl -- if it is asked to parse 'newlines' as in
> while (<>) {...}
>
> Then I submit that expecting that stream to be in binary is lunacy.
>
> It should be treated as text as it is being processed as textual lines.

No matter what defaults we choose, it will be the wrong one a
significant amount of the time. I don't think we can solve this using
different defaults. I certainly don't think guessing makes the odds
better. In the end, you should always state what you want: text (and
which encoding) or binary. Don't ask a computer to mindread, that's
asking for trouble.

Leon

First page Previous page 1 2 Next page Last page  View All Perl porters RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.