Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Perl: porters

[perl #43248] utf8 encoding and the -i switch

 

 

Perl porters RSS feed   Index | Next | Previous | View Threaded


perlbug-followup at perl

Jun 19, 2007, 12:14 AM

Post #1 of 10 (325 views)
Permalink
[perl #43248] utf8 encoding and the -i switch

# New Ticket Created by Martin Barth
# Please include the string: [perl #43248]
# in the subject line of all future correspondence about this issue.
# <URL: http://rt.perl.org/rt3/Ticket/Display.html?id=43248 >


This is a bug report for perl from martin [at] senfdax,
generated with the help of perlbug 1.35 running under perl v5.8.8.


% cat datei
eine test datei
die "u "a "o
% file datei
datei: ASCII text
% cp datei datei.bk
% perl -wpi -e 'use encoding "utf8"; s/"a/ä/' datei
% file datei
datei: ISO-8859 text
% perl -wp -e 'use encoding "utf8"; s/"a/ä/' datei.bk > datei.neu
% file datei.neu
datei.neu: UTF-8 Unicode text

my xterm is also utf8; I think both files should be utf8, i dont know why "datei" is changed to latin1.

[Please do not change anything below this line]
-----------------------------------------------------------------
---
Flags:
category=core
severity=medium
---
Site configuration information for perl v5.8.8:

Configured by Debian Project at Tue Mar 6 01:52:23 UTC 2007.

Summary of my perl5 (revision 5 version 8 subversion 8) configuration:
Platform:
osname=linux, osvers=2.6.15.7, archname=i486-linux-gnu-thread-multi
uname='linux rothera 2.6.15.7 #1 smp sat sep 30 10:21:42 utc 2006 i686 gnulinux '
config_args='-Dusethreads -Duselargefiles -Dccflags=-DDEBIAN -Dcccdlflags=-fPIC -Darchname=i486-linux-gnu -Dprefix=/usr -Dprivlib=/usr/share/perl/5.
8 -Darchlib=/usr/lib/perl/5.8 -Dvendorprefix=/usr -Dvendorlib=/usr/share/perl5 -Dvendorarch=/usr/lib/perl5 -Dsiteprefix=/usr/local -Dsitelib=/usr/local/
share/perl/5.8.8 -Dsitearch=/usr/local/lib/perl/5.8.8 -Dman1dir=/usr/share/man/man1 -Dman3dir=/usr/share/man/man3 -Dsiteman1dir=/usr/local/man/man1 -Dsi
teman3dir=/usr/local/man/man3 -Dman1ext=1 -Dman3ext=3perl -Dpager=/usr/bin/sensible-pager -Uafs -Ud_csh -Uusesfio -Uusenm -Duseshrplib -Dlibperl=libperl
.so.5.8.8 -Dd_dosuid -des'
hint=recommended, useposix=true, d_sigaction=define
usethreads=define use5005threads=undef useithreads=define usemultiplicity=define
useperlio=define d_sfio=undef uselargefiles=define usesocks=undef
use64bitint=undef use64bitall=undef uselongdouble=undef
usemymalloc=n, bincompat5005=undef
Compiler:
cc='cc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -DTHREADS_HAVE_PIDS -DDEBIAN -fno-strict-aliasing -pipe -I/usr/local/include -D_LARGEFILE_SOURCE -D_FI
LE_OFFSET_BITS=64',
optimize='-O2',
cppflags='-D_REENTRANT -D_GNU_SOURCE -DTHREADS_HAVE_PIDS -DDEBIAN -fno-strict-aliasing -pipe -I/usr/local/include'
ccversion='', gccversion='4.1.2 (Ubuntu 4.1.2-0ubuntu4)', gccosandvers=''
intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
alignbytes=4, prototype=define
Linker and Libraries:
ld='cc', ldflags =' -L/usr/local/lib'
libpth=/usr/local/lib /lib /usr/lib
libs=-lgdbm -lgdbm_compat -ldb -ldl -lm -lpthread -lc -lcrypt
perllibs=-ldl -lm -lpthread -lc -lcrypt
libc=/lib/libc-2.5.so, so=so, useshrplib=true, libperl=libperl.so.5.8.8
gnulibc_version='2.5'
Dynamic Linking:
dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E'
cccdlflags='-fPIC', lddlflags='-shared -L/usr/local/lib'

Locally applied patches:


---
@INC for perl v5.8.8:
/etc/perl
/usr/local/lib/perl/5.8.8
/usr/local/share/perl/5.8.8
/usr/lib/perl5
/usr/share/perl5
/usr/lib/perl/5.8
/usr/share/perl/5.8
/usr/local/lib/site_perl
.

---
Environment for perl v5.8.8:
HOME=/home/martin
LANG=de_DE.UTF-8
LANGUAGE (unset)
LD_LIBRARY_PATH (unset)
LOGDIR (unset)
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/bin/X11:/usr/games:/home/martin/bin/
PERL_BADLANG (unset)
SHELL=/bin/zsh


rvtol+news at isolution

Jun 19, 2007, 9:27 PM

Post #2 of 10 (307 views)
Permalink
Re: [perl #43248] utf8 encoding and the -i switch [In reply to]

Martin Barth schreef:
> # New Ticket Created by Martin Barth
> # Please include the string: [perl #43248]
> # in the subject line of all future correspondence about this issue.
> # <URL: http://rt.perl.org/rt3/Ticket/Display.html?id=43248 >
>
>
> This is a bug report for perl from martin [at] senfdax,
> generated with the help of perlbug 1.35 running under perl v5.8.8.
>
>
> % cat datei
> eine test datei
> die "u "a "o
> % file datei
> datei: ASCII text
> % cp datei datei.bk
> % perl -wpi -e 'use encoding "utf8"; s/"a/ä/' datei
> % file datei
> datei: ISO-8859 text
> % perl -wp -e 'use encoding "utf8"; s/"a/ä/' datei.bk > datei.neu
> % file datei.neu
> datei.neu: UTF-8 Unicode text
>
> my xterm is also utf8; I think both files should be utf8, i dont know
> why "datei" is changed to latin1.


I assume you misunderstand "use encoding", read `perldoc encoding`.


$ file datei
datei: ASCII text


$ hexdump -e '"%07_ad" 16/1 " %02X" "\n"'
-e '" " 16/1 " %-2_p" "\n\n"' datei

0000000 65 69 6E 20 74 65 73 74 20 64 61 74 65 69 0A 64
e i n t e s t d a t e i . d

0000016 69 65 20 22 75 20 22 6F 0A
i e " u " o .


$ perl -C31 -i.bk -wpe 's/"o/\x{f6}/g' datei


$ hexdump -e '"%07_ad" 16/1 " %02X" "\n"' \
-e '" " 16/1 " %-2_p" "\n\n"' datei

0000000 65 69 6E 20 74 65 73 74 20 64 61 74 65 69 0A 64
e i n t e s t d a t e i . d

0000016 69 65 20 22 75 20 C3 B6 0A
i e " u . . .


--
Affijn, Ruud

"Gewoon is een tijger."


ikegami at adaelis

Sep 26, 2010, 10:11 PM

Post #3 of 10 (256 views)
Permalink
Re: [perl #43248] utf8 encoding and the -i switch [In reply to]

On Sun, Sep 26, 2010 at 7:58 PM, Father Chrysostomos via RT <
perlbug-followup [at] perl> wrote:

> On Sun Sep 26 13:46:43 2010, sprout wrote:
> > On Tue Jun 19 00:14:20 2007, martin [at] senfdax wrote:
> > > This is a bug report for perl from martin [at] senfdax,
> > > generated with the help of perlbug 1.35 running under perl v5.8.8.
> > >
> > >
> > > % cat datei
> > > eine test datei
> > > die "u "a "o
> > > % file datei
> > > datei: ASCII text
> > > % cp datei datei.bk
> > > % perl -wpi -e 'use encoding "utf8"; s/"a//' datei
> > > % file datei
> > > datei: ISO-8859 text
> > > % perl -wp -e 'use encoding "utf8"; s/"a//' datei.bk > datei.neu
> > > % file datei.neu
> > > datei.neu: UTF-8 Unicode text
> > >
> > > my xterm is also utf8; I think both files should be utf8, i dont know
> > > why "datei" is changed to latin1.
> >
> > use encoding "utf8" tells perl that the script you are giving it to
> > run is in UTF-8 encoding. It does not affect the input and output
> > streams. For that, try use open ":utf8". See also perldoc -f binmode
> > and the -C option in perlrun.
>
> My mistake. The encoding pragma *is* supposed to affect file handles.
> But it is *so* buggy I recommend staying away from it!
>

Just STDIN and STDOUT, and neither are used here. Everything's working as
documented.
- Eric


tchrist at perl

Sep 29, 2010, 9:11 AM

Post #4 of 10 (246 views)
Permalink
Re: [perl #43248] utf8 encoding and the -i switch [In reply to]

In-Reply-To: Message from "Father Chrysostomos via RT" <perlbug-followup [at] perl>
of "Sun, 26 Sep 2010 16:58:01 PDT." <rt-3.6.HEAD-5116-1285545480-548.43248-15-0 [at] perl>

>>> my xterm is also utf8; I think both files should be utf8, i dont know
>>> why "datei" is changed to latin1.

>> ‘use encoding "utf8"’ tells perl that the script you are giving it to
>> run is in UTF-8 encoding. It does not affect the input and output
>> streams. For that, try ‘use open ":utf8"’. See also ‘perldoc -f binmode’
>> and the -C option in perlrun.

> My mistake. The ‘encoding’ pragma *is* supposed to affect file handles.
> But it is *so* buggy I recommend staying away from it!

Well, that's certainly been my experience with it as well, but I've always
wondered whether I wasn't just being bone-headed. :(

There *are* a few places it appears to comes in handy, providing that you
be careful with it. One such is getting proper character class semantics
on ISO 8859-1 text in your source.

% perl -le 'print chr(0xA0) =~ /\s/ ? "yup" : "nope"'
nope
% perl -le 'print chr(0xE6) =~ /\w/ ? "yup" : "nope"'
nope

% perl -Mencoding=Latin1 -le 'print chr(0xA0) =~ /\s/ ? "yup" : "nope"'
yup
% perl -Mencoding=Latin1 -le 'print chr(0xE6) =~ /\w/ ? "yup" : "nope"'
yup

% perl -CS -le ' print "\u\xDF"'
ß
% perl -le 'use encoding "latin1", STDOUT => "utf8"; print "\xDF => \u\xDF"'
ß => Ss

It's also easier than going through Encode::{en,de}code for things like this:

% perl -Mencoding=MacRoman -le 'printf "U+%04X\n", ord chr(0xCA)'
U+00A0
% perl -Mencoding=MacRoman -le 'printf "U+%04X\n", ord chr(0xBE)'
U+00E6

% perl -Mencoding=MacRoman -le 'print chr(0xCA) =~ /\s/ ? "yup" : "nope"'
yup
% perl -Mencoding=MacRoman -le 'print chr(0xBE) =~ /\w/ ? "yup" : "nope"'
yup

--tom


pagaltzis at gmx

Jul 6, 2012, 4:08 AM

Post #5 of 10 (130 views)
Permalink
Re: [perl #43248] utf8 encoding and the -i switch [In reply to]

* Father Chrysostomos via RT <perlbug-followup [at] perl> [2012-07-05 01:45]:
> Probably only if we deprecate the encoding pragma (or is it already
> deprecated?)

It should be.

> or someone scrutinizes the example more closely and determines for
> certain that it is working correctly.

It is correct insofar as it behaves according to the encoding.pm docs,
which only mention STDIN and STDOUT. The discrepancy in the example is
caused by the -i switch which uses the diamond operator which uses ARGV,
implicitly, which the pragma code does not touch and its docs do not
claim it will.

So the example may be surprising but it conforms to the documentation.

> I’m not sure there are currently any individuals who actually know
> exactly what encoding.pm is supposed to be doing.

I would argue that even the individuals who designed it never knew this,
because its design can only result from deep confusion about encodings.

For one thing it declares an encoding for both the source text and the
standard I/O streams, which is unreasonable enough to be useless.

But it also reinterprets \x escape sequences, not as characters, but as
byte values to be decoded according to the given encoding. That is to
say under `use encoding 'utf8'`,

"\x{E2}\x{82}\x{AC}" eq "\N{U+20AC}"

is true because the former gets decoded as UTF-8 bytes. Dump it and you
see this:

SV = PV(0x804b3a8) at 0x80786fc
REFCNT = 1
FLAGS = (POK,READONLY,pPOK,UTF8)
PV = 0x8079098 "\342\202\254"\0 [UTF8 "\x{20ac}"]
CUR = 3
LEN = 12

This is perverse enough, but given such a string at least you expect the
result that `eq` then gives you.

It gets worse though.

Under `use encoding 'utf8'`,

"\x{E2}\x{82}\x{AC}" eq join "", map chr, 0xE2, 0x82, 0xAC

is *also* true, even though the UTF8 flag on the latter string is *off*!

SV = PV(0x804b0f8) at 0x809dee4
REFCNT = 1
FLAGS = (PADTMP,POK,pPOK)
PV = 0x81088f8 "\342\202\254"\0
CUR = 3
LEN = 12

So encoding.pm is worse than even bytes.pm.

And were it not broken, the design would still be irredeemably confused.

It is a rare case that warrants the word in programming, but I will have
to use it here:

Never use encoding.pm.

Never.

--
*AUTOLOAD=*_;sub _{s/::([^:]*)$/print$1,(",$\/"," ")[defined wantarray]/e;chop;$_}
&Just->another->Perl->hack;
#Aristotle Pagaltzis // <http://plasmasturm.org/>


perl.p5p at rjbs

Jul 23, 2012, 6:50 PM

Post #6 of 10 (115 views)
Permalink
Re: [perl #43248] utf8 encoding and the -i switch [In reply to]

* Aristotle Pagaltzis <pagaltzis [at] gmx> [2012-07-06T07:08:59]
> * Father Chrysostomos via RT <perlbug-followup [at] perl> [2012-07-05 01:45]:
> > Probably only if we deprecate the encoding pragma (or is it already
> > deprecated?)
>
> It should be.

I'm not sure I have ever heard anyone defend it for any reason.

Let us begin the process of deprecating and ejecting it, shall we?

--
rjbs
Attachments: signature.asc (0.48 KB)


pagaltzis at gmx

Aug 3, 2012, 11:43 AM

Post #7 of 10 (115 views)
Permalink
Re: [perl #43248] utf8 encoding and the -i switch [In reply to]

* Ricardo Signes <perl.p5p [at] rjbs> [2012-07-24 03:55]:
> * Aristotle Pagaltzis <pagaltzis [at] gmx> [2012-07-06T07:08:59]
> > * Father Chrysostomos via RT <perlbug-followup [at] perl> [2012-07-05 01:45]:
> > > Probably only if we deprecate the encoding pragma (or is it
> > > already deprecated?)
> >
> > It should be.
>
> I'm not sure I have ever heard anyone defend it for any reason.
>
> Let us begin the process of deprecating and ejecting it, shall we?

What does that really entail, anything more than a `use deprecate`?


perl.p5p at rjbs

Aug 3, 2012, 12:22 PM

Post #8 of 10 (115 views)
Permalink
Re: [perl #43248] utf8 encoding and the -i switch [In reply to]

* Aristotle Pagaltzis <pagaltzis [at] gmx> [2012-08-03T14:43:38]
> * Ricardo Signes <perl.p5p [at] rjbs> [2012-07-24 03:55]:
> > * Aristotle Pagaltzis <pagaltzis [at] gmx> [2012-07-06T07:08:59]
> > > * Father Chrysostomos via RT <perlbug-followup [at] perl> [2012-07-05 01:45]:
> > > > Probably only if we deprecate the encoding pragma (or is it
> > > > already deprecated?)
> > >
> > > It should be.
> >
> > I'm not sure I have ever heard anyone defend it for any reason.
> >
> > Let us begin the process of deprecating and ejecting it, shall we?
>
> What does that really entail, anything more than a `use deprecate`?

It's dual-life, CPAN upstream, and part of Encode. So we lobby Dan Kogai.

--
rjbs
Attachments: signature.asc (0.48 KB)


pagaltzis at gmx

Aug 4, 2012, 2:15 AM

Post #9 of 10 (115 views)
Permalink
Re: [perl #43248] utf8 encoding and the -i switch [In reply to]

* Ricardo Signes <perl.p5p [at] rjbs> [2012-08-03 21:25]:
> * Aristotle Pagaltzis <pagaltzis [at] gmx> [2012-08-03T14:43:38]
> > * Ricardo Signes <perl.p5p [at] rjbs> [2012-07-24 03:55]:
> > > Let us begin the process of deprecating and ejecting
> > > [encoding.pm], shall we?
> >
> > What does that really entail, anything more than a `use deprecate`?
>
> It's dual-life, CPAN upstream, and part of Encode. So we lobby Dan
> Kogai.

Hmm, so what does it *mean* to deprecate an upstream-CPAN module? Does
deprecate.pm work with already-dual-life modules? I think not – or am
I misinformed?

I guess as far as Encode itself is concerned, deprecating encoding.pm
just means splitting it out into a separate distribution, fiddling the
docs, and you’re done.

But what then does that mean for core perl? Do p5p just go “oh you miss
that? Well upstream split it out so it‘s no longer shipped with perl”
and give a shrug? I guess the newly-created distro for encoding.pm has
to be added to the list of things the core includes… hmm, logically then
deprecate.pm must work for already-dual-life modules because how else
can it be used for the transition period of once-core-only modules? So,
then it would be added to encoding.pm inside its new distribution, which
core perl adds just so it can remove it right after the following stable
release.

Yes?

(Is this process written up anywhere? If not, I may be volunteering
myself to write a doc patch that once it is all clarified… Where? The
deprecate.pm POD seems the most likely choice. Probably there should
then also exist a pointer to it somewhere in the pile of perl*.pod or
maybe inside Porting. Where?)

Regards,
--
Aristotle Pagaltzis // <http://plasmasturm.org/>


perl.p5p at rjbs

Aug 4, 2012, 4:20 PM

Post #10 of 10 (115 views)
Permalink
Re: [perl #43248] utf8 encoding and the -i switch [In reply to]

* Aristotle Pagaltzis <pagaltzis [at] gmx> [2012-08-04T05:15:45]
> Hmm, so what does it *mean* to deprecate an upstream-CPAN module? Does
> deprecate.pm work with already-dual-life modules? I think not – or am
> I misinformed?

I think your answer is what I would also have said. Rather than stop there, I
will restate it, in case we've got some subtle difference.

1. encoding.pm would be spun out of the Encode dist
2. enocding.pm would 'use if $] >= ..., "deprecate"'
3. the encoding dist could be put into core
4. ...and then removed later.

> (Is this process written up anywhere? If not, I may be volunteering
> myself to write a doc patch that once it is all clarified… Where? The
> deprecate.pm POD seems the most likely choice. Probably there should
> then also exist a pointer to it somewhere in the pile of perl*.pod or
> maybe inside Porting. Where?)

As a "process by which perl shipping stuff is done" I believe it belongs in
Porting.

--
rjbs
Attachments: signature.asc (0.48 KB)

Perl porters RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.