Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Perl: porters

Perl 5.18 and Regexp::Grammars

 

 

First page Previous page 1 2 Next page Last page  View All Perl porters RSS feed   Index | Next | Previous | View Threaded


perl.p5p at rjbs

Jun 26, 2013, 8:24 PM

Post #1 of 30 (140 views)
Permalink
Perl 5.18 and Regexp::Grammars

In February, we got a report that Regexp::Grammars had been broken for some
time. It was marked as a blocker for 5.18.0. As 5.18.0 got closer, it was one
of the tickets that I hounded the list about. The problem was diagnosed and
addressed in the abstract.

You can read the details on the ticket, here:

https://rt.perl.org/rt3/Ticket/Display.html?id=116823

The belief was that the late bugfix to perl would enable Regexp::Grammars to be
mended to work under 5.18.0, even though it would not allow it to work
unaltered. At the time, I believe I thought Damian was in the loop on this,
but I now believe that this was not true.

Yesterday, Damian Conway released a new version with a lengthy notice about how
the module cannot be mended to work under 5.18.0. You can read it here:

https://metacpan.org/source/DCONWAY/Regexp-Grammars-1.030/lib/Regexp/Grammars.pm#L16

I hope it turns out that there is a way to repair R::G for 5.18.0. I've sent
Damian an email about this yesterday, and I'm copying him on this one.
Hopefully we can begin a dialog about the specific problems and how they might
be addressed.

--
rjbs
Attachments: signature.asc (0.48 KB)


davem at iabyn

Jun 27, 2013, 9:32 AM

Post #2 of 30 (128 views)
Permalink
Re: Perl 5.18 and Regexp::Grammars [In reply to]

On Wed, Jun 26, 2013 at 11:24:49PM -0400, Ricardo Signes wrote:
> I hope it turns out that there is a way to repair R::G for 5.18.0. I've sent
> Damian an email about this yesterday, and I'm copying him on this one.
> Hopefully we can begin a dialog about the specific problems and how they might
> be addressed.


Well, if Damian can point out specific behavioural deficiencies in
code blocks and qr constant overloading, I'm happy to attempt to address
them.

Note however that R::G, as was implemented, specifically relied on a bug in
the old perl implementation that allowed code blocks contained in strings
(rather than in regexes) to be interpolated into regexes and be executed,
without requiring a 'use re eval' in scope.


--
Atheism is a religion like not collecting stamps is a hobby


nick at ccl4

Jun 27, 2013, 9:48 AM

Post #3 of 30 (128 views)
Permalink
Re: Perl 5.18 and Regexp::Grammars [In reply to]

On Thu, Jun 27, 2013 at 05:32:58PM +0100, Dave Mitchell wrote:
> On Wed, Jun 26, 2013 at 11:24:49PM -0400, Ricardo Signes wrote:
> > I hope it turns out that there is a way to repair R::G for 5.18.0. I've sent
> > Damian an email about this yesterday, and I'm copying him on this one.
> > Hopefully we can begin a dialog about the specific problems and how they might
> > be addressed.
>
>
> Well, if Damian can point out specific behavioural deficiencies in
> code blocks and qr constant overloading, I'm happy to attempt to address
> them.
>
> Note however that R::G, as was implemented, specifically relied on a bug in
> the old perl implementation that allowed code blocks contained in strings
> (rather than in regexes) to be interpolated into regexes and be executed,
> without requiring a 'use re eval' in scope.

But this was the only bug that it was relying on?

ie that the code should still be able to work if 'use re eval' is added to
the scopes in which it it is used?

(I stress the *should* in the above question)

Nicholas Clark


davem at iabyn

Jun 27, 2013, 10:09 AM

Post #4 of 30 (128 views)
Permalink
Re: Perl 5.18 and Regexp::Grammars [In reply to]

On Thu, Jun 27, 2013 at 05:48:55PM +0100, Nicholas Clark wrote:
> On Thu, Jun 27, 2013 at 05:32:58PM +0100, Dave Mitchell wrote:
> > On Wed, Jun 26, 2013 at 11:24:49PM -0400, Ricardo Signes wrote:
> > > I hope it turns out that there is a way to repair R::G for 5.18.0. I've sent
> > > Damian an email about this yesterday, and I'm copying him on this one.
> > > Hopefully we can begin a dialog about the specific problems and how they might
> > > be addressed.
> >
> >
> > Well, if Damian can point out specific behavioural deficiencies in
> > code blocks and qr constant overloading, I'm happy to attempt to address
> > them.
> >
> > Note however that R::G, as was implemented, specifically relied on a bug in
> > the old perl implementation that allowed code blocks contained in strings
> > (rather than in regexes) to be interpolated into regexes and be executed,
> > without requiring a 'use re eval' in scope.
>
> But this was the only bug that it was relying on?
>
> ie that the code should still be able to work if 'use re eval' is added to
> the scopes in which it it is used?
>
> (I stress the *should* in the above question)

IIRC, the issue was that R::G was using qr-constant overloading to get at
the constant strings within user's //'s, and return the constant bits plus
pieces of added in code. If the R::G's overload method were to return a
qr// containing the string plus code, rather than returning a string, then
things should Just Work (tm).

No need even for 'use re eval' except that buggy old perls require it.

So something like this. Uncomment one of the three 'return' lines:

BEGIN {
use overload;
overload::constant 'qr' =>
sub {
# works in 5.16; fails in 5.18
return "(?{ print qq{whee!\n} })$_[0]";

# works in 5.18; fails in 5.16
return qr/(?{ print qq{whee!\n} })$_[0]/;

# works in 5.16, 5.18
use re 'eval'; return qr/(?{ print qq{whee!\n} })$_[0]/;
};
}

"foobar" =~ /bar/ or die;




--
Please note that ash-trays are provided for the use of smokers,
whereas the floor is provided for the use of all patrons.
-- Bill Royston


damian at conway

Jun 27, 2013, 1:46 PM

Post #5 of 30 (127 views)
Permalink
Re: Perl 5.18 and Regexp::Grammars [In reply to]

Nicholas Clark queried:

> the code should still be able to work if 'use re eval' is added to
> the scopes in which it it is used?

Should be able to work. But doesn't.

Here's a more detailed description of the various issues,
excerpted from a private discussion with rjbs:

This is the current state of play, as I currently understand it
(bear in mind that I have not had sufficient time to isolate the
problems cleanly enough to be certain of all issues, and that I have
no deep understanding of Perl's internals, so my conclusions should
probably be treated as speculations only):

* Regexp::Grammars (and other modules) use overload::constant
'qr' to rewrite augemented regexes into standard Perl syntax,
to provide new and useful functionality.

* But overload::constant 'qr' is only applied to the
compile-time constant portions of a regex.

* That's fine for "peep-hole" rewriting on regexes
(e.g. simulating a new \T metacharacter) but no good for
transformations that apply to the entire regex
(e.g. maintaining a parallel data-return stack on the entire
parse). Because, if the regex contains an interpolated variable,
overload::constant 'qr' doesn't see that piece of the final
pattern at all, so global transformations of the pattern can't be
applied to it.

* To work around this problem, Regexp::Grammars (and other
modules, notably Regexp::Debugger) use the technique of having
overload::constant 'qr' *not* return a modified version of the
original pattern "text".

* Instead they have the overloading return a blessed object with
its own '.' overloading. This '.' overloading then also
returns a blessed object, so that (at runtime) the
concatenation of interpolated variables in each regex
eventually produces a single object containing *all* the
"text" of the pattern, including the interpolated text.

* This final object also has an overloaded stringification,
which is automatically called when the regex is JIT-compiled,
just before matching starts. At that point the stringification
sub has access to the entire pattern "text", can apply the
global rewriting transformation to all of it, and having done
so can return a (now standard-syntax, but very much more
complex) pattern string, which is finally JIT-compiled into an
actual regex.

* In recent versions of Perl, the overloaded stringification
could be replaced by an overloaded qr-ification, but that sub
has to return an actual Regexp object. So it has to compile
the "pattern text" inside the qr-ification subroutine itself,
which means the pattern is compiled in a different lexical
scope from where it is declared. Hence any lexical variables
inside (?{...}) or (??{...}) blocks cannot be correctly closed
over. That means that a 'qr' overloading is often unacceptable
if the original regexs--or the rewritten regex--might ever use
either form of code block. As that's a reasonable expectation
of *any* regex, this means that the 'qr' overloading is rarely
useful in practice.

* So far so good. At least up to Perl 5.16.

* The first problem that has arisen in 5.18 is that any
overload::constant 'qr' that uses objects to collect and then
process the entire pattern "text" (i.e. the technique
described above) is now potentially broken.

* Specifically any 'qr' overloading that returns an object that
stringifies to a pattern "text" that contains (?{...}) or
(??{...}) will now *sometimes* trigger the dreaded 'use re
"eval"' warning, even if there is a 'use re "eval"' in the
scope where the pattern was originally defined.

* This appears to be because the pattern "text" supplied by the
object's final stringification is no longer compiled in the original
regex's scope, so it is not protected even by an explicit 'use
re "eval"' in that scope.

* The second problem that has arisen in 5.18 is that variables
that appear in (?{...}) or (??{...}) blocks are now checked
for 'use strict' compliance *before* the 'qr' overloading is
triggered, making it impossible to provide rewritings that
sanitize such variables.

* For example, R::G provides pseudo-variables $MATCH and %MATCH
as an interface to the current parse tree node, rewriting them
into internal (and 'strict safe') alternatives. In Perl 5.16
this produces no problems, as the 'qr' overloading is called
early so the rewritten 'strict safe' alternatives are the ones
that the compiler actually encounters. In Perl 5.18 the
compiler apparently encounters the pseudo-variables before
the 'qr' can write them out of existence. This is not an
insurmountable problem (I could change the interface...to
something far less convenient for users), but it's certainly
backwards incompatible.

* The third problem that has arisen in 5.18 is when the module
injects a code block that accesses an in-scope lexical
variable. Those blocks, when compiled, appear to
*sometimes* be failing to close over the correct variable.

* For example, the R::G <%hash> construct is rewritten into a
block like so:

(??{
exists $hash{$^N} ? q{} : q{(?!)}
})

But, when matching, the lexical variable %hash appears to be
empty inside the code block, even though it is not definitely
empty in the enclosing lexical scope.

* The final problem that has arisen in 5.18 is that several
tests in R::G's suite changed from passing under 5.16 to
segfaulting under 5.18. That's a separate, and arguably far
more serious, problem in itself...and an indication that some
deep issue still lurks in the new mechanism. I have not yet
had the time to track this problem down more specifically.


BTW, I think it is very likely that Regexp::Debugger (which,
necessarily, uses exactly the same 'qr' overloading pattern) may be
susceptible to similar problems. In my view, that's a much more serious
issue: R::G is arguably a niche product, but R::D should be in every
Perl developer's toolbox.


Finally, please note that I am currently mired in final preparations for
my annual speaking tour, which starts next week. As a result, my response-
time will be poor, and I will not be able to properly perlbug this issue
(i.e. boil Regexp::Grammars' 2500 lines of code down to minimal examples)
for at least a month.

I appreciate everyone's concern over this issue, and apologize for the trouble
it's causing.

Damian


nick at ccl4

Jun 27, 2013, 2:03 PM

Post #6 of 30 (127 views)
Permalink
Re: Perl 5.18 and Regexp::Grammars [In reply to]

On Fri, Jun 28, 2013 at 06:46:51AM +1000, Damian Conway wrote:
> Nicholas Clark queried:
>
> > the code should still be able to work if 'use re eval' is added to
> > the scopes in which it it is used?
>
> Should be able to work. But doesn't.

> * The final problem that has arisen in 5.18 is that several
> tests in R::G's suite changed from passing under 5.16 to
> segfaulting under 5.18. That's a separate, and arguably far
> more serious, problem in itself...and an indication that some
> deep issue still lurks in the new mechanism. I have not yet
> had the time to track this problem down more specifically.

Agree. It should not be possible to provoke SEGVs.
That suggests that (at least) one ugly bug remains. :-(

> Finally, please note that I am currently mired in final preparations for
> my annual speaking tour, which starts next week. As a result, my response-
> time will be poor, and I will not be able to properly perlbug this issue
> (i.e. boil Regexp::Grammars' 2500 lines of code down to minimal examples)
> for at least a month.
>
> I appreciate everyone's concern over this issue, and apologize for the trouble
> it's causing.

Thanks for interrupting your preparations to provide the detailed summary of
issues.


On Tue, Jun 25, 2013 at 12:41:21AM -0700, Greg Lindahl wrote:
> Inspired by C-Reduce, perl-reduce takes a perl program that causes a
> bug in perl itself, such as a core-dump in the perl interpreter, and
> tries to reduce the program to the minimum program needed to cause the
> bug.
>
> https://github.com/blekko/perl-reduce
>
> I didn't have much luck finding buggy scripts to test this with; you
> guys are too clever at reducing scripts before you file bugs. I threw
> in some useful features like taint and valgrind based on one of the
> few examples that I did find.
>
> Next time <someone> shows up with a 10,000 line crasher script and an
> attitude, send <them> to me. jk

Damian suggests that his test case is only 2500 lines. Is that an acceptable
challenge? Could you also wave the requirement to have an "attitude"? :-)

Nicholas Clark


damian at conway

Jun 27, 2013, 2:19 PM

Post #7 of 30 (127 views)
Permalink
Re: Perl 5.18 and Regexp::Grammars [In reply to]

PS: I saw that Dave suggested:

> If the R::G's overload method were to return a qr// containing the
> string plus code, rather than returning a string, then things should
> Just Work (tm).

Indeed.

Except, as explained in my previous post, the overload method
can't just build and return a qr// each time its invoked, because:

(a) it needs access to the entire contents of the original regex
including the contents of interpolated variables (so it needs
to use the return-an-object-that-accumulates-concatenations
technique), and

(b) even if it didn't need to accumulate interpolations, any code
blocks in the original regex wouldn't then close over the
correct variables, having been qr'd in the lexical scope of
the overload sub, rather than the lexical scope of the
original regex.

Damian


davem at iabyn

Jun 27, 2013, 3:26 PM

Post #8 of 30 (127 views)
Permalink
Re: Perl 5.18 and Regexp::Grammars [In reply to]

On Fri, Jun 28, 2013 at 06:46:51AM +1000, Damian Conway wrote:
> * The final problem that has arisen in 5.18 is that several
> tests in R::G's suite changed from passing under 5.16 to
> segfaulting under 5.18. That's a separate, and arguably far
> more serious, problem in itself...and an indication that some
> deep issue still lurks in the new mechanism. I have not yet
> had the time to track this problem down more specifically.

I'll need more time to digest the rest of your email, but as to the SEGVs:
I've just tried running Regexp-Grammars-1.030 under 5.18.0 (with the >=
5.18 test disabled), and although many tests fail, or exit with 255 due to
'use re eval' not being in scope, I couldn't see any SEGVs.

Do you know which tests did for you?

My gut instinct would be that they are SEGVing due to recursive calls to
overloading (which quickly blows the C stack, and is a problem with all
perls).

--
print+qq&$}$"$/$s$,$a$d$g$s$@$.$q$,$:$.$q$^$,$@$a$~$;$.$q$m&if+map{m,^\d{0\,},,${$::{$'}}=chr($"+=$&||1)}q&10m22,42}6:17a2~2.3@3;^2dg3q/s"&=~m*\d\*.*g


damian at conway

Jun 27, 2013, 3:46 PM

Post #9 of 30 (127 views)
Permalink
Re: Perl 5.18 and Regexp::Grammars [In reply to]

> Do you know which tests did for you?

None of the tests in the current release segfault, because I turned
off the tests that did. If you run the previous version of the test
t/seplist_countedhash_M_.t (attached) under the current version (1.030)
of Grammars.pm, it segfauts under this version of 5.18:

Summary of my perl5 (revision 5 version 18 subversion 0) configuration:

Platform:
osname=darwin, osvers=10.8.0, archname=darwin-2level
uname='darwin daneel.home.gateway 10.8.0 darwin kernel version
10.8.0: tue jun 7 16:32:41 pdt 2011;
root:xnu-1504.15.3~1release_x86_64 x86_64 '
config_args='-de
-Dprefix=/Users/damian/perl5/perlbrew/perls/perl-5.18.0
-Aeval:scriptdir=/Users/damian/perl5/perlbrew/perls/perl-5.18.0/bin'
hint=recommended, useposix=true, d_sigaction=define
useithreads=undef, usemultiplicity=undef
useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef
use64bitint=define, use64bitall=define, uselongdouble=undef
usemymalloc=n, bincompat5005=undef
Compiler:
cc='cc', ccflags ='-fno-common -DPERL_DARWIN -fno-strict-aliasing
-pipe -fstack-protector -I/usr/local/include',
optimize='-O3',
cppflags='-fno-common -DPERL_DARWIN -fno-strict-aliasing -pipe
-fstack-protector -I/usr/local/include'
ccversion='', gccversion='4.2.1 (Apple Inc. build 5666) (dot 3)',
gccosandvers=''
intsize=4, longsize=8, ptrsize=8, doublesize=8, byteorder=12345678
d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=16
ivtype='long', ivsize=8, nvtype='double', nvsize=8, Off_t='off_t',
lseeksize=8
alignbytes=8, prototype=define
Linker and Libraries:
ld='env MACOSX_DEPLOYMENT_TARGET=10.3 cc', ldflags ='
-fstack-protector -L/usr/local/lib'
libpth=/usr/local/lib /usr/lib
libs=-ldbm -ldl -lm -lutil -lc
perllibs=-ldl -lm -lutil -lc
libc=, so=dylib, useshrplib=false, libperl=libperl.a
gnulibc_version=''
Dynamic Linking:
dlsrc=dl_dlopen.xs, dlext=bundle, d_dlsymun=undef, ccdlflags=' '
cccdlflags=' ', lddlflags=' -bundle -undefined dynamic_lookup
-L/usr/local/lib -fstack-protector'


Characteristics of this binary (from libperl):
Compile-time options: HAS_TIMES PERLIO_LAYERS PERL_DONT_CREATE_GVSV
PERL_HASH_FUNC_ONE_AT_A_TIME_HARD PERL_MALLOC_WRAP
PERL_PRESERVE_IVUV PERL_SAWAMPERSAND USE_64_BIT_ALL
USE_64_BIT_INT USE_LARGE_FILES USE_LOCALE
USE_LOCALE_COLLATE USE_LOCALE_CTYPE
USE_LOCALE_NUMERIC USE_PERLIO USE_PERL_ATOF
Built under darwin


Damian
Attachments: seplist_countedhash_M_.t (0.88 KB)


davem at iabyn

Jun 27, 2013, 4:05 PM

Post #10 of 30 (127 views)
Permalink
Re: Perl 5.18 and Regexp::Grammars [In reply to]

On Fri, Jun 28, 2013 at 08:46:45AM +1000, Damian Conway wrote:
> None of the tests in the current release segfault, because I turned
> off the tests that did. If you run the previous version of the test
> t/seplist_countedhash_M_.t (attached) under the current version (1.030)
> of Grammars.pm, it segfauts under this version of 5.18:

Thanks, I can reproduce it now.

--
That he said that that that that is is is debatable, is debatable.


davem at iabyn

Jun 27, 2013, 5:22 PM

Post #11 of 30 (127 views)
Permalink
Re: Perl 5.18 and Regexp::Grammars [In reply to]

On Fri, Jun 28, 2013 at 12:05:13AM +0100, Dave Mitchell wrote:
> On Fri, Jun 28, 2013 at 08:46:45AM +1000, Damian Conway wrote:
> > None of the tests in the current release segfault, because I turned
> > off the tests that did. If you run the previous version of the test
> > t/seplist_countedhash_M_.t (attached) under the current version (1.030)
> > of Grammars.pm, it segfauts under this version of 5.18:
>
> Thanks, I can reproduce it now.

[. cutting Damian out of the CC for this subthread as he's probably got
better things to worry about ]

I can reduce the failure to the following code:

m{(?&foo){0}(?<foo>)}

which SEGVs during compilation. 'git blame' blames Karl.
(which is not to say that there aren't other SEGing issues, which might
involve code blocks, but this first SEGV is innocent of them).

Karl, is this something you can deal with?

v5.17.7.0-60-g3018b82

3018b823898645e44b8c37c70ac5c6302b031381 is the first bad commit
commit 3018b823898645e44b8c37c70ac5c6302b031381
Author: Karl Williamson <public [at] khwilliamson>
Date: Mon Dec 17 21:37:40 2012 -0700

Consolidate some regex OPS

The regular rexpression operation POSIXA works on any of the (currently)
16 posix classes (like \w and [:graph:]) under the regex modifier /a.
This commit creates similar operations for the other modifiers: POSIXL
(for /l), POSIXD (for /d), POSIXU (for /u), plus their complements.

It causes these ops to be generated instead of the ALNUM, DIGIT,
HORIZWS, SPACE, and VERTWS ops, as well as all their variants. The net
saving is 22 regnode types.

The reason to do this is for maintenance. As of this commit, there are
now 22 fewer node types for which code has to be maintained. The code
for each variant was essentially the same logic, but on different
operands. It would be easy to make a change to one copy and forget to
make the corresponding change in the others. Indeed, this patch fixes
[perl #114272] in which one copy was out of sync with others.

This patch actually reduces the number of separate code paths to 5:
POSIXA, NPOSIXA, POSIXL, POSIXD, and POSIXU. The complements of the
last 3 use the same code path as their non-complemented version, except
that a variable is initialized differently. The code then XORs this
variable with its result to do the complementing or not. Further, the
POSIXD branch now just checks if the target string being matched is
UTF-8 or not, and then jumps to either the POSIXU or POSIXA code
respectively. So, there are effectively only 4 cases that are coded:
POSIXA, NPOSIXA, POSIXL, and POSIXU. (POSIXA doesn't have to worry
about UTF-8, while NPOSIXA does, hence these for efficiency are coded
separately.)

Removing all this code saves memory. The output of the Linux size
command shows that the perl executable was shrunk by 33K bytes on my
platform compiled under -O0 (.7%) and by 18K bytes (1.3%) under -O2.

The reason this patch was doable was previous work in numbering the
POSIX classes, so that they could be indexed in arrays and bit
positions. This is a large patch; I didn't see how to break it into
smaller components.

I chose to make this code more efficient as opposed to saving even more
memory. Thus there is a separate loop that is jumped to after we know
we have to load a swash; this just saves having to test if the swash is
loaded each time through the loop. I avoid loading the swash until
absolutely necessary. In places in the previous version of this code,
the swash was loaded when the input was UTF-8, even if it wasn't yet
needed (and might never be if the input didn't contain anything above
Latin1); apparently to avoid the extra test per iteration.

The Perl test suite runs slightly faster on my platform with this patch
under -O0, and the speeds are indistinguishable under -O2. This is in
spite of these new POSIX regops being unknown to the regex optimizer
(this will be addressed in future commits), and extra machine
instructions being required for each character (the xor, and some
shifting and masking). I expect this is a result of better caching, and
not loading swashes unless absolutely necessary.


[davem [at] robi bleed]$ valgrind ./miniperl ~/tmp/p2
==11524== Memcheck, a memory error detector
==11524== Copyright (C) 2002-2012, and GNU GPL'd, by Julian Seward et al.
==11524== Using Valgrind-3.8.1 and LibVEX; rerun with -h for copyright info
==11524== Command: ./miniperl /home/davem/tmp/p2
==11524==
==11524== Invalid read of size 4
==11524== at 0x4EC7E1: Perl_re_op_compile (regcomp.c:6461)
==11524== by 0x42933B: Perl_pmruntime (op.c:4657)
==11524== by 0x4BECE1: Perl_yyparse (perly.y:1330)
==11524== by 0x40A56F: S_parse_body (perl.c:2313)
==11524== by 0x408D2C: perl_parse (perl.c:1625)
==11524== by 0x45059C: main (miniperlmain.c:111)
==11524== Address 0x4 is not stack'd, malloc'd or (recently) free'd
==11524==
==11524==
==11524== Process terminating with default action of signal 11 (SIGSEGV)
==11524== Access not within mapped region at address 0x4
==11524== at 0x4EC7E1: Perl_re_op_compile (regcomp.c:6461)
==11524== by 0x42933B: Perl_pmruntime (op.c:4657)
==11524== by 0x4BECE1: Perl_yyparse (perly.y:1330)
==11524== by 0x40A56F: S_parse_body (perl.c:2313)
==11524== by 0x408D2C: perl_parse (perl.c:1625)
==11524== by 0x45059C: main (miniperlmain.c:111)
==11524== If you believe this happened as a result of a stack
==11524== overflow in your program's main thread (unlikely but
==11524== possible), you can try to increase the size of the
==11524== main thread stack using the --main-stacksize= flag.
==11524== The main thread stack size used in this run was 8388608.
==11524==
==11524== HEAP SUMMARY:
==11524== in use at exit: 199,574 bytes in 722 blocks
==11524== total heap usage: 810 allocs, 88 frees, 209,456 bytes allocated
==11524==
==11524== LEAK SUMMARY:
==11524== definitely lost: 0 bytes in 0 blocks
==11524== indirectly lost: 0 bytes in 0 blocks
==11524== possibly lost: 0 bytes in 0 blocks
==11524== still reachable: 199,574 bytes in 722 blocks
==11524== suppressed: 0 bytes in 0 blocks
==11524== Rerun with --leak-check=full to see details of leaked memory
==11524==
==11524== For counts of detected and suppressed errors, rerun with: -v
==11524== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 2 from 2)
Segmentation fault (core dumped)


--
A power surge on the Bridge is rapidly and correctly diagnosed as a faulty
capacitor by the highly-trained and competent engineering staff.
-- Things That Never Happen in "Star Trek" #9


public at khwilliamson

Jun 27, 2013, 9:42 PM

Post #12 of 30 (127 views)
Permalink
Re: Perl 5.18 and Regexp::Grammars [In reply to]

On 06/27/2013 06:22 PM, Dave Mitchell wrote:
> I can reduce the failure to the following code:
>
> m{(?&foo){0}(?<foo>)}
>
> which SEGVs during compilation. 'git blame' blames Karl.
> (which is not to say that there aren't other SEGing issues, which might
> involve code blocks, but this first SEGV is innocent of them).
>
> Karl, is this something you can deal with?
>
> v5.17.7.0-60-g3018b82
>
> 3018b823898645e44b8c37c70ac5c6302b031381 is the first bad commit
> commit 3018b823898645e44b8c37c70ac5c6302b031381

I believe I have found the problem. I mistakenly combined what should
have been two commits in this one, and the commit message doesn't even
mention the one at fault.

What appears to be going on is that

(?&foo){0}

is getting replaced by a NOTHING node, which is an optimization we
agreed to on p5p. Thus it doesn't set up 'foo', and so the reference to
it is NULL, which segfaults when dereferenced.

Suggestions as to how to proceed are welcome. Obviously, we could
remove the optimization. Or apply it when what is being replaced is one
of certain node types, or has some characteristic, like width. I
haven't looked into this at all.


davem at iabyn

Jul 1, 2013, 3:12 AM

Post #13 of 30 (124 views)
Permalink
Re: Perl 5.18 and Regexp::Grammars [In reply to]

On Thu, Jun 27, 2013 at 10:42:54PM -0600, Karl Williamson wrote:
> On 06/27/2013 06:22 PM, Dave Mitchell wrote:
> >I can reduce the failure to the following code:
> >
> > m{(?&foo){0}(?<foo>)}
> >
> >which SEGVs during compilation. 'git blame' blames Karl.
> >(which is not to say that there aren't other SEGing issues, which might
> >involve code blocks, but this first SEGV is innocent of them).
> >
> >Karl, is this something you can deal with?
> >
> >v5.17.7.0-60-g3018b82
> >
> >3018b823898645e44b8c37c70ac5c6302b031381 is the first bad commit
> >commit 3018b823898645e44b8c37c70ac5c6302b031381
>
> I believe I have found the problem. I mistakenly combined what
> should have been two commits in this one, and the commit message
> doesn't even mention the one at fault.
>
> What appears to be going on is that
>
> (?&foo){0}
>
> is getting replaced by a NOTHING node, which is an optimization we
> agreed to on p5p. Thus it doesn't set up 'foo', and so the
> reference to it is NULL, which segfaults when dereferenced.
>
> Suggestions as to how to proceed are welcome. Obviously, we could
> remove the optimization. Or apply it when what is being replaced is
> one of certain node types, or has some characteristic, like width.
> I haven't looked into this at all.

Not knowing the details of the optimisation, I would hazard the suggestion
what we just remove the optimisation for now (hopefully a simple fix that
can be done now and backported in time for 5.18.1), then worry about doing
a better fix later on.

--
Standards (n). Battle insignia or tribal totems.


davem at iabyn

Jul 1, 2013, 9:48 AM

Post #14 of 30 (124 views)
Permalink
Re: Perl 5.18 and Regexp::Grammars [In reply to]

On Fri, Jun 28, 2013 at 06:46:51AM +1000, Damian Conway wrote:
> Here's a more detailed description of the various issues,
> excerpted from a private discussion with rjbs:

(Please respond to this only when you have the time).

Background
----------

First, a bit of background about how (?{}) handling has changed in 5.18.0.

Pre 5.18, the compilation of code blocks was handled by doing an eval
of individual code blocks from within the regex engine, after the pattern
string had been fully assembled.

Apart from implementation deficiencies, like the fact that you usually got
SEGVs as soon as you tried to access outer lexical vars, it was
conceptually flawed. For example in the following:

for my $x ((qw(a b c)) {
push @r, qr/(??{ $x })/;
}
/-$r[1]-/;

you might expect the pattern to be equivalent to /-b-/, but before 5.18 it
was actually equivalent to /--/.

The general fix was to handle code blocks directly in the parser, like
the way that something like "a$foo{expr}a" is already handled: the
perl toker converts that into something like

('a' . $foo{expr}. 'b')

Note that the index expression is parsed directly; in particular something
like "a$foo{'}'}a" works, and doesn't rely on balanced braces.

Similarly, /a(?{expr}b/

is now toked up as

regcomp('a', {expr}, 'b')

so the code block is directly compiled as part of the same pass as the
surrounding code (and is part of the same sub and shares the same pad).

Similarly, qr/a(?{expr}b/

is now toked up (roughly speaking) as

sub { regcomp('a', {expr}, 'b') }

so each time you execute qr//, you get a new closure.

In the presence of

overload::constant::qr => \&my_qr;

/a(?{expr}b/ gets toked as

regcomp(my_qr('a'), {expr}, my_qr('b'))

(with the calls to my_qr() done during toking).

Note that the code block text is no longer passed to my_qr().

At the beginning of regex compilation, the list of args is concatenated
into a string, using the usual '.', '""' and 'qr' overload methods.
The slight catch to this is that the concatter has access to the original
text of each code block, plus a pointer to the op subtree. It concats the
text of the code block, but rather than asking the regex compiler to
compile the code block, it just uses the already-compiled op-tree. It
remembers the substring of the pattern which corresponds to that op tree;
so in something like /ab(?{})/ it knows that the first optree maps to
chars 2-6 of the pattern.

There are two ways this fails to work. First, in a runtime pattern like

$c = '(?{})'; /ab$c/

the concatter fails to find an optree that corresponds to the code block
text (i.e. chars 2-6); in this case it dies unless 'use re eval'
is in scope; or if it is, it reparses the pattern string to compile any
run-time code blocks.

The second case is in the presence of overloaded concatenation; in this
case perl can no longer reliably assume that the code text maps to
pre-compiled optrees, and in this case it puts its hands up in horror,
throws away the optree and treats it as a run-time code block (and insists
on 'use re eval'.

(This may at least partially explain when you're seen it randomly
requiring 'use re eval'.)


Suggested workaround
--------------------

If I understand requirements correctly, what you need is, given a user
regex within the scope of 'Regex::Grammar', to extract out the full final
text of the regex (which may include run-time components and code blocks),
then to modify that text, including:
* injecting new code blocks into the text;
* modifying some variable references within user code blocks;
then compile the regex, including compiling all the code blocks within the
scope of the caller.

I think the following code demonstrates all the above working. It relies
on the fact that concatenation overload triggers disabling of compile-time
code blocks, and forces everything to run-time. So it relies on
'use re eval' being in scope at the caller.

package Foo;

our $lexical = 'XXX BAD XXX';

use overload
q{""} => sub {
my ($pat) = @_;
$pat = $pat->[0];

# demo: map '<ABC>' to '(??{$lexical})ABC'
$pat =~ s/<(\w+)>/(??{\$lexical})$1/g;

# demo: map certain keywords in code blocks to
# literal strings (XXX this doesn't check whether
# they're actually within a code block)
$pat =~ s/MATCH/"match"/g;
$pat =~ s/ARGS/"args"/g;
$pat;
},
q{.} => sub {
my ($a1, $a2) = @_;
$a1 = $a1->[0] if ref $a1;
$a2 = $a2->[0] if ref $a2;
bless [ "$a1$a2" ], 'Foo';
},
;


package main;

BEGIN {
overload::constant qr => sub { return bless [ $_[0] ], 'Foo' };
}

my $str = 'ghi-<BAR>-jkl-(??{ MATCH })-mno';
my $lexical = "lex";

use re 'eval';
my $r = qr/^abc-<FOO>-def-(??{ ARGS })-$str$/;
"abc-lexFOO-def-args-ghi-lexBAR-jkl-match-mno" =~ $r or die;
print $r, "\n";


This outputs, on both 5.16.3 and 5.18.0:

(?^:^abc-(??{$lexical})FOO-def-(??{ "args" })-ghi-(??{$lexical})BAR-jkl-(??{ "match" })-mno$)


Note that it matches, and uses the 'lex' value of $lexical in both the
user's and Foo's code blocks.

Is this close to what you need?

I haven't considered the case where there is only one component in the
regex, so concatenation isn't triggered. This might throw a spanner in the
works.


Finally, here are some replies to your specific issues.

> * Specifically any 'qr' overloading that returns an object that
> stringifies to a pattern "text" that contains (?{...}) or
> (??{...}) will now *sometimes* trigger the dreaded 'use re
> "eval"' warning, even if there is a 'use re "eval"' in the
> scope where the pattern was originally defined.

I can't reproduce this; I would need sample code.

> * The second problem that has arisen in 5.18 is that variables
> that appear in (?{...}) or (??{...}) blocks are now checked
> for 'use strict' compliance *before* the 'qr' overloading is
> triggered, making it impossible to provide rewritings that
> sanitize such variables.

Yep, you can't rewrite code blocks any more, unless you can force them to
become run-time, then overload-concatenate them, as shown above.

> * The third problem that has arisen in 5.18 is when the module
> injects a code block that accesses an in-scope lexical
> variable. Those blocks, when compiled, appear to
> *sometimes* be failing to close over the correct variable.
>
> * For example, the R::G <%hash> construct is rewritten into a
> block like so:
>
> (??{
> exists $hash{$^N} ? q{} : q{(?!)}
> })
>
> But, when matching, the lexical variable %hash appears to be
> empty inside the code block, even though it is not definitely
> empty in the enclosing lexical scope.

Again, I'd need sample code that reproduces this.

> * The final problem that has arisen in 5.18 is that several
> tests in R::G's suite changed from passing under 5.16 to
> segfaulting under 5.18. That's a separate, and arguably far
> more serious, problem in itself...and an indication that some
> deep issue still lurks in the new mechanism. I have not yet
> had the time to track this problem down more specifically.

This appears to be a separate issue; I reduced it to the following code
which segfaults:

m{(?&foo){0}(?<foo>)}

It's related to '{0}' triggering a new optimisation that deletes the node
to the left of it. This is likely to be fixed for 5.18.1. The workaround
(if possible) is to avoid '{0}'. I don't know whether there will be any
remaining segfault issues after that.

--
In my day, we used to edit the inodes by hand. With magnets.


public at khwilliamson

Jul 1, 2013, 10:05 AM

Post #15 of 30 (124 views)
Permalink
Re: Perl 5.18 and Regexp::Grammars [In reply to]

On 07/01/2013 04:12 AM, Dave Mitchell wrote:
> On Thu, Jun 27, 2013 at 10:42:54PM -0600, Karl Williamson wrote:
>> On 06/27/2013 06:22 PM, Dave Mitchell wrote:
>>> I can reduce the failure to the following code:
>>>
>>> m{(?&foo){0}(?<foo>)}
>>>
>>> which SEGVs during compilation. 'git blame' blames Karl.
>>> (which is not to say that there aren't other SEGing issues, which might
>>> involve code blocks, but this first SEGV is innocent of them).
>>>
>>> Karl, is this something you can deal with?
>>>
>>> v5.17.7.0-60-g3018b82
>>>
>>> 3018b823898645e44b8c37c70ac5c6302b031381 is the first bad commit
>>> commit 3018b823898645e44b8c37c70ac5c6302b031381
>>
>> I believe I have found the problem. I mistakenly combined what
>> should have been two commits in this one, and the commit message
>> doesn't even mention the one at fault.
>>
>> What appears to be going on is that
>>
>> (?&foo){0}
>>
>> is getting replaced by a NOTHING node, which is an optimization we
>> agreed to on p5p. Thus it doesn't set up 'foo', and so the
>> reference to it is NULL, which segfaults when dereferenced.
>>
>> Suggestions as to how to proceed are welcome. Obviously, we could
>> remove the optimization. Or apply it when what is being replaced is
>> one of certain node types, or has some characteristic, like width.
>> I haven't looked into this at all.
>
> Not knowing the details of the optimisation, I would hazard the suggestion
> what we just remove the optimisation for now (hopefully a simple fix that
> can be done now and backported in time for 5.18.1), then worry about doing
> a better fix later on.
>

Now reverted by

commit 2e3a23da260a7ec5d61b81cb34c38de5e528b41d
Author: Karl Williamson <public [at] khwilliamson>
Date: Mon Jul 1 10:26:14 2013 -0600

Fix regex seqfault 5.18 regression

This segfault is a result of an optimization that can leave the
compilation in an inconsistent state.

/f{0}/

doesn't match anything, and hence should be removable from the
regex for
all f. However,

qr{(?&foo){0}(?<foo>)}

caused a segfault. What was happening prior to this commit is that
(?&foo) refers to a named capture group further along in the regex.
The "{0}" caused the "(?&foo)" to be discarded prior to setting up the
pointers between the two related subexpressions; a segfault follows.

This commit removes the optimization, and should be suitable for a
maintenance release.

One might think that no one would be writing code like this, but this
example was distilled from machine-generated code in Regexp::Grammars.

Perhaps this optimization can be done, but the location I chose for
checking it was during parsing, which turns out to be premature. It
would be better to do it in the optimization phase of regex
compilation.
Another option would be to retain it where it was, but for it to
operate
only on a limited set of nodes, such as EXACTish, which would have no
unintended consequences. But that is for looking at in the
future; the
important thing is to have a simple patch suitable for fixing this
regression in a maintenance release.

For the record, the code being reverted was mistakenly added by me in
commit 3018b823898645e44b8c37c70ac5c6302b031381, and wasn't even
mentioned in that commit message. It should have had its own commit.


davem at iabyn

Jul 14, 2013, 11:51 AM

Post #16 of 30 (103 views)
Permalink
Re: Perl 5.18 and Regexp::Grammars [In reply to]

On Mon, Jul 01, 2013 at 11:05:25AM -0600, Karl Williamson wrote:
> On 07/01/2013 04:12 AM, Dave Mitchell wrote:
> >>(?&foo){0}
> >>
> >>is getting replaced by a NOTHING node, which is an optimization we
> >>agreed to on p5p. Thus it doesn't set up 'foo', and so the
> >>reference to it is NULL, which segfaults when dereferenced.
> >>
> >>Suggestions as to how to proceed are welcome. Obviously, we could
> >>remove the optimization. Or apply it when what is being replaced is
> >>one of certain node types, or has some characteristic, like width.
> >>I haven't looked into this at all.
> >
> >Not knowing the details of the optimisation, I would hazard the suggestion
> >what we just remove the optimisation for now (hopefully a simple fix that
> >can be done now and backported in time for 5.18.1), then worry about doing
> >a better fix later on.
> >
>
> Now reverted by
>
> commit 2e3a23da260a7ec5d61b81cb34c38de5e528b41d

And now cherry-picked into maint-5.18 by

0ec7dc753859025ffc69eb02bf79a3f27f794ab1

--
In my day, we used to edit the inodes by hand. With magnets.


damian at conway

Jul 15, 2013, 5:57 AM

Post #17 of 30 (103 views)
Permalink
Re: Perl 5.18 and Regexp::Grammars [In reply to]

Two weeks ago Dave Mitchell wrote:

> (Please respond to this only when you have the time).

I now have time. :-)


> Suggested workaround
> --------------------

The workaround Dave demonstrated recreated precisely the technique that
Regexp::Grammars currently uses. All the problems I reported occur when
using this very workaround. See below.


> I think the following code demonstrates all the above working. It relies
> on the fact that concatenation overload triggers disabling of compile-time
> code blocks, and forces everything to run-time. So it relies on
> 'use re eval' being in scope at the caller.

The code Dave offered did indeed work for the examples he included.

But the code does not work for everything that Regexp::Grammars does.
The problems are as follows...


>> * Specifically any 'qr' overloading that returns an object that
>> stringifies to a pattern "text" that contains (?{...}) or
>> (??{...}) will now *sometimes* trigger the dreaded 'use re
>> "eval"' warning, even if there is a 'use re "eval"' in the
>> scope where the pattern was originally defined.
>
> I can't reproduce this; I would need sample code.

This is the simplest example I could find. It works correctly under 5.14
(i.e. if prints "DONE", then prints "matched"). Under 5.18 it generates
a compile-time error:

Eval-group not allowed at runtime, use re 'eval' in regex

m/ [^#]+ (?{ say 'DONE'; }) /

at (eval 1) line 6.

Note that the RegexpProcessor code here is identical to Dave's original
workaround code, except that the commented line has been added to the
stringification operator to simulate the types of (much more complex)
code-block injections that Regexp::Grammars actually carries out:

-----cut----------cut----------cut----------cut----------cut----------cut-----

# Ensure 'eval' is allowed throughout the entire example code...
use re 'eval';

package RegexProcessor;
use overload (
q{""} => sub {
my ($pat) = @_;
my $complete_pattern = $pat->[0];

# This is an extremely simplified version of the type of
# code block injection that Regexp::Grammars performs...
return qq{
$complete_pattern
(?{ say 'DONE'; })
};
},
q{.} => sub {
my ($a1, $a2) = @_;
$a1 = $a1->[0] if ref $a1;
$a2 = $a2->[0] if ref $a2;
return bless [ "$a1$a2" ], 'RegexProcessor';
},
);

package main;
BEGIN {
overload::constant qr => sub { return bless [ $_[0] ],
'RegexProcessor' };
}

my $bracketed = qr{
[^#]+
}xms;

say 'matched' if '#foobar#' =~ $bracketed && $& eq 'foobar';


-----end----------end----------end----------end----------end----------end-----


>> * The second problem that has arisen in 5.18 is that variables
>> that appear in (?{...}) or (??{...}) blocks are now checked
>> for 'use strict' compliance *before* the 'qr' overloading is
>> triggered, making it impossible to provide rewritings that
>> sanitize such variables.
>
> Yep, you can't rewrite code blocks any more, unless you can force them to
> become run-time, then overload-concatenate them, as shown above.

Even when they are forced to become run-time (using your workaround code),
'use strict' compliance seems to be tested too early (i.e. before the
qr-overloading has a chance to "vanish" the variable in question).

For example, the following code works as expected under 5.14
(i.e. the post-processed regex correctly matches), but under 5.18
it generates an odd "double fatality" compile-time error:

Global symbol "$MAGIC_VAR" requires explicit package name at
demo.pl line 32.
Global symbol "$MAGIC_VAR" requires explicit package name at (eval
1) line 1.

Once again, the RegexpProcessor code is identical to Dave's workaround
code, except that this time the commented line has been added to the
qr-overloading in order to replace $MAGIC_VAR in the source with 'foo'
(this is a minimal version of the various kinds of much more complex
manipulations that Regexp::Grammars actually does):

-----cut----------cut----------cut----------cut----------cut----------cut-----

package RegexProcessor;

use overload (
q{""} => sub {
my ($pat) = @_;
return $pat->[0];
},
q{.} => sub {
my ($a1, $a2) = @_;
$a1 = $a1->[0] if ref $a1;
$a2 = $a2->[0] if ref $a2;
return bless [ "$a1$a2" ], 'RegexProcessor';
},
);

package main;
use re 'eval';

BEGIN {
overload::constant qr => sub {
my ($regex_pattern) = @_;

# Replace raw $MAGIC_VAR with 'foo'...
# (A greatly simplified version of what Regexp::Grammars does)
$regex_pattern =~ s/\$MAGIC_VAR/'foo'/g;

return bless [ $regex_pattern ], 'RegexProcessor'
};
}

use strict;
say 'matched' if "foobar" =~ m{ (??{ $MAGIC_VAR }) bar }xms;

-----end----------end----------end----------end----------end----------end-----



>> * The third problem that has arisen in 5.18 is when the module
>> injects a code block that accesses an in-scope lexical
>> variable. Those blocks, when compiled, appear to
>> *sometimes* be failing to close over the correct variable.
>>
>> * For example, the R::G <%hash> construct is rewritten into a
>> block like so:
>>
>> (??{
>> exists $hash{$^N} ? q{} : q{(?!)}
>> })
>>
>> But, when matching, the lexical variable %hash appears to be
>> empty inside the code block, even though it is not definitely
>> empty in the enclosing lexical scope.
>
> Again, I'd need sample code that reproduces this.

The following code demonstrates the problem. It works
as expected under 5.14 (i.e. prints the three hash keys)
but under 5.18 it generates a compile-time error:

"Global symbol "%hash" requires explicit package name at (eval 1) line 1."

Once again the RegexpProcessor code is just Dave's workaround:

-----cut----------cut----------cut----------cut----------cut----------cut-----

package RegexProcessor;
use overload (
q{""} => sub {
my ($pat) = @_;
return $pat->[0];
},
q{.} => sub {
my ($a1, $a2) = @_;
$a1 = $a1->[0] if ref $a1;
$a2 = $a2->[0] if ref $a2;
return bless [ "$a1$a2" ], 'RegexProcessor';
},
);

package main;
use re 'eval';

BEGIN {
overload::constant qr => sub { return bless [ $_[0] ],
'RegexProcessor' };
}

my %hash = ( a => 1, b => 2, c => 3);

"" =~ m{ (?{say join ', ', keys %hash}) }xms;

-----end----------end----------end----------end----------end----------end-----


I believe that these three examples now cover all the remaining
failures that are preventing Regexp::Grammars from working correctly
under Perl 5.18 (apart from the segfaulting issue that has already been
dealt with).

Thanks,

Damian


davem at iabyn

Jul 31, 2013, 3:06 PM

Post #18 of 30 (74 views)
Permalink
Re: Perl 5.18 and Regexp::Grammars [In reply to]

On Mon, Jul 15, 2013 at 04:57:43PM +0400, Damian Conway wrote:
> Two weeks ago Dave Mitchell wrote:
>
> > (Please respond to this only when you have the time).
>
> I now have time. :-)

Sorry, I didn't have time. I have now.

I've started to work my way through your email. I'll reply to specific
points in subsequent emails as a work further along.
> This is the simplest example I could find. It works correctly under 5.14
> (i.e. if prints "DONE", then prints "matched"). Under 5.18 it generates
> a compile-time error:
>
> Eval-group not allowed at runtime, use re 'eval' in regex
>
> m/ [^#]+ (?{ say 'DONE'; }) /
>
> at (eval 1) line 6.

This is actually a more fundamental regression in 5.18.0, that doesn't
require any use of overload etc; just a '#' within a character class in an
extended regex stops code blocks from being parsed:


$ perl5180 -e'/[#](?{})/x'
Eval-group not allowed at runtime, use re 'eval' in regex m/[#](?{})/ at -e line 1.
$ perl5180 -e'use re "eval"; /[#](?{})/x'
Eval-group not allowed at runtime, use re 'eval' in regex m/[#](?{})/ at (eval 1) line 1.

Fixed now in bleed with c30fc27b4df65a43710b25dd1d2a57d78ee2fe33.
This is a strong candidate for back-porting to maint-5.18.

--
Never do today what you can put off till tomorrow.


davem at iabyn

Aug 6, 2013, 11:19 AM

Post #19 of 30 (45 views)
Permalink
Re: Perl 5.18 and Regexp::Grammars [In reply to]

On Mon, Jul 15, 2013 at 04:57:43PM +0400, Damian Conway wrote:
> package RegexProcessor;
> use overload (
> q{""} => sub {
> my ($pat) = @_;
> return $pat->[0];
> },
> q{.} => sub {
> my ($a1, $a2) = @_;
> $a1 = $a1->[0] if ref $a1;
> $a2 = $a2->[0] if ref $a2;
> return bless [ "$a1$a2" ], 'RegexProcessor';
> },
> );
>
> package main;
> use re 'eval';
>
> BEGIN {
> overload::constant qr => sub { return bless [ $_[0] ],
> 'RegexProcessor' };
> }
>
> my %hash = ( a => 1, b => 2, c => 3);
>
> "" =~ m{ (?{say join ', ', keys %hash}) }xms;
>

Fixed in blead with the following commit; suitable for pack-porting to
5.18.2/

commit c3923c33af542d8764d5a1e4eb5d7b311f443b89
Author: David Mitchell <davem [at] iabyn>
AuthorDate: Tue Aug 6 16:34:50 2013 +0100
Commit: David Mitchell <davem [at] iabyn>
CommitDate: Tue Aug 6 16:44:12 2013 +0100

reparse compile-time /(?{})/ in right scope

When a compile-time regex like /...(?{ code-block }) .../
is compiled in the presence of constant and concat overloading,
this can cause (still at compile-time) for the pattern to be evaled and
re-compiled, in order to re-compile any code-blocks that got messed up
during the overloading and thus whose text no longer matches that which
the perl parser previously compiled.

When this happens, eval_sv() happens to be called when the perl parser is
still in compiling state; normally its called from running state.
This tickles an undiscovered bug in Perl_find_runcv_where(), which
finds the current cop sequence by looking at PL_curcop->cop_seq.
At compile time, we need to get it from PL_cop_seqmax instead.

M pp_ctl.c
M t/re/overload.t


--
In economics, the exam questions are the same every year.
They just change the answers.


perl.p5p at rjbs

Aug 6, 2013, 12:52 PM

Post #20 of 30 (45 views)
Permalink
Re: Perl 5.18 and Regexp::Grammars [In reply to]

* Dave Mitchell <davem [at] iabyn> [2013-08-06T14:19:11]
> Fixed in blead with the following commit; suitable for pack-porting to
> 5.18.2/
>
> commit c3923c33af542d8764d5a1e4eb5d7b311f443b89

Fantastic, thanks Dave!

I will cherry-pick this once 5.18.1-final is tagged.

--
rjbs
Attachments: signature.asc (0.48 KB)


davem at iabyn

Aug 8, 2013, 8:48 AM

Post #21 of 30 (45 views)
Permalink
Re: Perl 5.18 and Regexp::Grammars [In reply to]

On Mon, Jul 15, 2013 at 04:57:43PM +0400, Damian Conway wrote:
> >> * The second problem that has arisen in 5.18 is that variables
> >> that appear in (?{...}) or (??{...}) blocks are now checked
> >> for 'use strict' compliance *before* the 'qr' overloading is
> >> triggered, making it impossible to provide rewritings that
> >> sanitize such variables.
> >
> > Yep, you can't rewrite code blocks any more, unless you can force them to
> > become run-time, then overload-concatenate them, as shown above.
>
> Even when they are forced to become run-time (using your workaround code),
> 'use strict' compliance seems to be tested too early (i.e. before the
> qr-overloading has a chance to "vanish" the variable in question).
>
> For example, the following code works as expected under 5.14
> (i.e. the post-processed regex correctly matches), but under 5.18
> it generates an odd "double fatality" compile-time error:
>
> Global symbol "$MAGIC_VAR" requires explicit package name at
> demo.pl line 32.
> Global symbol "$MAGIC_VAR" requires explicit package name at (eval
> 1) line 1.
>
> Once again, the RegexpProcessor code is identical to Dave's workaround
> code, except that this time the commented line has been added to the
> qr-overloading in order to replace $MAGIC_VAR in the source with 'foo'
> (this is a minimal version of the various kinds of much more complex
> manipulations that Regexp::Grammars actually does):
>
> -----cut----------cut----------cut----------cut----------cut----------cut-----
>
> package RegexProcessor;
>
> use overload (
> q{""} => sub {
> my ($pat) = @_;
> return $pat->[0];
> },
> q{.} => sub {
> my ($a1, $a2) = @_;
> $a1 = $a1->[0] if ref $a1;
> $a2 = $a2->[0] if ref $a2;
> return bless [ "$a1$a2" ], 'RegexProcessor';
> },
> );
>
> package main;
> use re 'eval';
>
> BEGIN {
> overload::constant qr => sub {
> my ($regex_pattern) = @_;
>
> # Replace raw $MAGIC_VAR with 'foo'...
> # (A greatly simplified version of what Regexp::Grammars does)
> $regex_pattern =~ s/\$MAGIC_VAR/'foo'/g;
>
> return bless [ $regex_pattern ], 'RegexProcessor'
> };
> }
>
> use strict;
> say 'matched' if "foobar" =~ m{ (??{ $MAGIC_VAR }) bar }xms;
>
> -----end----------end----------end----------end----------end----------end-----

I think that this is the one that will be impossible to work fully
workaround; i.e. the modification of user-supplied code blocks before the
perl parser gets to see them.

Note first that moving the code-manipulation from the overload q{""}
function (as it was in my sample code) to the overload::constant qr
function (as it is in your sample code) will never work: the
overload::constant function is never called under any circumstances for
the text of literal code blocks in 5.18.x; which is why in my example
code I did the manipulation in the final stringification call (q{""}).

Before I discuss this in more detail, first can I ask whether its
absolutely necessary for R::G to modify user code? Could the effects you
achieve be done by exporting (say) a tied var $MAGIC_VAR into the callers
namespace???

Anyway, let me explain in a bit more detail what's going on.
(if this is tl;dr, then just skip the end where I discuss alternatives)

In the presence of overload::constant qr => \&f, a general regex like
/abc(?{d})e$f/, is toked/parsed at the same time as the surrounding perl
code, into a list op that looks like

regcomp(f('abc'), '(?{d})', {d}, f('e'), $f);

where the calls to f() are done at compile time, so if we have, say,

sub f { uc $_[0] }

then the above actually arrives at the parser as:

regcomp('ABC', '(?{d})', {d}, 'E', $f);

Note that the text of the code block is *not* passed through f(). Also,
note that both the text of the code block and the code block itself are
passed; the text is so that the regex compiler itself can assemble the
full, original text of the regex (so that print qr/(?{})/ will display the
right thing for example), but that also the 'bare' code is exposed and is
parsed and compiled along with everything else - so the {d} above is a bit
like the code block in map or grep.

The regcomp() above will be processed at compile-time if all of its
components are compile-time (so the above without the $f, for example),
and at run-time otherwise.

In either case, the regex compiler is called with
a) a list of strings (or regex objects) like ('ABC', '(?{d})', 'E',
whatever $f contains);
b) a list of optrees, one for each literal codeblock that got parsed
(so {d} in the above).

The regex compiler concats the list of strings into a single string that
represents the final pattern to be compiled.
If there is just a single item in the list, then 'qr' or '""' overloading
will be called if available to convert that single item into a final
pattern (or regex object). If there are multiple items, then we start with
an empty string, then concat each item to it, first applying
qr-overloading if necessary, then calling '.' overloading if it exists,
falling back to plain concatenation (using '""' overloading on the item if
it exists). Finally after the pattern is assembled, '""' overloading is
used to retrieve its final value.

During this assembly, optrees are paired up with the parts of the
final pattern string that correspond to the text of literal code blocks.
So that when the patten string is finally passed through the regex
compiler, when it sees a '(?{', it knows to use optree #3 (say) and
attaches that tree to appropriate regex node. If there's a '(?{' or
'(??{' in the pattern that doesn't correspond to an optree (e.g. it was
introduced by $f above, or by overloading), then the pattern is evalled,
but with any literal code-blocks blanked out. So in the above, if $f
contained '(?{f})' and there was no funny overloading, the pattern string
would be

'abc(?{d})e(?{f})';

where the second (?{f}) doesn't have a corresponding optree. At this point
we check that 'use re eval' is in scope and if not, croak(). Otherwise, we
internally eval the string

qr'abc______e(?{f})'

and from the returned object, extract out the optree for the (?{f}) block,
and continue as before.

A similar thing happens in the presence of concat overloading; since
the final pattern string may contain the text of code-blocks that no
longer match what the parser's already seen and compiled into optrees, we
abandon any existing optrees and treat every (?{}) as a runtime code block
and recompile as above. This is why in your code example you got two
warnings/errors: the same code block was compiled twice; first as literal
code, then after the overloading triggered throwing it away, it was
compiled a second time as a run-time pattern.

Note that this means that even if the '""' overloading gets a chance to
rewrite the code block, the pre-modification code block will get compiled
and then discarded; and this may trigger errors such as the
"$MAGIC_VAR" requires explicit package name
which can't be avoided.

Note also that if the user's regex consists purely of a code block, with
no constant text (such as /(?{a})/ verses /(?{a})b/), then the whole
"make overload::constant qr return an overloaded object" trick fails to
work, since the text of code blocks isn't passed through the const mapper.

TL;DR:

For those two reasons (bare code blocks don't get processed; code blocks
are compiled - and possibly error out - before they can be modified), I
don't think R::G as it stands is viable under 5.18.x.

Which leads me to repeat the question: is it possible for R::G to work
without the facility to modify the text of user code blocks?
If not, then I wonder whether, for 5.20.0, we could add a new facility
that would allow you to do what you need. I'm open to suggestions, but one
possibility might be to add a 'raw' type to overload::constant that is
passed the whole literal string, before interpolation etc (i.e. it sees it
as a single-quoted string), and before the rest of the perl toker and
parser has seen it. For example currently, this

use overload;
BEGIN {
overload::constant qr => sub {
my $s = shift; print "qr($s)\n"; $s;
};
overload::constant 'q' => sub {
my $s = shift; print "q($s)\n"; $s;
};
}

qr/abc(?{})def$x/;
"ABC$a[$b+$c]DEF";

outputs:

qr(abc)
qr(def)
q(ABC)
q(DEF)

I'm suggesting that in addition we allow you to add, say,

overload::constant raw => sub {
my $s = shift; print "raw($s)\n"; $s;
};

that when used along with the two existing overloads, gives

raw(abc(?{})def$x)
qr(abc)
qr(def)
raw(ABC$a[$b+$c]DEF)
q(ABC)
q(DEF)

Would this help??? Or would the fact that it passes you the names of
run-time variables rather than their values, be just as bad?


--
Atheism is a religion like not collecting stamps is a hobby


damian at conway

Aug 10, 2013, 4:16 PM

Post #22 of 30 (39 views)
Permalink
Re: Perl 5.18 and Regexp::Grammars [In reply to]

Dave Mitchell wrote:

> I think that this is the one that will be impossible to work fully
> workaround; i.e. the modification of user-supplied code blocks before
> the perl parser gets to see them.

Thanks for the detailed analysis and response, Dave.
It was, as usual, both interesting and educational.
And, as usual, much appreciated!

I had already assumed that this particular issue would prove to be the
one that was unfixable, as it proceeds not from a hidden bug, but rather
from a change of behaviour needed to fix hidden bugs.


> Before I discuss this in more detail, first can I ask whether its
> absolutely necessary for R::G to modify user code? Could the effects
> you achieve be done by exporting (say) a tied var $MAGIC_VAR into the
> callers namespace???

Yes, the current release of R::G already uses a similar solution.
Namely, it turns off 'strict vars' in the scope where the module is used.

That is obviously a *bad* solution, but so is exporting a variable.
Imposing 'no strict "vars"' weakens compile-time testing of every var
in that single scope. Exporting a "pseudo-var" weakens compile-time
testing of that single var in every scope. Honestly, I still can't
decide which is worse.

Of course, the right solution would be for R::G to set up a *lexical*
pseudo-var in each scope where it's imported, but I'm not aware of
any reliable way to do that...except by using source filters. :-(


> For those two reasons (bare code blocks don't get processed; code blocks
> are compiled - and possibly error out - before they can be modified), I
> don't think R::G as it stands is viable under 5.18.x.

Agreed.


> Which leads me to repeat the question: is it possible for R::G to work
> without the facility to modify the text of user code blocks?

It is, albeit by injecting either unlimited non-strictness into a
limited scope, or limited non-strictness into an unlimited scope.
I will simply have to settle on one of those two workarounds for 5.18.


> If not, then I wonder whether, for 5.20.0, we could add a new facility
> that would allow you to do what you need.

I would VERY MUCH like to have the facility your described for capturing
the raw source of constants (and other components).

That would not only solve my problem, it would also no longer require me
to use the trick of using 'overload "qr"' to return an object with
overloaded concatenation and stringification. So it would greatly
simplify the implementation of these kinds of modules. And probably
improve their performance too.

So, yes: please sir, may I have this for 5.20???


Though perhaps 'source' would be a better keyword than 'raw'?

And, ideally, the source code fragment being passed into the overloading
sub would not just be the "contents" of each constant construct, but
the entire syntax of the construct, including:

- the delimiters used

- any keyword used ('q', 'qq', 'qr', 'qx', 's', 'tr', or 'y')

- any flags used (i.e. in the case of 's', 'tr', and 'y')


That is, reworking and extending your own example:

use overload;
BEGIN {
overload::constant source => sub {
my $s = shift;
say "source($s)'";
$s;
};
}

$rex = qr/abc(?{})def$x/i;
$str = "ABC$a[$b+$c]DEF";
$str =~ s{foo}{bar}gxms;
$hex = 0xDeadCode;

would output:

source(qr/abc(?{})def$x/i)
source("ABC$a[$b+$c]DEF")
source(s{foo}{bar}gxms)
source(0xDeadCode)


I you could give me that, I might do *great* things with it.
("-- terrible, yes, but great!")


Damian
(with dreadful visions already dancing in his head ;-)


sprout at cpan

Aug 10, 2013, 6:32 PM

Post #23 of 30 (39 views)
Permalink
Re: Perl 5.18 and Regexp::Grammars [In reply to]

Damian Conway wrote:
> Of course, the right solution would be for R::G to set up a *lexical*
> pseudo-var in each scope where it's imported, but I'm not aware of
> any reliable way to do that...except by using source filters. :-(

If you don’t mind depending on an XS module:

Lexical::Var->import('$VAR_NAME' => \$My::Evil::Remapped::Variable);


damian at conway

Aug 10, 2013, 6:38 PM

Post #24 of 30 (39 views)
Permalink
Re: Perl 5.18 and Regexp::Grammars [In reply to]

Father Chrysostomos suggested:

> If you don’t mind depending on an XS module:
>
> Lexical::Var->import('$VAR_NAME' => \$My::Evil::Remapped::Variable);

An excellent suggestion, and much appreciated.
Alas, I had the same thought, but L::V fails under 5.18+.

There appears to have been a patch submitted for the problem,
so I will definitely revisit it as a possible solution on its next release.

Damian


ilmari at ilmari

Aug 11, 2013, 4:09 AM

Post #25 of 30 (39 views)
Permalink
Re: Perl 5.18 and Regexp::Grammars [In reply to]

Damian Conway <damian [at] conway> writes:

> Of course, the right solution would be for R::G to set up a *lexical*
> pseudo-var in each scope where it's imported, but I'm not aware of
> any reliable way to do that...except by using source filters. :-(

https://metacpan.org/module/Lexical::Var might be of interest?

--
- Twitter seems more influential [than blogs] in the 'gets reported in
the mainstream press' sense at least. - Matt McLeod
- That'd be because the content of a tweet is easier to condense down
to a mainstream media article. - Calle Dybedahl

First page Previous page 1 2 Next page Last page  View All Perl porters RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.