
davem at iabyn
Aug 8, 2013, 8:48 AM
Post #21 of 30
(45 views)
Permalink
|
On Mon, Jul 15, 2013 at 04:57:43PM +0400, Damian Conway wrote: > >> * The second problem that has arisen in 5.18 is that variables > >> that appear in (?{...}) or (??{...}) blocks are now checked > >> for 'use strict' compliance *before* the 'qr' overloading is > >> triggered, making it impossible to provide rewritings that > >> sanitize such variables. > > > > Yep, you can't rewrite code blocks any more, unless you can force them to > > become run-time, then overload-concatenate them, as shown above. > > Even when they are forced to become run-time (using your workaround code), > 'use strict' compliance seems to be tested too early (i.e. before the > qr-overloading has a chance to "vanish" the variable in question). > > For example, the following code works as expected under 5.14 > (i.e. the post-processed regex correctly matches), but under 5.18 > it generates an odd "double fatality" compile-time error: > > Global symbol "$MAGIC_VAR" requires explicit package name at > demo.pl line 32. > Global symbol "$MAGIC_VAR" requires explicit package name at (eval > 1) line 1. > > Once again, the RegexpProcessor code is identical to Dave's workaround > code, except that this time the commented line has been added to the > qr-overloading in order to replace $MAGIC_VAR in the source with 'foo' > (this is a minimal version of the various kinds of much more complex > manipulations that Regexp::Grammars actually does): > > -----cut----------cut----------cut----------cut----------cut----------cut----- > > package RegexProcessor; > > use overload ( > q{""} => sub { > my ($pat) = @_; > return $pat->[0]; > }, > q{.} => sub { > my ($a1, $a2) = @_; > $a1 = $a1->[0] if ref $a1; > $a2 = $a2->[0] if ref $a2; > return bless [ "$a1$a2" ], 'RegexProcessor'; > }, > ); > > package main; > use re 'eval'; > > BEGIN { > overload::constant qr => sub { > my ($regex_pattern) = @_; > > # Replace raw $MAGIC_VAR with 'foo'... > # (A greatly simplified version of what Regexp::Grammars does) > $regex_pattern =~ s/\$MAGIC_VAR/'foo'/g; > > return bless [ $regex_pattern ], 'RegexProcessor' > }; > } > > use strict; > say 'matched' if "foobar" =~ m{ (??{ $MAGIC_VAR }) bar }xms; > > -----end----------end----------end----------end----------end----------end----- I think that this is the one that will be impossible to work fully workaround; i.e. the modification of user-supplied code blocks before the perl parser gets to see them. Note first that moving the code-manipulation from the overload q{""} function (as it was in my sample code) to the overload::constant qr function (as it is in your sample code) will never work: the overload::constant function is never called under any circumstances for the text of literal code blocks in 5.18.x; which is why in my example code I did the manipulation in the final stringification call (q{""}). Before I discuss this in more detail, first can I ask whether its absolutely necessary for R::G to modify user code? Could the effects you achieve be done by exporting (say) a tied var $MAGIC_VAR into the callers namespace??? Anyway, let me explain in a bit more detail what's going on. (if this is tl;dr, then just skip the end where I discuss alternatives) In the presence of overload::constant qr => \&f, a general regex like /abc(?{d})e$f/, is toked/parsed at the same time as the surrounding perl code, into a list op that looks like regcomp(f('abc'), '(?{d})', {d}, f('e'), $f); where the calls to f() are done at compile time, so if we have, say, sub f { uc $_[0] } then the above actually arrives at the parser as: regcomp('ABC', '(?{d})', {d}, 'E', $f); Note that the text of the code block is *not* passed through f(). Also, note that both the text of the code block and the code block itself are passed; the text is so that the regex compiler itself can assemble the full, original text of the regex (so that print qr/(?{})/ will display the right thing for example), but that also the 'bare' code is exposed and is parsed and compiled along with everything else - so the {d} above is a bit like the code block in map or grep. The regcomp() above will be processed at compile-time if all of its components are compile-time (so the above without the $f, for example), and at run-time otherwise. In either case, the regex compiler is called with a) a list of strings (or regex objects) like ('ABC', '(?{d})', 'E', whatever $f contains); b) a list of optrees, one for each literal codeblock that got parsed (so {d} in the above). The regex compiler concats the list of strings into a single string that represents the final pattern to be compiled. If there is just a single item in the list, then 'qr' or '""' overloading will be called if available to convert that single item into a final pattern (or regex object). If there are multiple items, then we start with an empty string, then concat each item to it, first applying qr-overloading if necessary, then calling '.' overloading if it exists, falling back to plain concatenation (using '""' overloading on the item if it exists). Finally after the pattern is assembled, '""' overloading is used to retrieve its final value. During this assembly, optrees are paired up with the parts of the final pattern string that correspond to the text of literal code blocks. So that when the patten string is finally passed through the regex compiler, when it sees a '(?{', it knows to use optree #3 (say) and attaches that tree to appropriate regex node. If there's a '(?{' or '(??{' in the pattern that doesn't correspond to an optree (e.g. it was introduced by $f above, or by overloading), then the pattern is evalled, but with any literal code-blocks blanked out. So in the above, if $f contained '(?{f})' and there was no funny overloading, the pattern string would be 'abc(?{d})e(?{f})'; where the second (?{f}) doesn't have a corresponding optree. At this point we check that 'use re eval' is in scope and if not, croak(). Otherwise, we internally eval the string qr'abc______e(?{f})' and from the returned object, extract out the optree for the (?{f}) block, and continue as before. A similar thing happens in the presence of concat overloading; since the final pattern string may contain the text of code-blocks that no longer match what the parser's already seen and compiled into optrees, we abandon any existing optrees and treat every (?{}) as a runtime code block and recompile as above. This is why in your code example you got two warnings/errors: the same code block was compiled twice; first as literal code, then after the overloading triggered throwing it away, it was compiled a second time as a run-time pattern. Note that this means that even if the '""' overloading gets a chance to rewrite the code block, the pre-modification code block will get compiled and then discarded; and this may trigger errors such as the "$MAGIC_VAR" requires explicit package name which can't be avoided. Note also that if the user's regex consists purely of a code block, with no constant text (such as /(?{a})/ verses /(?{a})b/), then the whole "make overload::constant qr return an overloaded object" trick fails to work, since the text of code blocks isn't passed through the const mapper. TL;DR: For those two reasons (bare code blocks don't get processed; code blocks are compiled - and possibly error out - before they can be modified), I don't think R::G as it stands is viable under 5.18.x. Which leads me to repeat the question: is it possible for R::G to work without the facility to modify the text of user code blocks? If not, then I wonder whether, for 5.20.0, we could add a new facility that would allow you to do what you need. I'm open to suggestions, but one possibility might be to add a 'raw' type to overload::constant that is passed the whole literal string, before interpolation etc (i.e. it sees it as a single-quoted string), and before the rest of the perl toker and parser has seen it. For example currently, this use overload; BEGIN { overload::constant qr => sub { my $s = shift; print "qr($s)\n"; $s; }; overload::constant 'q' => sub { my $s = shift; print "q($s)\n"; $s; }; } qr/abc(?{})def$x/; "ABC$a[$b+$c]DEF"; outputs: qr(abc) qr(def) q(ABC) q(DEF) I'm suggesting that in addition we allow you to add, say, overload::constant raw => sub { my $s = shift; print "raw($s)\n"; $s; }; that when used along with the two existing overloads, gives raw(abc(?{})def$x) qr(abc) qr(def) raw(ABC$a[$b+$c]DEF) q(ABC) q(DEF) Would this help??? Or would the fact that it passes you the names of run-time variables rather than their values, be just as bad? -- Atheism is a religion like not collecting stamps is a hobby
|