Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Python: Python

regexp question

 

 

Python python RSS feed   Index | Next | Previous | View Threaded


python_charmer2000atyahoo.com

Dec 4, 2003, 9:47 PM

Post #1 of 17 (439 views)
Permalink
regexp question

I want to match several regexps against a large body of text. What I
have so far is similar to this:

re1 = <some regexp>
re2 = <some regexp>
re3 = <some regexp>

big_re = re.compile(re1 + '|' + re2 + '|' + re3)

matches = big_re.finditer(file_list)
for match in matches:
span = match.span()
print "matched text =", file_list[span[0]:span[1]]
print "matched re =", match.re.pattern

Now the "match.re.pattern" is the entire regexp, big_re. But I want
to print out the portion of the big re that was matched -- was it re1?
re2? or re3? Is it possible to determine this, or do I have to make
a second pass through the collection of re's and compare them against
the "matched text" in order to determine which part of the big_re was
matched?

thanks!!


bignose-hates-spamatand-benfinney-does-too.id.au

Dec 4, 2003, 10:35 PM

Post #2 of 17 (424 views)
Permalink
regexp question [In reply to]

On Fri, 05 Dec 2003 02:26:53 -0000, python_charmer2000 wrote:
> re1 = <some regexp>
> re2 = <some regexp>
> re3 = <some regexp>
>
> big_re = re.compile(re1 + '|' + re2 + '|' + re3)
>
> Now the "match.re.pattern" is the entire regexp, big_re. But I want
> to print out the portion of the big re that was matched -- was it re1?
> re2? or re3? Is it possible to determine this, or do I have to make
> a second pass through the collection of re's and compare them against
> the "matched text" in order to determine which part of the big_re was
> matched?

That will work no matter what your regexes hapen to be, and is easily
understood. Implement that, and see if it's fast enough. (Doing
otherwise is known as "premature optimisation" and is a bad practice.)
In fact, it may be better (from a readability standpoint) to simply
compile each of the regexes and match them all each time.

An alternative, if it's not fast enough: Group the regexes and inspect
them with the re.MatchObject.group() method.

>>> import re
>>> regex1 = 'abc'
>>> regex2 = 'def'
>>> regex3 = 'ghi'
>>> big_regex = re.compile(
... '(' + regex1 + ')'
... + '|(' + regex2 + ')'
... + '|(' + regex3 + ')'
... )
>>> match = re.match( big_regex, 'def' )
>>> match.groups()
(None, 'def', None)
>>> match.group(1)
>>> match.group(2)
'def'
>>> match.group(3)
>>>


--
\ "As the evening sky faded from a salmon color to a sort of |
`\ flint gray, I thought back to the salmon I caught that morning, |
_o__) and how gray he was, and how I named him Flint." -- Jack Handey |
Ben Finney <http://bignose.squidly.org/>


fredrikatpythonware.com

Dec 5, 2003, 7:41 AM

Post #3 of 17 (426 views)
Permalink
regexp question [In reply to]

"python_charmer2000" wrote:

> I want to match several regexps against a large body of text. What I
> have so far is similar to this:
>
> re1 = <some regexp>
> re2 = <some regexp>
> re3 = <some regexp>
>
> big_re = re.compile(re1 + '|' + re2 + '|' + re3)
>
> matches = big_re.finditer(file_list)
> for match in matches:
> span = match.span()
> print "matched text =", file_list[span[0]:span[1]]
> print "matched re =", match.re.pattern
>
> Now the "match.re.pattern" is the entire regexp, big_re. But I want
> to print out the portion of the big re that was matched -- was it re1?
> re2? or re3? Is it possible to determine this, or do I have to make
> a second pass through the collection of re's and compare them against
> the "matched text" in order to determine which part of the big_re was
> matched?

you could put each expression inside parentheses, and use the lastindex
attribute to find the subexpression:

import re, string

res = [
"(a+)",
"(b+)",
"(c+)"
]

big_re = re.compile(string.join(res, "|"))

matches = big_re.finditer("abba")
for match in matches:
span = match.span()
print "matched text =", match.group()
print "matched re =", res[match.lastindex-1]

prints

matched text = a
matched re = (a+)
matched text = bb
matched re = (b+)
matched text = a
matched re = (a+)

</F>


nun at example

Dec 1, 2004, 8:43 AM

Post #4 of 17 (426 views)
Permalink
Re: Regexp question [In reply to]

On Wed, 1 Dec 2004 07:48:24 -0600, Philippe C. Martin
<philippecmartin [at] sbcglobal> wrote:

> I realize this is more a regexp question than a python question, but
> maybe one
> of the re object could help me:
>
> I have wish to know how to _no_ match:
>
> This is but an example of the data I handle:
>
> xx xx xx xx xx xx xx [yy yy yy yy yy yy yy] (zz zz zz zz)
>
> I currently can retrieve the three group of logical data blocks with:
>
> l_str = 'xx xx xx xx xx xx xx [yy yy yy yy yy yy yy] (zz zz zz zz)'
> p = re.compile(r'([a-f-0-9\s]*) (\[[a-f-0-9\s]*\])
> (\([a-f-0-9\s]*\))',re.IGNORECASE) #OK
> g = p.search(l_str)
>
>
> What I would rather do is.
>
> "get the data block that is _not_ between brackets or parenthesis i.e;
> 'xx xx
> xx xx xx xx xx' knowing that the intial string could be:
>
> [yy yy yy yy yy yy yy] xx xx xx xx xx xx xx (zz zz zz zz)
>
>
> Any clue ?

regexps seem an overkill for the task at hand.

If data is really as simple as you suggest, you can try the following:
>>> s = 'xx [y y] (z z)'
>>> s = s[:s.index('(')] + s[s.index(')')+1:]
>>> s
'xx [y y] '
>>> s = s[:s.index('[')] + s[s.index(']')+1:]
>>> s
'xx '
>>> s.strip()
'xx'


Relevant lines:
s = s[:s.index('(')] + s[s.index(')'):]
s = s[:s.index('[')] + s[s.index(']')+1:]
s = s.strip()
--
Mitja
--
http://mail.python.org/mailman/listinfo/python-list


python.list at tim

Apr 11, 2006, 11:51 AM

Post #5 of 17 (430 views)
Permalink
Re: RegExp question [In reply to]

> I would like to form a regular expression to find a few
> different tokens (and, or, xor) followed by some variable
> number of whitespace (i.e., tabs and spaces) followed by
> a hash mark (i.e., #). What would be the regular
> expression for this?


(and|or|xor)\s*#

Unless "varible number of whitespace" means "at least *some*
whitespace", in which case you'd want to use

(and|or|xor)\s+#

Both are beautiful and precise.

-tim




--
http://mail.python.org/mailman/listinfo/python-list


michael.mcgarry at gmail

Apr 11, 2006, 12:16 PM

Post #6 of 17 (422 views)
Permalink
Re: RegExp question [In reply to]

Tim,

for some reason that does not seem to do the trick.

I am testing it with grep. (i.e., grep -e '(and|or|xor)\s*#' myfile)

Michael

--
http://mail.python.org/mailman/listinfo/python-list


ptmcg at austin

Apr 11, 2006, 12:20 PM

Post #7 of 17 (439 views)
Permalink
Re: RegExp question [In reply to]

"Michael McGarry" <michael.mcgarry [at] gmail> wrote in message
news:1144781090.622493.252460 [at] t31g2000cwb
> Hi,
>
> I would like to form a regular expression to find a few different
> tokens (and, or, xor) followed by some variable number of whitespace
> (i.e., tabs and spaces) followed by a hash mark (i.e., #). What would
> be the regular expression for this?
>
> Thanks for any help,
>
> Michael
>
Using pyparsing, whitespace is implicitly ignored. Your expression would
look like:

oneOf("and or xor") + Literal("#")


Here's a complete example:


from pyparsing import *

pattern = oneOf("and or xor") + Literal("#")

testString = """
z = (a and b) and #XVAL;
q = z xor #YVAL;
"""


# use scanString to locate matches
for tokens,start,end in pattern.scanString(testString):
print tokens[0], tokens.asList()
print line(start,testString)
print (" "*(col(start,testString)-1)) + "^"
print
print


# use transformString to locate matches and substitute values
subs = {
'XVAL': 0,
'YVAL': True,
}
def replaceSubs(st,loc,toks):
try:
return toks[0] + " " + str(subs[toks[2]])
except KeyError:
pass

pattern2 = (pattern + Word(alphanums)).setParseAction(replaceSubs)
print pattern2.transformString(testString)

-----------------
Prints:
and ['and', '#']
z = (a and b) and #XVAL;
^

xor ['xor', '#']
q = z xor #YVAL;
^


z = (a and b) and 0;
q = z xor True;


Download pyparsing at http://pyparsing.sourceforge.net.

-- Paul



--
http://mail.python.org/mailman/listinfo/python-list


me+python at modelnine

Apr 11, 2006, 12:31 PM

Post #8 of 17 (430 views)
Permalink
Re: RegExp question [In reply to]

Am Dienstag 11 April 2006 21:16 schrieb Michael McGarry:
> I am testing it with grep. (i.e., grep -e '(and|or|xor)\s*#' myfile)

Test it with Python's re-module, then. \s for matching Whitespace is specific
to Python (AFAIK). And as you've asked in a Python Newsgroup, you'll get
Python-answers here.

--- Heiko.
--
http://mail.python.org/mailman/listinfo/python-list


spamspam at spam

Apr 11, 2006, 12:40 PM

Post #9 of 17 (426 views)
Permalink
Re: RegExp question [In reply to]

On 2006-04-11, Michael McGarry <michael.mcgarry [at] gmail> wrote:
> Hi,
>
> I would like to form a regular expression to find a few different
> tokens (and, or, xor) followed by some variable number of whitespace
> (i.e., tabs and spaces) followed by a hash mark (i.e., #). What would
> be the regular expression for this?

re.compile(r'(?:and|or|xor)\s*#')
--
http://mail.python.org/mailman/listinfo/python-list


Aiwass333 at gmail

Apr 11, 2006, 12:57 PM

Post #10 of 17 (426 views)
Permalink
Re: RegExp question [In reply to]

In my opinion you would be best to use a tool like Kiki.
http://project5.freezope.org/kiki/index.html/#

This will allow you to paste in the actual text you want to search and
then play with different RE's and set flags with a simple mouse click
so you can find just what you want. Rember what re.DOTALL does. It
will treat white spaces special and if there are line breaks it will
follow them, otherwise it will not. It's a good idea to have a grasp
of regular expressions or when you come back to your code months /
weeks later, you will be just as lost, and always comment them very
well :).

Just my 2¢

--
http://mail.python.org/mailman/listinfo/python-list


spamspam at spam

Apr 11, 2006, 1:12 PM

Post #11 of 17 (426 views)
Permalink
Re: RegExp question [In reply to]

On 2006-04-11, Michael McGarry <michael.mcgarry [at] gmail> wrote:
> Tim,
>
> for some reason that does not seem to do the trick.
>
> I am testing it with grep. (i.e., grep -e '(and|or|xor)\s*#' myfile)

Try with grep -P, which means use perl-compatible regexes as opposed to
POSIX ones. I only know for sure that -P exists for GNU grep.

I assumed it was a Python question! Unless you're testing your Python
regex with grep, not realizing they're different.

Perl and Python regexes are (mostly?) the same.

I usually grep -P because I know Python regexes better than any other
ones.
--
http://mail.python.org/mailman/listinfo/python-list


python.list at tim

Apr 11, 2006, 1:23 PM

Post #12 of 17 (464 views)
Permalink
Re: RegExp question [In reply to]

> I am testing it with grep. (i.e., grep -e '(and|or|xor)\s*#' myfile)

Well, you asked for the python regexp...different
environments use different regexp parsing engines. Your
response is akin to saying "the example snippet of python
code you gave me doesn't work in my Pascal program".

For grep:

grep '\(and\|or\|xor\)[[:space:]]*#' myfile

For Vim:

:g/\(and\|or\|xor\)\s*#/

The one I gave originally is a python regexp, and thus
should be tested within python, not grep or vim or emacs or
sed or whatever.

It's always best to test in the real
environment...otherwise, you'll get flakey results.

-tkc






--
http://mail.python.org/mailman/listinfo/python-list


sjmachin at lexicon

Apr 11, 2006, 1:28 PM

Post #13 of 17 (426 views)
Permalink
Re: RegExp question [In reply to]

(-:
Sorry about Tim. He's not very imaginative. He presumed that because
you asked on comp.lang.python that you would be testing it with Python.
You should have either (a) asked your question on
comp.toolswithfunnynames.grep or (b) not presumed that grep's re syntax
is the same as Python's.
:-)

My grep appears to need something fugly like this:

grep -e "\(and\|or\|xor\)[ \t]*#" grepre.txt

but my grep is a Windows port which identifies itself as "grep (GNU
grep) 2.5.1" so it's definitely not The One True Grep ...

Now that you're here, why don't you try Python? It's not hard, e.g.

#>>> import re
#>>> rs = re.compile(r"(and|or|xor)\s*#").search
#>>> rs("if foo and #continued")
#<_sre.SRE_Match object at 0x00AE66E0>
#>>> rs("if foo and#continued")
#<_sre.SRE_Match object at 0x00AE6620>
#>>> rs("if foo and bar #continued")
#>>> rs("if foo xor # continued")
#<_sre.SRE_Match object at 0x00AE66E0>
#>>>

HTH,
John

--
http://mail.python.org/mailman/listinfo/python-list


sjmachin at lexicon

Apr 11, 2006, 1:46 PM

Post #14 of 17 (428 views)
Permalink
Re: RegExp question [In reply to]

Precise? The OP asked for "tokens".

#>>> re.search(r"(and|or|xor)\s*#", "a = the_operand # gotcha!")
#<_sre.SRE_Match object at 0x00AE6620>

Try this:

#>>> re.search(r"\b(and|or|xor)\s*#", "a = the_operand # should fail")
#>>> re.search(r"\b(and|or|xor)\s*#", "and # OK")
#<_sre.SRE_Match object at 0x00AE6E60>
#>>> re.search(r"\b(and|or|xor)\s*#", "blah blah and # OK")
#<_sre.SRE_Match object at 0x00AE66E0>

--
http://mail.python.org/mailman/listinfo/python-list


joncle at googlemail

Nov 6, 2009, 2:06 PM

Post #15 of 17 (421 views)
Permalink
Re: regexp question [In reply to]

On Nov 6, 9:50 pm, Jabba Laci <jabba.l...@gmail.com> wrote:
> Hi,
>
> How to find all occurences of a substring in a string? I want to
> convert the following Perl code to Python.
>
> Thanks,
>
> Laszlo
>
> ==========
>
> my $text = '<a href="ad1">sdqs</a><a href="ad2">sds</a><a href=ad3>qs</a>';
>
> while ($text =~ m#href="?(.*?)"?>#g)
> {
>    print $1, "\n";}
>
> # output:
> #
> # ad1
> # ad2
> # ad3

There's numerous threads on why using regexp's to process html is not
a great idea. Search GGs.

You're better off using beautifulsoup (an HTML parsing library). The
API is simple, and for real-world data is a much better choice.

hth
Jon.
--
http://mail.python.org/mailman/listinfo/python-list


rami.chowdhury at gmail

Nov 6, 2009, 2:10 PM

Post #16 of 17 (421 views)
Permalink
Re: regexp question [In reply to]

On Fri, 06 Nov 2009 13:50:16 -0800, Jabba Laci <jabba.laci [at] gmail>
wrote:

> Hi,
>
> How to find all occurences of a substring in a string? I want to
> convert the following Perl code to Python.
>
> Thanks,
>
> Laszlo
>
> ==========
>
> my $text = '<a href="ad1">sdqs</a><a href="ad2">sds</a><a
> href=ad3>qs</a>';
>
> while ($text =~ m#href="?(.*?)"?>#g)
> {
> print $1, "\n";
> }
> # output:
> #
> # ad1
> # ad2
> # ad3

Your regular expression pattern should work unchanged, and you probably
want to use http://docs.python.org/library/re.html#re.findall or similar
to do the actual matching. If all you want to do is iterate over the
matches, I would use re.finditer :-)

--
Rami Chowdhury
"Never attribute to malice that which can be attributed to stupidity" --
Hanlon's Razor
408-597-7068 (US) / 07875-841-046 (UK) / 0189-245544 (BD)
--
http://mail.python.org/mailman/listinfo/python-list


jabba.laci at gmail

Nov 6, 2009, 2:19 PM

Post #17 of 17 (419 views)
Permalink
Re: regexp question [In reply to]

> Your regular expression pattern should work unchanged, and you probably want
> to use  http://docs.python.org/library/re.html#re.findall or similar to do
> the actual matching. If all you want to do is iterate over the matches, I
> would use re.finditer :-)

Thank you, I found the solution:

for m in re.finditer(r'href="?(.*?)"?>', text):
print m.group(1)

Laszlo
--
http://mail.python.org/mailman/listinfo/python-list

Python python RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.