Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Python: Python

RegExp Help

 

 

Python python RSS feed   Index | Next | Previous | View Threaded


half.italian at gmail

Dec 13, 2007, 5:49 PM

Post #1 of 25 (353 views)
Permalink
RegExp Help

Hi group,

I'm wrapping up a command line util that returns xml in Python. The
util is flaky, and gives me back poorly formed xml with different
problems in different cases. Anyway I'm making progress. I'm not
very good at regular expressions though and was wondering if someone
could help with initially splitting the tags from the stdout returned
from the util.

I have the following example string, and am simply trying to split it
into two xml tags...

simplified = """2007-12-13 <tag1 attr1="text1" attr2="text2" /tag1>
\n2007-12-13 <tag2 attr1="text1" attr2="text2" attr3="text3\n" /tag2>
\n"""

Basically I want the two tags, and to discard anything in between
using a reg exp. Like this:

tags = ["<tag1 attr1="text1" attr2="text2" /tag1>", "<tag2
attr1="text1" attr2="text2" attr3="text3\n" /tag2>"]

I've tried several approaches, some of which got close, but the
newline in the middle of one of the tags screwed it up. The closest
I've been is something like this:

retag = re.compile(r'<.+>*') # tried here with re.DOTALL as well
tags = re.findall(retag)

Can anyone help me?

~Sean

--
http://mail.python.org/mailman/listinfo/python-list


half.italian at gmail

Dec 13, 2007, 6:04 PM

Post #2 of 25 (343 views)
Permalink
Re: RegExp Help [In reply to]

On Dec 13, 5:49 pm, Sean DiZazzo <half.ital...@gmail.com> wrote:
> Hi group,
>
> I'm wrapping up a command line util that returns xml in Python. The
> util is flaky, and gives me back poorly formed xml with different
> problems in different cases. Anyway I'm making progress. I'm not
> very good at regular expressions though and was wondering if someone
> could help with initially splitting the tags from the stdout returned
> from the util.
>
> I have the following example string, and am simply trying to split it
> into two xml tags...
>
> simplified = """2007-12-13 <tag1 attr1="text1" attr2="text2" /tag1>
> \n2007-12-13 <tag2 attr1="text1" attr2="text2" attr3="text3\n" /tag2>
> \n"""
>
> Basically I want the two tags, and to discard anything in between
> using a reg exp. Like this:
>
> tags = ["<tag1 attr1="text1" attr2="text2" /tag1>", "<tag2
> attr1="text1" attr2="text2" attr3="text3\n" /tag2>"]
>
> I've tried several approaches, some of which got close, but the
> newline in the middle of one of the tags screwed it up. The closest
> I've been is something like this:
>
> retag = re.compile(r'<.+>*') # tried here with re.DOTALL as well
> tags = re.findall(retag)
>
> Can anyone help me?
>
> ~Sean

I found something that works, although I couldn't tell you why it
works. :)

retag = re.compile(r'<.+?>', re.DOTALL)
tags = retag.findall(retag)

Why does that work?

~Sean
--
http://mail.python.org/mailman/listinfo/python-list


bj_666 at gmx

Dec 14, 2007, 12:04 AM

Post #3 of 25 (338 views)
Permalink
Re: RegExp Help [In reply to]

On Thu, 13 Dec 2007 17:49:20 -0800, Sean DiZazzo wrote:

> I'm wrapping up a command line util that returns xml in Python. The
> util is flaky, and gives me back poorly formed xml with different
> problems in different cases. Anyway I'm making progress. I'm not
> very good at regular expressions though and was wondering if someone
> could help with initially splitting the tags from the stdout returned
> from the util.
>
> […]
>
> Can anyone help me?

Flaky XML is often produced by programs that treat XML as ordinary text
files. If you are starting to parse XML with regular expressions you are
making the very same mistake. XML may look somewhat simple but
producing correct XML and parsing it isn't. Sooner or later you stumble
across something that breaks producing or parsing the "naive" way.

Ciao,
Marc 'BlackJack' Rintsch
--
http://mail.python.org/mailman/listinfo/python-list


half.italian at gmail

Dec 14, 2007, 1:06 AM

Post #4 of 25 (338 views)
Permalink
Re: RegExp Help [In reply to]

On Dec 14, 12:04 am, Marc 'BlackJack' Rintsch <bj_...@gmx.net> wrote:
> On Thu, 13 Dec 2007 17:49:20 -0800, Sean DiZazzo wrote:
> > I'm wrapping up a command line util that returns xml in Python. The
> > util is flaky, and gives me back poorly formed xml with different
> > problems in different cases. Anyway I'm making progress. I'm not
> > very good at regular expressions though and was wondering if someone
> > could help with initially splitting the tags from the stdout returned
> > from the util.
>
> > [...]
>
> > Can anyone help me?
>
> Flaky XML is often produced by programs that treat XML as ordinary text
> files. If you are starting to parse XML with regular expressions you are
> making the very same mistake. XML may look somewhat simple but
> producing correct XML and parsing it isn't. Sooner or later you stumble
> across something that breaks producing or parsing the "naive" way.
>
> Ciao,
> Marc 'BlackJack' Rintsch

It's not really complicated xml so far, just tags with attributes.
Still, using different queries against the program sometimes offers
differing results...a few examples:

<id 123456 />
<tag name="foo" />
<tag2 name="foo" moreattrs="..." /tag2>
<tag3 name="foo" moreattrs="..." tag3/>

It's consistent (at least) in that consistent queries always return
consistent tag styles. It's returned to stdout with some extra
useless information, so the original question was to help get to just
the tags. After getting the tags, I'm running them through some
functions to fix them, and then using elementtree to parse them and
get all the rest of the info.

There is no api, so this is what I have to work with. Is there a
better solution?

Thanks for your ideas.

~Sean
--
http://mail.python.org/mailman/listinfo/python-list


gagsl-py2 at yahoo

Dec 14, 2007, 3:06 AM

Post #5 of 25 (337 views)
Permalink
Re: RegExp Help [In reply to]

En Fri, 14 Dec 2007 06:06:21 -0300, Sean DiZazzo <half.italian[at]gmail.com>
escribió:

> On Dec 14, 12:04 am, Marc 'BlackJack' Rintsch <bj_...@gmx.net> wrote:
>> On Thu, 13 Dec 2007 17:49:20 -0800, Sean DiZazzo wrote:
>> > I'm wrapping up a command line util that returns xml in Python. The
>> > util is flaky, and gives me back poorly formed xml with different
>> > problems in different cases. Anyway I'm making progress. I'm not
>> > very good at regular expressions though and was wondering if someone
>> > could help with initially splitting the tags from the stdout returned
>> > from the util.
>>
>> Flaky XML is often produced by programs that treat XML as ordinary text
>> files. If you are starting to parse XML with regular expressions you are
>> making the very same mistake. XML may look somewhat simple but
>> producing correct XML and parsing it isn't. Sooner or later you stumble
>> across something that breaks producing or parsing the "naive" way.
>>
> It's not really complicated xml so far, just tags with attributes.
> Still, using different queries against the program sometimes offers
> differing results...a few examples:
>
> <id 123456 />
> <tag name="foo" />
> <tag2 name="foo" moreattrs="..." /tag2>
> <tag3 name="foo" moreattrs="..." tag3/>

Ouch... only the second is valid xml. Most tools require at least a well
formed document. You may try using BeautifulStoneSoup, included with
BeautifulSoup http://crummy.com/software/BeautifulSoup/

> I found something that works, although I couldn't tell you why it
> works. :)
> retag = re.compile(r'<.+?>', re.DOTALL)
> tags = retag.findall(retag)
> Why does that work?

That means: "look for a less-than sign (<), followed by the shortest
sequence of (?) one or more (+) arbitrary characters (.), followed by a
greater-than sign (>)"

If you never get nested tags, and never have a ">" inside an attribute,
that expression *might* work. But please try BeautifulStoneSoup, it uses a
lot of heuristics trying to guess the right structure. Doesn't work
always, but given your input, there isn't much one can do...


--
Gabriel Genellina

--
http://mail.python.org/mailman/listinfo/python-list


half.italian at gmail

Dec 14, 2007, 9:45 AM

Post #6 of 25 (328 views)
Permalink
Re: RegExp Help [In reply to]

On Dec 14, 3:06 am, "Gabriel Genellina" <gagsl-...@yahoo.com.ar>
wrote:
> En Fri, 14 Dec 2007 06:06:21 -0300, Sean DiZazzo <half.ital...@gmail.com>
> escribió:
>
>
>
> > On Dec 14, 12:04 am, Marc 'BlackJack' Rintsch <bj_...@gmx.net> wrote:
> >> On Thu, 13 Dec 2007 17:49:20 -0800, Sean DiZazzo wrote:
> >> > I'm wrapping up a command line util that returns xml in Python. The
> >> > util is flaky, and gives me back poorly formed xml with different
> >> > problems in different cases. Anyway I'm making progress. I'm not
> >> > very good at regular expressions though and was wondering if someone
> >> > could help with initially splitting the tags from the stdout returned
> >> > from the util.
>
> >> Flaky XML is often produced by programs that treat XML as ordinary text
> >> files. If you are starting to parse XML with regular expressions you are
> >> making the very same mistake. XML may look somewhat simple but
> >> producing correct XML and parsing it isn't. Sooner or later you stumble
> >> across something that breaks producing or parsing the "naive" way.
>
> > It's not really complicated xml so far, just tags with attributes.
> > Still, using different queries against the program sometimes offers
> > differing results...a few examples:
>
> > <id 123456 />
> > <tag name="foo" />
> > <tag2 name="foo" moreattrs="..." /tag2>
> > <tag3 name="foo" moreattrs="..." tag3/>
>
> Ouch... only the second is valid xml. Most tools require at least a well
> formed document. You may try using BeautifulStoneSoup, included with
> BeautifulSouphttp://crummy.com/software/BeautifulSoup/
>
> > I found something that works, although I couldn't tell you why it
> > works. :)
> > retag = re.compile(r'<.+?>', re.DOTALL)
> > tags = retag.findall(retag)
> > Why does that work?
>
> That means: "look for a less-than sign (<), followed by the shortest
> sequence of (?) one or more (+) arbitrary characters (.), followed by a
> greater-than sign (>)"
>
> If you never get nested tags, and never have a ">" inside an attribute,
> that expression *might* work. But please try BeautifulStoneSoup, it uses a
> lot of heuristics trying to guess the right structure. Doesn't work
> always, but given your input, there isn't much one can do...
>
> --
> Gabriel Genellina

Thanks! I'll take a look at BeautifulStoneSoup today and see what I
get.

~Sean
--
http://mail.python.org/mailman/listinfo/python-list


ptmcg at austin

May 9, 2008, 3:54 PM

Post #7 of 25 (307 views)
Permalink
Re: regexp help [In reply to]

On May 9, 5:19 pm, globalrev <skanem...@yahoo.se> wrote:
> i want to a little stringmanipulationa nd im looking into regexps. i
> couldnt find out how to do:
> s = 'poprorinoncoce'
> re.sub('$o$', '$', s)
> should result in 'prince'
>
> $ is obv the wrng character to use bu what i mean the pattern is
> "consonant o consonant" and should be replace by just "consonant".
> both consonants should be the same too.
> so mole would be mole
> mom would be m etc

from re import *
vowels = "aAeEiIoOuU"
cons = "bcdfghjklmnpqrstvwxyzBCDFGHJKLMNPQRSTVWXYZ"
encodeRe = re.compile(r"([%s])[%s]\1" % (cons,vowels))
print encodeRe.sub(r"\1",s)

This is actually a little more complex than you asked - it will search
for any consonant-vowel-same_consonant triple, and replace it with the
leading consonant. To meet your original request, change to:

from re import *
cons = "bcdfghjklmnpqrstvwxyzBCDFGHJKLMNPQRSTVWXYZ"
encodeRe = re.compile(r"([%s])o\1" % cons)
print encodeRe.sub(r"\1",s)

Both print "prince".

-- Paul

(I have a pyparsing solution too, but I just used it to prototype up
the solution, then coverted it to regex.)
--
http://mail.python.org/mailman/listinfo/python-list


mccredie at gmail

May 9, 2008, 4:06 PM

Post #8 of 25 (308 views)
Permalink
Re: regexp help [In reply to]

On May 9, 3:19 pm, globalrev <skanem...@yahoo.se> wrote:
> i want to a little stringmanipulationa nd im looking into regexps. i
> couldnt find out how to do:
> s = 'poprorinoncoce'
> re.sub('$o$', '$', s)
> should result in 'prince'
>
> $ is obv the wrng character to use bu what i mean the pattern is
> "consonant o consonant" and should be replace by just "consonant".
> both consonants should be the same too.
> so mole would be mole
> mom would be m etc

>>> import re
>>> s = s = 'poprorinoncoce'
>>> coc = re.compile(r"(.)o\1")
>>> coc.sub(r'\1', s)
'prince'

Matt
--
http://mail.python.org/mailman/listinfo/python-list


sjmachin at lexicon

May 9, 2008, 4:52 PM

Post #9 of 25 (308 views)
Permalink
Re: regexp help [In reply to]

Paul McGuire wrote:
> from re import *

Perhaps you intended "import re".

> vowels = "aAeEiIoOuU"
> cons = "bcdfghjklmnpqrstvwxyzBCDFGHJKLMNPQRSTVWXYZ"
> encodeRe = re.compile(r"([%s])[%s]\1" % (cons,vowels))
> print encodeRe.sub(r"\1",s)
>
> This is actually a little more complex than you asked - it will search
> for any consonant-vowel-same_consonant triple, and replace it with the
> leading consonant. To meet your original request, change to:
>
> from re import *

And again.

> cons = "bcdfghjklmnpqrstvwxyzBCDFGHJKLMNPQRSTVWXYZ"
> encodeRe = re.compile(r"([%s])o\1" % cons)
> print encodeRe.sub(r"\1",s)
>
> Both print "prince".
>

No they don't. The result is "NameError: name 're' is not defined".
--
http://mail.python.org/mailman/listinfo/python-list


skanemupp at yahoo

May 9, 2008, 4:53 PM

Post #10 of 25 (306 views)
Permalink
Re: regexp help [In reply to]

ty. that was the decrypt function. i am slo writing an encrypt
function.

def encrypt(phrase):
pattern =
re.compile(r"([bcdfghjklmnpqrstvwxyzBCDFGHJKLMNPQRSTVWXYZ])")
return pattern.sub(r"1\o\1", phrase)

doesnt work though, h becomes 1\\oh.


def encrypt(phrase):
pattern =
re.compile(r"([bcdfghjklmnpqrstvwxyzBCDFGHJKLMNPQRSTVWXYZ])")
return pattern.sub(r"o\1", phrase)

returns oh.

i want hoh.

i dont quite get it.why cant i delimit pattern with \
--
http://mail.python.org/mailman/listinfo/python-list


sjmachin at lexicon

May 9, 2008, 6:39 PM

Post #11 of 25 (293 views)
Permalink
Re: regexp help [In reply to]

globalrev wrote:
> ty. that was the decrypt function. i am slo writing an encrypt
> function.
>
> def encrypt(phrase):
> pattern =
> re.compile(r"([bcdfghjklmnpqrstvwxyzBCDFGHJKLMNPQRSTVWXYZ])")

The inner pair of () are not necessary.

> return pattern.sub(r"1\o\1", phrase)
>
> doesnt work though, h becomes 1\\oh.

To be precise, "h" becomes "1\\oh", which is the same as r"1\oh". There
is only one backslash in the result.

It's doing exactly what you told it to do: replace each consonant by
(1) the character '1'
(2) a backslash
(3) the character 'o'
(4) the consonant

>
>
> def encrypt(phrase):
> pattern =
> re.compile(r"([bcdfghjklmnpqrstvwxyzBCDFGHJKLMNPQRSTVWXYZ])")
> return pattern.sub(r"o\1", phrase)
>
> returns oh.

It's doing exactly what you told it to do: replace each consonant by
(1) the character 'o'
(2) the consonant

> i want hoh.

So tell it to do that:
return pattern.sub(r"\1o\1", phrase)

> i dont quite get it.why cant i delimit pattern with \

Perhaps you could explain what you mean by "delimit pattern with \".

--
http://mail.python.org/mailman/listinfo/python-list


skanemupp at yahoo

May 9, 2008, 7:58 PM

Post #12 of 25 (294 views)
Permalink
Re: regexp help [In reply to]

> The inner pair of () are not necessary.

yes they are?


ty anyway, got it now.
--
http://mail.python.org/mailman/listinfo/python-list


sjmachin at lexicon

May 9, 2008, 8:46 PM

Post #13 of 25 (293 views)
Permalink
Re: regexp help [In reply to]

globalrev wrote:
>> The inner pair of () are not necessary.
>
> yes they are?

You are correct. I was having a flashback to a dimly remembered previous
incarnation during which I used regexp software in which something like
& or \0 denoted the whole match (like MatchObject.group(0)) :-)
--
http://mail.python.org/mailman/listinfo/python-list


ptmcg at austin

May 9, 2008, 9:04 PM

Post #14 of 25 (293 views)
Permalink
Re: regexp help [In reply to]

On May 9, 6:52 pm, John Machin <sjmac...@lexicon.net> wrote:
> Paul McGuire wrote:
> > from re import *
>
> Perhaps you intended "import re".

Indeed I did.

> <snip>
>
> > Both print "prince".
>
> No they don't. The result is "NameError: name 're' is not defined".

Dang, now how did that work in my script? I assure you I did test it
before posting.

Ah! My pyparsing prototype preceded the regex version in the same
script, and importing the pyparsing module imports re using "import
re". That is why I didn't get NameError. Sorry for sloppy posting...

Once you clean up the mistakes, you essentially get the same code as
earlier posted by Matimus.

-- Paul
--
http://mail.python.org/mailman/listinfo/python-list


iurisilvio at gmail

Aug 27, 2009, 11:28 AM

Post #15 of 25 (195 views)
Permalink
Re: regexp help [In reply to]

You can use r"[+-]?\d+" to get positive and negative integers.

It returns true to these strings: "+123", "-123", "123"



On Thu, Aug 27, 2009 at 3:15 PM, Bakes <bakes[at]ymail.com> wrote:

> If I were using the code:
>
> (?P<data>[0-9]+)
>
> to get an integer between 0 and 9, how would I allow it to register
> negative integers as well?
> --
> http://mail.python.org/mailman/listinfo/python-list
>


ppearson at nowhere

Aug 27, 2009, 1:31 PM

Post #16 of 25 (193 views)
Permalink
Re: regexp help [In reply to]

On Thu, 27 Aug 2009 11:15:59 -0700 (PDT), Bakes <bakes[at]ymail.com> wrote:
> If I were using the code:
>
> (?P<data>[0-9]+)
>
> to get an integer between 0 and 9, how would I allow it to register
> negative integers as well?

(?P<data>-?[0-9]+)

--
To email me, substitute nowhere->spamcop, invalid->net.
--
http://mail.python.org/mailman/listinfo/python-list


mdekauwe at gmail

Aug 27, 2009, 2:51 PM

Post #17 of 25 (189 views)
Permalink
Re: regexp help [In reply to]

On Aug 27, 7:15 pm, Bakes <ba...@ymail.com> wrote:
> If I were using the code:
>
> (?P<data>[0-9]+)
>
> to get an integer between 0 and 9, how would I allow it to register
> negative integers as well?

-?
--
http://mail.python.org/mailman/listinfo/python-list


ptmcg at austin

Aug 27, 2009, 6:48 PM

Post #18 of 25 (184 views)
Permalink
Re: regexp help [In reply to]

On Aug 27, 1:15 pm, Bakes <ba...@ymail.com> wrote:
> If I were using the code:
>
> (?P<data>[0-9]+)
>
> to get an integer between 0 and 9, how would I allow it to register
> negative integers as well?

With that + sign in there, you will actually get an integer from 0 to
99999999999999999...

-- Paul
--
http://mail.python.org/mailman/listinfo/python-list


simon at brunningonline

Nov 4, 2009, 8:43 AM

Post #19 of 25 (90 views)
Permalink
Re: regexp help [In reply to]

2009/11/4 Nadav Chernin <Nadav.C[at]qualisystems.com>:
> I’m trying to write regexp that find all files that are not with next
> extensions:  exe|dll|ocx|py,  but can’t find any command that make it.

http://code.activestate.com/recipes/499305/ should be a good start.
Use the re module and your regex instead of fnmatch.filter(), and you
should be good to go.

--
Cheers,
Simon B.
--
http://mail.python.org/mailman/listinfo/python-list


Nadav.C at qualisystems

Nov 4, 2009, 9:05 AM

Post #20 of 25 (90 views)
Permalink
RE: regexp help [In reply to]

Thanks, but my question is how to write the regex.

-----Original Message-----
From: simon.brunning[at]gmail.com [mailto:simon.brunning[at]gmail.com] On Behalf Of Simon Brunning
Sent: ד 04 נובמבר 2009 18:44
To: Nadav Chernin; Python List
Subject: Re: regexp help

2009/11/4 Nadav Chernin <Nadav.C[at]qualisystems.com>:
> I’m trying to write regexp that find all files that are not with next
> extensions:  exe|dll|ocx|py,  but can’t find any command that make it.

http://code.activestate.com/recipes/499305/ should be a good start.
Use the re module and your regex instead of fnmatch.filter(), and you
should be good to go.

--
Cheers,
Simon B.
--
http://mail.python.org/mailman/listinfo/python-list


simon at brunningonline

Nov 4, 2009, 9:12 AM

Post #21 of 25 (90 views)
Permalink
Re: regexp help [In reply to]

2009/11/4 Nadav Chernin <Nadav.C[at]qualisystems.com>:
> Thanks, but my question is how to write the regex.

re.match(r'.*\.(exe|dll|ocx|py)$', the_file_name) works for me.

--
Cheers,
Simon B.
--
http://mail.python.org/mailman/listinfo/python-list


carsten.haese at gmail

Nov 4, 2009, 9:13 AM

Post #22 of 25 (90 views)
Permalink
Re: regexp help [In reply to]

Nadav Chernin wrote:
> Thanks, but my question is how to write the regex.

See http://www.amk.ca/python/howto/regex/ .

--
Carsten Haese
http://informixdb.sourceforge.net

--
http://mail.python.org/mailman/listinfo/python-list


Nadav.C at qualisystems

Nov 4, 2009, 9:16 AM

Post #23 of 25 (90 views)
Permalink
RE: regexp help [In reply to]

No, I need all files except exe|dll|ocx|py

-----Original Message-----
From: simon.brunning[at]gmail.com [mailto:simon.brunning[at]gmail.com] On Behalf Of Simon Brunning
Sent: ד 04 נובמבר 2009 19:13
To: Nadav Chernin
Cc: Python List
Subject: Re: regexp help

2009/11/4 Nadav Chernin <Nadav.C[at]qualisystems.com>:
> Thanks, but my question is how to write the regex.

re.match(r'.*\.(exe|dll|ocx|py)$', the_file_name) works for me.

--
Cheers,
Simon B.
--
http://mail.python.org/mailman/listinfo/python-list


simon at brunningonline

Nov 4, 2009, 9:18 AM

Post #24 of 25 (90 views)
Permalink
Re: regexp help [In reply to]

2009/11/4 Nadav Chernin <Nadav.C[at]qualisystems.com>:
> No, I need all files except exe|dll|ocx|py

not re.match(r'.*\.(exe|dll|ocx|py)$', the_file_name)

Now that wasn't so hard, was it? ;-)

--
Cheers,
Simon B.
--
http://mail.python.org/mailman/listinfo/python-list


davea at ieee

Nov 4, 2009, 1:03 PM

Post #25 of 25 (83 views)
Permalink
Re: regexp help [In reply to]

Simon Brunning wrote:
> 2009/11/4 Nadav Chernin <Nadav.C[at]qualisystems.com>:
>
>> Thanks, but my question is how to write the regex.
>>
>
> re.match(r'.*\.(exe|dll|ocx|py)$', the_file_name) works for me.
>
>
How about:
os.path.splitext(x)[1] in (".exe", ".dll", ".ocx", ".py"):

DaveA
--
http://mail.python.org/mailman/listinfo/python-list

Python python RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact lists@gossamer-threads.com
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.