Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Python: Python

regex question

 

 

First page Previous page 1 2 3 Next page Last page  View All Python python RSS feed   Index | Next | Previous | View Threaded


johnmasters at oxtedonline

Oct 4, 2007, 1:50 PM

Post #51 of 75 (800 views)
Permalink
Re: RegEx question [In reply to]

On 15:25 Thu 04 Oct , Robert Dailey wrote:
> I am not a regex expert, I simply assumed regex was standardized to follow
> specific guidelines.

There are as many different regex flavours as there are Linux distros.
Each follows the basic rules but implements them slightly differently
and adds their own 'extensions'.

> I also made the assumption that this was a good place
> to pose the question since regular expressions are a feature of Python.

The best place to pose a regex question is in the sphere of usage, i.e.
Perl regexes differ hugely in implementation from OO langs like Python
or Java, while shells like bash or zsh use regexes slightly differently,
as do shell scripting languages like awk or sed.

> The question concerned regular expressions in general, not really the
> application. However, now that I know that regex can be different, I'll try
> to contact the author directly to find out the dialect and then find the
> appropriate location for my question from there. I do appreciate everyone's
> help. I've tried the various suggestions offered here, however none of them
> work. I can only assume at this point that this regex is drastically
> different or the application reading the regex is just broken.

If you care to PM me with details of the language/context I will try to
help but I am no expert.

Regards, John
--
http://mail.python.org/mailman/listinfo/python-list


wanja.chre.sta at gmail

Feb 13, 2008, 5:23 AM

Post #52 of 75 (795 views)
Permalink
Re: regex question [In reply to]

Hey Mathieu

Due to word wrap I'm not sure what you want to do. What result do you
expect? I get:
>>> print m.groups()
('0021', 'xx0A', 'Siemens: Thorax/Multix FD Lab Settings Auto Window
Width ', ' ', 'SL', '1')
But only when I insert a space in the 3rd char group (I'm not sure if
your original pattern has a space there or not). So the third group is:
([A-Za-z0-9./:_ -]+). If I do not insert the space, the pattern does not
match the line.

I also cant see how the format of your line is. If it is like this:
line = "...Siemens: Thorax/Multix FD Lab Settings Auto Window Width..."
where "Auto Window Width" should be the 4th group, you have to mark the
+ in the 3rd group as non-greedy (it's done with a "?"):
http://docs.python.org/lib/re-syntax.html
([A-Za-z0-9./:_ -]+?)
With that I get:
>>> patt.match(line).groups()
('0021', 'xx0A', 'Siemens: Thorax/Multix FD Lab Settings', 'Auto Window
Width ', 'SL', '1')
Which probably is what you want. You can also add the non-greedy marker
in the fourth group, to get rid of the tailing spaces.

HTH
Wanja


mathieu wrote:
> I clearly mark that the separator in between group 3 and group 4
> should contain at least 2 white space, but group 3 is actually reading
> 3 +4

--
http://mail.python.org/mailman/listinfo/python-list


bearophileHUGS at lycos

Feb 13, 2008, 5:34 AM

Post #53 of 75 (797 views)
Permalink
Re: regex question [In reply to]

mathieu, stop writing complex REs like obfuscated toys, use the
re.VERBOSE flag and split that RE into several commented and
*indented* lines (indented just like Python code), the indentation
level has to be used to denote nesting. With that you may be able to
solve the problem by yourself. If not, you can offer us a much more
readable thing to fix.

Bye,
bearophile
--
http://mail.python.org/mailman/listinfo/python-list


grflanagan at gmail

Feb 13, 2008, 5:53 AM

Post #54 of 75 (797 views)
Permalink
Re: regex question [In reply to]

On Feb 13, 1:53 pm, mathieu <mathieu.malate...@gmail.com> wrote:
> I do not understand what is wrong with the following regex expression.
> I clearly mark that the separator in between group 3 and group 4
> should contain at least 2 white space, but group 3 is actually reading
> 3 +4
>
> Thanks
> -Mathieu
>
> import re
>
> line = " (0021,xx0A) Siemens: Thorax/Multix FD Lab Settings
> Auto Window Width SL 1 "
> patt = re.compile("^\s*\(([0-9A-Z]+),([0-9A-Zx]+)\)\s+([A-Za-z0-9./:_
> -]+)\s\s+([A-Za-z0-9 ()._,/#>-]+)\s+([A-Z][A-Z]_?O?W?)\s+([0-9n-]+)\s*
> $")
> m = patt.match(line)
> if m:
> print m.group(3)
> print m.group(4)


I don't know if it solves your problem, but if you want to match a
dash (-), then it must be either escaped or be the first element in a
character class.

Gerard
--
http://mail.python.org/mailman/listinfo/python-list


ptmcg at austin

Feb 13, 2008, 6:29 AM

Post #55 of 75 (800 views)
Permalink
Re: regex question [In reply to]

On Feb 13, 6:53 am, mathieu <mathieu.malate...@gmail.com> wrote:
> I do not understand what is wrong with the following regex expression.
> I clearly mark that the separator in between group 3 and group 4
> should contain at least 2 white space, but group 3 is actually reading
> 3 +4
>
> Thanks
> -Mathieu
>
> import re
>
> line = "      (0021,xx0A)   Siemens: Thorax/Multix FD Lab Settings
> Auto Window Width          SL   1 "
> patt = re.compile("^\s*\(([0-9A-Z]+),([0-9A-Zx]+)\)\s+([A-Za-z0-9./:_
> -]+)\s\s+([A-Za-z0-9 ()._,/#>-]+)\s+([A-Z][A-Z]_?O?W?)\s+([0-9n-]+)\s*
> $")
<snip>

I love the smell of regex'es in the morning!

For more legible posting (and general maintainability), try breaking
up your quoted strings like this:

line = \
" (0021,xx0A) Siemens: Thorax/Multix FD Lab Settings " \
"Auto Window Width SL 1 "

patt = re.compile(
"^\s*"
"\("
"([0-9A-Z]+),"
"([0-9A-Zx]+)"
"\)\s+"
"([A-Za-z0-9./:_ -]+)\s\s+"
"([A-Za-z0-9 ()._,/#>-]+)\s+"
"([A-Z][A-Z]_?O?W?)\s+"
"([0-9n-]+)\s*$")


Of course, the problem is that you have a greedy match in the part of
the regex that is supposed to stop between "Settings" and "Auto".
Change patt to:

patt = re.compile(
"^\s*"
"\("
"([0-9A-Z]+),"
"([0-9A-Zx]+)"
"\)\s+"
"([A-Za-z0-9./:_ -]+?)\s\s+"
"([A-Za-z0-9 ()._,/#>-]+)\s+"
"([A-Z][A-Z]_?O?W?)\s+"
"([0-9n-]+)\s*$")

or if you prefer:

patt = re.compile("^\s*\(([0-9A-Z]+),([0-9A-Zx]+)\)\s+([A-Za-z0-9./:_
-]+?)\s\s+([A-Za-z0-9 ()._,/#>-]+)\s+([A-Z][A-Z]_?O?W?)\s+([0-9n-]+)\s*
$")

It looks like you wrote this regex to process this specific input
string - it has a fragile feel to it, as if you will have to go back
and tweak it to handle other data that might come along, such as

(xx42,xx0A) Honeywell: Inverse Flitznoid (Kelvin)
80 SL 1


Just out of curiosity, I wondered what a pyparsing version of this
would look like. See below:

from pyparsing import Word,hexnums,delimitedList,printables,\
White,Regex,nums

line = \
" (0021,xx0A) Siemens: Thorax/Multix FD Lab Settings " \
"Auto Window Width SL 1 "

# define fields
hexint = Word(hexnums+"x")
text = delimitedList(Word(printables),
delim=White(" ",exact=1), combine=True)
type_label = Regex("[A-Z][A-Z]_?O?W?")
int_label = Word(nums+"n-")

# define line structure - give each field a name
line_defn = "(" + hexint("x") + "," + hexint("y") + ")" + \
text("desc") + text("window") + type_label("type") + \
int_label("int")

line_parts = line_defn.parseString(line)
print line_parts.dump()
print line_parts.desc

Prints:
[.'(', '0021', ',', 'xx0A', ')', 'Siemens: Thorax/Multix FD Lab
Settings', 'Auto Window Width', 'SL', '1']
- desc: Siemens: Thorax/Multix FD Lab Settings
- int: 1
- type: SL
- window: Auto Window Width
- x: 0021
- y: xx0A
Siemens: Thorax/Multix FD Lab Settings

I was just guessing on the field names, but you can see where they are
defined and change them to the appropriate values.

-- Paul
--
http://mail.python.org/mailman/listinfo/python-list


zancudero at gmail

Aug 5, 2008, 5:54 AM

Post #56 of 75 (798 views)
Permalink
Re: regex question [In reply to]

=)
Indeed. But it will replace all dots including ordinary strings instead of
numbers only.

On Tue, Aug 5, 2008 at 3:23 PM, Jeff <jeffober [at] gmail> wrote:

> On Aug 5, 7:10 am, Marc 'BlackJack' Rintsch <bj_...@gmx.net> wrote:
> > On Tue, 05 Aug 2008 11:39:36 +0100, Fred Mangusta wrote:
> > > In other words I'd like to replace all the instances of a '.' character
> > > with something (say nothing at all) when the '.' is representing a
> > > decimal separator. E.g.
> >
> > > 500.675 ----> 500675
> >
> > > but also
> >
> > > 1.000.456.344 ----> 1000456344
> >
> > > I don't care about the fact the the resulting number is difficult to
> > > read: as long as it remains a series of digits it's ok: the important
> > > thing is to get rid of the period, because I want to keep it only where
> > > it marks the end of a sentence.
> >
> > > I was trying to do like this
> >
> > > s=re.sub("[(\d+)(\.)(\d+)]","... ",s)
> >
> > > but I don't know much about regular expressions, and don't know how to
> > > get the two groups of numbers and join them in the sub. Moreover doing
> > > like this I only match things like "345.000" and not "1.000.000".
> >
> > > What's the correct approach?
> >
> > In [13]: re.sub(r'(\d)\.(\d)', r'\1\2', '1.000.456.344')
> > Out[13]: '1000456344'
> >
> > Ciao,
> > Marc 'BlackJack' Rintsch
>
> Even faster:
>
> '1.000.456.344'.replace('.', '') => '1000456344'
> --
> http://mail.python.org/mailman/listinfo/python-list
>


cwitts at gmail

Aug 5, 2008, 5:58 AM

Post #57 of 75 (796 views)
Permalink
Re: regex question [In reply to]

On Aug 5, 2:23 pm, Jeff <jeffo...@gmail.com> wrote:
> On Aug 5, 7:10 am, Marc 'BlackJack' Rintsch <bj_...@gmx.net> wrote:
>
>
>
> > On Tue, 05 Aug 2008 11:39:36 +0100, Fred Mangusta wrote:
> > > In other words I'd like to replace all the instances of a '.' character
> > > with something (say nothing at all) when the '.' is representing a
> > > decimal separator. E.g.
>
> > > 500.675  ---->       500675
>
> > > but also
>
> > > 1.000.456.344 ----> 1000456344
>
> > > I don't care about the fact the the resulting number is difficult to
> > > read: as long as it remains a series of digits it's ok: the important
> > > thing is to get rid of the period, because I want to keep it only where
> > > it marks the end of a sentence.
>
> > > I was trying to do like this
>
> > > s=re.sub("[(\d+)(\.)(\d+)]","... ",s)
>
> > > but I don't know much about regular expressions, and don't know how to
> > > get the two groups of numbers and join them in the sub. Moreover doing
> > > like this I only match things like "345.000" and not "1.000.000".
>
> > > What's the correct approach?
>
> > In [13]: re.sub(r'(\d)\.(\d)', r'\1\2', '1.000.456.344')
> > Out[13]: '1000456344'
>
> > Ciao,
> >         Marc 'BlackJack' Rintsch
>
> Even faster:
>
> '1.000.456.344'.replace('.', '') => '1000456344'

Doesn't work for his use case as he wants to keep periods marking the
end of a sentence.
--
http://mail.python.org/mailman/listinfo/python-list


aaa at bbb

Aug 5, 2008, 7:55 AM

Post #58 of 75 (791 views)
Permalink
Re: regex question [In reply to]

Chris wrote:

> Doesn't work for his use case as he wants to keep periods marking the
> end of a sentence.

Exactly. Thanks to all of you anyway, now I have a better understanding
on how to go on :)

F.
--
http://mail.python.org/mailman/listinfo/python-list


toby at tobiah

Aug 6, 2008, 9:20 AM

Post #59 of 75 (779 views)
Permalink
Re: regex question [In reply to]

On Tue, 05 Aug 2008 15:55:46 +0100, Fred Mangusta wrote:

> Chris wrote:
>
>> Doesn't work for his use case as he wants to keep periods marking the
>> end of a sentence.

Doesn't it? The period has to be surrounded by digits in the
example solution, so wouldn't periods followed by a space
(end of sentence) always make it through?



** Posted from http://www.teranews.com **
--
http://mail.python.org/mailman/listinfo/python-list


t at jollybox

Jul 29, 2011, 8:45 AM

Post #60 of 75 (744 views)
Permalink
Re: regex question [In reply to]

On 29/07/11 16:53, rusi wrote:
> Can someone throw some light on this anomalous behavior?
>
>>>> import re
>>>> r = re.search('a(b+)', 'ababbaaabbbbb')
>>>> r.group(1)
> 'b'
>>>> r.group(0)
> 'ab'
>>>> r.group(2)
> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> IndexError: no such group
>
>>>> re.findall('a(b+)', 'ababbaaabbbbb')
> ['b', 'bb', 'bbbbb']
>
> So evidently group counts by number of '()'s and not by number of
> matches (and this is the case whether one uses match or search). So
> then whats the point of search-ing vs match-ing?
>
> Or equivalently how to move to the groups of the next match in?
>
> [Side note: The docstrings for this really suck:
>
>>>> help(r.group)
> Help on built-in function group:
>
> group(...)
>

Pretty standard regex behaviour: Group 1 is the first pair of brackets.
Group 2 is the second, etc. pp. Group 0 is the whole match.
The difference between matching and searching is that match assumes that
the start of the regex coincides with the start of the string (and this
is documented in the library docs IIRC). re.match(exp, s) is equivalent
to re.search('^'+exp, s). (if not exp.startswith('^'))

Apparently, findall() returns the content of the first group if there is
one. I didn't check this, but I assume it is documented.

- Thomas
--
http://mail.python.org/mailman/listinfo/python-list


python at mrabarnett

Jul 29, 2011, 9:15 AM

Post #61 of 75 (747 views)
Permalink
Re: regex question [In reply to]

On 29/07/2011 16:45, Thomas Jollans wrote:
> On 29/07/11 16:53, rusi wrote:
>> Can someone throw some light on this anomalous behavior?
>>
>>>>> import re
>>>>> r = re.search('a(b+)', 'ababbaaabbbbb')
>>>>> r.group(1)
>> 'b'
>>>>> r.group(0)
>> 'ab'
>>>>> r.group(2)
>> Traceback (most recent call last):
>> File "<stdin>", line 1, in<module>
>> IndexError: no such group
>>
>>>>> re.findall('a(b+)', 'ababbaaabbbbb')
>> ['b', 'bb', 'bbbbb']
>>
>> So evidently group counts by number of '()'s and not by number of
>> matches (and this is the case whether one uses match or search). So
>> then whats the point of search-ing vs match-ing?
>>
>> Or equivalently how to move to the groups of the next match in?
>>
>> [Side note: The docstrings for this really suck:
>>
>>>>> help(r.group)
>> Help on built-in function group:
>>
>> group(...)
>>
>
> Pretty standard regex behaviour: Group 1 is the first pair of brackets.
> Group 2 is the second, etc. pp. Group 0 is the whole match.
> The difference between matching and searching is that match assumes that
> the start of the regex coincides with the start of the string (and this
> is documented in the library docs IIRC). re.match(exp, s) is equivalent
> to re.search('^'+exp, s). (if not exp.startswith('^'))
>
> Apparently, findall() returns the content of the first group if there is
> one. I didn't check this, but I assume it is documented.
>
findall returns a list of tuples (what the groups captured) if there is
more than 1 group, or a list of strings (what the group captured) if
there is 1 group, or a list of strings (what the regex matched) if
there are no groups.
--
http://mail.python.org/mailman/listinfo/python-list


rustompmody at gmail

Jul 29, 2011, 10:52 AM

Post #62 of 75 (743 views)
Permalink
Re: regex question [In reply to]

MRAB wrote:
> findall returns a list of tuples (what the groups captured) if there is
more than 1 group,
> or a list of strings (what the group captured) if there is 1 group, or a
list of
> strings (what the regex matched) if there are no groups.

Thanks.
It would be good to put this in the manual dont you think?

Also, the manual says in the 'match' section

"Note If you want to locate a match anywhere in *string*, use search()instead."

to guard against users using match when they should be using search.

Likewise it would be helpful if the manual also said (in the match,search
sections)
"If more than one match/search is required use findall"


t at jollybox

Jul 29, 2011, 1:36 PM

Post #63 of 75 (742 views)
Permalink
Re: regex question [In reply to]

On 29/07/11 19:52, Rustom Mody wrote:
> MRAB wrote:
> > findall returns a list of tuples (what the groups captured) if there
> is more than 1 group,
> > or a list of strings (what the group captured) if there is 1 group,
> or a list of
> > strings (what the regex matched) if there are no groups.
>
> Thanks.
> It would be good to put this in the manual dont you think?
It is in the manual.
>
> Also, the manual says in the 'match' section
>
> "Note If you want to locate a match anywhere in /string/, use search()
> instead."
>
> to guard against users using match when they should be using search.
>
> Likewise it would be helpful if the manual also said (in the
> match,search sections)
> "If more than one match/search is required use findall"
>
>


rosuav at gmail

Aug 17, 2012, 10:42 PM

Post #64 of 75 (637 views)
Permalink
Re: Regex Question [In reply to]

On Sat, Aug 18, 2012 at 2:41 PM, Frank Koshti <frank.koshti [at] gmail> wrote:
> Hi,
>
> I'm new to regular expressions. I want to be able to match for tokens
> with all their properties in the following examples. I would
> appreciate some direction on how to proceed.
>
>
> <h1>@foo1</h1>
> <p>@foo2()</p>
> <p>@foo3(anything could go here)</p>

You can find regular expression primers all over the internet - fire
up your favorite search engine and type those three words in. But it
may be that what you want here is a more flexible parser; have you
looked at BeautifulSoup (so rich and green)?

ChrisA
--
http://mail.python.org/mailman/listinfo/python-list


breamoreboy at yahoo

Aug 18, 2012, 3:50 AM

Post #65 of 75 (638 views)
Permalink
Re: Regex Question [In reply to]

On 18/08/2012 06:42, Chris Angelico wrote:
> On Sat, Aug 18, 2012 at 2:41 PM, Frank Koshti <frank.koshti [at] gmail> wrote:
>> Hi,
>>
>> I'm new to regular expressions. I want to be able to match for tokens
>> with all their properties in the following examples. I would
>> appreciate some direction on how to proceed.
>>
>>
>> <h1>@foo1</h1>
>> <p>@foo2()</p>
>> <p>@foo3(anything could go here)</p>
>
> You can find regular expression primers all over the internet - fire
> up your favorite search engine and type those three words in. But it
> may be that what you want here is a more flexible parser; have you
> looked at BeautifulSoup (so rich and green)?
>
> ChrisA
>

Totally agree with the sentiment. There's a comparison of python
parsers here http://nedbatchelder.com/text/python-parsers.html

--
Cheers.

Mark Lawrence.

--
http://mail.python.org/mailman/listinfo/python-list


roy at panix

Aug 18, 2012, 6:08 AM

Post #66 of 75 (634 views)
Permalink
Re: Regex Question [In reply to]

In article
<385e732e-1c02-4dd0-ab12-b92890bbed66 [at] o3g2000yqp>,
Frank Koshti <frank.koshti [at] gmail> wrote:

> I'm new to regular expressions. I want to be able to match for tokens
> with all their properties in the following examples. I would
> appreciate some direction on how to proceed.
>
>
> <h1>@foo1</h1>
> <p>@foo2()</p>
> <p>@foo3(anything could go here)</p>

Don't try to parse HTML with regexes. Use a real HTML parser, such as
lxml (http://lxml.de/).
--
http://mail.python.org/mailman/listinfo/python-list


frank.koshti at gmail

Aug 18, 2012, 7:21 AM

Post #67 of 75 (634 views)
Permalink
Re: Regex Question [In reply to]

I think the point was missed. I don't want to use an XML parser. The
point is to pick up those tokens, and yes I've done my share of RTFM.
This is what I've come up with:

'\$\w*\(?.*?\)'

Which doesn't work well on the above example, which is partly why I
reached out to the group. Can anyone help me with the regex?

Thanks,
Frank
--
http://mail.python.org/mailman/listinfo/python-list


steve+comp.lang.python at pearwood

Aug 18, 2012, 7:22 AM

Post #68 of 75 (636 views)
Permalink
Re: Regex Question [In reply to]

On Fri, 17 Aug 2012 21:41:07 -0700, Frank Koshti wrote:

> Hi,
>
> I'm new to regular expressions. I want to be able to match for tokens
> with all their properties in the following examples. I would appreciate
> some direction on how to proceed.

Others have already given you excellent advice to NOT use regular
expressions to parse HTML files, but to use a proper HTML parser instead.

However, since I remember how hard it was to get started with regexes,
I'm going to ignore that advice and show you how to abuse regexes to
search for text, and pretend that they aren't HTML tags.

Here's your string you want to search for:

> <h1>@foo1</h1>

You want to find a piece of text that starts with "<h1>@", followed by
any alphanumeric characters, followed by "</h1>".


We start by compiling a regex:

import re
pattern = r"<h1>@\w+</h1>"
regex = re.compile(pattern, re.I)


First we import the re module. Then we define a pattern string. Note that
I use a "raw string" instead of a regular string -- this is not
compulsory, but it is very common.

The difference between a raw string and a regular string is how they
handle backslashes. In Python, some (but not all!) backslashes are
special. For example, the regular string "\n" is not two characters,
backslash-n, but a single character, Newline. The Python string parser
converts backslash combinations as special characters, e.g.:

\n => newline
\t => tab
\0 => ASCII Null character
\\ => a single backslash
etc.

We often call these "backslash escapes".

Regular expressions use a lot of backslashes, and so it is useful to
disable the interpretation of backlash escapes when writing regex
patterns. We do that with a "raw string" -- if you prefix the string with
the letter r, the string is raw and backslash-escapes are ignored:

# ordinary "cooked" string:
"abc\n" => a b c newline

# raw string
r"abc\n" => a b c backslash n


Here is our pattern again:

pattern = r"<h1>@\w+</h1>"

which is thirteen characters:

less-than h 1 greater-than at-sign backslash w plus-sign less-than slash
h 1 greater-than

Most of the characters shown just match themselves. For example, the @
sign will only match another @ sign. But some have special meaning to the
regex:

\w doesn't match "backslash w", but any alphanumeric character;

+ doesn't match a plus sign, but tells the regex to match the previous
symbol one or more times. Since it immediately follows \w, this means
"match at least one alphanumeric character".

Now we feed that string into the re.compile, to create a pre-compiled
regex. (This step is optional: any function which takes a compiled regex
will also accept a string pattern. But pre-compiling regexes which you
are going to use repeatedly is a good idea.)

regex = re.compile(pattern, re.I)

The second argument to re.compile is a flag, re.I which is a special
value that tells the regular expression to ignore case, so "h" will match
both "h" and "H".

Now on to use the regex. Here's a bunch of text to search:

text = """Now is the time for all good men blah blah blah <h1>spam</h1>
and more text here blah blah blah
and some more <h1>@victory</h1> blah blah blah"""


And we search it this way:

mo = re.search(regex, text)

"mo" stands for "Match Object", which is returned if the regular
expression finds something that matches your pattern. If nothing matches,
then None is returned instead.

if mo is not None:
print(mo.group(0))

=> prints <h1>@victory</h1>

So far so good. But we can do better. In this case, we don't really care
about the tags <h1>, we only care about the "victory" part. Here's how to
use grouping to extract substrings from the regex:

pattern = r"<h1>@(\w+)</h1>" # notice the round brackets ()
regex = re.compile(pattern, re.I)
mo = re.search(regex, text)
if mo is not None:
print(mo.group(0))
print(mo.group(1))

This prints:

<h1>@victory</h1>
victory


Hope this helps.


--
Steven
--
http://mail.python.org/mailman/listinfo/python-list


frank.koshti at gmail

Aug 18, 2012, 7:53 AM

Post #69 of 75 (637 views)
Permalink
Re: Regex Question [In reply to]

Hey Steven,

Thank you for the detailed (and well-written) tutorial on this very
issue. I actually learned a few things! Though, I still have
unresolved questions.

The reason I don't want to use an XML parser is because the tokens are
not always placed in HTML, and even in HTML, they may appear in
strange places, such as <h1 $foo(x=3)>Hello</h1>. My specific issue is
I need to match, process and replace $foo(x=3), knowing that (x=3) is
optional, and the token might appear simply as $foo.

To do this, I decided to use:

re.compile('\$\w*\(?.*?\)').findall(mystring)

the issue with this is it doesn't match $foo by itself, and requires
there to be () at the end.

Thanks,
Frank
--
http://mail.python.org/mailman/listinfo/python-list


__peter__ at web

Aug 18, 2012, 8:48 AM

Post #70 of 75 (633 views)
Permalink
Re: Regex Question [In reply to]

Frank Koshti wrote:

> I need to match, process and replace $foo(x=3), knowing that (x=3) is
> optional, and the token might appear simply as $foo.
>
> To do this, I decided to use:
>
> re.compile('\$\w*\(?.*?\)').findall(mystring)
>
> the issue with this is it doesn't match $foo by itself, and requires
> there to be () at the end.

>>> s = """
... <h1>$foo1</h1>
... <p>$foo2()</p>
... <p>$foo3(anything could go here)</p>
... """
>>> re.compile("(\$\w+(?:\(.*?\))?)").findall(s)
['$foo1', '$foo2()', '$foo3(anything could go here)']


--
http://mail.python.org/mailman/listinfo/python-list


vlastimil.brom at gmail

Aug 18, 2012, 8:50 AM

Post #71 of 75 (634 views)
Permalink
Re: Regex Question [In reply to]

2012/8/18 Frank Koshti <frank.koshti [at] gmail>:
> Hey Steven,
>
> Thank you for the detailed (and well-written) tutorial on this very
> issue. I actually learned a few things! Though, I still have
> unresolved questions.
>
> The reason I don't want to use an XML parser is because the tokens are
> not always placed in HTML, and even in HTML, they may appear in
> strange places, such as <h1 $foo(x=3)>Hello</h1>. My specific issue is
> I need to match, process and replace $foo(x=3), knowing that (x=3) is
> optional, and the token might appear simply as $foo.
>
> To do this, I decided to use:
>
> re.compile('\$\w*\(?.*?\)').findall(mystring)
>
> the issue with this is it doesn't match $foo by itself, and requires
> there to be () at the end.
>
> Thanks,
> Frank
> --
> http://mail.python.org/mailman/listinfo/python-list

Hi,
Although I don't quite get the pattern you are using (with respect to
the specified task), you most likely need raw string syntax for the
pattern, e.g.: r"...", instead of "...", or you have to double all
backslashes (which should be escaped), i.e. \\w etc.

I am likely misunderstanding the specification, as the following:
>>> re.sub(r"\$foo\(x=3\)", "bar", "<h1 $foo(x=3)>Hello</h1>")
'<h1 bar>Hello</h1>'
>>>
is probably not the desired output.

For some kind of "processing" the matched text, you can use the
replace function instead of the replace pattern in re.sub too.
see
http://docs.python.org/library/re.html#re.sub

hth,
vbr
--
http://mail.python.org/mailman/listinfo/python-list


frank.koshti at gmail

Aug 18, 2012, 8:56 AM

Post #72 of 75 (637 views)
Permalink
Re: Regex Question [In reply to]

On Aug 18, 11:48 am, Peter Otten <__pete...@web.de> wrote:
> Frank Koshti wrote:
> > I need to match, process and replace $foo(x=3), knowing that (x=3) is
> > optional, and the token might appear simply as $foo.
>
> > To do this, I decided to use:
>
> > re.compile('\$\w*\(?.*?\)').findall(mystring)
>
> > the issue with this is it doesn't match $foo by itself, and requires
> > there to be () at the end.
> >>> s = """
>
> ... <h1>$foo1</h1>
> ... <p>$foo2()</p>
> ... <p>$foo3(anything could go here)</p>
> ... """>>> re.compile("(\$\w+(?:\(.*?\))?)").findall(s)
>
> ['$foo1', '$foo2()', '$foo3(anything could go here)']

PERFECT-
--
http://mail.python.org/mailman/listinfo/python-list


jpiitula at ling

Aug 18, 2012, 9:22 AM

Post #73 of 75 (638 views)
Permalink
Re: Regex Question [In reply to]

Frank Koshti writes:

> not always placed in HTML, and even in HTML, they may appear in
> strange places, such as <h1 $foo(x=3)>Hello</h1>. My specific issue
> is I need to match, process and replace $foo(x=3), knowing that
> (x=3) is optional, and the token might appear simply as $foo.
>
> To do this, I decided to use:
>
> re.compile('\$\w*\(?.*?\)').findall(mystring)
>
> the issue with this is it doesn't match $foo by itself, and requires
> there to be () at the end.

Adding a ? after the meant-to-be-optional expression would let the
regex engine know what you want. You can also separate the mandatory
and the optional part in the regex to receive pairs as matches. The
test program below prints this:

>$foo()$foo(bar=3)$$$foo($)$foo($bar(v=0))etc</htm
('$foo', '')
('$foo', '(bar=3)')
('$foo', '($)')
('$foo', '')
('$bar', '(v=0)')

Here is the program:

import re

def grab(text):
p = re.compile(r'([$]\w+)([(][^()]+[)])?')
return re.findall(p, text)

def test(html):
print(html)
for hit in grab(html):
print(hit)

if __name__ == '__main__':
test('>$foo()$foo(bar=3)$$$foo($)$foo($bar(v=0))etc</htm')
--
http://mail.python.org/mailman/listinfo/python-list


python at bdurham

Aug 18, 2012, 9:36 AM

Post #74 of 75 (634 views)
Permalink
Re: Regex Question [In reply to]

Steven,

Well done!!!

Regards,
Malcolm
--
http://mail.python.org/mailman/listinfo/python-list


frank.koshti at gmail

Aug 18, 2012, 1:18 PM

Post #75 of 75 (635 views)
Permalink
Re: Regex Question [In reply to]

On Aug 18, 12:22 pm, Jussi Piitulainen <jpiit...@ling.helsinki.fi>
wrote:
> Frank Koshti writes:
> > not always placed in HTML, and even in HTML, they may appear in
> > strange places, such as <h1 $foo(x=3)>Hello</h1>. My specific issue
> > is I need to match, process and replace $foo(x=3), knowing that
> > (x=3) is optional, and the token might appear simply as $foo.
>
> > To do this, I decided to use:
>
> > re.compile('\$\w*\(?.*?\)').findall(mystring)
>
> > the issue with this is it doesn't match $foo by itself, and requires
> > there to be () at the end.
>
> Adding a ? after the meant-to-be-optional expression would let the
> regex engine know what you want. You can also separate the mandatory
> and the optional part in the regex to receive pairs as matches. The
> test program below prints this:
>
> >$foo()$foo(bar=3)$$$foo($)$foo($bar(v=0))etc</htm
>
> ('$foo', '')
> ('$foo', '(bar=3)')
> ('$foo', '($)')
> ('$foo', '')
> ('$bar', '(v=0)')
>
> Here is the program:
>
> import re
>
> def grab(text):
>     p = re.compile(r'([$]\w+)([(][^()]+[)])?')
>     return re.findall(p, text)
>
> def test(html):
>     print(html)
>     for hit in grab(html):
>         print(hit)
>
> if __name__ == '__main__':
>     test('>$foo()$foo(bar=3)$$$foo($)$foo($bar(v=0))etc</htm')

You read my mind. I didn't even know that's possible. Thank you-
--
http://mail.python.org/mailman/listinfo/python-list

First page Previous page 1 2 3 Next page Last page  View All Python python RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.