Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Python: Bugs

[issue2834] re.IGNORECASE not Unicode-ready

 

 

Python bugs RSS feed   Index | Next | Previous | View Threaded


report at bugs

Jun 28, 2008, 12:40 PM

Post #1 of 11 (209 views)
Permalink
[issue2834] re.IGNORECASE not Unicode-ready

Antoine Pitrou <pitrou[at]free.fr> added the comment:

Same here, re.LOCALE doesn't circumvent the problem.

----------
nosy: +pitrou

_______________________________________
Python tracker <report[at]bugs.python.org>
<http://bugs.python.org/issue2834>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/list-python-bugs%40lists.gossamer-threads.com


report at bugs

Jun 28, 2008, 1:27 PM

Post #2 of 11 (206 views)
Permalink
[issue2834] re.IGNORECASE not Unicode-ready [In reply to]

Antoine Pitrou <pitrou[at]free.fr> added the comment:

Uh, actually, it works if you specify re.UNICODE. If you don't, the
getlower() function in _sre.c falls back to the plain ASCII algorithm.

>>> pat = re.compile('Á', re.IGNORECASE | re.UNICODE)
>>> pat.match('á')
<_sre.SRE_Match object at 0xb7c66c28>
>>> pat.match('Á')
<_sre.SRE_Match object at 0xb7c66cd0>

I wonder if re.UNICODE shouldn't be the default in Py3k, at least when
the pattern is a string and not a bytes object. There may also be a
re.ASCII flag for those cases where people want to fallback to the old
behaviour.

_______________________________________
Python tracker <report[at]bugs.python.org>
<http://bugs.python.org/issue2834>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/list-python-bugs%40lists.gossamer-threads.com


report at bugs

Jun 28, 2008, 3:20 PM

Post #3 of 11 (205 views)
Permalink
[issue2834] re.IGNORECASE not Unicode-ready [In reply to]

Guido van Rossum <guido[at]python.org> added the comment:

Sounds like re.UNICODE should be on by default when the pattern is a str
instance.

Also (per mailing list discussion) we should probably only allow
matching bytes when the pattern is bytes, and matching str when the
pattern is str.

Finally, is there a use case of re.LOCALE any more? I'm thinking not.

_______________________________________
Python tracker <report[at]bugs.python.org>
<http://bugs.python.org/issue2834>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/list-python-bugs%40lists.gossamer-threads.com


report at bugs

Jun 28, 2008, 3:35 PM

Post #4 of 11 (201 views)
Permalink
[issue2834] re.IGNORECASE not Unicode-ready [In reply to]

Antoine Pitrou <pitrou[at]free.fr> added the comment:

Le samedi 28 juin 2008 à 22:20 +0000, Guido van Rossum a écrit :
> Finally, is there a use case of re.LOCALE any more? I'm thinking not.

It's used for locale-specific case matching in the non-unicode case. But
it looks to me like a bad practice and we could probably remove it.

'C'
>>> re.match('À'.encode('latin1'), 'à'.encode('latin1'), re.IGNORECASE)
>>> re.match('À'.encode('latin1'), 'à'.encode('latin1'), re.IGNORECASE |re.LOCALE)
>>> locale.setlocale(locale.LC_CTYPE, 'fr_FR.ISO-8859-1')
'fr_FR.ISO-8859-1'
>>> re.match('À'.encode('latin1'), 'à'.encode('latin1'), re.IGNORECASE)
>>> re.match('À'.encode('latin1'), 'à'.encode('latin1'), re.IGNORECASE | re.LOCALE)
<_sre.SRE_Match object at 0xb7b9ac28>

_______________________________________
Python tracker <report[at]bugs.python.org>
<http://bugs.python.org/issue2834>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/list-python-bugs%40lists.gossamer-threads.com


report at bugs

Jun 28, 2008, 6:15 PM

Post #5 of 11 (202 views)
Permalink
[issue2834] re.IGNORECASE not Unicode-ready [In reply to]

Antoine Pitrou <pitrou[at]free.fr> added the comment:

Here is a preliminary patch which doesn't remove re.LOCALE, but adds
TypeError's for mistyped matchings, a ValueError when specifying
re.UNICODE with a bytes pattern, and implies re.UNICODE for unicode
patterns. The test suite runs fine after a few fixes.

It also includes the patch for #3231 ("re.compile fails with some bytes
patterns").

----------
keywords: +patch
Added file: http://bugs.python.org/file10767/reunicode.patch

_______________________________________
Python tracker <report[at]bugs.python.org>
<http://bugs.python.org/issue2834>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/list-python-bugs%40lists.gossamer-threads.com


report at bugs

Jun 28, 2008, 6:19 PM

Post #6 of 11 (201 views)
Permalink
[issue2834] re.IGNORECASE not Unicode-ready [In reply to]

Changes by Antoine Pitrou <pitrou[at]free.fr>:


Removed file: http://bugs.python.org/file10767/reunicode.patch

_______________________________________
Python tracker <report[at]bugs.python.org>
<http://bugs.python.org/issue2834>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/list-python-bugs%40lists.gossamer-threads.com


report at bugs

Jun 28, 2008, 6:19 PM

Post #7 of 11 (201 views)
Permalink
[issue2834] re.IGNORECASE not Unicode-ready [In reply to]

Changes by Antoine Pitrou <pitrou[at]free.fr>:


Added file: http://bugs.python.org/file10768/reunicode.patch

_______________________________________
Python tracker <report[at]bugs.python.org>
<http://bugs.python.org/issue2834>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/list-python-bugs%40lists.gossamer-threads.com


report at bugs

Jun 29, 2008, 1:21 PM

Post #8 of 11 (171 views)
Permalink
[issue2834] re.IGNORECASE not Unicode-ready [In reply to]

Antoine Pitrou <pitrou[at]free.fr> added the comment:

This new patch also introduces re.ASCII as discussed on the mailing-list.

Added file: http://bugs.python.org/file10777/reunicode2.patch

_______________________________________
Python tracker <report[at]bugs.python.org>
<http://bugs.python.org/issue2834>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/list-python-bugs%40lists.gossamer-threads.com


report at bugs

Jun 29, 2008, 1:36 PM

Post #9 of 11 (171 views)
Permalink
[issue2834] re.IGNORECASE not Unicode-ready [In reply to]

Antoine Pitrou <pitrou[at]free.fr> added the comment:

Improved patch which also detects incompatibilities for "(?u)".

Added file: http://bugs.python.org/file10778/reunicode3.patch

_______________________________________
Python tracker <report[at]bugs.python.org>
<http://bugs.python.org/issue2834>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/list-python-bugs%40lists.gossamer-threads.com


report at bugs

Jul 5, 2008, 2:10 PM

Post #10 of 11 (144 views)
Permalink
[issue2834] re.IGNORECASE not Unicode-ready [In reply to]

Antoine Pitrou <pitrou[at]free.fr> added the comment:

This new patch adds re.ASCII in all sensitive places I could find in the
stdlib (except lib2to3 which as far as I understand is maintained in a
separate branch, and even has its own copy of tokenize.py...).

Also, I didn't get an answer to the following question on the ML: should
an inline flag "(?a)" be introduced to mirror the existing "(?u)" - so
as to set the ASCII flag from inside a pattern string.

Added file: http://bugs.python.org/file10819/reunicode4.patch

_______________________________________
Python tracker <report[at]bugs.python.org>
<http://bugs.python.org/issue2834>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/list-python-bugs%40lists.gossamer-threads.com


report at bugs

Jul 5, 2008, 2:30 PM

Post #11 of 11 (137 views)
Permalink
[issue2834] re.IGNORECASE not Unicode-ready [In reply to]

Antoine Pitrou <pitrou[at]free.fr> added the comment:

http://codereview.appspot.com/2439

_______________________________________
Python tracker <report[at]bugs.python.org>
<http://bugs.python.org/issue2834>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/list-python-bugs%40lists.gossamer-threads.com

Python bugs RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact lists@gossamer-threads.com
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.