Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Python: Python

Re: New internal string format in 3.3

 

 

First page Previous page 1 2 Next page Last page  View All Python python RSS feed   Index | Next | Previous | View Threaded


torriem at gmail

Aug 19, 2012, 10:38 PM

Post #26 of 29 (92 views)
Permalink
Re: New internal string format in 3.3 [In reply to]

On 08/19/2012 11:51 AM, wxjmfauth [at] gmail wrote:
> Five minutes after a closed my interactive interpreters windows,
> the day I tested this stuff. I though:
> "Too bad I did not noted the extremely bad cases I found, I'm pretty
> sure, this problem will arrive on the table".

Reading through this thread (which is entertaining), I am reminded of
the old saying, "premature optimization is the root of all evil." This
"problem" that you have discovered, if fixed the way you propose,
(4-byte USC-4 representation internally always) would be just such a
premature optimization. It would come at a high cost with very little
real-world impact.

As others have made abundantly clear, the overhead of changing internal
string representations is a cost that's only manifest during the
creation of the immutable string object. If your code is doing a lot of
operations on immutable strings, which by definition creates new
immutable string objects, then the real speed problem is in your
algorithm. If you are working on a string as if it were a buffer, doing
many searches, replaces, etc, then you need to work on an object
designed for IO, such as io.StringIO. If implemented half correctly, I
imagine that StringIO uses internally the widest possible character
representation in the buffer. I could be wrong here.

As to your other problem, Python generally tries to follow unicode
encoding rules to the letter. Thus if a piece of text cannot be
represented in the character set of the terminal, then Python will
properly err out. Other languages you have tried, likely fudge it
somehow. Display what they can, or something similar. In general the
Windows command window is an outdated thing that no serious programmer
can rely on to display unicode text. Use a proper GUI api, or use a
better terminal that can handle utf-8.

The TLDR version: You're right that converting string representations
internally incurs overhead, but if your program is slow because of this
you're doing it wrong. It's not symptomatic of some python disease.
--
http://mail.python.org/mailman/listinfo/python-list


roy at panix

Aug 20, 2012, 6:17 AM

Post #27 of 29 (91 views)
Permalink
Re: New internal string format in 3.3 [In reply to]

In article <mailman.3538.1345442498.4697.python-list [at] python>,
Michael Torrie <torriem [at] gmail> wrote:

> Python generally tries to follow unicode
> encoding rules to the letter. Thus if a piece of text cannot be
> represented in the character set of the terminal, then Python will
> properly err out. Other languages you have tried, likely fudge it
> somehow.

And if you want the "fudge it somehow" behavior (which is often very
useful!), there's always http://pypi.python.org/pypi/Unidecode/
--
http://mail.python.org/mailman/listinfo/python-list


torriem at gmail

Aug 20, 2012, 9:18 PM

Post #28 of 29 (93 views)
Permalink
Re: New internal string format in 3.3 [In reply to]

On 08/20/2012 07:17 AM, Roy Smith wrote:
> In article <mailman.3538.1345442498.4697.python-list [at] python>,
> Michael Torrie <torriem [at] gmail> wrote:
>
>> Python generally tries to follow unicode
>> encoding rules to the letter. Thus if a piece of text cannot be
>> represented in the character set of the terminal, then Python will
>> properly err out. Other languages you have tried, likely fudge it
>> somehow.
>
> And if you want the "fudge it somehow" behavior (which is often very
> useful!), there's always http://pypi.python.org/pypi/Unidecode/

Sweet tip, thanks! I often want to process text that has smart quotes,
emdashes, etc, and convert them to plain old ascii quotes, dashes,
ticks, etc. This will do that for me without resorting to a bunch of
regexes. Bravo.
--
http://mail.python.org/mailman/listinfo/python-list


roy at panix

Aug 21, 2012, 4:48 AM

Post #29 of 29 (91 views)
Permalink
Re: New internal string format in 3.3 [In reply to]

In article <mailman.3587.1345522727.4697.python-list [at] python>,
Michael Torrie <torriem [at] gmail> wrote:

> > And if you want the "fudge it somehow" behavior (which is often very
> > useful!), there's always http://pypi.python.org/pypi/Unidecode/
>
> Sweet tip, thanks! I often want to process text that has smart quotes,
> emdashes, etc, and convert them to plain old ascii quotes, dashes,
> ticks, etc. This will do that for me without resorting to a bunch of
> regexes. Bravo.

Yup, that's one of the things it's good for. We mostly use it to help
map search terms, i.e. if you search for "beyonce", you're probably
expecting it to match "Beyoncé".

We also special-case some weird stuff like "kesha" matching "ke$ha", but
we have to hand-code those.

First page Previous page 1 2 Next page Last page  View All Python python RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.