Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Python: Bugs

[issue7224] One obvious way to do interning

 

 

Python bugs RSS feed   Index | Next | Previous | View Threaded


report at bugs

Oct 27, 2009, 6:05 PM

Post #1 of 12 (386 views)
Permalink
[issue7224] One obvious way to do interning

New submission from Alexander Belopolsky <belopolsky [at] users>:

Attached patch implements python-ideas proposal to return added or
existing element from set.add(). See
http://mail.python.org/pipermail/python-ideas/2009-October/006491.html .

In addition this patch contains a reimplementation of issue1507011 using
new behavior of set_add.

The two changes are independent an I would submit them separately if I
was more optimistic that any would be accepted. :-)

This submission is mostly for the benefit of users with applications
that create huge number of interned strings. Since the patch saves 8
bytes for each interned string, such application can see appreciable
memory savings.

I don't have such application and my tests show no difference between
patched and stock python. (For a data point, on start up stock python
creates about 2,000 interned strings.)

My own motivation for writing this patch was the same as for issue1507011: the code that uses a set to store interned strings is more
straightforward than code that uses a dict that stores strings twice,
both as key and value.

----------
components: Interpreter Core
files: set-add.diff
keywords: patch
messages: 94595
nosy: belopolsky
severity: normal
status: open
title: One obvious way to do interning
type: feature request
Added file: http://bugs.python.org/file15218/set-add.diff

_______________________________________
Python tracker <report [at] bugs>
<http://bugs.python.org/issue7224>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/list-python-bugs%40lists.gossamer-threads.com


report at bugs

Oct 27, 2009, 6:20 PM

Post #2 of 12 (370 views)
Permalink
[issue7224] One obvious way to do interning [In reply to]

Changes by Raymond Hettinger <rhettinger [at] users>:


----------
assignee: -> rhettinger
nosy: +rhettinger
versions: +Python 2.7, Python 3.2

_______________________________________
Python tracker <report [at] bugs>
<http://bugs.python.org/issue7224>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/list-python-bugs%40lists.gossamer-threads.com


report at bugs

Oct 28, 2009, 2:33 AM

Post #3 of 12 (368 views)
Permalink
[issue7224] One obvious way to do interning [In reply to]

Antoine Pitrou <pitrou [at] free> added the comment:

A simple way to try and see a difference would be to import lot of modules.

By the way, you shouldn't call it _PySet_Add(), it would cause confusion
with the existing PySet_Add(). _PySet_Intern() would be fine.

----------
nosy: +pitrou

_______________________________________
Python tracker <report [at] bugs>
<http://bugs.python.org/issue7224>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/list-python-bugs%40lists.gossamer-threads.com


report at bugs

Oct 28, 2009, 9:06 AM

Post #4 of 12 (365 views)
Permalink
[issue7224] One obvious way to do interning [In reply to]

Alexander Belopolsky <belopolsky [at] users> added the comment:

I agree, _PySet_Add name can be improved upon, but I don't want to paint
this particular bikeshed until it is clearer what if anything will be done
with this idea. If we add PySet_Intern API, then it would be natural to
expose it as set.intern rather than changing how set.add works. On the
other hand, if set.add grows a return value, then it would make sense to
eventually change PySet_Add to conform.

----------

_______________________________________
Python tracker <report [at] bugs>
<http://bugs.python.org/issue7224>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/list-python-bugs%40lists.gossamer-threads.com


report at bugs

Nov 4, 2009, 5:46 AM

Post #5 of 12 (338 views)
Permalink
[issue7224] One obvious way to do interning [In reply to]

Nick Coghlan <ncoghlan [at] gmail> added the comment:

If the idea is to create the "one obvious way" for interning, calling
the method and corresponding C function "intern" isn't painting a
bikeshed, it's meeting the design spec :)

Adding a new method also avoids a whole host of issues with the Set ABC.

----------
nosy: +ncoghlan

_______________________________________
Python tracker <report [at] bugs>
<http://bugs.python.org/issue7224>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/list-python-bugs%40lists.gossamer-threads.com


report at bugs

Nov 4, 2009, 6:47 AM

Post #6 of 12 (346 views)
Permalink
[issue7224] One obvious way to do interning [In reply to]

Alexander Belopolsky <belopolsky [at] users> added the comment:

On Wed, Nov 4, 2009 at 9:46 AM, Nick Coghlan <report [at] bugs> wrote:
>
> Nick Coghlan <ncoghlan [at] gmail> added the comment:
>
> If the idea is to create the "one obvious way" for interning, calling
> the method and corresponding C function "intern" isn't painting a
> bikeshed, it's meeting the design spec :)
>

Any bikeshed needs to be painted and _PySet_Add was deliberately just
a primer. (Sorry for the cliche abuse.) At the C API level, I was
hoping that old PySet_Add semantics could be eventually deprecated and
_PySet_Add be promoted to a public method. Keeping two functions that
have the same effect but return different values is inelegant
especially given that zero return from one is success and from the
other is an error, so people who don't pay attention to compiler
warnings may get some nasty bugs. Maybe _PySet_Add should be renamed
to _PySet_AddAndReturn to emphasize the similarity and difference with
PySet_Add. I don't mind calling the method _PySet_Intern (in the
original patch I used _PySet_InternString), but that choice will only
become obvious if set grows intern(..) method in addition to add(..).
I also wonder how widespread the knowledge of the term "intern" is. I
would not be surprised if some non-CS users would be more comfortable
with AddAndReturn.

> Adding a new method also avoids a whole host of issues with the Set ABC.

I am neutral on changing add(..) return value vs. adding a new method.
Can you elaborate on the "whole host of issues with the Set ABC".
Interestingly, current implementation of add(..) does not match PEP
3119 [1] which calls for a boolean return value.

[1] http://www.python.org/dev/peps/pep-3119/#sets

----------

_______________________________________
Python tracker <report [at] bugs>
<http://bugs.python.org/issue7224>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/list-python-bugs%40lists.gossamer-threads.com


report at bugs

Nov 5, 2009, 11:16 AM

Post #7 of 12 (338 views)
Permalink
[issue7224] One obvious way to do interning [In reply to]

Raymond Hettinger <rhettinger [at] users> added the comment:

The basic problem here is that the "one obvious way" to some people
(including me and Martin v. Löwis) is to use a dictionary. Further,
there is the problem of conflating types in a user's mind -- right now,
dictionaries are all about looking up and returning values, while sets
are not.

The current notion of sets corresponds neatly with what folks learn in
their basic math classes. In contrast, the notion of "canonical
representatives of an equivalence class" is a more advanced mathematical
concept, one that many people are never taught. This is even more
problematic in Python because the operation of finding canonical values
is hidden behind well designed __hash__ and __eq__ functions (for
example, it is not at all obvious how Decimal(3.0), float(3.0), and
int(3.0) all have the same hash value).

An additional problem is the modification of set.add() so that it
violates Python's norm of having mutating methods return None. We
should keep the language consistent in this regard.

Also, I should comment on the "appreciable memory savings". This saves
only a single pointer (4 bytes on a 32-bit build) out of three per
string. But the string object itself takes up memory, so the percent
savings per string is smaller than you would might expect (typically
less than 15%).

Finally, the spirit of the language moratorium is to make it so that
other Python implementations don't have to change for a few years.

FWIW, I am already going to expand the set/frozenset documention to show
some techniques for accessing and using sets. I can add a link to a
get_equivalent() recipe that makes it possible to retrieve canonical
values from any container including sets. That recipe uses all the
containers as-is, no patching or API changes needed.

----------

_______________________________________
Python tracker <report [at] bugs>
<http://bugs.python.org/issue7224>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/list-python-bugs%40lists.gossamer-threads.com


report at bugs

Nov 5, 2009, 11:50 AM

Post #8 of 12 (343 views)
Permalink
[issue7224] One obvious way to do interning [In reply to]

Alexander Belopolsky <belopolsky [at] users> added the comment:

On Thu, Nov 5, 2009 at 2:16 PM, Raymond Hettinger
<report [at] bugs> wrote:
>
> Raymond Hettinger <rhettinger [at] users> added the comment:
>
> The basic problem here is that the "one obvious way" to some people
> (including me and Martin v. Löwis) is to use a dictionary.  Further,
> there is the problem of conflating types in a user's mind -- right now,
> dictionaries are all about looking up and returning values, while sets
> are not.

One feature that is missing from both dict approach and get_equivalent
recipe is the ability to do interning of new values with a single
lookup. Note that even at the C level one has to first try to
retrieve the key from the dictionary and then if that fails, do an
insert which repeats the same lookup. Even dict.setdefault which is
designed to remove the double lookup while eliminates it when key is
present, still does the second lookup when the key is missing.

----------

_______________________________________
Python tracker <report [at] bugs>
<http://bugs.python.org/issue7224>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/list-python-bugs%40lists.gossamer-threads.com


report at bugs

Nov 5, 2009, 12:23 PM

Post #9 of 12 (337 views)
Permalink
[issue7224] One obvious way to do interning [In reply to]

Raymond Hettinger <rhettinger [at] users> added the comment:

That is a false optimization. Regular python code is full of look-ups
(even set.add has a getattr(s, 'add') lookup just to find the
add-method; every call to a built-in typically goes through multiple
lookups). Also, the cost of a second lookup is trivial because of
processor caching: see Object/dictnotes.txt.

Martin has already rejected a similar proposal for similar reasons.
Please drop this one. IMO, it is harmful to the set API because it
changes set's core concept to include lookup operations instead of just
simple set operations.

Either use sys.intern() or roll your own using get_equivalent() or a
dictionary. A single use-case doesn't warrant building-out the
language, providing more ways to do it, or messing the core concept of
the set/frozenset datatype.

----------

_______________________________________
Python tracker <report [at] bugs>
<http://bugs.python.org/issue7224>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/list-python-bugs%40lists.gossamer-threads.com


report at bugs

Nov 5, 2009, 12:38 PM

Post #10 of 12 (335 views)
Permalink
[issue7224] One obvious way to do interning [In reply to]

Eric Smith <eric [at] trueblade> added the comment:

I agree with Raymond here. This use case isn't special enough, or the
performance of the current "one way to do it" bad enough to warrant
changing set. I recommend closing this issue.

----------
nosy: +eric.smith

_______________________________________
Python tracker <report [at] bugs>
<http://bugs.python.org/issue7224>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/list-python-bugs%40lists.gossamer-threads.com


report at bugs

Nov 5, 2009, 1:03 PM

Post #11 of 12 (334 views)
Permalink
[issue7224] One obvious way to do interning [In reply to]

Alexander Belopolsky <belopolsky [at] users> added the comment:

On Thu, Nov 5, 2009 at 3:23 PM, Raymond Hettinger
<report [at] bugs> wrote:
..
> Martin has already rejected a similar proposal for similar reasons.
> Please drop this one.

Sure. In fact I've never proposed to apply this patch. As I said in
my original submission, the only purpose of reviving issue1507011 is
this way was to give an easy way for users who claimed that in their
apps saving 4-8 bytes per interned string would make a difference to
test their claim.

Just as with issue1507011 three years ago I personally did not see
noticeable improvements in my applications from the patch.

----------

_______________________________________
Python tracker <report [at] bugs>
<http://bugs.python.org/issue7224>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/list-python-bugs%40lists.gossamer-threads.com


report at bugs

Nov 5, 2009, 1:06 PM

Post #12 of 12 (335 views)
Permalink
[issue7224] One obvious way to do interning [In reply to]

Raymond Hettinger <rhettinger [at] users> added the comment:

Thank you.

----------
resolution: -> rejected
status: open -> closed

_______________________________________
Python tracker <report [at] bugs>
<http://bugs.python.org/issue7224>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/list-python-bugs%40lists.gossamer-threads.com

Python bugs RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.