Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Python: Python

How can I create customized classes that have similar properties as 'str'?

 

 

First page Previous page 1 2 Next page Last page  View All Python python RSS feed   Index | Next | Previous | View Threaded


MonkeeSage at gmail

Nov 25, 2007, 5:08 AM

Post #26 of 33 (90 views)
Permalink
Re: How can I create customized classes that have similar properties as 'str'? [In reply to]

On Nov 24, 6:42 pm, Steven D'Aprano <st...@REMOVE-THIS-
cybersource.com.au> wrote:

> This has nothing, absolutely NOTHING, to do with memoization. Memoization
> trades off memory for time, allowing slow functions to return results
> faster at the cost of using more memory. The OP wants to save memory, not
> use more of it.

Not to beat a dead horse, but memoization can significantly minimize
memory usage, given a large data set with redundant elements (as the
OP seems to suggest [e.g., calculating the deltas of trigrams in a
natural language]). For example, if the data set has 1/3 redundant
elements, then the un-memoized version requires 1/3 more memory,
because it needs to store 1/3 more unique copies of the element,
whereas the memoized version has only to store unique elements once.
Every instance of an element which is already in the cache requires
only the cache lookup (speed), rather than the creation of a new
object (memory). So the trade-off is actually speed for memory rather
than the other way around. Of course, it all depends on the nature of
the data; but for large, redundant data sets, memoization is
definitely a win when it comes to memory optimization.

Regards,
Jordan
--
http://mail.python.org/mailman/listinfo/python-list


cjw at sympatico

Nov 25, 2007, 5:30 AM

Post #27 of 33 (90 views)
Permalink
Re: How can I create customized classes that have similar properties as 'str'? [In reply to]

Steven D'Aprano wrote:
> On Sun, 25 Nov 2007 01:38:51 +0100, Hrvoje Niksic wrote:
>
>> samwyse <samwyse [at] gmail> writes:
>>
>>> create a hash that maps your keys to themselves, then use the values of
>>> that hash as your keys.
>> The "atom" function you describe already exists under the name "intern".
>
> Not really. intern() works very differently, because it can tie itself to
> the Python internals. Samwyse's atom() function doesn't, and so has no
> purpose.
>
>
> In any case, I'm not sure that intern() actually will solve the OP's
> problem, even assuming it is a real and not imaginary problem. According
> to the docs, intern()'s purpose is to speed up dictionary lookups, not to
> save memory. I suspect that if it does save memory, it will be by
> accident.
>
>>From the docs:
> http://docs.python.org/lib/non-essential-built-in-funcs.html

Yes, but it seems that buffer remains
with Python 3000.

Colin W.
>
> intern( string)
> Enter string in the table of ``interned'' strings and return the interned
> string - which is string itself or a copy. Interning strings is useful to
> gain a little performance on dictionary lookup - if the keys in a
> dictionary are interned, and the lookup key is interned, the key
> comparisons (after hashing) can be done by a pointer compare instead of a
> string compare. Normally, the names used in Python programs are
> automatically interned, and the dictionaries used to hold module, class
> or instance attributes have interned keys. Changed in version 2.3:
> Interned strings are not immortal (like they used to be in Python 2.2 and
> before); you must keep a reference to the return value of intern() around
> to benefit from it.
>
>
> Note the words "which is string itself or a copy". It would be ironic if
> the OP uses intern to avoid having copies of strings, and ends up with
> even more copies than if he didn't bother.
>
> I guess he'll actually need to measure his memory consumption and see
> whether he actually has a memory problem or not, right?
>
>

--
http://mail.python.org/mailman/listinfo/python-list


cjw at sympatico

Nov 25, 2007, 5:30 AM

Post #28 of 33 (91 views)
Permalink
Re: How can I create customized classes that have similar properties as 'str'? [In reply to]

Steven D'Aprano wrote:
> On Sun, 25 Nov 2007 01:38:51 +0100, Hrvoje Niksic wrote:
>
>> samwyse <samwyse [at] gmail> writes:
>>
>>> create a hash that maps your keys to themselves, then use the values of
>>> that hash as your keys.
>> The "atom" function you describe already exists under the name "intern".
>
> Not really. intern() works very differently, because it can tie itself to
> the Python internals. Samwyse's atom() function doesn't, and so has no
> purpose.
>
>
> In any case, I'm not sure that intern() actually will solve the OP's
> problem, even assuming it is a real and not imaginary problem. According
> to the docs, intern()'s purpose is to speed up dictionary lookups, not to
> save memory. I suspect that if it does save memory, it will be by
> accident.
>
>>From the docs:
> http://docs.python.org/lib/non-essential-built-in-funcs.html

Yes, but it seems that buffer remains
with Python 3000.

Colin W.
>
> intern( string)
> Enter string in the table of ``interned'' strings and return the interned
> string - which is string itself or a copy. Interning strings is useful to
> gain a little performance on dictionary lookup - if the keys in a
> dictionary are interned, and the lookup key is interned, the key
> comparisons (after hashing) can be done by a pointer compare instead of a
> string compare. Normally, the names used in Python programs are
> automatically interned, and the dictionaries used to hold module, class
> or instance attributes have interned keys. Changed in version 2.3:
> Interned strings are not immortal (like they used to be in Python 2.2 and
> before); you must keep a reference to the return value of intern() around
> to benefit from it.
>
>
> Note the words "which is string itself or a copy". It would be ironic if
> the OP uses intern to avoid having copies of strings, and ends up with
> even more copies than if he didn't bother.
>
> I guess he'll actually need to measure his memory consumption and see
> whether he actually has a memory problem or not, right?
>
>
--
http://mail.python.org/mailman/listinfo/python-list


steven at REMOVE

Nov 26, 2007, 6:45 PM

Post #29 of 33 (88 views)
Permalink
Re: How can I create customized classes that have similar properties as 'str'? [In reply to]

On Sun, 25 Nov 2007 02:42:36 -0800, Licheng Fang wrote:

> I mentioned trigram counting as an illustrative case. In fact, you'll
> often need to define patterns more complex than that, and tens of
> megabytes of text may generate millions of them, and I've observed they
> quickly ate up the 8G memory of a workstation in a few minutes.
> Manipulating these patterns can be tricky, you can easily waste a lot of
> memory without taking extra care. I just thought if I define my pattern
> class with this 'atom' property, coding efforts could be easier later.

I'm just not getting the same results as you when I try this. I'm finding
that with no extra effort at all, it just works.

The size of your corpus is not important. Neither is the complexity of
how you generate the patterns. What's important is the number of patterns
you produce, and "millions" isn't that huge a number, even without
interning the strings.

Obviously I'm not running your code, but I can build a dict with millions
of patterns, from hundreds of megabytes of text, on a PC with just 1GB of
memory and not run into any serious problems.

I've just processed roughly 240MB of random emails, generating n-grams up
to length 5. The emails include binary attachments and HTML etc., so I'm
getting lots of patterns that don't normally exist in natural languages
(e.g. 71 occurrences of 'qqq', and 5 of 'qqqq'). As I said, my PC has
only 1GB, and that's being shared with about a dozen other apps (including
such memory hogs as Firefox).

Results? 64939962 patterns found, of which 17031467 are unique. There's
paging, yes, and my PC runs a little slow when I try to switch from one
running application to another, but nothing unusable. Opening a dozen
YouTube videos at once impacts performance worse.

I can't think what you're doing to use up 8GB of RAM for merely
"millions" of strings, even if you are keeping two, three, ten redundant
copies. Assuming an average length of twenty bytes per pattern (you said
trigrams, but I'd rather over-estimate than under), and even assuming
that only half the 8GB are available to Python, you should be able to
store something of the order of one hundred million patterns easily:

4 bytes for a pointer plus 20 bytes for the string = 24 bytes

4*1024**3 / 24 = 178,956,970

(This is a ballpark figure. Real Python strings will have additional
overhead.)

If you're running into problems with merely millions of patterns, then
you're doing worse by probably two orders of magnitude.

I don't think that the problem lies where you think it does. If you have
a dict with millions of keys, it doesn't matter how many times each
pattern exists in the corpus because the key only exists *once* in the
dict. Duplicate the dict, or generate it from scratch even, and at worse
you double the memory used by the keys. And probably not even that.

The only thing I can think of that might explain why you're using so much
memory is if you are generating *all* the patterns up front, say in a
list, before adding them to the dict:

# Generate one massive list of patterns containing many duplicates
patterns = make_patterns(text)
# returns a massive list like ['fre', 'req', 'equ', 'que' ...]
d = {}
for pattern in patterns:
d[pattern] = d.get(pattern, 0) + 1


Notice that the real killer in the above algorithm is that you need
enough storage, not just for the unique patterns, but for EVERY separate
occurrence of each pattern. Nothing to do with how dicts operate, and
everything to do with the algorithm you (hypothetically) are using.

If that's what you're doing, then no wonder you're running out of memory.
With 200MB of text, you have 209715198 trigrams in your list. The
pointers alone will take almost a gigabyte, assuming 32-bit pointers.

If this is your approach, interning the strings won't save you. You
almost certainly should change to a lazy approach, and use a generator to
make each pattern as needed, then thrown away:

def make_patterns(s, n=3):
"""Yield n-grams."""
if len(s) <= n:
yield s
else:
for i in range(len(s)-n+1):
yield s[i:i+n]

d = {}
fp = open('corpus', 'r')
for line in fp:
for word in line.split():
for pattern in make_patterns(word.strip()):
d[pattern] = d.get(pattern, 0) + 1



Obviously I haven't seen your code and don't actually know what you are
doing. If my 1GB machine can deal with a dict of 17 million unique keys
from a corpus of 240MB with no special memory management, your 8GB
machine should too.



--
Steven.
--
http://mail.python.org/mailman/listinfo/python-list


fanglicheng at gmail

Nov 27, 2007, 5:16 AM

Post #30 of 33 (87 views)
Permalink
Re: How can I create customized classes that have similar properties as 'str'? [In reply to]

On Nov 27, 10:45 am, Steven D'Aprano
<ste...@REMOVE.THIS.cybersource.com.au> wrote:
> On Sun, 25 Nov 2007 02:42:36 -0800, Licheng Fang wrote:
> > I mentioned trigram counting as an illustrative case. In fact, you'll
> > often need to define patterns more complex than that, and tens of
> > megabytes of text may generate millions of them, and I've observed they
> > quickly ate up the 8G memory of a workstation in a few minutes.
> > Manipulating these patterns can be tricky, you can easily waste a lot of
> > memory without taking extra care. I just thought if I define my pattern
> > class with this 'atom' property, coding efforts could be easier later.
>
> I'm just not getting the same results as you when I try this. I'm finding
> that with no extra effort at all, it just works.
>
> The size of your corpus is not important. Neither is the complexity of
> how you generate the patterns. What's important is the number of patterns
> you produce, and "millions" isn't that huge a number, even without
> interning the strings.
>
> Obviously I'm not running your code, but I can build a dict with millions
> of patterns, from hundreds of megabytes of text, on a PC with just 1GB of
> memory and not run into any serious problems.
>
> I've just processed roughly 240MB of random emails, generating n-grams up
> to length 5. The emails include binary attachments and HTML etc., so I'm
> getting lots of patterns that don't normally exist in natural languages
> (e.g. 71 occurrences of 'qqq', and 5 of 'qqqq'). As I said, my PC has
> only 1GB, and that's being shared with about a dozen other apps (including
> such memory hogs as Firefox).
>
> Results? 64939962 patterns found, of which 17031467 are unique. There's
> paging, yes, and my PC runs a little slow when I try to switch from one
> running application to another, but nothing unusable. Opening a dozen
> YouTube videos at once impacts performance worse.
>
> I can't think what you're doing to use up 8GB of RAM for merely
> "millions" of strings, even if you are keeping two, three, ten redundant
> copies. Assuming an average length of twenty bytes per pattern (you said
> trigrams, but I'd rather over-estimate than under), and even assuming
> that only half the 8GB are available to Python, you should be able to
> store something of the order of one hundred million patterns easily:

My task is identifying sequences of tokens (phrases) that are possible
tranlations of each other from a bilingual corpus. I need to check all
the subsequences of a sentence to get the possible phrase pairs. This
makes the problem different from n-gram counting in that the number of
possible phrases doesn't grow linearly with n, but approximately with
n^2. (n being the sentence length.) My first version of the program
consumes almost twice as much memory as the current one because I
discovered in doing different statistics I was regenerating the
patterns, and the counting dictionaries ended up with duplicated
pattern keys (a == b, yet a is not b). Wouldn't it be convenient if I
can restrict the pattern class to not generate identical instances? So
I can avoid such subtle but significant bugs.

>
> 4 bytes for a pointer plus 20 bytes for the string = 24 bytes
>
> 4*1024**3 / 24 = 178,956,970
>
> (This is a ballpark figure. Real Python strings will have additional
> overhead.)



>
> If you're running into problems with merely millions of patterns, then
> you're doing worse by probably two orders of magnitude.
>
> I don't think that the problem lies where you think it does. If you have
> a dict with millions of keys, it doesn't matter how many times each
> pattern exists in the corpus because the key only exists *once* in the
> dict. Duplicate the dict, or generate it from scratch even, and at worse
> you double the memory used by the keys. And probably not even that.
>
> The only thing I can think of that might explain why you're using so much
> memory is if you are generating *all* the patterns up front, say in a
> list, before adding them to the dict:
>
> # Generate one massive list of patterns containing many duplicates
> patterns = make_patterns(text)
> # returns a massive list like ['fre', 'req', 'equ', 'que' ...]
> d = {}
> for pattern in patterns:
> d[pattern] = d.get(pattern, 0) + 1
>

No, I wasn't doing that.
BTW, do you think the pattern counting code can avoid hashing the
pattern twice? Is there a way to do that when the dictionary values
are of a primitive type?

> Notice that the real killer in the above algorithm is that you need
> enough storage, not just for the unique patterns, but for EVERY separate
> occurrence of each pattern. Nothing to do with how dicts operate, and
> everything to do with the algorithm you (hypothetically) are using.
>
> If that's what you're doing, then no wonder you're running out of memory.
> With 200MB of text, you have 209715198 trigrams in your list. The
> pointers alone will take almost a gigabyte, assuming 32-bit pointers.
>
> If this is your approach, interning the strings won't save you. You
> almost certainly should change to a lazy approach, and use a generator to
> make each pattern as needed, then thrown away:
>
> def make_patterns(s, n=3):
> """Yield n-grams."""
> if len(s) <= n:
> yield s
> else:
> for i in range(len(s)-n+1):
> yield s[i:i+n]
>
> d = {}
> fp = open('corpus', 'r')
> for line in fp:
> for word in line.split():
> for pattern in make_patterns(word.strip()):
> d[pattern] = d.get(pattern, 0) + 1
>
> Obviously I haven't seen your code and don't actually know what you are
> doing. If my 1GB machine can deal with a dict of 17 million unique keys
> from a corpus of 240MB with no special memory management, your 8GB
> machine should too.
>
> --
> Steven.

--
http://mail.python.org/mailman/listinfo/python-list


arkanes at gmail

Nov 27, 2007, 9:18 AM

Post #31 of 33 (87 views)
Permalink
Re: How can I create customized classes that have similar properties as 'str'? [In reply to]

On Nov 27, 2007 7:16 AM, Licheng Fang <fanglicheng [at] gmail> wrote:
> On Nov 27, 10:45 am, Steven D'Aprano
>
> <ste...@REMOVE.THIS.cybersource.com.au> wrote:
> > On Sun, 25 Nov 2007 02:42:36 -0800, Licheng Fang wrote:
> > > I mentioned trigram counting as an illustrative case. In fact, you'll
> > > often need to define patterns more complex than that, and tens of
> > > megabytes of text may generate millions of them, and I've observed they
> > > quickly ate up the 8G memory of a workstation in a few minutes.
> > > Manipulating these patterns can be tricky, you can easily waste a lot of
> > > memory without taking extra care. I just thought if I define my pattern
> > > class with this 'atom' property, coding efforts could be easier later.
> >
> > I'm just not getting the same results as you when I try this. I'm finding
> > that with no extra effort at all, it just works.
> >
> > The size of your corpus is not important. Neither is the complexity of
> > how you generate the patterns. What's important is the number of patterns
> > you produce, and "millions" isn't that huge a number, even without
> > interning the strings.
> >
> > Obviously I'm not running your code, but I can build a dict with millions
> > of patterns, from hundreds of megabytes of text, on a PC with just 1GB of
> > memory and not run into any serious problems.
> >
> > I've just processed roughly 240MB of random emails, generating n-grams up
> > to length 5. The emails include binary attachments and HTML etc., so I'm
> > getting lots of patterns that don't normally exist in natural languages
> > (e.g. 71 occurrences of 'qqq', and 5 of 'qqqq'). As I said, my PC has
> > only 1GB, and that's being shared with about a dozen other apps (including
> > such memory hogs as Firefox).
> >
> > Results? 64939962 patterns found, of which 17031467 are unique. There's
> > paging, yes, and my PC runs a little slow when I try to switch from one
> > running application to another, but nothing unusable. Opening a dozen
> > YouTube videos at once impacts performance worse.
> >
> > I can't think what you're doing to use up 8GB of RAM for merely
> > "millions" of strings, even if you are keeping two, three, ten redundant
> > copies. Assuming an average length of twenty bytes per pattern (you said
> > trigrams, but I'd rather over-estimate than under), and even assuming
> > that only half the 8GB are available to Python, you should be able to
> > store something of the order of one hundred million patterns easily:
>
> My task is identifying sequences of tokens (phrases) that are possible
> tranlations of each other from a bilingual corpus. I need to check all
> the subsequences of a sentence to get the possible phrase pairs. This
> makes the problem different from n-gram counting in that the number of
> possible phrases doesn't grow linearly with n, but approximately with
> n^2. (n being the sentence length.) My first version of the program
> consumes almost twice as much memory as the current one because I
> discovered in doing different statistics I was regenerating the
> patterns, and the counting dictionaries ended up with duplicated
> pattern keys (a == b, yet a is not b). Wouldn't it be convenient if I
> can restrict the pattern class to not generate identical instances? So
> I can avoid such subtle but significant bugs.
>

Implement __hash__ and __eq__ on your pattern class. If the same
pattern compares equal and hashes the same then it will be a "matching
key" as far as the dict is concerned and will only be stored once.
This is probably cheaper than explicit interning anyway (you don't
need to search an intern table).


> > The only thing I can think of that might explain why you're using so much
> > memory is if you are generating *all* the patterns up front, say in a
> > list, before adding them to the dict:
> >
> > # Generate one massive list of patterns containing many duplicates
> > patterns = make_patterns(text)
> > # returns a massive list like ['fre', 'req', 'equ', 'que' ...]
> > d = {}
> > for pattern in patterns:
> > d[pattern] = d.get(pattern, 0) + 1
> >
>
> No, I wasn't doing that.
> BTW, do you think the pattern counting code can avoid hashing the
> pattern twice? Is there a way to do that when the dictionary values
> are of a primitive type?
>

Hashing isn't really an expensive operation. On strings it's even
cached on the object. If you implement your own __hash__ method you
can do the same, but I wouldn't bother unless you benchmark it and
discover that hashing is a bottleneck.
--
http://mail.python.org/mailman/listinfo/python-list


zookog at gmail

Dec 3, 2007, 6:51 AM

Post #32 of 33 (83 views)
Permalink
Re: How can I create customized classes that have similar properties as 'str'? [In reply to]

On Nov 24, 4:44 am, Licheng Fang <fanglich...@gmail.com> wrote:
>
> Yes, millions. In my natural language processing tasks, I almost
> always need to define patterns, identify their occurrences in a huge
> data, and count them. Say, I have a big text file, consisting of
> millions of words, and I want to count the frequency of trigrams:

I have some experience with this, helping my wife do computational
linguistics.

(I also have quite a lot of experience with similar things in my day
job, which is a decentralized storage grid written in Python.)

Unfortunately, Python is not a perfect tool for the job because, as
you've learned, Python isn't overly concerned about conserving
memory. Each object has substantial overhead associated with it
(including each integer, each string, each tuple, ...), and dicts add
overhead due to being sparsely filled. You should do measurements
yourself to get results for your local CPU and OS, but I found, for
example, that storing 20-byte keys and 8-byte values as a Python dict
of Python strings took about 100 bytes per entry.

Try "tokenizing" your trigrams by defining a dict from three unigrams
to a sequentially allocated integer "trigram id" (also called a
"trigram token"), and a reverse dict which goes from a trigram id to
the three unigrams. Whenever you create a new set of three Python
objects representing unigrams, you can pass them through the first
mapping to get the trigram id and then free up the original three
Python objects. If you do this multiple times, you get multiple
references to the same integer object for the trigram id.

My wife and I tried this, but it still wasn't compact enough to
process her datasets in a mere 4 GiB of RAM.

One tool that might help is PyJudy:

http://www.dalkescientific.com/Python/PyJudy.html

Judy is a delightfully memory-efficient, fast, and flexible data
structure. In the specific example of trigram counting (which is also
what my wife was doing), you can, for example, assign each to each
unigram an integer, and assuming that you have less than two million
unigrams you can pack three unigrams into a 64-bit integer... Hm,
actually at this point my wife and I stopped using Python and rewrote
it in C using JudyTrees. (At the time, PyJudy didn't exist.)

If you are interested, please e-mail my wife, Amber Wilcox-O'Hearn and
perhaps she'll share the resulting C code with you.

Regards,

Zooko Wilcox-O'Hearn
--
http://mail.python.org/mailman/listinfo/python-list


samwyse at gmail

Dec 6, 2007, 12:35 PM

Post #33 of 33 (74 views)
Permalink
Re: How can I create customized classes that have similar properties as 'str'? [In reply to]

On Nov 24, 6:38 pm, Hrvoje Niksic <hnik...@xemacs.org> wrote:
> samwyse <samw...@gmail.com> writes:
> > create a hash that maps your keys to themselves, then use the values
> > of that hash as your keys.
>
> The "atom" function you describe already exists under the name
> "intern".

D'oh! That's what I get for not memorizing "Non-essential Built-in
Functions".

In my defense, however, my function will work with anything that can
be used as a dictionary key (strings, tuples, frozen sets, etc), not
just character strings; thus we return to the original:

>>> a=(1,2,3)
>>> b=(1,1+1,1+1+1)
>>> a == b
True
>>> a is b
False
>>> atom(a) is atom(b)
True
>>> intern(a) is intern(b)
Traceback (most recent call last):
File "<pyshell#33>", line 1, in ?
intern(a) is intern(b)
TypeError: intern() argument 1 must be string, not tuple
--
http://mail.python.org/mailman/listinfo/python-list

First page Previous page 1 2 Next page Last page  View All Python python RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.