Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Python: Python

key/value store optimized for disk storage

 

 

First page Previous page 1 2 Next page Last page  View All Python python RSS feed   Index | Next | Previous | View Threaded


showell30 at yahoo

May 2, 2012, 7:14 PM

Post #1 of 34 (2189 views)
Permalink
key/value store optimized for disk storage

This is slightly off topic, but I'm hoping folks can point me in the
right direction.

I'm looking for a fairly lightweight key/value store that works for
this type of problem:

ideally plays nice with the Python ecosystem
the data set is static, and written infrequently enough that I
definitely want *read* performance to trump all
there is too much data to keep it all in memory (so no memcache)
users will access keys with fairly uniform, random probability
the key/value pairs are fairly homogenous in nature:
keys are <= 16 chars
values are between 1k and 4k bytes generally
approx 3 million key/value pairs
total amount of data == 6Gb
needs to work on relatively recent versions of FreeBSD and Linux

My current solution works like this:

keys are file paths
directories are 2 levels deep (30 dirs w/100k files each)
values are file contents

The current solution isn't horrible, but I'm try to squeeze a little
performance/robustness out of it. A minor nuisance is that I waste a
fair amount of disk space, since the values are generally less than 4k
in size. A larger concern is that I'm not convinced that file systems
are optimized for dealing with lots of little files in a shallow
directory structure.

To deal with the latter issue, a minor refinement would be to deepen
the directory structure, but I want to do due diligence on other
options first.

I'm looking for something a little lighter than a full-on database
(either SQL or no-SQL), although I'm not completely ruling out any
alternatives yet.

As I mention up top, I'm mostly hoping folks can point me toward
sources they trust, whether it be other mailing lists, good tools,
etc. To the extent that this is on topic and folks don't mind
discussing this here, I'm happy to follow up on any questions.

Thanks,

Steve

P.S. I've already found some good information via Google, but there's
a lot of noise out there.
--
http://mail.python.org/mailman/listinfo/python-list


no.email at nospam

May 2, 2012, 7:46 PM

Post #2 of 34 (2156 views)
Permalink
Re: key/value store optimized for disk storage [In reply to]

Steve Howell <showell30 [at] yahoo> writes:
> keys are file paths
> directories are 2 levels deep (30 dirs w/100k files each)
> values are file contents

> The current solution isn't horrible,

Yes it is ;-)

> As I mention up top, I'm mostly hoping folks can point me toward
> sources they trust, whether it be other mailing lists, good tools,

cdb sounds reasonable for your purposes. I'm sure there are python
bindings for it.

http://cr.yp.to/cdb.html mentions a 4gb limit (2**32) but I
half-remember something about a 64 bit version.
--
http://mail.python.org/mailman/listinfo/python-list


wrw at mac

May 2, 2012, 7:50 PM

Post #3 of 34 (2147 views)
Permalink
Re: key/value store optimized for disk storage [In reply to]

On May 2, 2012, at 10:14 PM, Steve Howell wrote:

> This is slightly off topic, but I'm hoping folks can point me in the
> right direction.
>
> I'm looking for a fairly lightweight key/value store that works for
> this type of problem:
>
> ideally plays nice with the Python ecosystem
> the data set is static, and written infrequently enough that I
> definitely want *read* performance to trump all
> there is too much data to keep it all in memory (so no memcache)
> users will access keys with fairly uniform, random probability
> the key/value pairs are fairly homogenous in nature:
> keys are <= 16 chars
> values are between 1k and 4k bytes generally
> approx 3 million key/value pairs
> total amount of data == 6Gb
> needs to work on relatively recent versions of FreeBSD and Linux
>


I don't understand that statement, I don't think I agree with it. I'm writing this e-mail on a system with 16 GB of memory. It is just under 3 years old, and if it were purchased today, it would have 32 GB of memory at essentially the same price.

I think you might want to re-examine your thinking about the limits of a memory-based system. It could probably be completely written in python and still have acceptable performance.

-Bill

--
http://mail.python.org/mailman/listinfo/python-list


tjreedy at udel

May 2, 2012, 8:00 PM

Post #4 of 34 (2143 views)
Permalink
Re: key/value store optimized for disk storage [In reply to]

On 5/2/2012 10:14 PM, Steve Howell wrote:
> This is slightly off topic, but I'm hoping folks can point me in the
> right direction.
>
> I'm looking for a fairly lightweight key/value store that works for
> this type of problem:
>
> ideally plays nice with the Python ecosystem
> the data set is static, and written infrequently enough that I
> definitely want *read* performance to trump all
> there is too much data to keep it all in memory (so no memcache)
> users will access keys with fairly uniform, random probability
> the key/value pairs are fairly homogenous in nature:
> keys are<= 16 chars
> values are between 1k and 4k bytes generally
> approx 3 million key/value pairs
> total amount of data == 6Gb
> needs to work on relatively recent versions of FreeBSD and Linux

On my 64bit machine with 64 bit Python, I would consider putting all the
values in one data file and creating a key:file-offset dict. Each value
would start with length(value) so that is not needed in memory. The
dict, once created, could be pickled and unpickled for each run.

--
Terry Jan Reedy

--
http://mail.python.org/mailman/listinfo/python-list


python.list at tim

May 2, 2012, 8:03 PM

Post #5 of 34 (2144 views)
Permalink
Re: key/value store optimized for disk storage [In reply to]

On 05/02/12 21:14, Steve Howell wrote:
> I'm looking for a fairly lightweight key/value store that works for
> this type of problem:
>
> ideally plays nice with the Python ecosystem
> the data set is static, and written infrequently enough that I
> definitely want *read* performance to trump all
> there is too much data to keep it all in memory (so no memcache)
> users will access keys with fairly uniform, random probability
> the key/value pairs are fairly homogenous in nature:
> keys are <= 16 chars
> values are between 1k and 4k bytes generally
> approx 3 million key/value pairs
> total amount of data == 6Gb
> needs to work on relatively recent versions of FreeBSD and Linux
[snip]
> I'm looking for something a little lighter than a full-on database
> (either SQL or no-SQL), although I'm not completely ruling out any
> alternatives yet.

This sounds suspiciously like the standard library's anydbm
module[1] which should handle everything you list (though I can't be
positive about the 6Gb bit, though I expect it should be fine; I
can't find any documentation on the limits).

-tkc

[1]
http://docs.python.org/library/anydbm.html



--
http://mail.python.org/mailman/listinfo/python-list


showell30 at yahoo

May 2, 2012, 8:20 PM

Post #6 of 34 (2144 views)
Permalink
Re: key/value store optimized for disk storage [In reply to]

On May 2, 7:46 pm, Paul Rubin <no.em...@nospam.invalid> wrote:
> Steve Howell <showel...@yahoo.com> writes:
> >   keys are file paths
> >   directories are 2 levels deep (30 dirs w/100k files each)
> >   values are file contents
> > The current solution isn't horrible,
>
> Yes it is ;-)
> > As I mention up top, I'm mostly hoping folks can point me toward
> > sources they trust, whether it be other mailing lists, good tools,
>
> cdb sounds reasonable for your purposes.  I'm sure there are python
> bindings for it.
>
> http://cr.yp.to/cdb.htmlmentions a 4gb limit (2**32) but I
> half-remember something about a 64 bit version.

Thanks. That's definitely in the spirit of what I'm looking for,
although the non-64 bit version is obviously geared toward a slightly
smaller data set. My reading of cdb is that it has essentially 64k
hash buckets, so for 3 million keys, you're still scanning through an
average of 45 records per read, which is about 90k of data for my
record size. That seems actually inferior to a btree-based file
system, unless I'm missing something.

I did find this as follow up to your lead:

http://thomas.mangin.com/data/source/cdb.py

Unfortunately, it looks like you have to first build the whole thing
in memory.




--
http://mail.python.org/mailman/listinfo/python-list


no.email at nospam

May 2, 2012, 8:29 PM

Post #7 of 34 (2138 views)
Permalink
Re: key/value store optimized for disk storage [In reply to]

Steve Howell <showell30 [at] yahoo> writes:
> Thanks. That's definitely in the spirit of what I'm looking for,
> although the non-64 bit version is obviously geared toward a slightly
> smaller data set. My reading of cdb is that it has essentially 64k
> hash buckets, so for 3 million keys, you're still scanning through an
> average of 45 records per read, which is about 90k of data for my
> record size. That seems actually inferior to a btree-based file
> system, unless I'm missing something.

1) presumably you can use more buckets in a 64 bit version; 2) scanning
90k probably still takes far less time than a disk seek, even a "seek"
(several microseconds in practice) with a solid state disk.

> http://thomas.mangin.com/data/source/cdb.py
> Unfortunately, it looks like you have to first build the whole thing
> in memory.

It's probably fixable, but I'd guess you could just use Bernstein's
cdbdump program instead.

Alternatively maybe you could use one of the *dbm libraries,
which burn a little more disk space, but support online update.
--
http://mail.python.org/mailman/listinfo/python-list


showell30 at yahoo

May 2, 2012, 9:08 PM

Post #8 of 34 (2142 views)
Permalink
Re: key/value store optimized for disk storage [In reply to]

On May 2, 8:29 pm, Paul Rubin <no.em...@nospam.invalid> wrote:
> Steve Howell <showel...@yahoo.com> writes:
> > Thanks.  That's definitely in the spirit of what I'm looking for,
> > although the non-64 bit version is obviously geared toward a slightly
> > smaller data set.  My reading of cdb is that it has essentially 64k
> > hash buckets, so for 3 million keys, you're still scanning through an
> > average of 45 records per read, which is about 90k of data for my
> > record size.  That seems actually inferior to a btree-based file
> > system, unless I'm missing something.
>
> 1) presumably you can use more buckets in a 64 bit version; 2) scanning
> 90k probably still takes far less time than a disk seek, even a "seek"
> (several microseconds in practice) with a solid state disk.
>

Doesn't cdb do at least one disk seek as well? In the diagram on this
page, it seems you would need to do a seek based on the value of the
initial pointer (from the 256 possible values):

http://cr.yp.to/cdb/cdb.txt

> >http://thomas.mangin.com/data/source/cdb.py
> > Unfortunately, it looks like you have to first build the whole thing
> > in memory.
>
> It's probably fixable, but I'd guess you could just use Bernstein's
> cdbdump program instead.
>
> Alternatively maybe you could use one of the *dbm libraries,
> which burn a little more disk space, but support online update.

Yup, I don't think I want to incur the extra overhead. Do you have
any first hand experience pushing dbm to the scale of 6Gb or so? My
take on dbm is that its niche is more in the 10,000-record range.
--
http://mail.python.org/mailman/listinfo/python-list


no.email at nospam

May 2, 2012, 11:33 PM

Post #9 of 34 (2136 views)
Permalink
Re: key/value store optimized for disk storage [In reply to]

Steve Howell <showell30 [at] yahoo> writes:
> Doesn't cdb do at least one disk seek as well? In the diagram on this
> page, it seems you would need to do a seek based on the value of the
> initial pointer (from the 256 possible values):

Yes, of course it has to seek if there is too much data to fit in
memory. All I'm saying is that if you're spending milliseconds on the
seek, that may dominate the lookup time even if you scan the 90k.

Actually, looking at the spec more closely, there are 256 hash tables in
the file, but each table can be of any size. So there can be far more
than 2**16 hash slots. Uncharacteristically for Bernstein, the document
is pretty unclear, so maybe you have to look at the code to be sure of
the layout. Note also that empty hash buckets have just 2 words of
overhead. So with 3M keys and 75% load factor, you get 4M buckets and
relatively few collisions. The extra 1M buckets in a 64 bit
implementation is just 16MB in the file, which isn't much at all even
considering that you want it to stay resident in memory to avoid some
seeks, assuming you're on a PC and not some smaller device like a phone.
(A phone will have a solid state disk eliminating most seek time, so
you're ok in that situation too).

> Yup, I don't think I want to incur the extra overhead. Do you have
> any first hand experience pushing dbm to the scale of 6Gb or so? My
> take on dbm is that its niche is more in the 10,000-record range.

There are a bunch of different variants. I'm trying to remember what
I've personally done with it and I'm sure I've used much more than 10k
records, but maybe not millions. Unix dbm was originally designed to
handle millions of records back when that was a lot. I'd expect gdbm,
bsddb and so forth can handle it easily. The naive Python dbm module
might not be able to.

The amount of data you're talking about (a few million keys, a few gb of
data) is fairly modest by today's standards, so I would think fancy
methods aren't really needed.
--
http://mail.python.org/mailman/listinfo/python-list


no.email at nospam

May 2, 2012, 11:48 PM

Post #10 of 34 (2144 views)
Permalink
Re: key/value store optimized for disk storage [In reply to]

Paul Rubin <no.email [at] nospam> writes:
>looking at the spec more closely, there are 256 hash tables.. ...

You know, there is a much simpler way to do this, if you can afford to
use a few hundred MB of memory and you don't mind some load time when
the program first starts. Just dump all the data sequentially into a
file. Then scan through the file, building up a Python dictionary
mapping data keys to byte offsets in the file (this is a few hundred MB
if you have 3M keys). Then dump the dictionary as a Python pickle and
read it back in when you start the program.

You may want to turn off the cyclic garbage collector when building or
loading the dictionary, as it badly can slow down the construction of
big lists and maybe dicts (I'm not sure of the latter).
--
http://mail.python.org/mailman/listinfo/python-list


showell30 at yahoo

May 3, 2012, 1:42 AM

Post #11 of 34 (2129 views)
Permalink
Re: key/value store optimized for disk storage [In reply to]

On May 2, 11:48 pm, Paul Rubin <no.em...@nospam.invalid> wrote:
> Paul Rubin <no.em...@nospam.invalid> writes:
> >looking at the spec more closely, there are 256 hash tables.. ...
>
> You know, there is a much simpler way to do this, if you can afford to
> use a few hundred MB of memory and you don't mind some load time when
> the program first starts.  Just dump all the data sequentially into a
> file.  Then scan through the file, building up a Python dictionary
> mapping data keys to byte offsets in the file (this is a few hundred MB
> if you have 3M keys).  Then dump the dictionary as a Python pickle and
> read it back in when you start the program.
>
> You may want to turn off the cyclic garbage collector when building or
> loading the dictionary, as it badly can slow down the construction of
> big lists and maybe dicts (I'm not sure of the latter).

I'm starting to lean toward the file-offset/seek approach. I am
writing some benchmarks on it, comparing it to a more file-system
based approach like I mentioned in my original post. I'll report back
when I get results, but it's already way past my bedtime for tonight.

Thanks for all your help and suggestions.
--
http://mail.python.org/mailman/listinfo/python-list


kiuhnm03.4t.yahoo.it at mail

May 3, 2012, 4:46 AM

Post #12 of 34 (2131 views)
Permalink
Re: key/value store optimized for disk storage [In reply to]

On 5/3/2012 10:42, Steve Howell wrote:
> On May 2, 11:48 pm, Paul Rubin<no.em...@nospam.invalid> wrote:
>> Paul Rubin<no.em...@nospam.invalid> writes:
>>> looking at the spec more closely, there are 256 hash tables.. ...
>>
>> You know, there is a much simpler way to do this, if you can afford to
>> use a few hundred MB of memory and you don't mind some load time when
>> the program first starts. Just dump all the data sequentially into a
>> file. Then scan through the file, building up a Python dictionary
>> mapping data keys to byte offsets in the file (this is a few hundred MB
>> if you have 3M keys). Then dump the dictionary as a Python pickle and
>> read it back in when you start the program.
>>
>> You may want to turn off the cyclic garbage collector when building or
>> loading the dictionary, as it badly can slow down the construction of
>> big lists and maybe dicts (I'm not sure of the latter).
>
> I'm starting to lean toward the file-offset/seek approach. I am
> writing some benchmarks on it, comparing it to a more file-system
> based approach like I mentioned in my original post. I'll report back
> when I get results, but it's already way past my bedtime for tonight.
>
> Thanks for all your help and suggestions.

You should really cache the accesses to that file hoping that the
accesses are not as random as you think. If that's the case you should
notice a *huge* improvement.

Kiuhnm
--
http://mail.python.org/mailman/listinfo/python-list


miki.tebeka at gmail

May 3, 2012, 6:10 PM

Post #13 of 34 (2139 views)
Permalink
Re: key/value store optimized for disk storage [In reply to]

> I'm looking for a fairly lightweight key/value store that works for
> this type of problem:
I'd start with a benchmark and try some of the things that are already in the standard library:
- bsddb
- sqlite3 (table of key, value, index key)
- shelve (though I doubt this one)

You might find that for a little effort you get enough out of one of these.

Another module which is not in the standard library is hdf5/PyTables and in my experience very fast.

HTH,
--
Miki
--
http://mail.python.org/mailman/listinfo/python-list


showell30 at yahoo

May 3, 2012, 8:12 PM

Post #14 of 34 (2132 views)
Permalink
Re: key/value store optimized for disk storage [In reply to]

On May 3, 1:42 am, Steve Howell <showel...@yahoo.com> wrote:
> On May 2, 11:48 pm, Paul Rubin <no.em...@nospam.invalid> wrote:
>
> > Paul Rubin <no.em...@nospam.invalid> writes:
> > >looking at the spec more closely, there are 256 hash tables.. ...
>
> > You know, there is a much simpler way to do this, if you can afford to
> > use a few hundred MB of memory and you don't mind some load time when
> > the program first starts.  Just dump all the data sequentially into a
> > file.  Then scan through the file, building up a Python dictionary
> > mapping data keys to byte offsets in the file (this is a few hundred MB
> > if you have 3M keys).  Then dump the dictionary as a Python pickle and
> > read it back in when you start the program.
>
> > You may want to turn off the cyclic garbage collector when building or
> > loading the dictionary, as it badly can slow down the construction of
> > big lists and maybe dicts (I'm not sure of the latter).
>
> I'm starting to lean toward the file-offset/seek approach.  I am
> writing some benchmarks on it, comparing it to a more file-system
> based approach like I mentioned in my original post.  I'll report back
> when I get results, but it's already way past my bedtime for tonight.
>
> Thanks for all your help and suggestions.


I ended up going with the approach that Paul suggested (except I used
JSON instead of pickle for persisting the hash). I like it for its
simplicity and ease of troubleshooting.

My test was to write roughly 4GB of data, with 2 million keys of 2k
bytes each.

The nicest thing was how quickly I was able to write the file.
Writing tons of small files bogs down the file system, whereas the one-
big-file approach finishes in under three minutes.

Here's the code I used for testing:

https://github.com/showell/KeyValue/blob/master/test_key_value.py

Here are the results:

~/WORKSPACE/KeyValue > ls -l values.txt hash.txt
-rw-r--r-- 1 steve staff 44334161 May 3 18:53 hash.txt
-rw-r--r-- 1 steve staff 4006000000 May 3 18:53 values.txt

2000000 out of 2000000 records yielded (2k each)
Begin READING test
num trials 100000
time spent 39.8048191071
avg delay 0.000398048191071

real 2m46.887s
user 1m35.232s
sys 0m19.723s
--
http://mail.python.org/mailman/listinfo/python-list


no.email at nospam

May 3, 2012, 9:38 PM

Post #15 of 34 (2133 views)
Permalink
Re: key/value store optimized for disk storage [In reply to]

Steve Howell <showell30 [at] yahoo> writes:
> My test was to write roughly 4GB of data, with 2 million keys of 2k
> bytes each.

If the records are something like english text, you can compress
them with zlib and get some compression gain by pre-initializing
a zlib dictionary from a fixed english corpus, then cloning it.
That is, if your messages are a couple paragraphs, you might
say something like:

iv = (some fixed 20k or so of records concatenated together)
compressor = zlib(iv).clone() # I forget what this
# operation is actually called

# I forget what this is called too, but the idea is you throw
# away the output of compressing the fixed text, and sync
# to a byte boundary
compressor.sync()

zout = compressor.compress(your_record).sync()
...

i.e. the part you save in the file is just the difference
between compress(corpus) and compress(corpus_record). To
decompress, you initialize a compressor the same way, etc.

It's been a while since I used that trick but for json records of a few
hundred bytes, I remember getting around 2:1 compression, while starting
with an unprepared compressor gave almost no compression.
--
http://mail.python.org/mailman/listinfo/python-list


showell30 at yahoo

May 3, 2012, 10:09 PM

Post #16 of 34 (2129 views)
Permalink
Re: key/value store optimized for disk storage [In reply to]

On May 3, 9:38 pm, Paul Rubin <no.em...@nospam.invalid> wrote:
> Steve Howell <showel...@yahoo.com> writes:
> > My test was to write roughly 4GB of data, with 2 million keys of 2k
> > bytes each.
>
> If the records are something like english text, you can compress
> them with zlib and get some compression gain by pre-initializing
> a zlib dictionary from a fixed english corpus, then cloning it.
> That is, if your messages are a couple paragraphs, you might
> say something like:
>
>   iv = (some fixed 20k or so of records concatenated together)
>   compressor = zlib(iv).clone()  # I forget what this
>                                  # operation is actually called
>
>   # I forget what this is called too, but the idea is you throw
>   # away the output of compressing the fixed text, and sync
>   # to a byte boundary
>   compressor.sync()
>
>   zout = compressor.compress(your_record).sync()
>   ...
>
> i.e. the part you save in the file is just the difference
> between compress(corpus) and compress(corpus_record).  To
> decompress, you initialize a compressor the same way, etc.
>
> It's been a while since I used that trick but for json records of a few
> hundred bytes, I remember getting around 2:1 compression, while starting
> with an unprepared compressor gave almost no compression.

Sounds like a useful technique. The text snippets that I'm
compressing are indeed mostly English words, and 7-bit ascii, so it
would be practical to use a compression library that just uses the
same good-enough encodings every time, so that you don't have to write
the encoding dictionary as part of every small payload.

Sort of as you suggest, you could build a Huffman encoding for a
representative run of data, save that tree off somewhere, and then use
it for all your future encoding/decoding.

Is there a name to describe this technique?




--
http://mail.python.org/mailman/listinfo/python-list


no.email at nospam

May 3, 2012, 11:03 PM

Post #17 of 34 (2140 views)
Permalink
Re: key/value store optimized for disk storage [In reply to]

Steve Howell <showell30 [at] yahoo> writes:
> Sounds like a useful technique. The text snippets that I'm
> compressing are indeed mostly English words, and 7-bit ascii, so it
> would be practical to use a compression library that just uses the
> same good-enough encodings every time, so that you don't have to write
> the encoding dictionary as part of every small payload.

Zlib stays adaptive, the idea is just to start with some ready-made
compression state that reflects the statistics of your data.

> Sort of as you suggest, you could build a Huffman encoding for a
> representative run of data, save that tree off somewhere, and then use
> it for all your future encoding/decoding.

Zlib is better than Huffman in my experience, and Python's zlib module
already has the right entry points. Looking at the docs,
Compress.flush(Z_SYNC_FLUSH) is the important one. I did something like
this before and it was around 20 lines of code. I don't have it around
any more but maybe I can write something else like it sometime.

> Is there a name to describe this technique?

Incremental compression maybe?
--
http://mail.python.org/mailman/listinfo/python-list


showell30 at yahoo

May 3, 2012, 11:29 PM

Post #18 of 34 (2131 views)
Permalink
Re: key/value store optimized for disk storage [In reply to]

On May 3, 11:03 pm, Paul Rubin <no.em...@nospam.invalid> wrote:
> Steve Howell <showel...@yahoo.com> writes:
> > Sounds like a useful technique.  The text snippets that I'm
> > compressing are indeed mostly English words, and 7-bit ascii, so it
> > would be practical to use a compression library that just uses the
> > same good-enough encodings every time, so that you don't have to write
> > the encoding dictionary as part of every small payload.
>
> Zlib stays adaptive, the idea is just to start with some ready-made
> compression state that reflects the statistics of your data.
>
> > Sort of as you suggest, you could build a Huffman encoding for a
> > representative run of data, save that tree off somewhere, and then use
> > it for all your future encoding/decoding.
>
> Zlib is better than Huffman in my experience, and Python's zlib module
> already has the right entry points.  Looking at the docs,
> Compress.flush(Z_SYNC_FLUSH) is the important one.  I did something like
> this before and it was around 20 lines of code.  I don't have it around
> any more but maybe I can write something else like it sometime.
>
> > Is there a name to describe this technique?
>
> Incremental compression maybe?

Many thanks, this is getting me on the right path:

compressor = zlib.compressobj()
s = compressor.compress("foobar")
s += compressor.flush(zlib.Z_SYNC_FLUSH)

s_start = s
compressor2 = compressor.copy()

s += compressor.compress("baz")
s += compressor.flush(zlib.Z_FINISH)
print zlib.decompress(s)

s = s_start
s += compressor2.compress("spam")
s += compressor2.flush(zlib.Z_FINISH)
print zlib.decompress(s)
--
http://mail.python.org/mailman/listinfo/python-list


drsalists at gmail

May 3, 2012, 11:58 PM

Post #19 of 34 (2137 views)
Permalink
Re: key/value store optimized for disk storage [In reply to]

On Thu, May 3, 2012 at 11:03 PM, Paul Rubin <no.email [at] nospam> wrote:

> > Sort of as you suggest, you could build a Huffman encoding for a
> > representative run of data, save that tree off somewhere, and then use
> > it for all your future encoding/decoding.
>
> Zlib is better than Huffman in my experience, and Python's zlib module
> already has the right entry points.
>
> Isn't zlib kind of dated? Granted, it's newer than Huffman, but there's
been bzip2 and xz since then, among numerous others.

Here's something for xz:
http://stromberg.dnsalias.org/svn/xz_mod/trunk/
An xz module is in the CPython 3.3 alphas - the above module wraps it if
available, otherwise it uses ctypes or a pipe to an xz binary..

And I believe bzip2 is in the standard library for most versions of CPython.


no.email at nospam

May 3, 2012, 11:59 PM

Post #20 of 34 (2133 views)
Permalink
Re: key/value store optimized for disk storage [In reply to]

Steve Howell <showell30 [at] yahoo> writes:
> compressor = zlib.compressobj()
> s = compressor.compress("foobar")
> s += compressor.flush(zlib.Z_SYNC_FLUSH)
>
> s_start = s
> compressor2 = compressor.copy()

I think you also want to make a decompressor here, and initialize it
with s and then clone it. Then you don't have to reinitialize every
time you want to decompress something.

I also seem to remember that the first few bytes of compressed output
are always some fixed string or checksum, that you can strip out after
compression and put back before decompression, giving further savings in
output size when you have millions of records.
--
http://mail.python.org/mailman/listinfo/python-list


showell30 at yahoo

May 4, 2012, 12:14 AM

Post #21 of 34 (2128 views)
Permalink
Re: key/value store optimized for disk storage [In reply to]

On May 3, 11:59 pm, Paul Rubin <no.em...@nospam.invalid> wrote:
> Steve Howell <showel...@yahoo.com> writes:
> >     compressor = zlib.compressobj()
> >     s = compressor.compress("foobar")
> >     s += compressor.flush(zlib.Z_SYNC_FLUSH)
>
> >     s_start = s
> >     compressor2 = compressor.copy()
>
> I think you also want to make a decompressor here, and initialize it
> with s and then clone it.  Then you don't have to reinitialize every
> time you want to decompress something.

Makes sense. I believe I got that part correct:

https://github.com/showell/KeyValue/blob/master/salted_compressor.py

> I also seem to remember that the first few bytes of compressed output
> are always some fixed string or checksum, that you can strip out after
> compression and put back before decompression, giving further savings in
> output size when you have millions of records.

I'm pretty sure this happens for free as long as the salt is large
enough, but maybe I'm misunderstanding.
--
http://mail.python.org/mailman/listinfo/python-list


no.email at nospam

May 4, 2012, 1:01 AM

Post #22 of 34 (2136 views)
Permalink
Re: key/value store optimized for disk storage [In reply to]

Steve Howell <showell30 [at] yahoo> writes:
> Makes sense. I believe I got that part correct:
>
> https://github.com/showell/KeyValue/blob/master/salted_compressor.py

The API looks nice, but your compress method makes no sense. Why do you
include s.prefix in s and then strip it off? Why do you save the prefix
and salt in the instance, and have self.salt2 and s[len(self.salt):]
in the decompress? You should be able to just get the incremental bit.

> I'm pretty sure this happens for free as long as the salt is large
> enough, but maybe I'm misunderstanding.

No I mean there is some fixed overhead (a few bytes) in the compressor
output, to identify it as such. That's fine when the input and output
are both large, but when there's a huge number of small compressed
strings, it adds up.
--
http://mail.python.org/mailman/listinfo/python-list


showell30 at yahoo

May 4, 2012, 1:09 AM

Post #23 of 34 (2127 views)
Permalink
Re: key/value store optimized for disk storage [In reply to]

On May 4, 1:01 am, Paul Rubin <no.em...@nospam.invalid> wrote:
> Steve Howell <showel...@yahoo.com> writes:
> > Makes sense.  I believe I got that part correct:
>
> >  https://github.com/showell/KeyValue/blob/master/salted_compressor.py
>
> The API looks nice, but your compress method makes no sense.  Why do you
> include s.prefix in s and then strip it off?  Why do you save the prefix
> and salt in the instance, and have self.salt2 and s[len(self.salt):]
> in the decompress?  You should be able to just get the incremental bit.

This is fixed now.

https://github.com/showell/KeyValue/commit/1eb316d6e9e44a37ab4f3ca73fcaf4ec0e7f22b4#salted_compressor.py


> > I'm pretty sure this happens for free as long as the salt is large
> > enough, but maybe I'm misunderstanding.
>
> No I mean there is some fixed overhead (a few bytes) in the compressor
> output, to identify it as such.  That's fine when the input and output
> are both large, but when there's a huge number of small compressed
> strings, it adds up.

It it's in the header, wouldn't it be part of the output that comes
before Z_SYNC_FLUSH?



--
http://mail.python.org/mailman/listinfo/python-list


no.email at nospam

May 4, 2012, 1:58 AM

Post #24 of 34 (2127 views)
Permalink
Re: key/value store optimized for disk storage [In reply to]

Steve Howell <showell30 [at] yahoo> writes:
>> You should be able to just get the incremental bit.
> This is fixed now.

Nice.

> It it's in the header, wouldn't it be part of the output that comes
> before Z_SYNC_FLUSH?

Hmm, maybe you are right. My version was several years ago and I don't
remember it well, but I half-remember spending some time diddling around
with this issue.
--
http://mail.python.org/mailman/listinfo/python-list


showell30 at yahoo

May 4, 2012, 8:27 AM

Post #25 of 34 (2129 views)
Permalink
Re: key/value store optimized for disk storage [In reply to]

On May 3, 6:10 pm, Miki Tebeka <miki.teb...@gmail.com> wrote:
> > I'm looking for a fairly lightweight key/value store that works for
> > this type of problem:
>
> I'd start with a benchmark and try some of the things that are already in the standard library:
> - bsddb
> - sqlite3 (table of key, value, index key)
> - shelve (though I doubt this one)
>

Thanks. I think I'm ruling out bsddb, since it's recently deprecated:

http://www.gossamer-threads.com/lists/python/python/106494

I'll give sqlite3 a spin. Has anybody out there wrapped sqlite3
behind a hash interface already? I know it's simple to do
conceptually, but there are some minor details to work out for large
amounts of data (like creating the index after all the inserts), so if
somebody's already tackled this, it would be useful to see their
code.

> You might find that for a little effort you get enough out of one of these.
>
> Another module which is not in the standard library is hdf5/PyTables and in my experience very fast.

Thanks.
--
http://mail.python.org/mailman/listinfo/python-list

First page Previous page 1 2 Next page Last page  View All Python python RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.