Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Python: Python

Handling large datastore search

 

 

Python python RSS feed   Index | Next | Previous | View Threaded


ahmedbarakat at gmail

Nov 3, 2009, 6:21 AM

Post #1 of 4 (192 views)
Permalink
Handling large datastore search

In case I have a huge datastore (10000 entries, each entry has like 6
properties), what is the best way
to handle the search within such a huge datastore, and what if I want to
make a generic search, for example
you write a word and i use it to search within all properties I have for all
entries?

Is the conversion to XML a good solution, or it is not?

sorry for being new to web development, and python.

Thanks in advance.

--
--------------------
Regards,
Ahmed Barakat

http://ahmedbarakat83.blogspot.com/
Even a small step counts


motoom at xs4all

Nov 3, 2009, 6:53 AM

Post #2 of 4 (168 views)
Permalink
Re: Handling large datastore search [In reply to]

Ahmed Barakat wrote:

> In case I have a huge datastore (10000 entries, each entry has like 6
> properties)

Can you show some sample entries? That way we can get an idea how your
datastore looks like.

By the way, 10000 doesn't sound that much. At work I create python
programs which do data processing and ofter have this much entries in a
dictionary, often more.

> Is the conversion to XML a good solution, or it is not?

XML is meant to be an exchange format; it is not designed for storage or
fast retrieval.

Greetings,



--
"The ability of the OSS process to collect and harness
the collective IQ of thousands of individuals across
the Internet is simply amazing." - Vinod Valloppillil
http://www.catb.org/~esr/halloween/halloween4.html
--
http://mail.python.org/mailman/listinfo/python-list


davea at ieee

Nov 3, 2009, 3:27 PM

Post #3 of 4 (160 views)
Permalink
Re: Handling large datastore search [In reply to]

Ahmed Barakat wrote:
> In case I have a huge datastore (10000 entries, each entry has like 6
> properties), what is the best way
> to handle the search within such a huge datastore, and what if I want to
> make a generic search, for example
> you write a word and i use it to search within all properties I have for all
> entries?
>
> Is the conversion to XML a good solution, or it is not?
>
> sorry for being new to web development, and python.
>
> Thanks in advance.
>
>
I don't see anything about your query which is specific to web
development, and there's no need to be apologetic for being new anyway.

One person's "huge" is another person's "pretty large." I'd say 10000
items is pretty small if you're working on the desktop, as you can
readily hold all the data in "memory." I edit text files bigger than
that. But I'll assume your data really is huge, or will grow to be
huge, or is an environment which treats it as huge.

When you're parsing large amounts of data, there are always tradeoffs
between performance and other characteristics, usually size and
complexity. If you have lots of data, you're probably best off by using
a standard code system -- a real database. The developers of such
things have decades of experience in making certain things fast,
reliable, and self-consistent.

But considering only speed here, I have to point out that you have to
understand databases, and your particular model of database, pretty well
to really benefit from all the performance tricks in there. Keeping it
abstract, you specify what parts of the data you care about fast random
access to. If you want fast search access to "all" of it, your database
will generally be huge, and very slow to updates. And the best way to
avoid that is to pick a database mechanism that best fits your search
mechanism. I hate to think how many man-centuries Google has dedicated
to getting fast random word access to its *enormous* database. I'm sure
they did not build on a standard relational model.

If you plan to do it yourself, I'd say the last thing you want to do is
use XML. XML may be convenient way to store self-describing data, but
it's not quick to parse large amounts of it. Instead, store the raw
data in text form, with separate index files describing what is where.
Anything that's indexed will be found rapidly, while anything that isn't
will require search of the raw data.

There are algorithms for searching raw data that are faster than
scanning every byte, but a relevant index will almost always be faster.

DaveA

--
http://mail.python.org/mailman/listinfo/python-list


davea at ieee

Nov 4, 2009, 4:13 AM

Post #4 of 4 (153 views)
Permalink
Re: Handling large datastore search [In reply to]

(This reply was offline, but I forwarded parts so that others with
Google App Engine experience might jump in)

Ahmed Barakat wrote:
> <snip...>
> .... but I was trying to make use of everything provided by App engine.
>
> <snip>
>
> On Wed, Nov 4, 2009 at 1:27 AM, Dave Angel <davea [at] ieee> wrote:
>
>
>> Ahmed Barakat wrote:
>>
>>
>>> In case I have a huge datastore (10000 entries, each entry has like 6
>>> properties), what is the best way
>>> to handle the search within such a huge datastore, and what if I want to
>>> make a generic search, for example
>>> you write a word and i use it to search within all properties I have for
>>> all
>>> entries?
>>>
>>> Is the conversion to XML a good solution, or it is not?
>>>
>>> sorry for being new to web development, and python.
>>>
>>> Thanks in advance.
>>>
>>>
>>>
>>>
>> I don't see anything about your query which is specific to web development,
>> and there's no need to be apologetic for being new anyway.
>>
>> One person's "huge" is another person's "pretty large." I'd say 10000
>> items is pretty small if you're working on the desktop, as you can readily
>> hold all the data in "memory." I edit text files bigger than that. But
>> I'll assume your data really is huge, or will grow to be huge, or is an
>> environment which treats it as huge.
>>
>> When you're parsing large amounts of data, there are always tradeoffs
>> between performance and other characteristics, usually size and complexity.
>> If you have lots of data, you're probably best off by using a standard code
>> system -- a real database. The developers of such things have decades of
>> experience in making certain things fast, reliable, and self-consistent.
>>
>> But considering only speed here, I have to point out that you have to
>> understand databases, and your particular model of database, pretty well to
>> really benefit from all the performance tricks in there. Keeping it
>> abstract, you specify what parts of the data you care about fast random
>> access to. If you want fast search access to "all" of it, your database
>> will generally be huge, and very slow to updates. And the best way to avoid
>> that is to pick a database mechanism that best fits your search mechanism.
>> I hate to think how many man-centuries Google has dedicated to getting fast
>> random word access to its *enormous* database. I'm sure they did not build
>> on a standard relational model.
>>
>> If you plan to do it yourself, I'd say the last thing you want to do is use
>> XML. XML may be convenient way to store self-describing data, but it's not
>> quick to parse large amounts of it. Instead, store the raw data in text
>> form, with separate index files describing what is where. Anything that's
>> indexed will be found rapidly, while anything that isn't will require search
>> of the raw data.
>>
>> There are algorithms for searching raw data that are faster than scanning
>> every byte, but a relevant index will almost always be faster.
>>
>> DaveA
>>
>>
>>
Clearly, you left a few things out of your original query. Now that you
mention App Engine, I'm guessing you meant Google's Datastore and that
this whole query is about building a Google app. So many of my comments
don't apply, because I was talking about a desktop environment, using
(or not using) a traditional (relational) database.

I've never used Google's AppEngine, and never done a database web-app.
So I'm the wrong person to give more than general advice.

Google's Datastore is apparently both more and less powerful than a
relational database, and web apps have very different tradeoffs. So you
need to respecify the problem, giviing the full requirements first, and
maybe somebody with more relevant experience will then respond.


Meanwhile, try the following links, and see if any of them help.

http://code.google.com/appengine/docs/
http://code.google.com/appengine/docs/whatisgoogleappengine.html
http://code.google.com/appengine/docs/datastore/
http://code.google.com/appengine/docs/python/gettingstarted/usingdatastore.html
http://snarfed.org/space/datastore_talk.html
http://video.google.com/videosearch?q=app+engine+data+store&oe=utf-8&rls=org.mozilla:en-US:official&client=firefox-a&um=1&ie=UTF-8&ei=uGjxSrX-HcPp8Qbb2uWACQ&sa=X&oi=video_result_group&ct=title&resnum=11&ved=0CDIQqwQwCg

DaveA
--
http://mail.python.org/mailman/listinfo/python-list

Python python RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.