
nate at verse
Jun 14, 2007, 5:52 PM
Post #7 of 11
(994 views)
Permalink
|
On 6/14/07, Hans Dieter Pearcey <hdp [at] pobox> wrote: > On Thu, Jun 14, 2007 at 07:09:12AM -0700, Marvin Humphrey wrote: > > >My app primarily differs in that I was planning on having many > > >invindexes, two > > >or three per user, so opening them all at program start would > > >probably be > > >inefficient (there are several hundred of them). > > > > OK. With that architecture, you'll need to factor in the time it > > takes to begin reading from any one of those invindexes. > > It may be a stupid architecture; I'm not really very experienced with > invindexes. I want to index about 250G of email, which seems like a lot to me, > so I'm assuming that partitions will be useful (since each user only searches > their own email). Am I prematurely optimizing? Hi Hans --- I've been thinking about some similar architectural issues, and while I don't have any experience with corpus sizes as large as you were dealing with, I thought I'd jump in. First, your architecture sounds reasonable to me: if searches are never going to cross indexes, keeping them separate for each user seems like a reasonable idea. Yes, the initialization costs of each Searcher object will be expensive, but I think the smaller size of each index is going to offset this. Starting with this architecture strikes me as good forethought, and not premature. Worrying about caching hot Searcher objects to those indexes does strike me premature, or possibly misguided. The thing that takes the most time (I'm guessing) is reading the index from the disk, thus caching the object to disk isn't going to help you a lot. To get a real advantage, you are going to need it hanging around in RAM, and given the size of your corpus this is going to require finesse. Presuming you are running Linux, most extra RAM on the system will be used to cache recently read files so that they can read from relatively fast memory rather than waiting for the relatively very slow disk. The more you cache big objects, the less space available for the system to cache files. It's a trade: if you know you are going to reuse the object, it's a win, but if you don't you are probably better off letting the system do its thing. I'd wait and measure. If disk IO does turn out to be a bottleneck (and it will with heavy enough usage) the easiest solution may be to partition the search off to separate machines, each handling only a subset of your users. Rather than thinking about caching Searcher objects within the FastCGI, you could prepare for this eventuality by running your search in an external server process, either on the same machine or another. This process could then cache Searchers for the indexes of the most recent users and use the appropriate one for the search. Alternatively, you could cache a small number of Searcher objects in each FastCGI process, and then come up with a way of preferentially directing users to the same process they used on the previous request. Historically, there have been some affinity patches for mod_fastcgi that did this, but I don't know if they have been updated. But in general, I don't think there is going to be any good way for multiple processes or threads to share a single Searcher object. I'd start by sticking with the separate indexes, skipping the caching, and seeing how it goes. Hope this helps, Nathan Kurz nate [at] verse
|