ted.dunning at gmail
Sep 15, 2011, 6:30 AM
Post #2 of 3
This is a common need and there are a number of solutions.
Re: Index creation from multiple data sources
[In reply to]
One common method for joining large semi-structured data sets is to use
map-reduce such as with Apache Hadoop.
However, since you already have one side of your data in mysql, you should
test whether simply scanning the CSV data while accessing the mysql is
sufficiently performant to be an answer for you. This is very likely to be
sufficient if your mysql data is small enough to fit into memory. If the
CSV data is small enough to fit large portions of it in memory then sorting
those portions so that scanning the mysql database in order is possible may
improve your performance enormously.
There is a considerable amount of tuning that you can do to the join process
to make it work well.
Sometimes simply dumping data to an external sort/merge works just as well
Regardless of how you do it, the result should be a bunch of joined records.
From there, you just do the normal Lucene thing to index them.
The reason that the Lucene books don't talk about this is because there are
a gizillion different places data can come from and the exact method for
joining the data will vary. Once joined, the references you mention will
On Thu, Sep 15, 2011 at 8:46 AM, ntsrikanth <ntsrikanth [at] gmail> wrote:
> I work for a travel company and I am trying to do a feasibility study with
> We have got two different datasources, one for the accommodation
> details(mysql) and other for the availability(csv). We need to merge them
> together so that we get one record which would contain data from each
> source. For example, name, description and facilities of a accommodation
> from mysql and price details from csv file needs to be merged together to
> create a single record in solr.
> I searched the wiki, solr book and forums but couldn't find any answer.
> anyone got similar setup and if so how did you design it?
> View this message in context:
> Sent from the Lucene - General mailing list archive at Nabble.com.