Gossamer Forum
Home : General : Perl Programming :

Speeding up an import...

Quote Reply
Speeding up an import...
Hi. I have this script working, but it seems quite slow.

Basically, I now have the Amazon XML feed, and I'm "parsing" the data as follows;

Use a while() to go through the main 1Gb+ XML file, and split into smaller .xml files, which are then "cleaned" out to a Tab delimited file (to save space, and theoretically minamise the amount of memory each file takes up in memory when slurping it into memory).

I then use a while() loop to go through each of the smaller files (5,000 records per .db file, which works out about 5-10Mb).

So far, its does about 2-3 items per second, which sounds ok... but on 1,000,000+ products, thats several days for it to complete :(

I've tried cleaning out old variables (to fee up memory), but it still doesn't seem to be going that quick (I was hoping for at least 10 items per second).

Anyone got any suggestions on speeding it up, or do you think thats about the best I'm gonna be able to get? I *really* want to speed it up if possible (as long as it doesn't include a full rewrite Tongue).

TIA.

Cheers

Andy (mod)
andy@ultranerds.co.uk
Want to give me something back for my help? Please see my Amazon Wish List
GLinks ULTRA Package | GLinks ULTRA Package PRO
Links SQL Plugins | Website Design and SEO | UltraNerds | ULTRAGLobals Plugin | Pre-Made Template Sets | FREE GLinks Plugins!
Quote Reply
Re: [Andy] Speeding up an import... In reply to
It's hard to say without seeing the code.
Quote Reply
Re: [Jarrod] Speeding up an import... In reply to
Its a lot of code :( (I ended up splitting the process up into 2 scripts ... one for cutting up the large 1Gb file, and the other to extract the final details, and write it to the database).

Cheers

Andy (mod)
andy@ultranerds.co.uk
Want to give me something back for my help? Please see my Amazon Wish List
GLinks ULTRA Package | GLinks ULTRA Package PRO
Links SQL Plugins | Website Design and SEO | UltraNerds | ULTRAGLobals Plugin | Pre-Made Template Sets | FREE GLinks Plugins!
Quote Reply
Re: [Andy] Speeding up an import... In reply to
You're using a while loop to do what? You don't say.

If you're looking for something, try grep or awk.

- wil
Quote Reply
Re: [Wil] Speeding up an import... In reply to
Well, what I do is;

In amazon-xml.cgi;

1) Open up the main 1Gb XML file... do a loop with it and write to 15-20Mb smaller .xml format files.

2) Then go through each of the smaller files, and put them in a more friendly format (i.e without all the unrequired data)

3) Read through the smaller files, and generate all the required categories.

4) Read through the smaller .db files, going through each line (with a 'while'). Run checks on it, to make sure it doesn't already exist, and if should find all categories, as they are made in step 3.

5) Go through each line, and process it with the below routine;

Code:
my $_cnt = 0;

# open our filehandle...
open(READFILE, "<amazon_$file.db") || die "Cant open amazon_$file.db. Reason: $!";
while (<READFILE>) {

$_cnt++;

my ($prod_id,$brand,$title,$price,$short_desc,$page_url,$image_url,$category,$amazon_review,$format) = split /:\@:/, $_;

# no point in doing any more processing, as it looks like it already exists in the DB :)
if ($DB->table('Links')->count( { ImageURL => $image_url, Title => $title, BuyURL => $page_url } )) {
$DEBUG? print "$_cnt: Skipping $title \n" : '';
next;
}

# passed in as;
# $prod_id = SKU
# $brand = Manufacturer
# $title = Title
# $price = Price
# $short_desc = Description
# $page_url = BuyURL
# $image_url = ImageURL
# $category = CatLinks.CategoryID & AdvertiserCategory
# $amazon_review = PromotionalText
# $format = Format

my $category_id = $DB->table('Category')->select( ['ID'], { Full_Name => $category } )->fetchrow;

if ($category_id !~ /\d/) {
print "$_cnt: No category ID for \"$title\" ($category) \n";
next;
}

my $add;
$add->{SKU} = $prod_id;
$add->{Manufacturer} = $brand;
$add->{Title} = $title;
$add->{Price} = $price;
$add->{Description} = $short_desc;
$add->{BuyURL} = $page_url;
$add->{ImageURL} = $image_url;
$add->{'CatLinks.CategoryID'} = $category_id;
$add->{Format} = $format;

$add->{Add_Date} = GT::Date::date_get();
$add->{Mod_Date} = $add->{Add_Date};



$add->{TimeStmp} ||= "NOW()";
$add->{Hits} ||= "0";
$add->{URL} ||= "http://";
$add->{Rating} ||= "0";
$add->{Votes} ||= "0";
$add->{Status} ||= "200";
$add->{LinkOwner} ||= "amazon";
$add->{Contact_Name} ||= "Me";
$add->{Contact_Email}||= 'Me@foobar.foo';
$add->{isNew} ||= "Yes";
$add->{isPopular} ||= "No";
$add->{isChanged} ||= "No";
$add->{isValidated} ||= "Yes";

# no need to recheck if it exists, as we do this further up :)
$DB->table('Links')->add($add) || die $GT::SQL::error;
$DEBUG ? print "$_cnt: Added \"$title\" to \"$category\"\n" : '';

}
close(READFILE);

The final part is where most of the work happens. It is now in a format similar too;

Quote:
0471169609:@:John Wiley and Sons (WIE):@:Bank Management: Text and Cases:@:36.95:@:Pages: 700, Hardcover, John Wiley and Sons (WIE):@:http://www.amazon.co.uk/exec/obidos/ASIN/0471169609/discountedorfree-mf-21/ref=nosim:@:http://images-eu.amazon.com/images/P/0471169609.02.MZZZZZZZ.jpg:@:Amazon/Book:@::@:

Cheers

Andy (mod)
andy@ultranerds.co.uk
Want to give me something back for my help? Please see my Amazon Wish List
GLinks ULTRA Package | GLinks ULTRA Package PRO
Links SQL Plugins | Website Design and SEO | UltraNerds | ULTRAGLobals Plugin | Pre-Made Template Sets | FREE GLinks Plugins!
Quote Reply
Re: [Andy] Speeding up an import... In reply to
Profiling is your friend. =) Try just doing one thousand entries and do:

perl -d:DProf script.pl

and then after that, do:

dprofpp

to see a profile of your code and where the slowest parts are.

Cheers,

Alex
--
Gossamer Threads Inc.

Last edited by:

Alex: May 31, 2004, 10:33 AM
Quote Reply
Re: [Alex] Speeding up an import... In reply to
Thanks. Is that a general thing thats installed on normal servers? (i.e a non-gossamer host one).

You won't hear from me for a week now... I'm off on holiday :)

TIA

Andy (mod)
andy@ultranerds.co.uk
Want to give me something back for my help? Please see my Amazon Wish List
GLinks ULTRA Package | GLinks ULTRA Package PRO
Links SQL Plugins | Website Design and SEO | UltraNerds | ULTRAGLobals Plugin | Pre-Made Template Sets | FREE GLinks Plugins!
Quote Reply
Re: [Andy] Speeding up an import... In reply to
Hi,

I believe it comes standard on the 5.8 perls, but older perls it was optional. You need to instal Devel::DProf module to use it.

Cheers,

Alex
--
Gossamer Threads Inc.
Quote Reply
Re: [Alex] Speeding up an import... In reply to
Alex,
Do you know what the ASP version of Devel::DProf is?
Couldn't find it with a PPM repository search.

thanks
Bob
http://totallyfreeads.com.au
Quote Reply
Re: [lanerj] Speeding up an import... In reply to
Hi,

Not sure what you mean by 'ASP version', but Devel::DProf does come with the activestate package of perl (at least on my install of 5.6.1 build 633).

Cheers,

Alex
--
Gossamer Threads Inc.
Quote Reply
Re: [Andy] Speeding up an import... In reply to
Hey Andy,
Where did you get the Amazon XML Feed?? Would be interested in a copy:)

</not a clue>
Quote Reply
Re: [Dinky] Speeding up an import... In reply to
In Reply To:
Hey Andy,
Where did you get the Amazon XML Feed?? Would be interested in a copy:)

I havn't got a copy of it myself, although I do think its available through the main Amazon.com website. Have a look on their affiliate section ...I'm on vacation at the moment, so can't really take a look... sorry :(

Cheers

Andy (mod)
andy@ultranerds.co.uk
Want to give me something back for my help? Please see my Amazon Wish List
GLinks ULTRA Package | GLinks ULTRA Package PRO
Links SQL Plugins | Website Design and SEO | UltraNerds | ULTRAGLobals Plugin | Pre-Made Template Sets | FREE GLinks Plugins!