Gossamer Forum
Home : Products : Gossamer Links : Discussions :

mass deletion of duplicate links

Quote Reply
mass deletion of duplicate links
Hi there,

I've got a LinksSQL installation using DMOZ data from a fairly large category (over 300K links). Since I've done more than one import/update, I'm noticing that there are many (i.e. 1000s) cases where the same link (sometimes with a slightly different title) exists more than once in the same category. I have no problem with duplicate urls existing in different categories, but the same link should never appear more than once in the same category.

I tried using the "check duplicates" tool in the admin panel, but it would take hours and hours to go through it all manually.

To cut to the chase... I've been trying to come up with an SQL query that will simply go through and find every time the same url exists more than once in the same category and delete all but one of those links (preferably leaving only the one with the highest link ID). Easier said than done. Has anyone dealt with this before? Is there a way to do it with an SQL query, or do I need to write a script to do it for me?

Thanks for any ideas.

Fractured Atlas :: Liberate the Artist
Services: Healthcare, Fiscal Sponsorship, Marketing, Education, The Emerging Artists Fund
Quote Reply
Re: [hennagaijin] mass deletion of duplicate links In reply to
I'm fairly sure you'd need a script.

Code:
my %data;
my $sth = $DB->table('Links','CatLinks')->select;
while (my ($rec) = $sth->fetchrow_hashref) {
$data{$rec->{Title}} = [];
push @{$data{$rec->{Title}}}, $rec->{CategoryID};
}

...what that will do is make a hash of all link titles and will store their category ids.

Then what you can do it remove any entries where there is only one category, as this means the link isn't a duplicate....

Code:
my %unique = map { $_ => $data{$_} } grep scalar @{$data{$_}} > 1, keys %data;

Then you can sort all values by category id and remove the lowest id which you want to keep....

Code:
foreach my $key (keys %data) {
shift @{$data{$key}};
@{$data{$key}} = sort @{$data{$key}};
}

...then go on to delete everything.

Code:
foreach my $key (keys %data) {
foreach my $ids (@{$data{$key}}) {
$DB->table('CatLinks')->delete({ CategoryID => $ids });
}
}

I've done that step by step so you can see. I may have it totally wrong but it's the best I can do at the moment Crazy

If you *do* try it - make sure you have a backup.

Last edited by:

Paul: Mar 18, 2003, 9:40 AM
Quote Reply
Re: [hennagaijin] mass deletion of duplicate links In reply to
Actually hold on I have that wrong already. That will delete duplicates from different categories not the same category.

Hmmm for that you'd need to change:

Code:
foreach my $key (keys %data) {
shift @{$data{$key}};
@{$data{$key}} = sort @{$data{$key}};
}

...to...

Code:
foreach my $key (keys %data) {
@{$data{$key}} = sort grep ! $tmp{$_}++, @{$data{$key}};
shift @{$data{$key}};
}

Last edited by:

Paul: Mar 18, 2003, 9:44 AM
Quote Reply
Re: [Paul] mass deletion of duplicate links In reply to
Thanks for the reply/suggestion. Since I've got a lot more experience with PHP than perl, it might be easier for me to just write something myself in PHP. I'll give it a whirl at any rate, and if I can't figure it out, I'll try your perl version.

Thanks again.

Fractured Atlas :: Liberate the Artist
Services: Healthcare, Fiscal Sponsorship, Marketing, Education, The Emerging Artists Fund
Quote Reply
Re: [hennagaijin] mass deletion of duplicate links In reply to
If you try the perl version, before the final delete it may be worth commenting out that code and seeing if it looks ok by using GT::Dumper....eg....

require GT::Dumper;
print GT::Dumper::Dumper(\%data);

Last edited by:

Paul: Mar 18, 2003, 10:02 AM
Quote Reply
Re: [Paul] mass deletion of duplicate links In reply to
I was just thinking if dupe checking of over 300K links wouldn't cause memory problem.
Even if each link is 100 char length, and there are 300K links, it takes only about 30 Mb from memory.
So IMO this dupe check method should not cause any memory problem.

I was just thinking loudly.

Best regards,
Webmaster33


Paid Support
from Webmaster33. Expert in Perl programming & Gossamer Threads applications. (click here for prices)
Webmaster33's products (upd.2004.09.26) | Private message | Contact me | Was my post helpful? Donate my help...
Quote Reply
Re: [hennagaijin] mass deletion of duplicate links In reply to
Hi

You can use a sql query to do this.

Delete from

Links

Where LinksID Not in (

Select Max(LinksID) From Links Group By LinkURL

Having Count(*) > 1)

The subquery identifes the highest number ID that we are going to keep,

all the other rows will be deleted.

Be careful with queries like this. Shocked Test them first. You can change the delete to a select

and also try the subquery by it self.

Dregs2
Quote Reply
Re: [dregs2] mass deletion of duplicate links In reply to
You need mysql4 for sub-queries
Quote Reply
Re: [Paul] mass deletion of duplicate links In reply to
Hi Paul

I use SQL Server.

Oh well.Crazy

Dregs2
Quote Reply
Re: [Paul] mass deletion of duplicate links In reply to
Hi Paul

You're right had a quick look at MySQL looks like they have support

for subqueries in 4.0.

I remember this post http://www.gossamer-threads.com/...tring=dregs2;#170378

They have now posted

Support for stored procedures and triggers will be introduced in version 5.0

Oh well Mad

Dregs2