Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: SpamAssassin: devel

PATCH reduce sa-awl memory usage

 

 

SpamAssassin devel RSS feed   Index | Next | Previous | View Threaded


vitalyb at telenet

Apr 22, 2012, 9:58 AM

Post #1 of 4 (232 views)
Permalink
PATCH reduce sa-awl memory usage

Hello,

current version of sa-awl loads full database key list to memory before showing any
stats or performing maintenance. I believe it's obvious that this behavior is
undesirable and makes large databases impossible to handle.

The patch below improves sa-awl scaling and responsiveness by scanning database
row-by-row basis instead of loading all keys to memory first.

Tested cleaning db with over 8 million rows.

For a cached db with 850K rows memory usage lowers from 1G to 6M, execution time
is around 12% slower, though.

I'm not a perl expert, please review.

Thanks.

--- sa-awl.orig 2012-04-22 18:38:55.000000000 +0300
+++ sa-awl 2012-04-22 18:59:10.527228442 +0300
@@ -82,11 +82,10 @@
or die "Cannot open file $db: $!\n";
}

-my @k = grep(!/totscore$/,keys(%h));
-for my $key (@k)
+while (my ($key, $count) = each %h)
{
+ next if $key =~ /totscore$/;
my $totscore = $h{"$key|totscore"};
- my $count = $h{$key};
next unless defined($totscore);

if ($opt_clean) {


KMcGrail at PCCC

Apr 23, 2012, 10:33 AM

Post #2 of 4 (213 views)
Permalink
Re: PATCH reduce sa-awl memory usage [In reply to]

If you are using an SQL backend for your AWL, Kris Deugau gave me some
ideas a few years ago where I added a column to the AWL like such:

alter table awl add column `lastupdate` timestamp NOT NULL default
CURRENT_TIMESTAMP on update CURRENT_TIMESTAMP


Then I run a cron job in AWL to clear out the entries that haven't been
used in a while on a gradiated scale:

DELETE FROM awl WHERE lastupdate <= (now() - INTERVAL 15 day) and count < 5;
DELETE FROM awl WHERE lastupdate <= (now() - INTERVAL 30 day) and count
< 10;
DELETE FROM awl WHERE lastupdate <= (now() - INTERVAL 60 day) and count
< 20;
DELETE FROM awl WHERE lastupdate <= (now() - INTERVAL 120 day);

However, the idea below might be the only option for DB-Based backend.
Why don't you open a ticket, please? I think it's a straightforward
change and has good benefits. Might need to be a switch to control it's
behavior for those who care about time more than memory, though.

Regards,
KAM

On 4/22/2012 12:58 PM, Vitaly V. Bursov wrote:
> Hello,
>
> current version of sa-awl loads full database key list to memory
> before showing any
> stats or performing maintenance. I believe it's obvious that this
> behavior is
> undesirable and makes large databases impossible to handle.
>
> The patch below improves sa-awl scaling and responsiveness by scanning
> database
> row-by-row basis instead of loading all keys to memory first.
>
> Tested cleaning db with over 8 million rows.
>
> For a cached db with 850K rows memory usage lowers from 1G to 6M,
> execution time
> is around 12% slower, though.
>
> I'm not a perl expert, please review.
>
> Thanks.
>
> --- sa-awl.orig 2012-04-22 18:38:55.000000000 +0300
> +++ sa-awl 2012-04-22 18:59:10.527228442 +0300
> @@ -82,11 +82,10 @@
> or die "Cannot open file $db: $!\n";
> }
>
> -my @k = grep(!/totscore$/,keys(%h));
> -for my $key (@k)
> +while (my ($key, $count) = each %h)
> {
> + next if $key =~ /totscore$/;
> my $totscore = $h{"$key|totscore"};
> - my $count = $h{$key};
> next unless defined($totscore);
>
> if ($opt_clean) {


--
*Kevin A. McGrail*
President

Peregrine Computer Consultants Corporation
3927 Old Lee Highway, Suite 102-C
Fairfax, VA 22030-2422

http://www.pccc.com/

703-359-9700 x50 / 800-823-8402 (Toll-Free)
703-359-8451 (fax)
KMcGrail [at] PCCC <mailto:kmcgrail [at] pccc>
Attachments: pccc_logo.gif (10.2 KB)


vitalyb at telenet

Apr 23, 2012, 12:28 PM

Post #3 of 4 (213 views)
Permalink
Re: PATCH reduce sa-awl memory usage [In reply to]

23.04.2012 20:33, Kevin A. McGrail пишет:
> If you are using an SQL backend for your AWL, Kris Deugau gave me some ideas a few years ago where I added a column to
> the AWL like such:
>
> alter table awl add column `lastupdate` timestamp NOT NULL default CURRENT_TIMESTAMP on update CURRENT_TIMESTAMP
>
>
> Then I run a cron job in AWL to clear out the entries that haven't been used in a while on a gradiated scale:
>
> DELETE FROM awl WHERE lastupdate <= (now() - INTERVAL 15 day) and count < 5;
> DELETE FROM awl WHERE lastupdate <= (now() - INTERVAL 30 day) and count < 10;
> DELETE FROM awl WHERE lastupdate <= (now() - INTERVAL 60 day) and count < 20;
> DELETE FROM awl WHERE lastupdate <= (now() - INTERVAL 120 day);

No, mine is BerkeleyDB and looks like sa-awl can handle DB-Files only. I have a feeling that SQL for this task
would have a huge overall overhead with no clear benefits of being SQL.

> However, the idea below might be the only option for DB-Based backend. Why don't you open a ticket, please? I think
> it's a straightforward change and has good benefits. Might need to be a switch to control it's behavior for those who
> care about time more than memory, though.

I thought dev-list would be appropriate place to send a patch, haven't found clear guidelines, sorry. Submitting a
ticket now.

Just for reference, "12% slower" in benchmark I performed is a 26 seconds vs 29 seconds case and yet
it saves 1G of RAM. Two-three times larger database and sa-awl process won't even fit in a 32-bit virtual
address space.

> Regards,
> KAM
>
> On 4/22/2012 12:58 PM, Vitaly V. Bursov wrote:
>> Hello,
>>
>> current version of sa-awl loads full database key list to memory before showing any
>> stats or performing maintenance. I believe it's obvious that this behavior is
>> undesirable and makes large databases impossible to handle.
>>
>> The patch below improves sa-awl scaling and responsiveness by scanning database
>> row-by-row basis instead of loading all keys to memory first.
>>
>> Tested cleaning db with over 8 million rows.
>>
>> For a cached db with 850K rows memory usage lowers from 1G to 6M, execution time
>> is around 12% slower, though.
>>
>> I'm not a perl expert, please review.
>>
>> Thanks.
>>
>> --- sa-awl.orig 2012-04-22 18:38:55.000000000 +0300
>> +++ sa-awl 2012-04-22 18:59:10.527228442 +0300
>> @@ -82,11 +82,10 @@
>> or die "Cannot open file $db: $!\n";
>> }
>>
>> -my @k = grep(!/totscore$/,keys(%h));
>> -for my $key (@k)
>> +while (my ($key, $count) = each %h)
>> {
>> + next if $key =~ /totscore$/;
>> my $totscore = $h{"$key|totscore"};
>> - my $count = $h{$key};
>> next unless defined($totscore);
>>
>> if ($opt_clean) {
>
>
> --
> *Kevin A. McGrail*
> President
>
> Peregrine Computer Consultants Corporation
> 3927 Old Lee Highway, Suite 102-C
> Fairfax, VA 22030-2422
>
> http://www.pccc.com/
>
> 703-359-9700 x50 / 800-823-8402 (Toll-Free)
> 703-359-8451 (fax)
> KMcGrail [at] PCCC <mailto:kmcgrail [at] pccc>
>


KMcGrail at PCCC

Apr 23, 2012, 12:46 PM

Post #4 of 4 (213 views)
Permalink
Re: PATCH reduce sa-awl memory usage [In reply to]

> I thought dev-list would be appropriate place to send a patch, haven't
> found clear guidelines, sorry. Submitting a ticket now.
No apology needed. I think you perfectly discussed things and this is
worthy of a bug.
> Just for reference, "12% slower" in benchmark I performed is a 26
> seconds vs 29 seconds case and yet
> it saves 1G of RAM. Two-three times larger database and sa-awl process
> won't even fit in a 32-bit virtual
> address space.
A good reason why we need to make this an option.

Regards,
KAM

SpamAssassin devel RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.