Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: SpamAssassin: users

Bayes and MySQL - does it actually work?

 

 

SpamAssassin users RSS feed   Index | Next | Previous | View Threaded


support at junkemailfilter

Dec 21, 2011, 6:39 AM

Post #1 of 24 (1127 views)
Permalink
Bayes and MySQL - does it actually work?

I've been trying for a long time to get bayes/mysql to actually work.
Running a dedicated server with MySQL. Several servers running SA
configured to talk to it.

I'm running big servers with lots of ram and raid 0 flash drives for
speed. Also using InnoDB. I'm beginning to wonder if it is ever going to
work and if someone is going to fix it?

--
Marc Perkel - Sales/Support
support [at] junkemailfilter
http://www.junkemailfilter.com
Junk Email Filter dot com
415-992-3400


axb.lists at gmail

Dec 21, 2011, 6:46 AM

Post #2 of 24 (1109 views)
Permalink
Re: Bayes and MySQL - does it actually work? [In reply to]

On 12/21/2011 03:39 PM, Marc Perkel wrote:
> I've been trying for a long time to get bayes/mysql to actually work.
> Running a dedicated server with MySQL. Several servers running SA
> configured to talk to it.
>
> I'm running big servers with lots of ram and raid 0 flash drives for
> speed. Also using InnoDB. I'm beginning to wonder if it is ever going to
> work and if someone is going to fix it?
>

It works fine.


robert at schetterer

Dec 21, 2011, 6:52 AM

Post #3 of 24 (1109 views)
Permalink
Re: Bayes and MySQL - does it actually work? [In reply to]

Am 21.12.2011 15:39, schrieb Marc Perkel:
> I've been trying for a long time to get bayes/mysql to actually work.
> Running a dedicated server with MySQL. Several servers running SA
> configured to talk to it.
>
> I'm running big servers with lots of ram and raid 0 flash drives for
> speed. Also using InnoDB. I'm beginning to wonder if it is ever going to
> work and if someone is going to fix it?
>
what makes you think it does not work ?

--
Best Regards

MfG Robert Schetterer

Germany/Munich/Bavaria


christian.grunfeld at gmail

Dec 21, 2011, 8:54 AM

Post #4 of 24 (1105 views)
Permalink
Re: Bayes and MySQL - does it actually work? [In reply to]

Bayes in MySQL works great for my with only one user !
In my previous setup with per user bayes in mysql was a mess !

Cheers
Christian

2011/12/21 Robert Schetterer <robert [at] schetterer>:
> Am 21.12.2011 15:39, schrieb Marc Perkel:
>> I've been trying for a long time to get bayes/mysql to actually work.
>> Running a dedicated server with MySQL. Several servers running SA
>> configured to talk to it.
>>
>> I'm running big servers with lots of ram and raid 0 flash drives for
>> speed. Also using InnoDB. I'm beginning to wonder if it is ever going to
>> work and if someone is going to fix it?
>>
>  what makes you think it does not work ?
>
> --
> Best Regards
>
> MfG Robert Schetterer
>
> Germany/Munich/Bavaria


support at junkemailfilter

Dec 21, 2011, 9:20 AM

Post #5 of 24 (1105 views)
Permalink
Re: Bayes and MySQL - does it actually work? [In reply to]

I should have mentions I'm filtering for about 50K users.

On 12/21/2011 8:54 AM, Christian Grunfeld wrote:
> Bayes in MySQL works great for my with only one user !
> In my previous setup with per user bayes in mysql was a mess !
>
> Cheers
> Christian
>
> 2011/12/21 Robert Schetterer<robert [at] schetterer>:
>> Am 21.12.2011 15:39, schrieb Marc Perkel:
>>> I've been trying for a long time to get bayes/mysql to actually work.
>>> Running a dedicated server with MySQL. Several servers running SA
>>> configured to talk to it.
>>>
>>> I'm running big servers with lots of ram and raid 0 flash drives for
>>> speed. Also using InnoDB. I'm beginning to wonder if it is ever going to
>>> work and if someone is going to fix it?
>>>
>> what makes you think it does not work ?
>>
>> --
>> Best Regards
>>
>> MfG Robert Schetterer
>>
>> Germany/Munich/Bavaria
>

--
Marc Perkel - Sales/Support
support [at] junkemailfilter
http://www.junkemailfilter.com
Junk Email Filter dot com
415-992-3400


kdeugau at vianet

Dec 21, 2011, 10:10 AM

Post #6 of 24 (1101 views)
Permalink
Re: Bayes and MySQL - does it actually work? [In reply to]

Marc Perkel wrote:
> I've been trying for a long time to get bayes/mysql to actually work.
> Running a dedicated server with MySQL. Several servers running SA
> configured to talk to it.
>
> I'm running big servers with lots of ram and raid 0 flash drives for
> speed. Also using InnoDB. I'm beginning to wonder if it is ever going to
> work and if someone is going to fix it?

I'm not sure what official testing has been done, but some testing I did
about a year ago when upgrading the SA cluster here showed pretty much
the same IO load for a global Bayes no matter what combination of
MyISAM, InnoDB, generic SQL, or MySQL-specific SA modules I used.

Enabling MySQL replication also bogged things down pretty badly.

Performance with the database on physical disks simply wasn't keeping up
with more than about double the average message rate (if that...), so I
fell back to the "good enough" setup of putting the SA database on a
RAMdisk, and tweaking the MySQL init script to reload the database on
startup. A database dump is done once a day, about a half-hour after a
Bayes expiry run.

This is handling ~250K messages/day, although with some tweaks to
serialize mail delivery a little more to level off the extreme peaks in
messages/second it should probably be able to handle a lot more volume.

We also have several SA instances - on the inbound side, the first pass
has ~25 of the top-scoring only-hits-spam rules (mostly DNSBLs) to skim
off the junk that would usually score 15+ on a full ruleset. Anything
that gets past that is then passed to a full SA instance with a long
list of local rules targeted at the ones reported as missed spam by
customers. That first pass tags more than 80% of the junk for far less
processing cost than feeding it all through the full ruleset.

Occasional mail spikes[1] sometimes cause SA to sloooooooowwwww
dooowwwnnn due to CPU contention (60+ spamd threads are simply going to
take a while to chew through mail if you've only got 16 logical CPU
cores), but otherwise a pair of dual-socket, quad-core Xeon E5630
machines with 12G of RAM are mostly idle. (RAM usage is fairly steady
at just over 4G.) Average scan times are just under a second.

-kgd

[1] I'm looking at you, Rocket Science Group - hundreds of messages per
second from netblocks all over the US, all nominally operated by (AKA
"tagged in WHOIS for") the same group - and quite a lot of it spam.
Unfortunately MailChimp seems to buy rack space, hosting, or managed
email servers from them or I'd drop all of their netblocks in the local
reject-at-the-border DNSBL and be done with it.


robert at schetterer

Dec 21, 2011, 10:42 AM

Post #7 of 24 (1103 views)
Permalink
Re: Bayes and MySQL - does it actually work? [In reply to]

Am 21.12.2011 18:20, schrieb Marc Perkel:
> I should have mentions I'm filtering for about 50K users.

ok, thats a lot, changing away from mysql may help
but for so many users, things are going difficult in many ways
but thats not direct related by working bayes and mysql
i use it for 3K users no Problem so far

>
> On 12/21/2011 8:54 AM, Christian Grunfeld wrote:
>> Bayes in MySQL works great for my with only one user !
>> In my previous setup with per user bayes in mysql was a mess !
>>
>> Cheers
>> Christian
>>
>> 2011/12/21 Robert Schetterer<robert [at] schetterer>:
>>> Am 21.12.2011 15:39, schrieb Marc Perkel:
>>>> I've been trying for a long time to get bayes/mysql to actually work.
>>>> Running a dedicated server with MySQL. Several servers running SA
>>>> configured to talk to it.
>>>>
>>>> I'm running big servers with lots of ram and raid 0 flash drives for
>>>> speed. Also using InnoDB. I'm beginning to wonder if it is ever
>>>> going to
>>>> work and if someone is going to fix it?
>>>>
>>> what makes you think it does not work ?
>>>
>>> --
>>> Best Regards
>>>
>>> MfG Robert Schetterer
>>>
>>> Germany/Munich/Bavaria
>>
>


--
Best Regards

MfG Robert Schetterer

Germany/Munich/Bavaria


robert at schetterer

Dec 21, 2011, 10:58 AM

Post #8 of 24 (1098 views)
Permalink
Re: Bayes and MySQL - does it actually work? [In reply to]

Am 21.12.2011 19:10, schrieb Kris Deugau:
> Marc Perkel wrote:
>> I've been trying for a long time to get bayes/mysql to actually work.
>> Running a dedicated server with MySQL. Several servers running SA
>> configured to talk to it.
>>
>> I'm running big servers with lots of ram and raid 0 flash drives for
>> speed. Also using InnoDB. I'm beginning to wonder if it is ever going to
>> work and if someone is going to fix it?
>
> I'm not sure what official testing has been done, but some testing I did
> about a year ago when upgrading the SA cluster here showed pretty much
> the same IO load for a global Bayes no matter what combination of
> MyISAM, InnoDB, generic SQL, or MySQL-specific SA modules I used.
>
> Enabling MySQL replication also bogged things down pretty badly.
>
> Performance with the database on physical disks simply wasn't keeping up
> with more than about double the average message rate (if that...), so I
> fell back to the "good enough" setup of putting the SA database on a
> RAMdisk, and tweaking the MySQL init script to reload the database on
> startup. A database dump is done once a day, about a half-hour after a
> Bayes expiry run.
>
> This is handling ~250K messages/day, although with some tweaks to
> serialize mail delivery a little more to level off the extreme peaks in
> messages/second it should probably be able to handle a lot more volume.
>
> We also have several SA instances - on the inbound side, the first pass
> has ~25 of the top-scoring only-hits-spam rules (mostly DNSBLs) to skim
> off the junk that would usually score 15+ on a full ruleset. Anything
> that gets past that is then passed to a full SA instance with a long
> list of local rules targeted at the ones reported as missed spam by
> customers. That first pass tags more than 80% of the junk for far less
> processing cost than feeding it all through the full ruleset.
>
> Occasional mail spikes[1] sometimes cause SA to sloooooooowwwww
> dooowwwnnn due to CPU contention (60+ spamd threads are simply going to
> take a while to chew through mail if you've only got 16 logical CPU
> cores), but otherwise a pair of dual-socket, quad-core Xeon E5630
> machines with 12G of RAM are mostly idle. (RAM usage is fairly steady
> at just over 4G.) Average scan times are just under a second.
>
> -kgd
>
> [1] I'm looking at you, Rocket Science Group - hundreds of messages per
> second from netblocks all over the US, all nominally operated by (AKA
> "tagged in WHOIS for") the same group - and quite a lot of it spam.
> Unfortunately MailChimp seems to buy rack space, hosting, or managed
> email servers from them or I'd drop all of their netblocks in the local
> reject-at-the-border DNSBL and be done with it.

Interesting Infos, by the way
anyone knows postgresql performs better i.e with Bayes clusters etc ?
at last using postscreen has helped here stopping bots,so these mails
never reach spamd,
but for sure in large mailsystems a spamassassin setup
has to be configured very carefully ever, and analysed during runtime
to get performance tweaks
however 250K messages/day seems not that much to me
scanning outbound mail with spamd ,was slow here too,i only use
clamav-milter with sanesecurity for that, also for inbound before
spamass-milter

but no flames, for performance issues, a look to the total mailsetup
is needed ever, there is no straight right or wrong most cases
only analysing the bottlenecks will help


--
Best Regards

MfG Robert Schetterer

Germany/Munich/Bavaria


support at junkemailfilter

Dec 22, 2011, 5:45 PM

Post #9 of 24 (1091 views)
Permalink
Re: Bayes and MySQL - does it actually work? [In reply to]

On 12/21/2011 10:58 AM, Robert Schetterer wrote:
> Am 21.12.2011 19:10, schrieb Kris Deugau:
>> Marc Perkel wrote:
>>> I've been trying for a long time to get bayes/mysql to actually work.
>>> Running a dedicated server with MySQL. Several servers running SA
>>> configured to talk to it.
>>>
>>> I'm running big servers with lots of ram and raid 0 flash drives for
>>> speed. Also using InnoDB. I'm beginning to wonder if it is ever going to
>>> work and if someone is going to fix it?
>> I'm not sure what official testing has been done, but some testing I did
>> about a year ago when upgrading the SA cluster here showed pretty much
>> the same IO load for a global Bayes no matter what combination of
>> MyISAM, InnoDB, generic SQL, or MySQL-specific SA modules I used.
>>
>> Enabling MySQL replication also bogged things down pretty badly.
>>
>> Performance with the database on physical disks simply wasn't keeping up
>> with more than about double the average message rate (if that...), so I
>> fell back to the "good enough" setup of putting the SA database on a
>> RAMdisk, and tweaking the MySQL init script to reload the database on
>> startup. A database dump is done once a day, about a half-hour after a
>> Bayes expiry run.
>>
>> This is handling ~250K messages/day, although with some tweaks to
>> serialize mail delivery a little more to level off the extreme peaks in
>> messages/second it should probably be able to handle a lot more volume.
>>
>> We also have several SA instances - on the inbound side, the first pass
>> has ~25 of the top-scoring only-hits-spam rules (mostly DNSBLs) to skim
>> off the junk that would usually score 15+ on a full ruleset. Anything
>> that gets past that is then passed to a full SA instance with a long
>> list of local rules targeted at the ones reported as missed spam by
>> customers. That first pass tags more than 80% of the junk for far less
>> processing cost than feeding it all through the full ruleset.
>>
>> Occasional mail spikes[1] sometimes cause SA to sloooooooowwwww
>> dooowwwnnn due to CPU contention (60+ spamd threads are simply going to
>> take a while to chew through mail if you've only got 16 logical CPU
>> cores), but otherwise a pair of dual-socket, quad-core Xeon E5630
>> machines with 12G of RAM are mostly idle. (RAM usage is fairly steady
>> at just over 4G.) Average scan times are just under a second.
>>
>> -kgd
>>
>> [1] I'm looking at you, Rocket Science Group - hundreds of messages per
>> second from netblocks all over the US, all nominally operated by (AKA
>> "tagged in WHOIS for") the same group - and quite a lot of it spam.
>> Unfortunately MailChimp seems to buy rack space, hosting, or managed
>> email servers from them or I'd drop all of their netblocks in the local
>> reject-at-the-border DNSBL and be done with it.
> Interesting Infos, by the way
> anyone knows postgresql performs better i.e with Bayes clusters etc ?
> at last using postscreen has helped here stopping bots,so these mails
> never reach spamd,
> but for sure in large mailsystems a spamassassin setup
> has to be configured very carefully ever, and analysed during runtime
> to get performance tweaks
> however 250K messages/day seems not that much to me
> scanning outbound mail with spamd ,was slow here too,i only use
> clamav-milter with sanesecurity for that, also for inbound before
> spamass-milter
>
> but no flames, for performance issues, a look to the total mailsetup
> is needed ever, there is no straight right or wrong most cases
> only analysing the bottlenecks will help
>

Maybe it's time for me to try postgresql. Can you provide a link to how
to optimize SA for it?

--
Marc Perkel - Sales/Support
support [at] junkemailfilter
http://www.junkemailfilter.com
Junk Email Filter dot com
415-992-3400


robert at schetterer

Dec 22, 2011, 11:15 PM

Post #10 of 24 (1088 views)
Permalink
Re: Bayes and MySQL - does it actually work? [In reply to]

Am 23.12.2011 02:45, schrieb Marc Perkel:
>
>
> On 12/21/2011 10:58 AM, Robert Schetterer wrote:
>> Am 21.12.2011 19:10, schrieb Kris Deugau:
>>> Marc Perkel wrote:
>>>> I've been trying for a long time to get bayes/mysql to actually work.
>>>> Running a dedicated server with MySQL. Several servers running SA
>>>> configured to talk to it.
>>>>
>>>> I'm running big servers with lots of ram and raid 0 flash drives for
>>>> speed. Also using InnoDB. I'm beginning to wonder if it is ever
>>>> going to
>>>> work and if someone is going to fix it?
>>> I'm not sure what official testing has been done, but some testing I did
>>> about a year ago when upgrading the SA cluster here showed pretty much
>>> the same IO load for a global Bayes no matter what combination of
>>> MyISAM, InnoDB, generic SQL, or MySQL-specific SA modules I used.
>>>
>>> Enabling MySQL replication also bogged things down pretty badly.
>>>
>>> Performance with the database on physical disks simply wasn't keeping up
>>> with more than about double the average message rate (if that...), so I
>>> fell back to the "good enough" setup of putting the SA database on a
>>> RAMdisk, and tweaking the MySQL init script to reload the database on
>>> startup. A database dump is done once a day, about a half-hour after a
>>> Bayes expiry run.
>>>
>>> This is handling ~250K messages/day, although with some tweaks to
>>> serialize mail delivery a little more to level off the extreme peaks in
>>> messages/second it should probably be able to handle a lot more volume.
>>>
>>> We also have several SA instances - on the inbound side, the first pass
>>> has ~25 of the top-scoring only-hits-spam rules (mostly DNSBLs) to skim
>>> off the junk that would usually score 15+ on a full ruleset. Anything
>>> that gets past that is then passed to a full SA instance with a long
>>> list of local rules targeted at the ones reported as missed spam by
>>> customers. That first pass tags more than 80% of the junk for far less
>>> processing cost than feeding it all through the full ruleset.
>>>
>>> Occasional mail spikes[1] sometimes cause SA to sloooooooowwwww
>>> dooowwwnnn due to CPU contention (60+ spamd threads are simply going to
>>> take a while to chew through mail if you've only got 16 logical CPU
>>> cores), but otherwise a pair of dual-socket, quad-core Xeon E5630
>>> machines with 12G of RAM are mostly idle. (RAM usage is fairly steady
>>> at just over 4G.) Average scan times are just under a second.
>>>
>>> -kgd
>>>
>>> [1] I'm looking at you, Rocket Science Group - hundreds of messages per
>>> second from netblocks all over the US, all nominally operated by (AKA
>>> "tagged in WHOIS for") the same group - and quite a lot of it spam.
>>> Unfortunately MailChimp seems to buy rack space, hosting, or managed
>>> email servers from them or I'd drop all of their netblocks in the local
>>> reject-at-the-border DNSBL and be done with it.
>> Interesting Infos, by the way
>> anyone knows postgresql performs better i.e with Bayes clusters etc ?
>> at last using postscreen has helped here stopping bots,so these mails
>> never reach spamd,
>> but for sure in large mailsystems a spamassassin setup
>> has to be configured very carefully ever, and analysed during runtime
>> to get performance tweaks
>> however 250K messages/day seems not that much to me
>> scanning outbound mail with spamd ,was slow here too,i only use
>> clamav-milter with sanesecurity for that, also for inbound before
>> spamass-milter
>>
>> but no flames, for performance issues, a look to the total mailsetup
>> is needed ever, there is no straight right or wrong most cases
>> only analysing the bottlenecks will help
>>
>
> Maybe it's time for me to try postgresql. Can you provide a link to how
> to optimize SA for it?
>

sorry no, i have no links beside offical ones,
but i was told from good DB People postgresql
is more handy in Cluster Setups
but as i said , try to limit amount of mails
comming to spamassassin by using other filter tecs before it
this should help anyway, beside of the DB Stuff
--
Best Regards

MfG Robert Schetterer

Germany/Munich/Bavaria


jernej.porenta at arnes

Dec 23, 2011, 1:59 AM

Post #11 of 24 (1087 views)
Permalink
Re: Bayes and MySQL - does it actually work? [In reply to]

On Dec 23, 2011, at 8:15 AM, Robert Schetterer wrote:

> Am 23.12.2011 02:45, schrieb Marc Perkel:
>>>> This is handling ~250K messages/day, although with some tweaks to
>>>> serialize mail delivery a little more to level off the extreme peaks in
>>>> messages/second it should probably be able to handle a lot more volume.
>>>>
>>>> We also have several SA instances - on the inbound side, the first pass
>>>> has ~25 of the top-scoring only-hits-spam rules (mostly DNSBLs) to skim
>>>> off the junk that would usually score 15+ on a full ruleset. Anything
>>>> that gets past that is then passed to a full SA instance with a long
>>>> list of local rules targeted at the ones reported as missed spam by
>>>> customers. That first pass tags more than 80% of the junk for far less
>>>> processing cost than feeding it all through the full ruleset.


We are processing 300k+ mails (peaks up to 1M/day) with 3 mail servers + 1 dedicated MySQL server replicated to one old server and so far, we haven't seen any performance degradations by using Bayes in MySQL InnoDB engine. Mail servers are dual socket Xeon servers with 8G RAM, while MySQL server is dual-socket Xeon with 48G RAM, but SA Bayes is not the most used database on that server. We are using amavisd-new instead of spamd.

However, we've seen some degradations when we moved to new MySQL server, but some tweaking did help:
- correctly sizing InnoDB engine
- optimizing MySQL buffer sizes
- disable RAID battery autolearn period
- optimizing I/O scheduler
- optimizing network kernel stuff
- optimize kernel swappiness level
- using Mail::SpamAssassin::BayesStore::MySQL instead of Mail::SpamAssassin::BayesStore::SQL
- manually pruning auto-whitelisting data and bayes data

Currently our MySQL bayes data has over 2M tokens in place and we don't see any performance impact on SpamAssassin. Our backup setup runs on replicated database, so there is no performance impact on our primary MySQL server.

I don't have any numbers to compare MySQL and PostgreSQL, but I believe that newer versions of MySQL and its derivates (Percona Server etc.) did improve quite a lot, compared to older ones.

regards, Jernej


hege at hege

Dec 23, 2011, 3:29 AM

Post #12 of 24 (1084 views)
Permalink
Re: Bayes and MySQL - does it actually work? [In reply to]

On Wed, Dec 21, 2011 at 01:10:27PM -0500, Kris Deugau wrote:
> Marc Perkel wrote:
> >I've been trying for a long time to get bayes/mysql to actually work.
> >Running a dedicated server with MySQL. Several servers running SA
> >configured to talk to it.
> >
> >I'm running big servers with lots of ram and raid 0 flash drives for
> >speed. Also using InnoDB. I'm beginning to wonder if it is ever going to
> >work and if someone is going to fix it?
>
> I'm not sure what official testing has been done, but some testing I
> did about a year ago when upgrading the SA cluster here showed
> pretty much the same IO load for a global Bayes no matter what
> combination of MyISAM, InnoDB, generic SQL, or MySQL-specific SA
> modules I used.
>
> Enabling MySQL replication also bogged things down pretty badly.
>
> Performance with the database on physical disks simply wasn't
> keeping up with more than about double the average message rate (if
> that...), so I fell back to the "good enough" setup of putting the
> SA database on a RAMdisk, and tweaking the MySQL init script to
> reload the database on startup. A database dump is done once a day,
> about a half-hour after a Bayes expiry run.
>
> This is handling ~250K messages/day, although with some tweaks to
> serialize mail delivery a little more to level off the extreme peaks
> in messages/second it should probably be able to handle a lot more
> volume.

I guess it still boils down to basics. No matter what the database server is
used for, same principles apply. If you have slooow disks, then things are
going to be slow.

Ideally you should compile newest MySQL by hand. Older versions don't use
the new faster InnoDB Plugin codebase.

Disk / fsync() is almost always the bottleneck. If you don't have critical
stuff in the same database, look at all the relevant options
(innodb_flush_log_at_trx_commit=0, sync_binlog=0 etc). You could even run
separate instance for SA only with all the fastest options. Probably some
similar options for replication exist (speed vs reliability), no experience
with that.

Also you can tune the default schema. Drop atime index, it's pointless when
using manual expiry. If you have simple global bayes, change "id" column to
tinyint, it will cut your database size in half. I've also changed
spam_count and ham_count to smallint, since I don't have that much traffic.

Since these issues pop up here every now and then, I guess SA needs own
tutorial/howto for MySQL tuning..


me at junc

Dec 23, 2011, 3:58 AM

Post #13 of 24 (1085 views)
Permalink
Re: Bayes and MySQL - does it actually work? [In reply to]

On Fri, 23 Dec 2011 13:29:00 +0200, Henrik K wrote:

> Since these issues pop up here every now and then, I guess SA needs
> own
> tutorial/howto for MySQL tuning..

google mysqltuner was a help for me even i have not much trafic here

http://www.google.dk/search?aq=f&sourceid=chrome&ie=UTF-8&q=mysqltuner

can sa use mysqlcluster btw ?

spread innodb to more mysqlcluster db, where the cluster it self sync
diggest

marry xmax btw


spamassassin at lists

Dec 23, 2011, 6:10 AM

Post #14 of 24 (1087 views)
Permalink
Re: Bayes and MySQL - does it actually work? [In reply to]

On 23/12/11 11:29, Henrik K wrote:

>> Performance with the database on physical disks simply wasn't
>> keeping up with more than about double the average message rate (if
>> that...), so I fell back to the "good enough" setup of putting the
>> SA database on a RAMdisk,

> I guess it still boils down to basics. No matter what the database server is
> used for, same principles apply. If you have slooow disks, then things are
> going to be slow.

As I understand it, if the MySQL query cache is tuned appropriately,
then most of the queries should not be touching disk anyway?

--
Mike Cardwell https://grepular.com/ https://twitter.com/mickeyc
Professional http://cardwellit.com/ http://linkedin.com/in/mikecardwell
PGP.mit.edu 0018461F/35BC AF1D 3AA2 1F84 3DC3 B0CF 70A5 F512 0018 461F
Attachments: signature.asc (0.58 KB)


hege at hege

Dec 23, 2011, 6:20 AM

Post #15 of 24 (1084 views)
Permalink
Re: Bayes and MySQL - does it actually work? [In reply to]

On Fri, Dec 23, 2011 at 02:10:16PM +0000, spamassassin [at] lists wrote:
> On 23/12/11 11:29, Henrik K wrote:
>
> >> Performance with the database on physical disks simply wasn't
> >> keeping up with more than about double the average message rate (if
> >> that...), so I fell back to the "good enough" setup of putting the
> >> SA database on a RAMdisk,
>
> > I guess it still boils down to basics. No matter what the database server is
> > used for, same principles apply. If you have slooow disks, then things are
> > going to be slow.
>
> As I understand it, if the MySQL query cache is tuned appropriately,
> then most of the queries should not be touching disk anyway?

Enabling query cache will probably (marginally) slow things down. Bayes
queries are extremely random, so there's nothing to cache. Any write to the
table will invalidate caches anyway. And those writes happen every time a
token is read (atime is updated).


dfs at roaringpenguin

Dec 23, 2011, 6:25 AM

Post #16 of 24 (1089 views)
Permalink
Re: Bayes and MySQL - does it actually work? [In reply to]

I don't believe any kind of SQL database is the best choice for Bayes
(which involves simple keyed lookups). We use Dan Bernsteins "cdb"
file format with great success. Each user has his or her own CDB file
as well as a sitewide file containing 5.7 million tokens.

The CDB software uses mmap() to map the CDB file into memory. As long
as your server has lots of memory, the OS's memory management system
keeps heavily-used CDB files in memory... no arcane tuning required.
[.Actually, this is the key for any kind of fast Bayes lookup: Build a
server with huge gobs of memory. :)]

I realize SpamAssassin does not use CDB files for Bayes. But if the
developers are looking for a new back-end, I highly recommend CDB
for its excellent performance.

The only downside to CDB is that incremental updates are not possible.
To train, you need to rebuild the entire CDB file. For us, that's
an acceptable tradeoff, but YMMV.

Regards,

David.


spamassassin at lists

Dec 23, 2011, 7:03 AM

Post #17 of 24 (1089 views)
Permalink
Re: Bayes and MySQL - does it actually work? [In reply to]

On 23/12/11 14:20, Henrik K wrote:

>> As I understand it, if the MySQL query cache is tuned appropriately,
>> then most of the queries should not be touching disk anyway?
>
> Enabling query cache will probably (marginally) slow things down. Bayes
> queries are extremely random, so there's nothing to cache. Any write to the
> table will invalidate caches anyway. And those writes happen every time a
> token is read (atime is updated).

To stop the query cache being invalidated, it would probably be better
if the writes were queued and then done in batches. Can SpamAssassin
handle this sort of queue internally, or would some sort of additional
technology be required?

I don't know what the point of the atime data is, but is there any need
to update the atime on every read? Could that write be skipped if the
atime is already within a certain period of time? Ie, if the atime has
already been updated in the last 5 minutes, is there any point in doing
it again?

--
Mike Cardwell https://grepular.com/ https://twitter.com/mickeyc
Professional http://cardwellit.com/ http://linkedin.com/in/mikecardwell
PGP.mit.edu 0018461F/35BC AF1D 3AA2 1F84 3DC3 B0CF 70A5 F512 0018 461F
Attachments: signature.asc (0.58 KB)


spamassassin at lists

Dec 23, 2011, 7:05 AM

Post #18 of 24 (1085 views)
Permalink
Re: Bayes and MySQL - does it actually work? [In reply to]

On 23/12/11 14:25, David F. Skoll wrote:

> The only downside to CDB is that incremental updates are not possible.
> To train, you need to rebuild the entire CDB file. For us, that's
> an acceptable tradeoff, but YMMV.

Another major downside to this approach compared to using MySQL, is that
it doesn't allow you to access the same bayes db from multiple machines
at the same time. Unless I'm mistaken..?

--
Mike Cardwell https://grepular.com/ https://twitter.com/mickeyc
Professional http://cardwellit.com/ http://linkedin.com/in/mikecardwell
PGP.mit.edu 0018461F/35BC AF1D 3AA2 1F84 3DC3 B0CF 70A5 F512 0018 461F
Attachments: signature.asc (0.58 KB)


dfs at roaringpenguin

Dec 23, 2011, 7:09 AM

Post #19 of 24 (1085 views)
Permalink
Re: Bayes and MySQL - does it actually work? [In reply to]

On Fri, 23 Dec 2011 15:05:42 +0000
spamassassin [at] lists wrote:

> Another major downside to this approach compared to using MySQL, is
> that it doesn't allow you to access the same bayes db from multiple
> machines at the same time. Unless I'm mistaken..?

You're correct. We rsync the CDB files around to our scanners. In
this way, your available disk banwidth scales up with the number
of scanners.

For setups with a large number of scanners where the rsyncs get
annoying, we have an experimental Bayes server that takes a token list
and returns the probability. It works pretty well.

Regards,

David.


hege at hege

Dec 23, 2011, 7:15 AM

Post #20 of 24 (1087 views)
Permalink
Re: Bayes and MySQL - does it actually work? [In reply to]

On Fri, Dec 23, 2011 at 03:03:09PM +0000, spamassassin [at] lists wrote:
> On 23/12/11 14:20, Henrik K wrote:
>
> >> As I understand it, if the MySQL query cache is tuned appropriately,
> >> then most of the queries should not be touching disk anyway?
> >
> > Enabling query cache will probably (marginally) slow things down. Bayes
> > queries are extremely random, so there's nothing to cache. Any write to the
> > table will invalidate caches anyway. And those writes happen every time a
> > token is read (atime is updated).
>
> To stop the query cache being invalidated, it would probably be better
> if the writes were queued and then done in batches. Can SpamAssassin
> handle this sort of queue internally, or would some sort of additional
> technology be required?

You need to consider that tokens are done in batches of 50 or so (token in
('token1','token2','token3'...)). Since MySQL caches/hashes the query
_exactly_ as written, it's unlikely you'll ever get two same SQL clauses.

> I don't know what the point of the atime data is, but is there any need
> to update the atime on every read? Could that write be skipped if the
> atime is already within a certain period of time? Ie, if the atime has
> already been updated in the last 5 minutes, is there any point in doing
> it again?

That's a question worth entering into bugzilla. I doubt it even makes
difference it the time frame would be 1 day. After all the only point for
atime is to expire very old unused tokens. Would be fun to benchmark if I
had time.


jhardin at impsec

Dec 23, 2011, 7:49 AM

Post #21 of 24 (1084 views)
Permalink
Re: Bayes and MySQL - does it actually work? [In reply to]

On Fri, 23 Dec 2011, spamassassin [at] lists wrote:

> On 23/12/11 14:25, David F. Skoll wrote:
>
>> The only downside to CDB is that incremental updates are not possible.
>> To train, you need to rebuild the entire CDB file. For us, that's
>> an acceptable tradeoff, but YMMV.
>
> Another major downside to this approach compared to using MySQL, is that
> it doesn't allow you to access the same bayes db from multiple machines
> at the same time. Unless I'm mistaken..?

Each machine would have its own copy of the latest database.

Learning would be to a master that is not being read, and that master
would be periodically distributed to SA hosts.

--
John Hardin KA7OHZ http://www.impsec.org/~jhardin/
jhardin [at] impsec FALaholic #11174 pgpk -a jhardin [at] impsec
key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
"Bother," said Pooh as he struggled with /etc/sendmail.cf, "it never
does quite what I want. I wish Christopher Robin was here."
-- Peter da Silva in a.s.r
-----------------------------------------------------------------------
2 days until Christmas


mr88talent at gmail

Dec 24, 2011, 11:13 AM

Post #22 of 24 (1075 views)
Permalink
Re: Bayes and MySQL - does it actually work? [In reply to]

On Fri, Dec 23, 2011 at 4:58 AM, Benny Pedersen wrote:
> On Fri, 23 Dec 2011 13:29:00 +0200, Henrik K wrote:
>
>> Since these issues pop up here every now and then, I guess SA needs own
>> tutorial/howto for MySQL tuning..
>
>
> google mysqltuner was a help for me even i have not much trafic here
>
> http://www.google.dk/search?aq=f&sourceid=chrome&ie=UTF-8&q=mysqltuner
>
> can sa use mysqlcluster btw ?
>
> spread innodb to more mysqlcluster db, where the cluster it self sync
> diggest
>
> marry xmax btw
>

FYI, I clicked the link above and clicked on a howtoforge document
regarding mysqltuner and was nearly infected with fake AV virus. I
loaded up task manager and killed IE before it had a chance to infect
me. So, beware.

--
Gary V


jernej.porenta at arnes

Dec 25, 2011, 9:11 AM

Post #23 of 24 (1068 views)
Permalink
Re: Bayes and MySQL - does it actually work? [In reply to]

>
> FYI, I clicked the link above and clicked on a howtoforge document
> regarding mysqltuner and was nearly infected with fake AV virus. I
> loaded up task manager and killed IE before it had a chance to infect
> me. So, beware.


Use wget ;)

# wget mysqltuner.pl
# chmod u+x mysqltuner.pl
# ./mysqltuner.pl

cheers, J.

PS: not kidding ;)


maxsec at gmail

Dec 25, 2011, 10:35 AM

Post #24 of 24 (1066 views)
Permalink
Re: Bayes and MySQL - does it actually work? [In reply to]

Using ie ??? Recommend firefox or chrome much faster and less prone to
security issues

Martin


On Saturday, 24 December 2011, Gary V <mr88talent [at] gmail> wrote:
> On Fri, Dec 23, 2011 at 4:58 AM, Benny Pedersen wrote:
>> On Fri, 23 Dec 2011 13:29:00 +0200, Henrik K wrote:
>>
>>> Since these issues pop up here every now and then, I guess SA needs own
>>> tutorial/howto for MySQL tuning..
>>
>>
>> google mysqltuner was a help for me even i have not much trafic here
>>
>> http://www.google.dk/search?aq=f&sourceid=chrome&ie=UTF-8&q=mysqltuner
>>
>> can sa use mysqlcluster btw ?
>>
>> spread innodb to more mysqlcluster db, where the cluster it self sync
>> diggest
>>
>> marry xmax btw
>>
>
> FYI, I clicked the link above and clicked on a howtoforge document
> regarding mysqltuner and was nearly infected with fake AV virus. I
> loaded up task manager and killed IE before it had a chance to infect
> me. So, beware.
>
> --
> Gary V
>

--
--
Martin Hepworth
Oxford, UK

SpamAssassin users RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.