bugzilla-daemon at bugzilla
Feb 26, 2012, 1:46 PM
Post #9 of 9
[Bug 5185] Bayesian learning uses different message checksums during exiscan_acl and later sa_learn
[In reply to]
--- Comment #26 from Richard van der Hoff <bugzilla [at] rvanderhoff> 2012-02-26 21:46:18 UTC ---
A few thoughts on this from me:
(In reply to comment #10)
> By the way, comment 3 and comment 4 both suggest this will only affect
> messages with no Received header. I'm pretty sure that's not the case.
This, on further inspection, is a lie. We use the earliest Received header, so
the MTA's frobbing of the time on the last Received header only matters if
there were no other Received headers. I probably looked at a message with no
Received headers when I reported this, so I missed the CRLF vs LF issue. I
still think both issues need addressing, however.
(In reply to comment #20)
> There needs to be some part of msgid that isn't under the control of spammers,
> otherwise it's trivial for them to prevent their spam ever being learned. They
> can generate as many spams with the same msgid as they like, and they can prime
> the database with an initial dummy high-scoring spam that has no usable tokens
> in common with the rest.
Given that the earliest Received header is most certainly under the control of
the spammers, I certainly don't think we've made anything worse in this regard,
and whilst what we have now might not be perfect, I think calls to put it back
as it was are overstating matters.
Perhaps I'm being dense, but I don't really see how the spammers can use this
to their advantage. Is preventing your spams being learnt really that useful?
(In reply to comment #25)
> I feel we need to aim for a solution that works for everyone as the goal before
> we add yet another configuration option.
Agreed. Flexibility is all well and good, but having millions of configuration
options makes it really hard for people to get a piece of software working as
(In reply to comment #23)
> I think if we can get a msg_id that is more unique to the message sans the
> transport path, it could IMPROVE bayes use.
Whilst that's true, I have another suggestion. At the end of the day, we're
just trying to uniquely identify a particular message on our server, right?
Even if I get two copies of a spam, I can learn them as spam separately, I just
want to prevent re-learning each one on subsequent folder scans etc. So how
about trying to extract the local message-id from the most recent Received
header, rather than all this messing about with checksums etc?
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.