
feh at fehcom
Aug 17, 2009, 2:17 PM
Post #4 of 5
(1750 views)
Permalink
|
Hi Johannes, thanks for your interesting studies and samples. However, if your goal is to 'identify' or at least 'define a probability of spam mails based on their size' I very much doubt, whether this yields practical results. Years ago, I did something comparable: http://www.fehcom.de/qmail.html (check for chapter 7.2 of my qmail book) In any case, check the email-size distribution not only against the Zipf distribution, but in addition make the assumption it is Binomial distributed (or even NBD). Your double log-distributions don't show much; they rather hide. Apply the following cases: * Spam mails. * Automated mails (by systems)- * Mails generated by humans and the current mail clients. Differentiate between HTML mails and flat mails. regards. --eh. On Mon, 17 Aug 2009 07:13:41 +0200, Johannes Buchner <e0625457 [at] student> wrote: > Hi Markus! > > On Sun, 16 Aug 2009 21:48:51 +0200 > Markus Stumpf <lists-qmail [at] maexotic> wrote: > >> On Sat, Aug 15, 2009 at 10:25:48PM +1200, Johannes Buchner wrote: >> > For this, I would like to ask for your assistence: Could you, if you >> > have access to a large number of mail boxes, obtain the file sizes >> > of each mail, and email me that list? >> >> I may be wrong, but I don't think that taking sizes of eMails from the >> mailboxes is the same as taking sizes from of eMails flowing through >> the system. >> eMails in the boxes relate to eMails the users find worthy to keep, >> whereas they may consider e.g. short messages (think sms) or status >> messages from social networks or even mailing list messages more >> temporary and delete them after reading. >> So you wouldn't measure "email sizes", but "sizes of emails users >> think it is worth keeping", which may (or may not) be a big >> difference. > You are absolutely right. In the end, I am trying to model the way > users use (send) emails. > > Obviously, today email is used in various ways, usage classes if you > may: > - Mailing lists and Newsletters > - Bugmails and automated emails > - Spam > - File attachments > - Text-only vs. HTML-Mails > All these are usually mixed up in the inbox. I actually didn't want > to look at the differences, but I didn't come around doing it: > > I analysed the sizes of my mails, which I seperated in the following > categories: ham (1108), spam (4350), mailing lists (various lists, > 1420), bugmail (bugzilla, flyspray, 970), sent (170). > > I am always looking at buckets (how many mails are smaller than size > x, but bigger than the last buckets). I used the limits: 1024 > 1536 2048 2560 3072 4096 5120 6144 7168 8192 10240 12288 14336 > 16384 20480 24576 28672 32768 40960 49152 53248 57344 65536 131072 > 262144 524288 1048576 10264576 and 1073741824 (bytes). > > > In the attached bytype-mymail.png, I plotted the percentage of mails > you would find in a bucket for a given size. You can see that the > bugmail and mailing lists have very definite profiles. I assume this is > because of the strict rules in mailing lists (no attachments except > patches, no html). I addition I added the lkml. > > Here, it seems spam and inbox are quite similar, and that I send many > mails < 2kb. It is interesting to see that the profile of mails I > receive are very different from the mails I send. > In the cumulative view (bytype-mymail-cum.png), you can also see the > definite shape of mailing lists and bugmail. > > Lets taking a closer look at the difference between spam and ham mails: > The seperation is correct (there are no mails wrongly categorized). > mymail-ham-spam-ratio.png shows that below 1500 bytes, the share of > spam is incredible. The smallest human-written mail I possess is 1100 > bytes. Generated mails are usually very short, and very often less > than 1kb, and a certain class of spam is too (e.g. one liners and a > link). > > Another class of spam can be seen in mymail-spam-percentage.png > between 12KB and 30KB, and 50KB and 60KB (where the data point is > missing I did not send any mails, thus the ratio is infinite). This is > HTML spam. > > What can be learned from this is that it might be an effective spam > measure against the small sized mails (<1500Bytes) to take as long for > the acceptance of mail as you would need it to be open for 1500 Bytes. > Shortening spam messages to increase the number of spam mails would > become infeasible to spammers. I can imagine keeping the connection > open similar to the tarpit/Teergrube technique. > As a side effect other means of generated mail would be effected, they > shouldn't be at their limit of resources (# of connections) through. > > So far I laid out my arguments following the assumption that my mail > box is representative for the whole world. The sample of mails is > large enough to tell about my mail usage. I am a hacker, I use PGP, I do > not use HTML mail, thus, I am not a average email user. My mail box is > not representative for the whole world. > > I received samples from two people (thanks again), with 30991 and 8759 > sizes. Presumably these are not spam-free. In others-mail.png you see > them plotted, as well as my ham, my spam and the lkml. > > My goal was to come up with a realistic email generator for network > simulations (human-sent email), which I modelled after my ham by laying > three lines through the graph in others-mail.png (knots at 3KB and > 70KB, and hard limits >800Bytes, <10MB). > My conclusion is that the mail distribution is a sum of various > classes that can not really be distinguished, but can be approximated by > three Zipfian distributions. > It is also important to keep in mind that mobile device email usage will > have a different profile of mail sizes than desktop usage (due to the > input devices and the limited number of files to attach). > > The Zipfian distribution can be applied to many types of traffic (as > Adamic and Huberman (2002) show here > http://www.hpl.hp.com/research/idl/papers/ranking/adamicglottometrics.pdf > ), > but emails is a special case due to its size limit (used to be 10MB, > still is mostly). I suspect this is the reason the distribution gets > flatter after 70KB: Attachments larger than 10MB are not possible. On > the other hand it might just be the usage class of binary attachments > poking out. > > In case anyone wonders why I don't write a paper about it, well, some > of it I'll use for my course paper, on its own it is too thin and I'd > need more samples. I will probably make a (more beautiful) writeup on my > blog though: http://johannes.jakeapp.com/blog/ > > I also attached the source code of my mail generator (which I intend > to use in combination with ns-2 or ns-3). > > I'm ready for your storm of criticism now :-) > > Have a fun day, > Johannes > -- Dr. Erwin Hoffmann | FEHCom | http://www.fehcom.de
|