
e0625457 at student
Aug 16, 2009, 10:13 PM
Views: 1348
Permalink
|
Hi Markus! On Sun, 16 Aug 2009 21:48:51 +0200 Markus Stumpf <lists-qmail [at] maexotic> wrote: > On Sat, Aug 15, 2009 at 10:25:48PM +1200, Johannes Buchner wrote: > > For this, I would like to ask for your assistence: Could you, if you > > have access to a large number of mail boxes, obtain the file sizes > > of each mail, and email me that list? > > I may be wrong, but I don't think that taking sizes of eMails from the > mailboxes is the same as taking sizes from of eMails flowing through > the system. > eMails in the boxes relate to eMails the users find worthy to keep, > whereas they may consider e.g. short messages (think sms) or status > messages from social networks or even mailing list messages more > temporary and delete them after reading. > So you wouldn't measure "email sizes", but "sizes of emails users > think it is worth keeping", which may (or may not) be a big > difference. You are absolutely right. In the end, I am trying to model the way users use (send) emails. Obviously, today email is used in various ways, usage classes if you may: - Mailing lists and Newsletters - Bugmails and automated emails - Spam - File attachments - Text-only vs. HTML-Mails All these are usually mixed up in the inbox. I actually didn't want to look at the differences, but I didn't come around doing it: I analysed the sizes of my mails, which I seperated in the following categories: ham (1108), spam (4350), mailing lists (various lists, 1420), bugmail (bugzilla, flyspray, 970), sent (170). I am always looking at buckets (how many mails are smaller than size x, but bigger than the last buckets). I used the limits: 1024 1536 2048 2560 3072 4096 5120 6144 7168 8192 10240 12288 14336 16384 20480 24576 28672 32768 40960 49152 53248 57344 65536 131072 262144 524288 1048576 10264576 and 1073741824 (bytes). In the attached bytype-mymail.png, I plotted the percentage of mails you would find in a bucket for a given size. You can see that the bugmail and mailing lists have very definite profiles. I assume this is because of the strict rules in mailing lists (no attachments except patches, no html). I addition I added the lkml. Here, it seems spam and inbox are quite similar, and that I send many mails < 2kb. It is interesting to see that the profile of mails I receive are very different from the mails I send. In the cumulative view (bytype-mymail-cum.png), you can also see the definite shape of mailing lists and bugmail. Lets taking a closer look at the difference between spam and ham mails: The seperation is correct (there are no mails wrongly categorized). mymail-ham-spam-ratio.png shows that below 1500 bytes, the share of spam is incredible. The smallest human-written mail I possess is 1100 bytes. Generated mails are usually very short, and very often less than 1kb, and a certain class of spam is too (e.g. one liners and a link). Another class of spam can be seen in mymail-spam-percentage.png between 12KB and 30KB, and 50KB and 60KB (where the data point is missing I did not send any mails, thus the ratio is infinite). This is HTML spam. What can be learned from this is that it might be an effective spam measure against the small sized mails (<1500Bytes) to take as long for the acceptance of mail as you would need it to be open for 1500 Bytes. Shortening spam messages to increase the number of spam mails would become infeasible to spammers. I can imagine keeping the connection open similar to the tarpit/Teergrube technique. As a side effect other means of generated mail would be effected, they shouldn't be at their limit of resources (# of connections) through. So far I laid out my arguments following the assumption that my mail box is representative for the whole world. The sample of mails is large enough to tell about my mail usage. I am a hacker, I use PGP, I do not use HTML mail, thus, I am not a average email user. My mail box is not representative for the whole world. I received samples from two people (thanks again), with 30991 and 8759 sizes. Presumably these are not spam-free. In others-mail.png you see them plotted, as well as my ham, my spam and the lkml. My goal was to come up with a realistic email generator for network simulations (human-sent email), which I modelled after my ham by laying three lines through the graph in others-mail.png (knots at 3KB and 70KB, and hard limits >800Bytes, <10MB). My conclusion is that the mail distribution is a sum of various classes that can not really be distinguished, but can be approximated by three Zipfian distributions. It is also important to keep in mind that mobile device email usage will have a different profile of mail sizes than desktop usage (due to the input devices and the limited number of files to attach). The Zipfian distribution can be applied to many types of traffic (as Adamic and Huberman (2002) show here http://www.hpl.hp.com/research/idl/papers/ranking/adamicglottometrics.pdf ), but emails is a special case due to its size limit (used to be 10MB, still is mostly). I suspect this is the reason the distribution gets flatter after 70KB: Attachments larger than 10MB are not possible. On the other hand it might just be the usage class of binary attachments poking out. In case anyone wonders why I don't write a paper about it, well, some of it I'll use for my course paper, on its own it is too thin and I'd need more samples. I will probably make a (more beautiful) writeup on my blog though: http://johannes.jakeapp.com/blog/ I also attached the source code of my mail generator (which I intend to use in combination with ns-2 or ns-3). I'm ready for your storm of criticism now :-) Have a fun day, Johannes -- Johannes Buchner mail: e0625457 [at] student xmpp: buchner.johannes [at] amessage icq: 163390666 skype:johannes_buchner Ich freue mich über PGP/GPG verschlüsselte/signierte Mails!
|