Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Qmail: users
Re: Email size distribution
 

Index | Next | Previous | View Flat


e0625457 at student

Aug 16, 2009, 10:13 PM


Views: 1348
Permalink
Re: Email size distribution [In reply to]

Hi Markus!

On Sun, 16 Aug 2009 21:48:51 +0200
Markus Stumpf <lists-qmail [at] maexotic> wrote:

> On Sat, Aug 15, 2009 at 10:25:48PM +1200, Johannes Buchner wrote:
> > For this, I would like to ask for your assistence: Could you, if you
> > have access to a large number of mail boxes, obtain the file sizes
> > of each mail, and email me that list?
>
> I may be wrong, but I don't think that taking sizes of eMails from the
> mailboxes is the same as taking sizes from of eMails flowing through
> the system.
> eMails in the boxes relate to eMails the users find worthy to keep,
> whereas they may consider e.g. short messages (think sms) or status
> messages from social networks or even mailing list messages more
> temporary and delete them after reading.
> So you wouldn't measure "email sizes", but "sizes of emails users
> think it is worth keeping", which may (or may not) be a big
> difference.
You are absolutely right. In the end, I am trying to model the way
users use (send) emails.

Obviously, today email is used in various ways, usage classes if you
may:
- Mailing lists and Newsletters
- Bugmails and automated emails
- Spam
- File attachments
- Text-only vs. HTML-Mails
All these are usually mixed up in the inbox. I actually didn't want
to look at the differences, but I didn't come around doing it:

I analysed the sizes of my mails, which I seperated in the following
categories: ham (1108), spam (4350), mailing lists (various lists,
1420), bugmail (bugzilla, flyspray, 970), sent (170).

I am always looking at buckets (how many mails are smaller than size
x, but bigger than the last buckets). I used the limits: 1024
1536 2048 2560 3072 4096 5120 6144 7168 8192 10240 12288 14336
16384 20480 24576 28672 32768 40960 49152 53248 57344 65536 131072
262144 524288 1048576 10264576 and 1073741824 (bytes).


In the attached bytype-mymail.png, I plotted the percentage of mails
you would find in a bucket for a given size. You can see that the
bugmail and mailing lists have very definite profiles. I assume this is
because of the strict rules in mailing lists (no attachments except
patches, no html). I addition I added the lkml.

Here, it seems spam and inbox are quite similar, and that I send many
mails < 2kb. It is interesting to see that the profile of mails I
receive are very different from the mails I send.
In the cumulative view (bytype-mymail-cum.png), you can also see the
definite shape of mailing lists and bugmail.

Lets taking a closer look at the difference between spam and ham mails:
The seperation is correct (there are no mails wrongly categorized).
mymail-ham-spam-ratio.png shows that below 1500 bytes, the share of
spam is incredible. The smallest human-written mail I possess is 1100
bytes. Generated mails are usually very short, and very often less
than 1kb, and a certain class of spam is too (e.g. one liners and a
link).

Another class of spam can be seen in mymail-spam-percentage.png
between 12KB and 30KB, and 50KB and 60KB (where the data point is
missing I did not send any mails, thus the ratio is infinite). This is
HTML spam.

What can be learned from this is that it might be an effective spam
measure against the small sized mails (<1500Bytes) to take as long for
the acceptance of mail as you would need it to be open for 1500 Bytes.
Shortening spam messages to increase the number of spam mails would
become infeasible to spammers. I can imagine keeping the connection
open similar to the tarpit/Teergrube technique.
As a side effect other means of generated mail would be effected, they
shouldn't be at their limit of resources (# of connections) through.

So far I laid out my arguments following the assumption that my mail
box is representative for the whole world. The sample of mails is
large enough to tell about my mail usage. I am a hacker, I use PGP, I do
not use HTML mail, thus, I am not a average email user. My mail box is
not representative for the whole world.

I received samples from two people (thanks again), with 30991 and 8759
sizes. Presumably these are not spam-free. In others-mail.png you see
them plotted, as well as my ham, my spam and the lkml.

My goal was to come up with a realistic email generator for network
simulations (human-sent email), which I modelled after my ham by laying
three lines through the graph in others-mail.png (knots at 3KB and
70KB, and hard limits >800Bytes, <10MB).
My conclusion is that the mail distribution is a sum of various
classes that can not really be distinguished, but can be approximated by
three Zipfian distributions.
It is also important to keep in mind that mobile device email usage will
have a different profile of mail sizes than desktop usage (due to the
input devices and the limited number of files to attach).

The Zipfian distribution can be applied to many types of traffic (as
Adamic and Huberman (2002) show here
http://www.hpl.hp.com/research/idl/papers/ranking/adamicglottometrics.pdf ),
but emails is a special case due to its size limit (used to be 10MB,
still is mostly). I suspect this is the reason the distribution gets
flatter after 70KB: Attachments larger than 10MB are not possible. On
the other hand it might just be the usage class of binary attachments
poking out.

In case anyone wonders why I don't write a paper about it, well, some
of it I'll use for my course paper, on its own it is too thin and I'd
need more samples. I will probably make a (more beautiful) writeup on my
blog though: http://johannes.jakeapp.com/blog/

I also attached the source code of my mail generator (which I intend
to use in combination with ns-2 or ns-3).

I'm ready for your storm of criticism now :-)

Have a fun day,
Johannes

--
Johannes Buchner
mail: e0625457 [at] student
xmpp: buchner.johannes [at] amessage
icq: 163390666
skype:johannes_buchner
Ich freue mich über PGP/GPG verschlüsselte/signierte Mails!
Attachments: others-mail.png (68.7 KB)
  mymail-spam-percentage.png (56.2 KB)
  mymail-sent-spam-ratio.png (36.6 KB)
  mymail-ham-spam-ratio.png (34.1 KB)
  bytype-mymail-cum.png (113 KB)
  bytype-mymail.png (93.1 KB)
  create_traffic.c (2.33 KB)

Subject User Time
Email size distribution e0625457 at student Aug 15, 2009, 3:25 AM
    Re: Email size distribution lists-qmail at maexotic Aug 16, 2009, 12:48 PM
        Re: Email size distribution e0625457 at student Aug 16, 2009, 10:13 PM
    Re: Email size distribution feh at fehcom Aug 17, 2009, 2:17 PM
    Re: Email size distribution e0625457 at student Aug 26, 2009, 7:40 AM

  Index | Next | Previous | View Flat
 
 


Interested in having your list archived? Contact lists@gossamer-threads.com
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.