Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Qmail: users

Email size distribution

 

 

Qmail users RSS feed   Index | Next | Previous | View Threaded


e0625457 at student

Aug 15, 2009, 3:25 AM

Post #1 of 5 (2080 views)
Permalink
Email size distribution

Dear mail server admins :-)

I am trying to analyse the size distribution of emails, and whether it
is Zipf-distributed or if there is a better approximation.
I have my own mails obviously, and found a old analysis at [0], but I
would like to have a up-to-date and somewhat representative sample.

For this, I would like to ask for your assistence: Could you, if you
have access to a large number of mail boxes, obtain the file sizes of
each mail, and email me that list?

E.g. if every mail is a file in the folder inbox/
$ find inbox/ -type f -printf '%s\n'

If you'd prefer to use a backup tar, you can do something like
$ tar -tvzf mails.tar.gz |awk '{ print $3;}'

Any compressed format would be fine. The more files, the better ;-)
Spam-free would also be preferred. Please also let me know if
your user base might be a slightly biased sample.

If you wish, I can let you know my findings. It would be a help me a
lot.

Best regards,
Johannes
(Student at the TU Vienna, Austria)

[0] http://osdir.com/ml/freebsd.devel.net/2002-10/msg00203.html
--
Johannes Buchner
mail: e0625457 [at] student
xmpp: buchner.johannes [at] amessage
icq: 163390666
skype:johannes_buchner
Ich freue mich über PGP/GPG verschlüsselte/signierte Mails!


lists-qmail at maexotic

Aug 16, 2009, 12:48 PM

Post #2 of 5 (1764 views)
Permalink
Re: Email size distribution [In reply to]

On Sat, Aug 15, 2009 at 10:25:48PM +1200, Johannes Buchner wrote:
> For this, I would like to ask for your assistence: Could you, if you
> have access to a large number of mail boxes, obtain the file sizes of
> each mail, and email me that list?

I may be wrong, but I don't think that taking sizes of eMails from the
mailboxes is the same as taking sizes from of eMails flowing through the
system.
eMails in the boxes relate to eMails the users find worthy to keep,
whereas they may consider e.g. short messages (think sms) or status
messages from social networks or even mailing list messages more temporary
and delete them after reading.
So you wouldn't measure "email sizes", but "sizes of emails users think
it is worth keeping", which may (or may not) be a big difference.

A IMHO much more appropriate way would be to use the sizes from the logfiles,
like
fgrep -h ': bytes ' /home/service/qmail-send/log/main/* | awk '{print $6}'

\Maex


e0625457 at student

Aug 16, 2009, 10:13 PM

Post #3 of 5 (2006 views)
Permalink
Re: Email size distribution [In reply to]

Hi Markus!

On Sun, 16 Aug 2009 21:48:51 +0200
Markus Stumpf <lists-qmail [at] maexotic> wrote:

> On Sat, Aug 15, 2009 at 10:25:48PM +1200, Johannes Buchner wrote:
> > For this, I would like to ask for your assistence: Could you, if you
> > have access to a large number of mail boxes, obtain the file sizes
> > of each mail, and email me that list?
>
> I may be wrong, but I don't think that taking sizes of eMails from the
> mailboxes is the same as taking sizes from of eMails flowing through
> the system.
> eMails in the boxes relate to eMails the users find worthy to keep,
> whereas they may consider e.g. short messages (think sms) or status
> messages from social networks or even mailing list messages more
> temporary and delete them after reading.
> So you wouldn't measure "email sizes", but "sizes of emails users
> think it is worth keeping", which may (or may not) be a big
> difference.
You are absolutely right. In the end, I am trying to model the way
users use (send) emails.

Obviously, today email is used in various ways, usage classes if you
may:
- Mailing lists and Newsletters
- Bugmails and automated emails
- Spam
- File attachments
- Text-only vs. HTML-Mails
All these are usually mixed up in the inbox. I actually didn't want
to look at the differences, but I didn't come around doing it:

I analysed the sizes of my mails, which I seperated in the following
categories: ham (1108), spam (4350), mailing lists (various lists,
1420), bugmail (bugzilla, flyspray, 970), sent (170).

I am always looking at buckets (how many mails are smaller than size
x, but bigger than the last buckets). I used the limits: 1024
1536 2048 2560 3072 4096 5120 6144 7168 8192 10240 12288 14336
16384 20480 24576 28672 32768 40960 49152 53248 57344 65536 131072
262144 524288 1048576 10264576 and 1073741824 (bytes).


In the attached bytype-mymail.png, I plotted the percentage of mails
you would find in a bucket for a given size. You can see that the
bugmail and mailing lists have very definite profiles. I assume this is
because of the strict rules in mailing lists (no attachments except
patches, no html). I addition I added the lkml.

Here, it seems spam and inbox are quite similar, and that I send many
mails < 2kb. It is interesting to see that the profile of mails I
receive are very different from the mails I send.
In the cumulative view (bytype-mymail-cum.png), you can also see the
definite shape of mailing lists and bugmail.

Lets taking a closer look at the difference between spam and ham mails:
The seperation is correct (there are no mails wrongly categorized).
mymail-ham-spam-ratio.png shows that below 1500 bytes, the share of
spam is incredible. The smallest human-written mail I possess is 1100
bytes. Generated mails are usually very short, and very often less
than 1kb, and a certain class of spam is too (e.g. one liners and a
link).

Another class of spam can be seen in mymail-spam-percentage.png
between 12KB and 30KB, and 50KB and 60KB (where the data point is
missing I did not send any mails, thus the ratio is infinite). This is
HTML spam.

What can be learned from this is that it might be an effective spam
measure against the small sized mails (<1500Bytes) to take as long for
the acceptance of mail as you would need it to be open for 1500 Bytes.
Shortening spam messages to increase the number of spam mails would
become infeasible to spammers. I can imagine keeping the connection
open similar to the tarpit/Teergrube technique.
As a side effect other means of generated mail would be effected, they
shouldn't be at their limit of resources (# of connections) through.

So far I laid out my arguments following the assumption that my mail
box is representative for the whole world. The sample of mails is
large enough to tell about my mail usage. I am a hacker, I use PGP, I do
not use HTML mail, thus, I am not a average email user. My mail box is
not representative for the whole world.

I received samples from two people (thanks again), with 30991 and 8759
sizes. Presumably these are not spam-free. In others-mail.png you see
them plotted, as well as my ham, my spam and the lkml.

My goal was to come up with a realistic email generator for network
simulations (human-sent email), which I modelled after my ham by laying
three lines through the graph in others-mail.png (knots at 3KB and
70KB, and hard limits >800Bytes, <10MB).
My conclusion is that the mail distribution is a sum of various
classes that can not really be distinguished, but can be approximated by
three Zipfian distributions.
It is also important to keep in mind that mobile device email usage will
have a different profile of mail sizes than desktop usage (due to the
input devices and the limited number of files to attach).

The Zipfian distribution can be applied to many types of traffic (as
Adamic and Huberman (2002) show here
http://www.hpl.hp.com/research/idl/papers/ranking/adamicglottometrics.pdf ),
but emails is a special case due to its size limit (used to be 10MB,
still is mostly). I suspect this is the reason the distribution gets
flatter after 70KB: Attachments larger than 10MB are not possible. On
the other hand it might just be the usage class of binary attachments
poking out.

In case anyone wonders why I don't write a paper about it, well, some
of it I'll use for my course paper, on its own it is too thin and I'd
need more samples. I will probably make a (more beautiful) writeup on my
blog though: http://johannes.jakeapp.com/blog/

I also attached the source code of my mail generator (which I intend
to use in combination with ns-2 or ns-3).

I'm ready for your storm of criticism now :-)

Have a fun day,
Johannes

--
Johannes Buchner
mail: e0625457 [at] student
xmpp: buchner.johannes [at] amessage
icq: 163390666
skype:johannes_buchner
Ich freue mich über PGP/GPG verschlüsselte/signierte Mails!
Attachments: others-mail.png (68.7 KB)
  mymail-spam-percentage.png (56.2 KB)
  mymail-sent-spam-ratio.png (36.6 KB)
  mymail-ham-spam-ratio.png (34.1 KB)
  bytype-mymail-cum.png (113 KB)
  bytype-mymail.png (93.1 KB)
  create_traffic.c (2.33 KB)


feh at fehcom

Aug 17, 2009, 2:17 PM

Post #4 of 5 (1750 views)
Permalink
Re: Email size distribution [In reply to]

Hi Johannes,

thanks for your interesting studies and samples.

However, if your goal is to 'identify' or at least 'define a probability
of spam mails based on their size'
I very much doubt, whether this yields practical results.

Years ago, I did something comparable:

http://www.fehcom.de/qmail.html (check for chapter 7.2 of my qmail book)

In any case, check the email-size distribution not only against the Zipf
distribution,
but in addition make the assumption it is Binomial distributed (or even
NBD).

Your double log-distributions don't show much; they rather hide.

Apply the following cases:

* Spam mails.
* Automated mails (by systems)-
* Mails generated by humans and the current mail clients.

Differentiate between HTML mails and flat mails.

regards.
--eh.


On Mon, 17 Aug 2009 07:13:41 +0200, Johannes Buchner
<e0625457 [at] student> wrote:

> Hi Markus!
>
> On Sun, 16 Aug 2009 21:48:51 +0200
> Markus Stumpf <lists-qmail [at] maexotic> wrote:
>
>> On Sat, Aug 15, 2009 at 10:25:48PM +1200, Johannes Buchner wrote:
>> > For this, I would like to ask for your assistence: Could you, if you
>> > have access to a large number of mail boxes, obtain the file sizes
>> > of each mail, and email me that list?
>>
>> I may be wrong, but I don't think that taking sizes of eMails from the
>> mailboxes is the same as taking sizes from of eMails flowing through
>> the system.
>> eMails in the boxes relate to eMails the users find worthy to keep,
>> whereas they may consider e.g. short messages (think sms) or status
>> messages from social networks or even mailing list messages more
>> temporary and delete them after reading.
>> So you wouldn't measure "email sizes", but "sizes of emails users
>> think it is worth keeping", which may (or may not) be a big
>> difference.
> You are absolutely right. In the end, I am trying to model the way
> users use (send) emails.
>
> Obviously, today email is used in various ways, usage classes if you
> may:
> - Mailing lists and Newsletters
> - Bugmails and automated emails
> - Spam
> - File attachments
> - Text-only vs. HTML-Mails
> All these are usually mixed up in the inbox. I actually didn't want
> to look at the differences, but I didn't come around doing it:
>
> I analysed the sizes of my mails, which I seperated in the following
> categories: ham (1108), spam (4350), mailing lists (various lists,
> 1420), bugmail (bugzilla, flyspray, 970), sent (170).
>
> I am always looking at buckets (how many mails are smaller than size
> x, but bigger than the last buckets). I used the limits: 1024
> 1536 2048 2560 3072 4096 5120 6144 7168 8192 10240 12288 14336
> 16384 20480 24576 28672 32768 40960 49152 53248 57344 65536 131072
> 262144 524288 1048576 10264576 and 1073741824 (bytes).
>
>
> In the attached bytype-mymail.png, I plotted the percentage of mails
> you would find in a bucket for a given size. You can see that the
> bugmail and mailing lists have very definite profiles. I assume this is
> because of the strict rules in mailing lists (no attachments except
> patches, no html). I addition I added the lkml.
>
> Here, it seems spam and inbox are quite similar, and that I send many
> mails < 2kb. It is interesting to see that the profile of mails I
> receive are very different from the mails I send.
> In the cumulative view (bytype-mymail-cum.png), you can also see the
> definite shape of mailing lists and bugmail.
>
> Lets taking a closer look at the difference between spam and ham mails:
> The seperation is correct (there are no mails wrongly categorized).
> mymail-ham-spam-ratio.png shows that below 1500 bytes, the share of
> spam is incredible. The smallest human-written mail I possess is 1100
> bytes. Generated mails are usually very short, and very often less
> than 1kb, and a certain class of spam is too (e.g. one liners and a
> link).
>
> Another class of spam can be seen in mymail-spam-percentage.png
> between 12KB and 30KB, and 50KB and 60KB (where the data point is
> missing I did not send any mails, thus the ratio is infinite). This is
> HTML spam.
>
> What can be learned from this is that it might be an effective spam
> measure against the small sized mails (<1500Bytes) to take as long for
> the acceptance of mail as you would need it to be open for 1500 Bytes.
> Shortening spam messages to increase the number of spam mails would
> become infeasible to spammers. I can imagine keeping the connection
> open similar to the tarpit/Teergrube technique.
> As a side effect other means of generated mail would be effected, they
> shouldn't be at their limit of resources (# of connections) through.
>
> So far I laid out my arguments following the assumption that my mail
> box is representative for the whole world. The sample of mails is
> large enough to tell about my mail usage. I am a hacker, I use PGP, I do
> not use HTML mail, thus, I am not a average email user. My mail box is
> not representative for the whole world.
>
> I received samples from two people (thanks again), with 30991 and 8759
> sizes. Presumably these are not spam-free. In others-mail.png you see
> them plotted, as well as my ham, my spam and the lkml.
>
> My goal was to come up with a realistic email generator for network
> simulations (human-sent email), which I modelled after my ham by laying
> three lines through the graph in others-mail.png (knots at 3KB and
> 70KB, and hard limits >800Bytes, <10MB).
> My conclusion is that the mail distribution is a sum of various
> classes that can not really be distinguished, but can be approximated by
> three Zipfian distributions.
> It is also important to keep in mind that mobile device email usage will
> have a different profile of mail sizes than desktop usage (due to the
> input devices and the limited number of files to attach).
>
> The Zipfian distribution can be applied to many types of traffic (as
> Adamic and Huberman (2002) show here
> http://www.hpl.hp.com/research/idl/papers/ranking/adamicglottometrics.pdf
> ),
> but emails is a special case due to its size limit (used to be 10MB,
> still is mostly). I suspect this is the reason the distribution gets
> flatter after 70KB: Attachments larger than 10MB are not possible. On
> the other hand it might just be the usage class of binary attachments
> poking out.
>
> In case anyone wonders why I don't write a paper about it, well, some
> of it I'll use for my course paper, on its own it is too thin and I'd
> need more samples. I will probably make a (more beautiful) writeup on my
> blog though: http://johannes.jakeapp.com/blog/
>
> I also attached the source code of my mail generator (which I intend
> to use in combination with ns-2 or ns-3).
>
> I'm ready for your storm of criticism now :-)
>
> Have a fun day,
> Johannes
>



--
Dr. Erwin Hoffmann | FEHCom | http://www.fehcom.de


e0625457 at student

Aug 26, 2009, 7:40 AM

Post #5 of 5 (1657 views)
Permalink
Re: Email size distribution [In reply to]

Hi everyone,

Thanks to the people who sent me data. I published the plots here:
http://johannes.jakeapp.com/blog/?p=674

Kind regards,
Johannes


On Mon, 17 Aug 2009 17:13:41 +1200
Johannes Buchner <e0625457 [at] student> wrote:

> Hi Markus!
>
> On Sun, 16 Aug 2009 21:48:51 +0200
> Markus Stumpf <lists-qmail [at] maexotic> wrote:
>
> > On Sat, Aug 15, 2009 at 10:25:48PM +1200, Johannes Buchner wrote:
> > > For this, I would like to ask for your assistence: Could you, if
> > > you have access to a large number of mail boxes, obtain the file
> > > sizes of each mail, and email me that list?
> >
> > I may be wrong, but I don't think that taking sizes of eMails from
> > the mailboxes is the same as taking sizes from of eMails flowing
> > through the system.
> > eMails in the boxes relate to eMails the users find worthy to keep,
> > whereas they may consider e.g. short messages (think sms) or status
> > messages from social networks or even mailing list messages more
> > temporary and delete them after reading.
> > So you wouldn't measure "email sizes", but "sizes of emails users
> > think it is worth keeping", which may (or may not) be a big
> > difference.
> You are absolutely right. In the end, I am trying to model the way
> users use (send) emails.
>
> Obviously, today email is used in various ways, usage classes if you
> may:
> - Mailing lists and Newsletters
> - Bugmails and automated emails
> - Spam
> - File attachments
> - Text-only vs. HTML-Mails
> All these are usually mixed up in the inbox. I actually didn't want
> to look at the differences, but I didn't come around doing it:
>
> I analysed the sizes of my mails, which I seperated in the following
> categories: ham (1108), spam (4350), mailing lists (various lists,
> 1420), bugmail (bugzilla, flyspray, 970), sent (170).
>
> I am always looking at buckets (how many mails are smaller than size
> x, but bigger than the last buckets). I used the limits: 1024
> 1536 2048 2560 3072 4096 5120 6144 7168 8192 10240 12288 14336
> 16384 20480 24576 28672 32768 40960 49152 53248 57344 65536 131072
> 262144 524288 1048576 10264576 and 1073741824 (bytes).
>
>
> In the attached bytype-mymail.png, I plotted the percentage of mails
> you would find in a bucket for a given size. You can see that the
> bugmail and mailing lists have very definite profiles. I assume this
> is because of the strict rules in mailing lists (no attachments except
> patches, no html). I addition I added the lkml.
>
> Here, it seems spam and inbox are quite similar, and that I send many
> mails < 2kb. It is interesting to see that the profile of mails I
> receive are very different from the mails I send.
> In the cumulative view (bytype-mymail-cum.png), you can also see the
> definite shape of mailing lists and bugmail.
>
> Lets taking a closer look at the difference between spam and ham
> mails: The seperation is correct (there are no mails wrongly
> categorized). mymail-ham-spam-ratio.png shows that below 1500 bytes,
> the share of spam is incredible. The smallest human-written mail I
> possess is 1100 bytes. Generated mails are usually very short, and
> very often less than 1kb, and a certain class of spam is too (e.g.
> one liners and a link).
>
> Another class of spam can be seen in mymail-spam-percentage.png
> between 12KB and 30KB, and 50KB and 60KB (where the data point is
> missing I did not send any mails, thus the ratio is infinite). This is
> HTML spam.
>
> What can be learned from this is that it might be an effective spam
> measure against the small sized mails (<1500Bytes) to take as long for
> the acceptance of mail as you would need it to be open for 1500 Bytes.
> Shortening spam messages to increase the number of spam mails would
> become infeasible to spammers. I can imagine keeping the connection
> open similar to the tarpit/Teergrube technique.
> As a side effect other means of generated mail would be effected, they
> shouldn't be at their limit of resources (# of connections) through.
>
> So far I laid out my arguments following the assumption that my mail
> box is representative for the whole world. The sample of mails is
> large enough to tell about my mail usage. I am a hacker, I use PGP, I
> do not use HTML mail, thus, I am not a average email user. My mail
> box is not representative for the whole world.
>
> I received samples from two people (thanks again), with 30991 and 8759
> sizes. Presumably these are not spam-free. In others-mail.png you see
> them plotted, as well as my ham, my spam and the lkml.
>
> My goal was to come up with a realistic email generator for network
> simulations (human-sent email), which I modelled after my ham by
> laying three lines through the graph in others-mail.png (knots at 3KB
> and 70KB, and hard limits >800Bytes, <10MB).
> My conclusion is that the mail distribution is a sum of various
> classes that can not really be distinguished, but can be approximated
> by three Zipfian distributions.
> It is also important to keep in mind that mobile device email usage
> will have a different profile of mail sizes than desktop usage (due
> to the input devices and the limited number of files to attach).
>
> The Zipfian distribution can be applied to many types of traffic (as
> Adamic and Huberman (2002) show here
> http://www.hpl.hp.com/research/idl/papers/ranking/adamicglottometrics.pdf ),
> but emails is a special case due to its size limit (used to be 10MB,
> still is mostly). I suspect this is the reason the distribution gets
> flatter after 70KB: Attachments larger than 10MB are not possible. On
> the other hand it might just be the usage class of binary attachments
> poking out.
>
> In case anyone wonders why I don't write a paper about it, well, some
> of it I'll use for my course paper, on its own it is too thin and I'd
> need more samples. I will probably make a (more beautiful) writeup on
> my blog though: http://johannes.jakeapp.com/blog/
>
> I also attached the source code of my mail generator (which I intend
> to use in combination with ns-2 or ns-3).
>
> I'm ready for your storm of criticism now :-)
>
> Have a fun day,
> Johannes
>
> --
> Johannes Buchner
> mail: e0625457 [at] student
> xmpp: buchner.johannes [at] amessage
> icq: 163390666
> skype:johannes_buchner
> Ich freue mich über PGP/GPG verschlüsselte/signierte Mails!
>
>


--
Johannes Buchner
mail: e0625457 [at] student
xmpp: buchner.johannes [at] amessage
icq: 163390666
skype:johannes_buchner
Ich freue mich über PGP/GPG verschlüsselte/signierte Mails!

Qmail users RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.