Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: ClamAV: devel

clamAV scanning algorithm

 

 

ClamAV devel RSS feed   Index | Next | Previous | View Threaded


tomb.fish at gmail

Dec 2, 2008, 5:02 PM

Post #1 of 18 (1895 views)
Permalink
clamAV scanning algorithm

Hi,

I am new to CLAMAV & I am just wonder how files are scanned.

Does it work like:
1. PE section is taken from file to be scanned
2. MD5 is calculated
3. That MD5 is compared to all signatures in ClamAV Database
4. If match virus is found.

I have simplified this. But please let me know if I am right in above
steps for scanning files.

Regards,
Tom
_______________________________________________
http://lurker.clamav.net/list/clamav-devel.html
Please submit your patches to our Bugzilla: http://bugs.clamav.net


edwintorok at gmail

Dec 3, 2008, 11:58 AM

Post #2 of 18 (1826 views)
Permalink
Re: clamAV scanning algorithm [In reply to]

On 2008-12-03 03:02, Thomasz Blaszczyk wrote:
> Hi,
>

Hi,


> I am new to CLAMAV & I am just wonder how files are scanned.
>
> Does it work like:
> 1. PE section is taken from file to be scanned
>

It is much more than that, ClamAV can also process a variety of archive
formats, containers, and executable packers.
Also PE files aren't the only malware files, you can have malware in
scripts too.

Have a look at filetypes_int.h for the file types we support. New file
type definitions can be added via database updates.
> 2. MD5 is calculated
>

Correct, but ClamAV also uses a pattern matcher (Aho-Corasick and
extended version of Boyer-Moore), not only MD5.
See signatures.pdf for the kind of patterns it supports (in particular
it supports wildcards with AC matcher).

So ClamAV actually tries to match those patterns inside the file. It
also has some heuristic and algorithmic detections.

There is an MD5 calculated for the entire file, and MD5 calculated per
PE section too.

> 3. That MD5 is compared to all signatures in ClamAV Database
>

Using a BM matcher, yes. Not sequentially.


> 4. If match virus is found.
>

Yes.

> I have simplified this. But please let me know if I am right in above
> steps for scanning files.

If you only have a database with md5 loaded, and disable archives, and
disable algorithmic scans, and heuristics, and disable html, mbox
formats, then yes ;)
In practice, ClamAV does much more than just matching an MD5.

Best regards,
--Edwin
_______________________________________________
http://lurker.clamav.net/list/clamav-devel.html
Please submit your patches to our Bugzilla: http://bugs.clamav.net


tomb.fish at gmail

Dec 3, 2008, 2:41 PM

Post #3 of 18 (1827 views)
Permalink
Re: clamAV scanning algorithm [In reply to]

Thank you for reply,

Török Edwin, Very, very good web seminar!

I have 2 more questions:

1) I'd like to measure & compare performance of AC & BM algorithms.

clamscan displays in 'scan summary' a 'time'. Does this time include
disc access, signature tree building in AC(phase1) or BM
Just wonder If I can use this time or I should figure out new timestamps.

>Time: 2.189 sec (0 m 2 s)

2) I've downloaded Eicar Test Anti-Virus File and crated 10bytes file.
(See logs below) Then I've appended Eicar to this file. Why clamscan
doesn't find a signature in this file?



LOGS:
1. Creating 10bytes file

tomb[at]tomb_localhost
~/projects/aau/virus_scanner/clamav-0.94.1/database $ time dd
if=/dev/urandom of=../../testbox/new10bytes.com bs=10 count=1
1+0 records in
1+0 records out
10 bytes (10 B) copied, 4.8609e-05 s, 206 kB/s

real 0m0.001s
user 0m0.000s
sys 0m0.000s

2. Testbox folder contains:

tomb[at]tomb_localhost ~/projects/aau/virus_scanner/testbox $ ls -l
total 8
-rw-r--r-- 1 tomb tomb 68 Dec 3 22:26 eicar.com
-rw-r--r-- 1 tomb tomb 10 Dec 3 22:27 new10bytes.com
tomb[at]tomb_localhost ~/projects/aau/virus_scanner/testbox $ hexdump eicar.com
0000000 3558 214f 2550 4140 5b50 5c34 5a50 3558
0000010 2834 5e50 3729 4343 3729 247d 4945 4143
0000020 2d52 5453 4e41 4144 4452 412d 544e 5649
0000030 5249 5355 542d 5345 2d54 4946 454c 2421
0000040 2b48 2a48
0000044
tomb[at]tomb_localhost ~/projects/aau/virus_scanner/testbox $ hexdump
new10bytes.com
0000000 05b6 1256 0057 d6b2 9740
000000a

3.
68bytes of Eicar has been appended to the end of random generated
new10bytes.com

tomb[at]tomb_localhost ~/projects/aau/virus_scanner/testbox $ cat
eicar.com >> new10bytes.com
tomb[at]tomb_localhost ~/projects/aau/virus_scanner/testbox $ hexdump
new10bytes.com
0000000 05b6 1256 0057 d6b2 9740 3558 214f 2550
0000010 4140 5b50 5c34 5a50 3558 2834 5e50 3729
0000020 4343 3729 247d 4945 4143 2d52 5453 4e41
0000030 4144 4452 412d 544e 5649 5249 5355 542d
0000040 5345 2d54 4946 454c 2421 2b48 2a48
000004e

4.
Why signature is not found in this file?

tomb[at]tomb_localhost ~/projects/aau/virus_scanner/testbox $ clamscan
new10bytes.com

new10bytes.com: OK

----------- SCAN SUMMARY -----------
Known viruses: 455125
Engine version: 0.94.1
Scanned directories: 0
Scanned files: 1
Infected files: 0
Data scanned: 0.00 MB
Time: 2.194 sec (0 m 2 s)

-------
Thanks in advance,
Tom
_______________________________________________
http://lurker.clamav.net/list/clamav-devel.html
Please submit your patches to our Bugzilla: http://bugs.clamav.net


joe at thrallingpenguin

Dec 3, 2008, 3:24 PM

Post #4 of 18 (1831 views)
Permalink
Re: clamAV scanning algorithm [In reply to]

See:

http://www.eicar.org/anti_virus_test_file.htm

Specifically:

"Any anti-virus product that supports the EICAR test file should
detect it in any file providing that the file starts with the
following 68 characters, and is exactly 68 bytes long"

Best Regards,
Joseph Benden

.--.
|o_o |
|:_/ |
// \ \
(| | )
/'\_ _/`\
\___)=(___/
http://www.ThrallingPenguin.com/
--------------------------------
We design, develop, and extend
software technologies for the
most demanding business
applications, as well as
offer VoIP Consulting
services.



On Dec 3, 2008, at 5:41 PM, Thomasz Blaszczyk wrote:

> 2) I've downloaded Eicar Test Anti-Virus File and crated 10bytes file.
> (See logs below) Then I've appended Eicar to this file. Why clamscan
> doesn't find a signature in this file?

_______________________________________________
http://lurker.clamav.net/list/clamav-devel.html
Please submit your patches to our Bugzilla: http://bugs.clamav.net


edwintorok at gmail

Dec 4, 2008, 12:57 AM

Post #5 of 18 (1813 views)
Permalink
Re: clamAV scanning algorithm [In reply to]

On 2008-12-04 00:41, Thomasz Blaszczyk wrote:
> Thank you for reply,
>
> Török Edwin, Very, very good web seminar!
>

Thanks

> I have 2 more questions:
>
> 1) I'd like to measure & compare performance of AC & BM algorithms.
>
> clamscan displays in 'scan summary' a 'time'. Does this time include
> disc access, signature tree building in AC(phase1) or BM
> Just wonder If I can use this time or I should figure out new timestamps.
>

It includes all of the above: it is the time from the launch of clamscan
(after options are parsed), till the scan is complete.

Best regards,
--Edwin
_______________________________________________
http://lurker.clamav.net/list/clamav-devel.html
Please submit your patches to our Bugzilla: http://bugs.clamav.net


tomb.fish at gmail

Dec 6, 2008, 5:55 AM

Post #6 of 18 (1817 views)
Permalink
Re: clamAV scanning algorithm [In reply to]

Thanks Joseph for answer,

The quote appears too restrictive - as I found that the file can be
longer, as long as it starts with the Eicar.

> "Any anti-virus product that supports the EICAR test file should
> detect it in any file providing that the file starts with the
> following 68 characters, and is exactly 68 bytes long"
>
> Best Regards,
> Joseph Benden
_______________________________________________
http://lurker.clamav.net/list/clamav-devel.html
Please submit your patches to our Bugzilla: http://bugs.clamav.net


tomb.fish at gmail

Dec 17, 2008, 8:12 AM

Post #7 of 18 (1779 views)
Permalink
Re: clamAV scanning algorithm [In reply to]

Hi,

I have notice kind of limitation in ClamAV. When time of scanning one
file is longer than 1 sec, the entire file scan is droped. In order to
compare performance of BM and AC I need to remove that limitation,
Where this time per one file scan is defined?
Any options I can use from command line to remove this limitation?

Thanks in advance,
Tom
_______________________________________________
http://lurker.clamav.net/list/clamav-devel.html
Please submit your patches to our Bugzilla: http://bugs.clamav.net


edwintorok at gmail

Dec 17, 2008, 8:20 AM

Post #8 of 18 (1773 views)
Permalink
Re: clamAV scanning algorithm [In reply to]

On 2008-12-17 18:12, Thomasz Blaszczyk wrote:
> Hi,
>
> I have notice kind of limitation in ClamAV. When time of scanning one
> file is longer than 1 sec, the entire file scan is droped.

There is no such limitation in ClamAV.

Best regards,
--Edwin
_______________________________________________
http://lurker.clamav.net/list/clamav-devel.html
Please submit your patches to our Bugzilla: http://bugs.clamav.net


tomb.fish at gmail

Dec 17, 2008, 8:37 AM

Post #9 of 18 (1776 views)
Permalink
Re: clamAV scanning algorithm [In reply to]

ok, it seems that limits.maxfilesize limits to 10MB, but I am able to
scan up to 25MB files. see below:
(when I scan 30MB file the data scanned is 0, Why is like that? and I
am able to scan nearly 25MB)
Every byte in sample file is 'B8'

ls -l
total 60656
-rw-r--r-- 1 root root 16000000 Dec 17 16:08 16MB
-rw-r--r-- 1 root root 2000000 Dec 17 16:07 2MB
-rw-r--r-- 1 root root 32000000 Dec 17 16:08 32MB
-rw-r--r-- 1 root root 4000000 Dec 17 16:08 4MB
-rw-r--r-- 1 root root 8000000 Dec 17 16:08 8MB
-rw-r--r-- 1 root root 27 Dec 17 12:41 database_sig.ndb
drwx------ 2 root root 16384 Dec 17 11:58 lost+found
-rw-r--r-- 1 root root 0 Dec 17 16:38 testbed
cat database.ndb (only one signature)
TinyVirus:0:*:B89A020000C3
2MB: OK

----------- SCAN SUMMARY -----------
Known viruses: 1
Engine version: 0.94.1
Scanned directories: 0
Scanned files: 1
Infected files: 0
Data scanned: 1.91 MB
Time: 0.077 sec (0 m 0 s)
2MB: OK

----------- SCAN SUMMARY -----------
Known viruses: 1
Engine version: 0.94.1
Scanned directories: 0
Scanned files: 1
Infected files: 0
Data scanned: 7.63 MB
Time: 0.309 sec (0 m 0 s)
8MB: OK

----------- SCAN SUMMARY -----------
Known viruses: 1
Engine version: 0.94.1
Scanned directories: 0
Scanned files: 1
Infected files: 0
Data scanned: 15.26 MB
Time: 0.582 sec (0 m 0 s)
16MB: OK

----------- SCAN SUMMARY -----------
Known viruses: 1
Engine version: 0.94.1
Scanned directories: 0
Scanned files: 1
Infected files: 0
Data scanned: 24.79 MB
Time: 0.995 sec (0 m 0 s)
25MB: OK
_______________________________________________
http://lurker.clamav.net/list/clamav-devel.html
Please submit your patches to our Bugzilla: http://bugs.clamav.net


edwintorok at gmail

Dec 17, 2008, 8:47 AM

Post #10 of 18 (1779 views)
Permalink
Re: clamAV scanning algorithm [In reply to]

On 2008-12-17 18:37, Thomasz Blaszczyk wrote:
> ok, it seems that limits.maxfilesize limits to 10MB, but I am able to
> scan up to 25MB files. see below:
> (when I scan 30MB file the data scanned is 0, Why is like that? and I
> am able to scan nearly 25MB)
>

Read the archives of -users. This question has been repeatedly raised there.

--Edwin
_______________________________________________
http://lurker.clamav.net/list/clamav-devel.html
Please submit your patches to our Bugzilla: http://bugs.clamav.net


tomb.fish at gmail

Dec 17, 2008, 10:27 AM

Post #11 of 18 (1826 views)
Permalink
Re: clamAV scanning algorithm [In reply to]

I just got first results here,
http://omploader.org/vMTExNA

What do you think about them?

Just wonder about BM, it is not efficient as AC, cannot see point why
BM is used for static signatures. AC works here better...

or am I missing something,

I copied 20 first signature from main.ndb, and use them for performace
measurements.

Looking forward for feedback,
Thx,Tom



On Wed, Dec 17, 2008 at 6:10 PM, Thomasz Blaszczyk <tomb.fish[at]gmail.com> wrote:
> Thx, found it;)
>
> On Wed, Dec 17, 2008 at 5:47 PM, Török Edwin <edwintorok[at]gmail.com> wrote:
>> On 2008-12-17 18:37, Thomasz Blaszczyk wrote:
>>> ok, it seems that limits.maxfilesize limits to 10MB, but I am able to
>>> scan up to 25MB files. see below:
>>> (when I scan 30MB file the data scanned is 0, Why is like that? and I
>>> am able to scan nearly 25MB)
>>>
>>
>> Read the archives of -users. This question has been repeatedly raised there.
>>
>> --Edwin
>> _______________________________________________
>> http://lurker.clamav.net/list/clamav-devel.html
>> Please submit your patches to our Bugzilla: http://bugs.clamav.net
>>
>
>
>
> --
> "Stay Hungry. Stay Foolish." Steve Jobs
>



--
"Stay Hungry. Stay Foolish." Steve Jobs
_______________________________________________
http://lurker.clamav.net/list/clamav-devel.html
Please submit your patches to our Bugzilla: http://bugs.clamav.net


tomb.fish at gmail

Dec 17, 2008, 10:29 AM

Post #12 of 18 (1769 views)
Permalink
Re: clamAV scanning algorithm [In reply to]

I also change all 20 signatures to be in format:

<name>:0:*:<signature>

Regards,
Tom
_______________________________________________
http://lurker.clamav.net/list/clamav-devel.html
Please submit your patches to our Bugzilla: http://bugs.clamav.net


edwintorok at gmail

Dec 17, 2008, 10:47 AM

Post #13 of 18 (1780 views)
Permalink
Re: clamAV scanning algorithm [In reply to]

On 2008-12-17 20:27, Thomasz Blaszczyk wrote:
> I just got first results here,
> http://omploader.org/vMTExNA
>
> What do you think about them?
>

What kind of data was scanned?
Was it hand-crafted, automatically generated, or real world files?
What is the confidence of the values you measured?
(I don't see if you've repeated the experiment or not, there is no
standard deviation, or any other statistical indicator).

> Just wonder about BM, it is not efficient as AC, cannot see point why
> BM is used for static signatures. AC works here better...
>
> or am I missing something,
>

I've already answered this:

if you switch ClamAV to use only AC, you'll notice a significant
performance improvement, at the expense of increased memory usage for
the DB.


> I copied 20 first signature from main.ndb, and use them for performace
> measurements.
>

Results from benchmarks with such a low signature count will be useless
in practice.

Hint: larger tries don't fit in L2, and also produce a lot of DTLB misses

Best regards,
--Edwin
_______________________________________________
http://lurker.clamav.net/list/clamav-devel.html
Please submit your patches to our Bugzilla: http://bugs.clamav.net


tomb.fish at gmail

Dec 17, 2008, 11:28 AM

Post #14 of 18 (1780 views)
Permalink
Re: clamAV scanning algorithm [In reply to]

> What kind of data was scanned?
> Was it hand-crafted, automatically generated, or real world files?

I create files by calling in loop function: fputc('my_byte')
i.e:
file_builder -n sizeoffile -xB8

So entire file consists of bytes 'B8' and I create 2MB, 4MB file, up
to 60MB files

> What is the confidence of the values you measured?
> (I don't see if you've repeated the experiment or not, there is no
> standard deviation, or any other statistical indicator).
Right there is some deviation, (I repeat measurement 3 times) I take
average, but I will repeat measurments again and calculate deviation.


>> Just wonder about BM, it is not efficient as AC, cannot see point why
>> BM is used for static signatures. AC works here better...
>>
>> or am I missing something,
>>
>
> I've already answered this:
>
> if you switch ClamAV to use only AC, you'll notice a significant
> performance improvement, at the expense of increased memory usage for
> the DB.
Right, AC trees are quite large and takes lot of memory..
So BM is only used to save memory? I guess, it was implemented first
and some people still feel sentiment to this algorithm..:)
Since AC works faster and handles wildcards...

>
>> I copied 20 first signature from main.ndb, and use them for performace
>> measurements.
>>
>
> Results from benchmarks with such a low signature count will be useless
> in practice.
>
> Hint: larger tries don't fit in L2, and also produce a lot of DTLB misses
>
Thanks for hints
Regards,
Tom

> Best regards,
> --Edwin



--
"Stay Hungry. Stay Foolish." Steve Jobs
_______________________________________________
http://lurker.clamav.net/list/clamav-devel.html
Please submit your patches to our Bugzilla: http://bugs.clamav.net


edwintorok at gmail

Dec 17, 2008, 11:36 AM

Post #15 of 18 (1773 views)
Permalink
Re: clamAV scanning algorithm [In reply to]

On 2008-12-17 21:28, Thomasz Blaszczyk wrote:
>> What kind of data was scanned?
>> Was it hand-crafted, automatically generated, or real world files?
>>
>
> I create files by calling in loop function: fputc('my_byte')
> i.e:
> file_builder -n sizeoffile -xB8
>
> So entire file consists of bytes 'B8' and I create 2MB, 4MB file, up
> to 60MB files
>

You might want to scan something resembling a real world file, and I'm
not saying to use /dev/urandom instead of B8.
I can think of a much more efficient algorithm to match on B8 bytes...


Best regards,
--Edwin
_______________________________________________
http://lurker.clamav.net/list/clamav-devel.html
Please submit your patches to our Bugzilla: http://bugs.clamav.net


tomb.fish at gmail

Dec 17, 2008, 11:48 AM

Post #16 of 18 (1782 views)
Permalink
Re: clamAV scanning algorithm [In reply to]

> You might want to scan something resembling a real world file, and I'm
> not saying to use /dev/urandom instead of B8.
> I can think of a much more efficient algorithm to match on B8 bytes...

Ohh, yes, there will be several test cases, B8 bytes is only one
There will be also test case upon DNA sequence scanning :)

Cheers!
_______________________________________________
http://lurker.clamav.net/list/clamav-devel.html
Please submit your patches to our Bugzilla: http://bugs.clamav.net


gim at skrzynka

Dec 20, 2008, 10:36 AM

Post #17 of 18 (1745 views)
Permalink
Re: clamAV scanning algorithm [In reply to]

Thomasz Blaszczyk in message 'Re: [Clamav-devel] clamAV scanning algorithm' wrote:
> >
> > if you switch ClamAV to use only AC, you'll notice a significant
> > performance improvement, at the expense of increased memory usage for
> > the DB.
> Right, AC trees are quite large and takes lot of memory..
> So BM is only used to save memory? I guess, it was implemented first
> and some people still feel sentiment to this algorithm..:)
> Since AC works faster and handles wildcards...
>

If I remember correctly it was just on the contrary,
AC was implemented first (version I recall was 0.67,
I haven't looked at a code of earlier versions except
some truly historical stuff like 0.11 or 0.15)
and later because of memory problems Edwin mentioned
at least a few times BM was added to resolve those
issues.

cheers,
--
main(int a[puts("Michal 'GiM' Spadlinski")]){}
Attachments: signature.asc (0.18 KB)


tomb.fish at gmail

Dec 20, 2008, 4:37 PM

Post #18 of 18 (1733 views)
Permalink
Re: clamAV scanning algorithm [In reply to]

Hey guys,

Thanks for comments!

I got something interesting, measurements and graphs:
http://omploader.org/vMTFlYw

& I add my comment:
"Intersections where Boyer-Moore improves performance and where
Aho-Corasick gets slower is
shifting to the right as amount of patterns is increasing (together
with file size)."

I am looking forward for feedback, especially from Edwin, was the
measurements correct? Do results conform to your thoughts/expectation
about those algorithms?
GiM it seems that BM is efficient for same cases, see link above.
(BTW,Thanks for brief history of algorithms implemented in previous
CLAMAV versions)

My second comment:

"The measurements were performed 3 times, For 3 trials per measurement
a small difference can occur. This small deviation is in around ~ 30
ms max and can be neglected. Average value from 3 trials has been
taken.
Very amazing results can be observed. For small number of signatures
(20, 100, 200 signatures in database) Aho-Corasick is much better than
Boyer-Moore, To scan 2MB file it takes 2 times longer for Boyer-Moore
algorithm than Aho-Corasick. But for more than 10.000 signatures in
database for the same size of file Boyer-Moore make up for lost time.
Still for larger files ( with size greater than 3 MB) for the same
signature database Aho-Corasick is better."

Greetings,
Tom

On Sat, Dec 20, 2008 at 7:36 PM, GiM <gim[at]skrzynka.pl> wrote:
> Thomasz Blaszczyk in message 'Re: [Clamav-devel] clamAV scanning algorithm' wrote:
>> >
>> > if you switch ClamAV to use only AC, you'll notice a significant
>> > performance improvement, at the expense of increased memory usage for
>> > the DB.
>> Right, AC trees are quite large and takes lot of memory..
>> So BM is only used to save memory? I guess, it was implemented first
>> and some people still feel sentiment to this algorithm..:)
>> Since AC works faster and handles wildcards...
>>
>
> If I remember correctly it was just on the contrary,
> AC was implemented first (version I recall was 0.67,
> I haven't looked at a code of earlier versions except
> some truly historical stuff like 0.11 or 0.15)
> and later because of memory problems Edwin mentioned
> at least a few times BM was added to resolve those
> issues.
>
> cheers,
> --
> main(int a[puts("Michal 'GiM' Spadlinski")]){}
>

--
"Stay Hungry. Stay Foolish." Steve Jobs
_______________________________________________
http://lurker.clamav.net/list/clamav-devel.html
Please submit your patches to our Bugzilla: http://bugs.clamav.net

ClamAV devel RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact lists@gossamer-threads.com
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.