Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Varnish: Misc

Best practice for not caching content requested by crawlers

 

 

Varnish misc RSS feed   Index | Next | Previous | View Threaded


damon at huddler-inc

Jul 19, 2012, 10:09 AM

Post #1 of 6 (898 views)
Permalink
Best practice for not caching content requested by crawlers

Hi Everyone,
We have reason to believe that we have some amount of cache pollution from
crawlers. We believe this to be the case after we attempted to determine
the size of our hot data set.

To determine the size of our hot dataset, we summed up the request sizes
for all requests that had a hit rate > 1 over a nine hour period that
included the peaks for the day. The sum of the sizes from this measurement
came out to about 5GB.

We have allocated 18GB (using malloc) to varnish and we are nuking at the
rate of about 80 per second on a box whose hit rate is hovering around 70%.
This suggests to me that we have a lot of data in the cache that is not
actively being requested. Its not hot. The goal of this effort is to more
accurately determine if we need to add additional varnish capacity (more
memory). I'm using size your
cache<https://www.varnish-cache.org/docs/2.1/tutorial/sizing_your_cache.html>as
a guide and taking the advice there to try and reduce the n_lru_nuked
rate, hopefully driving it to 0.

As an experiment to both improve our hit rate, and ensure we are getting
the most out of the memory we have allocated to varnish, I want to explore
configuring varnish to not place into the cache requests that are coming
from crawlers. I'm defining crawlers as requests with User-Agent headers
containing strings like Googlebot, msnbot, etc.

So my question is, what is the best practice for doing this? If a request
comes from the crawler and its in the cache, I'm fine serving it from the
cache. However if the request comes from the crawler and its not in the
cache, I don't want varnish to cache it.

Any suggestions would be appreciated.

Thanks,
Damon


lasse.karstensen at gmail

Jul 20, 2012, 2:04 AM

Post #2 of 6 (886 views)
Permalink
Re: Best practice for not caching content requested by crawlers [In reply to]

Damon Snyder:
> We have reason to believe that we have some amount of cache pollution from
> crawlers. We believe this to be the case after we attempted to determine
> the size of our hot data set.
[..]
> So my question is, what is the best practice for doing this? If a request
> comes from the crawler and its in the cache, I'm fine serving it from the
> cache. However if the request comes from the crawler and its not in the
> cache, I don't want varnish to cache it.

I'm not clear on whether this is a good idea or not, but you can do
it in VCL like this:

sub vcl_miss {
if (req.http.user-agent ~ "(?i)yandex|msnbot") {
return(pass);
}
}

You can probably use openddr/deviceatlas/$favorite_detectionengine to get
better accuracy than this regex.

--
Lasse Karstensen
Varnish Software AS
http://www.varnish-software.com/

_______________________________________________
varnish-misc mailing list
varnish-misc [at] varnish-cache
https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc


lasse.karstensen at gmail

Jul 20, 2012, 3:44 AM

Post #3 of 6 (881 views)
Permalink
Re: Best practice for not caching content requested by crawlers [In reply to]

Lasse Karstensen:
[..]
> sub vcl_miss {
> if (req.http.user-agent ~ "(?i)yandex|msnbot") {
> return(pass);
> }
> }
> You can probably use openddr/deviceatlas/$favorite_detectionengine to get
> better accuracy than this regex.

I took at look at some access logs and updated devicedetect.vcl a bit so
it has rudimentary bot detection:

https://github.com/varnish/varnish-devicedetect/blob/master/devicedetect.vcl


--
Lasse Karstensen
Varnish Software AS
http://www.varnish-software.com/

_______________________________________________
varnish-misc mailing list
varnish-misc [at] varnish-cache
https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc


damon at huddler-inc

Jul 24, 2012, 9:53 AM

Post #4 of 6 (864 views)
Permalink
Re: Best practice for not caching content requested by crawlers [In reply to]

Hi Lasse,
Thanks! I forgot to mention this in the original email, but we are using
varnish 2.1.5. Here is what I ended up doing:

sub vcl_fetch {
...

if (req.http.User-Agent ~
"(?i)(msn|google|bing|yandex|youdao|exa|mj12|omgili|flr-|ahrefs|blekko)bot"
||
req.http.User-Agent ~
"(?i)(magpie|mediapartners|sogou|baiduspider|nutch|yahoo.*slurp|genieo)") {
set beresp.http.X-Bot-Bypass = "YES";
set beresp.ttl = 0s;
return (pass);
}

...
}

The X-Bot-Bypass was just for testing this configuration. With this
filtering and a lower ttl for some of our other objects, our nuking is now
at 0. The hit rate hasn't changed, but I think we need more granularity in
our hit rate metrics. For example, perhaps we should be looking at non-bot
hitrates.

Thanks,
Damon


On Fri, Jul 20, 2012 at 3:44 AM, Lasse Karstensen <
lasse.karstensen [at] gmail> wrote:

> Lasse Karstensen:
> [..]
> > sub vcl_miss {
> > if (req.http.user-agent ~ "(?i)yandex|msnbot") {
> > return(pass);
> > }
> > }
> > You can probably use openddr/deviceatlas/$favorite_detectionengine to get
> > better accuracy than this regex.
>
> I took at look at some access logs and updated devicedetect.vcl a bit so
> it has rudimentary bot detection:
>
>
> https://github.com/varnish/varnish-devicedetect/blob/master/devicedetect.vcl
>
>
> --
> Lasse Karstensen
> Varnish Software AS
> http://www.varnish-software.com/
>
> _______________________________________________
> varnish-misc mailing list
> varnish-misc [at] varnish-cache
> https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc
>


lasse.karstensen at gmail

Jul 25, 2012, 4:17 AM

Post #5 of 6 (862 views)
Permalink
Re: Best practice for not caching content requested by crawlers [In reply to]

Damon Snyder:
> Hi Lasse,
> Thanks! I forgot to mention this in the original email, but we are using
> varnish 2.1.5. Here is what I ended up doing:
> sub vcl_fetch {
> ...
> if (req.http.User-Agent ~
> "(?i)(msn|google|bing|yandex|youdao|exa|mj12|omgili|flr-|ahrefs|blekko)bot"
> ||
> req.http.User-Agent ~
> "(?i)(magpie|mediapartners|sogou|baiduspider|nutch|yahoo.*slurp|genieo)") {
> set beresp.http.X-Bot-Bypass = "YES";
> set beresp.ttl = 0s;
> return (pass);
> }
> ...
> }

Hi Damon.

Just a quick note; doing this check in vcl_fetch will lead to serialisation
of backend requests. This will hurt your HTTP response times, and since these
bots take response time into account, probably also hurt your search engine
visibility.

I'd advice you to do this test in vcl_miss, and also not override beresp.ttl
so that Varnish stores the hit_for_pass object for a while.

If you need to set the debug header you can store it temporarily in
req.http.x-bot-bypass and check/set resp.http.x-bot-bypass in vcl_deliver.

--
Lasse Karstensen
Varnish Software AS
http://www.varnish-software.com/

_______________________________________________
varnish-misc mailing list
varnish-misc [at] varnish-cache
https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc


damon at huddler-inc

Jul 25, 2012, 5:49 PM

Post #6 of 6 (861 views)
Permalink
Re: Best practice for not caching content requested by crawlers [In reply to]

Hi Lasse,
Correct me if I'm wrong, but vcl_miss is not available in varnish 2.1
(suggested here<https://www.varnish-software.com/static/book/Cache_invalidation.html#naming-confusion>).
Is there a varnish 2.x approach that would improve the response times?

As an aside, our content is very broad-- there is a LOT of it. Its unlikely
that serialization would be a concern for the bots unless multiple bots
happend to hit content simultaneously that wasn't currently hot.

That being said, we are exploring some form of caching for the bots. The
rules/ttls should probably look different than our normal traffic.

Thanks for the followup. I really appreciate it.

Damon

On Wed, Jul 25, 2012 at 4:17 AM, Lasse Karstensen <
lasse.karstensen [at] gmail> wrote:

> Damon Snyder:
> > Hi Lasse,
> > Thanks! I forgot to mention this in the original email, but we are using
> > varnish 2.1.5. Here is what I ended up doing:
> > sub vcl_fetch {
> > ...
> > if (req.http.User-Agent ~
> >
> "(?i)(msn|google|bing|yandex|youdao|exa|mj12|omgili|flr-|ahrefs|blekko)bot"
> > ||
> > req.http.User-Agent ~
> >
> "(?i)(magpie|mediapartners|sogou|baiduspider|nutch|yahoo.*slurp|genieo)") {
> > set beresp.http.X-Bot-Bypass = "YES";
> > set beresp.ttl = 0s;
> > return (pass);
> > }
> > ...
> > }
>
> Hi Damon.
>
> Just a quick note; doing this check in vcl_fetch will lead to serialisation
> of backend requests. This will hurt your HTTP response times, and since
> these
> bots take response time into account, probably also hurt your search engine
> visibility.
>
> I'd advice you to do this test in vcl_miss, and also not override
> beresp.ttl
> so that Varnish stores the hit_for_pass object for a while.
>
> If you need to set the debug header you can store it temporarily in
> req.http.x-bot-bypass and check/set resp.http.x-bot-bypass in vcl_deliver.
>
> --
> Lasse Karstensen
> Varnish Software AS
> http://www.varnish-software.com/
>
> _______________________________________________
> varnish-misc mailing list
> varnish-misc [at] varnish-cache
> https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc
>

Varnish misc RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.