Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Varnish: Dev

My random thoughts

 

 

Varnish dev RSS feed   Index | Next | Previous | View Threaded


phk at phk

Feb 9, 2006, 1:08 AM

Post #1 of 16 (568 views)
Permalink
My random thoughts

Here are my random thoughts on Varnish until now. Some of it mirrors
what we talked about in the meeting, some if it is more detailed or
reaches further into speculation.

Poul-Henning


Notes on Varnish
----------------

Philosophy
----------

It is not enough to deliver a technically superior piece of software,
if it is not possible for people to deploy it usefully in a sensible
way and timely fashion.


Deployment scenarios
--------------------

There are two fundamental usage scenarios for Varnish: when the
first machine is brought up to offload a struggling backend and
when a subsequent machine is brought online to help handle the load.


The first (layer of) Varnish
----------------------------

Somebodys webserver is struggling and they decide to try Varnish.

Often this will be a skunkworks operation with some random PC
purloined from wherever it wasn't being used and the Varnish "HOWTO"
in one hand.

If they do it in an orderly fashion before things reach panic proportions,
a sensible model is to setup the Varnish box, test it out from your
own browser, see that it answers correctly. Test it some more and
then add the IP# to the DNS records so that it takes 50% of the load
off the backend.

If it happens as firefighting at 3AM the backend will be moved to another
IP, the Varnish box given the main IP and things had better work real
well, really fast.

In both cases, it would be ideal if all that is necessary to tell
Varnish are two pieces of information:

Storage location
Alternatively we can offer an "auto" setting that makes
Varnish discover what is available and use what it find.

DNS or IP# of backend.

IP# is useful when the DNS settings are not quite certain
or when split DNS horizon setups are used.

Ideally this can be done on the commandline so that there is no
configuration file to edit to get going, just

varnish -d /home/varnish -s backend.example.dom

and you're off running.

A text, curses or HTML based based facility to give some instant
feedback and stats is necessary.

If circumstances are not conductive to strucured approach, it should
be possible to repeat this process and set up N independent Varnish
boxes and get some sort of relief without having to read any further
documentation.


The subsequent (layers of) Varnish
----------------------------------

This is what happens once everybody has caught their breath,
and where we start to talk about Varnish clusters.

We can assume that at this point, the already installed Varnish
machines have been configured more precisely and that people
have studied Varnish configuration to some level of detail.

When Varnish machines are put in a cluster, the administrator should
be able to consider the cluster as a unit and not have to think and
interact with the individual nodes.

Some sort of central management node or facility must exist and
it would be preferable if this was not a physical but a logical
entity so that it can follow the admin to the beach. Ideally it
would give basic functionality in any browser, even mobile phones.

The focus here is scaleability, we want to avoid per-machine
configuration if at all possible. Ideally, preconfigured hardware
can be plugged into power and net, find an address with DHCP, contact
preconfigured management node, get a configuration and start working.

But we also need to think about how we avoid a site of Varnish
machines from acting like a stampeeding horde when the power or
connectivity is brought back after a disruption. Some sort of
slow starting ("warm-up" ?) must be implemented to prevent them
from hitting all the backend with the full force.

An important aspect of cluster operations is giving a statistically
meaninful judgement of the cluster size, in particular answering
the question "would adding another machine help ?" precisely.

We should have a facility that allows the administrator to type
in a REGEXP/URL and have all the nodes answer with a checksum, age
and expiry timer for any documents they have which match. The
results should be grouped by URL and checksum.


Technical concepts
------------------

We want the central Varnish process to be that, just one process, and
we want to keep it small and efficient at all cost.

Code that will not be used for the central functionality should not
be part of the central process. For instance code to parse, validate
and interpret the (possibly) complex configuration file should be a
separate program.

Depending on the situation, the Varnish process can either invoke
this program via a pipe or receive the ready to use data structures
via a network connection.

Exported data from the Varnish process should be made as cheap as
possible, likely shared memory. That will allow us to deploy separate
processes for log-grabbing, statistics monitoring and similar
"off-duty" tasks and let the central process get on with the
important job.


Backend interaction
-------------------

We need a way to tune the backend interaction further than what the
HTTP protocol offers out of the box.

We can assume that all documents we get from the backend has an
expiry timer, if not we will set a default timer (configurable of
course).

But we need further policy than that. Amongst the questions we have
to ask are:

How long time after the expiry can we serve a cached copy
of this document while we have reason to belive the backend
can supply us with an update ?

How long time after the expiry can we serve a cached copy
of this document if the backend does not reply or is
unreachable.

If we cannot serve this document out of cache and the backend
cannot inform us, what do we serve instead (404 ? A default
document of some sort ?)

Should we just not serve this page at all if we are in a
bandwidth crush (DoS/stampede) situation ?

It may also make sense to have a "emergency detector" which triggers
when the backend is overloaded and offer a scaling factor for all
timeouts for when in such an emergency state. Something like "If
the average response time of the backend rises above 10 seconds,
multiply all expiry timers by two".

It probably also makes sense to have a bandwidth/request traffic
shaper for backend traffic to prevent any one Varnish machine from
pummeling the backend in case of attacks or misconfigured
expiry headers.


Startup/consistency
-------------------

We need to decide what to do about the cache when the Varnish
process starts. There may be a difference between it starting
first time after the machine booted and when it is subsequently
(re)started.

By far the easiest thing to do is to disregard the cache, that saves
a lot of code for locating and validating the contents, but this
carries a penalty in backend or cluster fetches whenever a node
comes up. Lets call this the "transient cache model"

The alternative is to allow persistently cached contents to be used
according to configured criteria:

Can expired contents be served if we can't contact the
backend ? (dangerous...)

Can unexpired contents be served if we can't contact the
backend ? If so, how much past the expiry ?

It is a very good question how big a fraction of the persistent
cache would be usable after typical downtimes:

After a Varnish process restart: Nearly all.

After a power-failure ? Probably at least half, but probably
not the half that contains the most busy pages.

And we need to take into consideration if validating the format and
contents of the cache might take more resources and time than getting
the content from the backend.

Off the top of my head, I would prefer the transient model any day
because of the simplicity and lack of potential consistency problems,
but if the load on the back end is intolerable this may not be
practically feasible.

The best way to decide is to carefully analyze a number of cold
starts and cache content replacement traces.

The choice we make does affect the storage management part of Varnish,
but I see that is being modular in any instance, so it may merely be
that some storage modules come up clean on any start while other
will come up with existing objects cached.


Clustering
----------

I'm somewhat torn on clustering for traffic purposes. For admin
and management: Yes, certainly, but starting to pass objects from
one machine in a cluster to another is likely to be just be a waste
of time and code.

Today one can trivially fit 1TB into a 1U machine so the partitioning
argument for cache clusters doesn't sound particularly urgent to me.

If all machines in the cluster have sufficient cache capacity, the
other remaining argument is backend offloading, that would likely
be better mitigated by implementing a 1:10 style two-layer cluster
with the second level node possibly having twice the storage of
the front row nodes.

The coordination necessary for keeping track of, or discovering in
real-time, who has a given object can easily turn into a traffic
and cpu load nightmare.

And from a performance point of view, it only reduces quality:
First we send out a discovery multicast, then we wait some amount
of time to see if a response arrives only then should we start
to ask the backend for the object. With a two-level cluster
we can ask the layer-two node right away and if it doesn't have
the object it can ask the back-end right away, no timeout is
involved in that.

Finally Consider the impact on a cluster of a "must get" object
like an IMG tag with a misspelled URL. Every hit on the front page
results in one get of the wrong URL. One machine in the cluster
ask everybody else in the cluster "do you have this URL" every
time somebody gets the frontpage.

If we implement a negative feedback protocol ("No I don't"), then
each hit on the wrong URL will result in N+1 packets (assuming multicast).

If we use a silent negative protocol the result is less severe for
the machine that got the request, but still everybody wakes up to
to find out that no, we didn't have that URL.

Negative caching can mitigate this to some extent.


Privacy
-------

Configuration data and instructions passed forth and back should
be encrypted and signed if so configured. Using PGP keys is
a very tempting and simple solution which would pave the way for
administrators typing a short ascii encoded pgp signed message
into a SMS from their Bahamas beach vacation...


Implementation ideas
--------------------

The simplest storage method mmap(2)'s a disk or file and puts
objects into the virtual memory on page aligned boundaries,
using a small struct for metadata. Data is not persistant
across reboots. Object free is incredibly cheap. Object
allocation should reuse recently freed space if at all possible.
"First free hole" is probably a good allocation strategy.
Sendfile can be used if filebacked. If nothing else disks
can be used by making a 1-file filesystem on them.

More complex storage methods are object per file and object
in database models. They are relatively trival and well
understood. May offer persistence.

Read-Only storage methods may make sense for getting hold
of static emergency contents from CD-ROM etc.

Treat each disk arm as a separate storage unit and keep track of
service time (if possible) to decide storage scheduling.

Avoid regular expressions at runtime. If config file contains
regexps, compile them into executable code and dlopen() it
into the Varnish process. Use versioning and refcounts to
do memory management on such segments.

Avoid committing transmit buffer space until we have bandwidth
estimate for client. One possible way: Send HTTP header
and time ACKs getting back, then calculate transmit buffer size
and send object. This makes DoS attacks more harmless and
mitigates traffic stampedes.

Kill all TCP connections after N seconds, nobody waits an hour
for a web-page to load.

Abuse mitigation interface to firewall/traffic shaping: Allow
the central node to put an IP/Net into traffic shaping or take
it out of traffic shaping firewall rules. Monitor/interface
process (not main Varnish process) calls script to config
firewalling.

"Warm-up" instructions can take a number of forms and we don't know
what is the most efficient or most usable. Here are some ideas:

Start at these URL's then...

... follow all links down to N levels.

... follow all links that match REGEXP no deeper than N levels down.

... follow N random links no deeper than M levels down.

... load N objects by following random links no deeper than
M levels down.

But...

... never follow any links that match REGEXP

... never pick up objects larger than N bytes

... never pick up objects older than T seconds


It makes a lot of sense to not actually implement this in the main
Varnish process, but rather supply a template perl or python script
that primes the cache by requesting the objects through Varnish.
(That would require us to listen separately on 127.0.0.1
so the perlscript can get in touch with Varnish while in warm-up.)

One interesting but quite likely overengineered option in the
cluster case is if the central monitor tracks a fraction of the
requests through the logs of the running machines in the cluster,
spots the hot objects and tell the warming up varnish what objects
to get and from where.


In the cluster configuration, it is probably best to run the cluster
interaction in a separate process rather than the main Varnish
process. From Varnish to cluster info would go through the shared
memory, but we don't want to implement locking in the shmem so
some sort of back-channel (UNIX domain or UDP socket ?) is necessary.

If we have such an "supervisor" process, it could also be tasked
with restarting the varnish process if vitals signs fail: A time
stamp in the shmem or kill -0 $pid.

It may even make sense to run the "supervisor" process in stand
alone mode as well, there it can offer a HTML based interface
to the Varnish process (via shmem).

For cluster use the user would probably just pass an extra argument
when he starts up Varnish:

varnish -c $cluster_args $other_args
vs

varnish $other_args

and a "varnish" shell script will Do The Right Thing.


Shared memory
-------------

The shared memory layout needs to be thought about somewhat. On one
hand we want it to be stable enough to allow people to write programs
or scripts that inspect it, on the other hand doing it entirely in
ascii is both slow and prone to race conditions.

The various different data types in the shared memory can either be
put into one single segment(= 1 file) or into individual segments
(= multiple files). I don't think the number of small data types to
be big enough to make the latter impractical.

Storing the "big overview" data in shmem in ASCII or HTML would
allow one to point cat(1) or a browser directly at the mmaped file
with no interpretation necessary, a big plus in my book.

Similarly, if we don't update them too often, statistics could be stored
in shared memory in perl/awk friendly ascii format.

But the logfile will have to be (one or more) FIFO logs, probably at least
three in fact: Good requests, Bad requests, and exception messages.

If we decide to make logentries fixed length, we could make them ascii
so that a simple "sort -n /tmp/shmem.log" would put them in order after
a leading numeric timestamp, but it is probably better to provide a
utility to cat/tail-f the log and keep the log in a bytestring FIFO
format. Overruns should be marked in the output.


*END*
--
Poul-Henning Kamp | UNIX since Zilog Zeus 3.20
phk at FreeBSD.ORG | TCP/IP since RFC 956
FreeBSD committer | BSD since 4.3-tahoe
Never attribute to malice what can adequately be explained by incompetence.


phk at phk

Feb 9, 2006, 1:08 AM

Post #2 of 16 (554 views)
Permalink
My random thoughts [In reply to]

Here are my random thoughts on Varnish until now. Some of it mirrors
what we talked about in the meeting, some if it is more detailed or
reaches further into speculation.

Poul-Henning


Notes on Varnish
----------------

Philosophy
----------

It is not enough to deliver a technically superior piece of software,
if it is not possible for people to deploy it usefully in a sensible
way and timely fashion.


Deployment scenarios
--------------------

There are two fundamental usage scenarios for Varnish: when the
first machine is brought up to offload a struggling backend and
when a subsequent machine is brought online to help handle the load.


The first (layer of) Varnish
----------------------------

Somebodys webserver is struggling and they decide to try Varnish.

Often this will be a skunkworks operation with some random PC
purloined from wherever it wasn't being used and the Varnish "HOWTO"
in one hand.

If they do it in an orderly fashion before things reach panic proportions,
a sensible model is to setup the Varnish box, test it out from your
own browser, see that it answers correctly. Test it some more and
then add the IP# to the DNS records so that it takes 50% of the load
off the backend.

If it happens as firefighting at 3AM the backend will be moved to another
IP, the Varnish box given the main IP and things had better work real
well, really fast.

In both cases, it would be ideal if all that is necessary to tell
Varnish are two pieces of information:

Storage location
Alternatively we can offer an "auto" setting that makes
Varnish discover what is available and use what it find.

DNS or IP# of backend.

IP# is useful when the DNS settings are not quite certain
or when split DNS horizon setups are used.

Ideally this can be done on the commandline so that there is no
configuration file to edit to get going, just

varnish -d /home/varnish -s backend.example.dom

and you're off running.

A text, curses or HTML based based facility to give some instant
feedback and stats is necessary.

If circumstances are not conductive to strucured approach, it should
be possible to repeat this process and set up N independent Varnish
boxes and get some sort of relief without having to read any further
documentation.


The subsequent (layers of) Varnish
----------------------------------

This is what happens once everybody has caught their breath,
and where we start to talk about Varnish clusters.

We can assume that at this point, the already installed Varnish
machines have been configured more precisely and that people
have studied Varnish configuration to some level of detail.

When Varnish machines are put in a cluster, the administrator should
be able to consider the cluster as a unit and not have to think and
interact with the individual nodes.

Some sort of central management node or facility must exist and
it would be preferable if this was not a physical but a logical
entity so that it can follow the admin to the beach. Ideally it
would give basic functionality in any browser, even mobile phones.

The focus here is scaleability, we want to avoid per-machine
configuration if at all possible. Ideally, preconfigured hardware
can be plugged into power and net, find an address with DHCP, contact
preconfigured management node, get a configuration and start working.

But we also need to think about how we avoid a site of Varnish
machines from acting like a stampeeding horde when the power or
connectivity is brought back after a disruption. Some sort of
slow starting ("warm-up" ?) must be implemented to prevent them
from hitting all the backend with the full force.

An important aspect of cluster operations is giving a statistically
meaninful judgement of the cluster size, in particular answering
the question "would adding another machine help ?" precisely.

We should have a facility that allows the administrator to type
in a REGEXP/URL and have all the nodes answer with a checksum, age
and expiry timer for any documents they have which match. The
results should be grouped by URL and checksum.


Technical concepts
------------------

We want the central Varnish process to be that, just one process, and
we want to keep it small and efficient at all cost.

Code that will not be used for the central functionality should not
be part of the central process. For instance code to parse, validate
and interpret the (possibly) complex configuration file should be a
separate program.

Depending on the situation, the Varnish process can either invoke
this program via a pipe or receive the ready to use data structures
via a network connection.

Exported data from the Varnish process should be made as cheap as
possible, likely shared memory. That will allow us to deploy separate
processes for log-grabbing, statistics monitoring and similar
"off-duty" tasks and let the central process get on with the
important job.


Backend interaction
-------------------

We need a way to tune the backend interaction further than what the
HTTP protocol offers out of the box.

We can assume that all documents we get from the backend has an
expiry timer, if not we will set a default timer (configurable of
course).

But we need further policy than that. Amongst the questions we have
to ask are:

How long time after the expiry can we serve a cached copy
of this document while we have reason to belive the backend
can supply us with an update ?

How long time after the expiry can we serve a cached copy
of this document if the backend does not reply or is
unreachable.

If we cannot serve this document out of cache and the backend
cannot inform us, what do we serve instead (404 ? A default
document of some sort ?)

Should we just not serve this page at all if we are in a
bandwidth crush (DoS/stampede) situation ?

It may also make sense to have a "emergency detector" which triggers
when the backend is overloaded and offer a scaling factor for all
timeouts for when in such an emergency state. Something like "If
the average response time of the backend rises above 10 seconds,
multiply all expiry timers by two".

It probably also makes sense to have a bandwidth/request traffic
shaper for backend traffic to prevent any one Varnish machine from
pummeling the backend in case of attacks or misconfigured
expiry headers.


Startup/consistency
-------------------

We need to decide what to do about the cache when the Varnish
process starts. There may be a difference between it starting
first time after the machine booted and when it is subsequently
(re)started.

By far the easiest thing to do is to disregard the cache, that saves
a lot of code for locating and validating the contents, but this
carries a penalty in backend or cluster fetches whenever a node
comes up. Lets call this the "transient cache model"

The alternative is to allow persistently cached contents to be used
according to configured criteria:

Can expired contents be served if we can't contact the
backend ? (dangerous...)

Can unexpired contents be served if we can't contact the
backend ? If so, how much past the expiry ?

It is a very good question how big a fraction of the persistent
cache would be usable after typical downtimes:

After a Varnish process restart: Nearly all.

After a power-failure ? Probably at least half, but probably
not the half that contains the most busy pages.

And we need to take into consideration if validating the format and
contents of the cache might take more resources and time than getting
the content from the backend.

Off the top of my head, I would prefer the transient model any day
because of the simplicity and lack of potential consistency problems,
but if the load on the back end is intolerable this may not be
practically feasible.

The best way to decide is to carefully analyze a number of cold
starts and cache content replacement traces.

The choice we make does affect the storage management part of Varnish,
but I see that is being modular in any instance, so it may merely be
that some storage modules come up clean on any start while other
will come up with existing objects cached.


Clustering
----------

I'm somewhat torn on clustering for traffic purposes. For admin
and management: Yes, certainly, but starting to pass objects from
one machine in a cluster to another is likely to be just be a waste
of time and code.

Today one can trivially fit 1TB into a 1U machine so the partitioning
argument for cache clusters doesn't sound particularly urgent to me.

If all machines in the cluster have sufficient cache capacity, the
other remaining argument is backend offloading, that would likely
be better mitigated by implementing a 1:10 style two-layer cluster
with the second level node possibly having twice the storage of
the front row nodes.

The coordination necessary for keeping track of, or discovering in
real-time, who has a given object can easily turn into a traffic
and cpu load nightmare.

And from a performance point of view, it only reduces quality:
First we send out a discovery multicast, then we wait some amount
of time to see if a response arrives only then should we start
to ask the backend for the object. With a two-level cluster
we can ask the layer-two node right away and if it doesn't have
the object it can ask the back-end right away, no timeout is
involved in that.

Finally Consider the impact on a cluster of a "must get" object
like an IMG tag with a misspelled URL. Every hit on the front page
results in one get of the wrong URL. One machine in the cluster
ask everybody else in the cluster "do you have this URL" every
time somebody gets the frontpage.

If we implement a negative feedback protocol ("No I don't"), then
each hit on the wrong URL will result in N+1 packets (assuming multicast).

If we use a silent negative protocol the result is less severe for
the machine that got the request, but still everybody wakes up to
to find out that no, we didn't have that URL.

Negative caching can mitigate this to some extent.


Privacy
-------

Configuration data and instructions passed forth and back should
be encrypted and signed if so configured. Using PGP keys is
a very tempting and simple solution which would pave the way for
administrators typing a short ascii encoded pgp signed message
into a SMS from their Bahamas beach vacation...


Implementation ideas
--------------------

The simplest storage method mmap(2)'s a disk or file and puts
objects into the virtual memory on page aligned boundaries,
using a small struct for metadata. Data is not persistant
across reboots. Object free is incredibly cheap. Object
allocation should reuse recently freed space if at all possible.
"First free hole" is probably a good allocation strategy.
Sendfile can be used if filebacked. If nothing else disks
can be used by making a 1-file filesystem on them.

More complex storage methods are object per file and object
in database models. They are relatively trival and well
understood. May offer persistence.

Read-Only storage methods may make sense for getting hold
of static emergency contents from CD-ROM etc.

Treat each disk arm as a separate storage unit and keep track of
service time (if possible) to decide storage scheduling.

Avoid regular expressions at runtime. If config file contains
regexps, compile them into executable code and dlopen() it
into the Varnish process. Use versioning and refcounts to
do memory management on such segments.

Avoid committing transmit buffer space until we have bandwidth
estimate for client. One possible way: Send HTTP header
and time ACKs getting back, then calculate transmit buffer size
and send object. This makes DoS attacks more harmless and
mitigates traffic stampedes.

Kill all TCP connections after N seconds, nobody waits an hour
for a web-page to load.

Abuse mitigation interface to firewall/traffic shaping: Allow
the central node to put an IP/Net into traffic shaping or take
it out of traffic shaping firewall rules. Monitor/interface
process (not main Varnish process) calls script to config
firewalling.

"Warm-up" instructions can take a number of forms and we don't know
what is the most efficient or most usable. Here are some ideas:

Start at these URL's then...

... follow all links down to N levels.

... follow all links that match REGEXP no deeper than N levels down.

... follow N random links no deeper than M levels down.

... load N objects by following random links no deeper than
M levels down.

But...

... never follow any links that match REGEXP

... never pick up objects larger than N bytes

... never pick up objects older than T seconds


It makes a lot of sense to not actually implement this in the main
Varnish process, but rather supply a template perl or python script
that primes the cache by requesting the objects through Varnish.
(That would require us to listen separately on 127.0.0.1
so the perlscript can get in touch with Varnish while in warm-up.)

One interesting but quite likely overengineered option in the
cluster case is if the central monitor tracks a fraction of the
requests through the logs of the running machines in the cluster,
spots the hot objects and tell the warming up varnish what objects
to get and from where.


In the cluster configuration, it is probably best to run the cluster
interaction in a separate process rather than the main Varnish
process. From Varnish to cluster info would go through the shared
memory, but we don't want to implement locking in the shmem so
some sort of back-channel (UNIX domain or UDP socket ?) is necessary.

If we have such an "supervisor" process, it could also be tasked
with restarting the varnish process if vitals signs fail: A time
stamp in the shmem or kill -0 $pid.

It may even make sense to run the "supervisor" process in stand
alone mode as well, there it can offer a HTML based interface
to the Varnish process (via shmem).

For cluster use the user would probably just pass an extra argument
when he starts up Varnish:

varnish -c $cluster_args $other_args
vs

varnish $other_args

and a "varnish" shell script will Do The Right Thing.


Shared memory
-------------

The shared memory layout needs to be thought about somewhat. On one
hand we want it to be stable enough to allow people to write programs
or scripts that inspect it, on the other hand doing it entirely in
ascii is both slow and prone to race conditions.

The various different data types in the shared memory can either be
put into one single segment(= 1 file) or into individual segments
(= multiple files). I don't think the number of small data types to
be big enough to make the latter impractical.

Storing the "big overview" data in shmem in ASCII or HTML would
allow one to point cat(1) or a browser directly at the mmaped file
with no interpretation necessary, a big plus in my book.

Similarly, if we don't update them too often, statistics could be stored
in shared memory in perl/awk friendly ascii format.

But the logfile will have to be (one or more) FIFO logs, probably at least
three in fact: Good requests, Bad requests, and exception messages.

If we decide to make logentries fixed length, we could make them ascii
so that a simple "sort -n /tmp/shmem.log" would put them in order after
a leading numeric timestamp, but it is probably better to provide a
utility to cat/tail-f the log and keep the log in a bytestring FIFO
format. Overruns should be marked in the output.


*END*
--
Poul-Henning Kamp | UNIX since Zilog Zeus 3.20
phk at FreeBSD.ORG | TCP/IP since RFC 956
FreeBSD committer | BSD since 4.3-tahoe
Never attribute to malice what can adequately be explained by incompetence.


des at linpro

Feb 9, 2006, 6:51 AM

Post #3 of 16 (555 views)
Permalink
My random thoughts [In reply to]

Poul-Henning Kamp <phk at phk.freebsd.dk> writes:
> Here are my random thoughts on Varnish until now.

Thank you. I will try to take the time to read them and comment
tomorrow; I am currently busy preparing for a trade show early next
week.

DES
--
Dag-Erling Sm?rgrav
Senior Software Developer
Linpro AS - www.linpro.no


des at linpro

Feb 9, 2006, 6:51 AM

Post #4 of 16 (553 views)
Permalink
My random thoughts [In reply to]

Poul-Henning Kamp <phk at phk.freebsd.dk> writes:
> Here are my random thoughts on Varnish until now.

Thank you. I will try to take the time to read them and comment
tomorrow; I am currently busy preparing for a trade show early next
week.

DES
--
Dag-Erling Sm?rgrav
Senior Software Developer
Linpro AS - www.linpro.no


des at linpro

Feb 10, 2006, 11:09 AM

Post #5 of 16 (554 views)
Permalink
My random thoughts [In reply to]

Poul-Henning Kamp <phk at phk.freebsd.dk> writes:
> It is not enough to deliver a technically superior piece of software,
> if it is not possible for people to deploy it usefully in a sensible
> way and timely fashion.

I tend to favor usability over performance. I believe you tend to
favor performance over usability. Hopefully, our opposing tendencies
will combine and the result will be a perfect balance ;)

> In both cases, it would be ideal if all that is necessary to tell
> Varnish are two pieces of information:
>
> Storage location
> Alternatively we can offer an "auto" setting that makes
> Varnish discover what is available and use what it find.

I want Varnish to support multiple storage backends:

- quick and dirty squid-like hashed directories, to begin with

- fancy block storage straight to disk (or to a large preallocated
file) like you suggested

- memcached

> Ideally this can be done on the commandline so that there is no
> configuration file to edit to get going, just
>
> varnish -d /home/varnish -s backend.example.dom

This would use hashed directories if /home/varnish is a directory, and
block storage if it's a file or device node.

> We need to decide what to do about the cache when the Varnish
> process starts. There may be a difference between it starting
> first time after the machine booted and when it is subsequently
> (re)started.

This might vary depending on which storage backend is used. With
memcached, for instance, there is a possibility that varnish
restarted, but memcached is still running and still has a warm cache;
and if memcached also restarted, it will transparently obtain any
cached object from its peers. The disadvantage with memcached is that
we can't sendfile() from it.

> By far the easiest thing to do is to disregard the cache, that saves
> a lot of code for locating and validating the contents, but this
> carries a penalty in backend or cluster fetches whenever a node
> comes up. Lets call this the "transient cache model"

Another issue is that a persistent cache must store both data and
metadata on disk, rather than just store data on disk and metadata in
memory. This complicates not only the logic but also the storage
format.

> Can expired contents be served if we can't contact the
> backend ? (dangerous...)

Dangerous, but highly desirable in certain circumstances. I need to
locate the architecture notes I wrote last fall and place them online;
I spent quite somet time thinking about and describing how this could
/ should be done.

> It is a very good question how big a fraction of the persistent
> cache would be usable after typical downtimes:
>
> After a Varnish process restart: Nearly all.
>
> After a power-failure ? Probably at least half, but probably
> not the half that contains the most busy pages.

When using direct-to-disk storage, we can (fairly) easily design the
storage format in such a way that updates are atomic, and make liberal
use of fsync() or similar to ensure (to the extent possible) that the
cache is in a consistent state after a power failure.

> Off the top of my head, I would prefer the transient model any day
> because of the simplicity and lack of potential consistency problems,
> but if the load on the back end is intolerable this may not be
> practically feasible.

How about this: we start with the transient model, and add persistence
later.

> If all machines in the cluster have sufficient cache capacity, the
> other remaining argument is backend offloading, that would likely
> be better mitigated by implementing a 1:10 style two-layer cluster
> with the second level node possibly having twice the storage of
> the front row nodes.

Multiple cache layers may give rise to undesirable and possibly
unpredictable interaction (compare this to tunneling TCP/IP over TCP,
with both TCP layers battling each other's congestion control)

> Finally Consider the impact on a cluster of a "must get" object
> like an IMG tag with a misspelled URL. Every hit on the front page
> results in one get of the wrong URL. One machine in the cluster
> ask everybody else in the cluster "do you have this URL" every
> time somebody gets the frontpage.

Not if we implement negative caching, which we have to anyway -
otherwise all those requests go to the backend, which gets bogged down
sending out 404s.

> If we implement a negative feedback protocol ("No I don't"), then
> each hit on the wrong URL will result in N+1 packets (assuming
> multicast).

Or we can just ignore queries for documents which we don't have; the
requesting node will have a simply request the document from the
backend if no reply arrives within a short timeout (~1s).

> Configuration data and instructions passed forth and back should
> be encrypted and signed if so configured. Using PGP keys is
> a very tempting and simple solution which would pave the way for
> administrators typing a short ascii encoded pgp signed message
> into a SMS from their Bahamas beach vacation...

Unfortunately, PGP is very slow, so it should only be used to
communicate with some kind of configuration server, not with the cache
itself.

> The simplest storage method mmap(2)'s a disk or file and puts
> objects into the virtual memory on page aligned boundaries,
> using a small struct for metadata. Data is not persistant
> across reboots. Object free is incredibly cheap. Object
> allocation should reuse recently freed space if at all possible.
> "First free hole" is probably a good allocation strategy.
> Sendfile can be used if filebacked. If nothing else disks
> can be used by making a 1-file filesystem on them.

hmm, I believe you can sendfile() /dev/zero if you use that trick to
get a private mmap()ed arena.

> Avoid regular expressions at runtime. If config file contains
> regexps, compile them into executable code and dlopen() it
> into the Varnish process. Use versioning and refcounts to
> do memory management on such segments.

unlike regexps, globs can be evaluated very efficiently.

> It makes a lot of sense to not actually implement this in the main
> Varnish process, but rather supply a template perl or python script
> that primes the cache by requesting the objects through Varnish.
> (That would require us to listen separately on 127.0.0.1
> so the perlscript can get in touch with Varnish while in warm-up.)

This can easily be done with existing software like w3mir.

> One interesting but quite likely overengineered option in the
> cluster case is if the central monitor tracks a fraction of the
> requests through the logs of the running machines in the cluster,
> spots the hot objects and tell the warming up varnish what objects
> to get and from where.

You can probably do this in ~50 lines of Perl using Net::HTTP.

> In the cluster configuration, it is probably best to run the cluster
> interaction in a separate process rather than the main Varnish
> process. From Varnish to cluster info would go through the shared
> memory, but we don't want to implement locking in the shmem so
> some sort of back-channel (UNIX domain or UDP socket ?) is necessary.

Distributed lock managers are *hard*... but we don't need locking for
simple stuff like reading logs out of shmem.

DES
--
Dag-Erling Sm?rgrav
Senior Software Developer
Linpro AS - www.linpro.no


des at linpro

Feb 10, 2006, 11:09 AM

Post #6 of 16 (554 views)
Permalink
My random thoughts [In reply to]

Poul-Henning Kamp <phk at phk.freebsd.dk> writes:
> It is not enough to deliver a technically superior piece of software,
> if it is not possible for people to deploy it usefully in a sensible
> way and timely fashion.

I tend to favor usability over performance. I believe you tend to
favor performance over usability. Hopefully, our opposing tendencies
will combine and the result will be a perfect balance ;)

> In both cases, it would be ideal if all that is necessary to tell
> Varnish are two pieces of information:
>
> Storage location
> Alternatively we can offer an "auto" setting that makes
> Varnish discover what is available and use what it find.

I want Varnish to support multiple storage backends:

- quick and dirty squid-like hashed directories, to begin with

- fancy block storage straight to disk (or to a large preallocated
file) like you suggested

- memcached

> Ideally this can be done on the commandline so that there is no
> configuration file to edit to get going, just
>
> varnish -d /home/varnish -s backend.example.dom

This would use hashed directories if /home/varnish is a directory, and
block storage if it's a file or device node.

> We need to decide what to do about the cache when the Varnish
> process starts. There may be a difference between it starting
> first time after the machine booted and when it is subsequently
> (re)started.

This might vary depending on which storage backend is used. With
memcached, for instance, there is a possibility that varnish
restarted, but memcached is still running and still has a warm cache;
and if memcached also restarted, it will transparently obtain any
cached object from its peers. The disadvantage with memcached is that
we can't sendfile() from it.

> By far the easiest thing to do is to disregard the cache, that saves
> a lot of code for locating and validating the contents, but this
> carries a penalty in backend or cluster fetches whenever a node
> comes up. Lets call this the "transient cache model"

Another issue is that a persistent cache must store both data and
metadata on disk, rather than just store data on disk and metadata in
memory. This complicates not only the logic but also the storage
format.

> Can expired contents be served if we can't contact the
> backend ? (dangerous...)

Dangerous, but highly desirable in certain circumstances. I need to
locate the architecture notes I wrote last fall and place them online;
I spent quite somet time thinking about and describing how this could
/ should be done.

> It is a very good question how big a fraction of the persistent
> cache would be usable after typical downtimes:
>
> After a Varnish process restart: Nearly all.
>
> After a power-failure ? Probably at least half, but probably
> not the half that contains the most busy pages.

When using direct-to-disk storage, we can (fairly) easily design the
storage format in such a way that updates are atomic, and make liberal
use of fsync() or similar to ensure (to the extent possible) that the
cache is in a consistent state after a power failure.

> Off the top of my head, I would prefer the transient model any day
> because of the simplicity and lack of potential consistency problems,
> but if the load on the back end is intolerable this may not be
> practically feasible.

How about this: we start with the transient model, and add persistence
later.

> If all machines in the cluster have sufficient cache capacity, the
> other remaining argument is backend offloading, that would likely
> be better mitigated by implementing a 1:10 style two-layer cluster
> with the second level node possibly having twice the storage of
> the front row nodes.

Multiple cache layers may give rise to undesirable and possibly
unpredictable interaction (compare this to tunneling TCP/IP over TCP,
with both TCP layers battling each other's congestion control)

> Finally Consider the impact on a cluster of a "must get" object
> like an IMG tag with a misspelled URL. Every hit on the front page
> results in one get of the wrong URL. One machine in the cluster
> ask everybody else in the cluster "do you have this URL" every
> time somebody gets the frontpage.

Not if we implement negative caching, which we have to anyway -
otherwise all those requests go to the backend, which gets bogged down
sending out 404s.

> If we implement a negative feedback protocol ("No I don't"), then
> each hit on the wrong URL will result in N+1 packets (assuming
> multicast).

Or we can just ignore queries for documents which we don't have; the
requesting node will have a simply request the document from the
backend if no reply arrives within a short timeout (~1s).

> Configuration data and instructions passed forth and back should
> be encrypted and signed if so configured. Using PGP keys is
> a very tempting and simple solution which would pave the way for
> administrators typing a short ascii encoded pgp signed message
> into a SMS from their Bahamas beach vacation...

Unfortunately, PGP is very slow, so it should only be used to
communicate with some kind of configuration server, not with the cache
itself.

> The simplest storage method mmap(2)'s a disk or file and puts
> objects into the virtual memory on page aligned boundaries,
> using a small struct for metadata. Data is not persistant
> across reboots. Object free is incredibly cheap. Object
> allocation should reuse recently freed space if at all possible.
> "First free hole" is probably a good allocation strategy.
> Sendfile can be used if filebacked. If nothing else disks
> can be used by making a 1-file filesystem on them.

hmm, I believe you can sendfile() /dev/zero if you use that trick to
get a private mmap()ed arena.

> Avoid regular expressions at runtime. If config file contains
> regexps, compile them into executable code and dlopen() it
> into the Varnish process. Use versioning and refcounts to
> do memory management on such segments.

unlike regexps, globs can be evaluated very efficiently.

> It makes a lot of sense to not actually implement this in the main
> Varnish process, but rather supply a template perl or python script
> that primes the cache by requesting the objects through Varnish.
> (That would require us to listen separately on 127.0.0.1
> so the perlscript can get in touch with Varnish while in warm-up.)

This can easily be done with existing software like w3mir.

> One interesting but quite likely overengineered option in the
> cluster case is if the central monitor tracks a fraction of the
> requests through the logs of the running machines in the cluster,
> spots the hot objects and tell the warming up varnish what objects
> to get and from where.

You can probably do this in ~50 lines of Perl using Net::HTTP.

> In the cluster configuration, it is probably best to run the cluster
> interaction in a separate process rather than the main Varnish
> process. From Varnish to cluster info would go through the shared
> memory, but we don't want to implement locking in the shmem so
> some sort of back-channel (UNIX domain or UDP socket ?) is necessary.

Distributed lock managers are *hard*... but we don't need locking for
simple stuff like reading logs out of shmem.

DES
--
Dag-Erling Sm?rgrav
Senior Software Developer
Linpro AS - www.linpro.no


phk at phk

Feb 10, 2006, 11:42 AM

Post #7 of 16 (565 views)
Permalink
My random thoughts [In reply to]

In message <ujrhd772ii0.fsf at cat.linpro.no>, Dag-Erling =?iso-8859-1?q?Sm=F8rgra
v?= writes:
>Poul-Henning Kamp <phk at phk.freebsd.dk> writes:

>> In both cases, it would be ideal if all that is necessary to tell
>> Varnish are two pieces of information:
>>
>> Storage location
>> Alternatively we can offer an "auto" setting that makes
>> Varnish discover what is available and use what it find.
>
>I want Varnish to support multiple storage backends:
>
> - quick and dirty squid-like hashed directories, to begin with

That's actually slow and dirty. So I'd prefer to wait with this
one until we know we need it (ie: persistance).

> - fancy block storage straight to disk (or to a large preallocated
> file) like you suggested

This is actually the simpler one to implement: make one file,
mmap it, sendfile from it.

I don't see any advantage to memcached right off the bat, but I
may become wiser later on.

Memcached is intended for when your app needs a shared memory
interface, which is then simulated using network.

Our app is network oriented and we know a lot more about or
data than memcached would, so we can do the networking more
efficiently ourselves.

>> By far the easiest thing to do is to disregard the cache, that saves
>> a lot of code for locating and validating the contents, but this
>> carries a penalty in backend or cluster fetches whenever a node
>> comes up. Lets call this the "transient cache model"
>
>Another issue is that a persistent cache must store both data and
>metadata on disk, rather than just store data on disk and metadata in
>memory. This complicates not only the logic but also the storage
>format.

Yes, although we can get pretty far with mmap on this too.

>> It is a very good question how big a fraction of the persistent
>> cache would be usable after typical downtimes:
>>
>> After a Varnish process restart: Nearly all.
>>
>> After a power-failure ? Probably at least half, but probably
>> not the half that contains the most busy pages.
>
>When using direct-to-disk storage, we can (fairly) easily design the
>storage format in such a way that updates are atomic, and make liberal
>use of fsync() or similar to ensure (to the extent possible) that the
>cache is in a consistent state after a power failure.

I meant "usable" as in "will be asked for", ie: usable for improving
the hitrate.

>How about this: we start with the transient model, and add persistence
>later.

My idea exactly :-)

Since I expect the storage to be pluggable, this should be pretty
straightforward.

>> If all machines in the cluster have sufficient cache capacity, the
>> other remaining argument is backend offloading, that would likely
>> be better mitigated by implementing a 1:10 style two-layer cluster
>> with the second level node possibly having twice the storage of
>> the front row nodes.
>
>Multiple cache layers may give rise to undesirable and possibly
>unpredictable interaction (compare this to tunneling TCP/IP over TCP,
>with both TCP layers battling each other's congestion control)

I doubt it. The front end Varnish fetches from the backend
into its store and from there another thread will serve the
users, so the two TCP connections are not interacting directly.

>Or we can just ignore queries for documents which we don't have; the
>requesting node will have a simply request the document from the
>backend if no reply arrives within a short timeout (~1s).

I want to avoid any kind of timeouts like that. One slight bulge
in your load and everybody times out and hits the backend.

>Unfortunately, PGP is very slow, so it should only be used to
>communicate with some kind of configuration server, not with the cache
>itself.

Absolutely. My plan wast to have the "management process" do that.

>unlike regexps, globs can be evaluated very efficiently.

But more efficiently still if compiled into C code.

>> It makes a lot of sense to not actually implement this in the main
>> Varnish process, but rather supply a template perl or python script
>> that primes the cache by requesting the objects through Varnish.

>This can easily be done with existing software like w3mir.
>[...]
>You can probably do this in ~50 lines of Perl using Net::HTTP.

Sounds like you just won this bite :-)

>Distributed lock managers are *hard*...

Nobody is talking about distributed lock managers. The shared
memory is strictly local to the machine and r/o by everybody else
than the main Varnish process.

--
Poul-Henning Kamp | UNIX since Zilog Zeus 3.20
phk at FreeBSD.ORG | TCP/IP since RFC 956
FreeBSD committer | BSD since 4.3-tahoe
Never attribute to malice what can adequately be explained by incompetence.


phk at phk

Feb 10, 2006, 11:42 AM

Post #8 of 16 (554 views)
Permalink
My random thoughts [In reply to]

In message <ujrhd772ii0.fsf at cat.linpro.no>, Dag-Erling =?iso-8859-1?q?Sm=F8rgra
v?= writes:
>Poul-Henning Kamp <phk at phk.freebsd.dk> writes:

>> In both cases, it would be ideal if all that is necessary to tell
>> Varnish are two pieces of information:
>>
>> Storage location
>> Alternatively we can offer an "auto" setting that makes
>> Varnish discover what is available and use what it find.
>
>I want Varnish to support multiple storage backends:
>
> - quick and dirty squid-like hashed directories, to begin with

That's actually slow and dirty. So I'd prefer to wait with this
one until we know we need it (ie: persistance).

> - fancy block storage straight to disk (or to a large preallocated
> file) like you suggested

This is actually the simpler one to implement: make one file,
mmap it, sendfile from it.

I don't see any advantage to memcached right off the bat, but I
may become wiser later on.

Memcached is intended for when your app needs a shared memory
interface, which is then simulated using network.

Our app is network oriented and we know a lot more about or
data than memcached would, so we can do the networking more
efficiently ourselves.

>> By far the easiest thing to do is to disregard the cache, that saves
>> a lot of code for locating and validating the contents, but this
>> carries a penalty in backend or cluster fetches whenever a node
>> comes up. Lets call this the "transient cache model"
>
>Another issue is that a persistent cache must store both data and
>metadata on disk, rather than just store data on disk and metadata in
>memory. This complicates not only the logic but also the storage
>format.

Yes, although we can get pretty far with mmap on this too.

>> It is a very good question how big a fraction of the persistent
>> cache would be usable after typical downtimes:
>>
>> After a Varnish process restart: Nearly all.
>>
>> After a power-failure ? Probably at least half, but probably
>> not the half that contains the most busy pages.
>
>When using direct-to-disk storage, we can (fairly) easily design the
>storage format in such a way that updates are atomic, and make liberal
>use of fsync() or similar to ensure (to the extent possible) that the
>cache is in a consistent state after a power failure.

I meant "usable" as in "will be asked for", ie: usable for improving
the hitrate.

>How about this: we start with the transient model, and add persistence
>later.

My idea exactly :-)

Since I expect the storage to be pluggable, this should be pretty
straightforward.

>> If all machines in the cluster have sufficient cache capacity, the
>> other remaining argument is backend offloading, that would likely
>> be better mitigated by implementing a 1:10 style two-layer cluster
>> with the second level node possibly having twice the storage of
>> the front row nodes.
>
>Multiple cache layers may give rise to undesirable and possibly
>unpredictable interaction (compare this to tunneling TCP/IP over TCP,
>with both TCP layers battling each other's congestion control)

I doubt it. The front end Varnish fetches from the backend
into its store and from there another thread will serve the
users, so the two TCP connections are not interacting directly.

>Or we can just ignore queries for documents which we don't have; the
>requesting node will have a simply request the document from the
>backend if no reply arrives within a short timeout (~1s).

I want to avoid any kind of timeouts like that. One slight bulge
in your load and everybody times out and hits the backend.

>Unfortunately, PGP is very slow, so it should only be used to
>communicate with some kind of configuration server, not with the cache
>itself.

Absolutely. My plan wast to have the "management process" do that.

>unlike regexps, globs can be evaluated very efficiently.

But more efficiently still if compiled into C code.

>> It makes a lot of sense to not actually implement this in the main
>> Varnish process, but rather supply a template perl or python script
>> that primes the cache by requesting the objects through Varnish.

>This can easily be done with existing software like w3mir.
>[...]
>You can probably do this in ~50 lines of Perl using Net::HTTP.

Sounds like you just won this bite :-)

>Distributed lock managers are *hard*...

Nobody is talking about distributed lock managers. The shared
memory is strictly local to the machine and r/o by everybody else
than the main Varnish process.

--
Poul-Henning Kamp | UNIX since Zilog Zeus 3.20
phk at FreeBSD.ORG | TCP/IP since RFC 956
FreeBSD committer | BSD since 4.3-tahoe
Never attribute to malice what can adequately be explained by incompetence.


des at linpro

Feb 11, 2006, 1:23 PM

Post #9 of 16 (554 views)
Permalink
My random thoughts [In reply to]

"Poul-Henning Kamp" <phk at phk.freebsd.dk> writes:
> "Dag-Erling Sm?rgrav" <des at des.no> writes:
> > Multiple cache layers may give rise to undesirable and possibly
> > unpredictable interaction (compare this to tunneling TCP/IP over TCP,
> > with both TCP layers battling each other's congestion control)
> I doubt it. The front end Varnish fetches from the backend
> into its store and from there another thread will serve the
> users, so the two TCP connections are not interacting directly.

You took me a little too literally. What I meant is that we may see
undesirable interaction between the two layers, for instance in the
area of expiry handling (what will the front layer think when the rear
layer sends it expired documents?).

> > Unfortunately, PGP is very slow, so it should only be used to
> > communicate with some kind of configuration server, not with the cache
> > itself.
> Absolutely. My plan wast to have the "management process" do that.

Hmm, we might as well go right ahead and call it a FEP :)

(see http://www.jargon.net/jargonfile/b/box.html if you didn't catch
the reference)

> > unlike regexps, globs can be evaluated very efficiently.
> But more efficiently still if compiled into C code.

I don't think so, but I may have overlooked something.

DES
--
Dag-Erling Sm?rgrav
Senior Software Developer
Linpro AS - www.linpro.no


des at linpro

Feb 11, 2006, 1:23 PM

Post #10 of 16 (555 views)
Permalink
My random thoughts [In reply to]

"Poul-Henning Kamp" <phk at phk.freebsd.dk> writes:
> "Dag-Erling Sm?rgrav" <des at des.no> writes:
> > Multiple cache layers may give rise to undesirable and possibly
> > unpredictable interaction (compare this to tunneling TCP/IP over TCP,
> > with both TCP layers battling each other's congestion control)
> I doubt it. The front end Varnish fetches from the backend
> into its store and from there another thread will serve the
> users, so the two TCP connections are not interacting directly.

You took me a little too literally. What I meant is that we may see
undesirable interaction between the two layers, for instance in the
area of expiry handling (what will the front layer think when the rear
layer sends it expired documents?).

> > Unfortunately, PGP is very slow, so it should only be used to
> > communicate with some kind of configuration server, not with the cache
> > itself.
> Absolutely. My plan wast to have the "management process" do that.

Hmm, we might as well go right ahead and call it a FEP :)

(see http://www.jargon.net/jargonfile/b/box.html if you didn't catch
the reference)

> > unlike regexps, globs can be evaluated very efficiently.
> But more efficiently still if compiled into C code.

I don't think so, but I may have overlooked something.

DES
--
Dag-Erling Sm?rgrav
Senior Software Developer
Linpro AS - www.linpro.no


andersb at vgnett

Feb 12, 2006, 2:54 PM

Post #11 of 16 (553 views)
Permalink
My random thoughts [In reply to]

Good work guys. I had a great time reading the notes.

Here comes the sys.adm approach.

P.S The sys.adm approach can easily been seen as a overengineered
solution, don't feel my approach as a must-have. More as a nice-to-have.


>Notes on Varnish
>----------------
>
>Philosophy
>----------
>
>It is not enough to deliver a technically superior piece of software,
>if it is not possible for people to deploy it usefully in a sensible
>way and timely fashion.
>[...]
>If circumstances are not conductive to strucured approach, it should
>be possible to repeat this process and set up N independent Varnish
>boxes and get some sort of relief without having to read any further
>documentation.

I think these are reasonable senarios and solutions.

>
>The subsequent (layers of) Varnish
>----------------------------------
>
>[...]
>When Varnish machines are put in a cluster, the administrator should
>be able to consider the cluster as a unit and not have to think and
>interact with the individual nodes.

That would be great. Imho far to little software acts like this.
There could be a good reason for that, but I wouldn't know.

>Some sort of central management node or facility must exist and
>it would be preferable if this was not a physical but a logical
>entity so that it can follow the admin to the beach. Ideally it
>would give basic functionality in any browser, even mobile phones.

A web-browser interface and a CLI should cover 99% of use. An easy
protocol/API would make it possible for anybody to write their own
interface to the central managment node.

>The focus here is scaleability, we want to avoid per-machine
>configuration if at all possible. Ideally, preconfigured hardware
>can be plugged into power and net, find an address with DHCP, contact
>preconfigured management node, get a configuration and start working.

This would ease many things. If one should make a image of some sort, one
does not have to change/make new image for every config change (if that
happens more ofte than software updates).

>But we also need to think about how we avoid a site of Varnish
>machines from acting like a stampeeding horde when the power or
>connectivity is brought back after a disruption. Some sort of
>slow starting ("warm-up" ?) must be implemented to prevent them
>from hitting all the backend with the full force.

Yes. As you said in Oslo Poul, this could be a killer-app feature for some
sites.

>An important aspect of cluster operations is giving a statistically
>meaninful judgement of the cluster size, in particular answering
>the question "would adding another machine help ?" precisely.

Is this possible? It would involve knowing how the backend is doing with
added load.
One thing is to measure how it's doing right now (responstime), but to
predict added load is hard.
My guess is also that the only reason somebody would ask "would adding
another machine help ?" was if the CPU or bandwith was exhausted on the
accelerator(s) in place, and one really needed to do something anyway. The
only other reason I can think of is responstime from the accelerator, and
then we have the predict load problem.

>We should have a facility that allows the administrator to type
>in a REGEXP/URL and have all the nodes answer with a checksum, age
>and expiry timer for any documents they have which match. The
>results should be grouped by URL and checksum.

Not only the admin needs this. Its great when programmers/implementors
need to debug how "good" the new/old application caches.
In a world of rapid development, little or no time is often given to
make/check the "cachebility" of the app.
A "check www.rapiddev.com/newapp/*" after a couple of clicks on the app
could save developers huge amount of time, and reduce backend load
immensely.

>
>Technical concepts
>------------------
>
>We want the central Varnish process to be that, just one process, and
>we want to keep it small and efficient at all cost.

Yes. When you say 1 process, you mean 1 process per CPU/Core?

>Code that will not be used for the central functionality should not
>be part of the central process. For instance code to parse, validate
>and interpret the (possibly) complex configuration file should be a
>separate program.

Lets list possible processes:

1. Varnish main.
2. Disk/storage process.
3. Config process/program.
4. Managment process.
5. Logger/stats.

>Depending on the situation, the Varnish process can either invoke
>this program via a pipe or receive the ready to use data structures
>via a network connection.
>
>Exported data from the Varnish process should be made as cheap as
>possible, likely shared memory. That will allow us to deploy separate
>processes for log-grabbing, statistics monitoring and similar
>"off-duty" tasks and let the central process get on with the
>important job.

Sounds great.

>
>Backend interaction
>-------------------
>
>We need a way to tune the backend interaction further than what the
>HTTP protocol offers out of the box.
>
>We can assume that all documents we get from the backend has an
>expiry timer, if not we will set a default timer (configurable of
>course).
>
>But we need further policy than that. Amongst the questions we have
>to ask are:
>
> How long time after the expiry can we serve a cached copy
> of this document while we have reason to belive the backend
> can supply us with an update ?
>
> How long time after the expiry can we serve a cached copy
> of this document if the backend does not reply or is
> unreachable.
>
> If we cannot serve this document out of cache and the backend
> cannot inform us, what do we serve instead (404 ? A default
> document of some sort ?)
>
> Should we just not serve this page at all if we are in a
> bandwidth crush (DoS/stampede) situation ?

You are correct. Did you mean ask the user or did you mean questions to
answer in a specification?
I think the best approach is to ask the user, and let him answer in the
config. I can see as many answers to these questions (and more) as there
are websites :) Also a site might answer differently in different
scenarios.

>It may also make sense to have a "emergency detector" which triggers
>when the backend is overloaded and offer a scaling factor for all
>timeouts for when in such an emergency state. Something like "If
>the average response time of the backend rises above 10 seconds,
>multiply all expiry timers by two".

Good idea. Once again I opt for a config choice on that one.

>It probably also makes sense to have a bandwidth/request traffic
>shaper for backend traffic to prevent any one Varnish machine from
>pummeling the backend in case of attacks or misconfigured
>expiry headers.

Good idea, but this one I am unsure about. The reason: One more thing that
can make the accelerator behave in a way you don't understand.
You are delivering stale documents from the accelerator. You start
"debugging". "Hmm, most of thre requests are given from backen in timely
fashion..." You debug more and start examining the headers. I can see
myself going through loads of different stuff, and than: "Ahh, the traffic
shaper..."
As I said, I like the idea, but to many rules for backoffs will make the
sys.admin scratch his head even more.
Can we come up with a way for Varnish to tell the sys.adm. "Hey, you are
delivering stale's here. Because ..." Or is this overengineer?

>
>Startup/consistency
>-------------------
>
>We need to decide what to do about the cache when the Varnish
>process starts. There may be a difference between it starting
>first time after the machine booted and when it is subsequently
>(re)started.
>
>By far the easiest thing to do is to disregard the cache, that saves
>a lot of code for locating and validating the contents, but this
>carries a penalty in backend or cluster fetches whenever a node
>comes up. Lets call this the "transient cache model"

I agree with Dag here. Lets start with "transient cache model" and add
more later.
We will discuss some scenarios at spec writing, and maybe come up with
some models for later implementation.
Better dig out those architecture notes Dag :)

>The alternative is to allow persistently cached contents to be used
>according to configured criteria:
>[...]
>The choice we make does affect the storage management part of Varnish,
>but I see that is being modular in any instance, so it may merely be
>that some storage modules come up clean on any start while other
>will come up with existing objects cached.

Ironically at VG the stuff that can be cahced long (JPG's, GIF's etc) can
be cached long, while the costly stuff is the documents that cost CPU
making.
It would not be surprised if its like that many places.

>
>Clustering
>----------
>
>I'm somewhat torn on clustering for traffic purposes. For admin
>and management: Yes, certainly, but starting to pass objects from
>one machine in a cluster to another is likely to be just be a waste
>of time and code.
>
>Today one can trivially fit 1TB into a 1U machine so the partitioning
>argument for cache clusters doesn't sound particularly urgent to me.
>
>If all machines in the cluster have sufficient cache capacity, the
>other remaining argument is backend offloading, that would likely
>be better mitigated by implementing a 1:10 style two-layer cluster
>with the second level node possibly having twice the storage of
>the front row nodes.

I am also torn here.
A part of me says. Hey, there is ICP v2 and such, lets use it, it's good
economy.
Another part is thinking that ICP works at it's best when you have many
accelerators, and if Varnish can deliver what we hope, not many frontends
are needed for most sites in the world :) At that level, you can for sure
deliver the extra content ICP and such would save you from.
I know that in saying that I am sacrificing design because of
implementation, but there it is.

>The coordination necessary for keeping track of, or discovering in
>real-time, who has a given object can easily turn into a traffic
>and cpu load nightmare.
>
>And from a performance point of view, it only reduces quality:
>First we send out a discovery multicast, then we wait some amount
>of time to see if a response arrives only then should we start
>to ask the backend for the object. With a two-level cluster
>we can ask the layer-two node right away and if it doesn't have
>the object it can ask the back-end right away, no timeout is
>involved in that.

A note. One of the reasons to be wary of two-level clusters in my opinion
is that if you cache a document from the backend at the lowest lvl for say
2 min. And then the level over comes and gets it 1 min. into those 2 min.,
looks up in its config and finds out this is a 2 min. cache document, the
document will be 1 min stale before a refesh. This could of cource be
solved with Expires tags, but it makes sys.adm's wary.
Dag also noted problems with this when we have two-layer approach and
first layer is in backoff-mode.

>Finally Consider the impact on a cluster of a "must get" object
>like an IMG tag with a misspelled URL. Every hit on the front page
>results in one get of the wrong URL. One machine in the cluster
>ask everybody else in the cluster "do you have this URL" every
>time somebody gets the frontpage.
>[...]
>Negative caching can mitigate this to some extent.
>
>
>Privacy
>-------
>
>Configuration data and instructions passed forth and back should
>be encrypted and signed if so configured. Using PGP keys is
>a very tempting and simple solution which would pave the way for
>administrators typing a short ascii encoded pgp signed message
>into a SMS from their Bahamas beach vacation...

Bahamas? Vaction? :)

>
>Implementation ideas
>--------------------
>
>The simplest storage method mmap(2)'s a disk or file and puts
>objects into the virtual memory on page aligned boundaries,
>using a small struct for metadata. Data is not persistant
>across reboots. Object free is incredibly cheap. Object
>allocation should reuse recently freed space if at all possible.
>"First free hole" is probably a good allocation strategy.
>Sendfile can be used if filebacked. If nothing else disks
>can be used by making a 1-file filesystem on them.
>
>More complex storage methods are object per file and object
>in database models. They are relatively trival and well
>understood. May offer persistence.

Dag says:

>- quick and dirty squid-like hashed directories, to begin with
>
> - fancy block storage straight to disk (or to a large preallocated
> file) like you suggested
>
> - memcached

as Poul later comments, squid is slow and dirty. Lets try to avoid it.
I am fine with fancy block storage, and I am tempted to suggest: Berkeley DB
I have always pictured Varnish with a Berkley DB backend. Why? I _think_
it is fast (only website info to go on here).

http://www.sleepycat.com/products/bdb.html and
http://www.sleepycat.com/products/bdb.html

its block storage, and wildcard purge could potentially be as easy as:
delete from table where URL like '%bye-bye%';
Another thing I am just gonna base on my wildest fantasies, could we use
the Berkley DB replication to make a cache up-to-date after downtime?
Would be fun, wouldn't it? :)

I also like memcached, and I am excited to hear Poul suggest that we build
a "better" approach.
When I read that, I must admit that my first thought was that it would be
really nice if this is a deamon/shem process that one can build a php (or
whatever) interface against. This is out of scope, but imagine you have
full access to the cache-data in php if only in RO mode. That means you
can build php apps with a superquick backend with loads of metadata. :)

>Read-Only storage methods may make sense for getting hold
>of static emergency contents from CD-ROM etc.

Nice feature.

>Treat each disk arm as a separate storage unit and keep track of
>service time (if possible) to decide storage scheduling.
>
>Avoid regular expressions at runtime. If config file contains
>regexps, compile them into executable code and dlopen() it
>into the Varnish process. Use versioning and refcounts to
>do memory management on such segments.

I smell a glob vs. compiled regexp showdown. Hehe.
My only contrib here would be. Don't do it in java regexp :)

>Avoid committing transmit buffer space until we have bandwidth
>estimate for client. One possible way: Send HTTP header
>and time ACKs getting back, then calculate transmit buffer size
>and send object. This makes DoS attacks more harmless and
>mitigates traffic stampedes.

Yes. Are you thinking of writing a FreeBSD kernel module (accept_filter)
for this? Like accf_http.


>Kill all TCP connections after N seconds, nobody waits an hour
>for a web-page to load.
>
>Abuse mitigation interface to firewall/traffic shaping: Allow
>the central node to put an IP/Net into traffic shaping or take
>it out of traffic shaping firewall rules. Monitor/interface
>process (not main Varnish process) calls script to config
>firewalling.

This sounds like a really good feature. Hope it can be solved in Linux as
well. Not sure they have the fancy IPFW filters etc.

>"Warm-up" instructions can take a number of forms and we don't know
>what is the most efficient or most usable. Here are some ideas:
>[...]
>
>One interesting but quite likely overengineered option in the
>cluster case is if the central monitor tracks a fraction of the
>requests through the logs of the running machines in the cluster,
>spots the hot objects and tell the warming up varnish what objects
>to get and from where.

>>This can easily be done with existing software like w3mir.
>>[...]
>>You can probably do this in ~50 lines of Perl using Net::HTTP.

>>>Sounds like you just won this bite :-)

Nice :) But I am not sure this is as "easy" as it sounds at first.

>In the cluster configuration, it is probably best to run the cluster
>interaction in a separate process rather than the main Varnish
>process. From Varnish to cluster info would go through the shared
>memory, but we don't want to implement locking in the shmem so
>some sort of back-channel (UNIX domain or UDP socket ?) is necessary.
>
>If we have such an "supervisor" process, it could also be tasked
>with restarting the varnish process if vitals signs fail: A time
>stamp in the shmem or kill -0 $pid.

You got to like programs that keep themselvs alive.

>It may even make sense to run the "supervisor" process in stand
>alone mode as well, there it can offer a HTML based interface
>to the Varnish process (via shmem).
>
>For cluster use the user would probably just pass an extra argument
>when he starts up Varnish:
>
> varnish -c $cluster_args $other_args
>vs
>
> varnish $other_args
>
>and a "varnish" shell script will Do The Right Thing.

Thats what we should aim at.

>Shared memory
>-------------
>
>The shared memory layout needs to be thought about somewhat. On one
>hand we want it to be stable enough to allow people to write programs
>or scripts that inspect it, on the other hand doing it entirely in
>ascii is both slow and prone to race conditions.
>
>The various different data types in the shared memory can either be
>put into one single segment(= 1 file) or into individual segments
>(= multiple files). I don't think the number of small data types to
>be big enough to make the latter impractical.
>
>Storing the "big overview" data in shmem in ASCII or HTML would
>allow one to point cat(1) or a browser directly at the mmaped file
>with no interpretation necessary, a big plus in my book.
>
>Similarly, if we don't update them too often, statistics could be stored
>in shared memory in perl/awk friendly ascii format.

That would be a big pluss with the stats either in HTML or in ASCII at least.

>But the logfile will have to be (one or more) FIFO logs, probably at least
>three in fact: Good requests, Bad requests, and exception messages.

And a debug logg. The squid modell is not to bad there. Only poorly
documented.
In short its a "binary configuration", 1=some part a, 4=some part b, ...,
128=some part i.
Debug=133=a,b and i.

I mentioned on the meeting some URL's that would provide some relevant
reading:

http://www.web-cache.com/

is old but good. It lists all relevant protocols:

http://www.web-cache.com/Writings/protocols-standards.html

and other written things:

http://www.web-cache.com/writings.html

Here is also the Hypertext Caching Protocol - alternative and improvement
to ICP, what I refered to as WCCP at the last meeting.
Another RFC to take a look on might be: Web Cache Invalidation Protocol
(WCIP)
Here is what ESI.org has to say about WCIP: http://www.esi.org/tfaq.html#q8
And here is their approach: http://www.esi.org/invalidation_protocol_1-0.html

Sorry about all the text :)

P.S I was not on the list when Poul wrote the first post, so I don't have
the ID either. My post will come as a seperate one.

Anders Berg


andersb at vgnett

Feb 12, 2006, 2:54 PM

Post #12 of 16 (552 views)
Permalink
My random thoughts [In reply to]

Good work guys. I had a great time reading the notes.

Here comes the sys.adm approach.

P.S The sys.adm approach can easily been seen as a overengineered
solution, don't feel my approach as a must-have. More as a nice-to-have.


>Notes on Varnish
>----------------
>
>Philosophy
>----------
>
>It is not enough to deliver a technically superior piece of software,
>if it is not possible for people to deploy it usefully in a sensible
>way and timely fashion.
>[...]
>If circumstances are not conductive to strucured approach, it should
>be possible to repeat this process and set up N independent Varnish
>boxes and get some sort of relief without having to read any further
>documentation.

I think these are reasonable senarios and solutions.

>
>The subsequent (layers of) Varnish
>----------------------------------
>
>[...]
>When Varnish machines are put in a cluster, the administrator should
>be able to consider the cluster as a unit and not have to think and
>interact with the individual nodes.

That would be great. Imho far to little software acts like this.
There could be a good reason for that, but I wouldn't know.

>Some sort of central management node or facility must exist and
>it would be preferable if this was not a physical but a logical
>entity so that it can follow the admin to the beach. Ideally it
>would give basic functionality in any browser, even mobile phones.

A web-browser interface and a CLI should cover 99% of use. An easy
protocol/API would make it possible for anybody to write their own
interface to the central managment node.

>The focus here is scaleability, we want to avoid per-machine
>configuration if at all possible. Ideally, preconfigured hardware
>can be plugged into power and net, find an address with DHCP, contact
>preconfigured management node, get a configuration and start working.

This would ease many things. If one should make a image of some sort, one
does not have to change/make new image for every config change (if that
happens more ofte than software updates).

>But we also need to think about how we avoid a site of Varnish
>machines from acting like a stampeeding horde when the power or
>connectivity is brought back after a disruption. Some sort of
>slow starting ("warm-up" ?) must be implemented to prevent them
>from hitting all the backend with the full force.

Yes. As you said in Oslo Poul, this could be a killer-app feature for some
sites.

>An important aspect of cluster operations is giving a statistically
>meaninful judgement of the cluster size, in particular answering
>the question "would adding another machine help ?" precisely.

Is this possible? It would involve knowing how the backend is doing with
added load.
One thing is to measure how it's doing right now (responstime), but to
predict added load is hard.
My guess is also that the only reason somebody would ask "would adding
another machine help ?" was if the CPU or bandwith was exhausted on the
accelerator(s) in place, and one really needed to do something anyway. The
only other reason I can think of is responstime from the accelerator, and
then we have the predict load problem.

>We should have a facility that allows the administrator to type
>in a REGEXP/URL and have all the nodes answer with a checksum, age
>and expiry timer for any documents they have which match. The
>results should be grouped by URL and checksum.

Not only the admin needs this. Its great when programmers/implementors
need to debug how "good" the new/old application caches.
In a world of rapid development, little or no time is often given to
make/check the "cachebility" of the app.
A "check www.rapiddev.com/newapp/*" after a couple of clicks on the app
could save developers huge amount of time, and reduce backend load
immensely.

>
>Technical concepts
>------------------
>
>We want the central Varnish process to be that, just one process, and
>we want to keep it small and efficient at all cost.

Yes. When you say 1 process, you mean 1 process per CPU/Core?

>Code that will not be used for the central functionality should not
>be part of the central process. For instance code to parse, validate
>and interpret the (possibly) complex configuration file should be a
>separate program.

Lets list possible processes:

1. Varnish main.
2. Disk/storage process.
3. Config process/program.
4. Managment process.
5. Logger/stats.

>Depending on the situation, the Varnish process can either invoke
>this program via a pipe or receive the ready to use data structures
>via a network connection.
>
>Exported data from the Varnish process should be made as cheap as
>possible, likely shared memory. That will allow us to deploy separate
>processes for log-grabbing, statistics monitoring and similar
>"off-duty" tasks and let the central process get on with the
>important job.

Sounds great.

>
>Backend interaction
>-------------------
>
>We need a way to tune the backend interaction further than what the
>HTTP protocol offers out of the box.
>
>We can assume that all documents we get from the backend has an
>expiry timer, if not we will set a default timer (configurable of
>course).
>
>But we need further policy than that. Amongst the questions we have
>to ask are:
>
> How long time after the expiry can we serve a cached copy
> of this document while we have reason to belive the backend
> can supply us with an update ?
>
> How long time after the expiry can we serve a cached copy
> of this document if the backend does not reply or is
> unreachable.
>
> If we cannot serve this document out of cache and the backend
> cannot inform us, what do we serve instead (404 ? A default
> document of some sort ?)
>
> Should we just not serve this page at all if we are in a
> bandwidth crush (DoS/stampede) situation ?

You are correct. Did you mean ask the user or did you mean questions to
answer in a specification?
I think the best approach is to ask the user, and let him answer in the
config. I can see as many answers to these questions (and more) as there
are websites :) Also a site might answer differently in different
scenarios.

>It may also make sense to have a "emergency detector" which triggers
>when the backend is overloaded and offer a scaling factor for all
>timeouts for when in such an emergency state. Something like "If
>the average response time of the backend rises above 10 seconds,
>multiply all expiry timers by two".

Good idea. Once again I opt for a config choice on that one.

>It probably also makes sense to have a bandwidth/request traffic
>shaper for backend traffic to prevent any one Varnish machine from
>pummeling the backend in case of attacks or misconfigured
>expiry headers.

Good idea, but this one I am unsure about. The reason: One more thing that
can make the accelerator behave in a way you don't understand.
You are delivering stale documents from the accelerator. You start
"debugging". "Hmm, most of thre requests are given from backen in timely
fashion..." You debug more and start examining the headers. I can see
myself going through loads of different stuff, and than: "Ahh, the traffic
shaper..."
As I said, I like the idea, but to many rules for backoffs will make the
sys.admin scratch his head even more.
Can we come up with a way for Varnish to tell the sys.adm. "Hey, you are
delivering stale's here. Because ..." Or is this overengineer?

>
>Startup/consistency
>-------------------
>
>We need to decide what to do about the cache when the Varnish
>process starts. There may be a difference between it starting
>first time after the machine booted and when it is subsequently
>(re)started.
>
>By far the easiest thing to do is to disregard the cache, that saves
>a lot of code for locating and validating the contents, but this
>carries a penalty in backend or cluster fetches whenever a node
>comes up. Lets call this the "transient cache model"

I agree with Dag here. Lets start with "transient cache model" and add
more later.
We will discuss some scenarios at spec writing, and maybe come up with
some models for later implementation.
Better dig out those architecture notes Dag :)

>The alternative is to allow persistently cached contents to be used
>according to configured criteria:
>[...]
>The choice we make does affect the storage management part of Varnish,
>but I see that is being modular in any instance, so it may merely be
>that some storage modules come up clean on any start while other
>will come up with existing objects cached.

Ironically at VG the stuff that can be cahced long (JPG's, GIF's etc) can
be cached long, while the costly stuff is the documents that cost CPU
making.
It would not be surprised if its like that many places.

>
>Clustering
>----------
>
>I'm somewhat torn on clustering for traffic purposes. For admin
>and management: Yes, certainly, but starting to pass objects from
>one machine in a cluster to another is likely to be just be a waste
>of time and code.
>
>Today one can trivially fit 1TB into a 1U machine so the partitioning
>argument for cache clusters doesn't sound particularly urgent to me.
>
>If all machines in the cluster have sufficient cache capacity, the
>other remaining argument is backend offloading, that would likely
>be better mitigated by implementing a 1:10 style two-layer cluster
>with the second level node possibly having twice the storage of
>the front row nodes.

I am also torn here.
A part of me says. Hey, there is ICP v2 and such, lets use it, it's good
economy.
Another part is thinking that ICP works at it's best when you have many
accelerators, and if Varnish can deliver what we hope, not many frontends
are needed for most sites in the world :) At that level, you can for sure
deliver the extra content ICP and such would save you from.
I know that in saying that I am sacrificing design because of
implementation, but there it is.

>The coordination necessary for keeping track of, or discovering in
>real-time, who has a given object can easily turn into a traffic
>and cpu load nightmare.
>
>And from a performance point of view, it only reduces quality:
>First we send out a discovery multicast, then we wait some amount
>of time to see if a response arrives only then should we start
>to ask the backend for the object. With a two-level cluster
>we can ask the layer-two node right away and if it doesn't have
>the object it can ask the back-end right away, no timeout is
>involved in that.

A note. One of the reasons to be wary of two-level clusters in my opinion
is that if you cache a document from the backend at the lowest lvl for say
2 min. And then the level over comes and gets it 1 min. into those 2 min.,
looks up in its config and finds out this is a 2 min. cache document, the
document will be 1 min stale before a refesh. This could of cource be
solved with Expires tags, but it makes sys.adm's wary.
Dag also noted problems with this when we have two-layer approach and
first layer is in backoff-mode.

>Finally Consider the impact on a cluster of a "must get" object
>like an IMG tag with a misspelled URL. Every hit on the front page
>results in one get of the wrong URL. One machine in the cluster
>ask everybody else in the cluster "do you have this URL" every
>time somebody gets the frontpage.
>[...]
>Negative caching can mitigate this to some extent.
>
>
>Privacy
>-------
>
>Configuration data and instructions passed forth and back should
>be encrypted and signed if so configured. Using PGP keys is
>a very tempting and simple solution which would pave the way for
>administrators typing a short ascii encoded pgp signed message
>into a SMS from their Bahamas beach vacation...

Bahamas? Vaction? :)

>
>Implementation ideas
>--------------------
>
>The simplest storage method mmap(2)'s a disk or file and puts
>objects into the virtual memory on page aligned boundaries,
>using a small struct for metadata. Data is not persistant
>across reboots. Object free is incredibly cheap. Object
>allocation should reuse recently freed space if at all possible.
>"First free hole" is probably a good allocation strategy.
>Sendfile can be used if filebacked. If nothing else disks
>can be used by making a 1-file filesystem on them.
>
>More complex storage methods are object per file and object
>in database models. They are relatively trival and well
>understood. May offer persistence.

Dag says:

>- quick and dirty squid-like hashed directories, to begin with
>
> - fancy block storage straight to disk (or to a large preallocated
> file) like you suggested
>
> - memcached

as Poul later comments, squid is slow and dirty. Lets try to avoid it.
I am fine with fancy block storage, and I am tempted to suggest: Berkeley DB
I have always pictured Varnish with a Berkley DB backend. Why? I _think_
it is fast (only website info to go on here).

http://www.sleepycat.com/products/bdb.html and
http://www.sleepycat.com/products/bdb.html

its block storage, and wildcard purge could potentially be as easy as:
delete from table where URL like '%bye-bye%';
Another thing I am just gonna base on my wildest fantasies, could we use
the Berkley DB replication to make a cache up-to-date after downtime?
Would be fun, wouldn't it? :)

I also like memcached, and I am excited to hear Poul suggest that we build
a "better" approach.
When I read that, I must admit that my first thought was that it would be
really nice if this is a deamon/shem process that one can build a php (or
whatever) interface against. This is out of scope, but imagine you have
full access to the cache-data in php if only in RO mode. That means you
can build php apps with a superquick backend with loads of metadata. :)

>Read-Only storage methods may make sense for getting hold
>of static emergency contents from CD-ROM etc.

Nice feature.

>Treat each disk arm as a separate storage unit and keep track of
>service time (if possible) to decide storage scheduling.
>
>Avoid regular expressions at runtime. If config file contains
>regexps, compile them into executable code and dlopen() it
>into the Varnish process. Use versioning and refcounts to
>do memory management on such segments.

I smell a glob vs. compiled regexp showdown. Hehe.
My only contrib here would be. Don't do it in java regexp :)

>Avoid committing transmit buffer space until we have bandwidth
>estimate for client. One possible way: Send HTTP header
>and time ACKs getting back, then calculate transmit buffer size
>and send object. This makes DoS attacks more harmless and
>mitigates traffic stampedes.

Yes. Are you thinking of writing a FreeBSD kernel module (accept_filter)
for this? Like accf_http.


>Kill all TCP connections after N seconds, nobody waits an hour
>for a web-page to load.
>
>Abuse mitigation interface to firewall/traffic shaping: Allow
>the central node to put an IP/Net into traffic shaping or take
>it out of traffic shaping firewall rules. Monitor/interface
>process (not main Varnish process) calls script to config
>firewalling.

This sounds like a really good feature. Hope it can be solved in Linux as
well. Not sure they have the fancy IPFW filters etc.

>"Warm-up" instructions can take a number of forms and we don't know
>what is the most efficient or most usable. Here are some ideas:
>[...]
>
>One interesting but quite likely overengineered option in the
>cluster case is if the central monitor tracks a fraction of the
>requests through the logs of the running machines in the cluster,
>spots the hot objects and tell the warming up varnish what objects
>to get and from where.

>>This can easily be done with existing software like w3mir.
>>[...]
>>You can probably do this in ~50 lines of Perl using Net::HTTP.

>>>Sounds like you just won this bite :-)

Nice :) But I am not sure this is as "easy" as it sounds at first.

>In the cluster configuration, it is probably best to run the cluster
>interaction in a separate process rather than the main Varnish
>process. From Varnish to cluster info would go through the shared
>memory, but we don't want to implement locking in the shmem so
>some sort of back-channel (UNIX domain or UDP socket ?) is necessary.
>
>If we have such an "supervisor" process, it could also be tasked
>with restarting the varnish process if vitals signs fail: A time
>stamp in the shmem or kill -0 $pid.

You got to like programs that keep themselvs alive.

>It may even make sense to run the "supervisor" process in stand
>alone mode as well, there it can offer a HTML based interface
>to the Varnish process (via shmem).
>
>For cluster use the user would probably just pass an extra argument
>when he starts up Varnish:
>
> varnish -c $cluster_args $other_args
>vs
>
> varnish $other_args
>
>and a "varnish" shell script will Do The Right Thing.

Thats what we should aim at.

>Shared memory
>-------------
>
>The shared memory layout needs to be thought about somewhat. On one
>hand we want it to be stable enough to allow people to write programs
>or scripts that inspect it, on the other hand doing it entirely in
>ascii is both slow and prone to race conditions.
>
>The various different data types in the shared memory can either be
>put into one single segment(= 1 file) or into individual segments
>(= multiple files). I don't think the number of small data types to
>be big enough to make the latter impractical.
>
>Storing the "big overview" data in shmem in ASCII or HTML would
>allow one to point cat(1) or a browser directly at the mmaped file
>with no interpretation necessary, a big plus in my book.
>
>Similarly, if we don't update them too often, statistics could be stored
>in shared memory in perl/awk friendly ascii format.

That would be a big pluss with the stats either in HTML or in ASCII at least.

>But the logfile will have to be (one or more) FIFO logs, probably at least
>three in fact: Good requests, Bad requests, and exception messages.

And a debug logg. The squid modell is not to bad there. Only poorly
documented.
In short its a "binary configuration", 1=some part a, 4=some part b, ...,
128=some part i.
Debug=133=a,b and i.

I mentioned on the meeting some URL's that would provide some relevant
reading:

http://www.web-cache.com/

is old but good. It lists all relevant protocols:

http://www.web-cache.com/Writings/protocols-standards.html

and other written things:

http://www.web-cache.com/writings.html

Here is also the Hypertext Caching Protocol - alternative and improvement
to ICP, what I refered to as WCCP at the last meeting.
Another RFC to take a look on might be: Web Cache Invalidation Protocol
(WCIP)
Here is what ESI.org has to say about WCIP: http://www.esi.org/tfaq.html#q8
And here is their approach: http://www.esi.org/invalidation_protocol_1-0.html

Sorry about all the text :)

P.S I was not on the list when Poul wrote the first post, so I don't have
the ID either. My post will come as a seperate one.

Anders Berg


des at linpro

Feb 17, 2006, 5:47 AM

Post #13 of 16 (563 views)
Permalink
My random thoughts [In reply to]

"Anders Berg" <andersb at vgnett.no> writes:
> "Dag-Erling Sm?rgrav" <des at linpro.no> writes:
> > - quick and dirty squid-like hashed directories, to begin with
> as Poul later comments, squid is slow and dirty. Lets try to avoid it.

I just mentioned it as a way of getting a storage backend up and
running quickly so we can concentrate on other stuff.

> I am fine with fancy block storage, and I am tempted to suggest:
> Berkeley DB I have always pictured Varnish with a Berkley DB
> backend. Why? I _think_ it is fast (only website info to go on
> here).
>
> http://www.sleepycat.com/products/bdb.html and
> http://www.sleepycat.com/products/bdb.html
>
> its block storage, and wildcard purge could potentially be as easy as:
> delete from table where URL like '%bye-bye%';

Berkeley DB does not have an SQL interface or any kind of query
engine.

> "Poul-Henning Kamp" <phk at phk.freebsd.dk> writes:
> > Abuse mitigation interface to firewall/traffic shaping: Allow
> > the central node to put an IP/Net into traffic shaping or take
> > it out of traffic shaping firewall rules. Monitor/interface
> > process (not main Varnish process) calls script to config
> > firewalling.
> This sounds like a really good feature. Hope it can be solved in
> Linux as well. Not sure they have the fancy IPFW filters etc.

They have iptables and other equivalents.

DES
--
Dag-Erling Sm?rgrav
Senior Software Developer
Linpro AS - www.linpro.no


des at linpro

Feb 17, 2006, 5:47 AM

Post #14 of 16 (554 views)
Permalink
My random thoughts [In reply to]

"Anders Berg" <andersb at vgnett.no> writes:
> "Dag-Erling Sm?rgrav" <des at linpro.no> writes:
> > - quick and dirty squid-like hashed directories, to begin with
> as Poul later comments, squid is slow and dirty. Lets try to avoid it.

I just mentioned it as a way of getting a storage backend up and
running quickly so we can concentrate on other stuff.

> I am fine with fancy block storage, and I am tempted to suggest:
> Berkeley DB I have always pictured Varnish with a Berkley DB
> backend. Why? I _think_ it is fast (only website info to go on
> here).
>
> http://www.sleepycat.com/products/bdb.html and
> http://www.sleepycat.com/products/bdb.html
>
> its block storage, and wildcard purge could potentially be as easy as:
> delete from table where URL like '%bye-bye%';

Berkeley DB does not have an SQL interface or any kind of query
engine.

> "Poul-Henning Kamp" <phk at phk.freebsd.dk> writes:
> > Abuse mitigation interface to firewall/traffic shaping: Allow
> > the central node to put an IP/Net into traffic shaping or take
> > it out of traffic shaping firewall rules. Monitor/interface
> > process (not main Varnish process) calls script to config
> > firewalling.
> This sounds like a really good feature. Hope it can be solved in
> Linux as well. Not sure they have the fancy IPFW filters etc.

They have iptables and other equivalents.

DES
--
Dag-Erling Sm?rgrav
Senior Software Developer
Linpro AS - www.linpro.no


andersb at vgnett

Feb 17, 2006, 10:11 AM

Post #15 of 16 (554 views)
Permalink
My random thoughts [In reply to]

> "Dag-Erling Sm?rgrav" <des at linpro.no> writes:
>> I am fine with fancy block storage, and I am tempted to suggest:
>> Berkeley DB I have always pictured Varnish with a Berkley DB
>> backend. Why? I _think_ it is fast (only website info to go on
>> here).
>>
>> http://www.sleepycat.com/products/bdb.html and
>> http://www.sleepycat.com/products/bdb.html
>>
>> its block storage, and wildcard purge could potentially be as easy as:
>> delete from table where URL like '%bye-bye%';
>
> Berkeley DB does not have an SQL interface or any kind of query
> engine.

Okay, I knew it did not have a SQL interface, but not that it did not
deliver a query engine of some sort. Anyway Berkeley DB (now Oracle owned
:)) does say this on their homepage:

"Berkeley DB is the ideal choice for static queries over dynamic data,
while traditional relational databases are well suited for dynamic queries
over static data."

I did not paste this in to argue that you and Berkeley have a different
definition of queries :) But rather that the "queries" we are gonna use
for this are the same, and the data dynamic. So at first glance it looks
to be right for us if it's _fast_. But no fear, I can kill darlings :)


>> "Poul-Henning Kamp" <phk at phk.freebsd.dk> writes:
>> > Abuse mitigation interface to firewall/traffic shaping: Allow
>> > the central node to put an IP/Net into traffic shaping or take
>> > it out of traffic shaping firewall rules. Monitor/interface
>> > process (not main Varnish process) calls script to config
>> > firewalling.
>> This sounds like a really good feature. Hope it can be solved in
>> Linux as well. Not sure they have the fancy IPFW filters etc.
>
> They have iptables and other equivalents.

Brilliant. Now lets pray they work the way they should, and are dynamic :)

Anders Berg


andersb at vgnett

Feb 17, 2006, 10:11 AM

Post #16 of 16 (553 views)
Permalink
My random thoughts [In reply to]

> "Dag-Erling Sm?rgrav" <des at linpro.no> writes:
>> I am fine with fancy block storage, and I am tempted to suggest:
>> Berkeley DB I have always pictured Varnish with a Berkley DB
>> backend. Why? I _think_ it is fast (only website info to go on
>> here).
>>
>> http://www.sleepycat.com/products/bdb.html and
>> http://www.sleepycat.com/products/bdb.html
>>
>> its block storage, and wildcard purge could potentially be as easy as:
>> delete from table where URL like '%bye-bye%';
>
> Berkeley DB does not have an SQL interface or any kind of query
> engine.

Okay, I knew it did not have a SQL interface, but not that it did not
deliver a query engine of some sort. Anyway Berkeley DB (now Oracle owned
:)) does say this on their homepage:

"Berkeley DB is the ideal choice for static queries over dynamic data,
while traditional relational databases are well suited for dynamic queries
over static data."

I did not paste this in to argue that you and Berkeley have a different
definition of queries :) But rather that the "queries" we are gonna use
for this are the same, and the data dynamic. So at first glance it looks
to be right for us if it's _fast_. But no fear, I can kill darlings :)


>> "Poul-Henning Kamp" <phk at phk.freebsd.dk> writes:
>> > Abuse mitigation interface to firewall/traffic shaping: Allow
>> > the central node to put an IP/Net into traffic shaping or take
>> > it out of traffic shaping firewall rules. Monitor/interface
>> > process (not main Varnish process) calls script to config
>> > firewalling.
>> This sounds like a really good feature. Hope it can be solved in
>> Linux as well. Not sure they have the fancy IPFW filters etc.
>
> They have iptables and other equivalents.

Brilliant. Now lets pray they work the way they should, and are dynamic :)

Anders Berg

Varnish dev RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.