
phk at phk
Jul 17, 2006, 12:02 PM
Post #10 of 16
(125 views)
Permalink
|
In message <B06E690C-CA93-4547-AD2A-61246521142D at vgnett.no>, Anders Berg writes : >The reason Anders N. asks about this is how Squid works today. The >squid.conf file leaves you with a option to specify how much RAM you >wanna use for Squid. And this is right where the trouble starts. Squid is written for a machine model that has not existed since 1980 when 3BSD was released. In that model, a process has some amount of "memory" and either all of that "memory" is present in RAM or none of it is. When RAM grew short, an entire process was swapped out (hence the name: "swap out one process for another") In that environment, it gives great meaning to tell Squid how much RAM it can use, because there is some magic size where the best performance compromise for the entire machine is reached. We spent a lot of time tuning stuff like that in 1980ies, we told sort(1) how many records to sort in memory and to switch to merging temporary files if it found more etc etc. Virtual memory on the other hand, means that the kernel "fakes" things such that the process has access to the entire address-space (ie: 2^32 bytes or 2^64 bytes) and the operating. It does this by tracking which pages are used, which are modified and all that stuff. In a VM system, what you think of as "RAM" is not RAM in the hardware sense. You may in fact have all of it accessible in hardware RAM, but if the system is short of memory, you won't have, some of it will be "paged out to disk" or because we sloppily adopted the old terminology: swapped out. The real trouble starts when Squid decides that an object in its "RAM" should be purged to disk. Quite likely, the operating system already found that out earlier so the "RAM" is already on disk, somewhere in the paging- (or swap-)partition. So what happens is that first we do a disk read to pull in the RAM, then we write it to disk some other place. Twice as much I/O for no gain. The same pattern happens all over Squid, and that is responsible for the observed "once squid starts paging, it goes straight south". It doesn't help in this context that Squid stores headers and body the same place. That means that if the "RAM" of some object has been paged out, we have to page it in to see the headers, even for a conditional request which ends up not transferring the objects body. >Your answer was detailed Poul-Henning, but >what will prevent this from happpening in Varnish? Lets say you have >2 applications running on a Varnish box, and both use the memory >model Varnish uses, what will happen in the long run with a lot of >traffic? All programs running in a VM system has a function which describes how fast they reach their goal, for a given number of pages of hardware memory they have access to. Unfortunately the function also has other variables, the input to the program, the timing its interaction with the world (how long must it wait for disk-I/O etc) and the state of all sorts of kernel caches come into play. There is no way to predict the function realistically. You can measure it under some set of circumstances and get an idea how it looks. The only trick there really is to writing an VM kernel is being good at estimating this function on the fly. If two processes run at the same time, and they both need more hw-RAM pages than the system has, the kernel will be flipping some pages between them. When a program accesses a page which is not "resident", the kernel will hunt around for a page that doesn't look used (ideally: doesn't look like it _will be_ used (soon)) writes that page to disk and reassigns the page to the faulting process, possibly after filling it from a disk first. In the meantime the process (or at least: thread) cannot do anthing. If you're just one page short, there is undoubtedly some page in the process which is seldomly used, the first bit of the program which is only used during startup, some table of error messages that are only accessed when there is an error etc. As memory pressure increases, more and more such pages will be paged out. At some point, we get to pages which are infact used every so often, and then it starts hurting performance. The thing to remember in writing programs for virtual memory sytems is therefore not to be careful about how much memory you allocate, but instead be careful about how much of it you use. Something as simple as variable order in the source code can affect this: int busy_integer_variable; char seldomly_used_error_string[5000]; double often_used_number; With a pagesize of 4096, this will take up two pages, both of which will be busy. Flipping the order: int busy_integer_variable; double often_used_number; char seldomly_used_error_string[5000]; means that one page will seldomly be used, and the other will be used all the time. (The example also improves CPU-cache hitrates, but forget that for now). What Varnish does is to rely on the kernel to do this work. Instead of trying to control how much memory we use and partition our data into the fast stuff which should be in RAM and the slow stuff which we can put on the disk, we simply operate on one data set, but make sure to arrange our data such that the kernel can easily deport data which we don't use, without us needing to get involved. Therefore all object storage in Varnish is allocated on a page-aligned border. That means that entire objects can be paged out, without affecting the neighbor objects. Yes, this may waste 4095 bytes for padding, but you'd be surprised what you save in performance. >http://www.tns-gallup.no/index.asp? >type=tabelno_url&did=185235&sort=uv&sort_ret=desc&UgeSelect=&path_by_id= >/12000/12003/12077/12266&aid=12266 > >will show what I mean. They have "few" users and sessions, but loads >of pageviews, also at any given time many thousand "wares" are "hot" >for the user, not a few popular articles. This does something to mem >usage and I/O. You're thinking about memory in the oldfashioned terms here :-) Try this: Imagine the disk is the real memory and the RAM is only a cache. -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 phk at FreeBSD.ORG | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence.
|