2008-01-15

Is swap space obsolete?

Is swap space obsolete?
2008-01-07 12:18:56

昨天在鼎好花了305元买了一根杂牌的、Hyundai芯片的、DDR400 1G内存条。我的机器得到了升华:主板只有两根DDR插槽,撤掉一根256M的DELL原装内存,换上这根1G的,内存达到了空前的1.2G。作为顺理成 章的好处,我应该可以关掉swap来充分利用我的RAM。但是在此之前,我习惯性的作了一个检索,以证明我的做法也是其他电脑人士的做法,结果是我看到这 样一篇文章:Is swap space obsolete?

它让我还是为Ubuntu 7.10保留了128M swap space....

Is swap space obsolete?

There was a thread on the CLUG list recently about whether it was still useful to have swap space, now that it's quite affordable to have a gigabyte or more of memory on a desktop machine. I think it is.

Some people have the idea that touching swap space at all is a sign that the machine is very overloaded, and so you ought to avoid it at all costs, by adding enough memory that the machine never needs to swap. This may have been true on Unix ten years ago, and may still be true on some systems for all I know but it's not true for Linux.

The meaning of the term swap has changed over time. It used to mean that entire tasks were moved out to disk, and they'd stay there until it was necessary to run them again. You can do this on machines without paged MMUs, and perhaps it was simpler to implement. However, these days almost all machines have MMUs, and so we use paging instead, where particular chunks of the program (typically 4kB) can move in or out independently. This gets more use out of the memory you have, because many programs run quite happily with only part of their virtual memory in RAM. Linux doesn't implement old-style whole-program swapping at all, and there does not seem to be any reason to add it.

Swap这个术语在过去用来这样一种场景/技术:为节省内存,一个task被整个移到磁盘上,直到它需要被再次执行的时候再被调度回内存。对于没有MMU的机器来说,这是可行的而且实现也比较简单,但是实际情况是绝大部分机器都有MMU,所以我们的调度单位往往是page(一般是4k)而不是整个task。况且,Linux也没有实现前面提到的老式的、基于调度整个taskSwap,因为觉得没有必要。

I'll recapitulate the way VM works, and in particular the ways it is different on Linux from in your average computer science textbook. The basic idea is that we have a relatively small fast RAM, and a slower larger disk, and we want to get the best performance out of the combination. I will skip some details and special cases for simplicity.

下面简单说说VM(Virtual Memory)的工作机制

All memory pages on Linux can be grouped into four classes. Firstly, there are kernel pages which are fixed in memory and never swapped. (Some other systems have pageable kernels, but at the moment the Linux developers consider it too risky.) Then there is program text: the contents of /bin/sh or /lib/libdl-2.3.2.so. These are read-only in memory, and so are always exactly the same as the file on disk. There are file-backed pages, which might have changes in memory that haven't been written out yet. Finally there are memory pages that don't correspond to any file on disk: this includes the stack and heap variables of running tasks. When a program does a malloc(), it allocates memory of this type. Pages in this last category are called anonymous mappings, because they don't correspond to any file name.

Linux上所有的memory page被划分成四类:

- kernel pages,他们不会被swap

- read-only pages

- file-backed pages,主要是与磁盘文件相关的

- anonymous pages,与磁盘file不相关的pages,主要用来保存正在运行的taskstack变量,heap变量

There is no separate disk cache in Linux, like there is on old-Unix or on DOS. Instead, we try to keep the most useful parts of the disk in memory as cached pages. Linux usually doesn't directly modify the disk: rather, changes are made to the files in memory and then they're flushed out.

与老式的UnixDos一样,Linux没有单独的disk cache,作为替代,Linux把那些有用的disk部分放在内存中作为cached pages.

You'll notice that the free-memory measure on Linux machines is normally pretty low, even when the machine has plenty of memory for the tasks it's running. This is normal and intentional: the kernel tries to keep memory filled up with cached pages so that if those files are accessed again it won't have to go to disk. The free pool indicates just a few spare pages that are ready for immediate allocation. One time when the free memory will be large is shortly after bootup when the kernel just hasn't read in very much of the disk yet. Another time you'll have a lot of free memory is shortly after a large program exited: it had a lot of data pages in RAM, but those pages were deleted and so there's no useful information in them anymore.

We talk of pages as being clean when the in-memory version is the same as the one on disk, and dirty when they've been changed since being read in. Data(Williamski: Data or dirty here? should be ‘dirty’ I guess) pages need to get written back to disk eventually, and the kernel generally does this in the background. You can force all dirty pages to be written out using the sync system call.

clean page指的是其在内存中的内容与磁盘内容一致;dirty page指的是不一致的情况。

The kernel can discard a clean page whenever it needs the memory for something else, because it knows it can always get the data back from disk. However, dirty pages need to be saved to disk before they can be reused. We call this eviction. So flushing pages in the background has two purposes: it helps protect data from sudden power cuts, and more importantly it means there are plenty of clean pages that can be reused when a process needs memory. So efficient is this flushing that at the moment my machine has only four dirty pages out of 256,000 (by grep nr_dirty /proc/vmstat).

As the kernel allocates memory, it firstly takes pages from the free pool. If that drops too low, it needs to free up more memory. Where does that come from? It needs to discard a clean page to make room. If there aren't any suitable clean pages then it needs to flush a dirty page to disk, then use it. This is very slow because the allocation can't continue until the disk write has finished, so the kernel tries very hard to avoid this by always having some clean pages around. (Remember the whole point of the VM algorithm is to avoid ever having to wait for a disk access to complete, by keeping pages in memory that are likely to be used again.)

File-backed pages can be flushed by writing them back to their file on disk. But anonymous mappings by definition don't have any backing file. Where can they be flushed to? Swap space, of course. Swap partitions or files on Linux hold pages that aren't backed by a file.

对于File-backed pages来说,它们的回写比较容易,因为它们在磁盘上有对应的文件,那anonymous pages呢?它们在磁盘上没有对应的文件,怎么办?导出主题:它们需要被回写到swap space上。

If you don't have swap space, then anonymous mappings can't be flushed. They have to stay in memory until they're deleted. The kernel can only obtain clean memory and free memory by flushing out file-backed pages: programs, libraries, and data files. Not having swap space constrains and unbalances the kernel's page allocation. However unlikely it is that the data pages will be used again — even if they're never used again — they still need to stay in memory sucking up precious RAM. That means the kernel has to do more work to write out file-backed pages, and to read them back in after they're discarded. The kernel needs to throw out relatively valuable file-backed pages, because it has nowhere to write relatively worthless anonymous pages.

所以作者认为如果没有swap space的话,anonymous pages没法回写到磁盘,这样也就没法让出它们占据的空间,这会给kernel在调度内存 (VM的管理)上带来不小的开销(见上面的粗体部分)

Not only this, but flushing pages to swap is actually a bit easier and quicker than flushing them to disk: the code is much simpler, and there are no directory trees to update. The swap file/partition is just an array of pages. This is another reason to give the kernel the option of flushing to swap as well as to the filesystem.

况且回写anonymous pagesswap space的效率(从实现和速度方面)比回写到disk要高。

As I write this, my 1024MB machine has 184MB of swap used out of 1506MB, and only 17MB of memory free. On old-Unix this would indicate a perilous situation: with numbers like this it would be grinding. But Linux is perfectly happy with these numbers: the disk is idle and it responds well.

The 184MB constitutes tasks that are running in the background, like the getty on the text console, or the gdm login manager. Neither of them has done anything much since I logged in days ago. From a certain overoptimizing point of view I ought to get rid of those tasks — although for the login manager it might be hard. But even then, there's probably lots of memory used for features of programs I am running that don't get invoked very often.

With swap, that memory is written to disk and costs very little. Without swap, it would be cluttering up RAM, as if I was down to only 840MB. Everything else would need to page a bit harder, but it wouldn't be obvious why.

Disk is cheap, so allocate a gigabyte or two for swap.

On BSD people used to advise allocating as much swap as memory, or maybe two or three times as much. Although the VM design is completely different, it's still a good rule of thumb. If anything, disk has gotten relatively cheaper over time: a typical developer machine now has 1GB of memory, but 200GB of disk. Spending one half or one percent of your disk on swap can probably improve performance.

If you are short on disk, as I am on my laptop, then use a swap file instead of a swap partition so that you can shrink or grow it more easily. (I think there is still a limit of 2GB per swap target, but you can create as many as you like.) Swap files might be slightly slower, but it's much better than not having it at all. If you ever see it get close to full, add some more.

Understanding the Linux Virtual Memory Manager has an enormous amount of detail on how this works in 2.4 (and I hope it doesn't contradict me!)

Q: 2.6会有什么变化呢?是不是这个观点还使用呢?

O'Reilly's System Performance Tuning approaches this from a sysadmin's point of view, but it mostly describes the way swap works under Solaris.

posted Fri 2 Jul 2004 in /software/linux-kernel | link

No comments: