<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux/mm/page-writeback.c, branch v3.3.5</title>
<subtitle>Linux kernel source tree</subtitle>
<id>https://git.amat.us/linux/atom/mm/page-writeback.c?h=v3.3.5</id>
<link rel='self' href='https://git.amat.us/linux/atom/mm/page-writeback.c?h=v3.3.5'/>
<link rel='alternate' type='text/html' href='https://git.amat.us/linux/'/>
<updated>2012-01-11T00:59:59Z</updated>
<entry>
<title>Merge branch 'writeback-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux</title>
<updated>2012-01-11T00:59:59Z</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2012-01-11T00:59:59Z</published>
<link rel='alternate' type='text/html' href='https://git.amat.us/linux/commit/?id=001a541ea9163ace5e8243ee0e907ad80a4c0ec2'/>
<id>urn:sha1:001a541ea9163ace5e8243ee0e907ad80a4c0ec2</id>
<content type='text'>
* 'writeback-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux:
  writeback: move MIN_WRITEBACK_PAGES to fs-writeback.c
  writeback: balanced_rate cannot exceed write bandwidth
  writeback: do strict bdi dirty_exceeded
  writeback: avoid tiny dirty poll intervals
  writeback: max, min and target dirty pause time
  writeback: dirty ratelimit - think time compensation
  btrfs: fix dirtied pages accounting on sub-page writes
  writeback: fix dirtied pages accounting on redirty
  writeback: fix dirtied pages accounting on sub-page writes
  writeback: charge leaked page dirties to active tasks
  writeback: Include all dirty inodes in background writeback
</content>
</entry>
<entry>
<title>mm: try to distribute dirty pages fairly across zones</title>
<updated>2012-01-11T00:30:43Z</updated>
<author>
<name>Johannes Weiner</name>
<email>jweiner@redhat.com</email>
</author>
<published>2012-01-10T23:07:49Z</published>
<link rel='alternate' type='text/html' href='https://git.amat.us/linux/commit/?id=a756cf5908530e8b40bdf569eb48b40139e8d7fd'/>
<id>urn:sha1:a756cf5908530e8b40bdf569eb48b40139e8d7fd</id>
<content type='text'>
The maximum number of dirty pages that exist in the system at any time is
determined by a number of pages considered dirtyable and a user-configured
percentage of those, or an absolute number in bytes.

This number of dirtyable pages is the sum of memory provided by all the
zones in the system minus their lowmem reserves and high watermarks, so
that the system can retain a healthy number of free pages without having
to reclaim dirty pages.

But there is a flaw in that we have a zoned page allocator which does not
care about the global state but rather the state of individual memory
zones.  And right now there is nothing that prevents one zone from filling
up with dirty pages while other zones are spared, which frequently leads
to situations where kswapd, in order to restore the watermark of free
pages, does indeed have to write pages from that zone's LRU list.  This
can interfere so badly with IO from the flusher threads that major
filesystems (btrfs, xfs, ext4) mostly ignore write requests from reclaim
already, taking away the VM's only possibility to keep such a zone
balanced, aside from hoping the flushers will soon clean pages from that
zone.

Enter per-zone dirty limits.  They are to a zone's dirtyable memory what
the global limit is to the global amount of dirtyable memory, and try to
make sure that no single zone receives more than its fair share of the
globally allowed dirty pages in the first place.  As the number of pages
considered dirtyable excludes the zones' lowmem reserves and high
watermarks, the maximum number of dirty pages in a zone is such that the
zone can always be balanced without requiring page cleaning.

As this is a placement decision in the page allocator and pages are
dirtied only after the allocation, this patch allows allocators to pass
__GFP_WRITE when they know in advance that the page will be written to and
become dirty soon.  The page allocator will then attempt to allocate from
the first zone of the zonelist - which on NUMA is determined by the task's
NUMA memory policy - that has not exceeded its dirty limit.

At first glance, it would appear that the diversion to lower zones can
increase pressure on them, but this is not the case.  With a full high
zone, allocations will be diverted to lower zones eventually, so it is
more of a shift in timing of the lower zone allocations.  Workloads that
previously could fit their dirty pages completely in the higher zone may
be forced to allocate from lower zones, but the amount of pages that
"spill over" are limited themselves by the lower zones' dirty constraints,
and thus unlikely to become a problem.

For now, the problem of unfair dirty page distribution remains for NUMA
configurations where the zones allowed for allocation are in sum not big
enough to trigger the global dirty limits, wake up the flusher threads and
remedy the situation.  Because of this, an allocation that could not
succeed on any of the considered zones is allowed to ignore the dirty
limits before going into direct reclaim or even failing the allocation,
until a future patch changes the global dirty throttling and flusher
thread activation so that they take individual zone states into account.

			Test results

15M DMA + 3246M DMA32 + 504 Normal = 3765M memory
40% dirty ratio
16G USB thumb drive
10 runs of dd if=/dev/zero of=disk/zeroes bs=32k count=$((10 &lt;&lt; 15))

		seconds			nr_vmscan_write
		        (stddev)	       min|     median|        max
xfs
vanilla:	 549.747( 3.492)	     0.000|      0.000|      0.000
patched:	 550.996( 3.802)	     0.000|      0.000|      0.000

fuse-ntfs
vanilla:	1183.094(53.178)	 54349.000|  59341.000|  65163.000
patched:	 558.049(17.914)	     0.000|      0.000|     43.000

btrfs
vanilla:	 573.679(14.015)	156657.000| 460178.000| 606926.000
patched:	 563.365(11.368)	     0.000|      0.000|   1362.000

ext4
vanilla:	 561.197(15.782)	     0.000|2725438.000|4143837.000
patched:	 568.806(17.496)	     0.000|      0.000|      0.000

Signed-off-by: Johannes Weiner &lt;jweiner@redhat.com&gt;
Reviewed-by: Minchan Kim &lt;minchan.kim@gmail.com&gt;
Acked-by: Mel Gorman &lt;mgorman@suse.de&gt;
Reviewed-by: Michal Hocko &lt;mhocko@suse.cz&gt;
Tested-by: Wu Fengguang &lt;fengguang.wu@intel.com&gt;
Cc: KAMEZAWA Hiroyuki &lt;kamezawa.hiroyu@jp.fujitsu.com&gt;
Cc: Christoph Hellwig &lt;hch@infradead.org&gt;
Cc: Dave Chinner &lt;david@fromorbit.com&gt;
Cc: Jan Kara &lt;jack@suse.cz&gt;
Cc: Shaohua Li &lt;shaohua.li@intel.com&gt;
Cc: Rik van Riel &lt;riel@redhat.com&gt;
Cc: Chris Mason &lt;chris.mason@oracle.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm: writeback: cleanups in preparation for per-zone dirty limits</title>
<updated>2012-01-11T00:30:43Z</updated>
<author>
<name>Johannes Weiner</name>
<email>jweiner@redhat.com</email>
</author>
<published>2012-01-10T23:07:44Z</published>
<link rel='alternate' type='text/html' href='https://git.amat.us/linux/commit/?id=ccafa2879fb8d13b8031337a8743eac4189e5d6e'/>
<id>urn:sha1:ccafa2879fb8d13b8031337a8743eac4189e5d6e</id>
<content type='text'>
The next patch will introduce per-zone dirty limiting functions in
addition to the traditional global dirty limiting.

Rename determine_dirtyable_memory() to global_dirtyable_memory() before
adding the zone-specific version, and fix up its documentation.

Also, move the functions to determine the dirtyable memory and the
function to calculate the dirty limit based on that together so that their
relationship is more apparent and that they can be commented on as a
group.

Signed-off-by: Johannes Weiner &lt;jweiner@redhat.com&gt;
Reviewed-by: Minchan Kim &lt;minchan.kim@gmail.com&gt;
Acked-by: Mel Gorman &lt;mel@suse.de&gt;
Reviewed-by: Michal Hocko &lt;mhocko@suse.cz&gt;
Cc: KAMEZAWA Hiroyuki &lt;kamezawa.hiroyu@jp.fujitsu.com&gt;
Cc: Christoph Hellwig &lt;hch@infradead.org&gt;
Cc: Wu Fengguang &lt;fengguang.wu@intel.com&gt;
Cc: Dave Chinner &lt;david@fromorbit.com&gt;
Cc: Jan Kara &lt;jack@suse.cz&gt;
Cc: Shaohua Li &lt;shaohua.li@intel.com&gt;
Cc: Rik van Riel &lt;riel@redhat.com&gt;
Cc: Chris Mason &lt;chris.mason@oracle.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm: exclude reserved pages from dirtyable memory</title>
<updated>2012-01-11T00:30:43Z</updated>
<author>
<name>Johannes Weiner</name>
<email>jweiner@redhat.com</email>
</author>
<published>2012-01-10T23:07:42Z</published>
<link rel='alternate' type='text/html' href='https://git.amat.us/linux/commit/?id=ab8fabd46f811d5153d8a0cd2fac9a0d41fb593d'/>
<id>urn:sha1:ab8fabd46f811d5153d8a0cd2fac9a0d41fb593d</id>
<content type='text'>
Per-zone dirty limits try to distribute page cache pages allocated for
writing across zones in proportion to the individual zone sizes, to reduce
the likelihood of reclaim having to write back individual pages from the
LRU lists in order to make progress.

This patch:

The amount of dirtyable pages should not include the full number of free
pages: there is a number of reserved pages that the page allocator and
kswapd always try to keep free.

The closer (reclaimable pages - dirty pages) is to the number of reserved
pages, the more likely it becomes for reclaim to run into dirty pages:

       +----------+ ---
       |   anon   |  |
       +----------+  |
       |          |  |
       |          |  -- dirty limit new    -- flusher new
       |   file   |  |                     |
       |          |  |                     |
       |          |  -- dirty limit old    -- flusher old
       |          |                        |
       +----------+                       --- reclaim
       | reserved |
       +----------+
       |  kernel  |
       +----------+

This patch introduces a per-zone dirty reserve that takes both the lowmem
reserve as well as the high watermark of the zone into account, and a
global sum of those per-zone values that is subtracted from the global
amount of dirtyable pages.  The lowmem reserve is unavailable to page
cache allocations and kswapd tries to keep the high watermark free.  We
don't want to end up in a situation where reclaim has to clean pages in
order to balance zones.

Not treating reserved pages as dirtyable on a global level is only a
conceptual fix.  In reality, dirty pages are not distributed equally
across zones and reclaim runs into dirty pages on a regular basis.

But it is important to get this right before tackling the problem on a
per-zone level, where the distance between reclaim and the dirty pages is
mostly much smaller in absolute numbers.

[akpm@linux-foundation.org: fix highmem build]
Signed-off-by: Johannes Weiner &lt;jweiner@redhat.com&gt;
Reviewed-by: Rik van Riel &lt;riel@redhat.com&gt;
Reviewed-by: Michal Hocko &lt;mhocko@suse.cz&gt;
Reviewed-by: Minchan Kim &lt;minchan.kim@gmail.com&gt;
Acked-by: Mel Gorman &lt;mgorman@suse.de&gt;
Cc: KAMEZAWA Hiroyuki &lt;kamezawa.hiroyu@jp.fujitsu.com&gt;
Cc: Christoph Hellwig &lt;hch@infradead.org&gt;
Cc: Wu Fengguang &lt;fengguang.wu@intel.com&gt;
Cc: Dave Chinner &lt;david@fromorbit.com&gt;
Cc: Jan Kara &lt;jack@suse.cz&gt;
Cc: Shaohua Li &lt;shaohua.li@intel.com&gt;
Cc: Chris Mason &lt;chris.mason@oracle.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm/page-writeback.c: make determine_dirtyable_memory static again</title>
<updated>2012-01-11T00:30:41Z</updated>
<author>
<name>Johannes Weiner</name>
<email>hannes@cmpxchg.org</email>
</author>
<published>2012-01-10T23:06:57Z</published>
<link rel='alternate' type='text/html' href='https://git.amat.us/linux/commit/?id=1edf223485c42c99655dcd001db1e46ad5e5d2d7'/>
<id>urn:sha1:1edf223485c42c99655dcd001db1e46ad5e5d2d7</id>
<content type='text'>
The tracing ring-buffer used this function briefly, but not anymore.
Make it local to the writeback code again.

Also, move the function so that no forward declaration needs to be
reintroduced.

Signed-off-by: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Acked-by: Mel Gorman &lt;mgorman@suse.de&gt;
Reviewed-by: Michal Hocko &lt;mhocko@suse.cz&gt;
Cc: Wu Fengguang &lt;fengguang.wu@intel.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>fs: move code out of buffer.c</title>
<updated>2012-01-04T03:54:07Z</updated>
<author>
<name>Al Viro</name>
<email>viro@zeniv.linux.org.uk</email>
</author>
<published>2011-09-16T06:31:11Z</published>
<link rel='alternate' type='text/html' href='https://git.amat.us/linux/commit/?id=ff01bb4832651c6d25ac509a06a10fcbd75c461c'/>
<id>urn:sha1:ff01bb4832651c6d25ac509a06a10fcbd75c461c</id>
<content type='text'>
Move invalidate_bdev, block_sync_page into fs/block_dev.c.  Export
kill_bdev as well, so brd doesn't have to open code it.  Reduce
buffer_head.h requirement accordingly.

Removed a rather large comment from invalidate_bdev, as it looked a bit
obsolete to bother moving.  The small comment replacing it says enough.

Signed-off-by: Nick Piggin &lt;npiggin@suse.de&gt;
Cc: Al Viro &lt;viro@ZenIV.linux.org.uk&gt;
Cc: Christoph Hellwig &lt;hch@lst.de&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
</content>
</entry>
<entry>
<title>writeback: balanced_rate cannot exceed write bandwidth</title>
<updated>2011-12-18T06:20:33Z</updated>
<author>
<name>Wu Fengguang</name>
<email>fengguang.wu@intel.com</email>
</author>
<published>2011-08-03T20:30:36Z</published>
<link rel='alternate' type='text/html' href='https://git.amat.us/linux/commit/?id=bdaac4902a8225bf247ecaeac46c4b2980cc70e5'/>
<id>urn:sha1:bdaac4902a8225bf247ecaeac46c4b2980cc70e5</id>
<content type='text'>
Add an upper limit to balanced_rate according to the below inequality.
This filters out some rare but huge singular points, which at least
enables more readable gnuplot figures.

When there are N dd dirtiers,

	balanced_dirty_ratelimit = write_bw / N

So it holds that

	balanced_dirty_ratelimit &lt;= write_bw

The singular points originate from dirty_rate in the below formular:

        balanced_dirty_ratelimit = task_ratelimit * write_bw / dirty_rate
where
	dirty_rate = (number of page dirties in the past 200ms) / 200ms

In the extreme case, if all dd tasks suddenly get blocked on something
else and hence no pages are dirtied at all, dirty_rate will be 0 and
balanced_dirty_ratelimit will be inf. This could happen in reality.

Note that these huge singular points are not a real threat, since they
are _guaranteed_ to be filtered out by the
	min(balanced_dirty_ratelimit, task_ratelimit)
line in bdi_update_dirty_ratelimit(). task_ratelimit is based on the
number of dirty pages, which will never _suddenly_ fly away like
balanced_dirty_ratelimit. So any weirdly large balanced_dirty_ratelimit
will be cut down to the level of task_ratelimit.

There won't be tiny singular points though, as long as the dirty pages
lie inside the dirty throttling region (above the freerun region).
Because there the dd tasks will be throttled by balanced_dirty_pages()
and won't be able to suddenly dirty much more pages than average.

Acked-by: Jan Kara &lt;jack@suse.cz&gt;
Acked-by: Peter Zijlstra &lt;a.p.zijlstra@chello.nl&gt;
Signed-off-by: Wu Fengguang &lt;fengguang.wu@intel.com&gt;
</content>
</entry>
<entry>
<title>writeback: do strict bdi dirty_exceeded</title>
<updated>2011-12-18T06:20:31Z</updated>
<author>
<name>Wu Fengguang</name>
<email>fengguang.wu@intel.com</email>
</author>
<published>2011-12-04T03:26:01Z</published>
<link rel='alternate' type='text/html' href='https://git.amat.us/linux/commit/?id=82791940545be38810dfd5e03ee701e749f04aab'/>
<id>urn:sha1:82791940545be38810dfd5e03ee701e749f04aab</id>
<content type='text'>
This helps to reduce dirty throttling polls and hence CPU overheads.

bdi-&gt;dirty_exceeded typically only helps when suddenly starting 100+
dd's on a disk, in which case the dd's may need to poll
balance_dirty_pages() earlier than tsk-&gt;nr_dirtied_pause.

CC: Jan Kara &lt;jack@suse.cz&gt;
CC: Peter Zijlstra &lt;a.p.zijlstra@chello.nl&gt;
Signed-off-by: Wu Fengguang &lt;fengguang.wu@intel.com&gt;
</content>
</entry>
<entry>
<title>writeback: avoid tiny dirty poll intervals</title>
<updated>2011-12-18T06:20:30Z</updated>
<author>
<name>Wu Fengguang</name>
<email>fengguang.wu@intel.com</email>
</author>
<published>2011-12-06T19:17:17Z</published>
<link rel='alternate' type='text/html' href='https://git.amat.us/linux/commit/?id=5b9b357435a51ff14835c06d8b00765a4c68f313'/>
<id>urn:sha1:5b9b357435a51ff14835c06d8b00765a4c68f313</id>
<content type='text'>
The LKP tests see big 56% regression for the case fio_mmap_randwrite_64k.
Shaohua manages to root cause it to be the much smaller dirty pause times
and hence much more frequent invocations to the IO-less balance_dirty_pages().
Since fio_mmap_randwrite_64k effectively contains both reads and writes,
the more frequent pauses triggered more idling in the cfq IO scheduler.

The solution is to increase pause time all the way up to the max 200ms
in this case, which is found to restore most performance. This will help
reduce CPU overheads in other cases, too.

Note that I don't expect many performance critical workloads to run this
access pattern: the mmap read-on-write is rather inefficient and could
be avoided by doing normal writes syscalls.

CC: Jan Kara &lt;jack@suse.cz&gt;
CC: Peter Zijlstra &lt;a.p.zijlstra@chello.nl&gt;
Reported-by: Li Shaohua &lt;shaohua.li@intel.com&gt;
Tested-by: Li Shaohua &lt;shaohua.li@intel.com&gt;
Signed-off-by: Wu Fengguang &lt;fengguang.wu@intel.com&gt;
</content>
</entry>
<entry>
<title>writeback: max, min and target dirty pause time</title>
<updated>2011-12-18T06:20:28Z</updated>
<author>
<name>Wu Fengguang</name>
<email>fengguang.wu@intel.com</email>
</author>
<published>2011-11-30T17:08:55Z</published>
<link rel='alternate' type='text/html' href='https://git.amat.us/linux/commit/?id=7ccb9ad5364d6ac0c803096c67e76a7545cf7a77'/>
<id>urn:sha1:7ccb9ad5364d6ac0c803096c67e76a7545cf7a77</id>
<content type='text'>
Control the pause time and the call intervals to balance_dirty_pages()
with three parameters:

1) max_pause, limited by bdi_dirty and MAX_PAUSE

2) the target pause time, grows with the number of dd tasks
   and is normally limited by max_pause/2

3) the minimal pause, set to half the target pause
   and is used to skip short sleeps and accumulate them into bigger ones

The typical behaviors after patch:

- if ever task_ratelimit is far below dirty_ratelimit, the pause time
  will remain constant at max_pause and nr_dirtied_pause will be
  fluctuating with task_ratelimit

- in the normal cases, nr_dirtied_pause will remain stable (keep in the
  same pace with dirty_ratelimit) and the pause time will be fluctuating
  with task_ratelimit

In summary, someone has to fluctuate with task_ratelimit, because

	task_ratelimit = nr_dirtied_pause / pause

We normally prefer a stable nr_dirtied_pause, until reaching max_pause.

The notable behavior changes are:

- in stable workloads, there will no longer be sudden big trajectory
  switching of nr_dirtied_pause as concerned by Peter. It will be as
  smooth as dirty_ratelimit and changing proportionally with it (as
  always, assuming bdi bandwidth does not fluctuate across 2^N lines,
  otherwise nr_dirtied_pause will show up in 2+ parallel trajectories)

- in the rare cases when something keeps task_ratelimit far below
  dirty_ratelimit, the smoothness can no longer be retained and
  nr_dirtied_pause will be "dancing" with task_ratelimit. This fixes a
  (not that destructive but still not good) bug that
	  dirty_ratelimit gets brought down undesirably
	  &lt;= balanced_dirty_ratelimit is under estimated
	  &lt;= weakly executed task_ratelimit
	  &lt;= pause goes too large and gets trimmed down to max_pause
	  &lt;= nr_dirtied_pause (based on dirty_ratelimit) is set too large
	  &lt;= dirty_ratelimit being much larger than task_ratelimit

- introduce min_pause to avoid small pause sleeps

- when pause is trimmed down to max_pause, try to compensate it at the
  next pause time

The "refactor" type of changes are:

The max_pause equation is slightly transformed to make it slightly more
efficient.

We now scale target_pause by (N * 10ms) on 2^N concurrent tasks, which
is effectively equal to the original scaling max_pause by (N * 20ms)
because the original code does implicit target_pause ~= max_pause / 2.
Based on the same implicit ratio, target_pause starts with 10ms on 1 dd.

CC: Jan Kara &lt;jack@suse.cz&gt;
CC: Peter Zijlstra &lt;a.p.zijlstra@chello.nl&gt;
Signed-off-by: Wu Fengguang &lt;fengguang.wu@intel.com&gt;
</content>
</entry>
</feed>
