<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux/fs/pipe.c, branch v3.0.36</title>
<subtitle>Linux kernel source tree</subtitle>
<id>https://git.amat.us/linux/atom/fs/pipe.c?h=v3.0.36</id>
<link rel='self' href='https://git.amat.us/linux/atom/fs/pipe.c?h=v3.0.36'/>
<link rel='alternate' type='text/html' href='https://git.amat.us/linux/'/>
<updated>2012-05-07T15:56:36Z</updated>
<entry>
<title>pipes: add a "packetized pipe" mode for writing</title>
<updated>2012-05-07T15:56:36Z</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2012-04-29T20:12:42Z</published>
<link rel='alternate' type='text/html' href='https://git.amat.us/linux/commit/?id=beed6c2e00e0dde6722b590e6a02c20248224c68'/>
<id>urn:sha1:beed6c2e00e0dde6722b590e6a02c20248224c68</id>
<content type='text'>
commit 9883035ae7edef3ec62ad215611cb8e17d6a1a5d upstream.

The actual internal pipe implementation is already really about
individual packets (called "pipe buffers"), and this simply exposes that
as a special packetized mode.

When we are in the packetized mode (marked by O_DIRECT as suggested by
Alan Cox), a write() on a pipe will not merge the new data with previous
writes, so each write will get a pipe buffer of its own.  The pipe
buffer is then marked with the PIPE_BUF_FLAG_PACKET flag, which in turn
will tell the reader side to break the read at that boundary (and throw
away any partial packet contents that do not fit in the read buffer).

End result: as long as you do writes less than PIPE_BUF in size (so that
the pipe doesn't have to split them up), you can now treat the pipe as a
packet interface, where each read() system call will read one packet at
a time.  You can just use a sufficiently big read buffer (PIPE_BUF is
sufficient, since bigger than that doesn't guarantee atomicity anyway),
and the return value of the read() will naturally give you the size of
the packet.

NOTE! We do not support zero-sized packets, and zero-sized reads and
writes to a pipe continue to be no-ops.  Also note that big packets will
currently be split at write time, but that the size at which that
happens is not really specified (except that it's bigger than PIPE_BUF).
Currently that limit is the system page size, but we might want to
explicitly support bigger packets some day.

The main user for this is going to be the autofs packet interface,
allowing us to stop having to care so deeply about exact packet sizes
(which have had bugs with 32/64-bit compatibility modes).  But user
space can create packetized pipes with "pipe2(fd, O_DIRECT)", which will
fail with an EINVAL on kernels that do not support this interface.

Tested-by: Michael Tokarev &lt;mjt@tls.msk.ru&gt;
Cc: Alan Cox &lt;alan@lxorguk.ukuu.org.uk&gt;
Cc: David Miller &lt;davem@davemloft.net&gt;
Cc: Ian Kent &lt;raven@themaw.net&gt;
Cc: Thomas Meyer &lt;thomas@m3y3r.de&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
</entry>
<entry>
<title>Fix broken "pipe: use event aware wakeups" optimization</title>
<updated>2011-01-21T00:21:59Z</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2011-01-21T00:21:59Z</published>
<link rel='alternate' type='text/html' href='https://git.amat.us/linux/commit/?id=28e58ee8ce1f0e69c207f747b7b9054b071e328d'/>
<id>urn:sha1:28e58ee8ce1f0e69c207f747b7b9054b071e328d</id>
<content type='text'>
Commit e462c448fdc8 ("pipe: use event aware wakeups") optimized the pipe
event wakeup calls to avoid wakeups if the events do not match the
requested set.

However, the optimization was buggy, in that it didn't actually use the
correct sets for the events: when we make room for more data to be
written, the pipe poll() routine will return both the POLLOUT _and_
POLLWRNORM bits.  Similarly for read.

And most critically, when a pipe is released, that will potentially
result in POLLHUP|POLLERR (depending on whether it was the last reader
or writer), not just the regular POLLIN|POLLOUT.

This bug showed itself as a hung gnome-screensaver-dialog process, stuck
forever (or at least until it was poked by a signal or by being traced)
in a poll() system call.

Cc: Davide Libenzi &lt;davidel@xmailserver.org&gt;
Cc: David S. Miller &lt;davem@davemloft.net&gt;
Cc: Eric Dumazet &lt;eric.dumazet@gmail.com&gt;
Cc: Jens Axboe &lt;axboe@kernel.dk&gt;
Cc: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>sanitize vfsmount refcounting changes</title>
<updated>2011-01-16T18:47:07Z</updated>
<author>
<name>Al Viro</name>
<email>viro@zeniv.linux.org.uk</email>
</author>
<published>2011-01-15T03:30:21Z</published>
<link rel='alternate' type='text/html' href='https://git.amat.us/linux/commit/?id=f03c65993b98eeb909a4012ce7833c5857d74755'/>
<id>urn:sha1:f03c65993b98eeb909a4012ce7833c5857d74755</id>
<content type='text'>
Instead of splitting refcount between (per-cpu) mnt_count
and (SMP-only) mnt_longrefs, make all references contribute
to mnt_count again and keep track of how many are longterm
ones.

Accounting rules for longterm count:
	* 1 for each fs_struct.root.mnt
	* 1 for each fs_struct.pwd.mnt
	* 1 for having non-NULL -&gt;mnt_ns
	* decrement to 0 happens only under vfsmount lock exclusive

That allows nice common case for mntput() - since we can't drop the
final reference until after mnt_longterm has reached 0 due to the rules
above, mntput() can grab vfsmount lock shared and check mnt_longterm.
If it turns out to be non-zero (which is the common case), we know
that this is not the final mntput() and can just blindly decrement
percpu mnt_count.  Otherwise we grab vfsmount lock exclusive and
do usual decrement-and-check of percpu mnt_count.

For fs_struct.c we have mnt_make_longterm() and mnt_make_shortterm();
namespace.c uses the latter in places where we don't already hold
vfsmount lock exclusive and opencodes a few remaining spots where
we need to manipulate mnt_longterm.

Note that we mostly revert the code outside of fs/namespace.c back
to what we used to have; in particular, normal code doesn't need
to care about two kinds of references, etc.  And we get to keep
the optimization Nick's variant had bought us...

Signed-off-by: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
</content>
</entry>
<entry>
<title>Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6</title>
<updated>2011-01-13T18:27:28Z</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2011-01-13T18:27:28Z</published>
<link rel='alternate' type='text/html' href='https://git.amat.us/linux/commit/?id=b2034d474b7e1e8578bd5c2977024b51693269d9'/>
<id>urn:sha1:b2034d474b7e1e8578bd5c2977024b51693269d9</id>
<content type='text'>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (41 commits)
  fs: add documentation on fallocate hole punching
  Gfs2: fail if we try to use hole punch
  Btrfs: fail if we try to use hole punch
  Ext4: fail if we try to use hole punch
  Ocfs2: handle hole punching via fallocate properly
  XFS: handle hole punching via fallocate properly
  fs: add hole punching to fallocate
  vfs: pass struct file to do_truncate on O_TRUNC opens (try #2)
  fix signedness mess in rw_verify_area() on 64bit architectures
  fs: fix kernel-doc for dcache::prepend_path
  fs: fix kernel-doc for dcache::d_validate
  sanitize ecryptfs -&gt;mount()
  switch afs
  move internal-only parts of ncpfs headers to fs/ncpfs
  switch ncpfs
  switch 9p
  pass default dentry_operations to mount_pseudo()
  switch hostfs
  switch affs
  switch configfs
  ...
</content>
</entry>
<entry>
<title>pipe: use event aware wakeups</title>
<updated>2011-01-13T16:03:15Z</updated>
<author>
<name>Davide Libenzi</name>
<email>davidel@xmailserver.org</email>
</author>
<published>2011-01-13T01:00:25Z</published>
<link rel='alternate' type='text/html' href='https://git.amat.us/linux/commit/?id=e462c448fdc89252d631b26ff0ed4f7ad6fe8ed2'/>
<id>urn:sha1:e462c448fdc89252d631b26ff0ed4f7ad6fe8ed2</id>
<content type='text'>
Send the events the wakeup refers to, so that epoll, and even the new poll
code in fs/select.c can avoid wakeups if the events do not match the
requested set.

Signed-off-by: Davide Libenzi &lt;davidel@xmailserver.org&gt;
Acked-by: David S. Miller &lt;davem@davemloft.net&gt;
Acked-by: Eric Dumazet &lt;eric.dumazet@gmail.com&gt;
Cc: Jens Axboe &lt;axboe@kernel.dk&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>pass default dentry_operations to mount_pseudo()</title>
<updated>2011-01-13T01:03:43Z</updated>
<author>
<name>Al Viro</name>
<email>viro@zeniv.linux.org.uk</email>
</author>
<published>2011-01-12T21:59:34Z</published>
<link rel='alternate' type='text/html' href='https://git.amat.us/linux/commit/?id=c74a1cbb3cac348f276fabc381758f5b0b4713b2'/>
<id>urn:sha1:c74a1cbb3cac348f276fabc381758f5b0b4713b2</id>
<content type='text'>
Signed-off-by: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
</content>
</entry>
<entry>
<title>fs: scale mntget/mntput</title>
<updated>2011-01-07T06:50:33Z</updated>
<author>
<name>Nick Piggin</name>
<email>npiggin@kernel.dk</email>
</author>
<published>2011-01-07T06:50:11Z</published>
<link rel='alternate' type='text/html' href='https://git.amat.us/linux/commit/?id=b3e19d924b6eaf2ca7d22cba99a517c5171007b6'/>
<id>urn:sha1:b3e19d924b6eaf2ca7d22cba99a517c5171007b6</id>
<content type='text'>
The problem that this patch aims to fix is vfsmount refcounting scalability.
We need to take a reference on the vfsmount for every successful path lookup,
which often go to the same mount point.

The fundamental difficulty is that a "simple" reference count can never be made
scalable, because any time a reference is dropped, we must check whether that
was the last reference. To do that requires communication with all other CPUs
that may have taken a reference count.

We can make refcounts more scalable in a couple of ways, involving keeping
distributed counters, and checking for the global-zero condition less
frequently.

- check the global sum once every interval (this will delay zero detection
  for some interval, so it's probably a showstopper for vfsmounts).

- keep a local count and only taking the global sum when local reaches 0 (this
  is difficult for vfsmounts, because we can't hold preempt off for the life of
  a reference, so a counter would need to be per-thread or tied strongly to a
  particular CPU which requires more locking).

- keep a local difference of increments and decrements, which allows us to sum
  the total difference and hence find the refcount when summing all CPUs. Then,
  keep a single integer "long" refcount for slow and long lasting references,
  and only take the global sum of local counters when the long refcount is 0.

This last scheme is what I implemented here. Attached mounts and process root
and working directory references are "long" references, and everything else is
a short reference.

This allows scalable vfsmount references during path walking over mounted
subtrees and unattached (lazy umounted) mounts with processes still running
in them.

This results in one fewer atomic op in the fastpath: mntget is now just a
per-CPU inc, rather than an atomic inc; and mntput just requires a spinlock
and non-atomic decrement in the common case. However code is otherwise bigger
and heavier, so single threaded performance is basically a wash.

Signed-off-by: Nick Piggin &lt;npiggin@kernel.dk&gt;
</content>
</entry>
<entry>
<title>fs: improve scalability of pseudo filesystems</title>
<updated>2011-01-07T06:50:32Z</updated>
<author>
<name>Nick Piggin</name>
<email>npiggin@kernel.dk</email>
</author>
<published>2011-01-07T06:50:07Z</published>
<link rel='alternate' type='text/html' href='https://git.amat.us/linux/commit/?id=4b936885ab04dc6e0bb0ef35e0e23c1a7364d9e5'/>
<id>urn:sha1:4b936885ab04dc6e0bb0ef35e0e23c1a7364d9e5</id>
<content type='text'>
Regardless of how much we possibly try to scale dcache, there is likely
always going to be some fundamental contention when adding or removing children
under the same parent. Pseudo filesystems do not seem need to have connected
dentries because by definition they are disconnected.

Signed-off-by: Nick Piggin &lt;npiggin@kernel.dk&gt;
</content>
</entry>
<entry>
<title>fs: dcache reduce branches in lookup path</title>
<updated>2011-01-07T06:50:28Z</updated>
<author>
<name>Nick Piggin</name>
<email>npiggin@kernel.dk</email>
</author>
<published>2011-01-07T06:49:55Z</published>
<link rel='alternate' type='text/html' href='https://git.amat.us/linux/commit/?id=fb045adb99d9b7c562dc7fef834857f78249daa1'/>
<id>urn:sha1:fb045adb99d9b7c562dc7fef834857f78249daa1</id>
<content type='text'>
Reduce some branches and memory accesses in dcache lookup by adding dentry
flags to indicate common d_ops are set, rather than having to check them.
This saves a pointer memory access (dentry-&gt;d_op) in common path lookup
situations, and saves another pointer load and branch in cases where we
have d_op but not the particular operation.

Patched with:

git grep -E '[.&gt;]([[:space:]])*d_op([[:space:]])*=' | xargs sed -e 's/\([^\t ]*\)-&gt;d_op = \(.*\);/d_set_d_op(\1, \2);/' -e 's/\([^\t ]*\)\.d_op = \(.*\);/d_set_d_op(\&amp;\1, \2);/' -i

Signed-off-by: Nick Piggin &lt;npiggin@kernel.dk&gt;
</content>
</entry>
<entry>
<title>fs: avoid inode RCU freeing for pseudo fs</title>
<updated>2011-01-07T06:50:26Z</updated>
<author>
<name>Nick Piggin</name>
<email>npiggin@kernel.dk</email>
</author>
<published>2011-01-07T06:49:50Z</published>
<link rel='alternate' type='text/html' href='https://git.amat.us/linux/commit/?id=ff0c7d15f9787b7e8c601533c015295cc68329f8'/>
<id>urn:sha1:ff0c7d15f9787b7e8c601533c015295cc68329f8</id>
<content type='text'>
Pseudo filesystems that don't put inode on RCU list or reachable by
rcu-walk dentries do not need to RCU free their inodes.

Signed-off-by: Nick Piggin &lt;npiggin@kernel.dk&gt;
</content>
</entry>
</feed>
