Age | Commit message (Collapse) | Author |
|
[ This combines upstream commit
2f53384424251c06038ae612e56231b96ab610ee and the follow-on bug fix
commit 35f9c09fe9c72eb8ca2b8e89a593e1c151f28fc2 ]
vmsplice()/splice(pipe, socket) call do_tcp_sendpages() one page at a
time, adding at most 4096 bytes to an skb. (assuming PAGE_SIZE=4096)
The call to tcp_push() at the end of do_tcp_sendpages() forces an
immediate xmit when pipe is not already filled, and tso_fragment() try
to split these skb to MSS multiples.
4096 bytes are usually split in a skb with 2 MSS, and a remaining
sub-mss skb (assuming MTU=1500)
This makes slow start suboptimal because many small frames are sent to
qdisc/driver layers instead of big ones (constrained by cwnd and packets
in flight of course)
In fact, applications using sendmsg() (adding an additional memory copy)
instead of vmsplice()/splice()/sendfile() are a bit faster because of
this anomaly, especially if serving small files in environments with
large initial [c]wnd.
Call tcp_push() only if MSG_MORE is not set in the flags parameter.
This bit is automatically provided by splice() internals but for the
last page, or on all pages if user specified SPLICE_F_MORE splice()
flag.
In some workloads, this can reduce number of sent logical packets by an
order of magnitude, making zero-copy TCP actually faster than
one-copy :)
Reported-by: Tom Herbert <therbert@google.com>
Cc: Nandita Dukkipati <nanditad@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Tom Herbert <therbert@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: H.K. Jerry Chu <hkchu@google.com>
Cc: Maciej Żenczykowski <maze@google.com>
Cc: Mahesh Bandewar <maheshb@google.com>
Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit e847469bf77a1d339274074ed068d461f0c872bc upstream.
comparing be32 values for < is not doing the right thing...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 72094e43e3af5020510f920321d71f1798fa896d upstream.
le16, not le32...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 28748b325dc2d730ccc312830a91c4ae0c0d9379 upstream.
le16, not le32...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit e1bf4cc620fd143766ddfcee3b004a1d1bb34fd0 upstream.
it's le16, not le32 or le64...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 3a251f04fe97c3d335b745c98e4b377e3c3899f2 upstream.
It's le16, not le32...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 6ed3cf2cdfce4c9f1d73171bd3f27d9cb77b734e upstream.
->root_flags is __le64 and all accesses to it go through the helpers
that do proper conversions. Except for btrfs_root_readonly(), which
checks bit 0 as in host-endian...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit efe39651f08813180f37dc508d950fc7d92b29a8 upstream.
Restore the original logics ("fail on mountpoints, negatives and in
case of fh_compose() failures"). Since commit 8177e (nfsd: clean up
readdirplus encoding) that got broken -
rv = fh_compose(fhp, exp, dchild, &cd->fh);
if (rv)
goto out;
if (!dchild->d_inode)
goto out;
rv = 0;
out:
is equivalent to
rv = fh_compose(fhp, exp, dchild, &cd->fh);
out:
and the second check has no effect whatsoever...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 02f5fde5df0ea930e70f93763dd48beff182b208 upstream.
->ts_id_status gets nfs errno, i.e. it's already big-endian; no need
to apply htonl() to it. Broken by commit 174568 (NFSD: Added TEST_STATEID
operation) last year...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 04da6e9d63427b2d0fd04766712200c250b3278f upstream.
nfsd_open() already returns an NFS error value; only vfs_test_lock()
result needs to be fed through nfserrno(). Broken by commit 55ef12
(nfsd: Ensure nfsv4 calls the underlying filesystem on LOCKT)
three years ago...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 96f6f98501196d46ce52c2697dd758d9300c63f5 upstream.
..._want_write() returns -EROFS on failure, _not_ an NFS error value.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit af1584f570b19b0285e4402a0b54731495d31784 upstream.
->ee_len is __le16, so assigning cpu_to_le32() to it is going to do
Bad Things(tm) on big-endian hosts...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: Ted Ts'o <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 99aa78466777083255b876293e9e83dec7cd809a upstream.
flush request is issued in transaction commit code path, so looks using
GFP_KERNEL to allocate memory for flush request bio falls into the classic
deadlock issue. I saw btrfs and dm get it right, but ext4, xfs and md are
using GFP.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 9cd70b347e9761ea2d2ac3d758c529a48a8193e6 upstream.
Andi Kleen and Tim Chen have reported that under certain circumstances
the extent cache statistics are causing scalability problems due to
cache line bounces.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 8e62c2de6e23e5c1fee04f59de51b54cc2868ca5 upstream.
This reverts commit 5500cdbe14d7435e04f66ff3cfb8ecd8b8e44ebf.
We've had a number of complaints of early enospc that bisect down
to this patch. We'll hae to fix the reservations differently.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 7a3ae2f8c8c8432e65467b7fc84d5deab04061a0 upstream.
In commit 4692cf58 we introduced new backref walking code for btrfs. This
assumes we're searching live roots, which requires a transaction context.
While scrubbing, however, we must not join a transaction because this could
deadlock with the commit path. Additionally, what scrub really wants to do
is resolving a logical address in the commit root it's currently checking.
This patch adds support for logical to path resolving on commit roots and
makes scrub use that.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 20e0fa98b751facf9a1101edaefbc19c82616a68 upstream.
_copy_from_pages() used to copy data from the temporary buffer to the
user passed buffer is passed the wrong size parameter when copying
data. res.acl_len contains both the bitmap and acl lenghts while
acl_len contains the acl length after adjusting for the bitmap size.
Signed-off-by: Sachin Prabhu <sprabhu@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Josh Boyer <jwboyer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 66189be74ff5f9f3fd6444315b85be210d07cef2 upstream.
We can deadlock if we have a write oplock and two processes
use the same file handle. In this case the first process can't
unlock its lock if the second process blocked on the lock in the
same time.
Fix it by using posix_lock_file rather than posix_lock_file_wait
under cinode->lock_mutex. If we request a blocking lock and
posix_lock_file indicates that there is another lock that prevents
us, wait untill that lock is released and restart our call.
Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit de5b8e8e047534aac6bc9803f96e7257436aef9c upstream.
If you try to set grace_period or timeout via a module parameter
to lockd, and do this on a big-endian machine where
sizeof(int) != sizeof(unsigned long)
it won't work. This number given will be effectively shifted right
by the difference in those two sizes.
So cast kp->arg properly to get correct result.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit e59d27e05a6435f8c04d5ad843f37fa795f2eaaa upstream.
Firstly, task->tk_status will always return negative error values,
so the current tests for 'NFS4ERR_DELEG_REVOKED' etc. are all being
ignored.
Secondly, clean up the code so that we only need to test
task->tk_status once!
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 05e9cfb408b24debb3a85fd98edbfd09dd148881 upstream.
We can currently loop forever in nfs4_lookup_root() and in
nfs41_proc_secinfo_no_name(), if the first iteration returns a
NFS4ERR_DELAY or something else that causes exception.retry to get
set.
Reported-by: Weston Andros Adamson <dros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit d97d32edcd732110758799ae60af725e5110b3dc upstream.
When an IO error happens during inode deletion run from
xlog_recover_process_iunlinks() filesystem gets shutdown. Thus any subsequent
attempt to read buffers fails. Code in xlog_recover_process_iunlinks() does not
count with the fact that read of a buffer which was read a while ago can
really fail which results in the oops on
agi = XFS_BUF_TO_AGI(agibp);
Fix the problem by cleaning up the buffer handling in
xlog_recover_process_iunlinks() as suggested by Dave Chinner. We release buffer
lock but keep buffer reference to AG buffer. That is enough for buffer to stay
pinned in memory and we don't have to call xfs_read_agi() all the time.
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit b18dafc86bb879d2f38a1743985d7ceb283c2f4d upstream.
In d_materialise_unique() there are 3 subcases to the 'aliased dentry'
case; in two subcases the inode i_lock is properly released but this
does not occur in the -ELOOP subcase.
This seems to have been introduced by commit 1836750115f2 ("fix loop
checks in d_materialise_unique()").
Signed-off-by: Michel Lespinasse <walken@google.com>
[ Added a comment, and moved the unlock to where we generate the -ELOOP,
which seems to be more natural.
You probably can't actually trigger this without a buggy network file
server - d_materialize_unique() is for finding aliases on non-local
filesystems, and the d_ancestor() case is for a hardlinked directory
loop.
But we should be robust in the case of such buggy servers anyway. ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 31d4f3a2f3c73f279ff96a7135d7202ef6833f12 upstream.
Explicitly test for an extent whose length is zero, and flag that as a
corrupted extent.
This avoids a kernel BUG_ON assertion failure.
Tested: Without this patch, the file system image found in
tests/f_ext_zero_len/image.gz in the latest e2fsprogs sources causes a
kernel panic. With this patch, an ext4 file system error is noted
instead, and the file system is marked as being corrupted.
https://bugzilla.kernel.org/show_bug.cgi?id=42859
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 491caa43639abcffaa645fbab372a7ef4ce2975c upstream.
The following command line will leave the aio-stress process unkillable
on an ext4 file system (in my case, mounted on /mnt/test):
aio-stress -t 20 -s 10 -O -S -o 2 -I 1000 /mnt/test/aiostress.3561.4 /mnt/test/aiostress.3561.4.20 /mnt/test/aiostress.3561.4.19 /mnt/test/aiostress.3561.4.18 /mnt/test/aiostress.3561.4.17 /mnt/test/aiostress.3561.4.16 /mnt/test/aiostress.3561.4.15 /mnt/test/aiostress.3561.4.14 /mnt/test/aiostress.3561.4.13 /mnt/test/aiostress.3561.4.12 /mnt/test/aiostress.3561.4.11 /mnt/test/aiostress.3561.4.10 /mnt/test/aiostress.3561.4.9 /mnt/test/aiostress.3561.4.8 /mnt/test/aiostress.3561.4.7 /mnt/test/aiostress.3561.4.6 /mnt/test/aiostress.3561.4.5 /mnt/test/aiostress.3561.4.4 /mnt/test/aiostress.3561.4.3 /mnt/test/aiostress.3561.4.2
This is using the aio-stress program from the xfstests test suite.
That particular command line tells aio-stress to do random writes to
20 files from 20 threads (one thread per file). The files are NOT
preallocated, so you will get writes to random offsets within the
file, thus creating holes and extending i_size. It also opens the
file with O_DIRECT and O_SYNC.
On to the problem. When an I/O requires unwritten extent conversion,
it is queued onto the completed_io_list for the ext4 inode. Two code
paths will pull work items from this list. The first is the
ext4_end_io_work routine, and the second is ext4_flush_completed_IO,
which is called via the fsync path (and O_SYNC handling, as well).
There are two issues I've found in these code paths. First, if the
fsync path beats the work routine to a particular I/O, the work
routine will free the io_end structure! It does not take into account
the fact that the io_end may still be in use by the fsync path. I've
fixed this issue by adding yet another IO_END flag, indicating that
the io_end is being processed by the fsync path.
The second problem is that the work routine will make an assignment to
io->flag outside of the lock. I have witnessed this result in a hang
at umount. Moving the flag setting inside the lock resolved that
problem.
The problem was introduced by commit b82e384c7b ("ext4: optimize
locking for end_io extent conversion"), which first appeared in 3.2.
As such, the fix should be backported to that release (probably along
with the unwritten extent conversion race fix).
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 266991b13890049ee1a6bb95b9817f06339ee3d7 upstream.
The following comment in ext4_end_io_dio caught my attention:
/* XXX: probably should move into the real I/O completion handler */
inode_dio_done(inode);
The truncate code takes i_mutex, then calls inode_dio_wait. Because the
ext4 code path above will end up dropping the mutex before it is
reacquired by the worker thread that does the extent conversion, it
seems to me that the truncate can happen out of order. Jan Kara
mentioned that this might result in error messages in the system logs,
but that should be the extent of the "damage."
The fix is pretty straight-forward: don't call inode_dio_done until the
extent conversion is complete.
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 3d2b158262826e8b75bbbfb7b97010838dd92ac7 upstream.
Ext4 does not support data journalling with delayed allocation enabled.
We even do not allow to mount the file system with delayed allocation
and data journalling enabled, however it can be set via FS_IOC_SETFLAGS
so we can hit the inode with EXT4_INODE_JOURNAL_DATA set even on file
system mounted with delayed allocation (default) and that's where
problem arises. The easies way to reproduce this problem is with the
following set of commands:
mkfs.ext4 /dev/sdd
mount /dev/sdd /mnt/test1
dd if=/dev/zero of=/mnt/test1/file bs=1M count=4
chattr +j /mnt/test1/file
dd if=/dev/zero of=/mnt/test1/file bs=1M count=4 conv=notrunc
chattr -j /mnt/test1/file
Additionally it can be reproduced quite reliably with xfstests 272 and
269. In fact the above reproducer is a part of test 272.
To fix this we should ignore the EXT4_INODE_JOURNAL_DATA inode flag if
the file system is mounted with delayed allocation. This can be easily
done by fixing ext4_should_*_data() functions do ignore data journal
flag when delalloc is set (suggested by Ted). We also have to set the
appropriate address space operations for the inode (again, ignoring data
journal flag if delalloc enabled).
Additionally this commit introduces ext4_inode_journal_mode() function
because ext4_should_*_data() has already had a lot of common code and
this change is putting it all into one function so it is easier to
read.
Successfully tested with xfstests in following configurations:
delalloc + data=ordered
delalloc + data=writeback
data=journal
nodelalloc + data=ordered
nodelalloc + data=writeback
nodelalloc + data=journal
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 15291164b22a357cb211b618adfef4fa82fc0de3 upstream.
journal_unmap_buffer()'s zap_buffer: code clears a lot of buffer head
state ala discard_buffer(), but does not touch _Delay or _Unwritten as
discard_buffer() does.
This can be problematic in some areas of the ext4 code which assume
that if they have found a buffer marked unwritten or delay, then it's
a live one. Perhaps those spots should check whether it is mapped
as well, but if jbd2 is going to tear down a buffer, let's really
tear it down completely.
Without this I get some fsx failures on sub-page-block filesystems
up until v3.2, at which point 4e96b2dbbf1d7e81f22047a50f862555a6cb87cb
and 189e868fa8fdca702eb9db9d8afc46b5cb9144c9 make the failures go
away, because buried within that large change is some more flag
clearing. I still think it's worth doing in jbd2, since
->invalidatepage leads here directly, and it's the right place
to clear away these flags.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 9a3ba432330e504ac61ff0043dbdaba7cea0e35a upstream.
Prevent the state manager from filling up system logs when recovery
fails on the server.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 4e474a00d7ff746ed177ddae14fa8b2d4bad7a00 upstream.
Protect code accessing ctl_table by grabbing the header with grab_header()
and after releasing with sysctl_head_finish(). This is needed if poll()
is called in entries created by modules: currently only hostname and
domainname support poll(), but this bug may be triggered when/if modules
use it and if user called poll() in a file that doesn't support it.
Dave Jones reported the following when using a syscall fuzzer while
hibernating/resuming:
RIP: 0010:[<ffffffff81233e3e>] [<ffffffff81233e3e>] proc_sys_poll+0x4e/0x90
RAX: 0000000000000145 RBX: ffff88020cab6940 RCX: 0000000000000000
RDX: ffffffff81233df0 RSI: 6b6b6b6b6b6b6b6b RDI: ffff88020cab6940
[ ... ]
Code: 00 48 89 fb 48 89 f1 48 8b 40 30 4c 8b 60 e8 b8 45 01 00 00 49 83
7c 24 28 00 74 2e 49 8b 74 24 30 48 85 f6 74 24 48 85 c9 75 32 <8b> 16
b8 45 01 00 00 48 63 d2 49 39 d5 74 10 8b 06 48 98 48 89
If an entry goes away while we are polling() it, ctl_table may not exist
anymore.
Reported-by: Dave Jones <davej@redhat.com>
Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 1b26c9b334044cff6d1d2698f2be41bc7d9a0864 upstream.
The namespace cleanup path leaks a dentry which holds a reference count
on a network namespace. Keeping that network namespace from being freed
when the last user goes away. Leaving things like vlan devices in the
leaked network namespace.
If you use ip netns add for much real work this problem becomes apparent
pretty quickly. It light testing the problem hides because frequently
you simply don't notice the leak.
Use d_set_d_op() so that DCACHE_OP_* flags are set correctly.
This issue exists back to 3.0.
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Reported-by: Justin Pettit <jpettit@nicira.com>
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit ce85852b90a214cf577fc1b4f49d99fd7e98784a upstream.
Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Reported-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
controller to the next
commit 1daaae8fa4afe3df78ca34e724ed7e8187e4eb32 upstream.
This patch fixes an issue when cifs_mount receives a
STATUS_BAD_NETWORK_NAME error during cifs_get_tcon but is able to
continue after an DFS ROOT referral. In this case, the return code
variable is not reset prior to trying to mount from the system referred
to. Thus, is_path_accessible is not executed and the final DFS referral
is not performed causing a mount error.
Use case: In DNS, example.com resolves to the secondary AD server
ad2.example.com Our primary domain controller is ad1.example.com and has
a DFS redirection set up from \\ad1\share\Users to \\files\share\Users.
Mounting \\example.com\share\Users fails.
Regression introduced by commit 724d9f1.
Reviewed-by: Pavel Shilovsky <piastry@etersoft.ru
Signed-off-by: Thomas Hadig <thomas@intapp.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 10b9b98e41ba248a899f6175ce96ee91431b6194 upstream.
Some servers sets this value less than 50 that was hardcoded and
we lost the connection if when we exceed this limit. Fix this by
respecting this value - not sending more than the server allows.
Reviewed-by: Jeff Layton <jlayton@samba.org>
Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Steve French <smfrench@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit f30d500f809eca67a21704347ab14bb35877b5ee upstream.
When we get concurrent lookups of the same inode that is not in the
per-AG inode cache, there is a race condition that triggers warnings
in unlock_new_inode() indicating that we are initialising an inode
that isn't in a the correct state for a new inode.
When we do an inode lookup via a file handle or a bulkstat, we don't
serialise lookups at a higher level through the dentry cache (i.e.
pathless lookup), and so we can get concurrent lookups of the same
inode.
The race condition is between the insertion of the inode into the
cache in the case of a cache miss and a concurrently lookup:
Thread 1 Thread 2
xfs_iget()
xfs_iget_cache_miss()
xfs_iread()
lock radix tree
radix_tree_insert()
rcu_read_lock
radix_tree_lookup
lock inode flags
XFS_INEW not set
igrab()
unlock inode flags
rcu_read_unlock
use uninitialised inode
.....
lock inode flags
set XFS_INEW
unlock inode flags
unlock radix tree
xfs_setup_inode()
inode flags = I_NEW
unlock_new_inode()
WARNING as inode flags != I_NEW
This can lead to inode corruption, inode list corruption, etc, and
is generally a bad thing to occur.
Fix this by setting XFS_INEW before inserting the inode into the
radix tree. This will ensure any concurrent lookup will find the new
inode with XFS_INEW set and that forces the lookup to wait until the
XFS_INEW flag is removed before allowing the lookup to succeed.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 3114ea7a24d3264c090556a2444fc6d2c06176d4 upstream.
If a setattr() fails because of an NFS4ERR_OPENMODE error, it is
probably due to us holding a read delegation. Ensure that the
recovery routines return that delegation in this case.
Reported-by: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit a1d0b5eebc4fd6e0edb02688b35f17f67f42aea5 upstream.
If we know that the delegation stateid is bad or revoked, we need to
remove that delegation as soon as possible, and then mark all the
stateids that relied on that delegation for recovery. We cannot use
the delegation as part of the recovery process.
Also note that NFSv4.1 uses a different error code (NFS4ERR_DELEG_REVOKED)
to indicate that the delegation was revoked.
Finally, ensure that setlk() and setattr() can both recover safely from
a revoked delegation.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit a05b0855fd15504972dba2358e5faa172a1e50ba upstream.
Taking i_mutex in hugetlbfs_read() can result in deadlock with mmap as
explained below
Thread A:
read() on hugetlbfs
hugetlbfs_read() called
i_mutex grabbed
hugetlbfs_read_actor() called
__copy_to_user() called
page fault is triggered
Thread B, sharing address space with A:
mmap() the same file
->mmap_sem is grabbed on task_B->mm->mmap_sem
hugetlbfs_file_mmap() is called
attempt to grab ->i_mutex and block waiting for A to give it up
Thread A:
pagefault handled blocked on attempt to grab task_A->mm->mmap_sem,
which happens to be the same thing as task_B->mm->mmap_sem. Block waiting
for B to give it up.
AFAIU the i_mutex locking was added to hugetlbfs_read() as per
http://lkml.indiana.edu/hypermail/linux/kernel/0707.2/3066.html to take
care of the race between truncate and read. This patch fixes this by
looking at page->mapping under lock_page() (find_lock_page()) to ensure
that the inode didn't get truncated in the range during a parallel read.
Ideally we can extend the patch to make sure we don't increase i_size in
mmap. But that will break userspace, because applications will now have
to use truncate(2) to increase i_size in hugetlbfs.
Based on the original patch from Hillf Danton.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 1a5a9906d4e8d1976b701f889d8f35d54b928f25 upstream.
In some cases it may happen that pmd_none_or_clear_bad() is called with
the mmap_sem hold in read mode. In those cases the huge page faults can
allocate hugepmds under pmd_none_or_clear_bad() and that can trigger a
false positive from pmd_bad() that will not like to see a pmd
materializing as trans huge.
It's not khugepaged causing the problem, khugepaged holds the mmap_sem
in write mode (and all those sites must hold the mmap_sem in read mode
to prevent pagetables to go away from under them, during code review it
seems vm86 mode on 32bit kernels requires that too unless it's
restricted to 1 thread per process or UP builds). The race is only with
the huge pagefaults that can convert a pmd_none() into a
pmd_trans_huge().
Effectively all these pmd_none_or_clear_bad() sites running with
mmap_sem in read mode are somewhat speculative with the page faults, and
the result is always undefined when they run simultaneously. This is
probably why it wasn't common to run into this. For example if the
madvise(MADV_DONTNEED) runs zap_page_range() shortly before the page
fault, the hugepage will not be zapped, if the page fault runs first it
will be zapped.
Altering pmd_bad() not to error out if it finds hugepmds won't be enough
to fix this, because zap_pmd_range would then proceed to call
zap_pte_range (which would be incorrect if the pmd become a
pmd_trans_huge()).
The simplest way to fix this is to read the pmd in the local stack
(regardless of what we read, no need of actual CPU barriers, only
compiler barrier needed), and be sure it is not changing under the code
that computes its value. Even if the real pmd is changing under the
value we hold on the stack, we don't care. If we actually end up in
zap_pte_range it means the pmd was not none already and it was not huge,
and it can't become huge from under us (khugepaged locking explained
above).
All we need is to enforce that there is no way anymore that in a code
path like below, pmd_trans_huge can be false, but pmd_none_or_clear_bad
can run into a hugepmd. The overhead of a barrier() is just a compiler
tweak and should not be measurable (I only added it for THP builds). I
don't exclude different compiler versions may have prevented the race
too by caching the value of *pmd on the stack (that hasn't been
verified, but it wouldn't be impossible considering
pmd_none_or_clear_bad, pmd_bad, pmd_trans_huge, pmd_none are all inlines
and there's no external function called in between pmd_trans_huge and
pmd_none_or_clear_bad).
if (pmd_trans_huge(*pmd)) {
if (next-addr != HPAGE_PMD_SIZE) {
VM_BUG_ON(!rwsem_is_locked(&tlb->mm->mmap_sem));
split_huge_page_pmd(vma->vm_mm, pmd);
} else if (zap_huge_pmd(tlb, vma, pmd, addr))
continue;
/* fall through */
}
if (pmd_none_or_clear_bad(pmd))
Because this race condition could be exercised without special
privileges this was reported in CVE-2012-1179.
The race was identified and fully explained by Ulrich who debugged it.
I'm quoting his accurate explanation below, for reference.
====== start quote =======
mapcount 0 page_mapcount 1
kernel BUG at mm/huge_memory.c:1384!
At some point prior to the panic, a "bad pmd ..." message similar to the
following is logged on the console:
mm/memory.c:145: bad pmd ffff8800376e1f98(80000000314000e7).
The "bad pmd ..." message is logged by pmd_clear_bad() before it clears
the page's PMD table entry.
143 void pmd_clear_bad(pmd_t *pmd)
144 {
-> 145 pmd_ERROR(*pmd);
146 pmd_clear(pmd);
147 }
After the PMD table entry has been cleared, there is an inconsistency
between the actual number of PMD table entries that are mapping the page
and the page's map count (_mapcount field in struct page). When the page
is subsequently reclaimed, __split_huge_page() detects this inconsistency.
1381 if (mapcount != page_mapcount(page))
1382 printk(KERN_ERR "mapcount %d page_mapcount %d\n",
1383 mapcount, page_mapcount(page));
-> 1384 BUG_ON(mapcount != page_mapcount(page));
The root cause of the problem is a race of two threads in a multithreaded
process. Thread B incurs a page fault on a virtual address that has never
been accessed (PMD entry is zero) while Thread A is executing an madvise()
system call on a virtual address within the same 2 MB (huge page) range.
virtual address space
.---------------------.
| |
| |
.-|---------------------|
| | |
| | |<-- B(fault)
| | |
2 MB | |/////////////////////|-.
huge < |/////////////////////| > A(range)
page | |/////////////////////|-'
| | |
| | |
'-|---------------------|
| |
| |
'---------------------'
- Thread A is executing an madvise(..., MADV_DONTNEED) system call
on the virtual address range "A(range)" shown in the picture.
sys_madvise
// Acquire the semaphore in shared mode.
down_read(¤t->mm->mmap_sem)
...
madvise_vma
switch (behavior)
case MADV_DONTNEED:
madvise_dontneed
zap_page_range
unmap_vmas
unmap_page_range
zap_pud_range
zap_pmd_range
//
// Assume that this huge page has never been accessed.
// I.e. content of the PMD entry is zero (not mapped).
//
if (pmd_trans_huge(*pmd)) {
// We don't get here due to the above assumption.
}
//
// Assume that Thread B incurred a page fault and
.---------> // sneaks in here as shown below.
| //
| if (pmd_none_or_clear_bad(pmd))
| {
| if (unlikely(pmd_bad(*pmd)))
| pmd_clear_bad
| {
| pmd_ERROR
| // Log "bad pmd ..." message here.
| pmd_clear
| // Clear the page's PMD entry.
| // Thread B incremented the map count
| // in page_add_new_anon_rmap(), but
| // now the page is no longer mapped
| // by a PMD entry (-> inconsistency).
| }
| }
|
v
- Thread B is handling a page fault on virtual address "B(fault)" shown
in the picture.
...
do_page_fault
__do_page_fault
// Acquire the semaphore in shared mode.
down_read_trylock(&mm->mmap_sem)
...
handle_mm_fault
if (pmd_none(*pmd) && transparent_hugepage_enabled(vma))
// We get here due to the above assumption (PMD entry is zero).
do_huge_pmd_anonymous_page
alloc_hugepage_vma
// Allocate a new transparent huge page here.
...
__do_huge_pmd_anonymous_page
...
spin_lock(&mm->page_table_lock)
...
page_add_new_anon_rmap
// Here we increment the page's map count (starts at -1).
atomic_set(&page->_mapcount, 0)
set_pmd_at
// Here we set the page's PMD entry which will be cleared
// when Thread A calls pmd_clear_bad().
...
spin_unlock(&mm->page_table_lock)
The mmap_sem does not prevent the race because both threads are acquiring
it in shared mode (down_read). Thread B holds the page_table_lock while
the page's map count and PMD table entry are updated. However, Thread A
does not synchronize on that lock.
====== end quote =======
[akpm@linux-foundation.org: checkpatch fixes]
Reported-by: Ulrich Obergfell <uobergfe@redhat.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Jones <davej@redhat.com>
Acked-by: Larry Woodman <lwoodman@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Mark Salter <msalter@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 93518dd2ebafcc761a8637b2877008cfd748c202 upstream.
This patch fixies follwing two memory leak patterns that reported by kmemleak.
sysfs_sd_setsecdata() is called during sys_lsetxattr() operation.
It checks sd->s_iattr is NULL or not. Then if it is NULL, it calls
sysfs_init_inode_attrs() to allocate memory.
That code is this.
iattrs = sd->s_iattr;
if (!iattrs)
iattrs = sysfs_init_inode_attrs(sd);
The iattrs recieves sysfs_init_inode_attrs()'s result, but sd->s_iattr
doesn't know the address. so it needs to set correct address to
sd->s_iattr to free memory in other function.
unreferenced object 0xffff880250b73e60 (size 32):
comm "systemd", pid 1, jiffies 4294683888 (age 94.553s)
hex dump (first 32 bytes):
73 79 73 74 65 6d 5f 75 3a 6f 62 6a 65 63 74 5f system_u:object_
72 3a 73 79 73 66 73 5f 74 3a 73 30 00 00 00 00 r:sysfs_t:s0....
backtrace:
[<ffffffff814cb1d0>] kmemleak_alloc+0x73/0x98
[<ffffffff811270ab>] __kmalloc+0x100/0x12c
[<ffffffff8120775a>] context_struct_to_string+0x106/0x210
[<ffffffff81207cc1>] security_sid_to_context_core+0x10b/0x129
[<ffffffff812090ef>] security_sid_to_context+0x10/0x12
[<ffffffff811fb0da>] selinux_inode_getsecurity+0x7d/0xa8
[<ffffffff811fb127>] selinux_inode_getsecctx+0x22/0x2e
[<ffffffff811f4d62>] security_inode_getsecctx+0x16/0x18
[<ffffffff81191dad>] sysfs_setxattr+0x96/0x117
[<ffffffff811542f0>] __vfs_setxattr_noperm+0x73/0xd9
[<ffffffff811543d9>] vfs_setxattr+0x83/0xa1
[<ffffffff811544c6>] setxattr+0xcf/0x101
[<ffffffff81154745>] sys_lsetxattr+0x6a/0x8f
[<ffffffff814efda9>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
unreferenced object 0xffff88024163c5a0 (size 96):
comm "systemd", pid 1, jiffies 4294683888 (age 94.553s)
hex dump (first 32 bytes):
00 00 00 00 ed 41 00 00 00 00 00 00 00 00 00 00 .....A..........
00 00 00 00 00 00 00 00 0c 64 42 4f 00 00 00 00 .........dBO....
backtrace:
[<ffffffff814cb1d0>] kmemleak_alloc+0x73/0x98
[<ffffffff81127402>] kmem_cache_alloc_trace+0xc4/0xee
[<ffffffff81191cbe>] sysfs_init_inode_attrs+0x2a/0x83
[<ffffffff81191dd6>] sysfs_setxattr+0xbf/0x117
[<ffffffff811542f0>] __vfs_setxattr_noperm+0x73/0xd9
[<ffffffff811543d9>] vfs_setxattr+0x83/0xa1
[<ffffffff811544c6>] setxattr+0xcf/0x101
[<ffffffff81154745>] sys_lsetxattr+0x6a/0x8f
[<ffffffff814efda9>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
`
Signed-off-by: Masami Ichikawa <masami256@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
Commit 28d82dc1c4ed ("epoll: limit paths") that I did to limit the
number of possible wakeup paths in epoll is causing a few applications
to longer work (dovecot for one).
The original patch is really about limiting the amount of epoll nesting
(since epoll fds can be attached to other fds). Thus, we probably can
allow an unlimited number of paths of depth 1. My current patch limits
it at 1000. And enforce the limits on paths that have a greater depth.
This is captured in: https://bugzilla.redhat.com/show_bug.cgi?id=681578
Signed-off-by: Jason Baron <jbaron@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Merge some more email patches from Andrew Morton:
"A couple of nilfs fixes"
* emailed from Andrew Morton <akpm@linux-foundation.org>:
nilfs2: fix NULL pointer dereference in nilfs_load_super_block()
nilfs2: clamp ns_r_segments_percentage to [1, 99]
|
|
According to the report from Slicky Devil, nilfs caused kernel oops at
nilfs_load_super_block function during mount after he shrank the
partition without resizing the filesystem:
BUG: unable to handle kernel NULL pointer dereference at 00000048
IP: [<d0d7a08e>] nilfs_load_super_block+0x17e/0x280 [nilfs2]
*pde = 00000000
Oops: 0000 [#1] PREEMPT SMP
...
Call Trace:
[<d0d7a87b>] init_nilfs+0x4b/0x2e0 [nilfs2]
[<d0d6f707>] nilfs_mount+0x447/0x5b0 [nilfs2]
[<c0226636>] mount_fs+0x36/0x180
[<c023d961>] vfs_kern_mount+0x51/0xa0
[<c023ddae>] do_kern_mount+0x3e/0xe0
[<c023f189>] do_mount+0x169/0x700
[<c023fa9b>] sys_mount+0x6b/0xa0
[<c04abd1f>] sysenter_do_call+0x12/0x28
Code: 53 18 8b 43 20 89 4b 18 8b 4b 24 89 53 1c 89 43 24 89 4b 20 8b 43
20 c7 43 2c 00 00 00 00 23 75 e8 8b 50 68 89 53 28 8b 54 b3 20 <8b> 72
48 8b 7a 4c 8b 55 08 89 b3 84 00 00 00 89 bb 88 00 00 00
EIP: [<d0d7a08e>] nilfs_load_super_block+0x17e/0x280 [nilfs2] SS:ESP 0068:ca9bbdcc
CR2: 0000000000000048
This turned out due to a defect in an error path which runs if the
calculated location of the secondary super block was invalid.
This patch fixes it and eliminates the reported oops.
Reported-by: Slicky Devil <slicky.dvl@gmail.com>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Tested-by: Slicky Devil <slicky.dvl@gmail.com>
Cc: <stable@vger.kernel.org> [2.6.30+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
ns_r_segments_percentage is read from the disk. Bogus or malicious
value could cause integer overflow and malfunction due to meaningless
disk usage calculation. This patch reports error when mounting such
bogus volumes.
Signed-off-by: Haogang Chen <haogangchen@gmail.com>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
When writing files to afs I sometimes hit a BUG:
kernel BUG at fs/afs/rxrpc.c:179!
With a backtrace of:
afs_free_call
afs_make_call
afs_fs_store_data
afs_vnode_store_data
afs_write_back_from_locked_page
afs_writepages_region
afs_writepages
The cause is:
ASSERT(skb_queue_empty(&call->rx_queue));
Looking at a tcpdump of the session the abort happens because we
are exceeding our disk quota:
rx abort fs reply store-data error diskquota exceeded (32)
So the abort error is valid. We hit the BUG because we haven't
freed all the resources for the call.
By freeing any skbs in call->rx_queue before calling afs_free_call
we avoid hitting leaking memory and avoid hitting the BUG.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Cc: <stable@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
A read of a large file on an afs mount failed:
# cat junk.file > /dev/null
cat: junk.file: Bad message
Looking at the trace, call->offset wrapped since it is only an
unsigned short. In afs_extract_data:
_enter("{%u},{%zu},%d,,%zu", call->offset, len, last, count);
...
if (call->offset < count) {
if (last) {
_leave(" = -EBADMSG [%d < %zu]", call->offset, count);
return -EBADMSG;
}
Which matches the trace:
[cat ] ==> afs_extract_data({65132},{524},1,,65536)
[cat ] <== afs_extract_data() = -EBADMSG [0 < 65536]
call->offset went from 65132 to 0. Fix this by making call->offset an
unsigned int.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Cc: <stable@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Pull block fixes from Jens Axboe:
"Been sitting on this for a while, but |