aboutsummaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2014-04-03nfs: add memory barriers around NFS_INO_INVALID_DATA and NFS_INO_INVALIDATINGJeff Layton
commit 4db72b40fdbc706f8957e9773ae73b1574b8c694 upstream. If the setting of NFS_INO_INVALIDATING gets reordered to before the clearing of NFS_INO_INVALID_DATA, then another task may hit a race window where both appear to be clear, even though the inode's pages are still in need of invalidation. Fix this by adding the appropriate memory barriers. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-04-03NFS: Fix races in nfs_revalidate_mappingTrond Myklebust
commit 17dfeb9113397a6119091a491ef7182649f0c5a9 upstream. Commit d529ef83c355f97027ff85298a9709fe06216a66 (NFS: fix the handling of NFS_INO_INVALID_DATA flag in nfs_revalidate_mapping) introduces a potential race, since it doesn't test the value of nfsi->cache_validity and set the bitlock in nfsi->flags atomically. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Cc: Jeff Layton <jlayton@redhat.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-04-03NFS: fix the handling of NFS_INO_INVALID_DATA flag in nfs_revalidate_mappingJeff Layton
commit d529ef83c355f97027ff85298a9709fe06216a66 upstream. There is a possible race in how the nfs_invalidate_mapping function is handled. Currently, we go and invalidate the pages in the file and then clear NFS_INO_INVALID_DATA. The problem is that it's possible for a stale page to creep into the mapping after the page was invalidated (i.e., via readahead). If another writer comes along and sets the flag after that happens but before invalidate_inode_pages2 returns then we could clear the flag without the cache having been properly invalidated. So, we must clear the flag first and then invalidate the pages. Doing this however, opens another race: It's possible to have two concurrent read() calls that end up in nfs_revalidate_mapping at the same time. The first one clears the NFS_INO_INVALID_DATA flag and then goes to call nfs_invalidate_mapping. Just before calling that though, the other task races in, checks the flag and finds it cleared. At that point, it trusts that the mapping is good and gets the lock on the page, allowing the read() to be satisfied from the cache even though the data is no longer valid. These effects are easily manifested by running diotest3 from the LTP test suite on NFS. That program does a series of DIO writes and buffered reads. The operations are serialized and page-aligned but the existing code fails the test since it occasionally allows a read to come out of the cache incorrectly. While mixing direct and buffered I/O isn't recommended, I believe it's possible to hit this in other ways that just use buffered I/O, though that situation is much harder to reproduce. The problem is that the checking/clearing of that flag and the invalidation of the mapping really need to be atomic. Fix this by serializing concurrent invalidations with a bitlock. At the same time, we also need to allow other places that check NFS_INO_INVALID_DATA to check whether we might be in the middle of invalidating the file, so fix up a couple of places that do that to look for the new NFS_INO_INVALIDATING flag. Doing this requires us to be careful not to set the bitlock unnecessarily, so this code only does that if it believes it will be doing an invalidation. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-04-03pnfs: fix BUG in filelayout_recover_commit_reqsWeston Andros Adamson
commit 471252cd8b34b0609973740b25dcd1ff01dc1889 upstream. cond_resched_lock(cinfo->lock) is called everywhere else while holding the cinfo->lock spinlock. Not holding this lock while calling transfer_commit_list in filelayout_recover_commit_reqs causes the BUG below. It's true that we can't hold this lock while calling pnfs_put_lseg, because that might try to lock the inode lock - which might be the same lock as cinfo->lock. To reproduce, mount a 2 DS pynfs server and run an O_DIRECT command that crosses a stripe boundary and is not page aligned, such as: dd if=/dev/zero of=/mnt/f bs=17000 count=1 oflag=direct BUG: sleeping function called from invalid context at linux/fs/nfs/nfs4filelayout.c:1161 in_atomic(): 0, irqs_disabled(): 0, pid: 27, name: kworker/0:1 2 locks held by kworker/0:1/27: #0: (events){.+.+.+}, at: [<ffffffff810501d7>] process_one_work+0x175/0x3a5 #1: ((&dreq->work)){+.+...}, at: [<ffffffff810501d7>] process_one_work+0x175/0x3a5 CPU: 0 PID: 27 Comm: kworker/0:1 Not tainted 3.13.0-rc3-branch-dros_testing+ #21 Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013 Workqueue: events nfs_direct_write_schedule_work [nfs] 0000000000000000 ffff88007a39bbb8 ffffffff81491256 ffff88007b87a130 ffff88007a39bbd8 ffffffff8105f103 ffff880079614000 ffff880079617d40 ffff88007a39bc20 ffffffffa011603e ffff880078988b98 0000000000000000 Call Trace: [<ffffffff81491256>] dump_stack+0x4d/0x66 [<ffffffff8105f103>] __might_sleep+0x100/0x105 [<ffffffffa011603e>] transfer_commit_list+0x94/0xf1 [nfs_layout_nfsv41_files] [<ffffffffa01160d6>] filelayout_recover_commit_reqs+0x3b/0x68 [nfs_layout_nfsv41_files] [<ffffffffa00ba53a>] nfs_direct_write_reschedule+0x9f/0x1d6 [nfs] [<ffffffff810705df>] ? mark_lock+0x1df/0x224 [<ffffffff8106e617>] ? trace_hardirqs_off_caller+0x37/0xa4 [<ffffffff8106e691>] ? trace_hardirqs_off+0xd/0xf [<ffffffffa00ba8f8>] nfs_direct_write_schedule_work+0x9d/0xb7 [nfs] [<ffffffff810501d7>] ? process_one_work+0x175/0x3a5 [<ffffffff81050258>] process_one_work+0x1f6/0x3a5 [<ffffffff810501d7>] ? process_one_work+0x175/0x3a5 [<ffffffff8105187e>] worker_thread+0x149/0x1f5 [<ffffffff81051735>] ? rescuer_thread+0x28d/0x28d [<ffffffff81056d74>] kthread+0xd2/0xda [<ffffffff81056ca2>] ? __kthread_parkme+0x61/0x61 [<ffffffff8149e66c>] ret_from_fork+0x7c/0xb0 [<ffffffff81056ca2>] ? __kthread_parkme+0x61/0x61 Signed-off-by: Weston Andros Adamson <dros@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-04-03nfs: increment i_dio_count for reads, tooChristoph Hellwig
commit 1f90ee27461e31a1c18e5d819f6ea6f5c7304b16 upstream. i_dio_count is used to protect dio access against truncate. We want to make sure there are no dio reads pending either when doing a truncate. I suspect on plain NFS things might work even without this, but once we use a pnfs layout driver that access backing devices directly things will go bad without the proper synchronization. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-04-03nfs: defer inode_dio_done call until size update is doneChristoph Hellwig
commit 2a009ec98cce440c0992fc9a2353e96cdb0b048b upstream. We need to have the I/O fully finished before telling the truncate code that we are done. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-04-03nfs: fix size updates for aio writesChristoph Hellwig
commit 9811cd57f4c6b5b60ec104de68a88303717e3106 upstream. nfs_file_direct_write only updates the inode size if it succeeded and returned the number of bytes written. But in the AIO case nfs_direct_wait turns the return value into -EIOCBQUEUED and we skip the size update. Instead the aio completion path should updated it, which this patch does. The implementation is a little hacky because there is no obvious way to find out we are called for a write in nfs_direct_complete. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-04-03nfs: fix dead code of ipv6_addr_scopeAlexander Aring
commit a8c2275493b866961f4429a741251c630c4fc6d7 upstream. The correct way to check on IPV6_ADDR_SCOPE_LINKLOCAL is to check with the ipv6_addr_src_scope function. Currently this can't be work, because ipv6_addr_scope returns a int with a mask of IPV6_ADDR_SCOPE_MASK (0x00f0U) and IPV6_ADDR_SCOPE_LINKLOCAL is 0x02. So the condition is always false. Signed-off-by: Alexander Aring <alex.aring@gmail.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-04-03NFSv4.1: Prevent a 3-way deadlock between layoutreturn, open and state recoveryTrond Myklebust
commit f22e5edd2244609aed3906207a62223e7707a34d upstream. Andy Adamson reports: The state manager is recovering expired state and recovery OPENs are being processed. If kswapd is pruning inodes at the same time, a deadlock can occur when kswapd calls evict_inode on an NFSv4.1 inode with a layout, and the resultant layoutreturn gets an error that the state mangager is to handle, causing the layoutreturn to wait on the (NFS client) cl_rpcwaitq. At the same time an open is waiting for the inode deletion to complete in __wait_on_freeing_inode. If the open is either the open called by the state manager, or an open from the same open owner that is holding the NFSv4 sequence id which causes the OPEN from the state manager to wait for the sequence id on the Seqid_waitqueue, then the state is deadlocked with kswapd. The fix is simply to have layoutreturn ignore all errors except NFS4ERR_DELAY. We already know that layouts are dropped on all server reboots, and that it has to be coded to deal with the "forgetful client model" that doesn't send layoutreturns. Reported-by: Andy Adamson <andros@netapp.com> Link: http://lkml.kernel.org/r/1385402270-14284-1-git-send-email-andros@netapp.com Signed-off-by: Trond Myklebust <Trond.Myklebust@primarydata.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-04-03nfs: use IS_ROOT not DCACHE_DISCONNECTEDJ. Bruce Fields
commit a3f432bfd06a4ec3b812e32d3266e0d1ad75d008 upstream. This check was added by Al Viro with d9e80b7de91db05c1c4d2e5ebbfd70b3b3ba0e0f "nfs d_revalidate() is too trigger-happy with d_drop()", with the explanation that we don't want to remove the root of a disconnected tree, which will still be included on the s_anon list. But DCACHE_DISCONNECTED does *not* actually identify dentries that are disconnected from the dentry tree or hashed on s_anon. IS_ROOT() is the way to do that. Also add a comment from Al's commit to remind us why this check is there. Signed-off-by: J. Bruce Fields <bfields@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-03-31fs/proc/proc_devtree.c: remove empty /proc/device-tree when no openfirmware ↵Dave Jones
exists. commit c1d867a54d426b45da017fbe8e585f8a3064ce8d upstream. Distribution kernels might want to build in support for /proc/device-tree for kernels that might end up running on hardware that doesn't support openfirmware. This results in an empty /proc/device-tree existing. Remove it if the OFW root node doesn't exist. This situation actually confuses grub2, resulting in install failures. grub2 sees the /proc/device-tree and picks the wrong install target cf. http://bzr.savannah.gnu.org/lh/grub/trunk/grub/annotate/4300/util/grub-install.in#L311 grub should be more robust, but still, leaving an empty proc dir seems pointless. Addresses https://bugzilla.redhat.com/show_bug.cgi?id=818378. Signed-off-by: Dave Jones <davej@redhat.com> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: Paul Mackerras <paulus@samba.org> Cc: Josh Boyer <jwboyer@fedoraproject.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-03-26NFSv4: Fix another nfs4_sequence corruptorTrond Myklebust
commit b7e63a1079b266866a732cf699d8c4d61391bbda upstream. nfs4_release_lockowner needs to set the rpc_message reply to point to the nfs4_sequence_res in order to avoid another Oopsable situation in nfs41_assign_slot. Fixes: fbd4bfd1d9d21 (NFS: Add nfs4_sequence calls for RELEASE_LOCKOWNER) Cc: stable@vger.kernel.org # 3.12+ Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-03-24bio-integrity: Fix bio_integrity_verify segment start bugNicholas Bellinger
commit 5837c80e870bc3b12ac6a98cdc9ce7a9522a8fb6 upstream. This patch addresses a bug in bio_integrity_verify() code that has been causing DIF READ verify operations to be silently skipped. The issue is that bio->bi_idx will have been incremented within bio_advance() code in the normal blk_update_request() -> req_bio_endio() completion path, and bio_integrity_verify() is using bio_for_each_segment() which starts the bio segment walk at the current bio->bi_idx. So instead use bio_for_each_segment_all() to always start the bio segment walk from zero, regardless of the current bio->bi_idx value after bio_advance() has been called. (Context change for v3.10.y -> v3.13.y code - nab) Cc: Martin K. Petersen <martin.petersen@oracle.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-03-22fs/proc/base.c: fix GPF in /proc/$PID/map_filesArtem Fetishev
commit 70335abb2689c8cd5df91bf2d95a65649addf50b upstream. The expected logic of proc_map_files_get_link() is either to return 0 and initialize 'path' or return an error and leave 'path' uninitialized. By the time dname_to_vma_addr() returns 0 the corresponding vma may have already be gone. In this case the path is not initialized but the return value is still 0. This results in 'general protection fault' inside d_path(). Steps to reproduce: CONFIG_CHECKPOINT_RESTORE=y fd = open(...); while (1) { mmap(fd, ...); munmap(fd, ...); } ls -la /proc/$PID/map_files Addresses https://bugzilla.kernel.org/show_bug.cgi?id=68991 Signed-off-by: Artem Fetishev <artem_fetishev@epam.com> Signed-off-by: Aleksandr Terekhov <aleksandr_terekhov@epam.com> Reported-by: <wiebittewas@gmail.com> Acked-by: Pavel Emelyanov <xemul@parallels.com> Acked-by: Cyrill Gorcunov <gorcunov@openvz.org> Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-03-22NFSv4: nfs4_stateid_is_current should return 'true' for an invalid stateidTrond Myklebust
commit e1253be0ece1a95a02c7f5843194877471af8179 upstream. When nfs4_set_rw_stateid() can fails by returning EIO to indicate that the stateid is completely invalid, then it makes no sense to have it trigger a retry of the READ or WRITE operation. Instead, we should just have it fall through and attempt a recovery. This fixes an infinite loop in which the client keeps replaying the same bad stateid back to the server. Reported-by: Andy Adamson <andros@netapp.com> Link: http://lkml.kernel.org/r/1393954269-3974-1-git-send-email-andros@netapp.com Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-03-22NFS: Fix a delegation callback raceTrond Myklebust
commit 755a48a7a4eb05b9c8424e3017d947b2961a60e0 upstream. The clean-up in commit 36281caa839f ended up removing a NULL pointer check that is needed in order to prevent an Oops in nfs_async_inode_return_delegation(). Reported-by: "Yan, Zheng" <zheng.z.yan@intel.com> Link: http://lkml.kernel.org/r/5313E9F6.2020405@intel.com Fixes: 36281caa839f (NFSv4: Further clean-ups of delegation stateid validation) Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-03-22ocfs2 syncs the wrong range...Al Viro
commit 1b56e98990bcdbb20b9fab163654b9315bf158e8 upstream. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-03-22ocfs2: fix quota file corruptionJan Kara
commit 15c34a760630ca2c803848fba90ca0646a9907dd upstream. Global quota files are accessed from different nodes. Thus we cannot cache offset of quota structure in the quota file after we drop our node reference count to it because after that moment quota structure may be freed and reallocated elsewhere by a different node resulting in corruption of quota file. Fix the problem by clearing dq_off when we are releasing dquot structure. We also remove the DB_READ_B handling because it is useless - DQ_ACTIVE_B is set iff DQ_READ_B is set. Signed-off-by: Jan Kara <jack@suse.cz> Cc: Goldwyn Rodrigues <rgoldwyn@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Reviewed-by: Mark Fasheh <mfasheh@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-03-14Fix mountpoint reference leakage in linkatOleg Drokin
commit d22e6338db7f613dd4f6095c190682fcc519e4b7 upstream. Recent changes to retry on ESTALE in linkat (commit 442e31ca5a49e398351b2954b51f578353fdf210) introduced a mountpoint reference leak and a small memory leak in case a filesystem link operation returns ESTALE which is pretty normal for distributed filesystems like lustre, nfs and so on. Free old_path in such a case. [AV: there was another missing path_put() nearby - on the previous goto retry] [js: the second path_put is not in 3.12 yet, hunk removed] Signed-off-by: Oleg Drokin: <green@linuxhacker.ru> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-03-12epoll: do not take the nested ep->mtx on EPOLL_CTL_DELJason Baron
commit 4ff36ee94d93ddb4b7846177f1118d9aa33408e2 upstream. The EPOLL_CTL_DEL path of epoll contains a classic, ab-ba deadlock. That is, epoll_ctl(a, EPOLL_CTL_DEL, b, x), will deadlock with epoll_ctl(b, EPOLL_CTL_DEL, a, x). The deadlock was introduced with commmit 67347fe4e632 ("epoll: do not take global 'epmutex' for simple topologies"). The acquistion of the ep->mtx for the destination 'ep' was added such that a concurrent EPOLL_CTL_ADD operation would see the correct state of the ep (Specifically, the check for '!list_empty(&f.file->f_ep_links') However, by simply not acquiring the lock, we do not serialize behind the ep->mtx from the add path, and thus may perform a full path check when if we had waited a little longer it may not have been necessary. However, this is a transient state, and performing the full loop checking in this case is not harmful. The important point is that we wouldn't miss doing the full loop checking when required, since EPOLL_CTL_ADD always locks any 'ep's that its operating upon. The reason we don't need to do lock ordering in the add path, is that we are already are holding the global 'epmutex' whenever we do the double lock. Further, the original posting of this patch, which was tested for the intended performance gains, did not perform this additional locking. Signed-off-by: Jason Baron <jbaron@akamai.com> Cc: Nathan Zimmer <nzimmer@sgi.com> Cc: Eric Wong <normalperson@yhbt.net> Cc: Nelson Elhage <nelhage@nelhage.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Davide Libenzi <davidel@xmailserver.org> Cc: "Paul E. McKenney" <paulmck@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-03-12epoll: do not take global 'epmutex' for simple topologiesJason Baron
commit 67347fe4e6326338ee217d7eb826bedf30b2e155 upstream. When calling EPOLL_CTL_ADD for an epoll file descriptor that is attached directly to a wakeup source, we do not need to take the global 'epmutex', unless the epoll file descriptor is nested. The purpose of taking the 'epmutex' on add is to prevent complex topologies such as loops and deep wakeup paths from forming in parallel through multiple EPOLL_CTL_ADD operations. However, for the simple case of an epoll file descriptor attached directly to a wakeup source (with no nesting), we do not need to hold the 'epmutex'. This patch along with 'epoll: optimize EPOLL_CTL_DEL using rcu' improves scalability on larger systems. Quoting Nathan Zimmer's mail on SPECjbb performance: "On the 16 socket run the performance went from 35k jOPS to 125k jOPS. In addition the benchmark when from scaling well on 10 sockets to scaling well on just over 40 sockets. ... Currently the benchmark stops scaling at around 40-44 sockets but it seems like I found a second unrelated bottleneck." [akpm@linux-foundation.org: use `bool' for boolean variables, remove unneeded/undesirable cast of void*, add missed ep_scan_ready_list() kerneldoc] Signed-off-by: Jason Baron <jbaron@akamai.com> Tested-by: Nathan Zimmer <nzimmer@sgi.com> Cc: Eric Wong <normalperson@yhbt.net> Cc: Nelson Elhage <nelhage@nelhage.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Davide Libenzi <davidel@xmailserver.org> Cc: "Paul E. McKenney" <paulmck@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-03-12epoll: optimize EPOLL_CTL_DEL using rcuJason Baron
commit ae10b2b4eb01bedc91d29d5c5bb9e416fd806c40 upstream. Nathan Zimmer found that once we get over 10+ cpus, the scalability of SPECjbb falls over due to the contention on the global 'epmutex', which is taken in on EPOLL_CTL_ADD and EPOLL_CTL_DEL operations. Patch #1 removes the 'epmutex' lock completely from the EPOLL_CTL_DEL path by using rcu to guard against any concurrent traversals. Patch #2 remove the 'epmutex' lock from EPOLL_CTL_ADD operations for simple topologies. IE when adding a link from an epoll file descriptor to a wakeup source, where the epoll file descriptor is not nested. This patch (of 2): Optimize EPOLL_CTL_DEL such that it does not require the 'epmutex' by converting the file->f_ep_links list into an rcu one. In this way, we can traverse the epoll network on the add path in parallel with deletes. Since deletes can't create loops or worse wakeup paths, this is safe. This patch in combination with the patch "epoll: Do not take global 'epmutex' for simple topologies", shows a dramatic performance improvement in scalability for SPECjbb. Signed-off-by: Jason Baron <jbaron@akamai.com> Tested-by: Nathan Zimmer <nzimmer@sgi.com> Cc: Eric Wong <normalperson@yhbt.net> Cc: Nelson Elhage <nelhage@nelhage.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Davide Libenzi <davidel@xmailserver.org> Cc: "Paul E. McKenney" <paulmck@us.ibm.com> CC: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-03-05quota: Fix race between dqput() and dquot_scan_active()Jan Kara
commit 1362f4ea20fa63688ba6026e586d9746ff13a846 upstream. Currently last dqput() can race with dquot_scan_active() causing it to call callback for an already deactivated dquot. The race is as follows: CPU1 CPU2 dqput() spin_lock(&dq_list_lock); if (atomic_read(&dquot->dq_count) > 1) { - not taken if (test_bit(DQ_ACTIVE_B, &dquot->dq_flags)) { spin_unlock(&dq_list_lock); ->release_dquot(dquot); if (atomic_read(&dquot->dq_count) > 1) - not taken dquot_scan_active() spin_lock(&dq_list_lock); if (!test_bit(DQ_ACTIVE_B, &dquot->dq_flags)) - not taken atomic_inc(&dquot->dq_count); spin_unlock(&dq_list_lock); - proceeds to release dquot ret = fn(dquot, priv); - called for inactive dquot Fix the problem by making sure possible ->release_dquot() is finished by the time we call the callback and new calls to it will notice reference dquot_scan_active() has taken and bail out. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-03-05fs: fix iversion handlingChristoph Hellwig
commit dff6efc326a4d5f305797d4a6bba14f374fdd633 upstream. Currently notify_change directly updates i_version for size updates, which not only is counter to how all other fields are updated through struct iattr, but also breaks XFS, which need inode updates to happen under its own lock, and synchronized to the structure that gets written to the log. Remove the update in the common code, and it to btrfs and ext4, XFS already does a proper updaste internally and currently gets a double update with the existing code. IMHO this is 3.13 and -stable material and should go in through the XFS tree. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Acked-by: Jan Kara <jack@suse.cz> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Ben Myers <bpm@sgi.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-03-05NFS fix error return in nfs4_select_rw_stateidAndy Adamson
commit 146d70caaa1b87f64597743429d7da4b8073d0c9 upstream. Do not return an error when nfs4_copy_delegation_stateid succeeds. Signed-off-by: Andy Adamson <andros@netapp.com> Link: http://lkml.kernel.org/r/1392737765-41942-1-git-send-email-andros@netapp.com Fixes: ef1820f9be27b (NFSv4: Don't try to recover NFSv4 locks when...) Cc: NeilBrown <neilb@suse.de> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-03-05CIFS: Fix too big maxBuf size for SMB3 mountsPavel Shilovsky
commit 2365c4eaf077c48574ab6f143960048fc0f31518 upstream. SMB3 servers can respond with MaxTransactSize of more than 4M that can cause a memory allocation error returned from kmalloc in a lock codepath. Also the client doesn't support multicredit requests now and allows buffer sizes of 65536 bytes only. Set MaxTransactSize to this maximum supported value. Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru> Acked-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <smfrench@gmail.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-03-05cifs: ensure that uncached writes handle unmapped areas correctlyJeff Layton
commit 5d81de8e8667da7135d3a32a964087c0faf5483f upstream. It's possible for userland to pass down an iovec via writev() that has a bogus user pointer in it. If that happens and we're doing an uncached write, then we can end up getting less bytes than we expect from the call to iov_iter_copy_from_user. This is CVE-2014-0069 cifs_iovec_write isn't set up to handle that situation however. It'll blindly keep chugging through the page array and not filling those pages with anything useful. Worse yet, we'll later end up with a negative number in wdata->tailsz, which will confuse the sending routines and cause an oops at the very least. Fix this by having the copy phase of cifs_iovec_write stop copying data in this situation and send the last write as a short one. At the same time, we want to avoid sending a zero-length write to the server, so break out of the loop and set rc to -EFAULT if that happens. This also allows us to handle the case where no address in the iovec is valid. [Note: Marking this for stable on v3.4+ kernels, but kernels as old as v2.6.38 may have a similar problem and may need similar fix] Reviewed-by: Pavel Shilovsky <piastry@etersoft.ru> Reported-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <smfrench@gmail.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-03-05jbd2: fix use after free in jbd2_journal_start_reserved()Dan Carpenter
commit 92e3b40537707001d17bbad800d150ab04e53bf4 upstream. If start_this_handle() fails then it leads to a use after free of "handle". Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-03-05ext4: don't leave i_crtime.tv_sec uninitializedTheodore Ts'o
commit 19ea80603715d473600cd993b9987bc97d042e02 upstream. If the i_crtime field is not present in the inode, don't leave the field uninitialized. Fixes: ef7f38359 ("ext4: Add nanosecond timestamps") Reported-by: Vegard Nossum <vegard.nossum@oracle.com> Tested-by: Vegard Nossum <vegard.nossum@oracle.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-03-05ext4: fix online resize with a non-standard blocks per group settingTheodore Ts'o
commit 3d2660d0c9c2f296837078c189b68a47f6b2e3b5 upstream. The set_flexbg_block_bitmap() function assumed that the number of blocks in a blockgroup was sb->blocksize * 8, which is normally true, but not always! Use EXT4_BLOCKS_PER_GROUP(sb) instead, to fix block bitmap corruption after: mke2fs -t ext4 -g 3072 -i 4096 /dev/vdd 1G mount -t ext4 /dev/vdd /vdd resize2fs /dev/vdd 8G Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reported-by: Jon Bernard <jbernard@tuxion.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-03-05ext4: fix online resize with very large inode tablesTheodore Ts'o
commit b93c95353413041a8cebad915a8109619f66bcc6 upstream. If a file system has a large number of inodes per block group, all of the metadata blocks in a flex_bg may be larger than what can fit in a single block group. Unfortunately, ext4_alloc_group_tables() in resize.c was never tested to see if it would handle this case correctly, and there were a large number of bugs which caused the following sequence to result in a BUG_ON: kernel bug at fs/ext4/resize.c:409! ... call trace: [<ffffffff81256768>] ext4_flex_group_add+0x1448/0x1830 [<ffffffff81257de2>] ext4_resize_fs+0x7b2/0xe80 [<ffffffff8123ac50>] ext4_ioctl+0xbf0/0xf00 [<ffffffff811c111d>] do_vfs_ioctl+0x2dd/0x4b0 [<ffffffff811b9df2>] ? final_putname+0x22/0x50 [<ffffffff811c1371>] sys_ioctl+0x81/0xa0 [<ffffffff81676aa9>] system_call_fastpath+0x16/0x1b code: c8 4c 89 df e8 41 96 f8 ff 44 89 e8 49 01 c4 44 29 6d d4 0 rip [<ffffffff81254fa1>] set_flexbg_block_bitmap+0x171/0x180 This can be reproduced with the following command sequence: mke2fs -t ext4 -i 4096 /dev/vdd 1G mount -t ext4 /dev/vdd /vdd resize2fs /dev/vdd 8G To fix this, we need to make sure the right thing happens when a block group's inode table straddles two block groups, which means the following bugs had to be fixed: 1) Not clearing the BLOCK_UNINIT flag in the second block group in ext4_alloc_group_tables --- the was proximate cause of the BUG_ON. 2) Incorrectly determining how many block groups contained contiguous free blocks in ext4_alloc_group_tables(). 3) Incorrectly setting the start of the next block range to be marked in use after a discontinuity in setup_new_flex_group_blocks(). Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-03-05ext4: don't try to modify s_flags if the the file system is read-onlyTheodore Ts'o
commit 23301410972330c0ae9a8afc379ba2005e249cc6 upstream. If an ext4 file system is created by some tool other than mke2fs (perhaps by someone who has a pathalogical fear of the GPL) that doesn't set one or the other of the EXT2_FLAGS_{UN}SIGNED_HASH flags, and that file system is then mounted read-only, don't try to modify the s_flags field. Otherwise, if dm_verity is in use, the superblock will change, causing an dm_verity failure. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-03-05ext4: fix error paths in swap_inode_boot_loader()Zheng Liu
commit 30d29b119ef01776e0a301444ab24defe8d8bef3 upstream. In swap_inode_boot_loader() we forgot to release ->i_mutex and resume unlocked dio for inode and inode_bl if there is an error starting the journal handle. This commit fixes this issue. Reported-by: Ahmed Tamrawi <ahmedtamrawi@gmail.com> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: Dr. Tilmann Bubeck <t.bubeck@reinform.de> Signed-off-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-03-05ext4: fix xfstest generic/299 block validity failuresEric Whitney
commit 15cc17678547676c82a5da9ccf357447333fc342 upstream. Commit a115f749c1 (ext4: remove wait for unwritten extent conversion from ext4_truncate) exposed a bug in ext4_ext_handle_uninitialized_extents(). It can be triggered by xfstest generic/299 when run on a test file system created without a journal. This test continuously fallocates and truncates files to which random dio/aio writes are simultaneously performed by a separate process. The test completes successfully, but if the test filesystem is mounted with the block_validity option, a warning message stating that a logical block has been mapped to an illegal physical block is posted in the kernel log. The bug occurs when an extent is being converted to the written state by ext4_end_io_dio() and ext4_ext_handle_uninitialized_extents() discovers a mapping for an existing uninitialized extent. Although it sets EXT4_MAP_MAPPED in map->m_flags, it fails to set map->m_pblk to the discovered physical block number. Because map->m_pblk is not otherwise initialized or set by this function or its callers, its uninitialized value is returned to ext4_map_blocks(), where it is stored as a bogus mapping in the extent status tree. Since map->m_pblk can accidentally contain illegal values that are larger than the physical size of the file system, calls to check_block_validity() in ext4_map_blocks() that are enabled if the block_validity mount option is used can fail, resulting in the logged warning message. Signed-off-by: Eric Whitney <enwlinux@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-03-05bio-integrity: Fix bio_integrity_verify segment start bugNicholas Bellinger
commit 5837c80e870bc3b12ac6a98cdc9ce7a9522a8fb6 upstream. This patch addresses a bug in bio_integrity_verify() code that has been causing DIF READ verify operations to be silently skipped. The issue is that bio->bi_idx will have been incremented within bio_advance() code in the normal blk_update_request() -> req_bio_endio() completion path, and bio_integrity_verify() is using bio_for_each_segment() which starts the bio segment walk at the current bio->bi_idx. So instead use bio_for_each_segment_all() to always start the bio segment walk from zero, regardless of the current bio->bi_idx value after bio_advance() has been called. (Context change for v3.10.y -> v3.13.y code - nab) Cc: Martin K. Petersen <martin.petersen@oracle.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org> Cc: stable@kernel.dk # >= v3.10 Signed-off-by: Jens Axboe <axboe@kernel.dk>
2014-03-03NFS: Do not set NFS_INO_INVALID_LABEL unless server supports labeled NFSTrond Myklebust
commit fd1defc257e2b12ab69bc0b379105c00eca4e112 upstream. Commit aa9c2669626c (NFS: Client implementation of Labeled-NFS) introduces a performance regression. When nfs_zap_caches_locked is called, it sets the NFS_INO_INVALID_LABEL flag irrespectively of whether or not the NFS server supports security labels. Since that flag is never cleared, it means that all calls to nfs_revalidate_inode() will now trigger an on-the-wire GETATTR call. This patch ensures that we never set the NFS_INO_INVALID_LABEL unless the server advertises support for labeled NFS. It also causes nfs_setsecurity() to clear NFS_INO_INVALID_LABEL when it has successfully set the security label for the inode. Finally it gets rid of the NFS_INO_INVALID_LABEL cruft from nfs_update_inode, which has nothing to do with labeled NFS. Reported-by: Neil Brown <neilb@suse.de> Cc: stable@vger.kernel.org # 3.11+ Tested-by: Neil Brown <neilb@suse.de> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-03-01Btrfs: fix data corruption when reading/updating compressed extentsFilipe David Borba Manana
commit a2aa75e18a21b21952dc6daa9bac7c9f4426f81f upstream. When using a mix of compressed file extents and prealloc extents, it is possible to fill a page of a file with random, garbage data from some unrelated previous use of the page, instead of a sequence of zeroes. A simple sequence of steps to get into such case, taken from the test case I made for xfstests, is: _scratch_mkfs _scratch_mount "-o compress-force=lzo" $XFS_IO_PROG -f -c "pwrite -S 0x06 -b 18670 266978 18670" $SCRATCH_MNT/foobar $XFS_IO_PROG -c "falloc 26450 665194" $SCRATCH_MNT/foobar $XFS_IO_PROG -c "truncate 542872" $SCRATCH_MNT/foobar $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foobar This results in the following file items in the fs tree: item 4 key (257 INODE_ITEM 0) itemoff 15879 itemsize 160 inode generation 6 transid 6 size 542872 block group 0 mode 100600 item 5 key (257 INODE_REF 256) itemoff 15863 itemsize 16 inode ref index 2 namelen 6 name: foobar item 6 key (257 EXTENT_DATA 0) itemoff 15810 itemsize 53 extent data disk byte 0 nr 0 gen 6 extent data offset 0 nr 24576 ram 266240 extent compression 0 item 7 key (257 EXTENT_DATA 24576) itemoff 15757 itemsize 53 prealloc data disk byte 12849152 nr 241664 gen 6 prealloc data offset 0 nr 241664 item 8 key (257 EXTENT_DATA 266240) itemoff 15704 itemsize 53 extent data disk byte 12845056 nr 4096 gen 6 extent data offset 0 nr 20480 ram 20480 extent compression 2 item 9 key (257 EXTENT_DATA 286720) itemoff 15651 itemsize 53 prealloc data disk byte 13090816 nr 405504 gen 6 prealloc data offset 0 nr 258048 The on disk extent at offset 266240 (which corresponds to 1 single disk block), contains 5 compressed chunks of file data. Each of the first 4 compress 4096 bytes of file data, while the last one only compresses 3024 bytes of file data. Therefore a read into the file region [285648 ; 286720[ (length = 4096 - 3024 = 1072 bytes) should always return zeroes (our next extent is a prealloc one). The solution here is the compression code path to zero the remaining (untouched) bytes of the last page it uncompressed data into, as the information about how much space the file data consumes in the last page is not known in the upper layer fs/btrfs/extent_io.c:__do_readpage(). In __do_readpage we were correctly zeroing the remainder of the page but only if it corresponds to the last page of the inode and if the inode's size is not a multiple of the page size. This would cause not only returning random data on reads, but also permanently storing random data when updating parts of the region that should be zeroed. For the example above, it means updating a single byte in the region [285648 ; 286720[ would store that byte correctly but also store random data on disk. A test case for xfstests follows soon. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-03-01Btrfs: fix tree mod loggingFilipe David Borba Manana
commit 5de865eebb8330eee19c37b31fb6f315a09d4273 upstream. While running the test btrfs/004 from xfstests in a loop, it failed about 1 time out of 20 runs in my desktop. The failure happened in the backref walking part of the test, and the test's error message was like this: btrfs/004 93s ... [failed, exit status 1] - output mismatch (see /home/fdmanana/git/hub/xfstests_2/results//btrfs/004.out.bad) --- tests/btrfs/004.out 2013-11-26 18:25:29.263333714 +0000 +++ /home/fdmanana/git/hub/xfstests_2/results//btrfs/004.out.bad 2013-12-10 15:25:10.327518516 +0000 @@ -1,3 +1,8 @@ QA output created by 004 *** test backref walking -*** done +unexpected output from + /home/fdmanana/git/hub/btrfs-progs/btrfs inspect-internal logical-resolve -P 141512704 /home/fdmanana/btrfs-tests/scratch_1 +expected inum: 405, expected address: 454656, file: /home/fdmanana/btrfs-tests/scratch_1/snap1/p0/d6/d3d/d156/fce, got: + ... (Run 'diff -u tests/btrfs/004.out /home/fdmanana/git/hub/xfstests_2/results//btrfs/004.out.bad' to see the entire diff) Ran: btrfs/004 Failures: btrfs/004 Failed 1 of 1 tests But immediately after the test finished, the btrfs inspect-internal command returned the expected output: $ btrfs inspect-internal logical-resolve -P 141512704 /home/fdmanana/btrfs-tests/scratch_1 inode 405 offset 454656 root 258 inode 405 offset 454656 root 5 It turned out this was because the btrfs_search_old_slot() calls performed during backref walking (backref.c:__resolve_indirect_ref) were not finding anything. The reason for this turned out to be that the tree mod logging code was not logging some node multi-step operations atomically, therefore btrfs_search_old_slot() callers iterated often over an incomplete tree that wasn't fully consistent with any tree state from the past. Besides missing items, this often (but not always) resulted in -EIO errors during old slot searches, reported in dmesg like this: [ 4299.933936] ------------[ cut here ]------------ [ 4299.933949] WARNING: CPU: 0 PID: 23190 at fs/btrfs/ctree.c:1343 btrfs_search_old_slot+0x57b/0xab0 [btrfs]() [ 4299.933950] Modules linked in: btrfs raid6_pq xor pci_stub vboxpci(O) vboxnetadp(O) vboxnetflt(O) vboxdrv(O) bnep rfcomm bluetooth parport_pc ppdev binfmt_misc joydev snd_hda_codec_h [ 4299.933977] CPU: 0 PID: 23190 Comm: btrfs Tainted: G W O 3.12.0-fdm-btrfs-next-16+ #70 [ 4299.933978] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z77 Pro4, BIOS P1.50 09/04/2012 [ 4299.933979] 000000000000053f ffff8806f3fd98f8 ffffffff8176d284 0000000000000007 [ 4299.933982] 0000000000000000 ffff8806f3fd9938 ffffffff8104a81c ffff880659c64b70 [ 4299.933984] ffff880659c643d0 ffff8806599233d8 ffff880701e2e938 0000160000000000 [ 4299.933987] Call Trace: [ 4299.933991] [<ffffffff8176d284>] dump_stack+0x55/0x76 [ 4299.933994] [<ffffffff8104a81c>] warn_slowpath_common+0x8c/0xc0 [ 4299.933997] [<ffffffff8104a86a>] warn_slowpath_null+0x1a/0x20 [ 4299.934003] [<ffffffffa065d3bb>] btrfs_search_old_slot+0x57b/0xab0 [btrfs] [ 4299.934005] [<ffffffff81775f3b>] ? _raw_read_unlock+0x2b/0x50 [ 4299.934010] [<ffffffffa0655001>] ? __tree_mod_log_search+0x81/0xc0 [btrfs] [ 4299.934019] [<ffffffffa06dd9b0>] __resolve_indirect_refs+0x130/0x5f0 [btrfs] [ 4299.934027] [<ffffffffa06a21f1>] ? free_extent_buffer+0x61/0xc0 [btrfs] [ 4299.934034] [<ffffffffa06de39c>] find_parent_nodes+0x1fc/0xe40 [btrfs] [ 4299.934042] [<ffffffffa06b13e0>] ? defrag_lookup_extent+0xe0/0xe0 [btrfs] [ 4299.934048] [<ffffffffa06b13e0>] ? defrag_lookup_extent+0xe0/0xe0 [btrfs] [ 4299.934056] [<ffffffffa06df980>] iterate_extent_inodes+0xe0/0x250 [btrfs] [ 4299.934058] [<ffffffff817762db>] ? _raw_spin_unlock+0x2b/0x50 [ 4299.934065] [<ffffffffa06dfb82>] iterate_inodes_from_logical+0x92/0xb0 [btrfs] [ 4299.934071] [<ffffffffa06b13e0>] ? defrag_lookup_extent+0xe0/0xe0 [btrfs] [ 4299.934078] [<ffffffffa06b7015>] btrfs_ioctl+0xf65/0x1f60 [btrfs] [ 4299.934080] [<ffffffff811658b8>] ? handle_mm_fault+0x278/0xb00 [ 4299.934083] [<ffffffff81075563>] ? up_read+0x23/0x40 [ 4299.934085] [<ffffffff8177a41c>] ? __do_page_fault+0x20c/0x5a0 [ 4299.934088] [<ffffffff811b2946>] do_vfs_ioctl+0x96/0x570 [ 4299.934090] [<ffffffff81776e23>] ? error_sti+0x5/0x6 [ 4299.934093] [<ffffffff810b71e8>] ? trace_hardirqs_off_caller+0x28/0xd0 [ 4299.934096] [<ffffffff81776a09>] ? retint_swapgs+0xe/0x13 [ 4299.934098] [<ffffffff811b2eb1>] SyS_ioctl+0x91/0xb0 [ 4299.934100] [<ffffffff813eecde>] ? trace_hardirqs_on_thunk+0x3a/0x3f [ 4299.934102] [<ffffffff8177ef12>] system_call_fastpath+0x16/0x1b [ 4299.934102] [<ffffffff8177ef12>] system_call_fastpath+0x16/0x1b [ 4299.934104] ---[ end trace 48f0cfc902491414 ]--- [ 4299.934378] btrfs bad fsid on block 0 These tree mod log operations that must be performed atomically, tree_mod_log_free_eb, tree_mod_log_eb_copy, tree_mod_log_insert_root and tree_mod_log_insert_move, used to be performed atomically before the following commit: c8cc6341653721b54760480b0d0d9b5f09b46741 (Btrfs: stop using GFP_ATOMIC for the tree mod log allocations) That change removed the atomicity of such operations. This patch restores the atomicity while still not doing the GFP_ATOMIC allocations of tree_mod_elem structures, so it has to do the allocations using GFP_NOFS before acquiring the mod log lock. This issue has been experienced by several users recently, such as for example: http://www.spinics.net/lists/linux-btrfs/msg28574.html After running the btrfs/004 test for 679 consecutive iterations with this patch applied, I didn't ran into the issue anymore. Cc: stable@vger.kernel.org Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-03-01Btrfs: return immediately if tree log mod is not necessaryFilipe David Borba Manana
commit 783577663507411e36e459390ef056556e93ef29 upstream. In ctree.c:tree_mod_log_set_node_key() we were calling __tree_mod_log_insert_key() even when the modification doesn't need to be logged. This would allocate a tree_mod_elem structure, fill it and pass it to __tree_mod_log_insert(), which would just acquire the tree mod log write lock and then free the tree_mod_elem structure and return (that is, a no-op). Therefore call tree_mod_log_insert() instead of __tree_mod_log_insert() which just returns immediately if the modification doesn't need to be logged (without allocating the structure, fill it, acquire write lock, free structure). Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-02-22block: Fix nr_vecs for inline integrity vectorsMartin K. Petersen
commit 087787959ce851d7bbb19f10f6e9241b7f85a3ca upstream. Commit 9f060e2231ca changed the way we handle allocations for the integrity vectors. When the vectors are inline there is no associated slab and consequently bvec_nr_vecs() returns 0. Ensure that we check against BIP_INLINE_VECS in that case. Reported-by: David Milburn <dmilburn@redhat.com> Tested-by: David Milburn <dmilburn@redhat.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-22lockd: send correct lock when granting a delayed lock.NeilBrown
commit 2ec197db1a56c9269d75e965f14c344b58b2a4f6 upstream. If an NFS client attempts to get a lock (using NLM) and the lock is not available, the server will remember the request and when the lock becomes available it will send a GRANT request to the client to provide the lock. If the client already held an adjacent lock, the GRANT callback will report the union of the existing and new locks, which can confuse the client. This happens because __posix_lock_file (called by vfs_lock_file) updates the passed-in file_lock structure when adjacent or over-lapping locks are found. To avoid this problem we take a copy of the two fields that can be changed (fl_start and fl_end) before the call and restore them afterwards. An alternate would be to allocate a 'struct file_lock', initialise it, use locks_copy_lock() to take a copy, then locks_release_private() after the vfs_lock_file() call. But that is a lot more work. Reported-by: Olaf Kirch <okir@suse.com> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> -- v1 had a couple of issues (large on-stack struct and didn't really work properly). This version is much better tested. Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-02-22retrieving CIFS ACLs when mounted with SMB2 fails dropping sessionSteve French
commit 83e3bc23ef9ce7c03b7b4e5d3d790246ea59db3e upstream. The get/set ACL xattr support for CIFS ACLs attempts to send old cifs dialect protocol requests even when mounted with SMB2 or later dialects. Sending cifs requests on an smb2 session causes problems - the server drops the session due to the illegal request. This patch makes CIFS ACL operations protocol specific to fix that. Attempting to query/set CIFS ACLs for SMB2 will now return EOPNOTSUPP (until we add worker routines for sending query ACL requests via SMB2) instead of sending invalid (cifs) requests. A separate followon patch will be needed to fix cifs_acl_to_fattr (which takes a cifs specific u16 fid so can't be abstracted to work with SMB2 until that is changed) and will be needed to fix mount problems when "cifsacl" is specified on mount with e.g. vers=2.1 Signed-off-by: Steve French <smfrench@gmail.com> Reviewed-by: Shirish Pargaonkar <spargaonkar@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-22Add protocol specific operation for CIFS xattrsSteve French
commit d979f3b0a1f0b5499ab85e68cdf02b56852918b6 upstream. Changeset 666753c3ef8fc88b0ddd5be4865d0aa66428ac35 added protocol operations for get/setxattr to avoid calling cifs operations on smb2/smb3 mounts for xattr operations and this changeset adds the calls to cifs specific protocol operations for xattrs (in order to reenable cifs support for xattrs which was temporarily disabled by the previous changeset. We do not have SMB2/SMB3 worker function for setting xattrs yet so this only enables it for cifs. CCing stable since without these two small changsets (its small coreq 666753c3ef8fc88b0ddd5be4865d0aa66428ac35 is also needed) calling getfattr/setfattr on smb2/smb3 mounts causes problems. Signed-off-by: Steve French <smfrench@gmail.com> Reviewed-by: Shirish Pargaonkar <spargaonkar@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-22CIFS: Fix SMB2 mounts so they don't try to set or get xattrs via cifsSteve French
commit 666753c3ef8fc88b0ddd5be4865d0aa66428ac35 upstream. When mounting with smb2 (or smb2.1 or smb3) we need to check to make sure that attempts to query or set extended attributes do not attempt to send the request with the older cifs protocol instead (eventually we also need to add the support in SMB2 to query/set extended attributes but this patch prevents us from using the wrong protocol for extended attribute operations). Signed-off-by: Steve French <smfrench@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-22mm: fix page leak at nfs_symlink()Rafael Aquini
commit a0b54adda3fe4b4cc6d28f2a9217cd35d1aa888c upstream. Changes in commit a0b8cab3b9b2 ("mm: remove lru parameter from __pagevec_lru_add and remove parts of pagevec API") have introduced a call to add_to_page_cache_lru() which causes a leak in nfs_symlink() as now the page gets an extra refcount that is not dropped. Jan Stancek observed and reported the leak effect while running test8 from Connectathon Testsuite. After several iterations over the test case, which creates several symlinks on a NFS mountpoint, the test system was quickly getting into an out-of-memory scenario. This patch fixes the page leak by dropping that extra refcount add_to_page_cache_lru() is grabbing. Signed-off-by: Jan Stancek <jstancek@redhat.com> Signed-off-by: Rafael Aquini <aquini@redhat.com> Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: Rik van Riel <riel@redhat.com> Cc: Jeff Layton <jlayton@redhat.com> Cc: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-22fs/file.c:fdtable: avoid triggering OOMs from alloc_fdmemEric W. Biederman
commit 96c7a2ff21501691587e1ae969b83cbec8b78e08 upstream. Recently due to a spike in connections per second memcached on 3 separate boxes triggered the OOM killer from accept. At the time the OOM killer was triggered there was 4GB out of 36GB free in zone 1. The problem was that alloc_fdtable was allocating an order 3 page (32KiB) to hold a bitmap, and there was sufficient fragmentation that the largest page available was 8KiB. I find the logic that PAGE_ALLOC_COSTLY_ORDER can't fail pretty dubious but I do agree that order 3 allocations are very likely to succeed. There are always pathologies where order > 0 allocations can fail when there are copious amounts of free memory available. Using the pigeon hole principle it is easy to show that it requires 1 page more than 50% of the pages being free to guarantee an order 1 (8KiB) allocation will succeed, 1 page more than 75% of the pages being free to guarantee an order 2 (16KiB) allocation will succeed and 1 page more than 87.5% of the pages being free to guarantee an order 3 allocate will succeed. A server churning memory with a lot of small requests and replies like memcached is a common case that if anything can will skew the odds against large pages being available. Therefore let's not give external applications a practical way to kill linux server applications, and specify __GFP_NORETRY to the kmalloc in alloc_fdmem. Unless I am misreading the code and by the time the code reaches should_alloc_retry in __alloc_pages_slowpath (where __GFP_NORETRY becomes signification). We have already tried everything reasonable to allocate a page and the only thing left to do is wait. So not waiting and falling back to vmalloc immediately seems like the reasonable thing to do even if there wasn't a chance of triggering the OOM killer. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Eric Dumazet <eric.dumazet@gmail.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Cong Wang <cwang@twopensource.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-20mm: __set_page_dirty uses spin_lock_irqsave instead of spin_lock_irqKOSAKI Motohiro
commit 227d53b397a32a7614667b3ecaf1d89902fb6c12 upstream. To use spin_{un}lock_irq is dangerous if caller disabled interrupt. During aio buffer migration, we have a possibility to see the following call stack. aio_migratepage [disable interrupt] migrate_page_copy clear_page_dirty_for_io set_page_dirty __set_page_dirty_buffers __set_page_dirty spin_lock_irq This mean, current aio migration is a deadlockable. spin_lock_irqsave is a safer alternative and we should use it. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reported-by: David Rientjes rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-20NFSv4: Fix memory corruption in nfs4_proc_open_confirmTrond Myklebust
commit 17ead6c85c3d0ef57a14d1373f1f1cee2ce60ea8 upstream. nfs41_wake_and_assign_slot() relies on the task->tk_msg.rpc_argp and task->tk_msg.rpc_resp always pointing to the session sequence arguments. nfs4_proc_open_confirm tries to pull a fast one by reusing the open sequence structure, thus causing corruption of the NFSv4 slot table. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-20NFSv4.1: nfs4_destroy_session must call rpc_destroy_waitqueueTrond Myklebust
commit 20b9a9024540a775395d5d1f41eec0ec6ec41f9b upstream. There may still be timers active on the session waitqueues. Make sure that we kill them before freeing the memory. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-20Btrfs: disable snapshot aware defrag for nowJosef Bacik
commit 8101c8dbf6243ba517aab58d69bf1bc37d8b7b9c upstream. It's just broken and it's taking a lot of effort to fix it, so for now just disable it so people can defrag in peace. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>