linux - Linux kernel source tree

Age	Commit message (Collapse)	Author
2011-10-03	writeback: update dirtied_when for synced inode to prevent livelock	Wu Fengguang
	commit 94c3dcbb0b0cdfd82cedd21705424d8044edc42c upstream. Explicitly update .dirtied_when on synced inodes, so that they are no longer considered for writeback in the next round. It can prevent both of the following livelock schemes: - while true; do echo data >> f; done - while true; do touch f; done (in theory) The exact livelock condition is, during sync(1): (1) no new inodes are dirtied (2) an inode being actively dirtied On (2), the inode will be tagged and synced with .nr_to_write=LONG_MAX. When finished, it will be redirty_tail()ed because it's still dirty and (.nr_to_write > 0). redirty_tail() won't update its ->dirtied_when on condition (1). The sync work will then revisit it on the next queue_io() and find it eligible again because its old ->dirtied_when predates the sync work start time. We'll do more aggressive "keep writeback as long as we wrote something" logic in wb_writeback(). The "use LONG_MAX .nr_to_write" trick in commit b9543dac5bbc ("writeback: avoid livelocking WB_SYNC_ALL writeback") will no longer be enough to stop sync livelock. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-10-03	writeback: introduce .tagged_writepages for the WB_SYNC_NONE sync stage	Wu Fengguang
	commit 6e6938b6d3130305a5960c86b1a9b21e58cf6144 upstream. sync(2) is performed in two stages: the WB_SYNC_NONE sync and the WB_SYNC_ALL sync. Identify the first stage with .tagged_writepages and do livelock prevention for it, too. Jan's commit f446daaea9 ("mm: implement writeback livelock avoidance using page tagging") is a partial fix in that it only fixed the WB_SYNC_ALL phase livelock. Although ext4 is tested to no longer livelock with commit f446daaea9, it may due to some "redirty_tail() after pages_skipped" effect which is by no means a guarantee for _all_ the file systems. Note that writeback_inodes_sb() is called by not only sync(), they are treated the same because the other callers also need livelock prevention. Impact: It changes the order in which pages/inodes are synced to disk. Now in the WB_SYNC_NONE stage, it won't proceed to write the next inode until finished with the current inode. Acked-by: Jan Kara <jack@suse.cz> CC: Dave Chinner <david@fromorbit.com> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-10-03	teach /proc/$pid/numa_maps about transparent hugepages	Dave Hansen
	commit 32ef43848f283e0ef945d3c67e851c143fea3970 upstream. This is modeled after the smaps code. It detects transparent hugepages and then does a single gather_stats() for the page as a whole. This has two benifits: 1. It is more efficient since it does many pages in a single shot. 2. It does not have to break down the huge page. Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com> Acked-by: Hugh Dickins <hughd@google.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-10-03	break out numa_maps gather_pte_stats() checks	Dave Hansen
	commit 3200a8aaab0c9ccdc0f59b0dac2d4a47029137fa upstream. gather_pte_stats() does a number of checks on a target page to see whether it should even be considered for statistics. This breaks that code out in to a separate function so that we can use it in the transparent hugepage case in the next patch. Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com> Acked-by: Hugh Dickins <hughd@google.com> Reviewed-by: Christoph Lameter <cl@gentwo.org> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-10-03	make /proc/$pid/numa_maps gather_stats() take variable page size	Dave Hansen
	commit eb4866d0066ffd5446751c102d64feb3318d8bd1 upstream. We need to teach the numa_maps code about transparent huge pages. The first step is to teach gather_stats() that the pte it is dealing with might represent more than one page. Note that will we use this in a moment for transparent huge pages since they have use a single pmd_t which _acts_ as a "surrogate" for a bunch of smaller pte_t's. I'm a _bit_ unhappy that this interface counts in hugetlbfs page sizes for hugetlbfs pages and PAGE_SIZE for normal pages. That means that to figure out how many _bytes_ "dirty=1" means, you must first know the hugetlbfs page size. That's easier said than done especially if you don't have visibility in to the mount. But, that's probably a discussion for another day especially since it would change behavior to fix it. But, just in case anyone wonders why this patch only passes a '1' in the hugetlb case... Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com> Acked-by: Hugh Dickins <hughd@google.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-10-03	Fix the conflict between rwpidforward and rw mount options	Steve French
	commit c9c7fa0064f4afe1d040e72f24c2256dd8ac402d upstream. Both these options are started with "rw" - that's why the first one isn't switched on even if it is specified. Fix this by adding a length check for "rw" option check. Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru> Signed-off-by: Steve French <sfrench@us.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-10-03	cifs: fix possible memory corruption in CIFSFindNext	Jeff Layton
	commit 9438fabb73eb48055b58b89fc51e0bc4db22fabd upstream. The name_len variable in CIFSFindNext is a signed int that gets set to the resume_name_len in the cifs_search_info. The resume_name_len however is unsigned and for some infolevels is populated directly from a 32 bit value sent by the server. If the server sends a very large value for this, then that value could look negative when converted to a signed int. That would make that value pass the PATH_MAX check later in CIFSFindNext. The name_len would then be used as a length value for a memcpy. It would then be treated as unsigned again, and the memcpy scribbles over a ton of memory. Fix this by making the name_len an unsigned value in CIFSFindNext. Reported-by: Darren Lavender <dcl@hppine99.gbr.hp.com> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-10-03	restore pinning the victim dentry in vfs_rmdir()/vfs_rename_dir()	Al Viro
	commit 1d2ef5901483004d74947bbf78d5146c24038fe7 upstream. We used to get the victim pinned by dentry_unhash() prior to commit 64252c75a219 ("vfs: remove dget() from dentry_unhash()") and ->rmdir() and ->rename() instances relied on that; most of them don't care, but ones that used d_delete() themselves do. As the result, we are getting rmdir() oopses on NFS now. Just grab the reference before locking the victim and drop it explicitly after unlocking, same as vfs_rename_other() does. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Tested-by: Simon Kirby <sim@hostway.ca> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-10-03	fs/9p: Use protocol-defined value for lock/getlock 'type' field.	Jim Garlick
	commit 51b8b4fb32271d39fbdd760397406177b2b0fd36 upstream. Signed-off-by: Jim Garlick <garlick@llnl.gov> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Harsh Prateek Bora <harsh@linux.vnet.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-10-03	fs/9p: Always ask new inode in lookup for cache mode disabled	Aneesh Kumar K.V
	commit 73f507171cfa407b19f254aef95cbb058c8180cf upstream. This make sure we don't end up reusing the unlinked inode object. The ideal way is to use inode i_generation. But i_generation is not available in userspace always. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-10-03	fs/9p: Add OS dependent open flags in 9p protocol	Aneesh Kumar K.V
	commit f88657ce3f9713a0c62101dffb0e972a979e77b9 upstream. Some of the flags are OS/arch dependent we add a 9p protocol value which maps to asm-generic/fcntl.h values in Linux Based on the original patch from Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com> [extra comments from author as to why this needs to go to stable: Earlier for different operation such as open we used the values of open flag as defined by the OS. But some of these flags such as O_DIRECT are arch dependent. So if we have the 9p client and server running on different architectures, we end up with client sending client architecture value of these open flag and server will try to map these values to what its architecture states. For ex: O_DIRECT on a x86 client maps to #define O_DIRECT 00040000 Where as on sparc server it will maps to #define O_DIRECT 0x100000 Hence we need to map these open flags to OS/arch independent flag values. Getting these changes to an early version of kernel ensures us that we work with different combination of client and server. We should ideally backport this patch to all possible kernel version.] Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Harsh Prateek Bora <harsh@linux.vnet.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-10-03	fs/9p: Don't update file type when updating file attributes	Aneesh Kumar K.V
	commit 45089142b1497dab2327d60f6c71c40766fc3ea4 upstream. We should only update attributes that we can change on stat2inode. Also do file type initialization in v9fs_init_inode. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-10-03	fs/9p: Add fid before dentry instantiation	Aneesh Kumar K.V
	commit 5441ae5eb3614d3c28f77073370738a2820c88e4 upstream. d_instantiate marks the dentry positive. So a parallel lookup and mkdir of the directory can find dentry that doesn't have fid attached. This can result in both the code path doing v9fs_fid_add which results in v9fs_dentry leak. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-10-03	9p: close ACL leaks	Al Viro
	commit 1ec95bf34d976b38897d1977b155a544d77b05e7 upstream. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-10-03	fs/9p: Always ask new inode in create	Aneesh Kumar K.V
	commit ed80fcfac2565fa866d93ba14f0e75de17a8223e upstream. This make sure we don't end up reusing the unlinked inode object. The ideal way is to use inode i_generation. But i_generation is not available in userspace always. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-10-03	fs/9p: Fix invalid mount options/args	Prem Karat
	commit a2dd43bb0d7b9ce28f8a39254c25840c0730498e upstream. Without this fix, if any invalid mount options/args are passed while mouting the 9p fs, no error (-EINVAL) is returned and default arg value is assigned. This fix returns -EINVAL when an invalid arguement is found while parsing mount options. Signed-off-by: Prem Karat <prem.karat@linux.vnet.ibm.com> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-10-03	fs/9p: When doing inode lookup compare qid details and inode mode bits.	Aneesh Kumar K.V
	commit fd2421f54423f307ecd31bdebdca6bc317e0c492 upstream. This make sure we don't use wrong inode from the inode hash. The inode number of the file deleted is reused by the next file system object created and if we only use inode number for inode hash lookup we could end up with wrong struct inode. Also compare inode generation number. Not all Linux file system provide st_gen in userspace. So it could be 0; Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-10-03	Avoid dereferencing a 'request_queue' after last close.	NeilBrown
	commit 94007751bb02797ba87bac7aacee2731ac2039a3 upstream. On the last close of an 'md' device which as been stopped, the device is destroyed and in particular the request_queue is freed. The free is done in a separate thread so it might happen a short time later. __blkdev_put calls bdev_inode_switch_bdi after ->release has been called. Since commit f758eeabeb96f878c860e8f110f94ec8820822a9 bdev_inode_switch_bdi will dereference the 'old' bdi, which lives inside a request_queue, to get a spin lock. This causes the last close on an md device to sometime take a spin_lock which lives in freed memory - which results in an oops. So move the called to bdev_inode_switch_bdi before the call to ->release. Cc: Christoph Hellwig <hch@lst.de> Cc: Hugh Dickins <hughd@google.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Wu Fengguang <fengguang.wu@intel.com> Acked-by: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-08-29	fuse: check size of FUSE_NOTIFY_INVAL_ENTRY message	Miklos Szeredi
	commit c2183d1e9b3f313dd8ba2b1b0197c8d9fb86a7ae upstream. FUSE_NOTIFY_INVAL_ENTRY didn't check the length of the write so the message processing could overrun and result in a "kernel BUG at fs/fuse/dev.c:629!" Reported-by: Han-Wen Nienhuys <hanwenn@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-08-29	ext4: fix nomblk_io_submit option so it correctly converts uninit blocks	Theodore Ts'o
	commit 9dd75f1f1a02d656a11a7b9b9e6c2759b9c1e946 upstream. Bug discovered by Jan Kara: Finally, commit 1449032be17abb69116dbc393f67ceb8bd034f92 returned back the old IO submission code but apparently it forgot to return the old handling of uninitialized buffers so we unconditionnaly call block_write_full_page() without specifying end_io function. So AFAICS we never convert unwritten extents to written in some cases. For example when I mount the fs as: mount -t ext4 -o nomblk_io_submit,dioread_nolock /dev/ubdb /mnt and do int fd = open(argv[1], O_RDWR \| O_CREAT \| O_TRUNC, 0600); char buf[1024]; memset(buf, 'a', sizeof(buf)); fallocate(fd, 0, 0, 16384); write(fd, buf, sizeof(buf)); I get a file full of zeros (after remounting the filesystem so that pagecache is dropped) instead of seeing the first KB contain 'a's. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-08-29	ext4: Resolve the hang of direct i/o read in handling EXT4_IO_END_UNWRITTEN.	Tao Ma
	commit 32c80b32c053dc52712dedac5e4d0aa7c93fc353 upstream. EXT4_IO_END_UNWRITTEN flag set and the increase of i_aiodio_unwritten should be done simultaneously since ext4_end_io_nolock always clear the flag and decrease the counter in the same time. We don't increase i_aiodio_unwritten when setting EXT4_IO_END_UNWRITTEN so it will go nagative and causes some process to wait forever. Part of the patch came from Eric in his e-mail, but it doesn't fix the problem met by Michael actually. http://marc.info/?l=linux-ext4&m=131316851417460&w=2 Reported-and-Tested-by: Michael Tokarev<mjt@tls.msk.ru> Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-08-29	ext4: call ext4_ioend_wait and ext4_flush_completed_IO in ext4_evict_inode	Jiaying Zhang
	commit 2581fdc810889fdea97689cb62481201d579c796 upstream. Flush inode's i_completed_io_list before calling ext4_io_wait to prevent the following deadlock scenario: A page fault happens while some process is writing inode A. During page fault, shrink_icache_memory is called that in turn evicts another inode B. Inode B has some pending io_end work so it calls ext4_ioend_wait() that waits for inode B's i_ioend_count to become zero. However, inode B's ioend work was queued behind some of inode A's ioend work on the same cpu's ext4-dio-unwritten workqueue. As the ext4-dio-unwritten thread on that cpu is processing inode A's ioend work, it tries to grab inode A's i_mutex lock. Since the i_mutex lock of inode A is still hold before the page fault happened, we enter a deadlock. Also moves ext4_flush_completed_IO and ext4_ioend_wait from ext4_destroy_inode() to ext4_evict_inode(). During inode deleteion, ext4_evict_inode() is called before ext4_destroy_inode() and in ext4_evict_inode(), we may call ext4_truncate() without holding i_mutex lock. As a result, there is a race between flush_completed_IO that is called from ext4_ext_truncate() and ext4_end_io_work, which may cause corruption on an io_end structure. This change moves ext4_flush_completed_IO and ext4_ioend_wait from ext4_destroy_inode() to ext4_evict_inode() to resolve the race between ext4_truncate() and ext4_end_io_work during inode deletion. Signed-off-by: Jiaying Zhang <jiayingz@google.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-08-29	ext4: Fix ext4_should_writeback_data() for no-journal mode	Curt Wohlgemuth
	commit 441c850857148935babe000fc2ba1455fe54a6a9 upstream. ext4_should_writeback_data() had an incorrect sequence of tests to determine if it should return 0 or 1: in particular, even in no-journal mode, 0 was being returned for a non-regular-file inode. This meant that, in non-journal mode, we would use ext4_journalled_aops for directories, symlinks, and other non-regular files. However, calling journalled aop callbacks when there is no valid handle, can cause problems. This would cause a kernel crash with Jan Kara's commit 2d859db3e4 ("ext4: fix data corruption in inodes with journalled data"), because we now dereference 'handle' in ext4_journalled_write_end(). I also added BUG_ONs to check for a valid handle in the obviously journal-only aops callbacks. I tested this running xfstests with a scratch device in these modes: - no-journal - data=ordered - data=writeback - data=journal All work fine; the data=journal run has many failures and a crash in xfstests 074, but this is no different from a vanilla kernel. Signed-off-by: Curt Wohlgemuth <curtw@google.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-08-29	Btrfs: fix an oops of log replay	liubo
	commit 34f3e4f23ca3d259fe078f62a128d97ca83508ef upstream. When btrfs recovers from a crash, it may hit the oops below: ------------[ cut here ]------------ kernel BUG at fs/btrfs/inode.c:4580! [...] RIP: 0010:[<ffffffffa03df251>] [<ffffffffa03df251>] btrfs_add_link+0x161/0x1c0 [btrfs] [...] Call Trace: [<ffffffffa03e7b31>] ? btrfs_inode_ref_index+0x31/0x80 [btrfs] [<ffffffffa04054e9>] add_inode_ref+0x319/0x3f0 [btrfs] [<ffffffffa0407087>] replay_one_buffer+0x2c7/0x390 [btrfs] [<ffffffffa040444a>] walk_down_log_tree+0x32a/0x480 [btrfs] [<ffffffffa0404695>] walk_log_tree+0xf5/0x240 [btrfs] [<ffffffffa0406cc0>] btrfs_recover_log_trees+0x250/0x350 [btrfs] [<ffffffffa0406dc0>] ? btrfs_recover_log_trees+0x350/0x350 [btrfs] [<ffffffffa03d18b2>] open_ctree+0x1442/0x17d0 [btrfs] [...] This comes from that while replaying an inode ref item, we forget to check those old conflicting DIR_ITEM and DIR_INDEX items in fs/file tree, then we will come to conflict corners which lead to BUG_ON(). Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com> Tested-by: Andy Lutomirski <luto@mit.edu> Signed-off-by: Chris Mason <chris.mason@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-08-29	Btrfs: detect wether a device supports discard	Josef Bacik
	commit d5e2003c2bcda93a8f2e668eb4642d70c9c38301 upstream. We have a problem where if a user specifies discard but doesn't actually support it we will return EOPNOTSUPP from btrfs_discard_extent. This is a problem because this gets called (in a fashion) from the tree log recovery code, which has a nice little BUG_ON(ret) after it, which causes us to fail the tree log replay. So instead detect wether our devices support discard when we're adding them and then don't issue discards if we know that the device doesn't support it. And just for good measure set ret = 0 in btrfs_issue_discard just in case we still get EOPNOTSUPP so we don't screw anybody up like this again. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-08-29	NFSv4.1: Return NFS4ERR_BADSESSION to callbacks during session resets	Trond Myklebust
	commit 910ac68a2b80c7de95bc8488734067b1bb15d583 upstream. If the client is in the process of resetting the session when it receives a callback, then returning NFS4ERR_DELAY may cause a deadlock with the DESTROY_SESSION call. Basically, if the client returns NFS4ERR_DELAY in response to the CB_SEQUENCE call, then the server is entitled to believe that the client is busy because it is already processing that call. In that case, the server is perfectly entitled to respond with a NFS4ERR_BACK_CHAN_BUSY to any DESTROY_SESSION call. Fix this by having the client reply with a NFS4ERR_BADSESSION in response to the callback if it is resetting the session. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-08-29	NFSv4.1: Fix the callback 'highest_used_slotid' behaviour	Trond Myklebust
	commit 55a673990ec04cf63005318bcf08c2b0046e5778 upstream. Currently, there is no guarantee that we will call nfs4_cb_take_slot() even though nfs4_callback_compound() will consistently call nfs4_cb_free_slot() provided the cb_process_state has set the 'clp' field. The result is that we can trigger the BUG_ON() upon the next call to nfs4_cb_take_slot(). This patch fixes the above problem by using the slot id that was taken in the CB_SEQUENCE operation as a flag for whether or not we need to call nfs4_cb_free_slot(). It also fixes an atomicity problem: we need to set tbl->highest_used_slotid atomically with the check for NFS4_SESSION_DRAINING, otherwise we end up racing with the various tests in nfs4_begin_drain_session(). Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-08-29	pnfs-obj: Bug when we are running out of bio	Boaz Harrosh
	commit 20618b21da0796115e81906d24ff1601552701b7 upstream. When we have a situation that the number of pages we want to encode is bigger then the size of the bio. (Which can currently happen only when all IO is going to a single device .e.g group_width==1) then the IO is submitted short and we report back only the amount of bytes we actually wrote/read and all is fine. BUT ... There was a bug that the current length counter was advanced before the fail to add the extra page, and we come to a situation that the CDB length was one-page longer then the actual bio size, which is of course rejected by the osd-target. While here also fix the bio size calculation, in the case that we received more then one group of devices. Signed-off-by: Boaz Harrosh <bharrosh@panasas.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-08-29	pnfs-obj: Fix the comp_index != 0 case	Boaz Harrosh
	commit 9af7db3228acc286c50e3a0f054ec982efdbc6c6 upstream. There were bugs in the case of partial layout where olo_comp_index is not zero. This used to work and was tested but one of the later cleanup SQUASHMEs broke it and was not tested since. Also add a dprint that specify those received layout parameters. Everything else was already printed. Signed-off-by: Boaz Harrosh <bharrosh@panasas.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-08-29	possible memory corruption on mount	Steve French
	commit 13589c437daf4c8e429b3236c0b923de1c9420d8 upstream. CIFS cleanup_volume_info_contents() looks like having a memory corruption problem. When UNCip is set to "&vol->UNC[2]" in cifs_parse_mount_options(), it should not be kfree()-ed in cleanup_volume_info_contents(). Introduced in commit b946845a9dc523c759cae2b6a0f6827486c3221a Signed-off-by: J.R. Okajima <hooanon05@yahoo.co.jp> Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-08-29	befs: Validate length of long symbolic links.	Timo Warns
	commit 338d0f0a6fbc82407864606f5b64b75aeb3c70f2 upstream. Signed-off-by: Timo Warns <warns@pre-sense.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-08-29	cifs: demote cERROR in build_path_from_dentry to cFYI	Jeff Layton
	commit fa71f447065f676157ba6a2c121ba419818fc559 upstream. Running the cthon tests on a recent kernel caused this message to pop occasionally: CIFS VFS: did not end path lookup where expected namelen is 0 Some added debugging showed that namelen and dfsplen were both 0 when this occurred. That means that the read_seqretry returned true. Assuming that the comment inside the if statement is true, this should be harmless and just means that we raced with a rename. If that is the case, then there's no need for alarm and we can demote this to cFYI. While we're at it, print the dfsplen too so that we can see what happened here if the message pops during debugging. Cc: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-08-15	ext4: Properly count journal credits for long symlinks	Eric Sandeen
	commit 8c20871998c082f6fbc963f1449a5ba5140ee39a upstream. Commit df5e6223407e ("ext4: fix deadlock in ext4_symlink() in ENOSPC conditions") recalculated the number of credits needed for a long symlink, in the process of splitting it into two transactions. However, the first credit calculation under-counted because if selinux is enabled, credits are needed to create the selinux xattr as well. Overrunning the reservation will result in an OOPS in jbd2_journal_dirty_metadata() due to this assert: J_ASSERT_JH(jh, handle->h_buffer_credits > 0); Fix this by increasing the reservation size. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-08-15	ext3: Properly count journal credits for long symlinks	Eric Sandeen
	commit d2db60df1e7eb39cf0f378dfc4dd8813666d46ef upstream. Commit ae54870a1dc9 ("ext3: Fix lock inversion in ext3_symlink()") recalculated the number of credits needed for a long symlink, in the process of splitting it into two transactions. However, the first credit calculation under-counted because if selinux is enabled, credits are needed to create the selinux xattr as well. Overrunning the reservation will result in an OOPS in journal_dirty_metadata() due to this assert: J_ASSERT_JH(jh, handle->h_buffer_credits > 0); Fix this by increasing the reservation size. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-08-15	eCryptfs: Return error when lower file pointer is NULL	Tyler Hicks
	commit f61500e000eedc0c7a0201200a7f00ba5529c002 upstream. When an eCryptfs inode's lower file has been closed, and the pointer has been set to NULL, return an error when trying to do a lower read or write rather than calling BUG(). https://bugzilla.kernel.org/show_bug.cgi?id=37292 Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-08-15	Ecryptfs: Add mount option to check uid of device being mounted = expect uid	John Johansen
	commit 764355487ea220fdc2faf128d577d7f679b91f97 upstream. Close a TOCTOU race for mounts done via ecryptfs-mount-private. The mount source (device) can be raced when the ownership test is done in userspace. Provide Ecryptfs a means to force the uid check at mount time. Signed-off-by: John Johansen <john.johansen@canonical.com> Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-08-15	cifs: convert prefixpath delimiters in cifs_build_path_to_root	Jeff Layton
	commit f9e8c45002cacad536b338dfa9e910e341a49c31 upstream. Regression from 2.6.39... The delimiters in the prefixpath are not being converted based on whether posix paths are in effect. Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=727834 Reported-and-Tested-by: Iain Arnell <iarnell@gmail.com> Reported-by: Patrick Oltmann <patrick.oltmann@gmx.net> Cc: Pavel Shilovsky <piastryyy@gmail.com> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-08-15	cifs: cope with negative dentries in cifs_get_root	Jeff Layton
	commit 80975d21aae2136ccae1ce914a1602dc1d8b0795 upstream. The loop around lookup_one_len doesn't handle the case where it might return a negative dentry, which can cause an oops on the next pass through the loop. Check for that and break out of the loop with an error of -ENOENT if there is one. Fixes the panic reported here: https://bugzilla.redhat.com/show_bug.cgi?id=727927 Reported-by: TR Bentley <home@trarbentley.net> Reported-by: Iain Arnell <iarnell@gmail.com> Cc: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-08-15	CIFS: Fix missing a decrement of inFlight value	Pavel Shilovsky
	commit 0193e072268fe62c4b19ad4b05cd0d4b23c43bb9 upstream. if we failed on getting mid entry in cifs_call_async. Signed-off-by: Pavel Shilovsky <piastryyy@gmail.com> Signed-off-by: Steve French <sfrench@us.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-08-04	GFS2: Fix mount hang caused by certain access pattern to sysfs files	Steven Whitehouse
	commit 19237039919088781b4191a00bdc1284d8fea1dd upstream. Depending upon the order of userspace/kernel during the mount process, this can result in a hang without the _all version of the completion. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-08-04	proc: fix a race in do_io_accounting()	Vasiliy Kulikov
	commit 293eb1e7772b25a93647c798c7b89bf26c2da2e0 upstream. If an inode's mode permits opening /proc/PID/io and the resulting file descriptor is kept across execve() of a setuid or similar binary, the ptrace_may_access() check tries to prevent using this fd against the task with escalated privileges. Unfortunately, there is a race in the check against execve(). If execve() is processed after the ptrace check, but before the actual io information gathering, io statistics will be gathered from the privileged process. At least in theory this might lead to gathering sensible information (like ssh/ftp password length) that wouldn't be available otherwise. Holding task->signal->cred_guard_mutex while gathering the io information should protect against the race. The order of locking is similar to the one inside of ptrace_attach(): first goes cred_guard_mutex, then lock_task_sighand(). Signed-off-by: Vasiliy Kulikov <segoon@openwall.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-08-04	NFS: Fix spurious readdir cookie loop messages	Trond Myklebust
	commit 0c0308066ca53fdf1423895f3a42838b67b3a5a8 upstream. If the directory contents change, then we have to accept that the file->f_pos value may shrink if we do a 'search-by-cookie'. In that case, we should turn off the loop detection and let the NFS client try to recover. The patch also fixes a second loop detection bug by ensuring that after turning on the ctx->duped flag, we read at least one new cookie into ctx->dir_cookie before attempting to match with ctx->dup_cookie. Reported-by: Petr Vandrovec <petr@vandrovec.name> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-08-04	NFSv4: Don't use the delegation->inode in nfs_mark_return_delegation()	Trond Myklebust
	commit ed1e6211a0a134ff23592c6f057af982ad5dab52 upstream. nfs_mark_return_delegation() is usually called without any locking, and so it is not safe to dereference delegation->inode. Since the inode is only used to discover the nfs_client anyway, it makes more sense to have the callers pass a valid pointer to the nfs_server as a parameter. Reported-by: Ian Kent <raven@themaw.net> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-08-04	nfsd4: fix file leak on open_downgrade	J. Bruce Fields
	commit f197c27196a5e7631b89e2e92daa096fcf7c302c upstream. Stateid's hold a read reference for a read open, a write reference for a write open, and an additional one of each for each read+write open. The latter wasn't getting put on a downgrade, so something like: open RW open R downgrade to R was resulting in a file leak. Also fix an imbalance in an error path. Regression from 7d94784293096c0a46897acdb83be5abd9278ece "nfsd4: fix downgrade/lock logic". Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-08-04	nfsd4: remember to put RW access on stateid destruction	J. Bruce Fields
	commit 499f3edc23ca0431f3a0a6736b3a40944c81bf3b upstream. Without this, for example, open read open read+write close will result in a struct file leak. Regression from 7d94784293096c0a46897acdb83be5abd9278ece "nfsd4: fix downgrade/lock logic". Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-08-04	nfsd: don't break lease on CLAIM_DELEGATE_CUR	Casey Bodley
	commit 0c12eaffdf09466f36a9ffe970dda8f4aeb6efc0 upstream. CLAIM_DELEGATE_CUR is used in response to a broken lease; allowing it to break the lease and return EAGAIN leaves the client unable to make progress in returning the delegation nfs4_get_vfs_file() now takes struct nfsd4_open for access to the claim type, and calls nfsd_open() with NFSD_MAY_NOT_BREAK_LEASE when claim type is CLAIM_DELEGATE_CUR Signed-off-by: Casey Bodley <cbodley@citi.umich.edu> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-08-04	eCryptfs: Unlock keys needed by ecryptfsd	Tyler Hicks
	commit b2987a5e05ec7a1af7ca42e5d5349d7a22753031 upstream. Fixes a regression caused by b5695d04634fa4ccca7dcbc05bb4a66522f02e0b Kernel keyring keys containing eCryptfs authentication tokens should not be write locked when calling out to ecryptfsd to wrap and unwrap file encryption keys. The eCryptfs kernel code can not hold the key's write lock because ecryptfsd needs to request the key after receiving such a request from the kernel. Without this fix, all file opens and creates will timeout and fail when using the eCryptfs PKI infrastructure. This is not an issue when using passphrase-based mount keys, which is the most widely deployed eCryptfs configuration. Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com> Acked-by: Roberto Sassu <roberto.sassu@polito.it> Tested-by: Roberto Sassu <roberto.sassu@polito.it> Tested-by: Alexis Hafner1 <haf@zurich.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-08-04	ecryptfs: Make inode bdi consistent with superblock bdi	Thieu Le
	commit 985ca0e626e195ea08a1a82b8dbeb6719747429a upstream. Make the inode mapping bdi consistent with the superblock bdi so that dirty pages are flushed properly. Signed-off-by: Thieu Le <thieule@chromium.org> Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-08-04	ext3: Fix oops in ext3_try_to_allocate_with_rsv()	Jan Kara
	commit ad95c5e9bc8b5885f94dce720137cac8fa8da4c9 upstream. Block allocation is called from two places: ext3_get_blocks_handle() and ext3_xattr_block_set(). These two callers are not necessarily synchronized because xattr code holds only xattr_sem and i_mutex, and ext3_get_blocks_handle() may hold only truncate_mutex when called from writepage() path. Block reservation code does not expect two concurrent allocations to happen to the same inode and thus assertions can be triggered or reservation structure corruption can occur. Fix the problem by taking truncate_mutex in xattr code to serialize allocations. CC: Sage Weil <sage@newdream.net> Reported-by: Fyodor Ustinov <ufm@ufm.su> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-08-04	ext4: free allocated and pre-allocated blocks when check_eofblocks_fl fails	Jiaying Zhang
	commit 575a1d4bdfa2ea9fc10733013136145b497e1be0 upstream. Upon corrupted inode or disk failures, we may fail after we already allocate some blocks from the inode or take some blocks from the inode's preallocation list, but before we successfully insert the corresponding extent to the extent tree. In this case, we should free any allocated blocks and discard the inode's preallocated blocks because the entries in the inode's preallocation list may be in an inconsistent state. Signed-off-by: Jiaying Zhang <jiayingz@google.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>