linux - Linux kernel source tree

Age	Commit message (Collapse)	Author
2014-04-26	jffs2: remove from wait queue after schedule()	Li Zefan
	commit 3ead9578443b66ddb3d50ed4f53af8a0c0298ec5 upstream. @wait is a local variable, so if we don't remove it from the wait queue list, later wake_up() may end up accessing invalid memory. This was spotted by eyes. Signed-off-by: Li Zefan <lizefan@huawei.com> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Brian Norris <computersforpeace@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-04-26	jffs2: avoid soft-lockup in jffs2_reserve_space_gc()	Li Zefan
	commit 13b546d96207c131eeae15dc7b26c6e7d0f1cad7 upstream. We triggered soft-lockup under stress test on 2.6.34 kernel. BUG: soft lockup - CPU#1 stuck for 60009ms! [lockf2.test:14488] ... [<bf09a4d4>] (jffs2_do_reserve_space+0x420/0x440 [jffs2]) [<bf09a528>] (jffs2_reserve_space_gc+0x34/0x78 [jffs2]) [<bf0a1350>] (jffs2_garbage_collect_dnode.isra.3+0x264/0x478 [jffs2]) [<bf0a2078>] (jffs2_garbage_collect_pass+0x9c0/0xe4c [jffs2]) [<bf09a670>] (jffs2_reserve_space+0x104/0x2a8 [jffs2]) [<bf09dc48>] (jffs2_write_inode_range+0x5c/0x4d4 [jffs2]) [<bf097d8c>] (jffs2_write_end+0x198/0x2c0 [jffs2]) [<c00e00a4>] (generic_file_buffered_write+0x158/0x200) [<c00e14f4>] (__generic_file_aio_write+0x3a4/0x414) [<c00e15c0>] (generic_file_aio_write+0x5c/0xbc) [<c012334c>] (do_sync_write+0x98/0xd4) [<c0123a84>] (vfs_write+0xa8/0x150) [<c0123d74>] (sys_write+0x3c/0xc0)] Fix this by adding a cond_resched() in the while loop. [akpm@linux-foundation.org: don't initialize `ret'] Signed-off-by: Li Zefan <lizefan@huawei.com> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Brian Norris <computersforpeace@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-04-26	jffs2: Fix crash due to truncation of csize	Ajesh Kunhipurayil Vijayan
	commit 41bf1a24c1001f4d0d41a78e1ac575d2f14789d7 upstream. mounting JFFS2 partition sometimes crashes with this call trace: [ 1322.240000] Kernel bug detected[#1]: [ 1322.244000] Cpu 2 [ 1322.244000] $ 0 : 0000000000000000 0000000000000018 000000003ff00070 0000000000000001 [ 1322.252000] $ 4 : 0000000000000000 c0000000f3980150 0000000000000000 0000000000010000 [ 1322.260000] $ 8 : ffffffffc09cd5f8 0000000000000001 0000000000000088 c0000000ed300de8 [ 1322.268000] $12 : e5e19d9c5f613a45 ffffffffc046d464 0000000000000000 66227ba5ea67b74e [ 1322.276000] $16 : c0000000f1769c00 c0000000ed1e0200 c0000000f3980150 0000000000000000 [ 1322.284000] $20 : c0000000f3a80000 00000000fffffffc c0000000ed2cfbd8 c0000000f39818f0 [ 1322.292000] $24 : 0000000000000004 0000000000000000 [ 1322.300000] $28 : c0000000ed2c0000 c0000000ed2cfab8 0000000000010000 ffffffffc039c0b0 [ 1322.308000] Hi : 000000000000023c [ 1322.312000] Lo : 000000000003f802 [ 1322.316000] epc : ffffffffc039a9f8 check_tn_node+0x88/0x3b0 [ 1322.320000] Not tainted [ 1322.324000] ra : ffffffffc039c0b0 jffs2_do_read_inode_internal+0x1250/0x1e48 [ 1322.332000] Status: 5400f8e3 KX SX UX KERNEL EXL IE [ 1322.336000] Cause : 00800034 [ 1322.340000] PrId : 000c1004 (Netlogic XLP) [ 1322.344000] Modules linked in: [ 1322.348000] Process jffs2_gcd_mtd7 (pid: 264, threadinfo=c0000000ed2c0000, task=c0000000f0e68dd8, tls=0000000000000000) [ 1322.356000] Stack : c0000000f1769e30 c0000000ed010780 c0000000ed010780 c0000000ed300000 c0000000f1769c00 c0000000f3980150 c0000000f3a80000 00000000fffffffc c0000000ed2cfbd8 ffffffffc039c0b0 ffffffffc09c6340 0000000000001000 0000000000000dec ffffffffc016c9d8 c0000000f39805a0 c0000000f3980180 0000008600000000 0000000000000000 0000000000000000 0000000000000000 0001000000000dec c0000000f1769d98 c0000000ed2cfb18 0000000000010000 0000000000010000 0000000000000044 c0000000f3a80000 c0000000f1769c00 c0000000f3d207a8 c0000000f1769d98 c0000000f1769de0 ffffffffc076f9c0 0000000000000009 0000000000000000 0000000000000000 ffffffffc039cf90 0000000000000017 ffffffffc013fbdc 0000000000000001 000000010003e61c ... [ 1322.424000] Call Trace: [ 1322.428000] [<ffffffffc039a9f8>] check_tn_node+0x88/0x3b0 [ 1322.432000] [<ffffffffc039c0b0>] jffs2_do_read_inode_internal+0x1250/0x1e48 [ 1322.440000] [<ffffffffc039cf90>] jffs2_do_crccheck_inode+0x70/0xd0 [ 1322.448000] [<ffffffffc03a1b80>] jffs2_garbage_collect_pass+0x160/0x870 [ 1322.452000] [<ffffffffc03a392c>] jffs2_garbage_collect_thread+0xdc/0x1f0 [ 1322.460000] [<ffffffffc01541c8>] kthread+0xb8/0xc0 [ 1322.464000] [<ffffffffc0106d18>] kernel_thread_helper+0x10/0x18 [ 1322.472000] [ 1322.472000] Code: 67bd0050 94a4002c 2c830001 <00038036> de050218 2403fffc 0080a82d 00431824 24630044 [ 1322.480000] ---[ end trace b052bb90e97dfbf5 ]--- The variable csize in structure jffs2_tmp_dnode_info is of type uint16_t, but it is used to hold the compressed data length(csize) which is declared as uint32_t. So, when the value of csize exceeds 16bits, it gets truncated when assigned to tn->csize. This is causing a kernel BUG. Changing the definition of csize in jffs2_tmp_dnode_info to uint32_t fixes the issue. Signed-off-by: Ajesh Kunhipurayil Vijayan <ajesh@broadcom.com> Signed-off-by: Kamlakant Patel <kamlakant.patel@broadcom.com> Signed-off-by: Brian Norris <computersforpeace@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-04-26	jffs2: Fix segmentation fault found in stress test	Kamlakant Patel
	commit 3367da5610c50e6b83f86d366d72b41b350b06a2 upstream. Creating a large file on a JFFS2 partition sometimes crashes with this call trace: [ 306.476000] CPU 13 Unable to handle kernel paging request at virtual address c0000000dfff8002, epc == ffffffffc03a80a8, ra == ffffffffc03a8044 [ 306.488000] Oops[#1]: [ 306.488000] Cpu 13 [ 306.492000] $ 0 : 0000000000000000 0000000000000000 0000000000008008 0000000000008007 [ 306.500000] $ 4 : c0000000dfff8002 000000000000009f c0000000e0007cde c0000000ee95fa58 [ 306.508000] $ 8 : 0000000000000001 0000000000008008 0000000000010000 ffffffffffff8002 [ 306.516000] $12 : 0000000000007fa9 000000000000ff0e 000000000000ff0f 80e55930aebb92bb [ 306.524000] $16 : c0000000e0000000 c0000000ee95fa5c c0000000efc80000 ffffffffc09edd70 [ 306.532000] $20 : ffffffffc2b60000 c0000000ee95fa58 0000000000000000 c0000000efc80000 [ 306.540000] $24 : 0000000000000000 0000000000000004 [ 306.548000] $28 : c0000000ee950000 c0000000ee95f738 0000000000000000 ffffffffc03a8044 [ 306.556000] Hi : 00000000000574a5 [ 306.560000] Lo : 6193b7a7e903d8c9 [ 306.564000] epc : ffffffffc03a80a8 jffs2_rtime_compress+0x98/0x198 [ 306.568000] Tainted: G W [ 306.572000] ra : ffffffffc03a8044 jffs2_rtime_compress+0x34/0x198 [ 306.580000] Status: 5000f8e3 KX SX UX KERNEL EXL IE [ 306.584000] Cause : 00800008 [ 306.588000] BadVA : c0000000dfff8002 [ 306.592000] PrId : 000c1100 (Netlogic XLP) [ 306.596000] Modules linked in: [ 306.596000] Process dd (pid: 170, threadinfo=c0000000ee950000, task=c0000000ee6e0858, tls=0000000000c47490) [ 306.608000] Stack : 7c547f377ddc7ee4 7ffc7f967f5d7fae 7f617f507fc37ff4 7e7d7f817f487f5f 7d8e7fec7ee87eb3 7e977ff27eec7f9e 7d677ec67f917f67 7f3d7e457f017ed7 7fd37f517f867eb2 7fed7fd17ca57e1d 7e5f7fe87f257f77 7fd77f0d7ede7fdb 7fba7fef7e197f99 7fde7fe07ee37eb5 7f5c7f8c7fc67f65 7f457fb87f847e93 7f737f3e7d137cd9 7f8e7e9c7fc47d25 7dbb7fac7fb67e52 7ff17f627da97f64 7f6b7df77ffa7ec5 80057ef17f357fb3 7f767fa27dfc7fd5 7fe37e8e7fd07e53 7e227fcf7efb7fa1 7f547e787fa87fcc 7fcb7fc57f5a7ffb 7fc07f6c7ea97e80 7e2d7ed17e587ee0 7fb17f9d7feb7f31 7f607e797e887faa 7f757fdd7c607ff3 7e877e657ef37fbd 7ec17fd67fe67ff7 7ff67f797ff87dc4 7eef7f3a7c337fa6 7fe57fc97ed87f4b 7ebe7f097f0b8003 7fe97e2a7d997cba 7f587f987f3c7fa9 ... [ 306.676000] Call Trace: [ 306.680000] [<ffffffffc03a80a8>] jffs2_rtime_compress+0x98/0x198 [ 306.684000] [<ffffffffc0394f10>] jffs2_selected_compress+0x110/0x230 [ 306.692000] [<ffffffffc039508c>] jffs2_compress+0x5c/0x388 [ 306.696000] [<ffffffffc039dc58>] jffs2_write_inode_range+0xd8/0x388 [ 306.704000] [<ffffffffc03971bc>] jffs2_write_end+0x16c/0x2d0 [ 306.708000] [<ffffffffc01d3d90>] generic_file_buffered_write+0xf8/0x2b8 [ 306.716000] [<ffffffffc01d4e7c>] __generic_file_aio_write+0x1ac/0x350 [ 306.720000] [<ffffffffc01d50a0>] generic_file_aio_write+0x80/0x168 [ 306.728000] [<ffffffffc021f7dc>] do_sync_write+0x94/0xf8 [ 306.732000] [<ffffffffc021ff6c>] vfs_write+0xa4/0x1a0 [ 306.736000] [<ffffffffc02202e8>] SyS_write+0x50/0x90 [ 306.744000] [<ffffffffc0116cc0>] handle_sys+0x180/0x1a0 [ 306.748000] [ 306.748000] Code: 020b202d 0205282d 90a50000 <90840000> 14a40038 00000000 0060602d 0000282d 016c5823 [ 306.760000] ---[ end trace 79dd088435be02d0 ]--- Segmentation fault This crash is caused because the 'positions' is declared as an array of signed short. The value of position is in the range 0..65535, and will be converted to a negative number when the position is greater than 32767 and causes a corruption and crash. Changing the definition to 'unsigned short' fixes this issue Signed-off-by: Jayachandran C <jchandra@broadcom.com> Signed-off-by: Kamlakant Patel <kamlakant.patel@broadcom.com> Signed-off-by: Brian Norris <computersforpeace@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-04-26	ext4: fix partial cluster handling for bigalloc file systems	Eric Whitney
	commit c06344939422bbd032ac967223a7863de57496b5 upstream. Commit 9cb00419fa, which enables hole punching for bigalloc file systems, exposed a bug introduced by commit 6ae06ff51e in an earlier release. When run on a bigalloc file system, xfstests generic/013, 068, 075, 083, 091, 100, 112, 127, 263, 269, and 270 fail with e2fsck errors or cause kernel error messages indicating that previously freed blocks are being freed again. The latter commit optimizes the selection of the starting extent in ext4_ext_rm_leaf() when hole punching by beginning with the extent supplied in the path argument rather than with the last extent in the leaf node (as is still done when truncating). However, the code in rm_leaf that initially sets partial_cluster to track cluster sharing on extent boundaries is only guaranteed to run if rm_leaf starts with the last node in the leaf. Consequently, partial_cluster is not correctly initialized when hole punching, and a cluster on the boundary of a punched region that should be retained may instead be deallocated. Signed-off-by: Eric Whitney <enwlinux@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-04-26	ext4: fix error return from ext4_ext_handle_uninitialized_extents()	Eric Whitney
	commit ce37c42919608e96ade3748fe23c3062a0a966c5 upstream. Commit 3779473246 breaks the return of error codes from ext4_ext_handle_uninitialized_extents() in ext4_ext_map_blocks(). A portion of the patch assigns that function's signed integer return value to an unsigned int. Consequently, negatively valued error codes are lost and can be treated as a bogus allocated block count. Signed-off-by: Eric Whitney <enwlinux@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-04-26	Btrfs: skip submitting barrier for missing device	Hidetoshi Seto
	commit f88ba6a2a44ee98e8d59654463dc157bb6d13c43 upstream. I got an error on v3.13: BTRFS error (device sdf1) in write_all_supers:3378: errno=-5 IO failure (errors while submitting device barriers.) how to reproduce: > mkfs.btrfs -f -d raid1 /dev/sdf1 /dev/sdf2 > wipefs -a /dev/sdf2 > mount -o degraded /dev/sdf1 /mnt > btrfs balance start -f -sconvert=single -mconvert=single -dconvert=single /mnt The reason of the error is that barrier_all_devices() failed to submit barrier to the missing device. However it is clear that we cannot do anything on missing device, and also it is not necessary to care chunks on the missing device. This patch stops sending/waiting barrier if device is missing. Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-04-26	bdi: avoid oops on device removal	Jan Kara
	commit 5acda9d12dcf1ad0d9a5a2a7c646de3472fa7555 upstream. After commit 839a8e8660b6 ("writeback: replace custom worker pool implementation with unbound workqueue") when device is removed while we are writing to it we crash in bdi_writeback_workfn() -> set_worker_desc() because bdi->dev is NULL. This can happen because even though bdi_unregister() cancels all pending flushing work, nothing really prevents new ones from being queued from balance_dirty_pages() or other places. Fix the problem by clearing BDI_registered bit in bdi_unregister() and checking it before scheduling of any flushing work. Fixes: 839a8e8660b6777e7fe4e80af1a048aebe2b5977 Reviewed-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jan Kara <jack@suse.cz> Cc: Derek Basehore <dbasehore@chromium.org> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-04-26	backing_dev: fix hung task on sync	Derek Basehore
	commit 6ca738d60c563d5c6cf6253ee4b8e76fa77b2b9e upstream. bdi_wakeup_thread_delayed() used the mod_delayed_work() function to schedule work to writeback dirty inodes. The problem with this is that it can delay work that is scheduled for immediate execution, such as the work from sync_inodes_sb(). This can happen since mod_delayed_work() can now steal work from a work_queue. This fixes the problem by using queue_delayed_work() instead. This is a regression caused by commit 839a8e8660b6 ("writeback: replace custom worker pool implementation with unbound workqueue"). The reason that this causes a problem is that laptop-mode will change the delay, dirty_writeback_centisecs, to 60000 (10 minutes) by default. In the case that bdi_wakeup_thread_delayed() races with sync_inodes_sb(), sync will be stopped for 10 minutes and trigger a hung task. Even if dirty_writeback_centisecs is not long enough to cause a hung task, we still don't want to delay sync for that long. We fix the problem by using queue_delayed_work() when we want to schedule writeback sometime in future. This function doesn't change the timer if it is already armed. For the same reason, we also change bdi_writeback_workfn() to immediately queue the work again in the case that the work_list is not empty. The same problem can happen if the sync work is run on the rescue worker. [jack@suse.cz: update changelog, add comment, use bdi_wakeup_thread_delayed()] Signed-off-by: Derek Basehore <dbasehore@chromium.org> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Alexander Viro <viro@zento.linux.org.uk> Reviewed-by: Tejun Heo <tj@kernel.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "Darrick J. Wong" <darrick.wong@oracle.com> Cc: Derek Basehore <dbasehore@chromium.org> Cc: Kees Cook <keescook@chromium.org> Cc: Benson Leung <bleung@chromium.org> Cc: Sonny Rao <sonnyrao@chromium.org> Cc: Luigi Semenzato <semenzato@chromium.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Dave Chinner <david@fromorbit.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-04-03	mm: close PageTail race	David Rientjes
	commit 668f9abbd4334e6c29fa8acd71635c4f9101caa7 upstream. Commit bf6bddf1924e ("mm: introduce compaction and migration for ballooned pages") introduces page_count(page) into memory compaction which dereferences page->first_page if PageTail(page). This results in a very rare NULL pointer dereference on the aforementioned page_count(page). Indeed, anything that does compound_head(), including page_count() is susceptible to racing with prep_compound_page() and seeing a NULL or dangling page->first_page pointer. This patch uses Andrea's implementation of compound_trans_head() that deals with such a race and makes it the default compound_head() implementation. This includes a read memory barrier that ensures that if PageTail(head) is true that we return a head page that is neither NULL nor dangling. The patch then adds a store memory barrier to prep_compound_page() to ensure page->first_page is set. This is the safest way to ensure we see the head page that we are expecting, PageTail(page) is already in the unlikely() path and the memory barriers are unfortunately required. Hugetlbfs is the exception, we don't enforce a store memory barrier during init since no race is possible. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Holger Kiehl <Holger.Kiehl@dwd.de> Cc: Christoph Lameter <cl@linux.com> Cc: Rafael Aquini <aquini@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Michal Hocko <mhocko@suse.cz> Cc: Mel Gorman <mgorman@suse.de> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-04-03	ext4: atomically set inode->i_flags in ext4_set_inode_flags()	Theodore Ts'o
	commit 00a1a053ebe5febcfc2ec498bd894f035ad2aa06 upstream. Use cmpxchg() to atomically set i_flags instead of clearing out the S_IMMUTABLE, S_APPEND, etc. flags and then setting them from the EXT4_IMMUTABLE_FL, EXT4_APPEND_FL flags, since this opens up a race where an immutable file has the immutable flag cleared for a brief window of time. Reported-by: John Sullivan <jsrhbz@kanargh.force9.co.uk> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-03-23	bio-integrity: Fix bio_integrity_verify segment start bug	Nicholas Bellinger
	commit 5837c80e870bc3b12ac6a98cdc9ce7a9522a8fb6 upstream. This patch addresses a bug in bio_integrity_verify() code that has been causing DIF READ verify operations to be silently skipped. The issue is that bio->bi_idx will have been incremented within bio_advance() code in the normal blk_update_request() -> req_bio_endio() completion path, and bio_integrity_verify() is using bio_for_each_segment() which starts the bio segment walk at the current bio->bi_idx. So instead use bio_for_each_segment_all() to always start the bio segment walk from zero, regardless of the current bio->bi_idx value after bio_advance() has been called. (Context change for v3.10.y -> v3.13.y code - nab) Cc: Martin K. Petersen <martin.petersen@oracle.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-03-23	Fix mountpoint reference leakage in linkat	Oleg Drokin
	commit d22e6338db7f613dd4f6095c190682fcc519e4b7 upstream. Recent changes to retry on ESTALE in linkat (commit 442e31ca5a49e398351b2954b51f578353fdf210) introduced a mountpoint reference leak and a small memory leak in case a filesystem link operation returns ESTALE which is pretty normal for distributed filesystems like lustre, nfs and so on. Free old_path in such a case. [AV: there was another missing path_put() nearby - on the previous goto retry] Signed-off-by: Oleg Drokin: <green@linuxhacker.ru> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-03-23	Btrfs: fix data corruption when reading/updating compressed extents	Filipe David Borba Manana
	commit a2aa75e18a21b21952dc6daa9bac7c9f4426f81f upstream. When using a mix of compressed file extents and prealloc extents, it is possible to fill a page of a file with random, garbage data from some unrelated previous use of the page, instead of a sequence of zeroes. A simple sequence of steps to get into such case, taken from the test case I made for xfstests, is: _scratch_mkfs _scratch_mount "-o compress-force=lzo" $XFS_IO_PROG -f -c "pwrite -S 0x06 -b 18670 266978 18670" $SCRATCH_MNT/foobar $XFS_IO_PROG -c "falloc 26450 665194" $SCRATCH_MNT/foobar $XFS_IO_PROG -c "truncate 542872" $SCRATCH_MNT/foobar $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foobar This results in the following file items in the fs tree: item 4 key (257 INODE_ITEM 0) itemoff 15879 itemsize 160 inode generation 6 transid 6 size 542872 block group 0 mode 100600 item 5 key (257 INODE_REF 256) itemoff 15863 itemsize 16 inode ref index 2 namelen 6 name: foobar item 6 key (257 EXTENT_DATA 0) itemoff 15810 itemsize 53 extent data disk byte 0 nr 0 gen 6 extent data offset 0 nr 24576 ram 266240 extent compression 0 item 7 key (257 EXTENT_DATA 24576) itemoff 15757 itemsize 53 prealloc data disk byte 12849152 nr 241664 gen 6 prealloc data offset 0 nr 241664 item 8 key (257 EXTENT_DATA 266240) itemoff 15704 itemsize 53 extent data disk byte 12845056 nr 4096 gen 6 extent data offset 0 nr 20480 ram 20480 extent compression 2 item 9 key (257 EXTENT_DATA 286720) itemoff 15651 itemsize 53 prealloc data disk byte 13090816 nr 405504 gen 6 prealloc data offset 0 nr 258048 The on disk extent at offset 266240 (which corresponds to 1 single disk block), contains 5 compressed chunks of file data. Each of the first 4 compress 4096 bytes of file data, while the last one only compresses 3024 bytes of file data. Therefore a read into the file region [285648 ; 286720[ (length = 4096 - 3024 = 1072 bytes) should always return zeroes (our next extent is a prealloc one). The solution here is the compression code path to zero the remaining (untouched) bytes of the last page it uncompressed data into, as the information about how much space the file data consumes in the last page is not known in the upper layer fs/btrfs/extent_io.c:__do_readpage(). In __do_readpage we were correctly zeroing the remainder of the page but only if it corresponds to the last page of the inode and if the inode's size is not a multiple of the page size. This would cause not only returning random data on reads, but also permanently storing random data when updating parts of the region that should be zeroed. For the example above, it means updating a single byte in the region [285648 ; 286720[ would store that byte correctly but also store random data on disk. A test case for xfstests follows soon. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-03-23	fs/proc/base.c: fix GPF in /proc/$PID/map_files	Artem Fetishev
	commit 70335abb2689c8cd5df91bf2d95a65649addf50b upstream. The expected logic of proc_map_files_get_link() is either to return 0 and initialize 'path' or return an error and leave 'path' uninitialized. By the time dname_to_vma_addr() returns 0 the corresponding vma may have already be gone. In this case the path is not initialized but the return value is still 0. This results in 'general protection fault' inside d_path(). Steps to reproduce: CONFIG_CHECKPOINT_RESTORE=y fd = open(...); while (1) { mmap(fd, ...); munmap(fd, ...); } ls -la /proc/$PID/map_files Addresses https://bugzilla.kernel.org/show_bug.cgi?id=68991 Signed-off-by: Artem Fetishev <artem_fetishev@epam.com> Signed-off-by: Aleksandr Terekhov <aleksandr_terekhov@epam.com> Reported-by: <wiebittewas@gmail.com> Acked-by: Pavel Emelyanov <xemul@parallels.com> Acked-by: Cyrill Gorcunov <gorcunov@openvz.org> Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-03-23	NFSv4: nfs4_stateid_is_current should return 'true' for an invalid stateid	Trond Myklebust
	commit e1253be0ece1a95a02c7f5843194877471af8179 upstream. When nfs4_set_rw_stateid() can fails by returning EIO to indicate that the stateid is completely invalid, then it makes no sense to have it trigger a retry of the READ or WRITE operation. Instead, we should just have it fall through and attempt a recovery. This fixes an infinite loop in which the client keeps replaying the same bad stateid back to the server. Reported-by: Andy Adamson <andros@netapp.com> Link: http://lkml.kernel.org/r/1393954269-3974-1-git-send-email-andros@netapp.com Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-03-23	NFS: Fix a delegation callback race	Trond Myklebust
	commit 755a48a7a4eb05b9c8424e3017d947b2961a60e0 upstream. The clean-up in commit 36281caa839f ended up removing a NULL pointer check that is needed in order to prevent an Oops in nfs_async_inode_return_delegation(). Reported-by: "Yan, Zheng" <zheng.z.yan@intel.com> Link: http://lkml.kernel.org/r/5313E9F6.2020405@intel.com Fixes: 36281caa839f (NFSv4: Further clean-ups of delegation stateid validation) Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-03-23	ocfs2 syncs the wrong range...	Al Viro
	commit 1b56e98990bcdbb20b9fab163654b9315bf158e8 upstream. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-03-23	ocfs2: fix quota file corruption	Jan Kara
	commit 15c34a760630ca2c803848fba90ca0646a9907dd upstream. Global quota files are accessed from different nodes. Thus we cannot cache offset of quota structure in the quota file after we drop our node reference count to it because after that moment quota structure may be freed and reallocated elsewhere by a different node resulting in corruption of quota file. Fix the problem by clearing dq_off when we are releasing dquot structure. We also remove the DB_READ_B handling because it is useless - DQ_ACTIVE_B is set iff DQ_READ_B is set. Signed-off-by: Jan Kara <jack@suse.cz> Cc: Goldwyn Rodrigues <rgoldwyn@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Reviewed-by: Mark Fasheh <mfasheh@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-03-06	quota: Fix race between dqput() and dquot_scan_active()	Jan Kara
	commit 1362f4ea20fa63688ba6026e586d9746ff13a846 upstream. Currently last dqput() can race with dquot_scan_active() causing it to call callback for an already deactivated dquot. The race is as follows: CPU1 CPU2 dqput() spin_lock(&dq_list_lock); if (atomic_read(&dquot->dq_count) > 1) { - not taken if (test_bit(DQ_ACTIVE_B, &dquot->dq_flags)) { spin_unlock(&dq_list_lock); ->release_dquot(dquot); if (atomic_read(&dquot->dq_count) > 1) - not taken dquot_scan_active() spin_lock(&dq_list_lock); if (!test_bit(DQ_ACTIVE_B, &dquot->dq_flags)) - not taken atomic_inc(&dquot->dq_count); spin_unlock(&dq_list_lock); - proceeds to release dquot ret = fn(dquot, priv); - called for inactive dquot Fix the problem by making sure possible ->release_dquot() is finished by the time we call the callback and new calls to it will notice reference dquot_scan_active() has taken and bail out. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-03-06	fs: fix iversion handling	Christoph Hellwig
	commit dff6efc326a4d5f305797d4a6bba14f374fdd633 upstream. Currently notify_change directly updates i_version for size updates, which not only is counter to how all other fields are updated through struct iattr, but also breaks XFS, which need inode updates to happen under its own lock, and synchronized to the structure that gets written to the log. Remove the update in the common code, and it to btrfs and ext4, XFS already does a proper updaste internally and currently gets a double update with the existing code. IMHO this is 3.13 and -stable material and should go in through the XFS tree. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Acked-by: Jan Kara <jack@suse.cz> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Ben Myers <bpm@sgi.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-03-06	CIFS: Fix too big maxBuf size for SMB3 mounts	Pavel Shilovsky
	commit 2365c4eaf077c48574ab6f143960048fc0f31518 upstream. SMB3 servers can respond with MaxTransactSize of more than 4M that can cause a memory allocation error returned from kmalloc in a lock codepath. Also the client doesn't support multicredit requests now and allows buffer sizes of 65536 bytes only. Set MaxTransactSize to this maximum supported value. Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru> Acked-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <smfrench@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-03-06	cifs: ensure that uncached writes handle unmapped areas correctly	Jeff Layton
	commit 5d81de8e8667da7135d3a32a964087c0faf5483f upstream. It's possible for userland to pass down an iovec via writev() that has a bogus user pointer in it. If that happens and we're doing an uncached write, then we can end up getting less bytes than we expect from the call to iov_iter_copy_from_user. This is CVE-2014-0069 cifs_iovec_write isn't set up to handle that situation however. It'll blindly keep chugging through the page array and not filling those pages with anything useful. Worse yet, we'll later end up with a negative number in wdata->tailsz, which will confuse the sending routines and cause an oops at the very least. Fix this by having the copy phase of cifs_iovec_write stop copying data in this situation and send the last write as a short one. At the same time, we want to avoid sending a zero-length write to the server, so break out of the loop and set rc to -EFAULT if that happens. This also allows us to handle the case where no address in the iovec is valid. [Note: Marking this for stable on v3.4+ kernels, but kernels as old as v2.6.38 may have a similar problem and may need similar fix] Reviewed-by: Pavel Shilovsky <piastry@etersoft.ru> Reported-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <smfrench@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-03-06	ext4: don't leave i_crtime.tv_sec uninitialized	Theodore Ts'o
	commit 19ea80603715d473600cd993b9987bc97d042e02 upstream. If the i_crtime field is not present in the inode, don't leave the field uninitialized. Fixes: ef7f38359 ("ext4: Add nanosecond timestamps") Reported-by: Vegard Nossum <vegard.nossum@oracle.com> Tested-by: Vegard Nossum <vegard.nossum@oracle.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-03-06	ext4: fix online resize with a non-standard blocks per group setting	Theodore Ts'o
	commit 3d2660d0c9c2f296837078c189b68a47f6b2e3b5 upstream. The set_flexbg_block_bitmap() function assumed that the number of blocks in a blockgroup was sb->blocksize * 8, which is normally true, but not always! Use EXT4_BLOCKS_PER_GROUP(sb) instead, to fix block bitmap corruption after: mke2fs -t ext4 -g 3072 -i 4096 /dev/vdd 1G mount -t ext4 /dev/vdd /vdd resize2fs /dev/vdd 8G Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reported-by: Jon Bernard <jbernard@tuxion.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-03-06	ext4: fix online resize with very large inode tables	Theodore Ts'o
	commit b93c95353413041a8cebad915a8109619f66bcc6 upstream. If a file system has a large number of inodes per block group, all of the metadata blocks in a flex_bg may be larger than what can fit in a single block group. Unfortunately, ext4_alloc_group_tables() in resize.c was never tested to see if it would handle this case correctly, and there were a large number of bugs which caused the following sequence to result in a BUG_ON: kernel bug at fs/ext4/resize.c:409! ... call trace: [<ffffffff81256768>] ext4_flex_group_add+0x1448/0x1830 [<ffffffff81257de2>] ext4_resize_fs+0x7b2/0xe80 [<ffffffff8123ac50>] ext4_ioctl+0xbf0/0xf00 [<ffffffff811c111d>] do_vfs_ioctl+0x2dd/0x4b0 [<ffffffff811b9df2>] ? final_putname+0x22/0x50 [<ffffffff811c1371>] sys_ioctl+0x81/0xa0 [<ffffffff81676aa9>] system_call_fastpath+0x16/0x1b code: c8 4c 89 df e8 41 96 f8 ff 44 89 e8 49 01 c4 44 29 6d d4 0 rip [<ffffffff81254fa1>] set_flexbg_block_bitmap+0x171/0x180 This can be reproduced with the following command sequence: mke2fs -t ext4 -i 4096 /dev/vdd 1G mount -t ext4 /dev/vdd /vdd resize2fs /dev/vdd 8G To fix this, we need to make sure the right thing happens when a block group's inode table straddles two block groups, which means the following bugs had to be fixed: 1) Not clearing the BLOCK_UNINIT flag in the second block group in ext4_alloc_group_tables --- the was proximate cause of the BUG_ON. 2) Incorrectly determining how many block groups contained contiguous free blocks in ext4_alloc_group_tables(). 3) Incorrectly setting the start of the next block range to be marked in use after a discontinuity in setup_new_flex_group_blocks(). Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-03-06	ext4: don't try to modify s_flags if the the file system is read-only	Theodore Ts'o
	commit 23301410972330c0ae9a8afc379ba2005e249cc6 upstream. If an ext4 file system is created by some tool other than mke2fs (perhaps by someone who has a pathalogical fear of the GPL) that doesn't set one or the other of the EXT2_FLAGS_{UN}SIGNED_HASH flags, and that file system is then mounted read-only, don't try to modify the s_flags field. Otherwise, if dm_verity is in use, the superblock will change, causing an dm_verity failure. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-03-06	ext4: fix error paths in swap_inode_boot_loader()	Zheng Liu
	commit 30d29b119ef01776e0a301444ab24defe8d8bef3 upstream. In swap_inode_boot_loader() we forgot to release ->i_mutex and resume unlocked dio for inode and inode_bl if there is an error starting the journal handle. This commit fixes this issue. Reported-by: Ahmed Tamrawi <ahmedtamrawi@gmail.com> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: Dr. Tilmann Bubeck <t.bubeck@reinform.de> Signed-off-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-22	block: Fix nr_vecs for inline integrity vectors	Martin K. Petersen
	commit 087787959ce851d7bbb19f10f6e9241b7f85a3ca upstream. Commit 9f060e2231ca changed the way we handle allocations for the integrity vectors. When the vectors are inline there is no associated slab and consequently bvec_nr_vecs() returns 0. Ensure that we check against BIP_INLINE_VECS in that case. Reported-by: David Milburn <dmilburn@redhat.com> Tested-by: David Milburn <dmilburn@redhat.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-22	lockd: send correct lock when granting a delayed lock.	NeilBrown
	commit 2ec197db1a56c9269d75e965f14c344b58b2a4f6 upstream. If an NFS client attempts to get a lock (using NLM) and the lock is not available, the server will remember the request and when the lock becomes available it will send a GRANT request to the client to provide the lock. If the client already held an adjacent lock, the GRANT callback will report the union of the existing and new locks, which can confuse the client. This happens because __posix_lock_file (called by vfs_lock_file) updates the passed-in file_lock structure when adjacent or over-lapping locks are found. To avoid this problem we take a copy of the two fields that can be changed (fl_start and fl_end) before the call and restore them afterwards. An alternate would be to allocate a 'struct file_lock', initialise it, use locks_copy_lock() to take a copy, then locks_release_private() after the vfs_lock_file() call. But that is a lot more work. Reported-by: Olaf Kirch <okir@suse.com> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> -- v1 had a couple of issues (large on-stack struct and didn't really work properly). This version is much better tested. Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-02-22	retrieving CIFS ACLs when mounted with SMB2 fails dropping session	Steve French
	commit 83e3bc23ef9ce7c03b7b4e5d3d790246ea59db3e upstream. The get/set ACL xattr support for CIFS ACLs attempts to send old cifs dialect protocol requests even when mounted with SMB2 or later dialects. Sending cifs requests on an smb2 session causes problems - the server drops the session due to the illegal request. This patch makes CIFS ACL operations protocol specific to fix that. Attempting to query/set CIFS ACLs for SMB2 will now return EOPNOTSUPP (until we add worker routines for sending query ACL requests via SMB2) instead of sending invalid (cifs) requests. A separate followon patch will be needed to fix cifs_acl_to_fattr (which takes a cifs specific u16 fid so can't be abstracted to work with SMB2 until that is changed) and will be needed to fix mount problems when "cifsacl" is specified on mount with e.g. vers=2.1 Signed-off-by: Steve French <smfrench@gmail.com> Reviewed-by: Shirish Pargaonkar <spargaonkar@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-22	Add protocol specific operation for CIFS xattrs	Steve French
	commit d979f3b0a1f0b5499ab85e68cdf02b56852918b6 upstream. Changeset 666753c3ef8fc88b0ddd5be4865d0aa66428ac35 added protocol operations for get/setxattr to avoid calling cifs operations on smb2/smb3 mounts for xattr operations and this changeset adds the calls to cifs specific protocol operations for xattrs (in order to reenable cifs support for xattrs which was temporarily disabled by the previous changeset. We do not have SMB2/SMB3 worker function for setting xattrs yet so this only enables it for cifs. CCing stable since without these two small changsets (its small coreq 666753c3ef8fc88b0ddd5be4865d0aa66428ac35 is also needed) calling getfattr/setfattr on smb2/smb3 mounts causes problems. Signed-off-by: Steve French <smfrench@gmail.com> Reviewed-by: Shirish Pargaonkar <spargaonkar@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-22	CIFS: Fix SMB2 mounts so they don't try to set or get xattrs via cifs	Steve French
	commit 666753c3ef8fc88b0ddd5be4865d0aa66428ac35 upstream. When mounting with smb2 (or smb2.1 or smb3) we need to check to make sure that attempts to query or set extended attributes do not attempt to send the request with the older cifs protocol instead (eventually we also need to add the support in SMB2 to query/set extended attributes but this patch prevents us from using the wrong protocol for extended attribute operations). Signed-off-by: Steve French <smfrench@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-22	fs/file.c:fdtable: avoid triggering OOMs from alloc_fdmem	Eric W. Biederman
	commit 96c7a2ff21501691587e1ae969b83cbec8b78e08 upstream. Recently due to a spike in connections per second memcached on 3 separate boxes triggered the OOM killer from accept. At the time the OOM killer was triggered there was 4GB out of 36GB free in zone 1. The problem was that alloc_fdtable was allocating an order 3 page (32KiB) to hold a bitmap, and there was sufficient fragmentation that the largest page available was 8KiB. I find the logic that PAGE_ALLOC_COSTLY_ORDER can't fail pretty dubious but I do agree that order 3 allocations are very likely to succeed. There are always pathologies where order > 0 allocations can fail when there are copious amounts of free memory available. Using the pigeon hole principle it is easy to show that it requires 1 page more than 50% of the pages being free to guarantee an order 1 (8KiB) allocation will succeed, 1 page more than 75% of the pages being free to guarantee an order 2 (16KiB) allocation will succeed and 1 page more than 87.5% of the pages being free to guarantee an order 3 allocate will succeed. A server churning memory with a lot of small requests and replies like memcached is a common case that if anything can will skew the odds against large pages being available. Therefore let's not give external applications a practical way to kill linux server applications, and specify __GFP_NORETRY to the kmalloc in alloc_fdmem. Unless I am misreading the code and by the time the code reaches should_alloc_retry in __alloc_pages_slowpath (where __GFP_NORETRY becomes signification). We have already tried everything reasonable to allocate a page and the only thing left to do is wait. So not waiting and falling back to vmalloc immediately seems like the reasonable thing to do even if there wasn't a chance of triggering the OOM killer. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Eric Dumazet <eric.dumazet@gmail.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Cong Wang <cwang@twopensource.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-20	mm: __set_page_dirty uses spin_lock_irqsave instead of spin_lock_irq	KOSAKI Motohiro
	commit 227d53b397a32a7614667b3ecaf1d89902fb6c12 upstream. To use spin_{un}lock_irq is dangerous if caller disabled interrupt. During aio buffer migration, we have a possibility to see the following call stack. aio_migratepage [disable interrupt] migrate_page_copy clear_page_dirty_for_io set_page_dirty __set_page_dirty_buffers __set_page_dirty spin_lock_irq This mean, current aio migration is a deadlockable. spin_lock_irqsave is a safer alternative and we should use it. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reported-by: David Rientjes rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-20	Btrfs: disable snapshot aware defrag for now	Josef Bacik
	commit 8101c8dbf6243ba517aab58d69bf1bc37d8b7b9c upstream. It's just broken and it's taking a lot of effort to fix it, so for now just disable it so people can defrag in peace. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-13	pnfs: Proper delay for NFS4ERR_RECALLCONFLICT in layout_get_done	Boaz Harrosh
	commit ed7e5423014ad89720fcf315c0b73f2c5d0c7bd2 upstream. An NFS4ERR_RECALLCONFLICT is returned by server from a GET_LAYOUT only when a Server Sent a RECALL do to that GET_LAYOUT, or the RECALL and GET_LAYOUT crossed on the wire. In any way this means we want to wait at most until in-flight IO is finished and the RECALL can be satisfied. So a proper wait here is more like 1/10 of a second, not 15 seconds like we have now. In case of a server bug we delay exponentially longer on each retry. Current code totally craps out performance of very large files on most pnfs-objects layouts, because of how the map changes when the file has grown into the next raid group. [Stable: This will patch back to 3.9. If there are earlier still maintained trees, please tell me I'll send a patch] Signed-off-by: Boaz Harrosh <bharrosh@panasas.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-13	nfs4: fix discover_server_trunking use after free	Weston Andros Adamson
	commit abad2fa5ba67725a3f9c376c8cfe76fbe94a3041 upstream. If clp is new (cl_count = 1) and it matches another client in nfs4_discover_server_trunking, the nfs_put_client will free clp before ->cl_preserve_clid is set. Signed-off-by: Weston Andros Adamson <dros@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-13	NFSv4.1: Handle errors correctly in nfs41_walk_client_list	Trond Myklebust
	commit 64590daa9e0dfb3aad89e3ab9230683b76211d5b upstream. Both nfs41_walk_client_list and nfs40_walk_client_list expect the 'status' variable to be set to the value -NFS4ERR_STALE_CLIENTID if the loop fails to find a match. The problem is that the 'pos->cl_cons_state > NFS_CS_READY' changes the value of 'status', and sets it either to the value '0' (which indicates success), or to the value EINTR. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-13	nfs4.1: properly handle ENOTSUP in SECINFO_NO_NAME	Weston Andros Adamson
	commit 78b19bae0813bd6f921ca58490196abd101297bd upstream. Don't check for -NFS4ERR_NOTSUPP, it's already been mapped to -ENOTSUPP by nfs4_stat_to_errno. This allows the client to mount v4.1 servers that don't support SECINFO_NO_NAME by falling back to the "guess and check" method of nfs4_find_root_sec. Signed-off-by: Weston Andros Adamson <dros@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-13	NFSv4: OPEN must handle the NFS4ERR_IO return code correctly	Trond Myklebust
	commit c7848f69ec4a8c03732cde5c949bd2aa711a9f4b upstream. decode_op_hdr() cannot distinguish between an XDR decoding error and the perfectly valid errorcode NFS4ERR_IO. This is normally not a problem, but for the particular case of OPEN, we need to be able to increment the NFSv4 open sequence id when the server returns a valid response. Reported-by: J Bruce Fields <bfields@fieldses.org> Link: http://lkml.kernel.org/r/20131204210356.GA19452@fieldses.org Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-13	ore: Fix wrong math in allocation of per device BIO	Boaz Harrosh
	commit aad560b7f63b495f48a7232fd086c5913a676e6f upstream. At IO preparation we calculate the max pages at each device and allocate a BIO per device of that size. The calculation was wrong on some unaligned corner cases offset/length combination and would make prepare return with -ENOMEM. This would be bad for pnfs-objects that would in that case IO through MDS. And fatal for exofs were it would fail writes with EIO. Fix it by doing the proper math, that will work in all cases. (I ran a test with all possible offset/length combinations this time round). Also when reading we do not need to allocate for the parity units since we jump over them. Also lower the max_io_length to take into account the parity pages so not to allocate BIOs bigger than PAGE_SIZE Signed-off-by: Boaz Harrosh <bharrosh@panasas.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-13	fs/compat: fix lookup_dcookie() parameter handling	Heiko Carstens
	commit d8d14bd09cddbaf0168d61af638455a26bd027ff upstream. Commit d5dc77bfeeab ("consolidate compat lookup_dcookie()") coverted all architectures to the new compat_sys_lookup_dcookie() syscall. The "len" paramater of the new compat syscall must have the type compat_size_t in order to enforce zero extension for architectures where the ABI requires that the caller of a function performed zero and/or sign extension to 64 bit of all parameters. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Ingo Molnar <mingo@kernel.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Hendrik Brueckner <brueckner@linux.vnet.ibm.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-13	fs/compat: fix parameter handling for compat readv/writev syscalls	Heiko Carstens
	commit dfd948e32af2e7b28bcd7a490c0a30d4b8df2a36 upstream. We got a report that the pwritev syscall does not work correctly in compat mode on s390. It turned out that with commit 72ec35163f9f ("switch compat readv/writev variants to COMPAT_SYSCALL_DEFINE") we lost the zero extension of a couple of syscall parameters because the some parameter types haven't been converted from unsigned long to compat_ulong_t. This is needed for architectures where the ABI requires that the caller of a function performed zero and/or sign extension to 64 bit of all parameters. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Ingo Molnar <mingo@kernel.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Hendrik Brueckner <brueckner@linux.vnet.ibm.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-13	compat: fix sys_fanotify_mark	Heiko Carstens
	commit 592f6b842f64e416c7598a1b97c649b34241e22d upstream. Commit 91c2e0bcae72 ("unify compat fanotify_mark(2), switch to COMPAT_SYSCALL_DEFINE") added a new unified compat fanotify_mark syscall to be used by all architectures. Unfortunately the unified version merges the split mask parameter in a wrong way: the lower and higher word got swapped. This was discovered with glibc's tst-fanotify test case. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Reported-by: Andreas Krebbel <krebbel@linux.vnet.ibm.com> Cc: "James E.J. Bottomley" <jejb@parisc-linux.org> Acked-by: "David S. Miller" <davem@davemloft.net> Acked-by: Al Viro <viro@ZenIV.linux.org.uk> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Ralf Baechle <ralf@linux-mips.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-13	fuse: fix pipe_buf_operations	Miklos Szeredi
	commit 28a625cbc2a14f17b83e47ef907b2658576a32aa upstream. Having this struct in module memory could Oops when if the module is unloaded while the buffer still persists in a pipe. Since sock_pipe_buf_ops is essentially the same as fuse_dev_pipe_buf_steal merge them into nosteal_pipe_buf_ops (this is the same as default_pipe_buf_ops except stealing the page from the buffer is not allowed). Reported-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-06	btrfs: restrict snapshotting to own subvolumes	David Sterba
	commit d024206133ce21936b3d5780359afc00247655b7 upstream. Currently, any user can snapshot any subvolume if the path is accessible and thus indirectly create and keep files he does not own under his direcotries. This is not possible with traditional directories. In security context, a user can snapshot root filesystem and pin any potentially buggy binaries, even if the updates are applied. All the snapshots are visible to the administrator, so it's possible to verify if there are suspicious snapshots. Another more practical problem is that any user can pin the space used by eg. root and cause ENOSPC. Original report: https://bugs.launchpad.net/ubuntu/+source/apparmor/+bug/484786 Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-06	Btrfs: handle EAGAIN case properly in btrfs_drop_snapshot()	Wang Shilong
	commit 90515e7f5d7d24cbb2a4038a3f1b5cfa2921aa17 upstream. We may return early in btrfs_drop_snapshot(), we shouldn't call btrfs_std_err() for this case, fix it. Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-06	vfs: Is mounted should be testing mnt_ns for NULL or error.	Eric W. Biederman
	commit 260a459d2e39761fbd39803497205ce1690bc7b1 upstream. A bug was introduced with the is_mounted helper function in commit f7a99c5b7c8bd3d3f533c8b38274e33f3da9096e Author: Al Viro <viro@zeniv.linux.org.uk> Date: Sat Jun 9 00:59:08 2012 -0400 get rid of ->mnt_longterm it's enough to set ->mnt_ns of internal vfsmounts to something distinct from all struct mnt_namespace out there; then we can just use the check for ->mnt_ns != NULL in the fast path of mntput_no_expire() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> The intent was to test if the real_mount(vfsmount)->mnt_ns was NULL_OR_ERR but the code is actually testing real_mount(vfsmount) and always returning true. The result is d_absolute_path returning paths it should be hiding. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-06	ext4: avoid clearing beyond i_blocks when truncating an inline data file	Theodore Ts'o
	commit 09c455aaa8f47a94d5bafaa23d58365768210507 upstream. A missing cast means that when we are truncating a file which is less than 60 bytes, we don't clear the correct area of memory, and in fact we can end up truncating the next inode in the inode table, or worse yet, some other kernel data structure. Addresses-Coverity-Id: #751987 Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>