aboutsummaryrefslogtreecommitdiff
path: root/fs/ext4
AgeCommit message (Collapse)Author
2012-08-25ext4: fix kernel BUG on large-scale rm -rf commandsTheodore Ts'o
commit 89a4e48f8479f8145eca9698f39fe188c982212f upstream. Commit 968dee7722: "ext4: fix hole punch failure when depth is greater than 0" introduced a regression in v3.5.1/v3.6-rc1 which caused kernel crashes when users ran run "rm -rf" on large directory hierarchy on ext4 filesystems on RAID devices: BUG: unable to handle kernel NULL pointer dereference at 0000000000000028 Process rm (pid: 18229, threadinfo ffff8801276bc000, task ffff880123631710) Call Trace: [<ffffffff81236483>] ? __ext4_handle_dirty_metadata+0x83/0x110 [<ffffffff812353d3>] ext4_ext_truncate+0x193/0x1d0 [<ffffffff8120a8cf>] ? ext4_mark_inode_dirty+0x7f/0x1f0 [<ffffffff81207e05>] ext4_truncate+0xf5/0x100 [<ffffffff8120cd51>] ext4_evict_inode+0x461/0x490 [<ffffffff811a1312>] evict+0xa2/0x1a0 [<ffffffff811a1513>] iput+0x103/0x1f0 [<ffffffff81196d84>] do_unlinkat+0x154/0x1c0 [<ffffffff8118cc3a>] ? sys_newfstatat+0x2a/0x40 [<ffffffff81197b0b>] sys_unlinkat+0x1b/0x50 [<ffffffff816135e9>] system_call_fastpath+0x16/0x1b Code: 8b 4d 20 0f b7 41 02 48 8d 04 40 48 8d 04 81 49 89 45 18 0f b7 49 02 48 83 c1 01 49 89 4d 00 e9 ae f8 ff ff 0f 1f 00 49 8b 45 28 <48> 8b 40 28 49 89 45 20 e9 85 f8 ff ff 0f 1f 80 00 00 00 RIP [<ffffffff81233164>] ext4_ext_remove_space+0xa34/0xdf0 This could be reproduced as follows: The problem in commit 968dee7722 was that caused the variable 'i' to be left uninitialized if the truncate required more space than was available in the journal. This resulted in the function ext4_ext_truncate_extend_restart() returning -EAGAIN, which caused ext4_ext_remove_space() to restart the truncate operation after starting a new jbd2 handle. Reported-by: Maciej Żenczykowski <maze@google.com> Reported-by: Marti Raudsepp <marti@juffo.org> Tested-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-08-25ext4: fix long mount times on very big file systemsTheodore Ts'o
commit 0548bbb85337e532ca2ed697c3e9b227ff2ed4b4 upstream. Commit 8aeb00ff85a: "ext4: fix overhead calculation used by ext4_statfs()" introduced a O(n**2) calculation which makes very large file systems take forever to mount. Fix this with an optimization for non-bigalloc file systems. (For bigalloc file systems the overhead needs to be set in the the superblock.) Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-08-25ext4: don't call ext4_error while block group is lockedTheodore Ts'o
commit 7a4c5de27efa4c2ecca87af0a3deea63446367e2 upstream. While in ext4_validate_block_bitmap(), if an block allocation bitmap is found to be invalid, we call ext4_error() while the block group is still locked. This causes ext4_commit_super() to call a function which might sleep while in an atomic context. There's no need to keep the block group locked at this point, so hoist the ext4_error() call up to ext4_validate_block_bitmap() and release the block group spinlock before calling ext4_error(). The reported stack trace can be found at: http://article.gmane.org/gmane.comp.file-systems.ext4/33731 Reported-by: Dave Jones <davej@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-08-25ext4: avoid kmemcheck complaint from reading uninitialized memoryTheodore Ts'o
commit 7e731bc9a12339f344cddf82166b82633d99dd86 upstream. Commit 03179fe923 introduced a kmemcheck complaint in ext4_da_get_block_prep() because we save and restore ei->i_da_metadata_calc_last_lblock even though it is left uninitialized in the case where i_da_metadata_calc_len is zero. This doesn't hurt anything, but silencing the kmemcheck complaint makes it easier for people to find real bugs. Addresses https://bugzilla.kernel.org/show_bug.cgi?id=45631 (which is marked as a regression). Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-08-25ext4: make sure the journal sb is written in ext4_clear_journal_err()Theodore Ts'o
commit d796c52ef0b71a988364f6109aeb63d79c5b116b upstream. After we transfer set the EXT4_ERROR_FS bit in the file system superblock, it's not enough to call jbd2_journal_clear_err() to clear the error indication from journal superblock --- we need to call jbd2_journal_update_sb_errno() as well. Otherwise, when the root file system is mounted read-only, the journal is replayed, and the error indicator is transferred to the superblock --- but the s_errno field in the jbd2 superblock is left set (since although we cleared it in memory, we never flushed it out to disk). This can end up confusing e2fsck. We should make e2fsck more robust in this case, but the kernel shouldn't be leaving things in this confused state, either. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-08-09ext4: use s_csum_seed instead of i_csum_seed for xattr blockTao Ma
commit 41eb70dde42b2360074a559a6f1fc49860a50179 upstream. In xattr block operation, we use h_refcount to indicate whether the xattr block is shared among many inodes. And xattr block csum uses s_csum_seed if it is shared and i_csum_seed if it belongs to one inode. But this has a problem. So consider the block is shared first bewteen inode A and B, and B has some xattr update and CoW the xattr block. When it updates the *old* xattr block(because of the h_refcount change) and calls ext4_xattr_release_block, we has no idea that inode A is the real owner of the *old* xattr block and we can't use the i_csum_seed of inode A either in xattr block csum calculation. And I don't think we have an easy way to find inode A. So this patch just removes the tricky i_csum_seed and we now uses s_csum_seed every time for the xattr block csum. The corresponding patch for the e2fsprogs will be sent in another patch. This is spotted by xfstests 117. Signed-off-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Acked-by: Darrick J. Wong <djwong@us.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-08-09ext4: use proper csum calculation in ext4_renameTao Ma
commit ef58f69c3c34f6377f1e21d3533c806dbd980ad0 upstream. In ext4_rename, when the old name is a dir, we need to change ".." to its new parent and journal the change, so with metadata_csum enabled, we have to re-calc the csum. As the first block of the dir can be either a htree root or a normal directory block and we have different csum calculation for these 2 types, we have to choose the right one in ext4_rename. btw, it is found by xfstests 013. Signed-off-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Acked-by: Darrick J. Wong <djwong@us.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-08-09ext4: undo ext4_calc_metadata_amount if we fail to claim spaceTheodore Ts'o
commit 03179fe92318e7934c180d96f12eff2cb36ef7b6 upstream. The function ext4_calc_metadata_amount() has side effects, although it's not obvious from its function name. So if we fail to claim space, regardless of whether we retry to claim the space again, or return an error, we need to undo these side effects. Otherwise we can end up incorrectly calculating the number of metadata blocks needed for the operation, which was responsible for an xfstests failure for test #271 when using an ext2 file system with delalloc enabled. Reported-by: Brian Foster <bfoster@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-08-09ext4: don't let i_reserved_meta_blocks go negativeBrian Foster
commit 97795d2a5b8d3c8dc4365d4bd3404191840453ba upstream. If we hit a condition where we have allocated metadata blocks that were not appropriately reserved, we risk underflow of ei->i_reserved_meta_blocks. In turn, this can throw sbi->s_dirtyclusters_counter significantly out of whack and undermine the nondelalloc fallback logic in ext4_nonda_switch(). Warn if this occurs and set i_allocated_meta_blocks to avoid this problem. This condition is reproduced by xfstests 270 against ext2 with delalloc enabled: Mar 28 08:58:02 localhost kernel: [ 171.526344] EXT4-fs (loop1): delayed block allocation failed for inode 14 at logical offset 64486 with max blocks 64 with error -28 Mar 28 08:58:02 localhost kernel: [ 171.526346] EXT4-fs (loop1): This should not happen!! Data will be lost 270 ultimately fails with an inconsistent filesystem and requires an fsck to repair. The cause of the error is an underflow in ext4_da_update_reserve_space() due to an unreserved meta block allocation. Signed-off-by: Brian Foster <bfoster@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-08-09ext4: fix hole punch failure when depth is greater than 0Ashish Sangwan
commit 968dee77220768a5f52cf8b21d0bdb73486febef upstream. Whether to continue removing extents or not is decided by the return value of function ext4_ext_more_to_rm() which checks 2 conditions: a) if there are no more indexes to process. b) if the number of entries are decreased in the header of "depth -1". In case of hole punch, if the last block to be removed is not part of the last extent index than this index will not be deleted, hence the number of valid entries in the extent header of "depth - 1" will remain as it is and ext4_ext_more_to_rm will return 0 although the required blocks are not yet removed. This patch fixes the above mentioned problem as instead of removing the extents from the end of file, it starts removing the blocks from the particular extent from which removing blocks is actually required and continue backward until done. Signed-off-by: Ashish Sangwan <ashish.sangwan2@gmail.com> Signed-off-by: Namjae Jeon <linkinjeon@gmail.com> Reviewed-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-08-09ext4: fix overhead calculation used by ext4_statfs()Theodore Ts'o
commit 952fc18ef9ec707ebdc16c0786ec360295e5ff15 upstream. Commit f975d6bcc7a introduced bug which caused ext4_statfs() to miscalculate the number of file system overhead blocks. This causes the f_blocks field in the statfs structure to be larger than it should be. This would in turn cause the "df" output to show the number of data blocks in the file system and the number of data blocks used to be larger than they should be. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-08-09ext4: pass a char * to ext4_count_free() instead of a buffer_head ptrTheodore Ts'o
commit f6fb99cadcd44660c68e13f6eab28333653621e6 upstream. Make it possible for ext4_count_free to operate on buffers and not just data in buffer_heads. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-07-18ext4: fix duplicated mnt_drop_write call in EXT4_IOC_MOVE_EXTAl Viro
Caused, AFAICS, by mismerge in commit ff9cb1c4eead ("Merge branch 'for_linus' into for_linus_merged") Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Cc: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org # 3.3+ Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-06-07ext4: don't set i_flags in EXT4_IOC_SETFLAGSTao Ma
Commit 7990696 uses the ext4_{set,clear}_inode_flags() functions to change the i_flags automatically but fails to remove the error setting of i_flags. So we still have the problem of trashing state flags. Fix this by removing the assignment. Signed-off-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@kernel.org
2012-06-07ext4: fix the free blocks calculation for ext3 file systems w/ uninit_bgTheodore Ts'o
Ext3 filesystems that are converted to use as many ext4 file system features as possible will enable uninit_bg to speed up e2fsck times. These file systems will have a native ext3 layout of inode tables and block allocation bitmaps (as opposed to ext4's flex_bg layout). Unfortunately, in these cases, when first allocating a block in an uninitialized block group, ext4 would incorrectly calculate the number of free blocks in that block group, and then errorneously report that the file system was corrupt: EXT4-fs error (device vdd): ext4_mb_generate_buddy:741: group 30, 32254 clusters in bitmap, 32258 in gd This problem can be reproduced via: mke2fs -q -t ext4 -O ^flex_bg /dev/vdd 5g mount -t ext4 /dev/vdd /mnt fallocate -l 4600m /mnt/test The problem was caused by a bone headed mistake in the check to see if a particular metadata block was part of the block group. Many thanks to Kees Cook for finding and bisecting the buggy commit which introduced this bug (commit fd034a84e1, present since v3.2). Reported-by: Sander Eikelenboom <linux@eikelenboom.it> Reported-by: Kees Cook <keescook@chromium.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Tested-by: Kees Cook <keescook@chromium.org> Cc: stable@kernel.org
2012-06-01Merge tag 'ext4_for_linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull Ext4 updates from Theodore Ts'o: "The major new feature added in this update is Darrick J Wong's metadata checksum feature, which adds crc32 checksums to ext4's metadata fields. There is also the usual set of cleanups and bug fixes." * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (44 commits) ext4: hole-punch use truncate_pagecache_range jbd2: use kmem_cache_zalloc wrapper instead of flag ext4: remove mb_groups before tearing down the buddy_cache ext4: add ext4_mb_unload_buddy in the error path ext4: don't trash state flags in EXT4_IOC_SETFLAGS ext4: let getattr report the right blocks in delalloc+bigalloc ext4: add missing save_error_info() to ext4_error() ext4: add debugging trigger for ext4_error() ext4: protect group inode free counting with group lock ext4: use consistent ssize_t type in ext4_file_write() ext4: fix format flag in ext4_ext_binsearch_idx() ext4: cleanup in ext4_discard_allocated_blocks() ext4: return ENOMEM when mounts fail due to lack of memory ext4: remove redundundant "(char *) bh->b_data" casts ext4: disallow hard-linked directory in ext4_lookup ext4: fix potential integer overflow in alloc_flex_gd() ext4: remove needs_recovery in ext4_mb_init() ext4: force ro mount if ext4_setup_super() fails ext4: fix potential NULL dereference in ext4_free_inodes_counts() ext4/jbd2: add metadata checksumming to the list of supported features ...
2012-06-01ext4: hole-punch use truncate_pagecache_rangeHugh Dickins
When truncating a file, we unmap pages from userspace first, as that's usually more efficient than relying, page by page, on the fallback in truncate_inode_page() - particularly if the file is mapped many times. Do the same when punching a hole: 3.4 added truncate_pagecache_range() to do the unmap and trunc, so use it in ext4_ext_punch_hole(), instead of calling truncate_inode_pages_range() directly. Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-05-31ext4: remove mb_groups before tearing down the buddy_cacheSalman Qazi
We can't have references held on pages in the s_buddy_cache while we are trying to truncate its pages and put the inode. All the pages must be gone before we reach clear_inode. This can only be gauranteed if we can prevent new users from grabbing references to s_buddy_cache's pages. The original bug can be reproduced and the bug fix can be verified by: while true; do mount -t ext4 /dev/ram0 /export/hda3/ram0; \ umount /export/hda3/ram0; done & while true; do cat /proc/fs/ext4/ram0/mb_groups; done Signed-off-by: Salman Qazi <sqazi@google.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@kernel.org
2012-05-31ext4: add ext4_mb_unload_buddy in the error pathSalman Qazi
ext4_free_blocks fails to pair an ext4_mb_load_buddy with a matching ext4_mb_unload_buddy when it fails a memory allocation. Signed-off-by: Salman Qazi <sqazi@google.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@kernel.org
2012-05-31ext4: don't trash state flags in EXT4_IOC_SETFLAGSTheodore Ts'o
In commit 353eb83c we removed i_state_flags with 64-bit longs, But when handling the EXT4_IOC_SETFLAGS ioctl, we replace i_flags directly, which trashes the state flags which are stored in the high 32-bits of i_flags on 64-bit platforms. So use the the ext4_{set,clear}_inode_flags() functions which use atomic bit manipulation functions instead. Reported-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@kernel.org
2012-05-31ext4: let getattr report the right blocks in delalloc+bigallocTao Ma
In delayed allocation, i_reserved_data_blocks now indicates clusters, not blocks. So report it in the right number. This can be easily exposed by the following command: echo foo > blah; du -hc blah; sync; du -hc blah Reported-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-05-30ext4: add missing save_error_info() to ext4_error()Theodore Ts'o
The ext4_error() function is missing a call to save_error_info(). Since this is the function which marks the file system as containing an error, this oversight (which was introduced in 2.6.36) is quite significant, and should be backported to older stable kernels with high urgency. Reported-by: Ken Sumrall <ksumrall@google.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: ksumrall@google.com Cc: stable@kernel.org
2012-05-30ext4: add debugging trigger for ext4_error()Theodore Ts'o
Make it easy to test whether or not the error handling subsystem in ext4 is working correctly. This allows us to simulate an ext4_error() by echoing a string to /sys/fs/ext4/<dev>/trigger_fs_error. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: ksumrall@google.com
2012-05-28ext4: protect group inode free counting with group lockTao Ma
Now when we set the group inode free count, we don't have a proper group lock so that multiple threads may decrease the inode free count at the same time. And e2fsck will complain something like: Free inodes count wrong for group #1 (1, counted=0). Fix? no Free inodes count wrong for group #2 (3, counted=0). Fix? no Directories count wrong for group #2 (780, counted=779). Fix? no Free inodes count wrong for group #3 (2272, counted=2273). Fix? no So this patch try to protect it with the ext4_lock_group. btw, it is found by xfstests test case 269 and the volume is mkfsed with the parameter "-O ^resize_inode,^uninit_bg,extent,meta_bg,flex_bg,ext_attr" and I have run it 100 times and the error in e2fsck doesn't show up again. Signed-off-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-05-28ext4: use consistent ssize_t type in ext4_file_write()Zheng Liu
The generic_file_aio_write() function returns ssize_t, and ext4_file_write() returns a ssize_t, so use a ssize_t to collect the return value from generic_file_aio_write(). It shouldn't matter since the VFS read/write paths shouldn't allow a read greater than MAX_INT, but there was previously a bug in the AIO code paths, and it's best if we use a consistent type so that the return value from generic_file_aio_write() can't get truncated. Reported-by: Jouni Siren <jouni.siren@iki.fi> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-05-28ext4: fix format flag in ext4_ext_binsearch_idx()Zheng Liu
fix ext_debug format flag in ext4_ext_binsearch_idx(). Signed-off-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-05-28ext4: cleanup in ext4_discard_allocated_blocks()Zheng Liu
remove 'len' variable in ext4_discard_allocated_blocks() because it is useless. Signed-off-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-05-28ext4: return ENOMEM when mounts fail due to lack of memoryTheodore Ts'o
This is a port of the ext3 commit: 4569cd1b0d9 Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-05-28ext4: remove redundundant "(char *) bh->b_data" castsTheodore Ts'o
The b_data field of the buffer_head is already a char *, so there's no point casting it to a char *. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-05-28ext4: disallow hard-linked directory in ext4_lookupAndreas Dilger
A hard-linked directory to its parent can cause the VFS to deadlock, and is a sign of a corrupted file system. So detect this case in ext4_lookup(), before the rmdir() lockup scenario can take place. Signed-off-by: Andreas Dilger <adilger@dilger.ca> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@kernel.org
2012-05-28ext4: fix potential integer overflow in alloc_flex_gd()Haogang Chen
In alloc_flex_gd(), when flexbg_size is large, kmalloc size would overflow and flex_gd->groups would point to a buffer smaller than expected, causing OOB accesses when it is used. Note that in ext4_resize_fs(), flexbg_size is calculated using sbi->s_log_groups_per_flex, which is read from the disk and only bounded to [1, 31]. The patch returns NULL for too large flexbg_size. Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Haogang Chen <haogangchen@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@kernel.org
2012-05-28ext4: remove needs_recovery in ext4_mb_init()Akira Fujita
needs_recovery in ext4_mb_init() is not used, remove it. Signed-off-by: Akira Fujita <a-fujita@rs.jp.ne.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-05-28ext4: force ro mount if ext4_setup_super() failsEric Sandeen
If ext4_setup_super() fails i.e. due to a too-high revision, the error is logged in dmesg but the fs is not mounted RO as indicated. Tested by: # mkfs.ext4 -r 4 /dev/sdb6 # mount /dev/sdb6 /mnt/test # dmesg | grep "too high" [164919.759248] EXT4-fs (sdb6): revision level too high, forcing read-only mode # grep sdb6 /proc/mounts /dev/sdb6 /mnt/test2 ext4 rw,seclabel,relatime,data=ordered 0 0 Reviewed-by: Andreas Dilger <adilger@whamcloud.com> Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@kernel.org
2012-05-28ext4: fix potential NULL dereference in ext4_free_inodes_counts()Dan Carpenter
The ext4_get_group_desc() function returns NULL on error, and ext4_free_inodes_count() function dereferences it without checking. There is a check on the next line, but it's too late. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@kernel.org
2012-05-28Merge tag 'writeback' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linuxLinus Torvalds
Pull writeback tree from Wu Fengguang: "Mainly from Jan Kara to avoid iput() in the flusher threads." * tag 'writeback' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux: writeback: Avoid iput() from flusher thread vfs: Rename end_writeback() to clear_inode() vfs: Move waiting for inode writeback from end_writeback() to evict_inode() writeback: Refactor writeback_single_inode() writeback: Remove wb->list_lock from writeback_single_inode() writeback: Separate inode requeueing after writeback writeback: Move I_DIRTY_PAGES handling writeback: Move requeueing when I_SYNC set to writeback_sb_inodes() writeback: Move clearing of I_SYNC into inode_sync_complete() writeback: initialize global_dirty_limit fs: remove 8 bytes of padding from struct writeback_control on 64 bit builds mm: page-writeback.c: local functions should not be exposed globally
2012-05-27ext4/jbd2: add metadata checksumming to the list of supported featuresDarrick J. Wong
Activate the metadata checksumming feature by adding it to ext4 and jbd2's lists of supported features. Signed-off-by: Darrick J. Wong <djwong@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-05-27jbd2: enable journal clients to enable v2 checksummingDarrick J. Wong
Add in the necessary code so that journal clients can enable the new journal checksumming features. Signed-off-by: Darrick J. Wong <djwong@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-05-25Merge branch 'for_linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull ext2, ext3 and quota fixes from Jan Kara: "Interesting bits are: - removal of a special i_mutex locking subclass (I_MUTEX_QUOTA) since quota code does not need i_mutex anymore in any unusual way. - backport (from ext4) of a fix of a checkpointing bug (missing cache flush) that could lead to fs corruption on power failure The rest are just random small fixes & cleanups." * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: ext2: trivial fix to comment for ext2_free_blocks ext2: remove the redundant comment for ext2_export_ops ext3: return 32/64-bit dir name hash according to usage type quota: Get rid of nested I_MUTEX_QUOTA locking subclass quota: Use precomputed value of sb_dqopt in dquot_quota_sync ext2: Remove i_mutex use from ext2_quota_write() reiserfs: Remove i_mutex use from reiserfs_quota_write() ext4: Remove i_mutex use from ext4_quota_write() ext3: Remove i_mutex use from ext3_quota_write() quota: Fix double lock in add_dquot_ref() with CONFIG_QUOTA_DEBUG jbd: Write journal superblock with WRITE_FUA after checkpointing jbd: protect all log tail updates with j_checkpoint_mutex jbd: Split updating of journal superblock and marking journal empty ext2: do not register write_super within VFS ext2: Remove s_dirt handling ext2: write superblock only once on unmount ext3: update documentation with barrier=1 default ext3: remove max_debt in find_group_orlov() jbd: Refine commit writeout logic
2012-05-23Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull user namespace enhancements from Eric Biederman: "This is a course correction for the user namespace, so that we can reach an inexpensive, maintainable, and reasonably complete implementation. Highlights: - Config guards make it impossible to enable the user namespace and code that has not been converted to be user namespace safe. - Use of the new kuid_t type ensures the if you somehow get past the config guards the kernel will encounter type errors if you enable user namespaces and attempt to compile in code whose permission checks have not been updated to be user namespace safe. - All uids from child user namespaces are mapped into the initial user namespace before they are processed. Removing the need to add an additional check to see if the user namespace of the compared uids remains the same. - With the user namespaces compiled out the performance is as good or better than it is today. - For most operations absolutely nothing changes performance or operationally with the user namespace enabled. - The worst case performance I could come up with was timing 1 billion cache cold stat operations with the user namespace code enabled. This went from 156s to 164s on my laptop (or 156ns to 164ns per stat operation). - (uid_t)-1 and (gid_t)-1 are reserved as an internal error value. Most uid/gid setting system calls treat these value specially anyway so attempting to use -1 as a uid would likely cause entertaining failures in userspace. - If setuid is called with a uid that can not be mapped setuid fails. I have looked at sendmail, login, ssh and every other program I could think of that would call setuid and they all check for and handle the case where setuid fails. - If stat or a similar system call is called from a context in which we can not map a uid we lie and return overflowuid. The LFS experience suggests not lying and returning an error code might be better, but the historical precedent with uids is different and I can not think of anything that would break by lying about a uid we can't map. - Capabilities are localized to the current user namespace making it safe to give the initial user in a user namespace all capabilities. My git tree covers all of the modifications needed to convert the core kernel and enough changes to make a system bootable to runlevel 1." Fix up trivial conflicts due to nearby independent changes in fs/stat.c * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (46 commits) userns: Silence silly gcc warning. cred: use correct cred accessor with regards to rcu read lock userns: Convert the move_pages, and migrate_pages permission checks to use uid_eq userns: Convert cgroup permission checks to use uid_eq userns: Convert tmpfs to use kuid and kgid where appropriate userns: Convert sysfs to use kgid/kuid where appropriate userns: Convert sysctl permission checks to use kuid and kgids. userns: Convert proc to use kuid/kgid where appropriate userns: Convert ext4 to user kuid/kgid where appropriate userns: Convert ext3 to use kuid/kgid where appropriate userns: Convert ext2 to use kuid/kgid where appropriate. userns: Convert devpts to use kuid/kgid where appropriate userns: Convert binary formats to use kuid/kgid where appropriate userns: Add negative depends on entries to avoid building code that is userns unsafe userns: signal remove unnecessary map_cred_ns userns: Teach inode_capable to understand inodes whose uids map to other namespaces. userns: Fail exec for suid and sgid binaries with ids outside our user namespace. userns: Convert stat to return values mapped from kuids and kgids userns: Convert user specfied uids and gids in chown into kuids and kgid userns: Use uid_eq gid_eq helpers when comparing kuids and kgids in the vfs ...
2012-05-21ext4: enable the 64-bit jbd2 feature based on the 64-bit ext4 featureTheodore Ts'o
Previously we were only enabling the 64-bit jbd2 feature if the number of blocks in the file system was greater 2**32-1. The problem with this is that it makes it harder to test the 64-bit journal code paths with small file systems, since a small test file system would with the 64-bit ext4 feature enable would use a 64-bit file system on-disk data structures, but use a 32-bit journal. This would also cause problems when trying to do an online resize to grow the filesystem above the 2**32-1 boundary. Fortunately the patch to support online resize for 64-bit file systems hasn't been merged yet, so this problem hasn't arisen in practice. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-05-15userns: Convert ext4 to user kuid/kgid where appropriateEric W. Biederman
Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-05-15ext4: Remove i_mutex use from ext4_quota_write()Jan Kara
We don't need i_mutex in ext4_quota_write() because writes to quota file are serialized by dqio_mutex anyway. Changes to quota files outside of quota code are forbidded and enforced by NOATIME and IMMUTABLE bits. Signed-off-by: Jan Kara <jack@suse.cz>
2012-05-10vfs: make it possible to access the dentry hash/len as one 64-bit entryLinus Torvalds
This allows comparing hash and len in one operation on 64-bit architectures. Right now only __d_lookup_rcu() takes advantage of this, since that is the case we care most about. The use of anonymous struct/unions hides the alternate 64-bit approach from most users, the exception being a few cases where we initialize a 'struct qstr' with a static initializer. This makes the problematic cases use a new QSTR_INIT() helper function for that (but initializing just the name pointer with a "{ .name = xyzzy }" initializer remains valid, as does just copying another qstr structure). Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-06vfs: Rename end_writeback() to clear_inode()Jan Kara
After we moved inode_sync_wait() from end_writeback() it doesn't make sense to call the function end_writeback() anymore. Rename it to clear_inode() which well says what the function really does - set I_CLEAR flag. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
2012-04-30ext4: remove unnecessary check in add_dirent_to_buf()Theodore Ts'o
None of this function callers ever pass in a NULL inode pointer, so this check is unnecessary, and the else clause is dead code. (This change should make the code coverage people a little happier. :-) Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-04-29ext4: add checksums to the MMP blockDarrick J. Wong
Compute and verify a checksum for the MMP block. Signed-off-by: Darrick J. Wong <djwong@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-04-29ext4: make block group checksums use metadata_csum algorithmDarrick J. Wong
metadata_csum supersedes uninit_bg. Convert the ROCOMPAT uninit_bg flag check to a helper function that covers both, and make the checksum calculation algorithm use either crc16 or the metadata_csum chosen algorithm depending on which flag is set. Print a warning if we try to mount a filesystem with both feature flags set. Signed-off-by: Darrick J. Wong <djwong@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-04-29ext4: Calculate and verify checksums of extended attribute blocksDarrick J. Wong
Calculate and verify the checksums of extended attribute blocks. This only applies to separate EA blocks that are pointed to by inode->i_file_acl (i.e. external EA blocks); the checksum lives in the EA header. Signed-off-by: Darrick J. Wong <djwong@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-04-29ext4: calculate and verify checksums of directory leaf blocksDarrick J. Wong
Calculate and verify the checksums for directory leaf blocks (i.e. blocks that only contain actual directory entries). The checksum lives in what looks to be an unused directory entry with a 0 name_len at the end of the block. This scheme is not used for internal htree nodes because the mechanism in place there only costs one dx_entry, whereas the "empty" directory entry would cost two dx_entries. Signed-off-by: Darrick J. Wong <djwong@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-04-29ext4: Calculate and verify checksums for htree nodesDarrick J. Wong
Calculate and verify the checksum for directory index tree (htree) node blocks. The checksum is stored in the last 4 bytes of the htree block and requires the dx_entry array to stop 1 dx_entry short of the end of the block. Signed-off-by: Darrick J. Wong <djwong@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>