aboutsummaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2010-01-18quota: Fix dquot_transfer for filesystems different from ext4Jan Kara
commit 05b5d898235401c489c68e1f3bc5706a29ad5713 upstream. Commit fd8fbfc1 modified the way we find amount of reserved space belonging to an inode. The amount of reserved space is checked from dquot_transfer and thus inode_reserved_space gets called even for filesystems that don't provide get_reserved_space callback which results in a BUG. Fix the problem by checking get_reserved_space callback and return 0 if the filesystem does not provide it. CC: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-01-18fasync: split 'fasync_helper()' into separate add/remove functionsLinus Torvalds
commit 53281b6d34d44308372d16acb7fb5327609f68b6 upstream. Yes, the add and remove cases do share the same basic loop and the locking, but the compiler can inline and then CSE some of the end result anyway. And splitting it up makes the code way easier to follow, and makes it clearer exactly what the semantics are. In particular, we must make sure that the FASYNC flag in file->f_flags exactly matches the state of "is this file on any fasync list", since not only is that flag visible to user space (F_GETFL), but we also use that flag to check whether we need to remove any fasync entries on file close. We got that wrong for the case of a mixed use of file locking (which tries to remove any fasync entries for file leases) and fasync. Splitting the function up also makes it possible to do some future optimizations without making the function even messier. In particular, since the FASYNC flag has to match the state of "is this on a list", we can do the following future optimizations: - on remove, we don't even need to get the locks and traverse the list if FASYNC isn't set, since we can know a priori that there is no point (this is effectively the same optimization that we already do in __fput() wrt removing fasync on file close) - on add, we can use the FASYNC flag to decide whether we are changing an existing entry or need to allocate a new one. but this is just the cleanup + fix for the FASYNC flag. Acked-by: Al Viro <viro@ZenIV.linux.org.uk> Tested-by: Tavis Ormandy <taviso@google.com> Cc: Jeff Dike <jdike@addtoit.com> Cc: Matt Mackall <mpm@selenic.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-01-06ext4: fix sleep inside spinlock issue with quota and dealloc (#14739)Dmitry Monakhov
commit 39bc680a8160bb9d6743f7873b535d553ff61058 upstream. Unlock i_block_reservation_lock before vfs_dq_reserve_block(). This patch fixes http://bugzilla.kernel.org/show_bug.cgi?id=14739 Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-01-06ext4: Convert to generic reserved quota's space management.Dmitry Monakhov
commit a9e7f4472075fb6937c545af3f6329e9946bbe66 upstream. This patch also fixes write vs chown race condition. Acked-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-01-06quota: decouple fs reserved space from quota reservationDmitry Monakhov
commit fd8fbfc1709822bd94247c5b2ab15a5f5041e103 upstream. Currently inode_reservation is managed by fs itself and this reservation is transfered on dquot_transfer(). This means what inode_reservation must always be in sync with dquot->dq_dqb.dqb_rsvspace. Otherwise dquot_transfer() will result in incorrect quota(WARN_ON in dquot_claim_reserved_space() will be triggered) This is not easy because of complex locking order issues for example http://bugzilla.kernel.org/show_bug.cgi?id=14739 The patch introduce quota reservation field for each fs-inode (fs specific inode is used in order to prevent bloating generic vfs inode). This reservation is managed by quota code internally similar to i_blocks/i_bytes and may not be always in sync with internal fs reservation. Also perform some code rearrangement: - Unify dquot_reserve_space() and dquot_reserve_space() - Unify dquot_release_reserved_space() and dquot_free_space() - Also this patch add missing warning update to release_rsv() dquot_release_reserved_space() must call flush_warnings() as dquot_free_space() does. Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-01-06Add unlocked version of inode_add_bytes() functionDmitry Monakhov
commit b462707e7ccad058ae151e5c5b06eb5cadcb737f upstream. Quota code requires unlocked version of this function. Off course we can just copy-paste the code, but copy-pasting is always an evil. Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-01-06generic_permission: MAY_OPEN is not write accessSerge E. Hallyn
commit 7ea6600148c265b1fd53e521022b1d7aec81d974 upstream. generic_permission was refusing CAP_DAC_READ_SEARCH-enabled processes from opening DAC-protected files read-only, because do_filp_open adds MAY_OPEN to the open mask. Ignore MAY_OPEN. After this patch, CAP_DAC_READ_SEARCH is again sufficient to open(fname, O_RDONLY) on a file to which DAC otherwise refuses us read permission. Reported-by: Mike Kazantsev <mk.fraggod@gmail.com> Signed-off-by: Serge E. Hallyn <serue@us.ibm.com> Tested-by: Mike Kazantsev <mk.fraggod@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-01-06XFS bug in log recover with quota (bugzilla id 855)Jan Rekorajski
commit 8ec6dba2581754e375be66f7bedd708d856d8b30 upstream. Hi, I was hit by a bug in linux 2.6.31 when XFS is not able to recover the log after a crash if fs was mounted with quotas. Gory details in XFS bugzilla: http://oss.sgi.com/bugzilla/show_bug.cgi?id=855. It looks like wrong struct is used in buffer length check, and the following patch should fix the problem. xfs_dqblk_t has a size of 104+32 bytes, while xfs_disk_dquot_t is 104 bytes long, and this is exactly what I see in system logs - "XFS: dquot too small (104) in xlog_recover_do_dquot_trans." Signed-off-by: Jan Rekorajski <baggins@sith.mimuw.edu.pl> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com> Cc: Simon Kirby <sim@hostway.ca> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-01-06udf: Try harder when looking for VAT inodeJan Kara
commit e971b0b9e0dd50d9ceecb67a6a6ab80a80906033 upstream. Some disks do not contain VAT inode in the last recorded block as required by the standard but a few blocks earlier (or the number of recorded blocks is wrong). So look for the VAT inode a bit before the end of the media. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-01-06cifs: NULL out tcon, pSesInfo, and srvTcp pointers when chasing DFS referralsJeff Layton
commit a2934c7b363ddcc001964f2444649f909e583bef upstream. The scenario is this: The kernel gets EREMOTE and starts chasing a DFS referral at mount time. The tcon reference is put, which puts the session reference too, but neither pointer is zeroed out. The mount gets retried (goto try_mount_again) with new mount info. Session setup fails fails and rc ends up being non-zero. The code then falls through to the end and tries to put the previously freed tcon pointer again. Oops at: cifs_put_smb_ses+0x14/0xd0 Fix this by moving the initialization of the rc variable and the tcon, pSesInfo and srvTcp pointers below the try_mount_again label. Also, add a FreeXid() before the goto to prevent xid "leaks". Signed-off-by: Jeff Layton <jlayton@redhat.com> Reported-by: Gustavo Carvalho Homem <gustavo@angulosolido.pt> Signed-off-by: Steve French <sfrench@us.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-12-18jffs2: Fix long-standing bug with symlink garbage collection.David Woodhouse
commit 2e16cfca6e17ae37ae21feca080a6f2eca9087dc upstream. Ever since jffs2_garbage_collect_metadata() was first half-written in February 2001, it's been broken on architectures where 'char' is signed. When garbage collecting a symlink with target length above 127, the payload length would end up negative, causing interesting and bad things to happen. Signed-off-by: David Woodhouse <David.Woodhouse@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-12-18ext3: Fix data / filesystem corruption when write fails to copy dataJan Kara
commit 68eb3db08344286733adac48304d9fb7a0e53b27 upstream. When ext3_write_begin fails after allocating some blocks or generic_perform_write fails to copy data to write, we truncate blocks already instantiated beyond i_size. Although these blocks were never inside i_size, we have to truncate pagecache of these blocks so that corresponding buffers get unmapped. Otherwise subsequent __block_prepare_write (called because we are retrying the write) will find the buffers mapped, not call ->get_block, and thus the page will be backed by already freed blocks leading to filesystem and data corruption. Reported-by: James Y Knight <foom@fuhm.net> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-12-18jbd2: don't wipe the journal on a failed journal checksumTheodore Ts'o
commit e6a47428de84e19fda52f21ab73fde2906c40d09 upstream. If there is a failed journal checksum, don't reset the journal. This allows for userspace programs to decide how to recover from this situation. It may be that ignoring the journal checksum failure might be a better way of recovering the file system. Once we add per-block checksums, we can definitely do better. Until then, a system administrator can try backing up the file system image (or taking a snapshot) and and trying to determine experimentally whether ignoring the checksum failure or aborting the journal replay results in less data loss. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-12-18hfs: fix a potential buffer overflowAmerigo Wang
commit ec81aecb29668ad71f699f4e7b96ec46691895b6 upstream. A specially-crafted Hierarchical File System (HFS) filesystem could cause a buffer overflow to occur in a process's kernel stack during a memcpy() call within the hfs_bnode_read() function (at fs/hfs/bnode.c:24). The attacker can provide the source buffer and length, and the destination buffer is a local variable of a fixed length. This local variable (passed as "&entry" from fs/hfs/dir.c:112 and allocated on line 60) is stored in the stack frame of hfs_bnode_read()'s caller, which is hfs_readdir(). Because the hfs_readdir() function executes upon any attempt to read a directory on the filesystem, it gets called whenever a user attempts to inspect any filesystem contents. [amwang@redhat.com: modify this patch and fix coding style problems] Signed-off-by: WANG Cong <amwang@redhat.com> Cc: Eugene Teo <eteo@redhat.com> Cc: Roman Zippel <zippel@linux-m68k.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@lst.de> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Dave Anderson <anderson@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-12-18devpts_get_tty() should validate inodeSukadev Bhattiprolu
commit edfacdd6f81119b9005615593f2cbd94b8c7e2d8 upstream. devpts_get_tty() assumes that the inode passed in is associated with a valid pty. But if the only reference to the pty is via a bind-mount, the inode passed to devpts_get_tty() while valid, would refer to a pty that no longer exists. With a lot of debug effort, Grzegorz Nosek developed a small program (see below) to reproduce a crash on recent kernels. This crash is a regression introduced by the commit: commit 527b3e4773628b30d03323a2cb5fb0d84441990f Author: Sukadev Bhattiprolu <sukadev@us.ibm.com> Date: Mon Oct 13 10:43:08 2008 +0100 To fix, ensure that the dentry associated with the inode has not yet been deleted/unhashed by devpts_pty_kill(). See also: https://lists.linux-foundation.org/pipermail/containers/2009-July/019273.html tty-bug.c: #define _GNU_SOURCE #include <fcntl.h> #include <sched.h> #include <stdlib.h> #include <sys/mount.h> #include <sys/signal.h> #include <unistd.h> #include <stdio.h> #include <linux/fs.h> void dummy(int sig) { } static int child(void *unused) { int fd; signal(SIGINT, dummy); signal(SIGHUP, dummy); pause(); /* cheesy synchronisation to wait for /dev/pts/0 to appear */ mount("/dev/pts/0", "/dev/console", NULL, MS_BIND, NULL); sleep(2); fd = open("/dev/console", O_RDWR); dup(0); dup(0); write(1, "Hello world!\n", sizeof("Hello world!\n")-1); return 0; } int main(void) { pid_t pid; char *stack; stack = malloc(16384); pid = clone(child, stack+16384, CLONE_NEWNS|SIGCHLD, NULL); open("/dev/ptmx", O_RDWR|O_NOCTTY|O_NONBLOCK); unlockpt(fd); grantpt(fd); sleep(2); kill(pid, SIGHUP); sleep(1); return 0; /* exit before child opens /dev/console */ } Reported-by: Grzegorz Nosek <root@localdomain.pl> Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Tested-by: Serge Hallyn <serue@us.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-12-18debugfs: fix create mutex racy fops and private dataMathieu Desnoyers
commit d3a3b0adad0865c12e39b712ca89efbd0a3a0dbc upstream. Setting fops and private data outside of the mutex at debugfs file creation introduces a race where the files can be opened with the wrong file operations and private data. It is easy to trigger with a process waiting on file creation notification. Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-12-14ext4: Fix potential fiemap deadlock (mmap_sem vs. i_data_sem)Theodore Ts'o
(cherry picked from commit fab3a549e204172236779f502eccb4f9bf0dc87d) Fix the following potential circular locking dependency between mm->mmap_sem and ei->i_data_sem: ======================================================= [ INFO: possible circular locking dependency detected ] 2.6.32-04115-gec044c5 #37 ------------------------------------------------------- ureadahead/1855 is trying to acquire lock: (&mm->mmap_sem){++++++}, at: [<ffffffff81107224>] might_fault+0x5c/0xac but task is already holding lock: (&ei->i_data_sem){++++..}, at: [<ffffffff811be1fd>] ext4_fiemap+0x11b/0x159 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (&ei->i_data_sem){++++..}: [<ffffffff81099bfa>] __lock_acquire+0xb67/0xd0f [<ffffffff81099e7e>] lock_acquire+0xdc/0x102 [<ffffffff81516633>] down_read+0x51/0x84 [<ffffffff811a2414>] ext4_get_blocks+0x50/0x2a5 [<ffffffff811a3453>] ext4_get_block+0xab/0xef [<ffffffff81154f39>] do_mpage_readpage+0x198/0x48d [<ffffffff81155360>] mpage_readpages+0xd0/0x114 [<ffffffff811a104b>] ext4_readpages+0x1d/0x1f [<ffffffff810f8644>] __do_page_cache_readahead+0x12f/0x1bc [<ffffffff810f86f2>] ra_submit+0x21/0x25 [<ffffffff810f0cfd>] filemap_fault+0x19f/0x32c [<ffffffff81107b97>] __do_fault+0x55/0x3a2 [<ffffffff81109db0>] handle_mm_fault+0x327/0x734 [<ffffffff8151aaa9>] do_page_fault+0x292/0x2aa [<ffffffff81518205>] page_fault+0x25/0x30 [<ffffffff812a34d8>] clear_user+0x38/0x3c [<ffffffff81167e16>] padzero+0x20/0x31 [<ffffffff81168b47>] load_elf_binary+0x8bc/0x17ed [<ffffffff81130e95>] search_binary_handler+0xc2/0x259 [<ffffffff81166d64>] load_script+0x1b8/0x1cc [<ffffffff81130e95>] search_binary_handler+0xc2/0x259 [<ffffffff8113255f>] do_execve+0x1ce/0x2cf [<ffffffff81027494>] sys_execve+0x43/0x5a [<ffffffff8102918a>] stub_execve+0x6a/0xc0 -> #0 (&mm->mmap_sem){++++++}: [<ffffffff81099aa4>] __lock_acquire+0xa11/0xd0f [<ffffffff81099e7e>] lock_acquire+0xdc/0x102 [<ffffffff81107251>] might_fault+0x89/0xac [<ffffffff81139382>] fiemap_fill_next_extent+0x95/0xda [<ffffffff811bcb43>] ext4_ext_fiemap_cb+0x138/0x157 [<ffffffff811be069>] ext4_ext_walk_space+0x178/0x1f1 [<ffffffff811be21e>] ext4_fiemap+0x13c/0x159 [<ffffffff811390e6>] do_vfs_ioctl+0x348/0x4d6 [<ffffffff811392ca>] sys_ioctl+0x56/0x79 [<ffffffff81028cb2>] system_call_fastpath+0x16/0x1b other info that might help us debug this: 1 lock held by ureadahead/1855: #0: (&ei->i_data_sem){++++..}, at: [<ffffffff811be1fd>] ext4_fiemap+0x11b/0x159 stack backtrace: Pid: 1855, comm: ureadahead Not tainted 2.6.32-04115-gec044c5 #37 Call Trace: [<ffffffff81098c70>] print_circular_bug+0xa8/0xb7 [<ffffffff81099aa4>] __lock_acquire+0xa11/0xd0f [<ffffffff8102f229>] ? sched_clock+0x9/0xd [<ffffffff81099e7e>] lock_acquire+0xdc/0x102 [<ffffffff81107224>] ? might_fault+0x5c/0xac [<ffffffff81107251>] might_fault+0x89/0xac [<ffffffff81107224>] ? might_fault+0x5c/0xac [<ffffffff81124b44>] ? __kmalloc+0x13b/0x18c [<ffffffff81139382>] fiemap_fill_next_extent+0x95/0xda [<ffffffff811bcb43>] ext4_ext_fiemap_cb+0x138/0x157 [<ffffffff811bca0b>] ? ext4_ext_fiemap_cb+0x0/0x157 [<ffffffff811be069>] ext4_ext_walk_space+0x178/0x1f1 [<ffffffff811be21e>] ext4_fiemap+0x13c/0x159 [<ffffffff81107224>] ? might_fault+0x5c/0xac [<ffffffff811390e6>] do_vfs_ioctl+0x348/0x4d6 [<ffffffff8129f6d0>] ? __up_read+0x8d/0x95 [<ffffffff81517fb5>] ? retint_swapgs+0x13/0x1b [<ffffffff811392ca>] sys_ioctl+0x56/0x79 [<ffffffff81028cb2>] system_call_fastpath+0x16/0x1b Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-12-14ext4: Fix insufficient checks in EXT4_IOC_MOVE_EXTAkira Fujita
(cherry picked from commit 4a58579b9e4e2a35d57e6c9c8483e52f6f1b7fd6) This patch fixes three problems in the handling of the EXT4_IOC_MOVE_EXT ioctl: 1. In current EXT4_IOC_MOVE_EXT, there are read access mode checks for original and donor files, but they allow the illegal write access to donor file, since donor file is overwritten by original file data. To fix this problem, change access mode checks of original (r->r/w) and donor (r->w) files. 2. Disallow the use of donor files that have a setuid or setgid bits. 3. Call mnt_want_write() and mnt_drop_write() before and after ext4_move_extents() calling to get write access to a mount. Signed-off-by: Akira Fujita <a-fujita@rs.jp.nec.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-12-14ext4: Wait for proper transaction commit on fsyncJan Kara
(cherry picked from commit b436b9bef84de6893e86346d8fbf7104bc520645) We cannot rely on buffer dirty bits during fsync because pdflush can come before fsync is called and clear dirty bits without forcing a transaction commit. What we do is that we track which transaction has last changed the inode and which transaction last changed allocation and force it to disk on fsync. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-12-14ext4: fix incorrect block reservation on quota transfer.Dmitry Monakhov
(cherry picked from commit 194074acacebc169ded90a4657193f5180015051) Inside ->setattr() call both ATTR_UID and ATTR_GID may be valid This means that we may end-up with transferring all quotas. Add we have to reserve QUOTA_DEL_BLOCKS for all quotas, as we do in case of QUOTA_INIT_BLOCKS. Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Reviewed-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-12-14ext4: quota macros cleanupDmitry Monakhov
(cherry picked from commit 5aca07eb7d8f14d90c740834d15ca15277f4820c) Currently all quota block reservation macros contains hard-coded "2" aka MAXQUOTAS value. This is no good because in some places it is not obvious to understand what does this digit represent. Let's introduce new macro with self descriptive name. Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Acked-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-12-14ext4: ext4_get_reserved_space() must return bytes instead of blocksDmitry Monakhov
(cherry picked from commit 8aa6790f876e81f5a2211fe1711a5fe3fe2d7b20) Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Acked-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-12-14ext4: remove blocks from inode prealloc list on failureCurt Wohlgemuth
(cherry picked from commit b844167edc7fcafda9623955c05e4c1b3c32ebc7) This fixes a leak of blocks in an inode prealloc list if device failures cause ext4_mb_mark_diskspace_used() to fail. Signed-off-by: Curt Wohlgemuth <curtw@google.com> Acked-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-12-14ext4: wait for log to commit when umountingJosef Bacik
(cherry picked from commit d4edac314e9ad0b21ba20ba8bc61b61f186f79e1) There is a potential race when a transaction is committing right when the file system is being umounting. This could reduce in a race because EXT4_SB(sb)->s_group_info could be freed in ext4_put_super before the commit code calls a callback so the mballoc code can release freed blocks in the transaction, resulting in a panic trying to access the freed s_group_info. The fix is to wait for the transaction to finish committing before we shutdown the multiblock allocator. Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-12-14ext4: Avoid data / filesystem corruption when write fails to copy dataJan Kara
(cherry picked from commit b9a4207d5e911b938f73079a83cc2ae10524ec7f) When ext4_write_begin fails after allocating some blocks or generic_perform_write fails to copy data to write, we truncate blocks already instantiated beyond i_size. Although these blocks were never inside i_size, we have to truncate the pagecache of these blocks so that corresponding buffers get unmapped. Otherwise subsequent __block_prepare_write (called because we are retrying the write) will find the buffers mapped, not call ->get_block, and thus the page will be backed by already freed blocks leading to filesystem and data corruption. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-12-14ext4: Return the PTR_ERR of the correct pointer in setup_new_group_blocks()Roel Kluin
(cherry picked from commit c09eef305dd43846360944ad072f051f964fa383) Signed-off-by: Roel Kluin <roel.kluin@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-12-14jbd2: Add ENOMEM checking in and for jbd2_journal_write_metadata_buffer()Theodore Ts'o
(cherry picked from commit e6ec116b67f46e0e7808276476554727b2e6240b) OOM happens. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-12-14ext4: move_extent_per_page() cleanupAkira Fujita
(cherry picked from commit ac48b0a1d068887141581bea8285de5fcab182b0) Integrate duplicate lines (acquire/release semaphore and invalidate extent cache in move_extent_per_page()) into mext_replace_branches(), to reduce source and object code size. Signed-off-by: Akira Fujita <a-fujita@rs.jp.nec.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-12-14ext4: initialize moved_len before calling ext4_move_extents()Kazuya Mio
(cherry picked from commit 446aaa6e7e993b38a6f21c6acfa68f3f1af3dbe3) The move_extent.moved_len is used to pass back the number of exchanged blocks count to user space. Currently the caller must clear this field; but we spend more code space checking for this requirement than simply zeroing the field ourselves, so let's just make life easier for everyone all around. Signed-off-by: Kazuya Mio <k-mio@sx.jp.nec.com> Signed-off-by: Akira Fujita <a-fujita@rs.jp.nec.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-12-14ext4: Fix double-free of blocks with EXT4_IOC_MOVE_EXTAkira Fujita
(cherry picked from commit 94d7c16cbbbd0e03841fcf272bcaf0620ad39618) At the beginning of ext4_move_extent(), we call ext4_discard_preallocations() to discard inode PAs of orig and donor inodes. But in the following case, blocks can be double freed, so move ext4_discard_preallocations() to the end of ext4_move_extents(). 1. Discard inode PAs of orig and donor inodes with ext4_discard_preallocations() in ext4_move_extents(). orig : [ DATA1 ] donor: [ DATA2 ] 2. While data blocks are exchanging between orig and donor inodes, new inode PAs is created to orig by other process's block allocation. (Since there are semaphore gaps in ext4_move_extents().) And new inode PAs is used partially (2-1). 2-1 Create new inode PAs to orig inode orig : [ DATA1 | used PA1 | free PA1 ] donor: [ DATA2 ] 3. Donor inode which has old orig inode's blocks is deleted after EXT4_IOC_MOVE_EXT finished (3-1, 3-2). So the block bitmap corresponds to old orig inode's blocks are freed. 3-1 After EXT4_IOC_MOVE_EXT finished orig : [ DATA2 | free PA1 ] donor: [ DATA1 | used PA1 ] 3-2 Delete donor inode orig : [ DATA2 | free PA1 ] donor: [ FREE SPACE(DATA1) | FREE SPACE(used PA1) ] 4. The double-free of blocks is occurred, when close() is called to orig inode. Because ext4_discard_preallocations() for orig inode frees used PA1 and free PA1, though used PA1 is already freed in 3. 4-1 Double-free of blocks is occurred orig : [ DATA2 | FREE SPACE(free PA1) ] donor: [ FREE SPACE(DATA1) | DOUBLE FREE(used PA1) ] Signed-off-by: Akira Fujita <a-fujita@rs.jp.nec.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-12-14ext4: make "norecovery" an alias for "noload"Eric Sandeen
(cherry picked from commit e3bb52ae2bb9573e84c17b8e3560378d13a5c798) Users on the linux-ext4 list recently complained about differences across filesystems w.r.t. how to mount without a journal replay. In the discussion it was noted that xfs's "norecovery" option is perhaps more descriptively accurate than "noload," so let's make that an alias for ext4. Also show this status in /proc/mounts Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-12-14ext4: make trim/discard optional (and off by default)Eric Sandeen
(cherry picked from commit 5328e635315734d42080de9a5a1ee87bf4cae0a4) It is anticipated that when sb_issue_discard starts doing real work on trim-capable devices, we may see issues. Make this mount-time optional, and default it to off until we know that things are working out OK. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-12-14ext4: fix error handling in ext4_ind_get_blocks()Jan Kara
(cherry picked from commit 2bba702d4f88d7b010ec37e2527b552588404ae7) When an error happened in ext4_splice_branch we failed to notice that in ext4_ind_get_blocks and mapped the buffer anyway. Fix the problem by checking for error properly. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-12-14ext4: avoid issuing unnecessary barriersTheodore Ts'o
(cherry picked from commit 6b17d902fdd241adfa4ce780df20547b28bf5801) We don't to issue an I/O barrier on an error or if we force commit because we are doing data journaling. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-12-14ext4: fix block validity checks so they work correctly with meta_bgTheodore Ts'o
(cherry picked from commit 1032988c71f3f85483b2b4319684d1205a704c02) The block validity checks used by ext4_data_block_valid() wasn't correctly written to check file systems with the meta_bg feature. Fix this. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-12-14ext4: fix uninit block bitmap initialization when s_meta_first_bg is non-zeroTheodore Ts'o
(cherry picked from commit 8dadb198cb70ef811916668fe67eeec82e8858dd) The number of old-style block group descriptor blocks is s_meta_first_bg when the meta_bg feature flag is set. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-12-14ext4: don't update the superblock in ext4_statfs()Theodore Ts'o
(cherry picked from commit 3f8fb9490efbd300887470a2a880a64e04dcc3f5) commit a71ce8c6c9bf269b192f352ea555217815cf027e updated ext4_statfs() to update the on-disk superblock counters, but modified this buffer directly without any journaling of the change. This is one of the accesses that was causing the crc errors in journal replay as seen in kernel.org bugzilla #14354. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-12-14ext4: journal all modifications in ext4_xattr_set_handleEric Sandeen
(cherry picked from commit 86ebfd08a1930ccedb8eac0aeb1ed4b8b6a41dbc) ext4_xattr_set_handle() was zeroing out an inode outside of journaling constraints; this is one of the accesses that was causing the crc errors in journal replay as seen in kernel.org bugzilla #14354. Reviewed-by: Andreas Dilger <adilger@sun.com> Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-12-14ext4: fix i_flags access in ext4_da_writepages_trans_blocks()Julia Lawall
(cherry picked from commit 30c6e07a92ea4cb87160d32ffa9bce172576ae4c) We need to be testing the i_flags field in the ext4 specific portion of the inode, instead of the (confusingly aliased) i_flags field in the generic struct inode. Signed-off-by: Julia Lawall <julia@diku.dk> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-12-14ext4: make sure directory and symlink blocks are revokedTheodore Ts'o
(cherry picked from commit 50689696867d95b38d9c7be640a311494a04fb86) When an inode gets unlinked, the functions ext4_clear_blocks() and ext4_remove_blocks() call ext4_forget() for all the buffer heads corresponding to the deleted inode's data blocks. If the inode is a directory or a symlink, the is_metadata parameter must be non-zero so ext4_forget() will revoke them via jbd2_journal_revoke(). Otherwise, if these blocks are reused for a data file, and the system crashes before a journal checkpoint, the journal replay could end up corrupting these data blocks. Thanks to Curt Wohlgemuth for pointing out potential problems in this area. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-12-14ext4: plug a buffer_head leak in an error path of ext4_iget()Theodore Ts'o
(cherry picked from commit 567f3e9a70d71e5c9be03701b8578be77857293b) One of the invalid error paths in ext4_iget() forgot to brelse() the inode buffer head. Fix it by adding a brelse() in the common error return path, which also simplifies function. Thanks to Andi Kleen <ak@linux.intel.com> reporting the problem. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-12-14ext4: fix possible recursive locking warning in EXT4_IOC_MOVE_EXTAkira Fujita
(cherry picked from commit 49bd22bc4d603a2a4fc2a6a60e156cbea52eb494) If CONFIG_PROVE_LOCKING is enabled, the double_down_write_data_sem() will trigger a false-positive warning of a recursive lock. Since we take i_data_sem for the two inodes ordered by their inode numbers, this isn't a problem. Use of down_write_nested() will notify the lock dependency checker machinery that there is no problem here. This problem was reported by Brian Rogers: http://marc.info/?l=linux-ext4&m=125115356928011&w=1 Reported-by: Brian Rogers <brian@xyzw.org> Signed-off-by: Akira Fujita <a-fujita@rs.jp.nec.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-12-14ext4: fix lock order problem in ext4_move_extents()Akira Fujita
(cherry picked from commit fc04cb49a898c372a22b21fffc47f299d8710801) ext4_move_extents() checks the logical block contiguousness of original file with ext4_find_extent() and mext_next_extent(). Therefore the extent which ext4_ext_path structure indicates must not be changed between above functions. But in current implementation, there is no i_data_sem protection between ext4_ext_find_extent() and mext_next_extent(). So the extent which ext4_ext_path structure indicates may be overwritten by delalloc. As a result, ext4_move_extents() will exchange wrong blocks between original and donor files. I change the place where acquire/release i_data_sem to solve this problem. Moreover, I changed move_extent_per_page() to start transaction first, and then acquire i_data_sem. Without this change, there is a possibility of the deadlock between mmap() and ext4_move_extents(): * NOTE: "A", "B" and "C" mean different processes A-1: ext4_ext_move_extents() acquires i_data_sem of two inodes. B: do_page_fault() starts the transaction (T), and then tries to acquire i_data_sem. But process "A" is already holding it, so it is kept waiting. C: While "A" and "B" running, kjournald2 tries to commit transaction (T) but it is under updating, so kjournald2 waits for it. A-2: Call ext4_journal_start with holding i_data_sem, but transaction (T) is locked. Signed-off-by: Akira Fujita <a-fujita@rs.jp.nec.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-12-14ext4: fix the returned block count if EXT4_IOC_MOVE_EXT failsAkira Fujita
(cherry picked from commit f868a48d06f8886cb0367568a12367fa4f21ea0d) If the EXT4_IOC_MOVE_EXT ioctl fails, the number of blocks that were exchanged before the failure should be returned to the userspace caller. Unfortunately, currently if the block size is not the same as the page size, the returned block count that is returned is the page-aligned block count instead of the actual block count. This commit addresses this bug. Signed-off-by: Akira Fujita <a-fujita@rs.jp.nec.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-12-14ext4: avoid divide by zero when trying to mount a corrupted file systemTheodore Ts'o
(cherry picked from commit 503358ae01b70ce6909d19dd01287093f6b6271c) If s_log_groups_per_flex is greater than 31, then groups_per_flex will will overflow and cause a divide by zero error. This can cause kernel BUG if such a file system is mounted. Thanks to Nageswara R Sastry for analyzing the failure and providing an initial patch. http://bugzilla.kernel.org/show_bug.cgi?id=14287 Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-12-14ext4: fix potential buffer head leak when add_dirent_to_buf() returns ENOSPCTheodore Ts'o
(cherry picked from commit 2de770a406b06dfc619faabbf5d85c835ed3f2e1) Previously add_dirent_to_buf() did not free its passed-in buffer head in the case of ENOSPC, since in some cases the caller still needed it. However, this led to potential buffer head leaks since not all callers dealt with this correctly. Fix this by making simplifying the freeing convention; now add_dirent_to_buf() *never* frees the passed-in buffer head, and leaves that to the responsibility of its caller. This makes things cleaner and easier to prove that the code is neither leaking buffer heads or calling brelse() one time too many. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: Curt Wohlgemuth <curtw@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-12-14ext4: Fix return value of ext4_split_unwritten_extents() to fix direct I/OMingming
(cherry picked from commit ba230c3f6dc88ec008806adb27b12088486d508e) To prepare for a direct I/O write, we need to split the unwritten extents before submitting the I/O. When no extents needed to be split, ext4_split_unwritten_extents() was incorrectly returning 0 instead of the size of uninitialized extents. This bug caused the wrong return value sent back to VFS code when it gets called from async IO path, leading to an unnecessary fall back to buffered IO. This bug also hid the fact that the check to see whether or not a split would be necessary was incorrect; we can only skip splitting the extent if the write completely covers the uninitialized extent. Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-12-14ext4: code clean up for dio fallocate handlingMingming
(cherry picked from commit 4b70df181611012a3556f017b57dfcef7e1d279f) The ext4_debug() call in ext4_end_io_dio() should be moved after the check to make sure that io_end is non-NULL. The comment above ext4_get_block_dio_write() ("Maximum number of blocks...") is a duplicate; the original and correct comment is above the #define DIO_MAX_BLOCKS up above. Based on review comments from Curt Wohlgemuth. Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-12-14ext4: skip conversion of uninit extents after direct IO if there isn't anyMingming
(cherry picked from commit 5f5249507e4b5c4fc0f9c93f33d133d8c95f47e1) At the end of direct I/O operation, ext4_ext_direct_IO() always called ext4_convert_unwritten_extents(), regardless of whether there were any unwritten extents involved in the I/O or not. This commit adds a state flag so that ext4_ext_direct_IO() only calls ext4_convert_unwritten_extents() when necessary. Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-12-14ext4: fix ext4_ext_direct_IO()'s return value after converting uninit extentsMingming
(cherry picked from commit 109f55651954def97fa41ee71c464d268c512ab0) After a direct I/O request covering an uninitalized extent (i.e., created using the fallocate system call) or a hole in a file, ext4 will convert the uninitialized extent so it is marked as initialized by calling ext4_convert_unwritten_extents(). This function returns zero on success. This return value was getting returned by ext4_direct_IO(); however the file system's direct_IO function is supposed to return the number of bytes read or written on a success. By returning zero, it confused the direct I/O code into falling back to buffered I/O unnecessarily. Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>