linux - Linux kernel source tree

Age	Commit message (Collapse)	Author
2011-04-17	install_special_mapping skips security_file_mmap check.	Tavis Ormandy
	commit 462e635e5b73ba9a4c03913b77138cd57ce4b050 upstream. The install_special_mapping routine (used, for example, to setup the vdso) skips the security check before insert_vm_struct, allowing a local attacker to bypass the mmap_min_addr security restriction by limiting the available pages for special mappings. bprm_mm_init() also skips the check, and although I don't think this can be used to bypass any restrictions, I don't see any reason not to have the security check. $ uname -m x86_64 $ cat /proc/sys/vm/mmap_min_addr 65536 $ cat install_special_mapping.s section .bss resb BSS_SIZE section .text global _start _start: mov eax, __NR_pause int 0x80 $ nasm -D__NR_pause=29 -DBSS_SIZE=0xfffed000 -f elf -o install_special_mapping.o install_special_mapping.s $ ld -m elf_i386 -Ttext=0x10000 -Tbss=0x11000 -o install_special_mapping install_special_mapping.o $ ./install_special_mapping & [1] 14303 $ cat /proc/14303/maps 0000f000-00010000 r-xp 00000000 00:00 0 [vdso] 00010000-00011000 r-xp 00001000 00:19 2453665 /home/taviso/install_special_mapping 00011000-ffffe000 rwxp 00000000 00:00 0 [stack] It's worth noting that Red Hat are shipping with mmap_min_addr set to 4096. Signed-off-by: Tavis Ormandy <taviso@google.com> Acked-by: Kees Cook <kees@ubuntu.com> Acked-by: Robert Swiecki <swiecki@google.com> [ Changed to not drop the error code - akpm ] Reviewed-by: James Morris <jmorris@namei.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2011-04-17	NFS: Fix fcntl F_GETLK not reporting some conflicts	Sergey Vlasov
	commit 21ac19d484a8ffb66f64487846c8d53afef04d2b upstream. The commit 129a84de2347002f09721cda3155ccfd19fade40 (locks: fix F_GETLK regression (failure to find conflicts)) fixed the posix_test_lock() function by itself, however, its usage in NFS changed by the commit 9d6a8c5c213e34c475e72b245a8eb709258e968c (locks: give posix_test_lock same interface as ->lock) remained broken - subsequent NFS-specific locking code received F_UNLCK instead of the user-specified lock type. To fix the problem, fl->fl_type needs to be saved before the posix_test_lock() call and restored if no local conflicts were reported. Reference: https://bugzilla.kernel.org/show_bug.cgi?id=23892 Tested-by: Alexander Morozov <amorozov@etersoft.ru> Signed-off-by: Sergey Vlasov <vsu@altlinux.ru> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2011-04-17	nfsd: Fix possible BUG_ON firing in set_change_info	Neil Brown
	commit c1ac3ffcd0bc7e9617f62be8c7043d53ab84deac upstream. If vfs_getattr in fill_post_wcc returns an error, we don't set fh_post_change. For NFSv4, this can result in set_change_info triggering a BUG_ON. i.e. fh_post_saved being zero isn't really a bug. So: - instead of BUGging when fh_post_saved is zero, just clear ->atomic. - if vfs_getattr fails in fill_post_wcc, take a copy of i_ctime anyway. This will be used i seg_change_info, but not overly trusted. - While we are there, remove the pointless 'if' statements in set_change_info. There is no harm setting all the values. Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2011-04-17	NFS: Fix panic after nfs_umount()	Chuck Lever
	commit 5b362ac3799ff4225c40935500f520cad4d7ed66 upstream. After a few unsuccessful NFS mount attempts in which the client and server cannot agree on an authentication flavor both support, the client panics. nfs_umount() is invoked in the kernel in this case. Turns out nfs_umount()'s UMNT RPC invocation causes the RPC client to write off the end of the rpc_clnt's iostat array. This is because the mount client's nrprocs field is initialized with the count of defined procedures (two: MNT and UMNT), rather than the size of the client's proc array (four). The fix is to use the same initialization technique used by most other upper layer clients in the kernel. Introduced by commit 0b524123, which failed to update nrprocs when support was added for UMNT in the kernel. BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=24302 BugLink: http://bugs.launchpad.net/bugs/683938 Reported-by: Stefan Bader <stefan.bader@canonical.com> Tested-by: Stefan Bader <stefan.bader@canonical.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2011-04-17	fuse: fix ioctl when server is 32bit	Miklos Szeredi
	commit d9d318d39dd5cb686660504a3565aac453709ccc upstream. If a 32bit CUSE server is run on 64bit this results in EIO being returned to the caller. The reason is that FUSE_IOCTL_RETRY reply was defined to use 'struct iovec', which is different on 32bit and 64bit archs. Work around this by looking at the size of the reply to determine which struct was used. This is only needed if CONFIG_COMPAT is defined. A more permanent fix for the interface will be to use the same struct on both 32bit and 64bit. Reported-by: "ccmail111" <ccmail111@yahoo.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> CC: Tejun Heo <tj@kernel.org> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2011-04-17	fuse: verify ioctl retries	Miklos Szeredi
	commit 7572777eef78ebdee1ecb7c258c0ef94d35bad16 upstream. Verify that the total length of the iovec returned in FUSE_IOCTL_RETRY doesn't overflow iov_length(). Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> CC: Tejun Heo <tj@kernel.org> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2011-04-17	exec: make argv/envp memory visible to oom-killer	Oleg Nesterov
	commit 3c77f845722158206a7209c45ccddc264d19319c upstream. Brad Spengler published a local memory-allocation DoS that evades the OOM-killer (though not the virtual memory RLIMIT): http://www.grsecurity.net/~spender/64bit_dos.c execve()->copy_strings() can allocate a lot of memory, but this is not visible to oom-killer, nobody can see the nascent bprm->mm and take it into account. With this patch get_arg_page() increments current's MM_ANONPAGES counter every time we allocate the new page for argv/envp. When do_execve() succeds or fails, we change this counter back. Technically this is not 100% correct, we can't know if the new page is swapped out and turn MM_ANONPAGES into MM_SWAPENTS, but I don't think this really matters and everything becomes correct once exec changes ->mm or fails. Compared to upstream: before 2.6.36 kernel, oom-killer's badness() takes mm->total_vm into account and nothing else. So acct_arg_size() has to play with this counter too. Reported-by: Brad Spengler <spender@grsecurity.net> Reviewed-and-discussed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2011-04-17	fuse: fix attributes after open(O_TRUNC)	Ken Sumrall
	commit a0822c55779d9319939eac69f00bb729ea9d23da upstream. The attribute cache for a file was not being cleared when a file is opened with O_TRUNC. If the filesystem's open operation truncates the file ("atomic_o_trunc" feature flag is set) then the kernel should invalidate the cached st_mtime and st_ctime attributes. Also i_size should be explicitly be set to zero as it is used sometimes without refreshing the cache. Signed-off-by: Ken Sumrall <ksumrall@android.com> Cc: Anfei <anfei.zhou@gmail.com> Cc: "Anand V. Avati" <avati@gluster.com> Signed-off-by: Miklos Szeredi <miklos@szeredi.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2011-04-17	bio: take care not overflow page count when mapping/copying user data	Jens Axboe
	commit cb4644cac4a2797afc847e6c92736664d4b0ea34 upstream. If the iovec is being set up in a way that causes uaddr + PAGE_SIZE to overflow, we could end up attempting to map a huge number of pages. Check for this invalid input type. Reported-by: Dan Rosenberg <drosenberg@vsecurity.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2011-04-17	eCryptfs: Clear LOOKUP_OPEN flag when creating lower file	Tyler Hicks
	commit 2e21b3f124eceb6ab5a07c8a061adce14ac94e14 upstream. eCryptfs was passing the LOOKUP_OPEN flag through to the lower file system, even though ecryptfs_create() doesn't support the flag. A valid filp for the lower filesystem could be returned in the nameidata if the lower file system's create() function supported LOOKUP_OPEN, possibly resulting in unencrypted writes to the lower file. However, this is only a potential problem in filesystems (FUSE, NFS, CIFS, CEPH, 9p) that eCryptfs isn't known to support today. https://bugs.launchpad.net/ecryptfs/+bug/641703 Reported-by: Kevin Buhr Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2011-04-17	block: limit vec count in bio_kmalloc() and bio_alloc_map_data()	Jens Axboe
	commit f3f63c1c28bc861a931fac283b5bc3585efb8967 upstream. Reported-by: Dan Rosenberg <drosenberg@vsecurity.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2011-04-17	Fix sget() race with failing mount	Al Viro
	commit 7a4dec53897ecd3367efb1e12fe8a4edc47dc0e9 upstream. If sget() finds a matching superblock being set up, it'll grab an active reference to it and grab s_umount. That's fine - we'll wait for completion of foofs_get_sb() that way. However, if said foofs_get_sb() fails we'll end up holding the halfway-created superblock. deactivate_locked_super() called by foofs_get_sb() will just unlock the sucker since we are holding another active reference to it. What we need is a way to tell if superblock has been successfully set up. Unfortunately, neither ->s_root nor the check for MS_ACTIVE quite fit. Cheap and easy way, suitable for backport: new flag set by the (only) caller of ->get_sb(). If that flag isn't present by the time sget() grabbed s_umount on preexisting superblock it has found, it's seeing a stillborn and should just bury it with deactivate_locked_super() (and repeat the search). Longer term we want to set that flag in ->get_sb() instances (and check for it to distinguish between "sget() found us a live sb" and "sget() has allocated an sb, we need to set it up" in there, instead of checking ->s_root as we do now). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2011-01-06	pipe: fix failure to return error code on ->confirm()	Nicolas Kaiser
	commit e5953cbdff26f7cbae7eff30cd9b18c4e19b7594 upstream. The arguments were transposed, we want to assign the error code to 'ret', which is being returned. Signed-off-by: Nicolas Kaiser <nikai@nikai.net> Signed-off-by: Jens Axboe <jaxboe@fusionio.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2011-01-06	mm: Move vma_stack_continue into mm.h	Stefan Bader
	commit 39aa3cb3e8250db9188a6f1e3fb62ffa1a717678 upstream. So it can be used by all that need to check for that. Signed-off-by: Stefan Bader <stefan.bader@canonical.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2011-01-06	execve: make responsive to SIGKILL with large arguments	Roland McGrath
	commit 9aea5a65aa7a1af9a4236dfaeb0088f1624f9919 upstream. An execve with a very large total of argument/environment strings can take a really long time in the execve system call. It runs uninterruptibly to count and copy all the strings. This change makes it abort the exec quickly if sent a SIGKILL. Note that this is the conservative change, to interrupt only for SIGKILL, by using fatal_signal_pending(). It would be perfectly correct semantics to let any signal interrupt the string-copying in execve, i.e. use signal_pending() instead of fatal_signal_pending(). We'll save that change for later, since it could have user-visible consequences, such as having a timer set too quickly make it so that an execve can never complete, though it always happened to work before. Signed-off-by: Roland McGrath <roland@redhat.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2011-01-06	execve: improve interactivity with large arguments	Roland McGrath
	commit 7993bc1f4663c0db67bb8f0d98e6678145b387cd upstream. This adds a preemption point during the copying of the argument and environment strings for execve, in copy_strings(). There is already a preemption point in the count() loop, so this doesn't add any new points in the abstract sense. When the total argument+environment strings are very large, the time spent copying them can be much more than a normal user time slice. So this change improves the interactivity of the rest of the system when one process is doing an execve with very large arguments. Signed-off-by: Roland McGrath <roland@redhat.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2011-01-06	setup_arg_pages: diagnose excessive argument size	Roland McGrath
	commit 1b528181b2ffa14721fb28ad1bd539fe1732c583 upstream. The CONFIG_STACK_GROWSDOWN variant of setup_arg_pages() does not check the size of the argument/environment area on the stack. When it is unworkably large, shift_arg_pages() hits its BUG_ON. This is exploitable with a very large RLIMIT_STACK limit, to create a crash pretty easily. Check that the initial stack is not too large to make it possible to map in any executable. We're not checking that the actual executable (or intepreter, for binfmt_elf) will fit. So those mappings might clobber part of the initial stack mapping. But that is just userland lossage that userland made happen, not a kernel problem. Signed-off-by: Roland McGrath <roland@redhat.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2011-01-06	ocfs2: Don't walk off the end of fast symlinks.	Joel Becker
	commit 1fc8a117865b54590acd773a55fbac9221b018f0 upstream. ocfs2 fast symlinks are NUL terminated strings stored inline in the inode data area. However, disk corruption or a local attacker could, in theory, remove that NUL. Because we're using strlen() (my fault, introduced in a731d1 when removing vfs_follow_link()), we could walk off the end of that string. Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2011-01-06	inotify: fix inotify oneshot support	Eric Paris
	commit ff311008ab8d2f2cfdbbefd407d1b05acc8164b2 upstream. During the large inotify rewrite to fsnotify I completely dropped support for IN_ONESHOT. Reimplement that support. Signed-off-by: Eric Paris <eparis@redhat.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2011-01-06	dasd: use correct label location for diag fba disks	Peter Oberparleiter
	commit cffab6bc5511cd6f67a60bf16b62de4267b68c4c upstream. Partition boundary calculation fails for DASD FBA disks under the following conditions: - disk is formatted with CMS FORMAT with a blocksize of more than 512 bytes - all of the disk is reserved to a single CMS file using CMS RESERVE - the disk is accessed using the DIAG mode of the DASD driver Under these circumstances, the partition detection code tries to read the CMS label block containing partition-relevant information from logical block offset 1, while it is in fact located at physical block offset 1. Fix this problem by using the correct CMS label block location depending on the device type as determined by the DASD SENSE ID information. Signed-off-by: Peter Oberparleiter <peter.oberparleiter@de.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2011-01-06	xfs: prevent reading uninitialized stack memory	Dan Rosenberg
	commit a122eb2fdfd78b58c6dd992d6f4b1aaef667eef9 upstream. The XFS_IOC_FSGETXATTR ioctl allows unprivileged users to read 12 bytes of uninitialized stack memory, because the fsxattr struct declared on the stack in xfs_ioc_fsgetxattr() does not alter (or zero) the 12-byte fsx_pad member before copying it back to the user. This patch takes care of it. Signed-off-by: Dan Rosenberg <dan.j.rosenberg@gmail.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2011-01-06	ext4: Fix remaining racy updates of EXT4_I(inode)->i_flags	Dmitry Monakhov
	commit 84a8dce2710cc425089a2b92acc354d4fbb5788d upstream. A few functions were still modifying i_flags in a racy manner. Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2011-01-06	inotify: send IN_UNMOUNT events	Eric Paris
	commit 611da04f7a31b2208e838be55a42c7a1310ae321 upstream. Since the .31 or so notify rewrite inotify has not sent events about inodes which are unmounted. This patch restores those events. Signed-off-by: Eric Paris <eparis@redhat.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2011-01-06	aio: check for multiplication overflow in do_io_submit	Jeff Moyer
	commit 75e1c70fc31490ef8a373ea2a4bea2524099b478 upstream. Tavis Ormandy pointed out that do_io_submit does not do proper bounds checking on the passed-in iocb array: if (unlikely(nr < 0)) return -EINVAL; if (unlikely(!access_ok(VERIFY_READ, iocbpp, (nr*sizeof(iocbpp))))) return -EFAULT; ^^^^^^^^^^^^^^^^^^ The attached patch checks for overflow, and if it is detected, the number of iocbs submitted is scaled down to a number that will fit in the long. This is an ok thing to do, as sys_io_submit is documented as returning the number of iocbs submitted, so callers should handle a return value of less than the 'nr' argument passed in. Reported-by: Tavis Ormandy <taviso@cmpxchg8b.com> Signed-off-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2011-01-06	char: Mark /dev/zero and /dev/kmem as not capable of writeback	Jan Kara
	commit 371d217ee1ff8b418b8f73fb2a34990f951ec2d4 upstream. These devices don't do any writeback but their device inodes still can get dirty so mark bdi appropriately so that bdi code does the right thing and files inodes to lists of bdi carrying the device inodes. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <jaxboe@fusionio.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2011-01-06	NFS: Fix a typo in nfs_sockaddr_match_ipaddr6	Trond Myklebust
	commit b20d37ca9561711c6a3c4b859c2855f49565e061 upstream. Reported-by: Ben Greear <greearb@candelatech.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2011-01-06	binfmt_misc: fix binfmt_misc priority	Jan Sembera
	commit ee3aebdd8f5f8eac41c25c80ceee3d728f920f3b upstream. Commit 74641f584da ("alpha: binfmt_aout fix") (May 2009) introduced a regression - binfmt_misc is now consulted after binfmt_elf, which will unfortunately break ia32el. ia32 ELF binaries on ia64 used to be matched using binfmt_misc and executed using wrapper. As 32bit binaries are now matched by binfmt_elf before bindmt_misc kicks in, the wrapper is ignored. The fix increases precedence of binfmt_misc to the original state. Signed-off-by: Jan Sembera <jsembera@suse.cz> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: Richard Henderson <rth@twiddle.net Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2011-01-06	sysfs: checking for NULL instead of ERR_PTR	Dan Carpenter
	commit 57f9bdac2510cd7fda58e4a111d250861eb1ebeb upstream. d_path() returns an ERR_PTR and it doesn't return NULL. Signed-off-by: Dan Carpenter <error27@gmail.com> Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2011-01-06	ocfs2: Fix incorrect checksum validation error	Sunil Mushran
	commit f5ce5a08a40f2086435858ddc80cb40394b082eb upstream. For local mounts, ocfs2_read_locked_inode() calls ocfs2_read_blocks_sync() to read the inode off the disk. The latter first checks to see if that block is cached in the journal, and, if so, returns that block. That is ok. But ocfs2_read_locked_inode() goes wrong when it tries to validate the checksum of such blocks. Blocks that are cached in the journal may not have had their checksum computed as yet. We should not validate the checksums of such blocks. Fixes ossbz#1282 http://oss.oracle.com/bugzilla/show_bug.cgi?id=1282 Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com> Singed-off-by: Tao Ma <tao.ma@oracle.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2011-01-06	fuse: flush background queue on connection close	Miklos Szeredi
	commit 595afaf9e6ee1b48e13ec4b8bcc8c7dee888161a upstream. David Bartly reported that fuse can hang in fuse_get_req_nofail() when the connection to the filesystem server is no longer active. If bg_queue is not empty then flush_bg_queue() called from request_end() can put more requests on to the pending queue. If this happens while ending requests on the processing queue then those background requests will be queued to the pending list and never ended. Another problem is that fuse_dev_release() didn't wake up processes sleeping on blocked_waitq. Solve this by: a) flushing the background queue before calling end_requests() on the pending and processing queues b) setting blocked = 0 and waking up processes waiting on blocked_waitq() Thanks to David for an excellent bug report. Reported-by: David Bartley <andareed@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2011-01-06	ext4: fix freeze deadlock under IO	Eric Sandeen
	commit 437f88cc031ffe7f37f3e705367f4fe1f4be8b0f upstream. [The 6b0310fb below references the mainline version of what has also been cherry picked into this 34-stable branch] Commit 6b0310fbf087ad6 caused a regression resulting in deadlocks when freezing a filesystem which had active IO; the vfs_check_frozen level (SB_FREEZE_WRITE) did not let the freeze-related IO syncing through. Duh. Changing the test to FREEZE_TRANS should let the normal freeze syncing get through the fs, but still block any transactions from starting once the fs is completely frozen. I tested this by running fsstress in the background while periodically snapshotting the fs and running fsck on the result. I ran into occasional deadlocks, but different ones. I think this is a fine fix for the problem at hand, and the other deadlocky things will need more investigation. Reported-by: Phillip Susi <psusi@cfl.rr.com> Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2011-01-06	CIFS: Remove __exit mark from cifs_exit_dns_resolver()	David Howells
	commit 51c20fcced5badee0e2021c6c89f44aa3cbd72aa upstream. Remove the __exit mark from cifs_exit_dns_resolver() as it's called by the module init routine in case of error, and so may have been discarded during linkage. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2011-01-06	ext4: Make fsync sync new parent directories in no-journal mode	Frank Mayhar
	commit 14ece1028b3ed53ffec1b1213ffc6acaf79ad77c upstream. Add a new ext4 state to tell us when a file has been newly created; use that state in ext4_sync_file in no-journal mode to tell us when we need to sync the parent directory as well as the inode and data itself. This fixes a problem in which a panic or power failure may lose the entire file even when using fsync, since the parent directory entry is lost. Addresses-Google-Bug: #2480057 Signed-off-by: Frank Mayhar <fmayhar@google.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2011-01-06	ext4: Fix compat EXT4_IOC_ADD_GROUP	Ben Hutchings
	commit 4d92dc0f00a775dc2e1267b0e00befb783902fe7 upstream. struct ext4_new_group_input needs to be converted because u64 has only 32-bit alignment on some 32-bit architectures, notably i386. Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2011-01-06	ext4: Conditionally define compat ioctl numbers	Ben Hutchings
	commit 899ad0cea6ad7ff4ba24b16318edbc3cbbe03fad upstream. It is unnecessary, and in general impossible, to define the compat ioctl numbers except when building the filesystem with CONFIG_COMPAT defined. Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2011-01-06	ext4: restart ext4_ext_remove_space() after transaction restart	Dmitry Monakhov
	commit 0617b83fa239db9743a18ce6cc0e556f4d0fd567 upstream. If i_data_sem was internally dropped due to transaction restart, it is necessary to restart path look-up because extents tree was possibly modified by ext4_get_block(). https://bugzilla.kernel.org/show_bug.cgi?id=15827 Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Acked-by: Jan Kara <jack@suse.cz> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2011-01-06	ext4: Clear the EXT4_EOFBLOCKS_FL flag only when warranted	Theodore Ts'o
	commit 786ec7915e530936b9eb2e3d12274145cab7aa7d upstream. Dimitry Monakhov discovered an edge case where it was possible for the EXT4_EOFBLOCKS_FL flag could get cleared unnecessarily. This is true; I have a test case that can be exercised via downloading and decompressing the file: wget ftp://ftp.kernel.org/pub/linux/kernel/people/tytso/ext4-testcases/eofblocks-fl-test-case.img.bz2 bunzip2 eofblocks-fl-test-case.img dd if=/dev/zero of=eofblocks-fl-test-case.img bs=1k seek=17925 bs=1k count=1 conv=notrunc However, triggering it in real life is highly unlikely since it requires an extremely fragmented sparse file with a hole in exactly the right place in the extent tree. (It actually took quite a bit of work to generate this test case.) Still, it's nice to get even extreme corner cases to be correct, so this patch makes sure that we don't clear the EXT4_EOFBLOCKS_FL incorrectly even in this corner case. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2011-01-06	ext4: Avoid crashing on NULL ptr dereference on a filesystem error	Theodore Ts'o
	commit f70f362b4a6fe47c239dbfb3efc0cc2c10e4f09c upstream. If the EOFBLOCK_FL flag is set when it should not be and the inode is zero length, then eh_entries is zero, and ex is NULL, so dereferencing ex to print ex->ee_block causes a kernel OOPS in ext4_ext_map_blocks(). On top of that, the error message which is printed isn't very helpful. So we fix this by printing something more explanatory which doesn't involve trying to print ex->ee_block. Addresses-Google-Bug: #2655740 Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2011-01-06	ext4: Use bitops to read/modify i_flags in struct ext4_inode_info	Dmitry Monakhov
	commit 12e9b892002d9af057655d35b44db8ee9243b0dc upstream. At several places we modify EXT4_I(inode)->i_flags without holding i_mutex (ext4_do_update_inode, ...). These modifications are racy and we can lose updates to i_flags. So convert handling of i_flags to use bitops which are atomic. https://bugzilla.kernel.org/show_bug.cgi?id=15792 Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2011-01-06	ext4: Show journal_checksum option	Jan Kara
	commit 39a4bade8c1826b658316d66ee81c09b0a4d7d42 upstream. We failed to show journal_checksum option in /proc/mounts. Fix it. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2011-01-06	ext4: check for a good block group before loading buddy pages	Curt Wohlgemuth
	commit 8a57d9d61a6e361c7bb159dda797672c1df1a691 upstream. This adds a new field in ext4_group_info to cache the largest available block range in a block group; and don't load the buddy pages until after we've done a sanity check on the block group. With large allocation requests (e.g., fallocate(), 8MiB) and relatively full partitions, it's easy to have no block groups with a block extent large enough to satisfy the input request length. This currently causes the loop during cr == 0 in ext4_mb_regular_allocator() to load the buddy bitmap pages for EVERY block group. That can be a lot of pages. The patch below allows us to call ext4_mb_good_group() BEFORE we load the buddy pages (although we have check again after we lock the block group). Addresses-Google-Bug: #2578108 Addresses-Google-Bug: #2704453 Signed-off-by: Curt Wohlgemuth <curtw@google.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2011-01-06	ext4: Prevent creation of files larger than RLIMIT_FSIZE using fallocate	Nikanth Karthikesan
	commit 6d19c42b7cf81c39632b6d4dbc514e8449bcd346 upstream. Currently using posix_fallocate one can bypass an RLIMIT_FSIZE limit and create a file larger than the limit. Add a check for that. Signed-off-by: Nikanth Karthikesan <knikanth@suse.de> Signed-off-by: Amit Arora <aarora@in.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2011-01-06	ext4: Remove extraneous newlines in ext4_msg() calls	Curt Wohlgemuth
	commit fbe845ddf368f77f86aa7500f8fd2690f54c66a8 upstream. Addresses-Google-Bug: #2562325 Signed-off-by: Curt Wohlgemuth <curtw@google.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2011-01-06	ext4: init statistics after journal recovery	Dmitry Monakhov
	commit 84061e07c5fbbbf9dc8aef8fb750fc3a2dfc31f3 upstream. Currently block/inode/dir counters initialized before journal was recovered. In fact after journal recovery this info will probably change. And freeblocks it critical for correct delalloc mode accounting. https://bugzilla.kernel.org/show_bug.cgi?id=15768 Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Acked-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2011-01-06	ext4: clean up inode bitmaps manipulation in ext4_free_inode	Dmitry Monakhov
	commit d17413c08cd2b1dd2bf2cfdbb0f7b736b2b2b15c upstream. - Reorganize locking scheme to batch two atomic operation in to one. This also allow us to state what healthy group must obey following rule ext4_free_inodes_count(sb, gdp) == ext4_count_free(inode_bitmap, NUM); - Fix possible undefined pointer dereference. - Even if group descriptor stats aren't accessible we have to update inode bitmaps. - Move non-group members update out of group_lock. Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2011-01-06	ext4: Do not zero out uninitialized extents beyond i_size	Dmitry Monakhov
	commit 21ca087a3891efab4d45488db8febee474d26c68 upstream. The extents code will sometimes zero out blocks and mark them as initialized instead of splitting an extent into several smaller ones. This optimization however, causes problems if the extent is beyond i_size because fsck will complain if there are uninitialized blocks after i_size as this can not be distinguished from an inode that has an incorrect i_size field. https://bugzilla.kernel.org/show_bug.cgi?id=15742 Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2011-01-06	ext4: don't scan/accumulate more pages than mballoc will allocate	Eric Sandeen
	commit c445e3e0a5c2804524dec6e55f66d63f6bc5bc3e upstream. There was a bug reported on RHEL5 that a 10G dd on a 12G box had a very, very slow sync after that. At issue was the loop in write_cache_pages scanning all the way to the end of the 10G file, even though the subsequent call to mpage_da_submit_io would only actually write a smallish amt; then we went back to the write_cache_pages loop ... wasting tons of time in calling __mpage_da_writepage for thousands of pages we would just revisit (many times) later. Upstream it's not such a big issue for sys_sync because we get to the loop with a much smaller nr_to_write, which limits the loop. However, talking with Aneesh he realized that fsync upstream still gets here with a very large nr_to_write and we face the same problem. This patch makes mpage_add_bh_to_extent stop the loop after we've accumulated 2048 pages, by setting mpd->io_done = 1; which ultimately causes the write_cache_pages loop to break. Repeating the test with a dirty_ratio of 80 (to leave something for fsync to do), I don't see huge IO performance gains, but the reduction in cpu usage is striking: 80% usage with stock, and 2% with the below patch. Instrumenting the loop in write_cache_pages clearly shows that we are wasting time here. Eventually we need to change mpage_da_map_pages() also submit its I/O to the block layer, subsuming mpage_da_submit_io(), and then change it call ext4_get_blocks() multiple times. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2011-01-06	ext4: stop issuing discards if not supported by device	Eric Sandeen
	commit a30eec2a8650a77f754e84b2e15f062fe652baa7 upstream. Turn off issuance of discard requests if the device does not support it - similar to the action we take for barriers. This will save a little computation time if a non-discardable device is mounted with -o discard, and also makes it obvious that it's not doing what was asked at mount time ... Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2011-01-06	ext4: don't return to userspace after freezing the fs with a mutex held	Eric Sandeen
	commit 6b0310fbf087ad6e9e3b8392adca97cd77184084 upstream. ext4_freeze() used jbd2_journal_lock_updates() which takes the j_barrier mutex, and then returns to userspace. The kernel does not like this: ================================================ [ BUG: lock held when returning to user space! ] ------------------------------------------------ lvcreate/1075 is leaving the kernel with locks still held! 1 lock held by lvcreate/1075: #0: (&journal->j_barrier){+.+...}, at: [<ffffffff811c6214>] jbd2_journal_lock_updates+0xe1/0xf0 Use vfs_check_frozen() added to ext4_journal_start_sb() and ext4_force_commit() instead. Addresses-Red-Hat-Bugzilla: #568503 Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2011-01-06	ext4: fix quota accounting in case of fallocate	Dmitry Monakhov
	commit 35121c9860316d7799cea0fbc359a9186e7c2747 upstream. allocated_meta_data is already included in 'used' variable. Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>