Age | Commit message (Collapse) | Author |
|
git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (28 commits)
ceph: update discussion list address in MAINTAINERS
ceph: some documentations fixes
ceph: fix use after free on mds __unregister_request
ceph: avoid loaded term 'OSD' in documention
ceph: fix possible double-free of mds request reference
ceph: fix session check on mds reply
ceph: handle kmalloc() failure
ceph: propagate mds session allocation failures to caller
ceph: make write_begin wait propagate ERESTARTSYS
ceph: fix snap rebuild condition
ceph: avoid reopening osd connections when address hasn't changed
ceph: rename r_sent_stamp r_stamp
ceph: fix connection fault con_work reentrancy problem
ceph: prevent dup stale messages to console for restarting mds
ceph: fix pg pool decoding from incremental osdmap update
ceph: fix mds sync() race with completing requests
ceph: only release unused caps with mds requests
ceph: clean up handle_cap_grant, handle_caps wrt session mutex
ceph: fix session locking in handle_caps, ceph_check_caps
ceph: drop unnecessary WARN_ON in caps migration
...
|
|
In commit 9df93939b735 ("ext3: Use bitops to read/modify
EXT3_I(inode)->i_state") ext3 changed its internal 'i_state' variable to
use bitops for its state handling. However, unline the same ext4
change, it didn't actually change the name of the field when it changed
the semantics of it.
As a result, an old use of 'i_state' remained in fs/ext3/ialloc.c that
initialized the field to EXT3_STATE_NEW. And that does not work
_at_all_ when we're now working with individually named bits rather than
values that get masked. So the code tried to mark the state to be new,
but in actual fact set the field to EXT3_STATE_JDATA. Which makes no
sense at all, and screws up all the code that checks whether the inode
was newly allocated.
In particular, it made the xattr code unhappy, and caused various random
behavior, like apparently
https://bugzilla.redhat.com/show_bug.cgi?id=577911
So fix the initialization, and rename the field to match ext4 so that we
don't have this happen again.
Cc: James Morris <jmorris@namei.org>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Cc: Daniel J Walsh <dwalsh@redhat.com>
Cc: Eric Paris <eparis@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
CONFIG_SLOW_WORK_PROC was changed to CONFIG_SLOW_WORK_DEBUG, but not in all
instances. Change the remaining instances. This makes the debugfs file
display the time mark and the owner's description again.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
There was a use after free in __unregister_request that would trigger
whenever the request map held the last reference. This appears to have
triggered an oops during 'umount -f' when requests are being torn down.
Signed-off-by: Sage Weil <sage@newdream.net>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2:
nilfs2: fix imperfect completion wait in nilfs_wait_on_logs
nilfs2: fix hang-up of cleaner after log writer returned with error
nilfs2: fix duplicate call to nilfs_segctor_cancel_freev
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
Restore LOOKUP_DIRECTORY hint handling in final lookup on open()
|
|
Lose want_dir argument, while we are at it - since now
nd->flags & LOOKUP_DIRECTORY is equivalent to it.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext4: Fixed inode allocator to correctly track a flex_bg's used_dirs
ext4: Don't use delayed allocation by default when used instead of ext3
ext4: Fix spelling of CONTIG_FS_EXT3 to CONFIG_FS_EXT3
ext4: Fix estimate of # of blocks needed to write indirect-mapped files
|
|
* 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6:
NFS: don't try to decode GETATTR if DELEGRETURN returned error
sunrpc: handle allocation errors from __rpc_lookup_create()
SUNRPC: Fix the return value of rpc_run_bc_task()
SUNRPC: Fix a use after free bug with the NFSv4.1 backchannel
SUNRPC: Fix a potential memory leak in auth_gss
NFS: Prevent another deadlock in nfs_release_page()
|
|
Sparse complained about this missing spin_unlock()
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
do_sync_read/write() should set kiocb.ki_nbytes to be consistent with
do_sync_readv_writev().
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Fix an incorrect for-loop in elf_core_vma_data_size(). The advance-pointer
statement lacks an assignment:
CC fs/binfmt_elf_fdpic.o
fs/binfmt_elf_fdpic.c: In function 'elf_core_vma_data_size':
fs/binfmt_elf_fdpic.c:1593: warning: statement with no effect
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Smaller size than a minimum blocksize can't be used, after all it's
handled like 0 size.
For extended partition itself, this makes sure to use bigger size than one
logical sector size at least.
Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Cc: Daniel Taylor <Daniel.Taylor@wdc.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
In order to use disks larger than 2TiB on Windows XP, it is necessary to
use 4096-byte logical sectors in an MBR.
Although the kernel storage and functions called from msdos.c used
"sector_t" internally, msdos.c still used u32 variables, which results in
the ability to handle XP-compatible large disks.
This patch changes the internal variables to "sector_t".
Daniel said: "In the near future, WD will be releasing products that need
this patch".
[hirofumi@mail.parknet.co.jp: tweaks and fix]
Signed-off-by: Daniel Taylor <daniel.taylor@wdc.com>
Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
"m" is never NULL here. We need a different test for the end of list
condition.
Signed-off-by: Dan Carpenter <error27@gmail.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: WANG Cong <xiyou.wangcong@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
The reiserfs journal behaves inconsistently when determining whether to
allow a mount of a read-only device.
This is due to the use of the continue_replay variable to short circuit
the journal scanning. If it's set, it's assumed that there are
transactions to replay, but there may not be. If it's unset, it's assumed
that there aren't any, and that may not be the case either.
I've observed two failure cases:
1) Where a clean file system on a read-only device refuses to mount
2) Where a clean file system on a read-only device passes the
optimization and then tries writing the journal header to update
the latest mount id.
The former is easily observable by using a freshly created file system on
a read-only loopback device.
This patch moves the check into journal_read_transaction, where it can
bail out before it's about to replay a transaction. That way it can go
through and skip transactions where appropriate, yet still refuse to mount
a file system with outstanding transactions.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Commit 57fe60df ("reiserfs: add atomic addition of selinux attributes
during inode creation") contains a bug that will cause it to oops when
mounting a file system that didn't previously contain extended attributes
on a system using security.* xattrs.
The issue is that while creating the privroot during mount
reiserfs_security_init calls reiserfs_xattr_jcreate_nblocks which
dereferences the xattr root. The xattr root doesn't exist, so we get an
oops.
Addresses http://bugzilla.kernel.org/show_bug.cgi?id=15309
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
fs/binfmt_aout.c: In function `aout_core_dump':
fs/binfmt_aout.c:125: warning: passing argument 2 of `dump_write' makes pointer from integer without a cast
include/linux/coredump.h:12: note: expected `const void *' but argument is of type `long unsigned int'
fs/binfmt_aout.c:132: warning: passing argument 2 of `dump_write' makes pointer from integer without a cast
include/linux/coredump.h:12: note: expected `const void *' but argument is of type `long unsigned int'
due to dump_write() expecting a user void *. Fold casts into the
START_DATA/START_STACK macros and shut up the warnings.
Signed-off-by: Borislav Petkov <petkovbb@gmail.com>
Cc: Daisuke HATAYAMA <d.hatayama@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
When used_dirs was introduced for the flex_groups struct, it looks
like the accounting was not put into place properly, in some places
manipulating free_inodes rather than used_dirs.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
|
When ext4 driver is used to mount a filesystem instead of the ext3 file
system driver (through CONFIG_EXT4_USE_FOR_EXT23), do not enable delayed
allocation by default since some ext3 users and application writers have
developed unfortunate expectations about the safety of writing files on
systems subject to sudden and violent death without using fsync().
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
|
Oops. (Blush.)
Thanks to Sedat Dilek for pointing this out.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
|
nilfs_wait_on_logs has a potential to slip out before completion of
all bio requests when it met an error. This synchronization fault may
cause unexpected results, for instance, violative access to freed
segment buffers from an end-bio callback routine.
This fixes the issue by ensuring that nilfs_wait_on_logs waits all
given logs.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
|
|
According to the report from Andreas Beckmann (Message-ID:
<4BA54677.3090902@abeckmann.de>), nilfs in 2.6.33 kernel got stuck
after a disk full error.
This turned out to be a regression by log writer updates merged at
kernel 2.6.33. nilfs_segctor_abort_construction, which is a cleanup
function for erroneous cases, was skipping writeback completion for
some logs.
This fixes the bug and would resolve the hang issue.
Reported-by: Andreas Beckmann <debian@abeckmann.de>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Tested-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Cc: stable <stable@kernel.org> [2.6.33.x]
|
|
Clear pointer to mds request after dropping the reference to
ensure we don't drop it again, as there is at least one error
path through this function that does not reset fi->last_readdir
to a new value.
Signed-off-by: Sage Weil <sage@newdream.net>
|
|
Fix a broken check that a reply came back from the same MDS we sent the
request to. I don't think a case that actually triggers this would ever
come up in practice, but it's clearly wrong and easy to fix.
Reported-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Sage Weil <sage@newdream.net>
|
|
Return ERR_PTR(-ENOMEM) if kmalloc() fails. We handle allocation
failures the same way later in the function.
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Sage Weil <sage@newdream.net>
|
|
Return error to original caller if register_session() fails.
Signed-off-by: Sage Weil <sage@newdream.net>
|
|
Currently, if the wait_event_interruptible is interrupted, we
return EAGAIN unconditionally and loop, such that we aren't, in
fact, interruptible. So, propagate ERESTARTSYS if we get it.
Signed-off-by: Sage Weil <sage@newdream.net>
|
|
We were rebuilding the snap context when it was not necessary
(i.e. when the realm seq hadn't changed _and_ the parent seq
was still older), which caused page snapc pointers to not match
the realm's snapc pointer (even though the snap context itself
was identical). This confused begin_write and put it into an
endless loop.
The correct logic is: rebuild snapc if _my_ realm seq changed, or
if my parent realm's seq is newer than mine (and thus mine needs
to be rebuilt too).
Signed-off-by: Sage Weil <sage@newdream.net>
|
|
We get a fault callback on _every_ tcp connection fault. Normally, we
want to reopen the connection when that happens. If the address we have
is bad, however, and connection attempts always result in a connection
refused or similar error, explicitly closing and reopening the msgr
connection just prevents the messenger's backoff logic from kicking in.
The result can be a console full of
[ 3974.417106] ceph: osd11 10.3.14.138:6800 connection failed
[ 3974.423295] ceph: osd11 10.3.14.138:6800 connection failed
[ 3974.429709] ceph: osd11 10.3.14.138:6800 connection failed
Instead, if we get a fault, and have outstanding requests, but the osd
address hasn't changed and the connection never successfully connected in
the first place, do nothing to the osd connection. The messenger layer
will back off and retry periodically, because we never connected and thus
the lossy bit is not set.
Instead, touch each request's r_stamp so that handle_timeout can tell the
request is still alive and kicking.
Signed-off-by: Sage Weil <sage@newdream.net>
|
|
Make variable name slightly more generic, since it will (soon)
reflect either the time the request was sent OR the time it was
last determined to be still retrying.
Signed-off-by: Sage Weil <sage@newdream.net>
|
|
The messenger fault was clearing the BUSY bit, for reasons unclear. This
made it possible for the con->ops->fault function to reopen the connection,
and requeue work in the workqueue--even though the current thread was
already in con_work.
This avoids a problem where the client busy loops with connection failures
on an unreachable OSD, but doesn't address the root cause of that problem.
Signed-off-by: Sage Weil <sage@newdream.net>
|
|
Prevent duplicate 'mds0 caps stale' message from spamming the console every
few seconds while the MDS restarts. Set s_renew_requested earlier, so that
we only print the message once, even if we don't send an actual request.
Signed-off-by: Sage Weil <sage@newdream.net>
|
|
The incremental map decoding of pg pool updates wasn't skipping
the snaps and removed_snaps vectors. This caused osd requests
to stall when pool snapshots were created or fs snapshots were
deleted. Use a common helper for full and incremental map
decoders that decodes pools properly.
Signed-off-by: Sage Weil <sage@newdream.net>
|
|
The wait_unsafe_requests() helper dropped the mdsc mutex to wait
for each request to complete, and then examined r_node to get the
next request after retaking the lock. But the request completion
removes the request from the tree, so r_node was always undefined
at this point. Since it's a small race, it usually led to a
valid request, but not always. The result was an occasional
crash in rb_next() while dereferencing node->rb_left.
Fix this by clearing the rb_node when removing the request from
the request tree, and not walking off into the weeds when we
are done waiting for a request. Since the request we waited on
will _always_ be out of the request tree, take a ref on the next
request, in the hopes that it won't be. But if it is, it's ok:
we can start over from the beginning (and traverse over older read
requests again).
Signed-off-by: Sage Weil <sage@newdream.net>
|
|
We were releasing used caps (e.g. FILE_CACHE) from encode_inode_release
with MDS requests (e.g. setattr). We don't carry refs on most caps, so
this code worked most of the time, but for setattr (utimes) we try to
drop Fscr.
This causes cap state to get slightly out of sync with reality, and may
result in subsequent mds revoke messages getting ignored.
Fix by only releasing unused caps.
Signed-off-by: Sage Weil <sage@newdream.net>
|
|
Drop session mutex unconditionally in handle_cap_grant, and do the
check_caps from the handle_cap_grant helper. This avoids using a magic
return value.
Also avoid using a flag variable in the IMPORT case and call
check_caps at the appropriate point.
Signed-off-by: Sage Weil <sage@newdream.net>
|
|
Passing a session pointer to ceph_check_caps() used to mean it would leave
the session mutex locked. That wasn't always possible if it wasn't passed
CHECK_CAPS_AUTHONLY. If could unlock the passed session and lock a
differet session mutex, which was clearly wrong, and also emitted a
warning when it a racing CPU retook it and we did an unlock from the wrong
context.
This was only a problem when there was more than one MDS.
First, make ceph_check_caps unconditionally drop the session mutex, so that
it is free to lock other sessions as needed. Then adjust the one caller
that passes in a session (handle_cap_grant) accordingly.
Signed-off-by: Sage Weil <sage@newdream.net>
|
|
If we don't have the exported cap it's because we already released it. No
need to WARN.
Signed-off-by: Sage Weil <sage@newdream.net>
|
|
This causes an oops when debug output is enabled and we kick
an osd request with no current r_osd (sometime after an osd
failure). Check the pointer before dereferencing.
Signed-off-by: Sage Weil <sage@newdream.net>
|
|
Previously we would decode state directly into our current ticket_handler.
This is problematic if for some reason we fail to decode, because we end
up with half new state and half old state.
We are probably already in bad shape if we get an update we can't decode,
but we may as well be tidy anyway. Decode into new_* temporaries and
update the ticket_handler only on success.
Signed-off-by: Sage Weil <sage@newdream.net>
|
|
It seems clear from the surrounding code that xpermits is allowed to be
NULL here.
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
The reply parsing code attempts to decode the GETATTR response even if
the DELEGRETURN portion of the compound returned an error. The GETATTR
response won't actually exist if that's the case and we're asking the
parser to read past the end of the response.
This bug is fairly benign. The parser catches this without reading past
the end of the response and decode_getfattr returns -EIO. Earlier
kernels however had decode_op_hdr using the READ_BUF macro, and this
bug would make this printk pop any time the client got an error from
a delegreturn:
kernel: decode_op_hdr: reply buffer overflowed in line XXXX
More recent kernels seem to have replaced this printk with a dprintk.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
|
|
Andreas Beckmann gave me a report that nilfs logged the following
warnings when it got a disk full:
nilfs_sufile_do_cancel_free: segment 0 must be clean
nilfs_sufile_do_cancel_free: segment 1 must be clean
These arise from a duplicate call to nilfs_segctor_cancel_freev in an
error path of log writer. This will fix the issue.
Reported-by: Andreas Beckmann <debian@abeckmann.de>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
|
|
Release the old ticket_blob buffer when we get an updated service ticket
from the monitor. Previously these were getting leaked.
Signed-off-by: Sage Weil <sage@newdream.net>
|
|
The buffer size was incorrectly calculated for the ceph_x_encrypt()
encapsulated ticket blob. Use a helper (with correct arithmetic) and
BUG out if we were wrong.
Signed-off-by: Sage Weil <sage@newdream.net>
|
|
We were failing to reconnect to services due to an old authenticator, even
though we had the new ticket, because we weren't properly retrying the
connect handshake, because we were calling an old/incorrect helper that
left in_base_pos incorrect. The result was a failure to reconnect to the
OSD or MDS (with an authentication error) if the MDS restarted after the
service had been up a few hours (long enough for the original authenticator
to be invalid). This was only a problem if the AUTH_X authentication was
enabled.
Now that the 'negotiate' and 'connect' stages are fully separated, we
should use the prepare_read_connect() helper instead, and remove the
obsolete one.
Signed-off-by: Sage Weil <sage@newdream.net>
|
|
When an inode was dropped while being migrated between two MDSs,
i_cap_exporting_issued was non-zero such that issue caps were non-zero and
__ceph_is_any_caps(ci) was true. This prevented the inode from being
removed from the snap realm, even as it was dropped from the cache.
Fix this by dropping any residual i_snap_realm ref in destroy_inode.
Signed-off-by: Sage Weil <sage@newdream.net>
|
|
All ci->i_snap_realm_item/realm->inodes_with_caps manipulation should be
protected by realm->inodes_with_caps_lock. This bug would have only bit
us in a rare race with a realm split (during some snap creations).
Signed-off-by: Sage Weil <sage@newdream.net>
|
|
Added assertion, and cleared one case where the implemented caps were
not following the issued caps.
Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
|