linux/ipc, branch v2.6.35

ipc/sem.c: bugfix for semop() not reporting successful operation

2010-07-20T23:25:40Z

The last change to improve the scalability moved the actual wake-up out of the section that is protected by spin_lock(sma->sem_perm.lock). This means that IN_WAKEUP can be in queue.status even when the spinlock is acquired by the current task. Thus the same loop that is performed when queue.status is read without the spinlock acquired must be performed when the spinlock is acquired. Thanks to kamezawa.hiroyu@jp.fujitsu.com for noticing lack of the memory barrier. Addresses https://bugzilla.kernel.org/show_bug.cgi?id=16255 [akpm@linux-foundation.org: clean up kerneldoc, checkpatch warning and whitespace] Signed-off-by: Manfred Spraul Reported-by: Luca Tettamanti Tested-by: Luca Tettamanti Reported-by: Christoph Lameter Cc: Maciej Rutecki Cc: KAMEZAWA Hiroyuki Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds

mqueue doesn't need make_bad_inode()

2010-06-04T21:16:27Z

It never hashes them anyway and does final iput() immediately afterwards. With ->drop_inode() being generic_delete_inode()... Signed-off-by: Al Viro

drop unused dentry argument to ->fsync

2010-05-28T02:05:02Z

Signed-off-by: Christoph Hellwig Signed-off-by: Al Viro

ipc/sem.c: use ERR_CAST

2010-05-27T16:12:49Z

Use ERR_CAST(x) rather than ERR_PTR(PTR_ERR(x)). The former makes more clear what is the purpose of the operation, which otherwise looks like a no-op. The semantic patch that makes this change is as follows: (http://coccinelle.lip6.fr/) // @@ type T; T x; identifier f; @@ T f (...) { <+... - ERR_PTR(PTR_ERR(x)) + x ...+> } @@ expression x; @@ - ERR_PTR(PTR_ERR(x)) + ERR_CAST(x) // Signed-off-by: Julia Lawall Cc: Manfred Spraul Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds

ipc/sem.c: update description of the implementation

2010-05-27T16:12:49Z

ipc/sem.c begins with a 15 year old description about bugs in the initial implementation in Linux-1.0. The patch replaces that with a top level description of the current code. A TODO could be derived from this text: The opengroup man page for semop() does not mandate FIFO. Thus there is no need for a semaphore array list of pending operations. If - this list is removed - the per-semaphore array spinlock is removed (possible if there is no list to protect) - sem_otime is moved into the semaphores and calculated on demand during semctl() then the array would be read-mostly - which would significantly improve scaling for applications that use semaphore arrays with lots of entries. The price would be expensive semctl() calls: for(i=0;isem_nsems;i++) spin_lock(sma->sem_lock); for(i=0;isem_nsems;i++) spin_unlock(sma->sem_lock); I'm not sure if the complexity is worth the effort, thus here is the documentation of the current behavior first. Signed-off-by: Manfred Spraul Cc: Chris Mason Cc: Zach Brown Cc: Jens Axboe Cc: Nick Piggin Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds

ipc/sem.c: move wake_up_process out of the spinlock section

2010-05-27T16:12:49Z

The wake-up part of semtimedop() consists out of two steps: - the right tasks must be identified. - they must be woken up. Right now, both steps run while the array spinlock is held. This patch reorders the code and moves the actual wake_up_process() behind the point where the spinlock is dropped. The code also moves setting sem->sem_otime to one place: It does not make sense to set the last modify time multiple times. [akpm@linux-foundation.org: repair kerneldoc] [akpm@linux-foundation.org: fix uninitialised retval] Signed-off-by: Manfred Spraul Cc: Chris Mason Cc: Zach Brown Cc: Jens Axboe Cc: Nick Piggin Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds

ipc/sem.c: optimize update_queue() for bulk wakeup calls

2010-05-27T16:12:49Z

The following series of patches tries to fix the spinlock contention reported by Chris Mason - his benchmark exposes problems of the current code: - In the worst case, the algorithm used by update_queue() is O(N^2). Bulk wake-up calls can enter this worst case. The patch series fix that. Note that the benchmark app doesn't expose the problem, it just should be fixed: Real world apps might do the wake-ups in another order than perfect FIFO. - The part of the code that runs within the semaphore array spinlock is significantly larger than necessary. The patch series fixes that. This change is responsible for the main improvement. - The cacheline with the spinlock is also used for a variable that is read in the hot path (sem_base) and for a variable that is unnecessarily written to multiple times (sem_otime). The last step of the series cacheline-aligns the spinlock. This patch: The SysV semaphore code allows to perform multiple operations on all semaphores in the array as atomic operations. After a modification, update_queue() checks which of the waiting tasks can complete. The algorithm that is used to identify the tasks is O(N^2) in the worst case. For some cases, it is simple to avoid the O(N^2). The patch adds a detection logic for some cases, especially for the case of an array where all sleeping tasks are single sembuf operations and a multi-sembuf operation is used to wake up multiple tasks. A big database application uses that approach. The patch fixes wakeup due to semctl(,,SETALL,) - the initial version of the patch breaks that. [akpm@linux-foundation.org: make do_smart_update() static] Signed-off-by: Manfred Spraul Cc: Chris Mason Cc: Zach Brown Cc: Jens Axboe Cc: Nick Piggin Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds

kernel-wide: replace USHORT_MAX, SHORT_MAX and SHORT_MIN with USHRT_MAX, SHRT_MAX and SHRT_MIN

2010-05-25T15:07:02Z

- C99 knows about USHRT_MAX/SHRT_MAX/SHRT_MIN, not USHORT_MAX/SHORT_MAX/SHORT_MIN. - Make SHRT_MIN of type s16, not int, for consistency. [akpm@linux-foundation.org: fix drivers/dma/timb_dma.c] [akpm@linux-foundation.org: fix security/keys/keyring.c] Signed-off-by: Alexey Dobriyan Acked-by: WANG Cong Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds

Merge branch 'timers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip

2010-05-20T00:11:10Z

* 'timers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: clocksource: Add clocksource_register_hz/khz interface posix-cpu-timers: Optimize run_posix_cpu_timers() time: Remove xtime_cache mqueue: Convert message queue timeout to use hrtimers hrtimers: Provide schedule_hrtimeout for CLOCK_REALTIME timers: Introduce the concept of timer slack for legacy timers ntp: Remove tickadj ntp: Make time_adjust static time: Add xtime, wall_to_monotonic to feature-removal-schedule timer: Try to survive timer callback preempt_count leak timer: Split out timer function call timer: Print function name for timer callbacks modifying preemption count time: Clean up warp_clock() cpu-timers: Avoid iterating over all threads in fastpath_timer_check() cpu-timers: Change SIGEV_NONE timer implementation cpu-timers: Return correct previous timer reload value cpu-timers: Cleanup arm_timer() cpu-timers: Simplify RLIMIT_CPU handling

mqueue: fix kernel BUG caused by double free() on mq_open()

2010-05-12T00:33:42Z

In case of aborting because we reach the maximum amount of memory which can be allocated to message queues per user (RLIMIT_MSGQUEUE), we would try to free the message area twice when bailing out: first by the error handling code itself, and then later when cleaning up the inode through delete_inode(). Signed-off-by: André Goddard Rosa Cc: Alexey Dobriyan Cc: Al Viro Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds