<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux/kernel, branch v3.0.3</title>
<subtitle>Linux kernel source tree</subtitle>
<id>https://git.amat.us/linux/atom/kernel?h=v3.0.3</id>
<link rel='self' href='https://git.amat.us/linux/atom/kernel?h=v3.0.3'/>
<link rel='alternate' type='text/html' href='https://git.amat.us/linux/'/>
<updated>2011-08-16T01:31:32Z</updated>
<entry>
<title>futex: Fix regression with read only mappings</title>
<updated>2011-08-16T01:31:32Z</updated>
<author>
<name>Shawn Bohrer</name>
<email>sbohrer@rgmadvisors.com</email>
</author>
<published>2011-06-30T16:21:32Z</published>
<link rel='alternate' type='text/html' href='https://git.amat.us/linux/commit/?id=3169336d0e0e6a41049b6798404a6ec1eb3ca80d'/>
<id>urn:sha1:3169336d0e0e6a41049b6798404a6ec1eb3ca80d</id>
<content type='text'>
commit 9ea71503a8ed9184d2d0b8ccc4d269d05f7940ae upstream.

commit 7485d0d3758e8e6491a5c9468114e74dc050785d (futexes: Remove rw
parameter from get_futex_key()) in 2.6.33 fixed two problems:  First, It
prevented a loop when encountering a ZERO_PAGE. Second, it fixed RW
MAP_PRIVATE futex operations by forcing the COW to occur by
unconditionally performing a write access get_user_pages_fast() to get
the page.  The commit also introduced a user-mode regression in that it
broke futex operations on read-only memory maps.  For example, this
breaks workloads that have one or more reader processes doing a
FUTEX_WAIT on a futex within a read only shared file mapping, and a
writer processes that has a writable mapping issuing the FUTEX_WAKE.

This fixes the regression for valid futex operations on RO mappings by
trying a RO get_user_pages_fast() when the RW get_user_pages_fast()
fails. This change makes it necessary to also check for invalid use
cases, such as anonymous RO mappings (which can never change) and the
ZERO_PAGE which the commit referenced above was written to address.

This patch does restore the original behavior with RO MAP_PRIVATE
mappings, which have inherent user-mode usage problems and don't really
make sense.  With this patch performing a FUTEX_WAIT within a RO
MAP_PRIVATE mapping will be successfully woken provided another process
updates the region of the underlying mapped file.  However, the mmap()
man page states that for a MAP_PRIVATE mapping:

  It is unspecified whether changes made to the file after
  the mmap() call are visible in the mapped region.

So user-mode users attempting to use futex operations on RO MAP_PRIVATE
mappings are depending on unspecified behavior.  Additionally a
RO MAP_PRIVATE mapping could fail to wake up in the following case.

  Thread-A: call futex(FUTEX_WAIT, memory-region-A).
            get_futex_key() return inode based key.
            sleep on the key
  Thread-B: call mprotect(PROT_READ|PROT_WRITE, memory-region-A)
  Thread-B: write memory-region-A.
            COW happen. This process's memory-region-A become related
            to new COWed private (ie PageAnon=1) page.
  Thread-B: call futex(FUETX_WAKE, memory-region-A).
            get_futex_key() return mm based key.
            IOW, we fail to wake up Thread-A.

Once again doing something like this is just silly and users who do
something like this get what they deserve.

While RO MAP_PRIVATE mappings are nonsensical, checking for a private
mapping requires walking the vmas and was deemed too costly to avoid a
userspace hang.

This Patch is based on Peter Zijlstra's initial patch with modifications to
only allow RO mappings for futex operations that need VERIFY_READ access.

Reported-by: David Oliver &lt;david@rgmadvisors.com&gt;
Signed-off-by: Shawn Bohrer &lt;sbohrer@rgmadvisors.com&gt;
Acked-by: Peter Zijlstra &lt;a.p.zijlstra@chello.nl&gt;
Signed-off-by: Darren Hart &lt;dvhart@linux.intel.com&gt;
Cc: KOSAKI Motohiro &lt;kosaki.motohiro@jp.fujitsu.com&gt;
Cc: peterz@infradead.org
Cc: eric.dumazet@gmail.com
Cc: zvonler@rgmadvisors.com
Cc: hughd@google.com
Link: http://lkml.kernel.org/r/1309450892-30676-1-git-send-email-sbohrer@rgmadvisors.com
Signed-off-by: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@suse.de&gt;

</content>
</entry>
<entry>
<title>mm/futex: fix futex writes on archs with SW tracking of dirty &amp; young</title>
<updated>2011-08-05T04:58:38Z</updated>
<author>
<name>Benjamin Herrenschmidt</name>
<email>benh@kernel.crashing.org</email>
</author>
<published>2011-07-26T00:12:32Z</published>
<link rel='alternate' type='text/html' href='https://git.amat.us/linux/commit/?id=b045b9a265fb46d8197b7d78aff1a8f6ab8e23df'/>
<id>urn:sha1:b045b9a265fb46d8197b7d78aff1a8f6ab8e23df</id>
<content type='text'>
commit 2efaca927f5cd7ecd0f1554b8f9b6a9a2c329c03 upstream.

I haven't reproduced it myself but the fail scenario is that on such
machines (notably ARM and some embedded powerpc), if you manage to hit
that futex path on a writable page whose dirty bit has gone from the PTE,
you'll livelock inside the kernel from what I can tell.

It will go in a loop of trying the atomic access, failing, trying gup to
"fix it up", getting succcess from gup, go back to the atomic access,
failing again because dirty wasn't fixed etc...

So I think you essentially hang in the kernel.

The scenario is probably rare'ish because affected architecture are
embedded and tend to not swap much (if at all) so we probably rarely hit
the case where dirty is missing or young is missing, but I think Shan has
a piece of SW that can reliably reproduce it using a shared writable
mapping &amp; fork or something like that.

On archs who use SW tracking of dirty &amp; young, a page without dirty is
effectively mapped read-only and a page without young unaccessible in the
PTE.

Additionally, some architectures might lazily flush the TLB when relaxing
write protection (by doing only a local flush), and expect a fault to
invalidate the stale entry if it's still present on another processor.

The futex code assumes that if the "in_atomic()" access -EFAULT's, it can
"fix it up" by causing get_user_pages() which would then be equivalent to
taking the fault.

However that isn't the case.  get_user_pages() will not call
handle_mm_fault() in the case where the PTE seems to have the right
permissions, regardless of the dirty and young state.  It will eventually
update those bits ...  in the struct page, but not in the PTE.

Additionally, it will not handle the lazy TLB flushing that can be
required by some architectures in the fault case.

Basically, gup is the wrong interface for the job.  The patch provides a
more appropriate one which boils down to just calling handle_mm_fault()
since what we are trying to do is simulate a real page fault.

The futex code currently attempts to write to user memory within a
pagefault disabled section, and if that fails, tries to fix it up using
get_user_pages().

This doesn't work on archs where the dirty and young bits are maintained
by software, since they will gate access permission in the TLB, and will
not be updated by gup().

In addition, there's an expectation on some archs that a spurious write
fault triggers a local TLB flush, and that is missing from the picture as
well.

I decided that adding those "features" to gup() would be too much for this
already too complex function, and instead added a new simpler
fixup_user_fault() which is essentially a wrapper around handle_mm_fault()
which the futex code can call.

[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: fix some nits Darren saw, fiddle comment layout]
Signed-off-by: Benjamin Herrenschmidt &lt;benh@kernel.crashing.org&gt;
Reported-by: Shan Hai &lt;haishan.bai@gmail.com&gt;
Tested-by: Shan Hai &lt;haishan.bai@gmail.com&gt;
Cc: David Laight &lt;David.Laight@ACULAB.COM&gt;
Acked-by: Peter Zijlstra &lt;a.p.zijlstra@chello.nl&gt;
Cc: Darren Hart &lt;darren.hart@intel.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@suse.de&gt;

</content>
</entry>
<entry>
<title>tracing: Have "enable" file use refcounts like the "filter" file</title>
<updated>2011-08-05T04:58:38Z</updated>
<author>
<name>Steven Rostedt</name>
<email>srostedt@redhat.com</email>
</author>
<published>2011-07-05T18:32:51Z</published>
<link rel='alternate' type='text/html' href='https://git.amat.us/linux/commit/?id=ff7b3dc6a634832a912445db8bffd18b05c15043'/>
<id>urn:sha1:ff7b3dc6a634832a912445db8bffd18b05c15043</id>
<content type='text'>
commit 40ee4dffff061399eb9358e0c8fcfbaf8de4c8fe upstream.

The "enable" file for the event system can be removed when a module
is unloaded and the event system only has events from that module.
As the event system nr_events count goes to zero, it may be freed
if its ref_count is also set to zero.

Like the "filter" file, the "enable" file may be opened by a task and
referenced later, after a module has been unloaded and the events for
that event system have been removed.

Although the "filter" file referenced the event system structure,
the "enable" file only references a pointer to the event system
name. Since the name is freed when the event system is removed,
it is possible that an access to the "enable" file may reference
a freed pointer.

Update the "enable" file to use the subsystem_open() routine that
the "filter" file uses, to keep a reference to the event system
structure while the "enable" file is opened.

Reported-by: Johannes Berg &lt;johannes.berg@intel.com&gt;
Signed-off-by: Steven Rostedt &lt;rostedt@goodmis.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@suse.de&gt;

</content>
</entry>
<entry>
<title>tracing: Fix bug when reading system filters on module removal</title>
<updated>2011-08-05T04:58:38Z</updated>
<author>
<name>Steven Rostedt</name>
<email>srostedt@redhat.com</email>
</author>
<published>2011-07-05T15:36:06Z</published>
<link rel='alternate' type='text/html' href='https://git.amat.us/linux/commit/?id=f35869d69b65ac9323a0d7ee13c062e1636d4a1b'/>
<id>urn:sha1:f35869d69b65ac9323a0d7ee13c062e1636d4a1b</id>
<content type='text'>
commit e9dbfae53eeb9fc3d4bb7da3df87fa9875f5da02 upstream.

The event system is freed when its nr_events is set to zero. This happens
when a module created an event system and then later the module is
removed. Modules may share systems, so the system is allocated when
it is created and freed when the modules are unloaded and all the
events under the system are removed (nr_events set to zero).

The problem arises when a task opened the "filter" file for the
system. If the module is unloaded and it removed the last event for
that system, the system structure is freed. If the task that opened
the filter file accesses the "filter" file after the system has
been freed, the system will access an invalid pointer.

By adding a ref_count, and using it to keep track of what
is using the event system, we can free it after all users
are finished with the event system.

Reported-by: Johannes Berg &lt;johannes.berg@intel.com&gt;
Signed-off-by: Steven Rostedt &lt;rostedt@goodmis.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@suse.de&gt;

</content>
</entry>
<entry>
<title>perf: Fix software event overflow</title>
<updated>2011-08-05T04:58:35Z</updated>
<author>
<name>Peter Zijlstra</name>
<email>a.p.zijlstra@chello.nl</email>
</author>
<published>2011-07-28T18:47:10Z</published>
<link rel='alternate' type='text/html' href='https://git.amat.us/linux/commit/?id=462fee3af72df0de7b60b96c525ffe8baf4db0f0'/>
<id>urn:sha1:462fee3af72df0de7b60b96c525ffe8baf4db0f0</id>
<content type='text'>
The below patch is for -stable only, upstream has a much larger patch
that contains the below hunk in commit a8b0ca17b80e92faab46ee7179ba9e99ccb61233

Vince found that under certain circumstances software event overflows
go wrong and deadlock. Avoid trying to delete a timer from the timer
callback.

Reported-by: Vince Weaver &lt;vweaver1@eecs.utk.edu&gt;
Signed-off-by: Peter Zijlstra &lt;a.p.zijlstra@chello.nl&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@suse.de&gt;

</content>
</entry>
<entry>
<title>Merge branch 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip</title>
<updated>2011-07-20T22:56:25Z</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2011-07-20T22:56:25Z</published>
<link rel='alternate' type='text/html' href='https://git.amat.us/linux/commit/?id=cf6ace16a3cd8b728fb0afa68368fd40bbeae19f'/>
<id>urn:sha1:cf6ace16a3cd8b728fb0afa68368fd40bbeae19f</id>
<content type='text'>
* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  signal: align __lock_task_sighand() irq disabling and RCU
  softirq,rcu: Inform RCU of irq_exit() activity
  sched: Add irq_{enter,exit}() to scheduler_ipi()
  rcu: protect __rcu_read_unlock() against scheduler-using irq handlers
  rcu: Streamline code produced by __rcu_read_unlock()
  rcu: Fix RCU_BOOST race handling current-&gt;rcu_read_unlock_special
  rcu: decrease rcu_report_exp_rnp coupling with scheduler
</content>
</entry>
<entry>
<title>Merge branch 'rcu/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-2.6-rcu into core/urgent</title>
<updated>2011-07-20T18:59:26Z</updated>
<author>
<name>Ingo Molnar</name>
<email>mingo@elte.hu</email>
</author>
<published>2011-07-20T18:59:26Z</published>
<link rel='alternate' type='text/html' href='https://git.amat.us/linux/commit/?id=d1e9ae47a0285d3f1699e8219ce50f656243b93f'/>
<id>urn:sha1:d1e9ae47a0285d3f1699e8219ce50f656243b93f</id>
<content type='text'>
</content>
</entry>
<entry>
<title>signal: align __lock_task_sighand() irq disabling and RCU</title>
<updated>2011-07-20T18:04:54Z</updated>
<author>
<name>Paul E. McKenney</name>
<email>paul.mckenney@linaro.org</email>
</author>
<published>2011-07-19T10:25:36Z</published>
<link rel='alternate' type='text/html' href='https://git.amat.us/linux/commit/?id=a841796f11c90d53dbac773be56b04fbee8af272'/>
<id>urn:sha1:a841796f11c90d53dbac773be56b04fbee8af272</id>
<content type='text'>
The __lock_task_sighand() function calls rcu_read_lock() with interrupts
and preemption enabled, but later calls rcu_read_unlock() with interrupts
disabled.  It is therefore possible that this RCU read-side critical
section will be preempted and later RCU priority boosted, which means that
rcu_read_unlock() will call rt_mutex_unlock() in order to deboost itself, but
with interrupts disabled. This results in lockdep splats, so this commit
nests the RCU read-side critical section within the interrupt-disabled
region of code.  This prevents the RCU read-side critical section from
being preempted, and thus prevents the attempt to deboost with interrupts
disabled.

It is quite possible that a better long-term fix is to make rt_mutex_unlock()
disable irqs when acquiring the rt_mutex structure's -&gt;wait_lock.

Signed-off-by: Paul E. McKenney &lt;paul.mckenney@linaro.org&gt;
Signed-off-by: Paul E. McKenney &lt;paulmck@linux.vnet.ibm.com&gt;
</content>
</entry>
<entry>
<title>softirq,rcu: Inform RCU of irq_exit() activity</title>
<updated>2011-07-20T17:50:12Z</updated>
<author>
<name>Peter Zijlstra</name>
<email>a.p.zijlstra@chello.nl</email>
</author>
<published>2011-07-19T22:32:00Z</published>
<link rel='alternate' type='text/html' href='https://git.amat.us/linux/commit/?id=ec433f0c51527426989ea8a38a856d810d739414'/>
<id>urn:sha1:ec433f0c51527426989ea8a38a856d810d739414</id>
<content type='text'>
The rcu_read_unlock_special() function relies on in_irq() to exclude
scheduler activity from interrupt level.  This fails because exit_irq()
can invoke the scheduler after clearing the preempt_count() bits that
in_irq() uses to determine that it is at interrupt level.  This situation
can result in failures as follows:

 $task			IRQ		SoftIRQ

 rcu_read_lock()

 /* do stuff */

 &lt;preempt&gt; |= UNLOCK_BLOCKED

 rcu_read_unlock()
   --t-&gt;rcu_read_lock_nesting

			irq_enter();
			/* do stuff, don't use RCU */
			irq_exit();
			  sub_preempt_count(IRQ_EXIT_OFFSET);
			  invoke_softirq()

					ttwu();
					  spin_lock_irq(&amp;pi-&gt;lock)
					  rcu_read_lock();
					  /* do stuff */
					  rcu_read_unlock();
					    rcu_read_unlock_special()
					      rcu_report_exp_rnp()
					        ttwu()
					          spin_lock_irq(&amp;pi-&gt;lock) /* deadlock */

   rcu_read_unlock_special(t);

Ed can simply trigger this 'easy' because invoke_softirq() immediately
does a ttwu() of ksoftirqd/# instead of doing the in-place softirq stuff
first, but even without that the above happens.

Cure this by also excluding softirqs from the
rcu_read_unlock_special() handler and ensuring the force_irqthreads
ksoftirqd/# wakeup is done from full softirq context.

[ Alternatively, delaying the -&gt;rcu_read_lock_nesting decrement
  until after the special handling would make the thing more robust
  in the face of interrupts as well.  And there is a separate patch
  for that. ]

Cc: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Reported-and-tested-by: Ed Tomlinson &lt;edt@aei.ca&gt;
Signed-off-by: Peter Zijlstra &lt;a.p.zijlstra@chello.nl&gt;
Signed-off-by: Paul E. McKenney &lt;paulmck@linux.vnet.ibm.com&gt;
</content>
</entry>
<entry>
<title>sched: Add irq_{enter,exit}() to scheduler_ipi()</title>
<updated>2011-07-20T17:50:11Z</updated>
<author>
<name>Peter Zijlstra</name>
<email>a.p.zijlstra@chello.nl</email>
</author>
<published>2011-07-19T22:07:25Z</published>
<link rel='alternate' type='text/html' href='https://git.amat.us/linux/commit/?id=c5d753a55ac92e09816d410cd17093813f1a904b'/>
<id>urn:sha1:c5d753a55ac92e09816d410cd17093813f1a904b</id>
<content type='text'>
Ensure scheduler_ipi() calls irq_{enter,exit} when it does some actual
work. Traditionally we never did any actual work from the resched IPI
and all magic happened in the return from interrupt path.

Now that we do do some work, we need to ensure irq_{enter,exit} are
called so that we don't confuse things.

This affects things like timekeeping, NO_HZ and RCU, basically
everything with a hook in irq_enter/exit.

Explicit examples of things going wrong are:

  sched_clock_cpu() -- has a callback when leaving NO_HZ state to take
                    a new reading from GTOD and TSC. Without this
                    callback, time is stuck in the past.

  RCU -- needs in_irq() to work in order to avoid some nasty deadlocks

Signed-off-by: Peter Zijlstra &lt;a.p.zijlstra@chello.nl&gt;
Signed-off-by: Paul E. McKenney &lt;paulmck@linux.vnet.ibm.com&gt;
</content>
</entry>
</feed>
