<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux/kernel, branch v2.6.13-rc7</title>
<subtitle>Linux kernel source tree</subtitle>
<id>https://git.amat.us/linux/atom/kernel?h=v2.6.13-rc7</id>
<link rel='self' href='https://git.amat.us/linux/atom/kernel?h=v2.6.13-rc7'/>
<link rel='alternate' type='text/html' href='https://git.amat.us/linux/'/>
<updated>2005-08-24T03:02:52Z</updated>
<entry>
<title>[PATCH] cpu_exclusive sched domains on partial nodes temp fix</title>
<updated>2005-08-24T03:02:52Z</updated>
<author>
<name>Paul Jackson</name>
<email>pj@sgi.com</email>
</author>
<published>2005-08-23T08:04:27Z</published>
<link rel='alternate' type='text/html' href='https://git.amat.us/linux/commit/?id=d10689b68aff7b48e3de1a3f7fcd6567bd2905af'/>
<id>urn:sha1:d10689b68aff7b48e3de1a3f7fcd6567bd2905af</id>
<content type='text'>
This keeps the kernel/cpuset.c routine update_cpu_domains() from
invoking the sched.c routine partition_sched_domains() if the cpuset in
question doesn't fall on node boundaries.

I have boot tested this on an SN2, and with the help of a couple of ad
hoc printk's, determined that it does indeed avoid calling the
partition_sched_domains() routine on partial nodes.

I did not directly verify that this avoids setting up bogus sched
domains or avoids the oops that Hawkes saw.

This patch imposes a silent artificial constraint on which cpusets can
be used to define dynamic sched domains.

This patch should allow proceeding with this new feature in 2.6.13 for
the configurations in which it is useful (node alligned sched domains)
while avoiding trying to setup sched domains in the less useful cases
that can cause the kernel corruption and oops.

Signed-off-by: Paul Jackson &lt;pj@sgi.com&gt;
Acked-by: Ingo Molnar &lt;mingo@elte.hu&gt;
Acked-by: Dinakar Guniguntala &lt;dino@in.ibm.com&gt;
Acked-by: John Hawkes &lt;hawkes@sgi.com&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@osdl.org&gt;
</content>
</entry>
<entry>
<title>[PATCH] preempt race in getppid</title>
<updated>2005-08-23T18:44:29Z</updated>
<author>
<name>David Meybohm</name>
<email>dmeybohmlkml@bellsouth.net</email>
</author>
<published>2005-08-22T20:11:08Z</published>
<link rel='alternate' type='text/html' href='https://git.amat.us/linux/commit/?id=4c5640cb5f5a6fd780d99397eca028b575cb1206'/>
<id>urn:sha1:4c5640cb5f5a6fd780d99397eca028b575cb1206</id>
<content type='text'>
With CONFIG_PREEMPT &amp;&amp; !CONFIG_SMP, it's possible for sys_getppid to
return a bogus value if the parent's task_struct gets reallocated after
current-&gt;group_leader-&gt;real_parent is read:

        asmlinkage long sys_getppid(void)
        {
                int pid;
                struct task_struct *me = current;
                struct task_struct *parent;

                parent = me-&gt;group_leader-&gt;real_parent;
RACE HERE =&gt;    for (;;) {
                        pid = parent-&gt;tgid;
        #ifdef CONFIG_SMP
        {
                        struct task_struct *old = parent;

                        /*
                         * Make sure we read the pid before re-reading the
                         * parent pointer:
                         */
                        smp_rmb();
                        parent = me-&gt;group_leader-&gt;real_parent;
                        if (old != parent)
                                continue;
        }
        #endif
                        break;
                }
                return pid;
        }

If the process gets preempted at the indicated point, the parent process
can go ahead and call exit() and then get wait()'d on to reap its
task_struct. When the preempted process gets resumed, it will not do any
further checks of the parent pointer on !CONFIG_SMP: it will read the
bad pid and return.

So, the same algorithm used when SMP is enabled should be used when
preempt is enabled, which will recheck -&gt;real_parent in this case.

Signed-off-by: David Meybohm &lt;dmeybohmlkml@bellsouth.net&gt;
Signed-off-by: Andrew Morton &lt;akpm@osdl.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@osdl.org&gt;
</content>
</entry>
<entry>
<title>[PATCH] Make RLIMIT_NICE ranges consistent with getpriority(2)</title>
<updated>2005-08-18T19:53:58Z</updated>
<author>
<name>Matt Mackall</name>
<email>mpm@selenic.com</email>
</author>
<published>2005-08-18T18:24:19Z</published>
<link rel='alternate' type='text/html' href='https://git.amat.us/linux/commit/?id=024f474795af7a0d41bd6d60061d78bd66d13f56'/>
<id>urn:sha1:024f474795af7a0d41bd6d60061d78bd66d13f56</id>
<content type='text'>
As suggested by Michael Kerrisk &lt;mtk-manpages@gmx.net&gt;, make RLIMIT_NICE
consistent with getpriority before it becomes available in released glibc.

Signed-off-by: Matt Mackall &lt;mpm@selenic.com&gt;
Acked-by: Ingo Molnar &lt;mingo@elte.hu&gt;
Acked-by: Chris Wright &lt;chrisw@osdl.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@osdl.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@osdl.org&gt;
</content>
</entry>
<entry>
<title>[PATCH] NPTL signal delivery deadlock fix</title>
<updated>2005-08-17T19:52:04Z</updated>
<author>
<name>Bhavesh P. Davda</name>
<email>bhavesh@avaya.com</email>
</author>
<published>2005-08-17T18:26:33Z</published>
<link rel='alternate' type='text/html' href='https://git.amat.us/linux/commit/?id=dd12f48d4e8774415b528d3991ae47c28f26e1ac'/>
<id>urn:sha1:dd12f48d4e8774415b528d3991ae47c28f26e1ac</id>
<content type='text'>
This bug is quite subtle and only happens in a very interesting
situation where a real-time threaded process is in the middle of a
coredump when someone whacks it with a SIGKILL.  However, this deadlock
leaves the system pretty hosed and you have to reboot to recover.

Not good for real-time priority-preemption applications like our
telephony application, with 90+ real-time (SCHED_FIFO and SCHED_RR)
processes, many of them multi-threaded, interacting with each other for
high volume call processing.

Acked-by: Roland McGrath &lt;roland@redhat.com&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@osdl.org&gt;
</content>
</entry>
<entry>
<title>[PATCH] remove name length check in a workqueue</title>
<updated>2005-08-10T18:55:19Z</updated>
<author>
<name>James Bottomley</name>
<email>James.Bottomley@SteelEye.com</email>
</author>
<published>2005-08-10T18:29:15Z</published>
<link rel='alternate' type='text/html' href='https://git.amat.us/linux/commit/?id=606867443764edac5a2c542f2fa0a12ef7a7c7fd'/>
<id>urn:sha1:606867443764edac5a2c542f2fa0a12ef7a7c7fd</id>
<content type='text'>
We have a chek in there to make sure that the name won't overflow
task_struct.comm[], but it's triggering for scsi with lots of HBAs, only
scsi is using single-threaded workqueues which don't append the "/%d"
anyway.

All too hard.  Just kill the BUG_ON.

Cc: Ingo Molnar &lt;mingo@elte.hu&gt;
Signed-off-by: Andrew Morton &lt;akpm@osdl.org&gt;

[ kthread_create() uses vsnprintf() and limits the thing, so no
  actual overflow can actually happen regardless ]

Signed-off-by: Linus Torvalds &lt;torvalds@osdl.org&gt;
</content>
</entry>
<entry>
<title>[PATCH] cpuset release ABBA deadlock fix</title>
<updated>2005-08-09T19:08:22Z</updated>
<author>
<name>Paul Jackson</name>
<email>pj@sgi.com</email>
</author>
<published>2005-08-09T17:07:59Z</published>
<link rel='alternate' type='text/html' href='https://git.amat.us/linux/commit/?id=3077a260e9f316b611436b1506eec9cc5c4f8aa6'/>
<id>urn:sha1:3077a260e9f316b611436b1506eec9cc5c4f8aa6</id>
<content type='text'>
Fix possible cpuset_sem ABBA deadlock if 'notify_on_release' set.

For a particular usage pattern, creating and destroying cpusets fairly
frequently using notify_on_release, on a very large system, this deadlock
can be seen every few days.  If you are not using the cpuset
notify_on_release feature, you will never see this deadlock.

The existing code, on task exit (or cpuset deletion) did:

  get cpuset_sem
  if cpuset marked notify_on_release and is ready to release:
    compute cpuset path relative to /dev/cpuset mount point
    call_usermodehelper() forks /sbin/cpuset_release_agent with path
  drop cpuset_sem

Unfortunately, the fork in call_usermodehelper can allocate memory, and
allocating memory can require cpuset_sem, if the mems_generation values
changed in the interim.  This results in an ABBA deadlock, trying to obtain
cpuset_sem when it is already held by the current task.

To fix this, I put the cpuset path (which must be computed while holding
cpuset_sem) in a temporary buffer, to be used in the call_usermodehelper
call of /sbin/cpuset_release_agent only _after_ dropping cpuset_sem.

So the new logic is:

  get cpuset_sem
  if cpuset marked notify_on_release and is ready to release:
    compute cpuset path relative to /dev/cpuset mount point
    stash path in kmalloc'd buffer
  drop cpuset_sem
  call_usermodehelper() forks /sbin/cpuset_release_agent with path
  free path

The sharp eyed reader might notice that this patch does not contain any
calls to kmalloc.  The existing code in the check_for_release() routine was
already kmalloc'ing a buffer to hold the cpuset path.  In the old code, it
just held the buffer for a few lines, over the cpuset_release_agent() call
that in turn invoked call_usermodehelper().  In the new code, with the
application of this patch, it returns that buffer via the new char
**ppathbuf parameter, for later use and freeing in cpuset_release_agent(),
which is called after cpuset_sem is dropped.  Whereas the old code has just
one call to cpuset_release_agent(), right in the check_for_release()
routine, the new code has three calls to cpuset_release_agent(), from the
various places that a cpuset can be released.

This patch has been build and booted on SN2, and passed a stress test that
previously hit the deadlock within a few seconds.

Signed-off-by: Paul Jackson &lt;pj@sgi.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@osdl.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@osdl.org&gt;
</content>
</entry>
<entry>
<title>[PATCH] revert "timer exit cleanup"</title>
<updated>2005-08-04T23:57:49Z</updated>
<author>
<name>Andrew Morton</name>
<email>akpm@osdl.org</email>
</author>
<published>2005-08-04T23:49:32Z</published>
<link rel='alternate' type='text/html' href='https://git.amat.us/linux/commit/?id=c306895167c8384b88bc02945a0d226a04218fa5'/>
<id>urn:sha1:c306895167c8384b88bc02945a0d226a04218fa5</id>
<content type='text'>
Revert this June 17 patch: it broke persistence of timers across execve().

Cc: Roland McGrath &lt;roland@redhat.com&gt;
Cc: george anzinger &lt;george@mvista.com&gt;
Cc: Ingo Molnar &lt;mingo@elte.hu&gt;
Signed-off-by: Andrew Morton &lt;akpm@osdl.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@osdl.org&gt;
</content>
</entry>
<entry>
<title>[PATCH] Remove suspend() calls from shutdown path</title>
<updated>2005-08-04T15:20:47Z</updated>
<author>
<name>Benjamin Herrenschmidt</name>
<email>benh@kernel.crashing.org</email>
</author>
<published>2005-08-04T09:36:26Z</published>
<link rel='alternate' type='text/html' href='https://git.amat.us/linux/commit/?id=c36f19e02a96488f550fdb678c92500afca3109b'/>
<id>urn:sha1:c36f19e02a96488f550fdb678c92500afca3109b</id>
<content type='text'>
This removes the calls to device_suspend() from the shutdown path that
were added sometime during 2.6.13-rc*.  They aren't working properly on
a number of configs (I got reports from both ppc powerbook users and x86
users) causing the system to not shutdown anymore.

I think it isn't the right approach at the moment anyway.  We have
already a shutdown() callback for the drivers that actually care about
shutdown and the suspend() code isn't yet in a good enough shape to be
so much generalized.  Also, the semantics of suspend and shutdown are
slightly different on a number of setups and the way this was patched in
provides little way for drivers to cleanly differenciate.  It should
have been at least a different message.

For 2.6.13, I think we should revert to 2.6.12 behaviour and have a
working suspend back.

Signed-off-by: Benjamin Herrenschmidt &lt;benh@kernel.crashing.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@osdl.org&gt;
</content>
</entry>
<entry>
<title>[PATCH] Module per-cpu alignment cannot always be met</title>
<updated>2005-08-02T04:38:01Z</updated>
<author>
<name>Rusty Russell</name>
<email>rusty@rustcorp.com.au</email>
</author>
<published>2005-08-02T04:11:47Z</published>
<link rel='alternate' type='text/html' href='https://git.amat.us/linux/commit/?id=842bbaaa7394820c8f1fe0629cd15478653caf86'/>
<id>urn:sha1:842bbaaa7394820c8f1fe0629cd15478653caf86</id>
<content type='text'>
The module code assumes noone will ever ask for a per-cpu area more than
SMP_CACHE_BYTES aligned.  However, as these cases show, gcc asks sometimes
asks for 32-byte alignment for the per-cpu section on a module, and if
CONFIG_X86_L1_CACHE_SHIFT is 4, we hit that BUG_ON().  This is obviously an
unusual combination, as there have been few reports, but better to warn
than die.

See:
	http://www.ussg.iu.edu/hypermail/linux/kernel/0409.0/0768.html

And more recently:
	http://bugs.gentoo.org/show_bug.cgi?id=97006

Signed-off-by: Rusty Russell &lt;rusty@rustcorp.com.au&gt;
Signed-off-by: Andrew Morton &lt;akpm@osdl.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@osdl.org&gt;
</content>
</entry>
<entry>
<title>[PATCH] remove sys_set_zone_reclaim()</title>
<updated>2005-08-01T17:03:56Z</updated>
<author>
<name>Ingo Molnar</name>
<email>mingo@elte.hu</email>
</author>
<published>2005-08-01T11:39:13Z</published>
<link rel='alternate' type='text/html' href='https://git.amat.us/linux/commit/?id=6cb54819d7b1867053e2dfd8c0ca3a8dc65a7eff'/>
<id>urn:sha1:6cb54819d7b1867053e2dfd8c0ca3a8dc65a7eff</id>
<content type='text'>
This removes sys_set_zone_reclaim() for now.  While i'm sure Martin is
trying to solve a real problem, we must not hard-code an incomplete and
insufficient approach into a syscall, because syscalls are pretty much
for eternity.  I am quite strongly convinced that this syscall must not
hit v2.6.13 in its current form.

Firstly, the syscall lacks basic syscall design: e.g. it allows the
global setting of VM policy for unprivileged users. (!) [ Imagine an
Oracle installation and a SAP installation on the same NUMA box fighting
over the 'optimal' setting for this flag. What will they do? Will they
try to set the flag to their own preferred value every second or so? ]

Secondly, it was added based on a single datapoint from Martin:

 http://marc.theaimsgroup.com/?l=linux-mm&amp;m=111763597218177&amp;w=2

where Martin characterizes the numbers the following way:

 ' Run-to-run variability for "make -j" is huge, so these numbers aren't
   terribly useful except to see that with reclaim the benchmark still
   finishes in a reasonable amount of time. '

in other words: the fundamental problem has likely not been solved, only
a tendential move into the right direction has been observed, and a
handful of numbers were picked out of a set of hugely variable results,
without showing the variability data. How much variance is there
run-to-run?

I'd really suggest to first walk the walk and see what's needed to get
stable &amp; predictable kernel compilation numbers on that NUMA box, before
adding random syscalls to tune a particular aspect of the VM ... which
approach might not even matter once the whole picture has been analyzed
and understood!

The third, most important point is that the syscall exposes VM tuning
internals in a completely unstructured way. What sense does it make to
have a _GLOBAL_ per-node setting for 'should we go to another node for
reclaim'? If then it might make sense to do this per-app, via numalib or
so.

The change is minimalistic in that it doesnt remove the syscall and the
underlying infrastructure changes, only the user-visible changes.  We
could perhaps add a CAP_SYS_ADMIN-only sysctl for this hack, a'ka
/proc/sys/vm/swappiness, but even that looks quite counterproductive
when the generic approach is that we are trying to reduce the number of
external factors in the VM balance picture.

Signed-off-by: Ingo Molnar &lt;mingo@elte.hu&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@osdl.org&gt;
</content>
</entry>
</feed>
