Age | Commit message (Collapse) | Author |
|
commit c79c49826270b8b0061b2fca840fc3f013c8a78a upstream.
The git commit 8eaffa67b43e99ae581622c5133e20b0f48bcef1
(xen/pat: Disable PAT support for now) explains in details why
we want to disable PAT for right now. However that
change was not enough and we should have also disabled
the pat_enabled value. Otherwise we end up with:
mmap-example:3481 map pfn expected mapping type write-back for
[mem 0x00010000-0x00010fff], got uncached-minus
------------[ cut here ]------------
WARNING: at /build/buildd/linux-3.8.0/arch/x86/mm/pat.c:774 untrack_pfn+0xb8/0xd0()
mem 0x00010000-0x00010fff], got uncached-minus
------------[ cut here ]------------
WARNING: at /build/buildd/linux-3.8.0/arch/x86/mm/pat.c:774
untrack_pfn+0xb8/0xd0()
...
Pid: 3481, comm: mmap-example Tainted: GF 3.8.0-6-generic #13-Ubuntu
Call Trace:
[<ffffffff8105879f>] warn_slowpath_common+0x7f/0xc0
[<ffffffff810587fa>] warn_slowpath_null+0x1a/0x20
[<ffffffff8104bcc8>] untrack_pfn+0xb8/0xd0
[<ffffffff81156c1c>] unmap_single_vma+0xac/0x100
[<ffffffff81157459>] unmap_vmas+0x49/0x90
[<ffffffff8115f808>] exit_mmap+0x98/0x170
[<ffffffff810559a4>] mmput+0x64/0x100
[<ffffffff810560f5>] dup_mm+0x445/0x660
[<ffffffff81056d9f>] copy_process.part.22+0xa5f/0x1510
[<ffffffff81057931>] do_fork+0x91/0x350
[<ffffffff81057c76>] sys_clone+0x16/0x20
[<ffffffff816ccbf9>] stub_clone+0x69/0x90
[<ffffffff816cc89d>] ? system_call_fastpath+0x1a/0x1f
---[ end trace 4918cdd0a4c9fea4 ]---
(a similar message shows up if you end up launching 'mcelog')
The call chain is (as analyzed by Liu, Jinsong):
do_fork
--> copy_process
--> dup_mm
--> dup_mmap
--> copy_page_range
--> track_pfn_copy
--> reserve_pfn_range
--> line 624: flags != want_flags
It comes from different memory types of page table (_PAGE_CACHE_WB) and MTRR
(_PAGE_CACHE_UC_MINUS).
Stefan Bader dug in this deep and found out that:
"That makes it clearer as this will do
reserve_memtype(...)
--> pat_x_mtrr_type
--> mtrr_type_lookup
--> __mtrr_type_lookup
And that can return -1/0xff in case of MTRR not being enabled/initialized. Which
is not the case (given there are no messages for it in dmesg). This is not equal
to MTRR_TYPE_WRBACK and thus becomes _PAGE_CACHE_UC_MINUS.
It looks like the problem starts early in reserve_memtype:
if (!pat_enabled) {
/* This is identical to page table setting without PAT */
if (new_type) {
if (req_type == _PAGE_CACHE_WC)
*new_type = _PAGE_CACHE_UC_MINUS;
else
*new_type = req_type & _PAGE_CACHE_MASK;
}
return 0;
}
This would be what we want, that is clearing the PWT and PCD flags from the
supported flags - if pat_enabled is disabled."
This patch does that - disabling PAT.
Reported-by: Sander Eikelenboom <linux@eikelenboom.it>
Reported-and-Tested-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reported-and-Tested-by: Stefan Bader <stefan.bader@canonical.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 76eaca031f0af2bb303e405986f637811956a422 upstream.
There is a loophole between Xen's current implementation of
pv-spinlocks and the scheduler. This was triggerable through
a testcase until v3.6 changed the TLB flushing code. The
problem potentially is still there just not observable in the
same way.
What could happen was (is):
1. CPU n tries to schedule task x away and goes into a slow
wait for the runq lock of CPU n-# (must be one with a lower
number).
2. CPU n-#, while processing softirqs, tries to balance domains
and goes into a slow wait for its own runq lock (for updating
some records). Since this is a spin_lock_irqsave in softirq
context, interrupts will be re-enabled for the duration of
the poll_irq hypercall used by Xen.
3. Before the runq lock of CPU n-# is unlocked, CPU n-1 receives
an interrupt (e.g. endio) and when processing the interrupt,
tries to wake up task x. But that is in schedule and still
on_cpu, so try_to_wake_up goes into a tight loop.
4. The runq lock of CPU n-# gets unlocked, but the message only
gets sent to the first waiter, which is CPU n-# and that is
busily stuck.
5. CPU n-# never returns from the nested interruption to take and
release the lock because the scheduler uses a busy wait.
And CPU n never finishes the task migration because the unlock
notification only went to CPU n-#.
To avoid this and since the unlocking code has no real sense of
which waiter is best suited to grab the lock, just send the IPI
to all of them. This causes the waiters to return from the hyper-
call (those not interrupted at least) and do active spinlocking.
BugLink: http://bugs.launchpad.net/bugs/1011792
Acked-by: Jan Beulich <JBeulich@suse.com>
Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 13d2b4d11d69a92574a55bfd985cfb0ca77aebdc upstream.
This fixes CVE-2013-0228 / XSA-42
Drew Jones while working on CVE-2013-0190 found that that unprivileged guest user
in 32bit PV guest can use to crash the > guest with the panic like this:
-------------
general protection fault: 0000 [#1] SMP
last sysfs file: /sys/devices/vbd-51712/block/xvda/dev
Modules linked in: sunrpc ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4
iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6
xt_state nf_conntrack ip6table_filter ip6_tables ipv6 xen_netfront ext4
mbcache jbd2 xen_blkfront dm_mirror dm_region_hash dm_log dm_mod [last
unloaded: scsi_wait_scan]
Pid: 1250, comm: r Not tainted 2.6.32-356.el6.i686 #1
EIP: 0061:[<c0407462>] EFLAGS: 00010086 CPU: 0
EIP is at xen_iret+0x12/0x2b
EAX: eb8d0000 EBX: 00000001 ECX: 08049860 EDX: 00000010
ESI: 00000000 EDI: 003d0f00 EBP: b77f8388 ESP: eb8d1fe0
DS: 0000 ES: 007b FS: 0000 GS: 00e0 SS: 0069
Process r (pid: 1250, ti=eb8d0000 task=c2953550 task.ti=eb8d0000)
Stack:
00000000 0027f416 00000073 00000206 b77f8364 0000007b 00000000 00000000
Call Trace:
Code: c3 8b 44 24 18 81 4c 24 38 00 02 00 00 8d 64 24 30 e9 03 00 00 00
8d 76 00 f7 44 24 08 00 00 02 80 75 33 50 b8 00 e0 ff ff 21 e0 <8b> 40
10 8b 04 85 a0 f6 ab c0 8b 80 0c b0 b3 c0 f6 44 24 0d 02
EIP: [<c0407462>] xen_iret+0x12/0x2b SS:ESP 0069:eb8d1fe0
general protection fault: 0000 [#2]
---[ end trace ab0d29a492dcd330 ]---
Kernel panic - not syncing: Fatal exception
Pid: 1250, comm: r Tainted: G D ---------------
2.6.32-356.el6.i686 #1
Call Trace:
[<c08476df>] ? panic+0x6e/0x122
[<c084b63c>] ? oops_end+0xbc/0xd0
[<c084b260>] ? do_general_protection+0x0/0x210
[<c084a9b7>] ? error_code+0x73/
-------------
Petr says: "
I've analysed the bug and I think that xen_iret() cannot cope with
mangled DS, in this case zeroed out (null selector/descriptor) by either
xen_failsafe_callback() or RESTORE_REGS because the corresponding LDT
entry was invalidated by the reproducer. "
Jan took a look at the preliminary patch and came up a fix that solves
this problem:
"This code gets called after all registers other than those handled by
IRET got already restored, hence a null selector in %ds or a non-null
one that got loaded from a code or read-only data descriptor would
cause a kernel mode fault (with the potential of crashing the kernel
as a whole, if panic_on_oops is set)."
The way to fix this is to realize that the we can only relay on the
registers that IRET restores. The two that are guaranteed are the
%cs and %ss as they are always fixed GDT selectors. Also they are
inaccessible from user mode - so they cannot be altered. This is
the approach taken in this patch.
Another alternative option suggested by Jan would be to relay on
the subtle realization that using the %ebp or %esp relative references uses
the %ss segment. In which case we could switch from using %eax to %ebp and
would not need the %ss over-rides. That would also require one extra
instruction to compensate for the one place where the register is used
as scaled index. However Andrew pointed out that is too subtle and if
further work was to be done in this code-path it could escape folks attention
and lead to accidents.
Reviewed-by: Petr Matousek <pmatouse@redhat.com>
Reported-by: Petr Matousek <pmatouse@redhat.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 95a7d76897c1e7243d4137037c66d15cbf2cce76 upstream.
As Mukesh explained it, the MMUEXT_TLB_FLUSH_ALL allows the
hypervisor to do a TLB flush on all active vCPUs. If instead
we were using the generic one (which ends up being xen_flush_tlb)
we end up making the MMUEXT_TLB_FLUSH_LOCAL hypercall. But
before we make that hypercall the kernel will IPI all of the
vCPUs (even those that were asleep from the hypervisor
perspective). The end result is that we needlessly wake them
up and do a TLB flush when we can just let the hypervisor
do it correctly.
This patch gives around 50% speed improvement when migrating
idle guest's from one host to another.
Oracle-bug: 14630170
Tested-by: Jingjie Jiang <jingjie.jiang@oracle.com>
Suggested-by: Mukesh Rathor <mukesh.rathor@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit cd0608e71e9757f4dae35bcfb4e88f4d1a03a8ab upstream.
The hypervisor will trap it. However without this patch,
we would crash as the .read_tscp is set to NULL. This patch
fixes it and sets it to the native_read_tscp call.
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 1a7bbda5b1ab0e02622761305a32dc38735b90b2 upstream.
We actually do not do anything about it. Just return a default
value of zero and if the kernel tries to write anything but 0
we BUG_ON.
This fixes the case when an user tries to suspend the machine
and it blows up in save_processor_state b/c 'read_cr8' is set
to NULL and we get:
kernel BUG at /home/konrad/ssd/linux/arch/x86/include/asm/paravirt.h:100!
invalid opcode: 0000 [#1] SMP
Pid: 2687, comm: init.late Tainted: G O 3.6.0upstream-00002-gac264ac-dirty #4 Bochs Bochs
RIP: e030:[<ffffffff814d5f42>] [<ffffffff814d5f42>] save_processor_state+0x212/0x270
.. snip..
Call Trace:
[<ffffffff810733bf>] do_suspend_lowlevel+0xf/0xac
[<ffffffff8107330c>] ? x86_acpi_suspend_lowlevel+0x10c/0x150
[<ffffffff81342ee2>] acpi_suspend_enter+0x57/0xd5
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 8d54db795dfb1049d45dc34f0dddbc5347ec5642 upstream.
The hypervisor is in charge of allocating the proper "NUMA" memory
and dealing with the CPU scheduler to keep them bound to the proper
NUMA node. The PV guests (and PVHVM) have no inkling of where they
run and do not need to know that right now. In the future we will
need to inject NUMA configuration data (if a guest spans two or more
NUMA nodes) so that the kernel can make the right choices. But those
patches are not yet present.
In the meantime, disable the NUMA capability in the PV guest, which
also fixes a bootup issue. Andre says:
"we see Dom0 crashes due to the kernel detecting the NUMA topology not
by ACPI, but directly from the northbridge (CONFIG_AMD_NUMA).
This will detect the actual NUMA config of the physical machine, but
will crash about the mismatch with Dom0's virtual memory. Variation of
the theme: Dom0 sees what it's not supposed to see.
This happens with the said config option enabled and on a machine where
this scanning is still enabled (K8 and Fam10h, not Bulldozer class)
We have this dump then:
NUMA: Warning: node ids are out of bound, from=-1 to=-1 distance=10
Scanning NUMA topology in Northbridge 24
Number of physical nodes 4
Node 0 MemBase 0000000000000000 Limit 0000000040000000
Node 1 MemBase 0000000040000000 Limit 0000000138000000
Node 2 MemBase 0000000138000000 Limit 00000001f8000000
Node 3 MemBase 00000001f8000000 Limit 0000000238000000
Initmem setup node 0 0000000000000000-0000000040000000
NODE_DATA [000000003ffd9000 - 000000003fffffff]
Initmem setup node 1 0000000040000000-0000000138000000
NODE_DATA [0000000137fd9000 - 0000000137ffffff]
Initmem setup node 2 0000000138000000-00000001f8000000
NODE_DATA [00000001f095e000 - 00000001f0984fff]
Initmem setup node 3 00000001f8000000-0000000238000000
Cannot find 159744 bytes in node 3
BUG: unable to handle kernel NULL pointer dereference at (null)
IP: [<ffffffff81d220e6>] __alloc_bootmem_node+0x43/0x96
Pid: 0, comm: swapper Not tainted 3.3.6 #1 AMD Dinar/Dinar
RIP: e030:[<ffffffff81d220e6>] [<ffffffff81d220e6>] __alloc_bootmem_node+0x43/0x96
.. snip..
[<ffffffff81d23024>] sparse_early_usemaps_alloc_node+0x64/0x178
[<ffffffff81d23348>] sparse_init+0xe4/0x25a
[<ffffffff81d16840>] paging_init+0x13/0x22
[<ffffffff81d07fbb>] setup_arch+0x9c6/0xa9b
[<ffffffff81683954>] ? printk+0x3c/0x3e
[<ffffffff81d01a38>] start_kernel+0xe5/0x468
[<ffffffff81d012cf>] x86_64_start_reservations+0xba/0xc1
[<ffffffff81007153>] ? xen_setup_runstate_info+0x2c/0x36
[<ffffffff81d050ee>] xen_start_kernel+0x565/0x56c
"
so we just disable NUMA scanning by setting numa_off=1.
Reported-and-Tested-by: Andre Przywara <andre.przywara@amd.com>
Acked-by: Andre Przywara <andre.przywara@amd.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 2fc136eecd0c647a6b13fcd00d0c41a1a28f35a5 upstream.
If the caller passes a valid kmap_op to m2p_add_override, we use
kmap_op->dev_bus_addr to store the original mfn, but dev_bus_addr is
part of the interface with Xen and if we are batching the hypercalls it
might not have been written by the hypervisor yet. That means that later
on Xen will write to it and we'll think that the original mfn is
actually what Xen has written to it.
Rather than "stealing" struct members from kmap_op, keep using
page->index to store the original mfn and add another parameter to
m2p_remove_override to get the corresponding kmap_op instead.
It is now responsibility of the caller to keep track of which kmap_op
corresponds to a particular page in the m2p_override (gntdev, the only
user of this interface that passes a valid kmap_op, is already doing that).
Reported-and-Tested-By: Sander Eikelenboom <linux@eikelenboom.it>
Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit c96aae1f7f393387d160211f60398d58463a7e65 upstream.
When we are finished with return PFNs to the hypervisor, then
populate it back, and also mark the E820 MMIO and E820 gaps
as IDENTITY_FRAMEs, we then call P2M to set areas that can
be used for ballooning. We were off by one, and ended up
over-writting a P2M entry that most likely was an IDENTITY_FRAME.
For example:
1-1 mapping on 40000->40200
1-1 mapping on bc558->bc5ac
1-1 mapping on bc5b4->bc8c5
1-1 mapping on bc8c6->bcb7c
1-1 mapping on bcd00->100000
Released 614 pages of unused memory
Set 277889 page(s) to 1-1 mapping
Populating 40200-40466 pfn range: 614 pages added
=> here we set from 40466 up to bc559 P2M tree to be
INVALID_P2M_ENTRY. We should have done it up to bc558.
The end result is that if anybody is trying to construct
a PTE for PFN bc558 they end up with ~PAGE_PRESENT.
Reported-by-and-Tested-by: Andre Przywara <andre.przywara@amd.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit b9e0d95c041ca2d7ad297ee37c2e9cfab67a188f upstream.
When the frontend and the backend reside on the same domain, even if we
add pages to the m2p_override, these pages will never be returned by
mfn_to_pfn because the check "get_phys_to_machine(pfn) != mfn" will
always fail, so the pfn of the frontend will be returned instead
(resulting in a deadlock because the frontend pages are already locked).
INFO: task qemu-system-i38:1085 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
qemu-system-i38 D ffff8800cfc137c0 0 1085 1 0x00000000
ffff8800c47ed898 0000000000000282 ffff8800be4596b0 00000000000137c0
ffff8800c47edfd8 ffff8800c47ec010 00000000000137c0 00000000000137c0
ffff8800c47edfd8 00000000000137c0 ffffffff82213020 ffff8800be4596b0
Call Trace:
[<ffffffff81101ee0>] ? __lock_page+0x70/0x70
[<ffffffff81a0fdd9>] schedule+0x29/0x70
[<ffffffff81a0fe80>] io_schedule+0x60/0x80
[<ffffffff81101eee>] sleep_on_page+0xe/0x20
[<ffffffff81a0e1ca>] __wait_on_bit_lock+0x5a/0xc0
[<ffffffff81101ed7>] __lock_page+0x67/0x70
[<ffffffff8106f750>] ? autoremove_wake_function+0x40/0x40
[<ffffffff811867e6>] ? bio_add_page+0x36/0x40
[<ffffffff8110b692>] set_page_dirty_lock+0x52/0x60
[<ffffffff81186021>] bio_set_pages_dirty+0x51/0x70
[<ffffffff8118c6b4>] do_blockdev_direct_IO+0xb24/0xeb0
[<ffffffff811e71a0>] ? ext3_get_blocks_handle+0xe00/0xe00
[<ffffffff8118ca95>] __blockdev_direct_IO+0x55/0x60
[<ffffffff811e71a0>] ? ext3_get_blocks_handle+0xe00/0xe00
[<ffffffff811e91c8>] ext3_direct_IO+0xf8/0x390
[<ffffffff811e71a0>] ? ext3_get_blocks_handle+0xe00/0xe00
[<ffffffff81004b60>] ? xen_mc_flush+0xb0/0x1b0
[<ffffffff81104027>] generic_file_aio_read+0x737/0x780
[<ffffffff813bedeb>] ? gnttab_map_refs+0x15b/0x1e0
[<ffffffff811038f0>] ? find_get_pages+0x150/0x150
[<ffffffff8119736c>] aio_rw_vect_retry+0x7c/0x1d0
[<ffffffff811972f0>] ? lookup_ioctx+0x90/0x90
[<ffffffff81198856>] aio_run_iocb+0x66/0x1a0
[<ffffffff811998b8>] do_io_submit+0x708/0xb90
[<ffffffff81199d50>] sys_io_submit+0x10/0x20
[<ffffffff81a18d69>] system_call_fastpath+0x16/0x1b
The explanation is in the comment within the code:
We need to do this because the pages shared by the frontend
(xen-blkfront) can be already locked (lock_page, called by
do_read_cache_page); when the userspace backend tries to use them
with direct_IO, mfn_to_pfn returns the pfn of the frontend, so
do_blockdev_direct_IO is going to try to lock the same pages
again resulting in a deadlock.
A simplified call graph looks like this:
pygrub QEMU
-----------------------------------------------
do_read_cache_page io_submit
| |
lock_page ext3_direct_IO
|
bio_add_page
|
lock_page
Internally the xen-blkback uses m2p_add_override to swizzle (temporarily)
a 'struct page' to have a different MFN (so that it can point to another
guest). It also can easily find out whether another pfn corresponding
to the mfn exists in the m2p, and can set the FOREIGN bit
in the p2m, making sure that mfn_to_pfn returns the pfn of the backend.
This allows the backend to perform direct_IO on these pages, but as a
side effect prevents the frontend from using get_user_pages_fast on
them while they are being shared with the backend.
Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 5e626254206a709c6e937f3dda69bf26c7344f6f upstream.
Xen PV kernels allow access to the APERF/MPERF registers to read the
effective frequency. Access to the MSRs is however redirected to the
currently scheduled physical CPU, making consecutive read and
compares unreliable. In addition each rdmsr traps into the hypervisor.
So to avoid bogus readouts and expensive traps, disable the kernel
internal feature flag for APERF/MPERF if running under Xen.
This will
a) remove the aperfmperf flag from /proc/cpuinfo
b) not mislead the power scheduler (arch/x86/kernel/cpu/sched.c) to
use the feature to improve scheduling (by default disabled)
c) not mislead the cpufreq driver to use the MSRs
This does not cover userland programs which access the MSRs via the
device file interface, but this will be addressed separately.
Signed-off-by: Andre Przywara <andre.przywara@amd.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 1ab46fd319bcf1fcd9fb6311727d532b580e4eba upstream.
Stub out MSR methods that aren't actually needed. This fixes a crash
as Xen Dom0 on AMD Trinity systems. A bigger patch should be added to
remove the paravirt machinery completely for the methods which
apparently have no users!
Reported-by: Andre Przywara <andre.przywara@amd.com>
Link: http://lkml.kernel.org/r/20120530222356.GA28417@andromeda.dapyr.net
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
The accessing PCI configuration space with the PCI BIOS32 service does
not work in PV guests.
On systems without MMCONFIG or where the BIOS hasn't marked the
MMCONFIG region as reserved in the e820 map, the BIOS service is
probed (even though direct access is preferred) and this hangs.
CC: stable@kernel.org
Acked-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
[v1: Fixed compile error when CONFIG_PCI is not set]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
|
|
If I try to do "cat /sys/kernel/debug/kernel_page_tables"
I end up with:
BUG: unable to handle kernel paging request at ffffc7fffffff000
IP: [<ffffffff8106aa51>] ptdump_show+0x221/0x480
PGD 0
Oops: 0000 [#1] SMP
CPU 0
.. snip..
RAX: 0000000000000000 RBX: ffffc00000000fff RCX: 0000000000000000
RDX: 0000800000000000 RSI: 0000000000000000 RDI: ffffc7fffffff000
which is due to the fact we are trying to access a PFN that is not
accessible to us. The reason (at least in this case) was that
PGD[256] is set to __HYPERVISOR_VIRT_START which was setup (by the
hypervisor) to point to a read-only linear map of the MFN->PFN array.
During our parsing we would get the MFN (a valid one), try to look
it up in the MFN->PFN tree and find it invalid and return ~0 as PFN.
Then pte_mfn_to_pfn would happilly feed that in, attach the flags
and return it back to the caller. 'ptdump_show' bitshifts it and
gets and invalid value that it tries to dereference.
Instead of doing all of that, we detect the ~0 case and just
return !_PAGE_PRESENT.
This bug has been in existence .. at least until 2.6.37 (yikes!)
CC: stable@kernel.org
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
|
|
On x86_64 on AMD machines where the first APIC_ID is not zero, we get:
ACPI: LAPIC (acpi_id[0x01] lapic_id[0x10] enabled)
BIOS bug: APIC version is 0 for CPU 1/0x10, fixing up to 0x10
BIOS bug: APIC version mismatch, boot CPU: 0, CPU 1: version 10
which means that when the ACPI processor driver loads and
tries to parse the _Pxx states it fails to do as, as it
ends up calling acpi_get_cpuid which does this:
for_each_possible_cpu(i) {
if (cpu_physical_id(i) == apic_id)
return i;
}
And the bootup CPU, has not been found so it fails and returns -1
for the first CPU - which then subsequently in the loop that
"acpi_processor_get_info" does results in returning an error, which
means that "acpi_processor_add" failing and per_cpu(processor)
is never set (and is NULL).
That means that when xen-acpi-processor tries to load (much much
later on) and parse the P-states it gets -ENODEV from
acpi_processor_register_performance() (which tries to read
the per_cpu(processor)) and fails to parse the data.
Reported-by-and-Tested-by: Stefan Bader <stefan.bader@canonical.com>
Suggested-by: Boris Ostrovsky <boris.ostrovsky@amd.com>
[v2: Bit-shift APIC ID by 24 bits]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
|
|
In xen_restore_fl_direct(), xen_force_evtchn_callback() was being
called even if no events were pending. This resulted in (depending on
workload) about a 100 times as many xen_version hypercalls as
necessary.
Fix this by correcting the sense of the conditional jump.
This seems to give a significant performance benefit for some
workloads.
There is some subtle tricksy "..since the check here is trying to
check both pending and masked in a single cmpw, but I think this is
correct. It will call check_events now only when the combined
mask+pending word is 0x0001 (aka unmasked, pending)." (Ian)
CC: stable@kernel.org
Acked-by: Ian Campbell <ian.campbell@citrix.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
|
|
When we boot on a machine that can hotplug CPUs and we
are using 'dom0_max_vcpus=X' on the Xen hypervisor line
to clip the amount of CPUs available to the initial domain,
we get this:
(XEN) Command line: com1=115200,8n1 dom0_mem=8G noreboot dom0_max_vcpus=8 sync_console mce_verbosity=verbose console=com1,vga loglvl=all guest_loglvl=all
.. snip..
DMI: Intel Corporation S2600CP/S2600CP, BIOS SE5C600.86B.99.99.x032.072520111118 07/25/2011
.. snip.
SMP: Allowing 64 CPUs, 32 hotplug CPUs
installing Xen timer for CPU 7
cpu 7 spinlock event irq 361
NMI watchdog: disabled (cpu7): hardware events not enabled
Brought up 8 CPUs
.. snip..
[acpi processor finds the CPUs are not initialized and starts calling
arch_register_cpu, which creates /sys/devices/system/cpu/cpu8/online]
CPU 8 got hotplugged
CPU 9 got hotplugged
CPU 10 got hotplugged
.. snip..
initcall 1_acpi_battery_init_async+0x0/0x1b returned 0 after 406 usecs
calling erst_init+0x0/0x2bb @ 1
[and the scheduler sticks newly started tasks on the new CPUs, but
said CPUs cannot be initialized b/c the hypervisor has limited the
amount of vCPUS to 8 - as per the dom0_max_vcpus=8 flag.
The spinlock tries to kick the other CPU, but the structure for that
is not initialized and we crash.]
BUG: unable to handle kernel paging request at fffffffffffffed8
IP: [<ffffffff81035289>] xen_spin_lock+0x29/0x60
PGD 180d067 PUD 180e067 PMD 0
Oops: 0002 [#1] SMP
CPU 7
Modules linked in:
Pid: 1, comm: swapper/0 Not tainted 3.4.0-rc2upstream-00001-gf5154e8 #1 Intel Corporation S2600CP/S2600CP
RIP: e030:[<ffffffff81035289>] [<ffffffff81035289>] xen_spin_lock+0x29/0x60
RSP: e02b:ffff8801fb9b3a70 EFLAGS: 00010282
With this patch, we cap the amount of vCPUS that the initial domain
can run, to exactly what dom0_max_vcpus=X has specified.
In the future, if there is a hypercall that will allow a running
domain to expand past its initial set of vCPUS, this patch should
be re-evaluated.
CC: stable@kernel.org
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
|
|
There are exactly four users of __monitor and __mwait:
- cstate.c (which allows acpi_processor_ffh_cstate_enter to be called
when the cpuidle API drivers are used. However patch
"cpuidle: replace xen access to x86 pm_idle and default_idle"
provides a mechanism to disable the cpuidle and use safe_halt.
- smpboot (which allows mwait_play_dead to be called). However
safe_halt is always used so we skip that.
- intel_idle (same deal as above).
- acpi_pad.c. This the one that we do not want to run as we
will hit the below crash.
Why do we want to expose MWAIT_LEAF in the first place?
We want it for the xen-acpi-processor driver - which uploads
C-states to the hypervisor. If MWAIT_LEAF is set, the cstate.c
sets the proper address in the C-states so that the hypervisor
can benefit from using the MWAIT functionality. And that is
the sole reason for using it.
Without this patch, if a module performs mwait or monitor we
get this:
invalid opcode: 0000 [#1] SMP
CPU 2
.. snip..
Pid: 5036, comm: insmod Tainted: G O 3.4.0-rc2upstream-dirty #2 Intel Corporation S2600CP/S2600CP
RIP: e030:[<ffffffffa000a017>] [<ffffffffa000a017>] mwait_check_init+0x17/0x1000 [mwait_check]
RSP: e02b:ffff8801c298bf18 EFLAGS: 00010282
RAX: ffff8801c298a010 RBX: ffffffffa03b2000 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffff8801c29800d8 RDI: ffff8801ff097200
RBP: ffff8801c298bf18 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000000
R13: ffffffffa000a000 R14: 0000005148db7294 R15: 0000000000000003
FS: 00007fbb364f2700(0000) GS:ffff8801ff08c000(0000) knlGS:0000000000000000
CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 000000000179f038 CR3: 00000001c9469000 CR4: 0000000000002660
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process insmod (pid: 5036, threadinfo ffff8801c298a000, task ffff8801c29cd7e0)
Stack:
ffff8801c298bf48 ffffffff81002124 ffffffffa03b2000 00000000000081fd
000000000178f010 000000000178f030 ffff8801c298bf78 ffffffff810c41e6
00007fff3fb30db9 00007fff3fb30db9 00000000000081fd 0000000000010000
Call Trace:
[<ffffffff81002124>] do_one_initcall+0x124/0x170
[<ffffffff810c41e6>] sys_init_module+0xc6/0x220
[<ffffffff815b15b9>] system_call_fastpath+0x16/0x1b
Code: <0f> 01 c8 31 c0 0f 01 c9 c9 c3 00 00 00 00 00 00 00 00 00 00 00 00
RIP [<ffffffffa000a017>] mwait_check_init+0x17/0x1000 [mwait_check]
RSP <ffff8801c298bf18>
---[ end trace 16582fc8a3d1e29a ]---
Kernel panic - not syncing: Fatal exception
With this module (which is what acpi_pad.c would hit):
MODULE_AUTHOR("Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>");
MODULE_DESCRIPTION("mwait_check_and_back");
MODULE_LICENSE("GPL");
MODULE_VERSION();
static int __init mwait_check_init(void)
{
__monitor((void *)¤t_thread_info()->flags, 0, 0);
__mwait(0, 0);
return 0;
}
static void __exit mwait_check_exit(void)
{
}
module_init(mwait_check_init);
module_exit(mwait_check_exit);
Reported-by: Liu, Jinsong <jinsong.liu@intel.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
|
|
This reverts commit b960d6c43a63ebd2d8518b328da3816b833ee8cc.
If we have another thread (very likely) touched the list, we
end up hitting a problem "that the next element is wrong because
we should be able to cope with that. The problem is that the
next->next pointer would be set LIST_POISON1. " (Stefano's
comment on the patch).
Reverting for now.
Suggested-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
|
|
stable/for-linus-3.4
* commit 'c104f1fa1ecf4ee0fc06e31b1f77630b2551be81': (14566 commits)
cpufreq: OMAP: fix build errors: depends on ARCH_OMAP2PLUS
sparc64: Eliminate obsolete __handle_softirq() function
sparc64: Fix bootup crash on sun4v.
kconfig: delete last traces of __enabled_ from autoconf.h
Revert "kconfig: fix __enabled_ macros definition for invisible and un-selected symbols"
kconfig: fix IS_ENABLED to not require all options to be defined
irq_domain: fix type mismatch in debugfs output format
staging: android: fix mem leaks in __persistent_ram_init()
staging: vt6656: Don't leak memory in drivers/staging/vt6656/ioctl.c::private_ioctl()
staging: iio: hmc5843: Fix crash in probe function.
panic: fix stack dump print on direct call to panic()
drivers/rtc/rtc-pl031.c: enable clock on all ST variants
Revert "mm: vmscan: fix misused nr_reclaimed in shrink_mem_cgroup_zone()"
hugetlb: fix race condition in hugetlb_fault()
drivers/rtc/rtc-twl.c: use static register while reading time
drivers/rtc/rtc-s3c.c: add placeholder for driver private data
drivers/rtc/rtc-s3c.c: fix compilation error
MAINTAINERS: add PCDP console maintainer
memcg: do not open code accesses to res_counter members
drivers/rtc/rtc-efi.c: fix section mismatch warning
...
|
|
Use list_for_each_entry_safe and remove the spin_lock acquisition in
m2p_find_override.
Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen
Pull xen fixes from Konrad Rzeszutek Wilk:
"Two fixes for regressions:
* one is a workaround that will be removed in v3.5 with proper fix in
the tip/x86 tree,
* the other is to fix drivers to load on PV (a previous patch made
them only load in PVonHVM mode).
The rest are just minor fixes in the various drivers and some cleanup
in the core code."
* tag 'stable/for-linus-3.4-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen:
xen/pcifront: avoid pci_frontend_enable_msix() falsely returning success
xen/pciback: fix XEN_PCI_OP_enable_msix result
xen/smp: Remove unnecessary call to smp_processor_id()
xen/x86: Workaround 'x86/ioapic: Add register level checks to detect bogus io-apic entries'
xen: only check xen_platform_pci_unplug if hvm
|
|
There is an extra and unnecessary call to smp_processor_id()
in cpu_bringup(). Remove it.
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
|
|
io-apic entries'
The above mentioned patch checks the IOAPIC and if it contains
-1, then it unmaps said IOAPIC. But under Xen we get this:
BUG: unable to handle kernel NULL pointer dereference at 0000000000000040
IP: [<ffffffff8134e51f>] xen_irq_init+0x1f/0xb0
PGD 0
Oops: 0002 [#1] SMP
CPU 0
Modules linked in:
Pid: 1, comm: swapper/0 Not tainted 3.2.10-3.fc16.x86_64 #1 Dell Inc. Inspiron
1525 /0U990C
RIP: e030:[<ffffffff8134e51f>] [<ffffffff8134e51f>] xen_irq_init+0x1f/0xb0
RSP: e02b: ffff8800d42cbb70 EFLAGS: 00010202
RAX: 0000000000000000 RBX: 00000000ffffffef RCX: 0000000000000001
RDX: 0000000000000040 RSI: 00000000ffffffef RDI: 0000000000000001
RBP: ffff8800d42cbb80 R08: ffff8800d6400000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 00000000ffffffef
R13: 0000000000000001 R14: 0000000000000001 R15: 0000000000000010
FS: 0000000000000000(0000) GS:ffff8800df5fe000(0000) knlGS:0000000000000000
CS: e033 DS: 0000 ES: 0000 CR0:000000008005003b
CR2: 0000000000000040 CR3: 0000000001a05000 CR4: 0000000000002660
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process swapper/0 (pid: 1, threadinfo ffff8800d42ca000, task ffff8800d42d0000)
Stack:
00000000ffffffef 0000000000000010 ffff8800d42cbbe0 ffffffff8134f157
ffffffff8100a9b2 ffffffff8182ffd1 00000000000000a0 00000000829e7384
0000000000000002 0000000000000010 00000000ffffffff 0000000000000000
Call Trace:
[<ffffffff8134f157>] xen_bind_pirq_gsi_to_irq+0x87/0x230
[<ffffffff8100a9b2>] ? check_events+0x12+0x20
[<ffffffff814bab42>] xen_register_pirq+0x82/0xe0
[<ffffffff814bac1a>] xen_register_gsi.part.2+0x4a/0xd0
[<ffffffff814bacc0>] acpi_register_gsi_xen+0x20/0x30
[<ffffffff8103036f>] acpi_register_gsi+0xf/0x20
[<ffffffff8131abdb>] acpi_pci_irq_enable+0x12e/0x202
[<ffffffff814bc849>] pcibios_enable_device+0x39/0x40
[<ffffffff812dc7ab>] do_pci_enable_device+0x4b/0x70
[<ffffffff812dc878>] __pci_enable_device_flags+0xa8/0xf0
[<ffffffff812dc8d3>] pci_enable_device+0x13/0x20
The reason we are dying is b/c the call acpi_get_override_irq() is used,
which returns the polarity and trigger for the IRQs. That function calls
mp_find_ioapics to get the 'struct ioapic' structure - which along with the
mp_irq[x] is used to figure out the default values and the polarity/trigger
overrides. Since the mp_find_ioapics now returns -1 [b/c the IOAPIC is filled
with 0xffffffff], the acpi_get_override_irq() stops trying to lookup in the
mp_irq[x] the proper INT_SRV_OVR and we can't install the SCI interrupt.
The proper fix for this is going in v3.5 and adds an x86_io_apic_ops
struct so that platforms can override it. But for v3.4 lets carry this
work-around. This patch does that by providing a slightly different variant
of the fake IOAPIC entries.
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
|
|
git://git.linaro.org/people/mszyprowski/linux-dma-mapping
Pull DMA mapping branch from Marek Szyprowski:
"Short summary for the whole series:
A few limitations have been identified in the current dma-mapping
design and its implementations for various architectures. There exist
more than one function for allocating and freeing the buffers:
currently these 3 are used dma_{alloc, free}_coherent,
dma_{alloc,free}_writecombine, dma_{alloc,free}_noncoherent.
For most of the systems these calls are almost equivalent and can be
interchanged. For others, especially the truly non-coherent ones
(like ARM), the difference can be easily noticed in overall driver
performance. Sadly not all architectures provide implementations for
all of them, so the drivers might need to be adapted and cannot be
easily shared between different architectures. The provided patches
unify all these functions and hide the differences under the already
existing dma attributes concept. The thread with more references is
available here:
http://www.spinics.net/lists/linux-sh/msg09777.html
These patches are also a prerequisite for unifying DMA-mapping
implementation on ARM architecture with the common one provided by
dma_map_ops structure and extending it with IOMMU support. More
information is available in the following thread:
http://thread.gmane.org/gmane.linux.kernel.cross-arch/12819
More works on dma-mapping framework are planned, especially in the
area of buffer sharing and managing the shared mappings (together with
the recently introduced dma_buf interface: commit d15bd7ee445d
"dma-buf: Introduce dma buffer sharing mechanism").
The patches in the current set introduce a new alloc/free methods
(with support for memory attributes) in dma_map_ops structure, which
will later replace dma_alloc_coherent and dma_alloc_writecombine
functions."
People finally started piping up with support for merging this, so I'm
merging it as the last of the pending stuff from the merge window.
Looks like pohmelfs is going to wait for 3.5 and more external support
for merging.
* 'for-linus' of git://git.linaro.org/people/mszyprowski/linux-dma-mapping:
common: DMA-mapping: add NON-CONSISTENT attribute
common: DMA-mapping: add WRITE_COMBINE attribute
common: dma-mapping: introduce mmap method
common: dma-mapping: remove old alloc_coherent and free_coherent methods
Hexagon: adapt for dma_map_ops changes
Unicore32: adapt for dma_map_ops changes
Microblaze: adapt for dma_map_ops changes
SH: adapt for dma_map_ops changes
Alpha: adapt for dma_map_ops changes
SPARC: adapt for dma_map_ops changes
PowerPC: adapt for dma_map_ops changes
MIPS: adapt for dma_map_ops changes
X86 & IA64: adapt for dma_map_ops changes
common: dma-mapping: introduce generic alloc() and free() methods
|
|
This has been obsolescent for a while, fix documentation and
misc comments.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
|
|
Adapt core x86 and IA64 architecture code for dma_map_ops changes: replace
alloc/free_coherent with generic alloc/free methods.
Signed-off-by: Andrzej Pietrasiewicz <andrzej.p@samsung.com>
Acked-by: Kyungmin Park <kyungmin.park@samsung.com>
[removed swiotlb related changes and replaced it with wrappers,
merged with IA64 patch to avoid inter-patch dependences in intel-iommu code]
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Tony Luck <tony.luck@intel.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen
Pull more xen updates from Konrad Rzeszutek Wilk:
"One tiny feature that accidentally got lost in the initial git pull:
* Add fast-EOI acking of interrupts (clear a bit instead of
hypercall)
And bug-fixes:
* Fix CPU bring-up code missing a call to notify other subsystems.
* Fix reading /sys/hypervisor even if PVonHVM drivers are not loaded.
* In Xen ACPI processor driver: remove too verbose WARN messages, fix
up the Kconfig dependency to be a module by default, and add
dependency on CPU_FREQ.
* Disable CPU frequency drivers from loading when booting under Xen
(as we want the Xen ACPI processor to be used instead).
* Cleanups in tmem code."
* tag 'stable/for-linus-3.4-tag-two' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen:
xen/acpi: Fix Kconfig dependency on CPU_FREQ
xen: initialize platform-pci even if xen_emul_unplug=never
xen/smp: Fix bringup bug in AP code.
xen/acpi: Remove the WARN's as they just create noise.
xen/tmem: cleanup
xen: support pirq_eoi_map
xen/acpi-processor: Do not depend on CPU frequency scaling drivers.
xen/cpufreq: Disable the cpu frequency scaling drivers from loading.
provide disable_cpufreq() function to disable the API.
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen
Pull xen updates from Konrad Rzeszutek Wilk:
"which has three neat features:
- PV multiconsole support, so that there can be hvc1, hvc2, etc; This
can be used in HVM and in PV mode.
- P-state and C-state power management driver that uploads said power
management data to the hypervisor. It also inhibits cpufreq
scaling drivers to load so that only the hypervisor can make power
management decisions - fixing a weird perf bug.
There is one thing in the Kconfig that you won't like: "default y
if (X86_ACPI_CPUFREQ = y || X86_POWERNOW_K8 = y)" (note, that it
all depends on CONFIG_XEN which depends on CONFIG_PARAVIRT which by
default is off). I've a fix to convert that boolean expression
into "default m" which I am going to post after the cpufreq git
pull - as the two patches to make this work depend on a fix in Dave
Jones's tree.
- Function Level Reset (FLR) support in the Xen PCI backend.
Fixes:
- Kconfig dependencies for Xen PV keyboard and video
- Compile warnings and constify fixes
- Change over to use percpu_xxx instead of this_cpu_xxx"
Fix up trivial conflicts in drivers/tty/hvc/hvc_xen.c due to changes to
a removed commit.
* tag 'stable/for-linus-3.4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen:
xen kconfig: relax INPUT_XEN_KBDDEV_FRONTEND deps
xen/acpi-processor: C and P-state driver that uploads said data to hypervisor.
xen: constify all instances of "struct attribute_group"
xen/xenbus: ignore console/0
hvc_xen: introduce HVC_XEN_FRONTEND
hvc_xen: implement multiconsole support
hvc_xen: support PV on HVM consoles
xenbus: don't free other end details too early
xen/enlighten: Expose MWAIT and MWAIT_LEAF if hypervisor OKs it.
xen/setup/pm/acpi: Remove the call to boot_option_idle_override.
xenbus: address compiler warnings
xen: use this_cpu_xxx replace percpu_xxx funcs
xen/pciback: Support pci_reset_function, aka FLR or D3 support.
pci: Introduce __pci_reset_function_locked to be used when holding device_lock.
xen: Utilize the restore_msi_irqs hook.
|
|
The CPU hotplug code has now a callback to help bring up the CPU.
Without the call we end up getting:
BUG: soft lockup - CPU#0 stuck for 29s! [migration/0:6]
Modules linked in:
CPU ] Pid: 6, comm: migration/0 Not tainted 3.3.0upstream-01180-ged378a5 #1 Dell Inc. PowerEdge T105 /0RR825
RIP: e030:[<ffffffff810d3b8b>] [<ffffffff810d3b8b>] stop_machine_cpu_stop+0x7b/0xf0
RSP: e02b:ffff8800ceaabdb0 EFLAGS: 00000293
.. snip..
Call Trace:
[<ffffffff810d3b10>] ? stop_one_cpu_nowait+0x50/0x50
[<ffffffff810d3841>] cpu_stopper_thread+0xf1/0x1c0
[<ffffffff815a9776>] ? __schedule+0x3c6/0x760
[<ffffffff815aa749>] ? _raw_spin_unlock_irqrestore+0x19/0x30
[<ffffffff810d3750>] ? res_counter_charge+0x150/0x150
[<ffffffff8108dc76>] kthread+0x96/0xa0
[<ffffffff815b27e4>] kernel_thread_helper+0x4/0x10
[<ffffffff815aacbc>] ? retint_restore_ar
Thix fixes it.
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
|
|
By using the functionality provided by "[CPUFREQ]: provide
disable_cpuidle() function to disable the API."
Under the Xen hypervisor we do not want the initial domain to exercise
the cpufreq scaling drivers. This is b/c the Xen hypervisor is
in charge of doing this as well and we can end up with both the
Linux kernel and the hypervisor trying to change the P-states
leading to weird performance issues.
Acked-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
[v2: Fix compile error spotted by Benjamin Schweikert <b.schweikert@googlemail.com>]
|
|
For the hypervisor to take advantage of the MWAIT support it needs
to extract from the ACPI _CST the register address. But the
hypervisor does not have the support to parse DSDT so it relies on
the initial domain (dom0) to parse the ACPI Power Management information
and push it up to the hypervisor. The pushing of the data is done
by the processor_harveset_xen module which parses the information that
the ACPI parser has graciously exposed in 'struct acpi_processor'.
For the ACPI parser to also expose the Cx states for MWAIT, we need
to expose the MWAIT capability (leaf 1). Furthermore we also need to
expose the MWAIT_LEAF capability (leaf 5) for cstate.c to properly
function.
The hypervisor could expose these flags when it traps the XEN_EMULATE_PREFIX
operations, but it can't do it since it needs to be backwards compatible.
Instead we choose to use the native CPUID to figure out if the MWAIT
capability exists and use the XEN_SET_PDC query hypercall to figure out
if the hypervisor wants us to expose the MWAIT_LEAF capability or not.
Note: The XEN_SET_PDC query was implemented in c/s 23783:
"ACPI: add _PDC input override mechanism".
With this in place, instead of
C3 ACPI IOPORT 415
we get now
C3:ACPI FFH INTEL MWAIT 0x20
Note: The cpu_idle which would be calling the mwait variants for idling
never gets set b/c we set the default pm_idle to be the hypercall variant.
Acked-by: Jan Beulich <JBeulich@suse.com>
[v2: Fix missing header file include and #ifdef]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
|
|
We needed that call in the past to force the kernel to use
default_idle (which called safe_halt, which called xen_safe_halt).
But set_pm_idle_to_default() does now that, so there is no need
to use this boot option operand.
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
|
|
[Pls also look at https://lkml.org/lkml/2012/2/10/228]
Using of PAT to change pages from WB to WC works quite nicely.
Changing it back to WB - not so much. The crux of the matter is
that the code that does this (__page_change_att_set_clr) has only
limited information so when it tries to the change it gets
the "raw" unfiltered information instead of the properly filtered one -
and the "raw" one tell it that PSE bit is on (while infact it
is not). As a result when the PTE is set to be WB from WC, we get
tons of:
:WARNING: at arch/x86/xen/mmu.c:475 xen_make_pte+0x67/0xa0()
:Hardware name: HP xw4400 Workstation
.. snip..
:Pid: 27, comm: kswapd0 Tainted: G W 3.2.2-1.fc16.x86_64 #1
:Call Trace:
: [<ffffffff8106dd1f>] warn_slowpath_common+0x7f/0xc0
: [<ffffffff8106dd7a>] warn_slowpath_null+0x1a/0x20
: [<ffffffff81005a17>] xen_make_pte+0x67/0xa0
: [<ffffffff810051bd>] __raw_callee_save_xen_make_pte+0x11/0x1e
: [<ffffffff81040e15>] ? __change_page_attr_set_clr+0x9d5/0xc00
: [<ffffffff8114c2e8>] ? __purge_vmap_area_lazy+0x158/0x1d0
: [<ffffffff8114cca5>] ? vm_unmap_aliases+0x175/0x190
: [<ffffffff81041168>] change_page_attr_set_clr+0x128/0x4c0
: [<ffffffff81041542>] set_pages_array_wb+0x42/0xa0
: [<ffffffff8100a9b2>] ? check_events+0x12/0x20
: [<ffffffffa0074d4c>] ttm_pages_put+0x1c/0x70 [ttm]
: [<ffffffffa0074e98>] ttm_page_pool_free+0xf8/0x180 [ttm]
: [<ffffffffa0074f78>] ttm_pool_mm_shrink+0x58/0x90 [ttm]
: [<ffffffff8112ba04>] shrink_slab+0x154/0x310
: [<ffffffff8112f17a>] balance_pgdat+0x4fa/0x6c0
: [<ffffffff8112f4b8>] kswapd+0x178/0x3d0
: [<ffffffff815df134>] ? __schedule+0x3d4/0x8c0
: [<ffffffff81090410>] ? remove_wait_queue+0x50/0x50
: [<ffffffff8112f340>] ? balance_pgdat+0x6c0/0x6c0
: [<ffffffff8108fb6c>] kthread+0x8c/0xa0
for every page. The proper fix for this is has been posted
and is https://lkml.org/lkml/2012/2/10/228
"x86/cpa: Use pte_attrs instead of pte_flags on CPA/set_p.._wb/wc operations."
along with a detailed descrip |