aboutsummaryrefslogtreecommitdiff
path: root/Documentation/x86
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/x86')
-rw-r--r--Documentation/x86/00-INDEX16
-rw-r--r--Documentation/x86/boot.txt139
-rw-r--r--Documentation/x86/early-microcode.txt42
-rw-r--r--Documentation/x86/earlyprintk.txt41
-rw-r--r--Documentation/x86/entry_64.txt95
-rw-r--r--Documentation/x86/exception-tables.txt292
-rw-r--r--Documentation/x86/i386/IO-APIC.txt2
-rw-r--r--Documentation/x86/x86_64/boot-options.txt138
-rw-r--r--Documentation/x86/x86_64/kernel-stacks6
-rw-r--r--Documentation/x86/x86_64/machinecheck8
-rw-r--r--Documentation/x86/x86_64/mm.txt13
-rw-r--r--Documentation/x86/zero-page.txt6
12 files changed, 703 insertions, 95 deletions
diff --git a/Documentation/x86/00-INDEX b/Documentation/x86/00-INDEX
index dbe3377754a..692264456f0 100644
--- a/Documentation/x86/00-INDEX
+++ b/Documentation/x86/00-INDEX
@@ -1,4 +1,20 @@
00-INDEX
- this file
+boot.txt
+ - List of boot protocol versions
+early-microcode.txt
+ - How to load microcode from an initrd-CPIO archive early to fix CPU issues.
+earlyprintk.txt
+ - Using earlyprintk with a USB2 debug port key.
+entry_64.txt
+ - Describe (some of the) kernel entry points for x86.
+exception-tables.txt
+ - why and how Linux kernel uses exception tables on x86
mtrr.txt
- how to use x86 Memory Type Range Registers to increase performance
+pat.txt
+ - Page Attribute Table intro and API
+usb-legacy-support.txt
+ - how to fix/avoid quirks when using emulated PS/2 mouse/keyboard.
+zero-page.txt
+ - layout of the first page of memory.
diff --git a/Documentation/x86/boot.txt b/Documentation/x86/boot.txt
index 8da3a795083..a75e3adaa39 100644
--- a/Documentation/x86/boot.txt
+++ b/Documentation/x86/boot.txt
@@ -54,6 +54,13 @@ Protocol 2.10: (Kernel 2.6.31) Added a protocol for relaxed alignment
beyond the kernel_alignment added, new init_size and
pref_address fields. Added extended boot loader IDs.
+Protocol 2.11: (Kernel 3.6) Added a field for offset of EFI handover
+ protocol entry point.
+
+Protocol 2.12: (Kernel 3.8) Added the xloadflags field and extension fields
+ to struct boot_params for loading bzImage and ramdisk
+ above 4G in 64bit.
+
**** MEMORY LAYOUT
The traditional memory map for the kernel loader, used for Image or
@@ -175,11 +182,11 @@ Offset Proto Name Meaning
0226/1 2.02+(3 ext_loader_ver Extended boot loader version
0227/1 2.02+(3 ext_loader_type Extended boot loader ID
0228/4 2.02+ cmd_line_ptr 32-bit pointer to the kernel command line
-022C/4 2.03+ ramdisk_max Highest legal initrd address
+022C/4 2.03+ initrd_addr_max Highest legal initrd address
0230/4 2.05+ kernel_alignment Physical addr alignment required for kernel
0234/1 2.05+ relocatable_kernel Whether kernel is relocatable or not
0235/1 2.10+ min_alignment Minimum alignment, as a power of two
-0236/2 N/A pad3 Unused
+0236/2 2.12+ xloadflags Boot protocol option flags
0238/4 2.06+ cmdline_size Maximum size of the kernel command line
023C/4 2.07+ hardware_subarch Hardware subarchitecture
0240/8 2.07+ hardware_subarch_data Subarchitecture-specific data
@@ -189,6 +196,7 @@ Offset Proto Name Meaning
of struct setup_data
0258/8 2.10+ pref_address Preferred loading address
0260/4 2.10+ init_size Linear memory required during initialization
+0264/4 2.11+ handover_offset Offset of handover entry point
(1) For backwards compatibility, if the setup_sects field contains 0, the
real value is 4.
@@ -363,12 +371,13 @@ Protocol: 2.00+
ext_loader_type <- 0x05
ext_loader_ver <- 0x23
- Assigned boot loader ids:
+ Assigned boot loader ids (hexadecimal):
+
0 LILO (0x00 reserved for pre-2.00 bootloader)
1 Loadlin
2 bootsect-loader (0x20, all other values reserved)
3 Syslinux
- 4 Etherboot/gPXE
+ 4 Etherboot/gPXE/iPXE
5 ELILO
7 GRUB
8 U-Boot
@@ -376,8 +385,12 @@ Protocol: 2.00+
A Gujin
B Qemu
C Arcturus Networks uCbootloader
+ D kexec-tools
E Extended (see ext_loader_type)
F Special (0xFF = undefined)
+ 10 Reserved
+ 11 Minimal Linux Bootloader <http://sebastian-plotz.blogspot.de>
+ 12 OVMF UEFI virtualization stack
Please contact <hpa@zytor.com> if you need a bootloader ID
value assigned.
@@ -521,7 +534,7 @@ Protocol: 2.02+
zero, the kernel will assume that your boot loader does not support
the 2.02+ protocol.
-Field name: ramdisk_max
+Field name: initrd_addr_max
Type: read
Offset/size: 0x22c/4
Protocol: 2.03+
@@ -574,6 +587,30 @@ Protocol: 2.10+
misaligned kernel. Therefore, a loader should typically try each
power-of-two alignment from kernel_alignment down to this alignment.
+Field name: xloadflags
+Type: read
+Offset/size: 0x236/2
+Protocol: 2.12+
+
+ This field is a bitmask.
+
+ Bit 0 (read): XLF_KERNEL_64
+ - If 1, this kernel has the legacy 64-bit entry point at 0x200.
+
+ Bit 1 (read): XLF_CAN_BE_LOADED_ABOVE_4G
+ - If 1, kernel/boot_params/cmdline/ramdisk can be above 4G.
+
+ Bit 2 (read): XLF_EFI_HANDOVER_32
+ - If 1, the kernel supports the 32-bit EFI handoff entry point
+ given at handover_offset.
+
+ Bit 3 (read): XLF_EFI_HANDOVER_64
+ - If 1, the kernel supports the 64-bit EFI handoff entry point
+ given at handover_offset + 0x200.
+
+ Bit 4 (read): XLF_EFI_KEXEC
+ - If 1, the kernel supports kexec EFI boot with EFI runtime support.
+
Field name: cmdline_size
Type: read
Offset/size: 0x238/4
@@ -599,6 +636,8 @@ Protocol: 2.07+
0x00000000 The default x86/PC environment
0x00000001 lguest
0x00000002 Xen
+ 0x00000003 Moorestown MID
+ 0x00000004 CE4100 TV Platform
Field name: hardware_subarch_data
Type: write (subarch-dependent)
@@ -620,10 +659,11 @@ Protocol: 2.08+
The payload may be compressed. The format of both the compressed and
uncompressed data should be determined using the standard magic
numbers. The currently supported compression formats are gzip
- (magic numbers 1F 8B or 1F 9E), bzip2 (magic number 42 5A) and LZMA
- (magic number 5D 00). The uncompressed payload is currently always ELF
- (magic number 7F 45 4C 46).
-
+ (magic numbers 1F 8B or 1F 9E), bzip2 (magic number 42 5A), LZMA
+ (magic number 5D 00), XZ (magic number FD 37), and LZ4 (magic number
+ 02 21). The uncompressed payload is currently always ELF (magic
+ number 7F 45 4C 46).
+
Field name: payload_length
Type: read
Offset/size: 0x24c/4
@@ -672,7 +712,7 @@ Protocol: 2.10+
Field name: init_size
Type: read
-Offset/size: 0x25c/4
+Offset/size: 0x260/4
This field indicates the amount of linear contiguous memory starting
at the kernel runtime start address that the kernel needs before it
@@ -688,6 +728,16 @@ Offset/size: 0x25c/4
else
runtime_start = pref_address
+Field name: handover_offset
+Type: read
+Offset/size: 0x264/4
+
+ This field is the offset from the beginning of the kernel image to
+ the EFI handover protocol entry point. Boot loaders using the EFI
+ handover protocol to boot the kernel should jump to this offset.
+
+ See EFI HANDOVER PROTOCOL below for more details.
+
**** THE IMAGE CHECKSUM
@@ -994,7 +1044,7 @@ boot_params as that of 16-bit boot protocol, the boot loader should
also fill the additional fields of the struct boot_params as that
described in zero-page.txt.
-After setupping the struct boot_params, the boot loader can load the
+After setting up the struct boot_params, the boot loader can load the
32/64-bit kernel in the same way as that of 16-bit boot protocol.
In 32-bit boot protocol, the kernel is started by jumping to the
@@ -1004,7 +1054,72 @@ In 32-bit boot protocol, the kernel is started by jumping to the
At entry, the CPU must be in 32-bit protected mode with paging
disabled; a GDT must be loaded with the descriptors for selectors
__BOOT_CS(0x10) and __BOOT_DS(0x18); both descriptors must be 4G flat
-segment; __BOOS_CS must have execute/read permission, and __BOOT_DS
+segment; __BOOT_CS must have execute/read permission, and __BOOT_DS
must have read/write permission; CS must be __BOOT_CS and DS, ES, SS
must be __BOOT_DS; interrupt must be disabled; %esi must hold the base
address of the struct boot_params; %ebp, %edi and %ebx must be zero.
+
+**** 64-bit BOOT PROTOCOL
+
+For machine with 64bit cpus and 64bit kernel, we could use 64bit bootloader
+and we need a 64-bit boot protocol.
+
+In 64-bit boot protocol, the first step in loading a Linux kernel
+should be to setup the boot parameters (struct boot_params,
+traditionally known as "zero page"). The memory for struct boot_params
+could be allocated anywhere (even above 4G) and initialized to all zero.
+Then, the setup header at offset 0x01f1 of kernel image on should be
+loaded into struct boot_params and examined. The end of setup header
+can be calculated as follows:
+
+ 0x0202 + byte value at offset 0x0201
+
+In addition to read/modify/write the setup header of the struct
+boot_params as that of 16-bit boot protocol, the boot loader should
+also fill the additional fields of the struct boot_params as described
+in zero-page.txt.
+
+After setting up the struct boot_params, the boot loader can load
+64-bit kernel in the same way as that of 16-bit boot protocol, but
+kernel could be loaded above 4G.
+
+In 64-bit boot protocol, the kernel is started by jumping to the
+64-bit kernel entry point, which is the start address of loaded
+64-bit kernel plus 0x200.
+
+At entry, the CPU must be in 64-bit mode with paging enabled.
+The range with setup_header.init_size from start address of loaded
+kernel and zero page and command line buffer get ident mapping;
+a GDT must be loaded with the descriptors for selectors
+__BOOT_CS(0x10) and __BOOT_DS(0x18); both descriptors must be 4G flat
+segment; __BOOT_CS must have execute/read permission, and __BOOT_DS
+must have read/write permission; CS must be __BOOT_CS and DS, ES, SS
+must be __BOOT_DS; interrupt must be disabled; %rsi must hold the base
+address of the struct boot_params.
+
+**** EFI HANDOVER PROTOCOL
+
+This protocol allows boot loaders to defer initialisation to the EFI
+boot stub. The boot loader is required to load the kernel/initrd(s)
+from the boot media and jump to the EFI handover protocol entry point
+which is hdr->handover_offset bytes from the beginning of
+startup_{32,64}.
+
+The function prototype for the handover entry point looks like this,
+
+ efi_main(void *handle, efi_system_table_t *table, struct boot_params *bp)
+
+'handle' is the EFI image handle passed to the boot loader by the EFI
+firmware, 'table' is the EFI system table - these are the first two
+arguments of the "handoff state" as described in section 2.3 of the
+UEFI specification. 'bp' is the boot loader-allocated boot params.
+
+The boot loader *must* fill out the following fields in bp,
+
+ o hdr.code32_start
+ o hdr.cmd_line_ptr
+ o hdr.cmdline_size
+ o hdr.ramdisk_image (if applicable)
+ o hdr.ramdisk_size (if applicable)
+
+All other fields should be zero.
diff --git a/Documentation/x86/early-microcode.txt b/Documentation/x86/early-microcode.txt
new file mode 100644
index 00000000000..d62bea6796d
--- /dev/null
+++ b/Documentation/x86/early-microcode.txt
@@ -0,0 +1,42 @@
+Early load microcode
+====================
+By Fenghua Yu <fenghua.yu@intel.com>
+
+Kernel can update microcode in early phase of boot time. Loading microcode early
+can fix CPU issues before they are observed during kernel boot time.
+
+Microcode is stored in an initrd file. The microcode is read from the initrd
+file and loaded to CPUs during boot time.
+
+The format of the combined initrd image is microcode in cpio format followed by
+the initrd image (maybe compressed). Kernel parses the combined initrd image
+during boot time. The microcode file in cpio name space is:
+on Intel: kernel/x86/microcode/GenuineIntel.bin
+on AMD : kernel/x86/microcode/AuthenticAMD.bin
+
+During BSP boot (before SMP starts), if the kernel finds the microcode file in
+the initrd file, it parses the microcode and saves matching microcode in memory.
+If matching microcode is found, it will be uploaded in BSP and later on in all
+APs.
+
+The cached microcode patch is applied when CPUs resume from a sleep state.
+
+There are two legacy user space interfaces to load microcode, either through
+/dev/cpu/microcode or through /sys/devices/system/cpu/microcode/reload file
+in sysfs.
+
+In addition to these two legacy methods, the early loading method described
+here is the third method with which microcode can be uploaded to a system's
+CPUs.
+
+The following example script shows how to generate a new combined initrd file in
+/boot/initrd-3.5.0.ucode.img with original microcode microcode.bin and
+original initrd image /boot/initrd-3.5.0.img.
+
+mkdir initrd
+cd initrd
+mkdir -p kernel/x86/microcode
+cp ../microcode.bin kernel/x86/microcode/GenuineIntel.bin (or AuthenticAMD.bin)
+find . | cpio -o -H newc >../ucode.cpio
+cd ..
+cat ucode.cpio /boot/initrd-3.5.0.img >/boot/initrd-3.5.0.ucode.img
diff --git a/Documentation/x86/earlyprintk.txt b/Documentation/x86/earlyprintk.txt
index 607b1a01606..688e3eeed21 100644
--- a/Documentation/x86/earlyprintk.txt
+++ b/Documentation/x86/earlyprintk.txt
@@ -7,7 +7,7 @@ and two USB cables, connected like this:
[host/target] <-------> [USB debug key] <-------> [client/console]
-1. There are three specific hardware requirements:
+1. There are a number of specific hardware requirements:
a.) Host/target system needs to have USB debug port capability.
@@ -33,7 +33,7 @@ and two USB cables, connected like this:
...
( If your system does not list a debug port capability then you probably
- wont be able to use the USB debug key. )
+ won't be able to use the USB debug key. )
b.) You also need a Netchip USB debug cable/key:
@@ -42,7 +42,35 @@ and two USB cables, connected like this:
This is a small blue plastic connector with two USB connections,
it draws power from its USB connections.
- c.) Thirdly, you need a second client/console system with a regular USB port.
+ c.) You need a second client/console system with a high speed USB 2.0
+ port.
+
+ d.) The Netchip device must be plugged directly into the physical
+ debug port on the "host/target" system. You cannot use a USB hub in
+ between the physical debug port and the "host/target" system.
+
+ The EHCI debug controller is bound to a specific physical USB
+ port and the Netchip device will only work as an early printk
+ device in this port. The EHCI host controllers are electrically
+ wired such that the EHCI debug controller is hooked up to the
+ first physical and there is no way to change this via software.
+ You can find the physical port through experimentation by trying
+ each physical port on the system and rebooting. Or you can try
+ and use lsusb or look at the kernel info messages emitted by the
+ usb stack when you plug a usb device into various ports on the
+ "host/target" system.
+
+ Some hardware vendors do not expose the usb debug port with a
+ physical connector and if you find such a device send a complaint
+ to the hardware vendor, because there is no reason not to wire
+ this port into one of the physically accessible ports.
+
+ e.) It is also important to note, that many versions of the Netchip
+ device require the "client/console" system to be plugged into the
+ right and side of the device (with the product logo facing up and
+ readable left to right). The reason being is that the 5 volt
+ power supply is taken from only one side of the device and it
+ must be the side that does not get rebooted.
2. Software requirements:
@@ -56,6 +84,13 @@ and two USB cables, connected like this:
(If you are using Grub, append it to the 'kernel' line in
/etc/grub.conf)
+ On systems with more than one EHCI debug controller you must
+ specify the correct EHCI debug controller number. The ordering
+ comes from the PCI bus enumeration of the EHCI controllers. The
+ default with no number argument is "0" the first EHCI debug
+ controller. To use the second EHCI debug controller, you would
+ use the command line: "earlyprintk=dbgp1"
+
NOTE: normally earlyprintk console gets turned off once the
regular console is alive - use "earlyprintk=dbgp,keep" to keep
this channel open beyond early bootup. This can be useful for
diff --git a/Documentation/x86/entry_64.txt b/Documentation/x86/entry_64.txt
new file mode 100644
index 00000000000..bc7226ef505
--- /dev/null
+++ b/Documentation/x86/entry_64.txt
@@ -0,0 +1,95 @@
+This file documents some of the kernel entries in
+arch/x86/kernel/entry_64.S. A lot of this explanation is adapted from
+an email from Ingo Molnar:
+
+http://lkml.kernel.org/r/<20110529191055.GC9835%40elte.hu>
+
+The x86 architecture has quite a few different ways to jump into
+kernel code. Most of these entry points are registered in
+arch/x86/kernel/traps.c and implemented in arch/x86/kernel/entry_64.S
+and arch/x86/ia32/ia32entry.S.
+
+The IDT vector assignments are listed in arch/x86/include/irq_vectors.h.
+
+Some of these entries are:
+
+ - system_call: syscall instruction from 64-bit code.
+
+ - ia32_syscall: int 0x80 from 32-bit or 64-bit code; compat syscall
+ either way.
+
+ - ia32_syscall, ia32_sysenter: syscall and sysenter from 32-bit
+ code
+
+ - interrupt: An array of entries. Every IDT vector that doesn't
+ explicitly point somewhere else gets set to the corresponding
+ value in interrupts. These point to a whole array of
+ magically-generated functions that make their way to do_IRQ with
+ the interrupt number as a parameter.
+
+ - APIC interrupts: Various special-purpose interrupts for things
+ like TLB shootdown.
+
+ - Architecturally-defined exceptions like divide_error.
+
+There are a few complexities here. The different x86-64 entries
+have different calling conventions. The syscall and sysenter
+instructions have their own peculiar calling conventions. Some of
+the IDT entries push an error code onto the stack; others don't.
+IDT entries using the IST alternative stack mechanism need their own
+magic to get the stack frames right. (You can find some
+documentation in the AMD APM, Volume 2, Chapter 8 and the Intel SDM,
+Volume 3, Chapter 6.)
+
+Dealing with the swapgs instruction is especially tricky. Swapgs
+toggles whether gs is the kernel gs or the user gs. The swapgs
+instruction is rather fragile: it must nest perfectly and only in
+single depth, it should only be used if entering from user mode to
+kernel mode and then when returning to user-space, and precisely
+so. If we mess that up even slightly, we crash.
+
+So when we have a secondary entry, already in kernel mode, we *must
+not* use SWAPGS blindly - nor must we forget doing a SWAPGS when it's
+not switched/swapped yet.
+
+Now, there's a secondary complication: there's a cheap way to test
+which mode the CPU is in and an expensive way.
+
+The cheap way is to pick this info off the entry frame on the kernel
+stack, from the CS of the ptregs area of the kernel stack:
+
+ xorl %ebx,%ebx
+ testl $3,CS+8(%rsp)
+ je error_kernelspace
+ SWAPGS
+
+The expensive (paranoid) way is to read back the MSR_GS_BASE value
+(which is what SWAPGS modifies):
+
+ movl $1,%ebx
+ movl $MSR_GS_BASE,%ecx
+ rdmsr
+ testl %edx,%edx
+ js 1f /* negative -> in kernel */
+ SWAPGS
+ xorl %ebx,%ebx
+1: ret
+
+and the whole paranoid non-paranoid macro complexity is about whether
+to suffer that RDMSR cost.
+
+If we are at an interrupt or user-trap/gate-alike boundary then we can
+use the faster check: the stack will be a reliable indicator of
+whether SWAPGS was already done: if we see that we are a secondary
+entry interrupting kernel mode execution, then we know that the GS
+base has already been switched. If it says that we interrupted
+user-space execution then we must do the SWAPGS.
+
+But if we are in an NMI/MCE/DEBUG/whatever super-atomic entry context,
+which might have triggered right after a normal entry wrote CS to the
+stack but before we executed SWAPGS, then the only safe way to check
+for GS is the slower method: the RDMSR.
+
+So we try only to mark those entry methods 'paranoid' that absolutely
+need the more expensive check for the GS base - and we generate all
+'normal' entry points with the regular (faster) entry macros.
diff --git a/Documentation/x86/exception-tables.txt b/Documentation/x86/exception-tables.txt
new file mode 100644
index 00000000000..32901aa36f0
--- /dev/null
+++ b/Documentation/x86/exception-tables.txt
@@ -0,0 +1,292 @@
+ Kernel level exception handling in Linux
+ Commentary by Joerg Pommnitz <joerg@raleigh.ibm.com>
+
+When a process runs in kernel mode, it often has to access user
+mode memory whose address has been passed by an untrusted program.
+To protect itself the kernel has to verify this address.
+
+In older versions of Linux this was done with the
+int verify_area(int type, const void * addr, unsigned long size)
+function (which has since been replaced by access_ok()).
+
+This function verified that the memory area starting at address
+'addr' and of size 'size' was accessible for the operation specified
+in type (read or write). To do this, verify_read had to look up the
+virtual memory area (vma) that contained the address addr. In the
+normal case (correctly working program), this test was successful.
+It only failed for a few buggy programs. In some kernel profiling
+tests, this normally unneeded verification used up a considerable
+amount of time.
+
+To overcome this situation, Linus decided to let the virtual memory
+hardware present in every Linux-capable CPU handle this test.
+
+How does this work?
+
+Whenever the kernel tries to access an address that is currently not
+accessible, the CPU generates a page fault exception and calls the
+page fault handler
+
+void do_page_fault(struct pt_regs *regs, unsigned long error_code)
+
+in arch/x86/mm/fault.c. The parameters on the stack are set up by
+the low level assembly glue in arch/x86/kernel/entry_32.S. The parameter
+regs is a pointer to the saved registers on the stack, error_code
+contains a reason code for the exception.
+
+do_page_fault first obtains the unaccessible address from the CPU
+control register CR2. If the address is within the virtual address
+space of the process, the fault probably occurred, because the page
+was not swapped in, write protected or something similar. However,
+we are interested in the other case: the address is not valid, there
+is no vma that contains this address. In this case, the kernel jumps
+to the bad_area label.
+
+There it uses the address of the instruction that caused the exception
+(i.e. regs->eip) to find an address where the execution can continue
+(fixup). If this search is successful, the fault handler modifies the
+return address (again regs->eip) and returns. The execution will
+continue at the address in fixup.
+
+Where does fixup point to?
+
+Since we jump to the contents of fixup, fixup obviously points
+to executable code. This code is hidden inside the user access macros.
+I have picked the get_user macro defined in arch/x86/include/asm/uaccess.h
+as an example. The definition is somewhat hard to follow, so let's peek at
+the code generated by the preprocessor and the compiler. I selected
+the get_user call in drivers/char/sysrq.c for a detailed examination.
+
+The original code in sysrq.c line 587:
+ get_user(c, buf);
+
+The preprocessor output (edited to become somewhat readable):
+
+(
+ {
+ long __gu_err = - 14 , __gu_val = 0;
+ const __typeof__(*( ( buf ) )) *__gu_addr = ((buf));
+ if (((((0 + current_set[0])->tss.segment) == 0x18 ) ||
+ (((sizeof(*(buf))) <= 0xC0000000UL) &&
+ ((unsigned long)(__gu_addr ) <= 0xC0000000UL - (sizeof(*(buf)))))))
+ do {
+ __gu_err = 0;
+ switch ((sizeof(*(buf)))) {
+ case 1:
+ __asm__ __volatile__(
+ "1: mov" "b" " %2,%" "b" "1\n"
+ "2:\n"
+ ".section .fixup,\"ax\"\n"
+ "3: movl %3,%0\n"
+ " xor" "b" " %" "b" "1,%" "b" "1\n"
+ " jmp 2b\n"
+ ".section __ex_table,\"a\"\n"
+ " .align 4\n"
+ " .long 1b,3b\n"
+ ".text" : "=r"(__gu_err), "=q" (__gu_val): "m"((*(struct __large_struct *)
+ ( __gu_addr )) ), "i"(- 14 ), "0"( __gu_err )) ;
+ break;
+ case 2:
+ __asm__ __volatile__(
+ "1: mov" "w" " %2,%" "w" "1\n"
+ "2:\n"
+ ".section .fixup,\"ax\"\n"
+ "3: movl %3,%0\n"
+ " xor" "w" " %" "w" "1,%" "w" "1\n"
+ " jmp 2b\n"
+ ".section __ex_table,\"a\"\n"
+ " .align 4\n"
+ " .long 1b,3b\n"
+ ".text" : "=r"(__gu_err), "=r" (__gu_val) : "m"((*(struct __large_struct *)
+ ( __gu_addr )) ), "i"(- 14 ), "0"( __gu_err ));
+ break;
+ case 4:
+ __asm__ __volatile__(
+ "1: mov" "l" " %2,%" "" "1\n"
+ "2:\n"
+ ".section .fixup,\"ax\"\n"
+ "3: movl %3,%0\n"
+ " xor" "l" " %" "" "1,%" "" "1\n"
+ " jmp 2b\n"
+ ".section __ex_table,\"a\"\n"
+ " .align 4\n" " .long 1b,3b\n"
+ ".text" : "=r"(__gu_err), "=r" (__gu_val) : "m"((*(struct __large_struct *)
+ ( __gu_addr )) ), "i"(- 14 ), "0"(__gu_err));
+ break;
+ default:
+ (__gu_val) = __get_user_bad();
+ }
+ } while (0) ;
+ ((c)) = (__typeof__(*((buf))))__gu_val;
+ __gu_err;
+ }
+);
+
+WOW! Black GCC/assembly magic. This is impossible to follow, so let's
+see what code gcc generates:
+
+ > xorl %edx,%edx
+ > movl current_set,%eax
+ > cmpl $24,788(%eax)
+ > je .L1424
+ > cmpl $-1073741825,64(%esp)
+ > ja .L1423
+ > .L1424:
+ > movl %edx,%eax
+ > movl 64(%esp),%ebx
+ > #APP
+ > 1: movb (%ebx),%dl /* this is the actual user access */
+ > 2:
+ > .section .fixup,"ax"
+ > 3: movl $-14,%eax
+ > xorb %dl,%dl
+ > jmp 2b
+ > .section __ex_table,"a"
+ > .align 4
+ > .long 1b,3b
+ > .text
+ > #NO_APP
+ > .L1423:
+ > movzbl %dl,%esi
+
+The optimizer does a good job and gives us something we can actually
+understand. Can we? The actual user access is quite obvious. Thanks
+to the unified address space we can just access the address in user
+memory. But what does the .section stuff do?????
+
+To understand this we have to look at the final kernel:
+
+ > objdump --section-headers vmlinux
+ >
+ > vmlinux: file format elf32-i386
+ >
+ > Sections:
+ > Idx Name Size VMA LMA File off Algn
+ > 0 .text 00098f40 c0100000 c0100000 00001000 2**4
+ > CONTENTS, ALLOC, LOAD, READONLY, CODE
+ > 1 .fixup 000016bc c0198f40 c0198f40 00099f40 2**0
+ > CONTENTS, ALLOC, LOAD, READONLY, CODE
+ > 2 .rodata 0000f127 c019a5fc c019a5fc 0009b5fc 2**2
+ > CONTENTS, ALLOC, LOAD, READONLY, DATA
+ > 3 __ex_table 000015c0 c01a9724 c01a9724 000aa724 2**2
+ > CONTENTS, ALLOC, LOAD, READONLY, DATA
+ > 4 .data 0000ea58 c01abcf0 c01abcf0 000abcf0 2**4
+ > CONTENTS, ALLOC, LOAD, DATA
+ > 5 .bss 00018e21 c01ba748 c01ba748 000ba748 2**2
+ > ALLOC
+ > 6 .comment 00000ec4 00000000 00000000 000ba748 2**0
+ > CONTENTS, READONLY
+ > 7 .note 00001068 00000ec4 00000ec4 000bb60c 2**0
+ > CONTENTS, READONLY
+
+There are obviously 2 non standard ELF sections in the generated object
+file. But first we want to find out what happened to our code in the
+final kernel executable:
+
+ > objdump --disassemble --section=.text vmlinux
+ >
+ > c017e785 <do_con_write+c1> xorl %edx,%edx
+ > c017e787 <do_con_write+c3> movl 0xc01c7bec,%eax
+ > c017e78c <do_con_write+c8> cmpl $0x18,0x314(%eax)
+ > c017e793 <do_con_write+cf> je c017e79f <do_con_write+db>
+ > c017e795 <do_con_write+d1> cmpl $0xbfffffff,0x40(%esp,1)
+ > c017e79d <do_con_write+d9> ja c017e7a7 <do_con_write+e3>
+ > c017e79f <do_con_write+db> movl %edx,%eax
+ > c017e7a1 <do_con_write+dd> movl 0x40(%esp,1),%ebx
+ > c017e7a5 <do_con_write+e1> movb (%ebx),%dl
+ > c017e7a7 <do_con_write+e3> movzbl %dl,%esi
+
+The whole user memory access is reduced to 10 x86 machine instructions.
+The instructions bracketed in the .section directives are no longer
+in the normal execution path. They are located in a different section
+of the executable file:
+
+ > objdump --disassemble --section=.fixup vmlinux
+ >
+ > c0199ff5 <.fixup+10b5> movl $0xfffffff2,%eax
+ > c0199ffa <.fixup+10ba> xorb %dl,%dl
+ > c0199ffc <.fixup+10bc> jmp c017e7a7 <do_con_write+e3>
+
+And finally:
+ > objdump --full-contents --section=__ex_table vmlinux
+ >
+ > c01aa7c4 93c017c0 e09f19c0 97c017c0 99c017c0 ................
+ > c01aa7d4 f6c217c0 e99f19c0 a5e717c0 f59f19c0 ................
+ > c01aa7e4 080a18c0 01a019c0 0a0a18c0 04a019c0 ................
+
+or in human readable byte order:
+
+ > c01aa7c4 c017c093 c0199fe0 c017c097 c017c099 ................
+ > c01aa7d4 c017c2f6 c0199fe9 c017e7a5 c0199ff5 ................
+ ^^^^^^^^^^^^^^^^^
+ this is the interesting part!
+ > c01aa7e4 c0180a08 c019a001 c0180a0a c019a004 ................
+
+What happened? The assembly directives
+
+.section .fixup,"ax"
+.section __ex_table,"a"
+
+told the assembler to move the following code to the specified
+sections in the ELF object file. So the instructions
+3: movl $-14,%eax
+ xorb %dl,%dl
+ jmp 2b
+ended up in the .fixup section of the object file and the addresses
+ .long 1b,3b
+ended up in the __ex_table section of the object file. 1b and 3b
+are local labels. The local label 1b (1b stands for next label 1
+backward) is the address of the instruction that might fault, i.e.
+in our case the address of the label 1 is c017e7a5:
+the original assembly code: > 1: movb (%ebx),%dl
+and linked in vmlinux : > c017e7a5 <do_con_write+e1> movb (%ebx),%dl
+
+The local label 3 (backwards again) is the address of the code to handle
+the fault, in our case the actual value is c0199ff5:
+the original assembly code: > 3: movl $-14,%eax
+and linked in vmlinux : > c0199ff5 <.fixup+10b5> movl $0xfffffff2,%eax
+
+The assembly code
+ > .section __ex_table,"a"
+ > .align 4
+ > .long 1b,3b
+
+becomes the value pair
+ > c01aa7d4 c017c2f6 c0199fe9 c017e7a5 c0199ff5 ................
+ ^this is ^this is
+ 1b 3b
+c017e7a5,c0199ff5 in the exception table of the kernel.
+
+So, what actually happens if a fault from kernel mode with no suitable
+vma occurs?
+
+1.) access to invalid address:
+ > c017e7a5 <do_con_write+e1> movb (%ebx),%dl
+2.) MMU generates exception
+3.) CPU calls do_page_fault
+4.) do page fault calls search_exception_table (regs->eip == c017e7a5);
+5.) search_exception_table looks up the address c017e7a5 in the
+ exception table (i.e. the contents of the ELF section __ex_table)
+ and returns the address of the associated fault handle code c0199ff5.
+6.) do_page_fault modifies its own return address to point to the fault
+ handle code and returns.
+7.) execution continues in the fault handling code.
+8.) 8a) EAX becomes -EFAULT (== -14)
+ 8b) DL becomes zero (the value we "read" from user space)
+ 8c) execution continues at local label 2 (address of the
+ instruction immediately after the faulting user access).
+
+The steps 8a to 8c in a certain way emulate the faulting instruction.
+
+That's it, mostly. If you look at our example, you might ask why
+we set EAX to -EFAULT in the exception handler code. Well, the
+get_user macro actually returns a value: 0, if the user access was
+successful, -EFAULT on failure. Our original code did not test this
+return value, however the inline assembly code in get_user tries to
+return -EFAULT. GCC selected EAX to return this value.
+
+NOTE:
+Due to the way that the exception table is built and needs to be ordered,
+only use exceptions for code in the .text section. Any other section
+will cause the exception table to not be sorted correctly, and the
+exceptions will fail.
diff --git a/Documentation/x86/i386/IO-APIC.txt b/Documentation/x86/i386/IO-APIC.txt
index 30b4c714fbe..15f5baf7e1b 100644
--- a/Documentation/x86/i386/IO-APIC.txt
+++ b/Documentation/x86/i386/IO-APIC.txt
@@ -87,7 +87,7 @@ your PCI configuration:
echo -n pirq=; echo `scanpci | grep T_L | cut -c56-` | sed 's/ /,/g'
-note that this script wont work if you have skipped a few slots or if your
+note that this script won't work if you have skipped a few slots or if your
board does not do default daisy-chaining. (or the IO-APIC has the PIRQ pins
connected in some strange way). E.g. if in the above case you have your SCSI
card (IRQ11) in Slot3, and have Slot1 empty:
diff --git a/Documentation/x86/x86_64/boot-options.txt b/Documentation/x86/x86_64/boot-options.txt
index 2db5893d6c9..5223479291a 100644
--- a/Documentation/x86/x86_64/boot-options.txt
+++ b/Documentation/x86/x86_64/boot-options.txt
@@ -5,21 +5,58 @@ only the AMD64 specific ones are listed here.
Machine check
- mce=off disable machine check
- mce=bootlog Enable logging of machine checks left over from booting.
- Disabled by default on AMD because some BIOS leave bogus ones.
- If your BIOS doesn't do that it's a good idea to enable though
- to make sure you log even machine check events that result
- in a reboot. On Intel systems it is enabled by default.
+ Please see Documentation/x86/x86_64/machinecheck for sysfs runtime tunables.
+
+ mce=off
+ Disable machine check
+ mce=no_cmci
+ Disable CMCI(Corrected Machine Check Interrupt) that
+ Intel processor supports. Usually this disablement is
+ not recommended, but it might be handy if your hardware
+ is misbehaving.
+ Note that you'll get more problems without CMCI than with
+ due to the shared banks, i.e. you might get duplicated
+ error logs.
+ mce=dont_log_ce
+ Don't make logs for corrected errors. All events reported
+ as corrected are silently cleared by OS.
+ This option will be useful if you have no interest in any
+ of corrected errors.
+ mce=ignore_ce
+ Disable features for corrected errors, e.g. polling timer
+ and CMCI. All events reported as corrected are not cleared
+ by OS and remained in its error banks.
+ Usually this disablement is not recommended, however if
+ there is an agent checking/clearing corrected errors
+ (e.g. BIOS or hardware monitoring applications), conflicting
+ with OS's error handling, and you cannot deactivate the agent,
+ then this option will be a help.
+ mce=bootlog
+ Enable logging of machine checks left over from booting.
+ Disabled by default on AMD because some BIOS leave bogus ones.
+ If your BIOS doesn't do that it's a good idea to enable though
+ to make sure you log even machine check events that result
+ in a reboot. On Intel systems it is enabled by default.
mce=nobootlog
Disable boot machine check logging.
- mce=tolerancelevel (number)
+ mce=tolerancelevel[,monarchtimeout] (number,number)
+ tolerance levels:
0: always panic on uncorrected errors, log corrected errors
1: panic or SIGBUS on uncorrected errors, log corrected errors
2: SIGBUS or log uncorrected errors, log corrected errors
3: never panic or SIGBUS, log all errors (for testing only)
Default is 1
Can be also set using sysfs which is preferable.
+ monarchtimeout:
+ Sets the time in us to wait for other CPUs on machine checks. 0
+ to disable.
+ mce=bios_cmci_threshold
+ Don't overwrite the bios-set CMCI threshold. This boot option
+ prevents Linux from overwriting the CMCI threshold set by the
+ bios. Without this option, Linux always sets the CMCI
+ threshold to 1. Enabling this may make memory predictive failure
+ analysis less effective if the bios sets thresholds for memory
+ errors since we will not see details for all errors.
nomce (for compatibility with i386): same as mce=off
@@ -41,33 +78,11 @@ APICs
no_timer_check Don't check the IO-APIC timer. This can work around
problems with incorrect timer initialization on some boards.
-
- apicmaintimer Run time keeping from the local APIC timer instead
- of using the PIT/HPET interrupt for this. This is useful
- when the PIT/HPET interrupts are unreliable.
-
- noapicmaintimer Don't do time keeping using the APIC timer.
- Useful when this option was auto selected, but doesn't work.
-
apicpmtimer
Do APIC timer calibration using the pmtimer. Implies
apicmaintimer. Useful when your PIT timer is totally
broken.
-Early Console
-
- syntax: earlyprintk=vga
- earlyprintk=serial[,ttySn[,baudrate]]
-
- The early console is useful when the kernel crashes before the
- normal console is initialized. It is not enabled by
- default because it has some cosmetic problems.
- Append ,keep to not disable it when the real console takes over.
- Only vga or serial at a time, not both.
- Currently only ttyS0 and ttyS1 are supported.
- Interaction with the standard serial driver is not very good.
- The VGA output is eventually overwritten by the real console.
-
Timing
notsc
@@ -75,10 +90,6 @@ Timing
This can be used to work around timing problems on multiprocessor systems
with not properly synchronized CPUs.
- report_lost_ticks
- Report when timer interrupts are lost because some code turned off
- interrupts for too long.
-
nohpet
Don't use the HPET timer.
@@ -125,30 +136,19 @@ Non Executable Mappings
on Enable(default)
off Disable
-SMP
-
- additional_cpus=NUM Allow NUM more CPUs for hotplug
- (defaults are specified by the BIOS, see Documentation/x86/x86_64/cpu-hotplug-spec)
-
NUMA
numa=off Only set up a single NUMA node spanning all memory.
numa=noacpi Don't parse the SRAT table for NUMA setup
- numa=fake=CMDLINE
- If a number, fakes CMDLINE nodes and ignores NUMA setup of the
- actual machine. Otherwise, system memory is configured
- depending on the sizes and coefficients listed. For example:
- numa=fake=2*512,1024,4*256,*128
- gives two 512M nodes, a 1024M node, four 256M nodes, and the
- rest split into 128M chunks. If the last character of CMDLINE
- is a *, the remaining memory is divided up equally among its
- coefficient:
- numa=fake=2*512,2*
- gives two 512M nodes and the rest split into two nodes.
- Otherwise, the remaining system RAM is allocated to an
- additional node.
+ numa=fake=<size>[MG]
+ If given as a memory unit, fills all system RAM with nodes of
+ size interleaved over physical nodes.
+
+ numa=fake=<N>
+ If given as an integer, fills all system RAM with N fake nodes
+ interleaved over physical nodes.
ACPI
@@ -163,15 +163,20 @@ ACPI
acpi=noirq Don't route interrupts
+ acpi=nocmcff Disable firmware first mode for corrected errors. This
+ disables parsing the HEST CMC error source to check if
+ firmware has set the FF flag. This may result in
+ duplicate corrected error reports.
+
PCI
- pci=off Don't use PCI
- pci=conf1 Use conf1 access.
- pci=conf2 Use conf2 access.
- pci=rom Assign ROMs.
- pci=assign-busses Assign busses
- pci=irqmask=MASK Set PCI interrupt mask to MASK
- pci=lastbus=NUMBER Scan upto NUMBER busses, no matter what the mptable says.
+ pci=off Don't use PCI
+ pci=conf1 Use conf1 access.
+ pci=conf2 Use conf2 access.
+ pci=rom Assign ROMs.
+ pci=assign-busses Assign busses
+ pci=irqmask=MASK Set PCI interrupt mask to MASK
+ pci=lastbus=NUMBER Scan up to NUMBER busses, no matter what the mptable says.
pci=noacpi Don't use ACPI to set up PCI interrupt routing.
IOMMU (input/output memory management unit)
@@ -182,7 +187,7 @@ IOMMU (input/output memory management unit)
(e.g. because you have < 3 GB memory).
Kernel boot message: "PCI-DMA: Disabling IOMMU"
- 2. <arch/x86_64/kernel/pci-gart.c>: AMD GART based hardware IOMMU.
+ 2. <arch/x86/kernel/amd_gart_64.c>: AMD GART based hardware IOMMU.
Kernel boot message: "PCI-DMA: using GART IOMMU"
3. <arch/x86_64/kernel/pci-swiotlb.c> : Software IOMMU implementation. Used
@@ -269,23 +274,8 @@ IOMMU (input/output memory management unit)
Debugging
- oops=panic Always panic on oopses. Default is to just kill the process,
- but there is a small probability of deadlocking the machine.
- This will also cause panics on machine check exceptions.
- Useful together with panic=30 to trigger a reboot.
-
kstack=N Print N words from the kernel stack in oops dumps.
- pagefaulttrace Dump all page faults. Only useful for extreme debugging
- and will create a lot of output.
-
- call_trace=[old|both|newfallback|new]
- old: use old inexact backtracer
- new: use new exact dwarf2 unwinder
- both: print entries from both
- newfallback: use new unwinder but fall back to old if it gets
- stuck (default)
-
Miscellaneous
nogbpages
diff --git a/Documentation/x86/x86_64/kernel-stacks b/Documentation/x86/x86_64/kernel-stacks
index 5ad65d51fb9..a01eec5d1d0 100644
--- a/Documentation/x86/x86_64/kernel-stacks
+++ b/Documentation/x86/x86_64/kernel-stacks
@@ -18,9 +18,9 @@ specialized stacks contain no useful data. The main CPU stacks are:
Used for external hardware interrupts. If this is the first external
hardware interrupt (i.e. not a nested hardware interrupt) then the
kernel switches from the current task to the interrupt stack. Like
- the split thread and interrupt stacks on i386 (with CONFIG_4KSTACKS),
- this gives more room for kernel interrupt processing without having
- to increase the size of every per thread stack.
+ the split thread and interrupt stacks on i386, this gives more room
+ for kernel interrupt processing without having to increase the size
+ of every per thread stack.
The interrupt stack is also used when processing a softirq.
diff --git a/Documentation/x86/x86_64/machinecheck b/Documentation/x86/x86_64/machinecheck
index a05e58e7b15..b1fb3027328 100644
--- a/Documentation/x86/x86_64/machinecheck
+++ b/Documentation/x86/x86_64/machinecheck
@@ -41,7 +41,9 @@ check_interval
the polling interval. When the poller stops finding MCEs, it
triggers an exponential backoff (poll less often) on the polling
interval. The check_interval variable is both the initial and
- maximum polling interval.
+ maximum polling interval. 0 means no polling for corrected machine
+ check errors (but some corrected errors might be still reported
+ in other ways)
tolerant
Tolerance level. When a machine check exception occurs for a non
@@ -67,6 +69,10 @@ trigger
Program to run when a machine check event is detected.
This is an alternative to running mcelog regularly from cron
and allows to detect events faster.
+monarch_timeout
+ How long to wait for the other CPUs to machine check too on a
+ exception. 0 to disable waiting for other CPUs.
+ Unit: us
TBD document entries for AMD threshold interrupt configuration
diff --git a/Documentation/x86/x86_64/mm.txt b/Documentation/x86/x86_64/mm.txt
index d6498e3cd71..afe68ddbe6a 100644
--- a/Documentation/x86/x86_64/mm.txt
+++ b/Documentation/x86/x86_64/mm.txt
@@ -12,8 +12,12 @@ ffffc90000000000 - ffffe8ffffffffff (=45 bits) vmalloc/ioremap space
ffffe90000000000 - ffffe9ffffffffff (=40 bits) hole
ffffea0000000000 - ffffeaffffffffff (=40 bits) virtual memory map (1TB)
... unused hole ...
+ffffff0000000000 - ffffff7fffffffff (=39 bits) %esp fixup stacks
+... unused hole ...
ffffffff80000000 - ffffffffa0000000 (=512 MB) kernel text mapping, from phys 0
-ffffffffa0000000 - fffffffffff00000 (=1536 MB) module mapping space
+ffffffffa0000000 - ffffffffff5fffff (=1525 MB) module mapping space
+ffffffffff600000 - ffffffffffdfffff (=8 MB) vsyscalls
+ffffffffffe00000 - ffffffffffffffff (=2 MB) unused hole
The direct mapping covers all memory in the system up to the highest
memory address (this means in some cases it can also include PCI memory
@@ -26,4 +30,11 @@ reference.
Current X86-64 implementations only support 40 bits of address space,
but we support up to 46 bits. This expands into MBZ space in the page tables.
+->trampoline_pgd:
+
+We map EFI runtime services in the aforementioned PGD in the virtual
+range of 64Gb (arbitrarily set, can be raised if needed)
+
+0xffffffef00000000 - 0xffffffff00000000
+
-Andi Kleen, Jul 2004
diff --git a/Documentation/x86/zero-page.txt b/Documentation/x86/zero-page.txt
index 4f913857b8a..199f453cb4d 100644
--- a/Documentation/x86/zero-page.txt
+++ b/Documentation/x86/zero-page.txt
@@ -12,11 +12,16 @@ Offset Proto Name Meaning
000/040 ALL screen_info Text mode or frame buffer information
(struct screen_info)
040/014 ALL apm_bios_info APM BIOS information (struct apm_bios_info)
+058/008 ALL tboot_addr Physical address of tboot shared page
060/010 ALL ist_info Intel SpeedStep (IST) BIOS support information
(struct ist_info)
080/010 ALL hd0_info hd0 disk parameter, OBSOLETE!!
090/010 ALL hd1_info hd1 disk parameter, OBSOLETE!!
0A0/010 ALL sys_desc_table System description table (struct sys_desc_table)
+0B0/010 ALL olpc_ofw_header OLPC's OpenFirmware CIF and friends
+0C0/004 ALL ext_ramdisk_image ramdisk_image high 32bits
+0C4/004 ALL ext_ramdisk_size ramdisk_size high 32bits
+0C8/004 ALL ext_cmd_line_ptr cmd_line_ptr high 32bits
140/080 ALL edid_info Video mode setup (struct edid_info)
1C0/020 ALL efi_info EFI 32 information (struct efi_info)
1E0/004 ALL alk_mem_k Alternative mem check, in KB
@@ -25,6 +30,7 @@ Offset Proto Name Meaning
1E9/001 ALL eddbuf_entries Number of entries in eddbuf (below)
1EA/001 ALL edd_mbr_sig_buf_entries Number of entries in edd_mbr_sig_buffer
(below)
+1EF/001 ALL sentinel Used to detect broken bootloaders
290/040 ALL edd_mbr_sig_buffer EDD MBR signatures
2D0/A00 ALL e820_map E820 memory map table
(array of struct e820entry)