diff options
author | Linus Torvalds <torvalds@linux-foundation.org> | 2011-07-22 17:05:15 -0700 |
---|---|---|
committer | Linus Torvalds <torvalds@linux-foundation.org> | 2011-07-22 17:05:15 -0700 |
commit | 8e204874db000928e37199c2db82b7eb8966cc3c (patch) | |
tree | eae66035cb761c3c5a79e98b92280b5156bc01ef /Documentation | |
parent | 3e0b8df79ddb8955d2cce5e858972a9cfe763384 (diff) | |
parent | aafade242ff24fac3aabf61c7861dfa44a3c2445 (diff) |
Merge branch 'x86-vdso-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'x86-vdso-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86-64, vdso: Do not allocate memory for the vDSO
clocksource: Change __ARCH_HAS_CLOCKSOURCE_DATA to a CONFIG option
x86, vdso: Drop now wrong comment
Document the vDSO and add a reference parser
ia64: Replace clocksource.fsys_mmio with generic arch data
x86-64: Move vread_tsc and vread_hpet into the vDSO
clocksource: Replace vread with generic arch data
x86-64: Add --no-undefined to vDSO build
x86-64: Allow alternative patching in the vDSO
x86: Make alternative instruction pointers relative
x86-64: Improve vsyscall emulation CS and RIP handling
x86-64: Emulate legacy vsyscalls
x86-64: Fill unused parts of the vsyscall page with 0xcc
x86-64: Remove vsyscall number 3 (venosys)
x86-64: Map the HPET NX
x86-64: Remove kernel.vsyscall64 sysctl
x86-64: Give vvars their own page
x86-64: Document some of entry_64.S
x86-64: Fix alignment of jiffies variable
Diffstat (limited to 'Documentation')
-rw-r--r-- | Documentation/ABI/stable/vdso | 27 | ||||
-rw-r--r-- | Documentation/vDSO/parse_vdso.c | 256 | ||||
-rw-r--r-- | Documentation/vDSO/vdso_test.c | 111 | ||||
-rw-r--r-- | Documentation/x86/entry_64.txt | 98 |
4 files changed, 492 insertions, 0 deletions
diff --git a/Documentation/ABI/stable/vdso b/Documentation/ABI/stable/vdso new file mode 100644 index 00000000000..8a1cbb59449 --- /dev/null +++ b/Documentation/ABI/stable/vdso @@ -0,0 +1,27 @@ +On some architectures, when the kernel loads any userspace program it +maps an ELF DSO into that program's address space. This DSO is called +the vDSO and it often contains useful and highly-optimized alternatives +to real syscalls. + +These functions are called just like ordinary C function according to +your platform's ABI. Call them from a sensible context. (For example, +if you set CS on x86 to something strange, the vDSO functions are +within their rights to crash.) In addition, if you pass a bad +pointer to a vDSO function, you might get SIGSEGV instead of -EFAULT. + +To find the DSO, parse the auxiliary vector passed to the program's +entry point. The AT_SYSINFO_EHDR entry will point to the vDSO. + +The vDSO uses symbol versioning; whenever you request a symbol from the +vDSO, specify the version you are expecting. + +Programs that dynamically link to glibc will use the vDSO automatically. +Otherwise, you can use the reference parser in Documentation/vDSO/parse_vdso.c. + +Unless otherwise noted, the set of symbols with any given version and the +ABI of those symbols is considered stable. It may vary across architectures, +though. + +(As of this writing, this ABI documentation as been confirmed for x86_64. + The maintainers of the other vDSO-using architectures should confirm + that it is correct for their architecture.)
\ No newline at end of file diff --git a/Documentation/vDSO/parse_vdso.c b/Documentation/vDSO/parse_vdso.c new file mode 100644 index 00000000000..85870208edc --- /dev/null +++ b/Documentation/vDSO/parse_vdso.c @@ -0,0 +1,256 @@ +/* + * parse_vdso.c: Linux reference vDSO parser + * Written by Andrew Lutomirski, 2011. + * + * This code is meant to be linked in to various programs that run on Linux. + * As such, it is available with as few restrictions as possible. This file + * is licensed under the Creative Commons Zero License, version 1.0, + * available at http://creativecommons.org/publicdomain/zero/1.0/legalcode + * + * The vDSO is a regular ELF DSO that the kernel maps into user space when + * it starts a program. It works equally well in statically and dynamically + * linked binaries. + * + * This code is tested on x86_64. In principle it should work on any 64-bit + * architecture that has a vDSO. + */ + +#include <stdbool.h> +#include <stdint.h> +#include <string.h> +#include <elf.h> + +/* + * To use this vDSO parser, first call one of the vdso_init_* functions. + * If you've already parsed auxv, then pass the value of AT_SYSINFO_EHDR + * to vdso_init_from_sysinfo_ehdr. Otherwise pass auxv to vdso_init_from_auxv. + * Then call vdso_sym for each symbol you want. For example, to look up + * gettimeofday on x86_64, use: + * + * <some pointer> = vdso_sym("LINUX_2.6", "gettimeofday"); + * or + * <some pointer> = vdso_sym("LINUX_2.6", "__vdso_gettimeofday"); + * + * vdso_sym will return 0 if the symbol doesn't exist or if the init function + * failed or was not called. vdso_sym is a little slow, so its return value + * should be cached. + * + * vdso_sym is threadsafe; the init functions are not. + * + * These are the prototypes: + */ +extern void vdso_init_from_auxv(void *auxv); +extern void vdso_init_from_sysinfo_ehdr(uintptr_t base); +extern void *vdso_sym(const char *version, const char *name); + + +/* And here's the code. */ + +#ifndef __x86_64__ +# error Not yet ported to non-x86_64 architectures +#endif + +static struct vdso_info +{ + bool valid; + + /* Load information */ + uintptr_t load_addr; + uintptr_t load_offset; /* load_addr - recorded vaddr */ + + /* Symbol table */ + Elf64_Sym *symtab; + const char *symstrings; + Elf64_Word *bucket, *chain; + Elf64_Word nbucket, nchain; + + /* Version table */ + Elf64_Versym *versym; + Elf64_Verdef *verdef; +} vdso_info; + +/* Straight from the ELF specification. */ +static unsigned long elf_hash(const unsigned char *name) +{ + unsigned long h = 0, g; + while (*name) + { + h = (h << 4) + *name++; + if (g = h & 0xf0000000) + h ^= g >> 24; + h &= ~g; + } + return h; +} + +void vdso_init_from_sysinfo_ehdr(uintptr_t base) +{ + size_t i; + bool found_vaddr = false; + + vdso_info.valid = false; + + vdso_info.load_addr = base; + + Elf64_Ehdr *hdr = (Elf64_Ehdr*)base; + Elf64_Phdr *pt = (Elf64_Phdr*)(vdso_info.load_addr + hdr->e_phoff); + Elf64_Dyn *dyn = 0; + + /* + * We need two things from the segment table: the load offset + * and the dynamic table. + */ + for (i = 0; i < hdr->e_phnum; i++) + { + if (pt[i].p_type == PT_LOAD && !found_vaddr) { + found_vaddr = true; + vdso_info.load_offset = base + + (uintptr_t)pt[i].p_offset + - (uintptr_t)pt[i].p_vaddr; + } else if (pt[i].p_type == PT_DYNAMIC) { + dyn = (Elf64_Dyn*)(base + pt[i].p_offset); + } + } + + if (!found_vaddr || !dyn) + return; /* Failed */ + + /* + * Fish out the useful bits of the dynamic table. + */ + Elf64_Word *hash = 0; + vdso_info.symstrings = 0; + vdso_info.symtab = 0; + vdso_info.versym = 0; + vdso_info.verdef = 0; + for (i = 0; dyn[i].d_tag != DT_NULL; i++) { + switch (dyn[i].d_tag) { + case DT_STRTAB: + vdso_info.symstrings = (const char *) + ((uintptr_t)dyn[i].d_un.d_ptr + + vdso_info.load_offset); + break; + case DT_SYMTAB: + vdso_info.symtab = (Elf64_Sym *) + ((uintptr_t)dyn[i].d_un.d_ptr + + vdso_info.load_offset); + break; + case DT_HASH: + hash = (Elf64_Word *) + ((uintptr_t)dyn[i].d_un.d_ptr + + vdso_info.load_offset); + break; + case DT_VERSYM: + vdso_info.versym = (Elf64_Versym *) + ((uintptr_t)dyn[i].d_un.d_ptr + + vdso_info.load_offset); + break; + case DT_VERDEF: + vdso_info.verdef = (Elf64_Verdef *) + ((uintptr_t)dyn[i].d_un.d_ptr + + vdso_info.load_offset); + break; + } + } + if (!vdso_info.symstrings || !vdso_info.symtab || !hash) + return; /* Failed */ + + if (!vdso_info.verdef) + vdso_info.versym = 0; + + /* Parse the hash table header. */ + vdso_info.nbucket = hash[0]; + vdso_info.nchain = hash[1]; + vdso_info.bucket = &hash[2]; + vdso_info.chain = &hash[vdso_info.nbucket + 2]; + + /* That's all we need. */ + vdso_info.valid = true; +} + +static bool vdso_match_version(Elf64_Versym ver, + const char *name, Elf64_Word hash) +{ + /* + * This is a helper function to check if the version indexed by + * ver matches name (which hashes to hash). + * + * The version definition table is a mess, and I don't know how + * to do this in better than linear time without allocating memory + * to build an index. I also don't know why the table has + * variable size entries in the first place. + * + * For added fun, I can't find a comprehensible specification of how + * to parse all the weird flags in the table. + * + * So I just parse the whole table every time. + */ + + /* First step: find the version definition */ + ver &= 0x7fff; /* Apparently bit 15 means "hidden" */ + Elf64_Verdef *def = vdso_info.verdef; + while(true) { + if ((def->vd_flags & VER_FLG_BASE) == 0 + && (def->vd_ndx & 0x7fff) == ver) + break; + + if (def->vd_next == 0) + return false; /* No definition. */ + + def = (Elf64_Verdef *)((char *)def + def->vd_next); + } + + /* Now figure out whether it matches. */ + Elf64_Verdaux *aux = (Elf64_Verdaux*)((char *)def + def->vd_aux); + return def->vd_hash == hash + && !strcmp(name, vdso_info.symstrings + aux->vda_name); +} + +void *vdso_sym(const char *version, const char *name) +{ + unsigned long ver_hash; + if (!vdso_info.valid) + return 0; + + ver_hash = elf_hash(version); + Elf64_Word chain = vdso_info.bucket[elf_hash(name) % vdso_info.nbucket]; + + for (; chain != STN_UNDEF; chain = vdso_info.chain[chain]) { + Elf64_Sym *sym = &vdso_info.symtab[chain]; + + /* Check for a defined global or weak function w/ right name. */ + if (ELF64_ST_TYPE(sym->st_info) != STT_FUNC) + continue; + if (ELF64_ST_BIND(sym->st_info) != STB_GLOBAL && + ELF64_ST_BIND(sym->st_info) != STB_WEAK) + continue; + if (sym->st_shndx == SHN_UNDEF) + continue; + if (strcmp(name, vdso_info.symstrings + sym->st_name)) + continue; + + /* Check symbol version. */ + if (vdso_info.versym + && !vdso_match_version(vdso_info.versym[chain], + version, ver_hash)) + continue; + + return (void *)(vdso_info.load_offset + sym->st_value); + } + + return 0; +} + +void vdso_init_from_auxv(void *auxv) +{ + Elf64_auxv_t *elf_auxv = auxv; + for (int i = 0; elf_auxv[i].a_type != AT_NULL; i++) + { + if (elf_auxv[i].a_type == AT_SYSINFO_EHDR) { + vdso_init_from_sysinfo_ehdr(elf_auxv[i].a_un.a_val); + return; + } + } + + vdso_info.valid = false; +} diff --git a/Documentation/vDSO/vdso_test.c b/Documentation/vDSO/vdso_test.c new file mode 100644 index 00000000000..fff633432df --- /dev/null +++ b/Documentation/vDSO/vdso_test.c @@ -0,0 +1,111 @@ +/* + * vdso_test.c: Sample code to test parse_vdso.c on x86_64 + * Copyright (c) 2011 Andy Lutomirski + * Subject to the GNU General Public License, version 2 + * + * You can amuse yourself by compiling with: + * gcc -std=gnu99 -nostdlib + * -Os -fno-asynchronous-unwind-tables -flto + * vdso_test.c parse_vdso.c -o vdso_test + * to generate a small binary with no dependencies at all. + */ + +#include <sys/syscall.h> +#include <sys/time.h> +#include <unistd.h> +#include <stdint.h> + +extern void *vdso_sym(const char *version, const char *name); +extern void vdso_init_from_sysinfo_ehdr(uintptr_t base); +extern void vdso_init_from_auxv(void *auxv); + +/* We need a libc functions... */ +int strcmp(const char *a, const char *b) +{ + /* This implementation is buggy: it never returns -1. */ + while (*a || *b) { + if (*a != *b) + return 1; + if (*a == 0 || *b == 0) + return 1; + a++; + b++; + } + + return 0; +} + +/* ...and two syscalls. This is x86_64-specific. */ +static inline long linux_write(int fd, const void *data, size_t len) +{ + + long ret; + asm volatile ("syscall" : "=a" (ret) : "a" (__NR_write), + "D" (fd), "S" (data), "d" (len) : + "cc", "memory", "rcx", + "r8", "r9", "r10", "r11" ); + return ret; +} + +static inline void linux_exit(int code) +{ + asm volatile ("syscall" : : "a" (__NR_exit), "D" (code)); +} + +void to_base10(char *lastdig, uint64_t n) +{ + while (n) { + *lastdig = (n % 10) + '0'; + n /= 10; + lastdig--; + } +} + +__attribute__((externally_visible)) void c_main(void **stack) +{ + /* Parse the stack */ + long argc = (long)*stack; + stack += argc + 2; + + /* Now we're pointing at the environment. Skip it. */ + while(*stack) + stack++; + stack++; + + /* Now we're pointing at auxv. Initialize the vDSO parser. */ + vdso_init_from_auxv((void *)stack); + + /* Find gettimeofday. */ + typedef long (*gtod_t)(struct timeval *tv, struct timezone *tz); + gtod_t gtod = (gtod_t)vdso_sym("LINUX_2.6", "__vdso_gettimeofday"); + + if (!gtod) + linux_exit(1); + + struct timeval tv; + long ret = gtod(&tv, 0); + + if (ret == 0) { + char buf[] = "The time is .000000\n"; + to_base10(buf + 31, tv.tv_sec); + to_base10(buf + 38, tv.tv_usec); + linux_write(1, buf, sizeof(buf) - 1); + } else { + linux_exit(ret); + } + + linux_exit(0); +} + +/* + * This is the real entry point. It passes the initial stack into + * the C entry point. + */ +asm ( + ".text\n" + ".global _start\n" + ".type _start,@function\n" + "_start:\n\t" + "mov %rsp,%rdi\n\t" + "jmp c_main" + ); diff --git a/Documentation/x86/entry_64.txt b/Documentation/x86/entry_64.txt new file mode 100644 index 00000000000..7869f14d055 --- /dev/null +++ b/Documentation/x86/entry_64.txt @@ -0,0 +1,98 @@ +This file documents some of the kernel entries in +arch/x86/kernel/entry_64.S. A lot of this explanation is adapted from +an email from Ingo Molnar: + +http://lkml.kernel.org/r/<20110529191055.GC9835%40elte.hu> + +The x86 architecture has quite a few different ways to jump into +kernel code. Most of these entry points are registered in +arch/x86/kernel/traps.c and implemented in arch/x86/kernel/entry_64.S +and arch/x86/ia32/ia32entry.S. + +The IDT vector assignments are listed in arch/x86/include/irq_vectors.h. + +Some of these entries are: + + - system_call: syscall instruction from 64-bit code. + + - ia32_syscall: int 0x80 from 32-bit or 64-bit code; compat syscall + either way. + + - ia32_syscall, ia32_sysenter: syscall and sysenter from 32-bit + code + + - interrupt: An array of entries. Every IDT vector that doesn't + explicitly point somewhere else gets set to the corresponding + value in interrupts. These point to a whole array of + magically-generated functions that make their way to do_IRQ with + the interrupt number as a parameter. + + - emulate_vsyscall: int 0xcc, a special non-ABI entry used by + vsyscall emulation. + + - APIC interrupts: Various special-purpose interrupts for things + like TLB shootdown. + + - Architecturally-defined exceptions like divide_error. + +There are a few complexities here. The different x86-64 entries +have different calling conventions. The syscall and sysenter +instructions have their own peculiar calling conventions. Some of +the IDT entries push an error code onto the stack; others don't. +IDT entries using the IST alternative stack mechanism need their own +magic to get the stack frames right. (You can find some +documentation in the AMD APM, Volume 2, Chapter 8 and the Intel SDM, +Volume 3, Chapter 6.) + +Dealing with the swapgs instruction is especially tricky. Swapgs +toggles whether gs is the kernel gs or the user gs. The swapgs +instruction is rather fragile: it must nest perfectly and only in +single depth, it should only be used if entering from user mode to +kernel mode and then when returning to user-space, and precisely +so. If we mess that up even slightly, we crash. + +So when we have a secondary entry, already in kernel mode, we *must +not* use SWAPGS blindly - nor must we forget doing a SWAPGS when it's +not switched/swapped yet. + +Now, there's a secondary complication: there's a cheap way to test +which mode the CPU is in and an expensive way. + +The cheap way is to pick this info off the entry frame on the kernel +stack, from the CS of the ptregs area of the kernel stack: + + xorl %ebx,%ebx + testl $3,CS+8(%rsp) + je error_kernelspace + SWAPGS + +The expensive (paranoid) way is to read back the MSR_GS_BASE value +(which is what SWAPGS modifies): + + movl $1,%ebx + movl $MSR_GS_BASE,%ecx + rdmsr + testl %edx,%edx + js 1f /* negative -> in kernel */ + SWAPGS + xorl %ebx,%ebx +1: ret + +and the whole paranoid non-paranoid macro complexity is about whether +to suffer that RDMSR cost. + +If we are at an interrupt or user-trap/gate-alike boundary then we can +use the faster check: the stack will be a reliable indicator of +whether SWAPGS was already done: if we see that we are a secondary +entry interrupting kernel mode execution, then we know that the GS +base has already been switched. If it says that we interrupted +user-space execution then we must do the SWAPGS. + +But if we are in an NMI/MCE/DEBUG/whatever super-atomic entry context, +which might have triggered right after a normal entry wrote CS to the +stack but before we executed SWAPGS, then the only safe way to check +for GS is the slower method: the RDMSR. + +So we try only to mark those entry methods 'paranoid' that absolutely +need the more expensive check for the GS base - and we generate all +'normal' entry points with the regular (faster) entry macros. |