diff options
Diffstat (limited to 'Documentation/powerpc')
| -rw-r--r-- | Documentation/powerpc/00-INDEX | 17 | ||||
| -rw-r--r-- | Documentation/powerpc/cpu_families.txt | 221 | ||||
| -rw-r--r-- | Documentation/powerpc/cpu_features.txt | 10 | ||||
| -rw-r--r-- | Documentation/powerpc/firmware-assisted-dump.txt | 270 | ||||
| -rw-r--r-- | Documentation/powerpc/mpc52xx.txt | 12 | ||||
| -rw-r--r-- | Documentation/powerpc/phyp-assisted-dump.txt | 127 | ||||
| -rw-r--r-- | Documentation/powerpc/pmu-ebb.txt | 137 | ||||
| -rw-r--r-- | Documentation/powerpc/ptrace.txt | 17 | ||||
| -rw-r--r-- | Documentation/powerpc/sound.txt | 81 | ||||
| -rw-r--r-- | Documentation/powerpc/transactional_memory.txt | 198 | ||||
| -rw-r--r-- | Documentation/powerpc/zImage_layout.txt | 47 |
11 files changed, 867 insertions, 270 deletions
diff --git a/Documentation/powerpc/00-INDEX b/Documentation/powerpc/00-INDEX index 5620fb5ac42..6db73df0427 100644 --- a/Documentation/powerpc/00-INDEX +++ b/Documentation/powerpc/00-INDEX @@ -5,19 +5,28 @@ please mail me. 00-INDEX - this file +bootwrapper.txt + - Information on how the powerpc kernel is wrapped for boot on various + different platforms. cpu_features.txt - info on how we support a variety of CPUs with minimal compile-time options. eeh-pci-error-recovery.txt - info on PCI Bus EEH Error Recovery +firmware-assisted-dump.txt + - Documentation on the firmware assisted dump mechanism "fadump". hvcs.txt - IBM "Hypervisor Virtual Console Server" Installation Guide +kvm_440.txt + - Various notes on the implementation of KVM for PowerPC 440. mpc52xx.txt - Linux 2.6.x on MPC52xx family -sound.txt - - info on sound support under Linux/PPC -zImage_layout.txt - - info on the kernel images for Linux/PPC +pmu-ebb.txt + - Description of the API for using the PMU with Event Based Branches. qe_firmware.txt - describes the layout of firmware binaries for the Freescale QUICC Engine and the code that parses and uploads the microcode therein. +ptrace.txt + - Information on the ptrace interfaces for hardware debug registers. +transactional_memory.txt + - Overview of the Power8 transactional memory support. diff --git a/Documentation/powerpc/cpu_families.txt b/Documentation/powerpc/cpu_families.txt new file mode 100644 index 00000000000..fc08e22feb1 --- /dev/null +++ b/Documentation/powerpc/cpu_families.txt @@ -0,0 +1,221 @@ +CPU Families +============ + +This document tries to summarise some of the different cpu families that exist +and are supported by arch/powerpc. + + +Book3S (aka sPAPR) +------------------ + + - Hash MMU + - Mix of 32 & 64 bit + + +--------------+ +----------------+ + | Old POWER | --------------> | RS64 (threads) | + +--------------+ +----------------+ + | + | + v + +--------------+ +----------------+ +------+ + | 601 | --------------> | 603 | ---> | e300 | + +--------------+ +----------------+ +------+ + | | + | | + v v + +--------------+ +----------------+ +-------+ + | 604 | | 750 (G3) | ---> | 750CX | + +--------------+ +----------------+ +-------+ + | | | + | | | + v v v + +--------------+ +----------------+ +-------+ + | 620 (64 bit) | | 7400 | | 750CL | + +--------------+ +----------------+ +-------+ + | | | + | | | + v v v + +--------------+ +----------------+ +-------+ + | POWER3/630 | | 7410 | | 750FX | + +--------------+ +----------------+ +-------+ + | | + | | + v v + +--------------+ +----------------+ + | POWER3+ | | 7450 | + +--------------+ +----------------+ + | | + | | + v v + +--------------+ +----------------+ + | POWER4 | | 7455 | + +--------------+ +----------------+ + | | + | | + v v + +--------------+ +-------+ +----------------+ + | POWER4+ | --> | 970 | | 7447 | + +--------------+ +-------+ +----------------+ + | | | + | | | + v v v + +--------------+ +-------+ +----------------+ + | POWER5 | | 970FX | | 7448 | + +--------------+ +-------+ +----------------+ + | | | + | | | + v v v + +--------------+ +-------+ +----------------+ + | POWER5+ | | 970MP | | e600 | + +--------------+ +-------+ +----------------+ + | + | + v + +--------------+ + | POWER5++ | + +--------------+ + | + | + v + +--------------+ +-------+ + | POWER6 | <-?-> | Cell | + +--------------+ +-------+ + | + | + v + +--------------+ + | POWER7 | + +--------------+ + | + | + v + +--------------+ + | POWER7+ | + +--------------+ + | + | + v + +--------------+ + | POWER8 | + +--------------+ + + + +---------------+ + | PA6T (64 bit) | + +---------------+ + + +IBM BookE +--------- + + - Software loaded TLB. + - All 32 bit + + +--------------+ + | 401 | + +--------------+ + | + | + v + +--------------+ + | 403 | + +--------------+ + | + | + v + +--------------+ + | 405 | + +--------------+ + | + | + v + +--------------+ + | 440 | + +--------------+ + | + | + v + +--------------+ +----------------+ + | 450 | --> | BG/P | + +--------------+ +----------------+ + | + | + v + +--------------+ + | 460 | + +--------------+ + | + | + v + +--------------+ + | 476 | + +--------------+ + + +Motorola/Freescale 8xx +---------------------- + + - Software loaded with hardware assist. + - All 32 bit + + +-------------+ + | MPC8xx Core | + +-------------+ + + +Freescale BookE +--------------- + + - Software loaded TLB. + - e6500 adds HW loaded indirect TLB entries. + - Mix of 32 & 64 bit + + +--------------+ + | e200 | + +--------------+ + + + +--------------------------------+ + | e500 | + +--------------------------------+ + | + | + v + +--------------------------------+ + | e500v2 | + +--------------------------------+ + | + | + v + +--------------------------------+ + | e500mc (Book3e) | + +--------------------------------+ + | + | + v + +--------------------------------+ + | e5500 (64 bit) | + +--------------------------------+ + | + | + v + +--------------------------------+ + | e6500 (HW TLB) (Multithreaded) | + +--------------------------------+ + + +IBM A2 core +----------- + + - Book3E, software loaded TLB + HW loaded indirect TLB entries. + - 64 bit + + +--------------+ +----------------+ + | A2 core | --> | WSP | + +--------------+ +----------------+ + | + | + v + +--------------+ + | BG/Q | + +--------------+ diff --git a/Documentation/powerpc/cpu_features.txt b/Documentation/powerpc/cpu_features.txt index ffa4183fdb8..ae09df8722c 100644 --- a/Documentation/powerpc/cpu_features.txt +++ b/Documentation/powerpc/cpu_features.txt @@ -11,10 +11,10 @@ split instruction and data caches, and if the CPU supports the DOZE and NAP sleep modes. Detection of the feature set is simple. A list of processors can be found in -arch/ppc/kernel/cputable.c. The PVR register is masked and compared with each -value in the list. If a match is found, the cpu_features of cur_cpu_spec is -assigned to the feature bitmask for this processor and a __setup_cpu function -is called. +arch/powerpc/kernel/cputable.c. The PVR register is masked and compared with +each value in the list. If a match is found, the cpu_features of cur_cpu_spec +is assigned to the feature bitmask for this processor and a __setup_cpu +function is called. C code may test 'cur_cpu_spec[smp_processor_id()]->cpu_features' for a particular feature bit. This is done in quite a few places, for example @@ -51,6 +51,6 @@ should be used in the majority of cases. The END_FTR_SECTION macros are implemented by storing information about this code in the '__ftr_fixup' ELF section. When do_cpu_ftr_fixups -(arch/ppc/kernel/misc.S) is invoked, it will iterate over the records in +(arch/powerpc/kernel/misc.S) is invoked, it will iterate over the records in __ftr_fixup, and if the required feature is not present it will loop writing nop's from each BEGIN_FTR_SECTION to END_FTR_SECTION. diff --git a/Documentation/powerpc/firmware-assisted-dump.txt b/Documentation/powerpc/firmware-assisted-dump.txt new file mode 100644 index 00000000000..3007bc98af2 --- /dev/null +++ b/Documentation/powerpc/firmware-assisted-dump.txt @@ -0,0 +1,270 @@ + + Firmware-Assisted Dump + ------------------------ + July 2011 + +The goal of firmware-assisted dump is to enable the dump of +a crashed system, and to do so from a fully-reset system, and +to minimize the total elapsed time until the system is back +in production use. + +- Firmware assisted dump (fadump) infrastructure is intended to replace + the existing phyp assisted dump. +- Fadump uses the same firmware interfaces and memory reservation model + as phyp assisted dump. +- Unlike phyp dump, fadump exports the memory dump through /proc/vmcore + in the ELF format in the same way as kdump. This helps us reuse the + kdump infrastructure for dump capture and filtering. +- Unlike phyp dump, userspace tool does not need to refer any sysfs + interface while reading /proc/vmcore. +- Unlike phyp dump, fadump allows user to release all the memory reserved + for dump, with a single operation of echo 1 > /sys/kernel/fadump_release_mem. +- Once enabled through kernel boot parameter, fadump can be + started/stopped through /sys/kernel/fadump_registered interface (see + sysfs files section below) and can be easily integrated with kdump + service start/stop init scripts. + +Comparing with kdump or other strategies, firmware-assisted +dump offers several strong, practical advantages: + +-- Unlike kdump, the system has been reset, and loaded + with a fresh copy of the kernel. In particular, + PCI and I/O devices have been reinitialized and are + in a clean, consistent state. +-- Once the dump is copied out, the memory that held the dump + is immediately available to the running kernel. And therefore, + unlike kdump, fadump doesn't need a 2nd reboot to get back + the system to the production configuration. + +The above can only be accomplished by coordination with, +and assistance from the Power firmware. The procedure is +as follows: + +-- The first kernel registers the sections of memory with the + Power firmware for dump preservation during OS initialization. + These registered sections of memory are reserved by the first + kernel during early boot. + +-- When a system crashes, the Power firmware will save + the low memory (boot memory of size larger of 5% of system RAM + or 256MB) of RAM to the previous registered region. It will + also save system registers, and hardware PTE's. + + NOTE: The term 'boot memory' means size of the low memory chunk + that is required for a kernel to boot successfully when + booted with restricted memory. By default, the boot memory + size will be the larger of 5% of system RAM or 256MB. + Alternatively, user can also specify boot memory size + through boot parameter 'fadump_reserve_mem=' which will + override the default calculated size. Use this option + if default boot memory size is not sufficient for second + kernel to boot successfully. + +-- After the low memory (boot memory) area has been saved, the + firmware will reset PCI and other hardware state. It will + *not* clear the RAM. It will then launch the bootloader, as + normal. + +-- The freshly booted kernel will notice that there is a new + node (ibm,dump-kernel) in the device tree, indicating that + there is crash data available from a previous boot. During + the early boot OS will reserve rest of the memory above + boot memory size effectively booting with restricted memory + size. This will make sure that the second kernel will not + touch any of the dump memory area. + +-- User-space tools will read /proc/vmcore to obtain the contents + of memory, which holds the previous crashed kernel dump in ELF + format. The userspace tools may copy this info to disk, or + network, nas, san, iscsi, etc. as desired. + +-- Once the userspace tool is done saving dump, it will echo + '1' to /sys/kernel/fadump_release_mem to release the reserved + memory back to general use, except the memory required for + next firmware-assisted dump registration. + + e.g. + # echo 1 > /sys/kernel/fadump_release_mem + +Please note that the firmware-assisted dump feature +is only available on Power6 and above systems with recent +firmware versions. + +Implementation details: +---------------------- + +During boot, a check is made to see if firmware supports +this feature on that particular machine. If it does, then +we check to see if an active dump is waiting for us. If yes +then everything but boot memory size of RAM is reserved during +early boot (See Fig. 2). This area is released once we finish +collecting the dump from user land scripts (e.g. kdump scripts) +that are run. If there is dump data, then the +/sys/kernel/fadump_release_mem file is created, and the reserved +memory is held. + +If there is no waiting dump data, then only the memory required +to hold CPU state, HPTE region, boot memory dump and elfcore +header, is reserved at the top of memory (see Fig. 1). This area +is *not* released: this region will be kept permanently reserved, +so that it can act as a receptacle for a copy of the boot memory +content in addition to CPU state and HPTE region, in the case a +crash does occur. + + o Memory Reservation during first kernel + + Low memory Top of memory + 0 boot memory size | + | | |<--Reserved dump area -->| + V V | Permanent Reservation V + +-----------+----------/ /----------+---+----+-----------+----+ + | | |CPU|HPTE| DUMP |ELF | + +-----------+----------/ /----------+---+----+-----------+----+ + | ^ + | | + \ / + ------------------------------------------- + Boot memory content gets transferred to + reserved area by firmware at the time of + crash + Fig. 1 + + o Memory Reservation during second kernel after crash + + Low memory Top of memory + 0 boot memory size | + | |<------------- Reserved dump area ----------- -->| + V V V + +-----------+----------/ /----------+---+----+-----------+----+ + | | |CPU|HPTE| DUMP |ELF | + +-----------+----------/ /----------+---+----+-----------+----+ + | | + V V + Used by second /proc/vmcore + kernel to boot + Fig. 2 + +Currently the dump will be copied from /proc/vmcore to a +a new file upon user intervention. The dump data available through +/proc/vmcore will be in ELF format. Hence the existing kdump +infrastructure (kdump scripts) to save the dump works fine with +minor modifications. + +The tools to examine the dump will be same as the ones +used for kdump. + +How to enable firmware-assisted dump (fadump): +------------------------------------- + +1. Set config option CONFIG_FA_DUMP=y and build kernel. +2. Boot into linux kernel with 'fadump=on' kernel cmdline option. +3. Optionally, user can also set 'fadump_reserve_mem=' kernel cmdline + to specify size of the memory to reserve for boot memory dump + preservation. + +NOTE: If firmware-assisted dump fails to reserve memory then it will + fallback to existing kdump mechanism if 'crashkernel=' option + is set at kernel cmdline. + +Sysfs/debugfs files: +------------ + +Firmware-assisted dump feature uses sysfs file system to hold +the control files and debugfs file to display memory reserved region. + +Here is the list of files under kernel sysfs: + + /sys/kernel/fadump_enabled + + This is used to display the fadump status. + 0 = fadump is disabled + 1 = fadump is enabled + + This interface can be used by kdump init scripts to identify if + fadump is enabled in the kernel and act accordingly. + + /sys/kernel/fadump_registered + + This is used to display the fadump registration status as well + as to control (start/stop) the fadump registration. + 0 = fadump is not registered. + 1 = fadump is registered and ready to handle system crash. + + To register fadump echo 1 > /sys/kernel/fadump_registered and + echo 0 > /sys/kernel/fadump_registered for un-register and stop the + fadump. Once the fadump is un-registered, the system crash will not + be handled and vmcore will not be captured. This interface can be + easily integrated with kdump service start/stop. + + /sys/kernel/fadump_release_mem + + This file is available only when fadump is active during + second kernel. This is used to release the reserved memory + region that are held for saving crash dump. To release the + reserved memory echo 1 to it: + + echo 1 > /sys/kernel/fadump_release_mem + + After echo 1, the content of the /sys/kernel/debug/powerpc/fadump_region + file will change to reflect the new memory reservations. + + The existing userspace tools (kdump infrastructure) can be easily + enhanced to use this interface to release the memory reserved for + dump and continue without 2nd reboot. + +Here is the list of files under powerpc debugfs: +(Assuming debugfs is mounted on /sys/kernel/debug directory.) + + /sys/kernel/debug/powerpc/fadump_region + + This file shows the reserved memory regions if fadump is + enabled otherwise this file is empty. The output format + is: + <region>: [<start>-<end>] <reserved-size> bytes, Dumped: <dump-size> + + e.g. + Contents when fadump is registered during first kernel + + # cat /sys/kernel/debug/powerpc/fadump_region + CPU : [0x0000006ffb0000-0x0000006fff001f] 0x40020 bytes, Dumped: 0x0 + HPTE: [0x0000006fff0020-0x0000006fff101f] 0x1000 bytes, Dumped: 0x0 + DUMP: [0x0000006fff1020-0x0000007fff101f] 0x10000000 bytes, Dumped: 0x0 + + Contents when fadump is active during second kernel + + # cat /sys/kernel/debug/powerpc/fadump_region + CPU : [0x0000006ffb0000-0x0000006fff001f] 0x40020 bytes, Dumped: 0x40020 + HPTE: [0x0000006fff0020-0x0000006fff101f] 0x1000 bytes, Dumped: 0x1000 + DUMP: [0x0000006fff1020-0x0000007fff101f] 0x10000000 bytes, Dumped: 0x10000000 + : [0x00000010000000-0x0000006ffaffff] 0x5ffb0000 bytes, Dumped: 0x5ffb0000 + +NOTE: Please refer to Documentation/filesystems/debugfs.txt on + how to mount the debugfs filesystem. + + +TODO: +----- + o Need to come up with the better approach to find out more + accurate boot memory size that is required for a kernel to + boot successfully when booted with restricted memory. + o The fadump implementation introduces a fadump crash info structure + in the scratch area before the ELF core header. The idea of introducing + this structure is to pass some important crash info data to the second + kernel which will help second kernel to populate ELF core header with + correct data before it gets exported through /proc/vmcore. The current + design implementation does not address a possibility of introducing + additional fields (in future) to this structure without affecting + compatibility. Need to come up with the better approach to address this. + The possible approaches are: + 1. Introduce version field for version tracking, bump up the version + whenever a new field is added to the structure in future. The version + field can be used to find out what fields are valid for the current + version of the structure. + 2. Reserve the area of predefined size (say PAGE_SIZE) for this + structure and have unused area as reserved (initialized to zero) + for future field additions. + The advantage of approach 1 over 2 is we don't need to reserve extra space. +--- +Author: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> +This document is based on the original documentation written for phyp +assisted dump by Linas Vepstas and Manish Ahuja. diff --git a/Documentation/powerpc/mpc52xx.txt b/Documentation/powerpc/mpc52xx.txt index 10dd4ab93b8..0d540a31ea1 100644 --- a/Documentation/powerpc/mpc52xx.txt +++ b/Documentation/powerpc/mpc52xx.txt @@ -2,7 +2,7 @@ Linux 2.6.x on MPC52xx family ----------------------------- For the latest info, go to http://www.246tNt.com/mpc52xx/ - + To compile/use : - U-Boot: @@ -10,23 +10,23 @@ To compile/use : if you wish to ). # make lite5200_defconfig # make uImage - + then, on U-boot: => tftpboot 200000 uImage => tftpboot 400000 pRamdisk => bootm 200000 400000 - + - DBug: # <edit Makefile to set ARCH=ppc & CROSS_COMPILE=... ( also EXTRAVERSION if you wish to ). # make lite5200_defconfig # cp your_initrd.gz arch/ppc/boot/images/ramdisk.image.gz - # make zImage.initrd - # make + # make zImage.initrd + # make then in DBug: DBug> dn -i zImage.initrd.lite5200 - + Some remarks : - The port is named mpc52xxx, and config options are PPC_MPC52xx. The MGT5100 diff --git a/Documentation/powerpc/phyp-assisted-dump.txt b/Documentation/powerpc/phyp-assisted-dump.txt deleted file mode 100644 index ad340205d96..00000000000 --- a/Documentation/powerpc/phyp-assisted-dump.txt +++ /dev/null @@ -1,127 +0,0 @@ - - Hypervisor-Assisted Dump - ------------------------ - November 2007 - -The goal of hypervisor-assisted dump is to enable the dump of -a crashed system, and to do so from a fully-reset system, and -to minimize the total elapsed time until the system is back -in production use. - -As compared to kdump or other strategies, hypervisor-assisted -dump offers several strong, practical advantages: - --- Unlike kdump, the system has been reset, and loaded - with a fresh copy of the kernel. In particular, - PCI and I/O devices have been reinitialized and are - in a clean, consistent state. --- As the dump is performed, the dumped memory becomes - immediately available to the system for normal use. --- After the dump is completed, no further reboots are - required; the system will be fully usable, and running - in its normal, production mode on its normal kernel. - -The above can only be accomplished by coordination with, -and assistance from the hypervisor. The procedure is -as follows: - --- When a system crashes, the hypervisor will save - the low 256MB of RAM to a previously registered - save region. It will also save system state, system - registers, and hardware PTE's. - --- After the low 256MB area has been saved, the - hypervisor will reset PCI and other hardware state. - It will *not* clear RAM. It will then launch the - bootloader, as normal. - --- The freshly booted kernel will notice that there - is a new node (ibm,dump-kernel) in the device tree, - indicating that there is crash data available from - a previous boot. It will boot into only 256MB of RAM, - reserving the rest of system memory. - --- Userspace tools will parse /sys/kernel/release_region - and read /proc/vmcore to obtain the contents of memory, - which holds the previous crashed kernel. The userspace - tools may copy this info to disk, or network, nas, san, - iscsi, etc. as desired. - - For Example: the values in /sys/kernel/release-region - would look something like this (address-range pairs). - CPU:0x177fee000-0x10000: HPTE:0x177ffe020-0x1000: / - DUMP:0x177fff020-0x10000000, 0x10000000-0x16F1D370A - --- As the userspace tools complete saving a portion of - dump, they echo an offset and size to - /sys/kernel/release_region to release the reserved - memory back to general use. - - An example of this is: - "echo 0x40000000 0x10000000 > /sys/kernel/release_region" - which will release 256MB at the 1GB boundary. - -Please note that the hypervisor-assisted dump feature -is only available on Power6-based systems with recent -firmware versions. - -Implementation details: ----------------------- - -During boot, a check is made to see if firmware supports -this feature on this particular machine. If it does, then -we check to see if a active dump is waiting for us. If yes -then everything but 256 MB of RAM is reserved during early -boot. This area is released once we collect a dump from user -land scripts that are run. If there is dump data, then -the /sys/kernel/release_region file is created, and -the reserved memory is held. - -If there is no waiting dump data, then only the highest -256MB of the ram is reserved as a scratch area. This area -is *not* released: this region will be kept permanently -reserved, so that it can act as a receptacle for a copy -of the low 256MB in the case a crash does occur. See, -however, "open issues" below, as to whether -such a reserved region is really needed. - -Currently the dump will be copied from /proc/vmcore to a -a new file upon user intervention. The starting address -to be read and the range for each data point in provided -in /sys/kernel/release_region. - -The tools to examine the dump will be same as the ones -used for kdump. - -General notes: --------------- -Security: please note that there are potential security issues -with any sort of dump mechanism. In particular, plaintext -(unencrypted) data, and possibly passwords, may be present in -the dump data. Userspace tools must take adequate precautions to -preserve security. - -Open issues/ToDo: ------------- - o The various code paths that tell the hypervisor that a crash - occurred, vs. it simply being a normal reboot, should be - reviewed, and possibly clarified/fixed. - - o Instead of using /sys/kernel, should there be a /sys/dump - instead? There is a dump_subsys being created by the s390 code, - perhaps the pseries code should use a similar layout as well. - - o Is reserving a 256MB region really required? The goal of - reserving a 256MB scratch area is to make sure that no - important crash data is clobbered when the hypervisor - save low mem to the scratch area. But, if one could assure - that nothing important is located in some 256MB area, then - it would not need to be reserved. Something that can be - improved in subsequent versions. - - o Still working the kdump team to integrate this with kdump, - some work remains but this would not affect the current - patches. - - o Still need to write a shell script, to copy the dump away. - Currently I am parsing it manually. diff --git a/Documentation/powerpc/pmu-ebb.txt b/Documentation/powerpc/pmu-ebb.txt new file mode 100644 index 00000000000..73cd163dbfb --- /dev/null +++ b/Documentation/powerpc/pmu-ebb.txt @@ -0,0 +1,137 @@ +PMU Event Based Branches +======================== + +Event Based Branches (EBBs) are a feature which allows the hardware to +branch directly to a specified user space address when certain events occur. + +The full specification is available in Power ISA v2.07: + + https://www.power.org/documentation/power-isa-version-2-07/ + +One type of event for which EBBs can be configured is PMU exceptions. This +document describes the API for configuring the Power PMU to generate EBBs, +using the Linux perf_events API. + + +Terminology +----------- + +Throughout this document we will refer to an "EBB event" or "EBB events". This +just refers to a struct perf_event which has set the "EBB" flag in its +attr.config. All events which can be configured on the hardware PMU are +possible "EBB events". + + +Background +---------- + +When a PMU EBB occurs it is delivered to the currently running process. As such +EBBs can only sensibly be used by programs for self-monitoring. + +It is a feature of the perf_events API that events can be created on other +processes, subject to standard permission checks. This is also true of EBB +events, however unless the target process enables EBBs (via mtspr(BESCR)) no +EBBs will ever be delivered. + +This makes it possible for a process to enable EBBs for itself, but not +actually configure any events. At a later time another process can come along +and attach an EBB event to the process, which will then cause EBBs to be +delivered to the first process. It's not clear if this is actually useful. + + +When the PMU is configured for EBBs, all PMU interrupts are delivered to the +user process. This means once an EBB event is scheduled on the PMU, no non-EBB +events can be configured. This means that EBB events can not be run +concurrently with regular 'perf' commands, or any other perf events. + +It is however safe to run 'perf' commands on a process which is using EBBs. The +kernel will in general schedule the EBB event, and perf will be notified that +its events could not run. + +The exclusion between EBB events and regular events is implemented using the +existing "pinned" and "exclusive" attributes of perf_events. This means EBB +events will be given priority over other events, unless they are also pinned. +If an EBB event and a regular event are both pinned, then whichever is enabled +first will be scheduled and the other will be put in error state. See the +section below titled "Enabling an EBB event" for more information. + + +Creating an EBB event +--------------------- + +To request that an event is counted using EBB, the event code should have bit +63 set. + +EBB events must be created with a particular, and restrictive, set of +attributes - this is so that they interoperate correctly with the rest of the +perf_events subsystem. + +An EBB event must be created with the "pinned" and "exclusive" attributes set. +Note that if you are creating a group of EBB events, only the leader can have +these attributes set. + +An EBB event must NOT set any of the "inherit", "sample_period", "freq" or +"enable_on_exec" attributes. + +An EBB event must be attached to a task. This is specified to perf_event_open() +by passing a pid value, typically 0 indicating the current task. + +All events in a group must agree on whether they want EBB. That is all events +must request EBB, or none may request EBB. + +EBB events must specify the PMC they are to be counted on. This ensures +userspace is able to reliably determine which PMC the event is scheduled on. + + +Enabling an EBB event +--------------------- + +Once an EBB event has been successfully opened, it must be enabled with the +perf_events API. This can be achieved either via the ioctl() interface, or the +prctl() interface. + +However, due to the design of the perf_events API, enabling an event does not +guarantee that it has been scheduled on the PMU. To ensure that the EBB event +has been scheduled on the PMU, you must perform a read() on the event. If the +read() returns EOF, then the event has not been scheduled and EBBs are not +enabled. + +This behaviour occurs because the EBB event is pinned and exclusive. When the +EBB event is enabled it will force all other non-pinned events off the PMU. In +this case the enable will be successful. However if there is already an event +pinned on the PMU then the enable will not be successful. + + +Reading an EBB event +-------------------- + +It is possible to read() from an EBB event. However the results are +meaningless. Because interrupts are being delivered to the user process the +kernel is not able to count the event, and so will return a junk value. + + +Closing an EBB event +-------------------- + +When an EBB event is finished with, you can close it using close() as for any +regular event. If this is the last EBB event the PMU will be deconfigured and +no further PMU EBBs will be delivered. + + +EBB Handler +----------- + +The EBB handler is just regular userspace code, however it must be written in +the style of an interrupt handler. When the handler is entered all registers +are live (possibly) and so must be saved somehow before the handler can invoke +other code. + +It's up to the program how to handle this. For C programs a relatively simple +option is to create an interrupt frame on the stack and save registers there. + +Fork +---- + +EBB events are not inherited across fork. If the child process wishes to use +EBBs it should open a new event for itself. Similarly the EBB state in +BESCR/EBBHR/EBBRR is cleared across fork(). diff --git a/Documentation/powerpc/ptrace.txt b/Documentation/powerpc/ptrace.txt index f4a5499b7bc..99c5ce88d0f 100644 --- a/Documentation/powerpc/ptrace.txt +++ b/Documentation/powerpc/ptrace.txt @@ -40,6 +40,7 @@ features will have bits indicating whether there is support for: #define PPC_DEBUG_FEATURE_INSN_BP_MASK 0x2 #define PPC_DEBUG_FEATURE_DATA_BP_RANGE 0x4 #define PPC_DEBUG_FEATURE_DATA_BP_MASK 0x8 +#define PPC_DEBUG_FEATURE_DATA_BP_DAWR 0x10 2. PTRACE_SETHWDEBUG @@ -127,6 +128,22 @@ Some examples of using the structure to: p.addr2 = (uint64_t) end_range; p.condition_value = 0; +- set a watchpoint in server processors (BookS) + + p.version = 1; + p.trigger_type = PPC_BREAKPOINT_TRIGGER_RW; + p.addr_mode = PPC_BREAKPOINT_MODE_RANGE_INCLUSIVE; + or + p.addr_mode = PPC_BREAKPOINT_MODE_EXACT; + + p.condition_mode = PPC_BREAKPOINT_CONDITION_NONE; + p.addr = (uint64_t) begin_range; + /* For PPC_BREAKPOINT_MODE_RANGE_INCLUSIVE addr2 needs to be specified, where + * addr2 - addr <= 8 Bytes. + */ + p.addr2 = (uint64_t) end_range; + p.condition_value = 0; + 3. PTRACE_DELHWDEBUG Takes an integer which identifies an existing breakpoint or watchpoint diff --git a/Documentation/powerpc/sound.txt b/Documentation/powerpc/sound.txt deleted file mode 100644 index df23d95e03a..00000000000 --- a/Documentation/powerpc/sound.txt +++ /dev/null @@ -1,81 +0,0 @@ - Information about PowerPC Sound support -===================================================================== - -Please mail me (Cort Dougan, cort@fsmlabs.com) if you have questions, -comments or corrections. - -Last Change: 6.16.99 - -This just covers sound on the PReP and CHRP systems for now and later -will contain information on the PowerMac's. - -Sound on PReP has been tested and is working with the PowerStack and IBM -Power Series onboard sound systems which are based on the cs4231(2) chip. -The sound options when doing the make config are a bit different from -the default, though. - -The I/O base, irq and dma lines that you enter during the make config -are ignored and are set when booting according to the machine type. -This is so that one binary can be used for Motorola and IBM machines -which use different values and isn't allowed by the driver, so things -are hacked together in such a way as to allow this information to be -set automatically on boot. - -1. Motorola PowerStack PReP machines - - Enable support for "Crystal CS4232 based (PnP) cards" and for the - Microsoft Sound System. The MSS isn't used, but some of the routines - that the CS4232 driver uses are in it. - - Although the options you set are ignored and determined automatically - on boot these are included for information only: - - (830) CS4232 audio I/O base 530, 604, E80 or F40 - (10) CS4232 audio IRQ 5, 7, 9, 11, 12 or 15 - (6) CS4232 audio DMA 0, 1 or 3 - (7) CS4232 second (duplex) DMA 0, 1 or 3 - - This will allow simultaneous record and playback, as 2 different dma - channels are used. - - The sound will be all left channel and very low volume since the - auxiliary input isn't muted by default. I had the changes necessary - for this in the kernel but the sound driver maintainer didn't want - to include them since it wasn't common in other machines. To fix this - you need to mute it using a mixer utility of some sort (if you find one - please let me know) or by patching the driver yourself and recompiling. - - There is a problem on the PowerStack 2's (PowerStack Pro's) using a - different irq/drq than the kernel expects. Unfortunately, I don't know - which irq/drq it is so if anyone knows please email me. - - Midi is not supported since the cs4232 driver doesn't support midi yet. - -2. IBM PowerPersonal PReP machines - - I've only tested sound on the Power Personal Series of IBM workstations - so if you try it on others please let me know the result. I'm especially - interested in the 43p's sound system, which I know nothing about. - - Enable support for "Crystal CS4232 based (PnP) cards" and for the - Microsoft Sound System. The MSS isn't used, but some of the routines - that the CS4232 driver uses are in it. - - Although the options you set are ignored and determined automatically - on boot these are included for information only: - - (530) CS4232 audio I/O base 530, 604, E80 or F40 - (5) CS4232 audio IRQ 5, 7, 9, 11, 12 or 15 - (1) CS4232 audio DMA 0, 1 or 3 - (7) CS4232 second (duplex) DMA 0, 1 or 3 - (330) CS4232 MIDI I/O base 330, 370, 3B0 or 3F0 - (9) CS4232 MIDI IRQ 5, 7, 9, 11, 12 or 15 - - This setup does _NOT_ allow for recording yet. - - Midi is not supported since the cs4232 driver doesn't support midi yet. - -2. IBM CHRP - - I have only tested this on the 43P-150. Build the kernel with the cs4232 - set as a module and load the module with irq=9 dma=1 dma2=2 io=0x550 diff --git a/Documentation/powerpc/transactional_memory.txt b/Documentation/powerpc/transactional_memory.txt new file mode 100644 index 00000000000..9791e98ab49 --- /dev/null +++ b/Documentation/powerpc/transactional_memory.txt @@ -0,0 +1,198 @@ +Transactional Memory support +============================ + +POWER kernel support for this feature is currently limited to supporting +its use by user programs. It is not currently used by the kernel itself. + +This file aims to sum up how it is supported by Linux and what behaviour you +can expect from your user programs. + + +Basic overview +============== + +Hardware Transactional Memory is supported on POWER8 processors, and is a +feature that enables a different form of atomic memory access. Several new +instructions are presented to delimit transactions; transactions are +guaranteed to either complete atomically or roll back and undo any partial +changes. + +A simple transaction looks like this: + +begin_move_money: + tbegin + beq abort_handler + + ld r4, SAVINGS_ACCT(r3) + ld r5, CURRENT_ACCT(r3) + subi r5, r5, 1 + addi r4, r4, 1 + std r4, SAVINGS_ACCT(r3) + std r5, CURRENT_ACCT(r3) + + tend + + b continue + +abort_handler: + ... test for odd failures ... + + /* Retry the transaction if it failed because it conflicted with + * someone else: */ + b begin_move_money + + +The 'tbegin' instruction denotes the start point, and 'tend' the end point. +Between these points the processor is in 'Transactional' state; any memory +references will complete in one go if there are no conflicts with other +transactional or non-transactional accesses within the system. In this +example, the transaction completes as though it were normal straight-line code +IF no other processor has touched SAVINGS_ACCT(r3) or CURRENT_ACCT(r3); an +atomic move of money from the current account to the savings account has been +performed. Even though the normal ld/std instructions are used (note no +lwarx/stwcx), either *both* SAVINGS_ACCT(r3) and CURRENT_ACCT(r3) will be +updated, or neither will be updated. + +If, in the meantime, there is a conflict with the locations accessed by the +transaction, the transaction will be aborted by the CPU. Register and memory +state will roll back to that at the 'tbegin', and control will continue from +'tbegin+4'. The branch to abort_handler will be taken this second time; the +abort handler can check the cause of the failure, and retry. + +Checkpointed registers include all GPRs, FPRs, VRs/VSRs, LR, CCR/CR, CTR, FPCSR +and a few other status/flag regs; see the ISA for details. + +Causes of transaction aborts +============================ + +- Conflicts with cache lines used by other processors +- Signals +- Context switches +- See the ISA for full documentation of everything that will abort transactions. + + +Syscalls +======== + +Performing syscalls from within transaction is not recommended, and can lead +to unpredictable results. + +Syscalls do not by design abort transactions, but beware: The kernel code will +not be running in transactional state. The effect of syscalls will always +remain visible, but depending on the call they may abort your transaction as a +side-effect, read soon-to-be-aborted transactional data that should not remain +invisible, etc. If you constantly retry a transaction that constantly aborts +itself by calling a syscall, you'll have a livelock & make no progress. + +Simple syscalls (e.g. sigprocmask()) "could" be OK. Even things like write() +from, say, printf() should be OK as long as the kernel does not access any +memory that was accessed transactionally. + +Consider any syscalls that happen to work as debug-only -- not recommended for +production use. Best to queue them up till after the transaction is over. + + +Signals +======= + +Delivery of signals (both sync and async) during transactions provides a second +thread state (ucontext/mcontext) to represent the second transactional register +state. Signal delivery 'treclaim's to capture both register states, so signals +abort transactions. The usual ucontext_t passed to the signal handler +represents the checkpointed/original register state; the signal appears to have +arisen at 'tbegin+4'. + +If the sighandler ucontext has uc_link set, a second ucontext has been +delivered. For future compatibility the MSR.TS field should be checked to +determine the transactional state -- if so, the second ucontext in uc->uc_link +represents the active transactional registers at the point of the signal. + +For 64-bit processes, uc->uc_mcontext.regs->msr is a full 64-bit MSR and its TS +field shows the transactional mode. + +For 32-bit processes, the mcontext's MSR register is only 32 bits; the top 32 +bits are stored in the MSR of the second ucontext, i.e. in +uc->uc_link->uc_mcontext.regs->msr. The top word contains the transactional +state TS. + +However, basic signal handlers don't need to be aware of transactions +and simply returning from the handler will deal with things correctly: + +Transaction-aware signal handlers can read the transactional register state +from the second ucontext. This will be necessary for crash handlers to +determine, for example, the address of the instruction causing the SIGSEGV. + +Example signal handler: + + void crash_handler(int sig, siginfo_t *si, void *uc) + { + ucontext_t *ucp = uc; + ucontext_t *transactional_ucp = ucp->uc_link; + + if (ucp_link) { + u64 msr = ucp->uc_mcontext.regs->msr; + /* May have transactional ucontext! */ +#ifndef __powerpc64__ + msr |= ((u64)transactional_ucp->uc_mcontext.regs->msr) << 32; +#endif + if (MSR_TM_ACTIVE(msr)) { + /* Yes, we crashed during a transaction. Oops. */ + fprintf(stderr, "Transaction to be restarted at 0x%llx, but " + "crashy instruction was at 0x%llx\n", + ucp->uc_mcontext.regs->nip, + transactional_ucp->uc_mcontext.regs->nip); + } + } + + fix_the_problem(ucp->dar); + } + +When in an active transaction that takes a signal, we need to be careful with +the stack. It's possible that the stack has moved back up after the tbegin. +The obvious case here is when the tbegin is called inside a function that +returns before a tend. In this case, the stack is part of the checkpointed +transactional memory state. If we write over this non transactionally or in +suspend, we are in trouble because if we get a tm abort, the program counter and +stack pointer will be back at the tbegin but our in memory stack won't be valid +anymore. + +To avoid this, when taking a signal in an active transaction, we need to use +the stack pointer from the checkpointed state, rather than the speculated +state. This ensures that the signal context (written tm suspended) will be +written below the stack required for the rollback. The transaction is aborted +because of the treclaim, so any memory written between the tbegin and the +signal will be rolled back anyway. + +For signals taken in non-TM or suspended mode, we use the +normal/non-checkpointed stack pointer. + + +Failure cause codes used by kernel +================================== + +These are defined in <asm/reg.h>, and distinguish different reasons why the +kernel aborted a transaction: + + TM_CAUSE_RESCHED Thread was rescheduled. + TM_CAUSE_TLBI Software TLB invalide. + TM_CAUSE_FAC_UNAV FP/VEC/VSX unavailable trap. + TM_CAUSE_SYSCALL Currently unused; future syscalls that must abort + transactions for consistency will use this. + TM_CAUSE_SIGNAL Signal delivered. + TM_CAUSE_MISC Currently unused. + TM_CAUSE_ALIGNMENT Alignment fault. + TM_CAUSE_EMULATE Emulation that touched memory. + +These can be checked by the user program's abort handler as TEXASR[0:7]. If +bit 7 is set, it indicates that the error is consider persistent. For example +a TM_CAUSE_ALIGNMENT will be persistent while a TM_CAUSE_RESCHED will not.q + +GDB +=== + +GDB and ptrace are not currently TM-aware. If one stops during a transaction, +it looks like the transaction has just started (the checkpointed state is +presented). The transaction cannot then be continued and will take the failure +handler route. Furthermore, the transactional 2nd register state will be +inaccessible. GDB can currently be used on programs using TM, but not sensibly +in parts within transactions. diff --git a/Documentation/powerpc/zImage_layout.txt b/Documentation/powerpc/zImage_layout.txt deleted file mode 100644 index 048e0150f57..00000000000 --- a/Documentation/powerpc/zImage_layout.txt +++ /dev/null @@ -1,47 +0,0 @@ - Information about the Linux/PPC kernel images -===================================================================== - -Please mail me (Cort Dougan, cort@fsmlabs.com) if you have questions, -comments or corrections. - -This document is meant to answer several questions I've had about how -the PReP system boots and how Linux/PPC interacts with that mechanism. -It would be nice if we could have information on how other architectures -boot here as well. If you have anything to contribute, please -let me know. - - -1. PReP boot file - - This is the file necessary to boot PReP systems from floppy or - hard drive. The firmware reads the PReP partition table entry - and will load the image accordingly. - - To boot the zImage, copy it onto a floppy with dd if=zImage of=/dev/fd0h1440 - or onto a PReP hard drive partition with dd if=zImage of=/dev/sda4 - assuming you've created a PReP partition (type 0x41) with fdisk on - /dev/sda4. - - The layout of the image format is: - - 0x0 +------------+ - | | PReP partition table entry - | | - 0x400 +------------+ - | | Bootstrap program code + data - | | - | | - +------------+ - | | compressed kernel, elf header removed - +------------+ - | | initrd (if loaded) - +------------+ - | | Elf section table for bootstrap program - +------------+ - - -2. MBX boot file - - The MBX boards can load an elf image, and relocate it to the - proper location in memory - it copies the image to the location it was - linked at. |
