From e2c3a49c8029ebd9ef530101cc24c66562e3dff5 Mon Sep 17 00:00:00 2001 From: mike-m Date: Fri, 7 May 2010 00:28:04 +0000 Subject: Revert r103213. It broke several sections of live website. git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@103219 91177308-0d34-0410-b5e6-96231b3b80d8 --- docs/BitCodeFormat.html | 1163 +++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 1163 insertions(+) create mode 100644 docs/BitCodeFormat.html (limited to 'docs/BitCodeFormat.html') diff --git a/docs/BitCodeFormat.html b/docs/BitCodeFormat.html new file mode 100644 index 0000000000..f1ddefdea9 --- /dev/null +++ b/docs/BitCodeFormat.html @@ -0,0 +1,1163 @@ + + + + + LLVM Bitcode File Format + + + +
LLVM Bitcode File Format
+
    +
  1. Abstract
  2. +
  3. Overview
  4. +
  5. Bitstream Format +
      +
    1. Magic Numbers
    2. +
    3. Primitives
    4. +
    5. Abbreviation IDs
    6. +
    7. Blocks
    8. +
    9. Data Records
    10. +
    11. Abbreviations
    12. +
    13. Standard Blocks
    14. +
    +
  6. +
  7. Bitcode Wrapper Format +
  8. +
  9. LLVM IR Encoding +
      +
    1. Basics
    2. +
    3. MODULE_BLOCK Contents
    4. +
    5. PARAMATTR_BLOCK Contents
    6. +
    7. TYPE_BLOCK Contents
    8. +
    9. CONSTANTS_BLOCK Contents
    10. +
    11. FUNCTION_BLOCK Contents
    12. +
    13. TYPE_SYMTAB_BLOCK Contents
    14. +
    15. VALUE_SYMTAB_BLOCK Contents
    16. +
    17. METADATA_BLOCK Contents
    18. +
    19. METADATA_ATTACHMENT Contents
    20. +
    +
  10. +
+
+

Written by Chris Lattner + and Joshua Haberman. +

+
+ + +
Abstract
+ + +
+ +

This document describes the LLVM bitstream file format and the encoding of +the LLVM IR into it.

+ +
+ + +
Overview
+ + +
+ +

+What is commonly known as the LLVM bitcode file format (also, sometimes +anachronistically known as bytecode) is actually two things: a bitstream container format +and an encoding of LLVM IR into the container format.

+ +

+The bitstream format is an abstract encoding of structured data, very +similar to XML in some ways. Like XML, bitstream files contain tags, and nested +structures, and you can parse the file without having to understand the tags. +Unlike XML, the bitstream format is a binary encoding, and unlike XML it +provides a mechanism for the file to self-describe "abbreviations", which are +effectively size optimizations for the content.

+ +

LLVM IR files may be optionally embedded into a wrapper structure that makes it easy to embed extra data +along with LLVM IR files.

+ +

This document first describes the LLVM bitstream format, describes the +wrapper format, then describes the record structure used by LLVM IR files. +

+ +
+ + +
Bitstream Format
+ + +
+ +

+The bitstream format is literally a stream of bits, with a very simple +structure. This structure consists of the following concepts: +

+ + + +

Note that the llvm-bcanalyzer tool can be +used to dump and inspect arbitrary bitstreams, which is very useful for +understanding the encoding.

+ +
+ + +
Magic Numbers +
+ +
+ +

The first two bytes of a bitcode file are 'BC' (0x42, 0x43). +The second two bytes are an application-specific magic number. Generic +bitcode tools can look at only the first two bytes to verify the file is +bitcode, while application-specific programs will want to look at all four.

+ +
+ + +
Primitives +
+ +
+ +

+A bitstream literally consists of a stream of bits, which are read in order +starting with the least significant bit of each byte. The stream is made up of a +number of primitive values that encode a stream of unsigned integer values. +These integers are encoded in two ways: either as Fixed +Width Integers or as Variable Width +Integers. +

+ +
+ + +
Fixed Width Integers +
+ +
+ +

Fixed-width integer values have their low bits emitted directly to the file. + For example, a 3-bit integer value encodes 1 as 001. Fixed width integers + are used when there are a well-known number of options for a field. For + example, boolean values are usually encoded with a 1-bit wide integer. +

+ +
+ + +
Variable Width +Integers
+ +
+ +

Variable-width integer (VBR) values encode values of arbitrary size, +optimizing for the case where the values are small. Given a 4-bit VBR field, +any 3-bit value (0 through 7) is encoded directly, with the high bit set to +zero. Values larger than N-1 bits emit their bits in a series of N-1 bit +chunks, where all but the last set the high bit.

+ +

For example, the value 27 (0x1B) is encoded as 1011 0011 when emitted as a +vbr4 value. The first set of four bits indicates the value 3 (011) with a +continuation piece (indicated by a high bit of 1). The next word indicates a +value of 24 (011 << 3) with no continuation. The sum (3+24) yields the value +27. +

+ +
+ + +
6-bit characters
+ +
+ +

6-bit characters encode common characters into a fixed 6-bit field. They +represent the following characters with the following 6-bit values:

+ +
+
+'a' .. 'z' —  0 .. 25
+'A' .. 'Z' — 26 .. 51
+'0' .. '9' — 52 .. 61
+       '.' — 62
+       '_' — 63
+
+
+ +

This encoding is only suitable for encoding characters and strings that +consist only of the above characters. It is completely incapable of encoding +characters not in the set.

+ +
+ + +
Word Alignment
+ +
+ +

Occasionally, it is useful to emit zero bits until the bitstream is a +multiple of 32 bits. This ensures that the bit position in the stream can be +represented as a multiple of 32-bit words.

+ +
+ + + +
Abbreviation IDs +
+ +
+ +

+A bitstream is a sequential series of Blocks and +Data Records. Both of these start with an +abbreviation ID encoded as a fixed-bitwidth field. The width is specified by +the current block, as described below. The value of the abbreviation ID +specifies either a builtin ID (which have special meanings, defined below) or +one of the abbreviation IDs defined for the current block by the stream itself. +

+ +

+The set of builtin abbrev IDs is: +

+ + + +

Abbreviation IDs 4 and above are defined by the stream itself, and specify +an abbreviated record encoding.

+ +
+ + +
Blocks +
+ +
+ +

+Blocks in a bitstream denote nested regions of the stream, and are identified by +a content-specific id number (for example, LLVM IR uses an ID of 12 to represent +function bodies). Block IDs 0-7 are reserved for standard blocks +whose meaning is defined by Bitcode; block IDs 8 and greater are +application specific. Nested blocks capture the hierarchical structure of the data +encoded in it, and various properties are associated with blocks as the file is +parsed. Block definitions allow the reader to efficiently skip blocks +in constant time if the reader wants a summary of blocks, or if it wants to +efficiently skip data it does not understand. The LLVM IR reader uses this +mechanism to skip function bodies, lazily reading them on demand. +

+ +

+When reading and encoding the stream, several properties are maintained for the +block. In particular, each block maintains: +

+ +
    +
  1. A current abbrev id width. This value starts at 2 at the beginning of + the stream, and is set every time a + block record is entered. The block entry specifies the abbrev id width for + the body of the block.
  2. + +
  3. A set of abbreviations. Abbreviations may be defined within a block, in + which case they are only defined in that block (neither subblocks nor + enclosing blocks see the abbreviation). Abbreviations can also be defined + inside a BLOCKINFO block, in which case + they are defined in all blocks that match the ID that the BLOCKINFO block is + describing. +
  4. +
+ +

+As sub blocks are entered, these properties are saved and the new sub-block has +its own set of abbreviations, and its own abbrev id width. When a sub-block is +popped, the saved values are restored. +

+ +
+ + +
ENTER_SUBBLOCK +Encoding
+ +
+ +

[ENTER_SUBBLOCK, blockidvbr8, newabbrevlenvbr4, + <align32bits>, blocklen32]

+ +

+The ENTER_SUBBLOCK abbreviation ID specifies the start of a new block +record. The blockid value is encoded as an 8-bit VBR identifier, and +indicates the type of block being entered, which can be +a standard block or an application-specific block. +The newabbrevlen value is a 4-bit VBR, which specifies the abbrev id +width for the sub-block. The blocklen value is a 32-bit aligned value +that specifies the size of the subblock in 32-bit words. This value allows the +reader to skip over the entire block in one jump. +

+ +
+ + +
END_BLOCK +Encoding
+ +
+ +

[END_BLOCK, <align32bits>]

+ +

+The END_BLOCK abbreviation ID specifies the end of the current block +record. Its end is aligned to 32-bits to ensure that the size of the block is +an even multiple of 32-bits. +

+ +
+ + + + +
Data Records +
+ +
+

+Data records consist of a record code and a number of (up to) 64-bit +integer values. The interpretation of the code and values is +application specific and may vary between different block types. +Records can be encoded either using an unabbrev record, or with an +abbreviation. In the LLVM IR format, for example, there is a record +which encodes the target triple of a module. The code is +MODULE_CODE_TRIPLE, and the values of the record are the +ASCII codes for the characters in the string. +

+ +
+ + +
UNABBREV_RECORD +Encoding
+ +
+ +

[UNABBREV_RECORD, codevbr6, numopsvbr6, + op0vbr6, op1vbr6, ...]

+ +

+An UNABBREV_RECORD provides a default fallback encoding, which is both +completely general and extremely inefficient. It can describe an arbitrary +record by emitting the code and operands as VBRs. +

+ +

+For example, emitting an LLVM IR target triple as an unabbreviated record +requires emitting the UNABBREV_RECORD abbrevid, a vbr6 for the +MODULE_CODE_TRIPLE code, a vbr6 for the length of the string, which is +equal to the number of operands, and a vbr6 for each character. Because there +are no letters with values less than 32, each letter would need to be emitted as +at least a two-part VBR, which means that each letter would require at least 12 +bits. This is not an efficient encoding, but it is fully general. +

+ +
+ + +
Abbreviated Record +Encoding
+ +
+ +

[<abbrevid>, fields...]

+ +

+An abbreviated record is a abbreviation id followed by a set of fields that are +encoded according to the abbreviation definition. +This allows records to be encoded significantly more densely than records +encoded with the UNABBREV_RECORD type, +and allows the abbreviation types to be specified in the stream itself, which +allows the files to be completely self describing. The actual encoding of +abbreviations is defined below. +

+ +

The record code, which is the first field of an abbreviated record, +may be encoded in the abbreviation definition (as a literal +operand) or supplied in the abbreviated record (as a Fixed or VBR +operand value).

+ +
+ + +
Abbreviations +
+ +
+

+Abbreviations are an important form of compression for bitstreams. The idea is +to specify a dense encoding for a class of records once, then use that encoding +to emit many records. It takes space to emit the encoding into the file, but +the space is recouped (hopefully plus some) when the records that use it are +emitted. +

+ +

+Abbreviations can be determined dynamically per client, per file. Because the +abbreviations are stored in the bitstream itself, different streams of the same +format can contain different sets of abbreviations according to the needs +of the specific stream. +As a concrete example, LLVM IR files usually emit an abbreviation +for binary operators. If a specific LLVM module contained no or few binary +operators, the abbreviation does not need to be emitted. +

+
+ + +
DEFINE_ABBREV + Encoding
+ +
+ +

[DEFINE_ABBREV, numabbrevopsvbr5, abbrevop0, abbrevop1, + ...]

+ +

+A DEFINE_ABBREV record adds an abbreviation to the list of currently +defined abbreviations in the scope of this block. This definition only exists +inside this immediate block — it is not visible in subblocks or enclosing +blocks. Abbreviations are implicitly assigned IDs sequentially starting from 4 +(the first application-defined abbreviation ID). Any abbreviations defined in a +BLOCKINFO record for the particular block type +receive IDs first, in order, followed by any +abbreviations defined within the block itself. Abbreviated data records +reference this ID to indicate what abbreviation they are invoking. +

+ +

+An abbreviation definition consists of the DEFINE_ABBREV abbrevid +followed by a VBR that specifies the number of abbrev operands, then the abbrev +operands themselves. Abbreviation operands come in three forms. They all start +with a single bit that indicates whether the abbrev operand is a literal operand +(when the bit is 1) or an encoding operand (when the bit is 0). +

+ +
    +
  1. Literal operands — [11, litvaluevbr8] +— Literal operands specify that the value in the result is always a single +specific value. This specific value is emitted as a vbr8 after the bit +indicating that it is a literal operand.
  2. +
  3. Encoding info without data — [01, + encoding3] — Operand encodings that do not have extra + data are just emitted as their code. +
  4. +
  5. Encoding info with data — [01, encoding3, +valuevbr5] — Operand encodings that do have extra data are +emitted as their code, followed by the extra data. +
  6. +
+ +

The possible operand encodings are:

+ + + +

+For example, target triples in LLVM modules are encoded as a record of the +form [TRIPLE, 'a', 'b', 'c', 'd']. Consider if the bitstream emitted +the following abbrev entry: +

+ +
+
+[0, Fixed, 4]
+[0, Array]
+[0, Char6]
+
+
+ +

+When emitting a record with this abbreviation, the above entry would be emitted +as: +

+ +
+

+[4abbrevwidth, 24, 4vbr6, 06, +16, 26, 36] +

+
+ +

These values are:

+ +
    +
  1. The first value, 4, is the abbreviation ID for this abbreviation.
  2. +
  3. The second value, 2, is the record code for TRIPLE records within LLVM IR file MODULE_BLOCK blocks.
  4. +
  5. The third value, 4, is the length of the array.
  6. +
  7. The rest of the values are the char6 encoded values + for "abcd".
  8. +
+ +

+With this abbreviation, the triple is emitted with only 37 bits (assuming a +abbrev id width of 3). Without the abbreviation, significantly more space would +be required to emit the target triple. Also, because the TRIPLE value +is not emitted as a literal in the abbreviation, the abbreviation can also be +used for any other string value. +

+ +
+ + +
Standard Blocks +
+ +
+ +

+In addition to the basic block structure and record encodings, the bitstream +also defines specific built-in block types. These block types specify how the +stream is to be decoded or other metadata. In the future, new standard blocks +may be added. Block IDs 0-7 are reserved for standard blocks. +

+ +
+ + +
#0 - BLOCKINFO +Block
+ +
+ +

+The BLOCKINFO block allows the description of metadata for other +blocks. The currently specified records are: +

+ +
+
+[SETBID (#1), blockid]
+[DEFINE_ABBREV, ...]
+[BLOCKNAME, ...name...]
+[SETRECORDNAME, RecordID, ...name...]
+
+
+ +

+The SETBID record (code 1) indicates which block ID is being +described. SETBID records can occur multiple times throughout the +block to change which block ID is being described. There must be +a SETBID record prior to any other records. +

+ +

+Standard DEFINE_ABBREV records can occur inside BLOCKINFO +blocks, but unlike their occurrence in normal blocks, the abbreviation is +defined for blocks matching the block ID we are describing, not the +BLOCKINFO block itself. The abbreviations defined +in BLOCKINFO blocks receive abbreviation IDs as described +in DEFINE_ABBREV. +

+ +

The BLOCKNAME record (code 2) can optionally occur in this block. The elements of +the record are the bytes of the string name of the block. llvm-bcanalyzer can use +this to dump out bitcode files symbolically.

+ +

The SETRECORDNAME record (code 3) can also optionally occur in this block. The +first operand value is a record ID number, and the rest of the elements of the record are +the bytes for the string name of the record. llvm-bcanalyzer can use +this to dump out bitcode files symbolically.

+ +

+Note that although the data in BLOCKINFO blocks is described as +"metadata," the abbreviations they contain are essential for parsing records +from the corresponding blocks. It is not safe to skip them. +

+ +
+ + +
Bitcode Wrapper Format
+ + +
+ +

+Bitcode files for LLVM IR may optionally be wrapped in a simple wrapper +structure. This structure contains a simple header that indicates the offset +and size of the embedded BC file. This allows additional information to be +stored alongside the BC file. The structure of this file header is: +

+ +
+

+[Magic32, Version32, Offset32, +Size32, CPUType32] +

+
+ +

+Each of the fields are 32-bit fields stored in little endian form (as with +the rest of the bitcode file fields). The Magic number is always +0x0B17C0DE and the version is currently always 0. The Offset +field is the offset in bytes to the start of the bitcode stream in the file, and +the Size field is the size in bytes of the stream. CPUType is a target-specific +value that can be used to encode the CPU of the target. +

+ +
+ + +
LLVM IR Encoding
+ + +
+ +

+LLVM IR is encoded into a bitstream by defining blocks and records. It uses +blocks for things like constant pools, functions, symbol tables, etc. It uses +records for things like instructions, global variable descriptors, type +descriptions, etc. This document does not describe the set of abbreviations +that the writer uses, as these are fully self-described in the file, and the +reader is not allowed to build in any knowledge of this. +

+ +
+ + +
Basics +
+ + +
LLVM IR Magic Number
+ +
+ +

+The magic number for LLVM IR files is: +

+ +
+

+[0x04, 0xC4, 0xE4, 0xD4] +

+
+ +

+When combined with the bitcode magic number and viewed as bytes, this is +"BC 0xC0DE". +

+ +
+ + +
Signed VBRs
+ +
+ +

+Variable Width Integer encoding is an efficient way to +encode arbitrary sized unsigned values, but is an extremely inefficient for +encoding signed values, as signed values are otherwise treated as maximally large +unsigned values. +

+ +

+As such, signed VBR values of a specific width are emitted as follows: +

+ + + +

+With this encoding, small positive and small negative values can both +be emitted efficiently. Signed VBR encoding is used in +CST_CODE_INTEGER and CST_CODE_WIDE_INTEGER records +within CONSTANTS_BLOCK blocks. +

+ +
+ + + +
LLVM IR Blocks
+ +
+ +

+LLVM IR is defined with the following blocks: +

+ + + +
+ + +
MODULE_BLOCK Contents +
+ +
+ +

The MODULE_BLOCK block (id 8) is the top-level block for LLVM +bitcode files, and each bitcode file must contain exactly one. In +addition to records (described below) containing information +about the module, a MODULE_BLOCK block may contain the +following sub-blocks: +

+ + + +
+ + +
MODULE_CODE_VERSION Record +
+ +
+ +

[VERSION, version#]

+ +

The VERSION record (code 1) contains a single value +indicating the format version. Only version 0 is supported at this +time.

+
+ + +
MODULE_CODE_TRIPLE Record +
+ +
+

[TRIPLE, ...string...]

+ +

The TRIPLE record (code 2) contains a variable number of +values representing the bytes of the target triple +specification string.

+
+ + +
MODULE_CODE_DATALAYOUT Record +
+ +
+

[DATALAYOUT, ...string...]

+ +

The DATALAYOUT record (code 3) contains a variable number of +values representing the bytes of the target datalayout +specification string.

+
+ + +
MODULE_CODE_ASM Record +
+ +
+

[ASM, ...string...]

+ +

The ASM record (code 4) contains a variable number of +values representing the bytes of module asm strings, with +individual assembly blocks separated by newline (ASCII 10) characters.

+
+ + +
MODULE_CODE_SECTIONNAME Record +
+ +
+

[SECTIONNAME, ...string...]

+ +

The SECTIONNAME record (code 5) contains a variable number +of values representing the bytes of a single section name +string. There should be one SECTIONNAME record for each +section name referenced (e.g., in global variable or function +section attributes) within the module. These records can be +referenced by the 1-based index in the section fields of +GLOBALVAR or FUNCTION records.

+
+ + +
MODULE_CODE_DEPLIB Record +
+ +
+

[DEPLIB, ...string...]

+ +

The DEPLIB record (code 6) contains a variable number of +values representing the bytes of a single dependent library name +string, one of the libraries mentioned in a deplibs +declaration. There should be one DEPLIB record for each +library name referenced.

+
+ + +
MODULE_CODE_GLOBALVAR Record +
+ +
+

[GLOBALVAR, pointer type, isconst, initid, linkage, alignment, section, visibility, threadlocal]

+ +

The GLOBALVAR record (code 7) marks the declaration or +definition of a global variable. The operand fields are:

+ + +
+ + +
MODULE_CODE_FUNCTION Record +
+ +
+ +

[FUNCTION, type, callingconv, isproto, linkage, paramattr, alignment, section, visibility, gc]

+ +

The FUNCTION record (code 8) marks the declaration or +definition of a function. The operand fields are:

+ + +
+ + +
MODULE_CODE_ALIAS Record +
+ +
+ +

[ALIAS, alias type, aliasee val#, linkage, visibility]

+ +

The ALIAS record (code 9) marks the definition of an +alias. The operand fields are

+ + +
+ + +
MODULE_CODE_PURGEVALS Record +
+ +
+

[PURGEVALS, numvals]

+ +

The PURGEVALS record (code 10) resets the module-level +value list to the size given by the single operand value. Module-level +value list items are added by GLOBALVAR, FUNCTION, +and ALIAS records. After a PURGEVALS record is seen, +new value indices will start from the given numvals value.

+
+ + +
MODULE_CODE_GCNAME Record +
+ +
+

[GCNAME, ...string...]

+ +

The GCNAME record (code 11) contains a variable number of +values representing the bytes of a single garbage collector name +string. There should be one GCNAME record for each garbage +collector name referenced in function gc attributes within +the module. These records can be referenced by 1-based index in the gc +fields of FUNCTION records.

+
+ + +
PARAMATTR_BLOCK Contents +
+ +
+ +

The PARAMATTR_BLOCK block (id 9) ... +

+ +
+ + + +
PARAMATTR_CODE_ENTRY Record +
+ +
+ +

[ENTRY, paramidx0, attr0, paramidx1, attr1...]

+ +

The ENTRY record (code 1) ... +

+
+ + +
TYPE_BLOCK Contents +
+ +
+ +

The TYPE_BLOCK block (id 10) ... +

+ +
+ + + +
CONSTANTS_BLOCK Contents +
+ +
+ +

The CONSTANTS_BLOCK block (id 11) ... +

+ +
+ + + +
FUNCTION_BLOCK Contents +
+ +
+ +

The FUNCTION_BLOCK block (id 12) ... +

+ +

In addition to the record types described below, a +FUNCTION_BLOCK block may contain the following sub-blocks: +

+ + + +
+ + + +
TYPE_SYMTAB_BLOCK Contents +
+ +
+ +

The TYPE_SYMTAB_BLOCK block (id 13) ... +

+ +
+ + + +
VALUE_SYMTAB_BLOCK Contents +
+ +
+ +

The VALUE_SYMTAB_BLOCK block (id 14) ... +

+ +
+ + + +
METADATA_BLOCK Contents +
+ +
+ +

The METADATA_BLOCK block (id 15) ... +

+ +
+ + + +
METADATA_ATTACHMENT Contents +
+ +
+ +

The METADATA_ATTACHMENT block (id 16) ... +

+ +
+ + + +
+
Valid CSS +Valid HTML 4.01 + Chris Lattner
+The LLVM Compiler Infrastructure
+Last modified: $Date$ +
+ + -- cgit v1.2.3-70-g09d2