diff options
author | Chris Lattner <sabre@nondot.org> | 2007-05-18 05:51:57 +0000 |
---|---|---|
committer | Chris Lattner <sabre@nondot.org> | 2007-05-18 05:51:57 +0000 |
commit | 18e85be368508de2ef713c7fc6c55a8f5380fbd9 (patch) | |
tree | c7cbaf688fac1d635146b729efff399056feb362 /docs | |
parent | 520f3c98ae9965185243e65357ac37196a91a904 (diff) |
merge in from mainline
git-svn-id: https://llvm.org/svn/llvm-project/llvm/branches/release_20@37210 91177308-0d34-0410-b5e6-96231b3b80d8
Diffstat (limited to 'docs')
-rw-r--r-- | docs/BitCodeFormat.html | 609 |
1 files changed, 581 insertions, 28 deletions
diff --git a/docs/BitCodeFormat.html b/docs/BitCodeFormat.html index 0579a42115..7194c7a6d3 100644 --- a/docs/BitCodeFormat.html +++ b/docs/BitCodeFormat.html @@ -1,59 +1,612 @@ -<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"> +<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" + "http://www.w3.org/TR/html4/strict.dtd"> <html> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> <title>LLVM Bitcode File Format</title> <link rel="stylesheet" href="llvm.css" type="text/css"> - <style type="text/css"> - TR, TD { border: 2px solid gray; padding-left: 4pt; padding-right: 4pt; - padding-top: 2pt; padding-bottom: 2pt; } - TH { border: 2px solid gray; font-weight: bold; font-size: 105%; } - TABLE { text-align: center; border: 2px solid black; - border-collapse: collapse; margin-top: 1em; margin-left: 1em; - margin-right: 1em; margin-bottom: 1em; } - .td_left { border: 2px solid gray; text-align: left; } - </style> </head> <body> <div class="doc_title"> LLVM Bitcode File Format </div> <ol> <li><a href="#abstract">Abstract</a></li> - <li><a href="#concepts">Concepts</a></li> + <li><a href="#overview">Overview</a></li> + <li><a href="#bitstream">Bitstream Format</a> + <ol> + <li><a href="#magic">Magic Numbers</a></li> + <li><a href="#primitives">Primitives</a></li> + <li><a href="#abbrevid">Abbreviation IDs</a></li> + <li><a href="#blocks">Blocks</a></li> + <li><a href="#datarecord">Data Records</a></li> + <li><a href="#abbreviations">Abbreviations</a></li> + <li><a href="#stdblocks">Standard Blocks</a></li> + </ol> + </li> + <li><a href="#llvmir">LLVM IR Encoding</a> + <ol> + <li><a href="#basics">Basics</a></li> + </ol> + </li> </ol> <div class="doc_author"> - <p>Written by <a href="mailto:rspencer@x10sys.com">Reid Spencer</a> and - <a href="mailto:sabre@nondot.org">Chris Lattner</a>. + <p>Written by <a href="mailto:sabre@nondot.org">Chris Lattner</a>. </p> </div> + <!-- *********************************************************************** --> -<div class="doc_section"> <a name="abstract">Abstract </a></div> +<div class="doc_section"> <a name="abstract">Abstract</a></div> <!-- *********************************************************************** --> + <div class="doc_text"> -<p>This document describes the LLVM bitcode file format. It specifies -the binary encoding rules of the bitcode file format so that -equivalent systems can encode bitcode files correctly. The LLVM -bitcode representation is used to store the intermediate -representation on disk in a compacted form.</p> -<p>This document supercedes the LLVM bytecode file format for the 2.0 -release.</p> + +<p>This document describes the LLVM bitstream file format and the encoding of +the LLVM IR into it.</p> + </div> + <!-- *********************************************************************** --> -<div class="doc_section"> <a name="concepts">Concepts</a> </div> +<div class="doc_section"> <a name="overview">Overview</a></div> <!-- *********************************************************************** --> + <div class="doc_text"> -<p>This section describes the general concepts of the bitcode file -format without getting into specific layout details. It is recommended -that you read this section thoroughly before interpreting the detailed -descriptions.</p> + +<p> +What is commonly known as the LLVM bitcode file format (also, sometimes +anachronistically known as bytecode) is actually two things: a <a +href="#bitstream">bitstream container format</a> +and an <a href="#llvmir">encoding of LLVM IR</a> into the container format.</p> + +<p> +The bitstream format is an abstract encoding of structured data, very +similar to XML in some ways. Like XML, bitstream files contain tags, and nested +structures, and you can parse the file without having to understand the tags. +Unlike XML, the bitstream format is a binary encoding, and unlike XML it +provides a mechanism for the file to self-describe "abbreviations", which are +effectively size optimizations for the content.</p> + +<p>This document first describes the LLVM bitstream format, then describes the +record structure used by LLVM IR files. +</p> + +</div> + +<!-- *********************************************************************** --> +<div class="doc_section"> <a name="bitstream">Bitstream Format</a></div> +<!-- *********************************************************************** --> + +<div class="doc_text"> + +<p> +The bitstream format is literally a stream of bits, with a very simple +structure. This structure consists of the following concepts: +</p> + +<ul> +<li>A "<a href="#magic">magic number</a>" that identifies the contents of + the stream.</li> +<li>Encoding <a href="#primitives">primitives</a> like variable bit-rate + integers.</li> +<li><a href="#blocks">Blocks</a>, which define nested content.</li> +<li><a href="#datarecord">Data Records</a>, which describe entities within the + file.</li> +<li>Abbreviations, which specify compression optimizations for the file.</li> +</ul> + +<p>Note that the <a +href="CommandGuide/html/llvm-bcanalyzer.html">llvm-bcanalyzer</a> tool can be +used to dump and inspect arbitrary bitstreams, which is very useful for +understanding the encoding.</p> + +</div> + +<!-- ======================================================================= --> +<div class="doc_subsection"><a name="magic">Magic Numbers</a> +</div> + +<div class="doc_text"> + +<p>The first four bytes of the stream identify the encoding of the file. This +is used by a reader to know what is contained in the file.</p> + +</div> + +<!-- ======================================================================= --> +<div class="doc_subsection"><a name="primitives">Primitives</a> +</div> + +<div class="doc_text"> + +<p> +A bitstream literally consists of a stream of bits. This stream is made up of a +number of primitive values that encode a stream of unsigned integer values. +These +integers are are encoded in two ways: either as <a href="#fixedwidth">Fixed +Width Integers</a> or as <a href="#variablewidth">Variable Width +Integers</a>. +</p> + +</div> + +<!-- _______________________________________________________________________ --> +<div class="doc_subsubsection"> <a name="fixedwidth">Fixed Width Integers</a> +</div> + +<div class="doc_text"> + +<p>Fixed-width integer values have their low bits emitted directly to the file. + For example, a 3-bit integer value encodes 1 as 001. Fixed width integers + are used when there are a well-known number of options for a field. For + example, boolean values are usually encoded with a 1-bit wide integer. +</p> + +</div> + +<!-- _______________________________________________________________________ --> +<div class="doc_subsubsection"> <a name="variablewidth">Variable Width +Integers</a></div> + +<div class="doc_text"> + +<p>Variable-width integer (VBR) values encode values of arbitrary size, +optimizing for the case where the values are small. Given a 4-bit VBR field, +any 3-bit value (0 through 7) is encoded directly, with the high bit set to +zero. Values larger than N-1 bits emit their bits in a series of N-1 bit +chunks, where all but the last set the high bit.</p> + +<p>For example, the value 27 (0x1B) is encoded as 1011 0011 when emitted as a +vbr4 value. The first set of four bits indicates the value 3 (011) with a +continuation piece (indicated by a high bit of 1). The next word indicates a +value of 24 (011 << 3) with no continuation. The sum (3+24) yields the value +27. +</p> + +</div> + +<!-- _______________________________________________________________________ --> +<div class="doc_subsubsection"> <a name="char6">6-bit characters</a></div> + +<div class="doc_text"> + +<p>6-bit characters encode common characters into a fixed 6-bit field. They +represent the following characters with the following 6-bit values:</p> + +<ul> +<li>'a' .. 'z' - 0 .. 25</li> +<li>'A' .. 'Z' - 26 .. 52</li> +<li>'0' .. '9' - 53 .. 61</li> +<li>'.' - 62</li> +<li>'_' - 63</li> +</ul> + +<p>This encoding is only suitable for encoding characters and strings that +consist only of the above characters. It is completely incapable of encoding +characters not in the set.</p> + +</div> + +<!-- _______________________________________________________________________ --> +<div class="doc_subsubsection"> <a name="wordalign">Word Alignment</a></div> + +<div class="doc_text"> + +<p>Occasionally, it is useful to emit zero bits until the bitstream is a +multiple of 32 bits. This ensures that the bit position in the stream can be +represented as a multiple of 32-bit words.</p> + +</div> + + +<!-- ======================================================================= --> +<div class="doc_subsection"><a name="abbrevid">Abbreviation IDs</a> +</div> + +<div class="doc_text"> + +<p> +A bitstream is a sequential series of <a href="#blocks">Blocks</a> and +<a href="#datarecord">Data Records</a>. Both of these start with an +abbreviation ID encoded as a fixed-bitwidth field. The width is specified by +the current block, as described below. The value of the abbreviation ID +specifies either a builtin ID (which have special meanings, defined below) or +one of the abbreviation IDs defined by the stream itself. +</p> + +<p> +The set of builtin abbrev IDs is: +</p> + +<ul> +<li>0 - <a href="#END_BLOCK">END_BLOCK</a> - This abbrev ID marks the end of the + current block.</li> +<li>1 - <a href="#ENTER_SUBBLOCK">ENTER_SUBBLOCK</a> - This abbrev ID marks the + beginning of a new block.</li> +<li>2 - <a href="#DEFINE_ABBREV">DEFINE_ABBREV</a> - This defines a new + abbreviation.</li> +<li>3 - <a href="#UNABBREV_RECORD">UNABBREV_RECORD</a> - This ID specifies the + definition of an unabbreviated record.</li> +</ul> + +<p>Abbreviation IDs 4 and above are defined by the stream itself, and specify +an <a href="#abbrev_records">abbreviated record encoding</a>.</p> + +</div> + +<!-- ======================================================================= --> +<div class="doc_subsection"><a name="blocks">Blocks</a> +</div> + +<div class="doc_text"> + +<p> +Blocks in a bitstream denote nested regions of the stream, and are identified by +a content-specific id number (for example, LLVM IR uses an ID of 12 to represent +function bodies). Nested blocks capture the hierachical structure of the data +encoded in it, and various properties are associated with blocks as the file is +parsed. Block definitions allow the reader to efficiently skip blocks +in constant time if the reader wants a summary of blocks, or if it wants to +efficiently skip data they do not understand. The LLVM IR reader uses this +mechanism to skip function bodies, lazily reading them on demand. +</p> + +<p> +When reading and encoding the stream, several properties are maintained for the +block. In particular, each block maintains: +</p> + +<ol> +<li>A current abbrev id width. This value starts at 2, and is set every time a + block record is entered. The block entry specifies the abbrev id width for + the body of the block.</li> + +<li>A set of abbreviations. Abbreviations may be defined within a block, or + they may be associated with all blocks of a particular ID. +</li> +</ol> + +<p>As sub blocks are entered, these properties are saved and the new sub-block +has its own set of abbreviations, and its own abbrev id width. When a sub-block +is popped, the saved values are restored.</p> + +</div> + +<!-- _______________________________________________________________________ --> +<div class="doc_subsubsection"> <a name="ENTER_SUBBLOCK">ENTER_SUBBLOCK +Encoding</a></div> + +<div class="doc_text"> + +<p><tt>[ENTER_SUBBLOCK, blockid<sub>vbr8</sub>, newabbrevlen<sub>vbr4</sub>, + <align32bits>, blocklen<sub>32</sub>]</tt></p> + +<p> +The ENTER_SUBBLOCK abbreviation ID specifies the start of a new block record. +The <tt>blockid</tt> value is encoded as a 8-bit VBR identifier, and indicates +the type of block being entered (which is application specific). The +<tt>newabbrevlen</tt> value is a 4-bit VBR which specifies the +abbrev id width for the sub-block. The <tt>blocklen</tt> is a 32-bit aligned +value that specifies the size of the subblock, in 32-bit words. This value +allows the reader to skip over the entire block in one jump. +</p> + +</div> + +<!-- _______________________________________________________________________ --> +<div class="doc_subsubsection"> <a name="END_BLOCK">END_BLOCK +Encoding</a></div> + +<div class="doc_text"> + +<p><tt>[END_BLOCK, <align32bits>]</tt></p> + +<p> +The END_BLOCK abbreviation ID specifies the end of the current block record. +Its end is aligned to 32-bits to ensure that the size of the block is an even +multiple of 32-bits.</p> + +</div> + + + +<!-- ======================================================================= --> +<div class="doc_subsection"><a name="datarecord">Data Records</a> +</div> + +<div class="doc_text"> +<p> +Data records consist of a record code and a number of (up to) 64-bit integer +values. The interpretation of the code and values is application specific and +there are multiple different ways to encode a record (with an unabbrev record +or with an abbreviation). In the LLVM IR format, for example, there is a record +which encodes the target triple of a module. The code is MODULE_CODE_TRIPLE, +and the values of the record are the ascii codes for the characters in the +string.</p> + +</div> + +<!-- _______________________________________________________________________ --> +<div class="doc_subsubsection"> <a name="UNABBREV_RECORD">UNABBREV_RECORD +Encoding</a></div> + +<div class="doc_text"> + +<p><tt>[UNABBREV_RECORD, code<sub>vbr6</sub>, numops<sub>vbr6</sub>, + op0<sub>vbr6</sub>, op1<sub>vbr6</sub>, ...]</tt></p> + +<p>An UNABBREV_RECORD provides a default fallback encoding, which is both +completely general and also extremely inefficient. It can describe an arbitrary +record, by emitting the code and operands as vbrs.</p> + +<p>For example, emitting an LLVM IR target triple as an unabbreviated record +requires emitting the UNABBREV_RECORD abbrevid, a vbr6 for the +MODULE_CODE_TRIPLE code, a vbr6 for the length of the string (which is equal to +the number of operands), and a vbr6 for each character. Since there are no +letters with value less than 32, each letter would need to be emitted as at +least a two-part VBR, which means that each letter would require at least 12 +bits. This is not an efficient encoding, but it is fully general.</p> + +</div> + +<!-- _______________________________________________________________________ --> +<div class="doc_subsubsection"> <a name="abbrev_records">Abbreviated Record +Encoding</a></div> + +<div class="doc_text"> + +<p><tt>[<abbrevid>, fields...]</tt></p> + +<p>An abbreviated record is a abbreviation id followed by a set of fields that +are encoded according to the <a href="#abbreviations">abbreviation +definition</a>. This allows records to be encoded significantly more densely +than records encoded with the <a href="#UNABBREV_RECORD">UNABBREV_RECORD</a> +type, and allows the abbreviation types to be specified in the stream itself, +which allows the files to be completely self describing. The actual encoding +of abbreviations is defined below. +</p> + +</div> + +<!-- ======================================================================= --> +<div class="doc_subsection"><a name="abbreviations">Abbreviations</a> +</div> + +<div class="doc_text"> +<p> +Abbreviations are an important form of compression for bitstreams. The idea is +to specify a dense encoding for a class of records once, then use that encoding +to emit many records. It takes space to emit the encoding into the file, but +the space is recouped (hopefully plus some) when the records that use it are +emitted. +</p> + +<p> +Abbreviations can be determined dynamically per client, per file. Since the +abbreviations are stored in the bitstream itself, different streams of the same +format can contain different sets of abbreviations if the specific stream does +not need it. As a concrete example, LLVM IR files usually emit an abbreviation +for binary operators. If a specific LLVM module contained no or few binary +operators, the abbreviation does not need to be emitted. +</p> +</div> + +<!-- _______________________________________________________________________ --> +<div class="doc_subsubsection"><a name="DEFINE_ABBREV">DEFINE_ABBREV + Encoding</a></div> + +<div class="doc_text"> + +<p><tt>[DEFINE_ABBREV, numabbrevops<sub>vbr5</sub>, abbrevop0, abbrevop1, + ...]</tt></p> + +<p>An abbreviation definition consists of the DEFINE_ABBREV abbrevid followed +by a VBR that specifies the number of abbrev operands, then the abbrev +operands themselves. Abbreviation operands come in three forms. They all start +with a single bit that indicates whether the abbrev operand is a literal operand +(when the bit is 1) or an encoding operand (when the bit is 0).</p> + +<ol> +<li>Literal operands - <tt>[1<sub>1</sub>, litvalue<sub>vbr8</sub>]</tt> - +Literal operands specify that the value in the result +is always a single specific value. This specific value is emitted as a vbr8 +after the bit indicating that it is a literal operand.</li> +<li>Encoding info without data - <tt>[0<sub>1</sub>, encoding<sub>3</sub>]</tt> + - Operand encodings that do not have extra data are just emitted as their code. +</li> +<li>Encoding info with data - <tt>[0<sub>1</sub>, encoding<sub>3</sub>, +value<sub>vbr5</sub>]</tt> - Operand encodings that do have extra data are +emitted as their code, followed by the extra data. +</li> +</ol> + +<p>The possible operand encodings are:</p> + +<ul> +<li>1 - Fixed - The field should be emitted as a <a + href="#fixedwidth">fixed-width value</a>, whose width + is specified by the encoding operand.</li> +<li>2 - VBR - The field should be emitted as a <a + href="#variablewidth">variable-width value</a>, whose width + is specified by the encoding operand.</li> +<li>3 - Array - This field is an array of values. The element type of the array + is specified by the next encoding operand.</li> +<li>4 - Char6 - This field should be emitted as a <a href="#char6">char6-encoded + value</a>.</li> +</ul> + +<p>For example, target triples in LLVM modules are encoded as a record of the +form <tt>[TRIPLE, 'a', 'b', 'c', 'd']</tt>. Consider if the bitstream emitted +the following abbrev entry:</p> + +<ul> +<li><tt>[0, Fixed, 4]</tt></li> +<li><tt>[0, Array]</tt></li> +<li><tt>[0, Char6]</tt></li> +</ul> + +<p>When emitting a record with this abbreviation, the above entry would be +emitted as:</p> + +<p><tt>[4<sub>abbrevwidth</sub>, 2<sub>4</sub>, 4<sub>vbr6</sub>, + 0<sub>6</sub>, 1<sub>6</sub>, 2<sub>6</sub>, 3<sub>6</sub>]</tt></p> + +<p>These values are:</p> + +<ol> +<li>The first value, 4, is the abbreviation ID for this abbreviation.</li> +<li>The second value, 2, is the code for TRIPLE in LLVM IR files.</li> +<li>The third value, 4, is the length of the array.</li> +<li>The rest of the values are the char6 encoded values for "abcd".</li> +</ol> + +<p>With this abbreviation, the triple is emitted with only 37 bits (assuming a +abbrev id width of 3). Without the abbreviation, significantly more space would +be required to emit the target triple. Also, since the TRIPLE value is not +emitted as a literal in the abbreviation, the abbreviation can also be used for +any other string value. +</p> + +</div> + +<!-- ======================================================================= --> +<div class="doc_subsection"><a name="stdblocks">Standard Blocks</a> +</div> + +<div class="doc_text"> + +<p> +In addition to the basic block structure and record encodings, the bitstream +also defines specific builtin block types. These block types specify how the +stream is to be decoded or other metadata. In the future, new standard blocks +may be added. +</p> + +</div> + +<!-- _______________________________________________________________________ --> +<div class="doc_subsubsection"><a name="BLOCKINFO">#0 - BLOCKINFO +Block</a></div> + +<div class="doc_text"> + +<p>The BLOCKINFO block allows the description of metadata for other blocks. The + currently specified records are:</p> + +<ul> +<li><tt>[SETBID (#1), blockid]</tt></li> +<li><tt>[DEFINE_ABBREV, ...]</tt></li> +</ul> + +<p> +The SETBID record indicates which block ID is being described. The standard +DEFINE_ABBREV record specifies an abbreviation. The abbreviation is associated +with the record ID, and any records with matching ID automatically get the +abbreviation. +</p> + +</div> + +<!-- *********************************************************************** --> +<div class="doc_section"> <a name="llvmir">LLVM IR Encoding</a></div> +<!-- *********************************************************************** --> + +<div class="doc_text"> + +<p>LLVM IR is encoded into a bitstream by defining blocks and records. It uses +blocks for things like constant pools, functions, symbol tables, etc. It uses +records for things like instructions, global variable descriptors, type +descriptions, etc. This document does not describe the set of abbreviations +that the writer uses, as these are fully self-described in the file, and the +reader is not allowed to build in any knowledge of this.</p> + +</div> + +<!-- ======================================================================= --> +<div class="doc_subsection"><a name="basics">Basics</a> +</div> + +<!-- _______________________________________________________________________ --> +<div class="doc_subsubsection"><a name="ir_magic">LLVM IR Magic Number</a></div> + +<div class="doc_text"> + +<p> +The magic number for LLVM IR files is: +</p> + +<p><tt>['B'<sub>8</sub>, 'C'<sub>8</sub>, 0x0<sub>4</sub>, 0xC<sub>4</sub>, +0xE<sub>4</sub>, 0xD<sub>4</sub>]</tt></p> + +<p>When viewed as bytes, this is "BC 0xC0DE".</p> + +</div> + +<!-- _______________________________________________________________________ --> +<div class="doc_subsubsection"><a name="ir_signed_vbr">Signed VBRs</a></div> + +<div class="doc_text"> + +<p> +<a href="#variablewidth">Variable Width Integers</a> are an efficient way to +encode arbitrary sized unsigned values, but is an extremely inefficient way to +encode signed values (as signed values are otherwise treated as maximally large +unsigned values).</p> + +<p>As such, signed vbr values of a specific width are emitted as follows:</p> + +<ul> +<li>Positive values are emitted as vbrs of the specified width, but with their + value shifted left by one.</li> +<li>Negative values are emitted as vbrs of the specified width, but the negated + value is shifted left by one, and the low bit is set.</li> +</ul> + +<p>With this encoding, small positive and small negative values can both be +emitted efficiently.</p> + +</div> + + +<!-- _______________________________________________________________________ --> +<div class="doc_subsubsection"><a name="ir_blocks">LLVM IR Blocks</a></div> + +<div class="doc_text"> + +<p> +LLVM IR is defined with the following blocks: +</p> + +<ul> +<li>8 - MODULE_BLOCK - This is the top-level block that contains the + entire module, and describes a variety of per-module information.</li> +<li>9 - PARAMATTR_BLOCK - This enumerates the parameter attributes.</li> +<li>10 - TYPE_BLOCK - This describes all of the types in the module.</li> +<li>11 - CONSTANTS_BLOCK - This describes constants for a module or + function.</li> +<li>12 - FUNCTION_BLOCK - This describes a function body.</li> +<li>13 - TYPE_SYMTAB_BLOCK - This describes the type symbol table.</li> +<li>14 - VALUE_SYMTAB_BLOCK - This describes a value symbol table.</li> +</ul> + +</div> + +<!-- ======================================================================= --> +<div class="doc_subsection"><a name="MODULE_BLOCK">MODULE_BLOCK Contents</a> +</div> + +<div class="doc_text"> + +<p> +</p> + </div> + + <!-- *********************************************************************** --> <hr> <address> <a href="http://jigsaw.w3.org/css-validator/check/referer"><img src="http://jigsaw.w3.org/css-validator/images/vcss" alt="Valid CSS!"></a> <a href="http://validator.w3.org/check/referer"><img src="http://www.w3.org/Icons/valid-html401" alt="Valid HTML 4.01!"></a> -<a href="mailto:rspencer@x10sys.com">Reid Spencer</a> and <a - href="mailto:sabre@nondot.org">Chris Lattner</a><br> + <a href="mailto:sabre@nondot.org">Chris Lattner</a><br> <a href="http://llvm.org">The LLVM Compiler Infrastructure</a><br> Last modified: $Date$ </address> |