diff options
author | mike-m <mikem.llvm@gmail.com> | 2010-05-07 00:28:04 +0000 |
---|---|---|
committer | mike-m <mikem.llvm@gmail.com> | 2010-05-07 00:28:04 +0000 |
commit | e2c3a49c8029ebd9ef530101cc24c66562e3dff5 (patch) | |
tree | 91bf9600cc8df90cf99751a8f8bafc317cffc91e /docs/BitCodeFormat.html | |
parent | c10b5afbe8138b0fdf3af4ed3e1ddf96cf3cb4cb (diff) |
Revert r103213. It broke several sections of live website.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@103219 91177308-0d34-0410-b5e6-96231b3b80d8
Diffstat (limited to 'docs/BitCodeFormat.html')
-rw-r--r-- | docs/BitCodeFormat.html | 1163 |
1 files changed, 1163 insertions, 0 deletions
diff --git a/docs/BitCodeFormat.html b/docs/BitCodeFormat.html new file mode 100644 index 0000000000..f1ddefdea9 --- /dev/null +++ b/docs/BitCodeFormat.html @@ -0,0 +1,1163 @@ +<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" + "http://www.w3.org/TR/html4/strict.dtd"> +<html> +<head> + <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> + <title>LLVM Bitcode File Format</title> + <link rel="stylesheet" href="llvm.css" type="text/css"> +</head> +<body> +<div class="doc_title"> LLVM Bitcode File Format </div> +<ol> + <li><a href="#abstract">Abstract</a></li> + <li><a href="#overview">Overview</a></li> + <li><a href="#bitstream">Bitstream Format</a> + <ol> + <li><a href="#magic">Magic Numbers</a></li> + <li><a href="#primitives">Primitives</a></li> + <li><a href="#abbrevid">Abbreviation IDs</a></li> + <li><a href="#blocks">Blocks</a></li> + <li><a href="#datarecord">Data Records</a></li> + <li><a href="#abbreviations">Abbreviations</a></li> + <li><a href="#stdblocks">Standard Blocks</a></li> + </ol> + </li> + <li><a href="#wrapper">Bitcode Wrapper Format</a> + </li> + <li><a href="#llvmir">LLVM IR Encoding</a> + <ol> + <li><a href="#basics">Basics</a></li> + <li><a href="#MODULE_BLOCK">MODULE_BLOCK Contents</a></li> + <li><a href="#PARAMATTR_BLOCK">PARAMATTR_BLOCK Contents</a></li> + <li><a href="#TYPE_BLOCK">TYPE_BLOCK Contents</a></li> + <li><a href="#CONSTANTS_BLOCK">CONSTANTS_BLOCK Contents</a></li> + <li><a href="#FUNCTION_BLOCK">FUNCTION_BLOCK Contents</a></li> + <li><a href="#TYPE_SYMTAB_BLOCK">TYPE_SYMTAB_BLOCK Contents</a></li> + <li><a href="#VALUE_SYMTAB_BLOCK">VALUE_SYMTAB_BLOCK Contents</a></li> + <li><a href="#METADATA_BLOCK">METADATA_BLOCK Contents</a></li> + <li><a href="#METADATA_ATTACHMENT">METADATA_ATTACHMENT Contents</a></li> + </ol> + </li> +</ol> +<div class="doc_author"> + <p>Written by <a href="mailto:sabre@nondot.org">Chris Lattner</a> + and <a href="http://www.reverberate.org">Joshua Haberman</a>. +</p> +</div> + +<!-- *********************************************************************** --> +<div class="doc_section"> <a name="abstract">Abstract</a></div> +<!-- *********************************************************************** --> + +<div class="doc_text"> + +<p>This document describes the LLVM bitstream file format and the encoding of +the LLVM IR into it.</p> + +</div> + +<!-- *********************************************************************** --> +<div class="doc_section"> <a name="overview">Overview</a></div> +<!-- *********************************************************************** --> + +<div class="doc_text"> + +<p> +What is commonly known as the LLVM bitcode file format (also, sometimes +anachronistically known as bytecode) is actually two things: a <a +href="#bitstream">bitstream container format</a> +and an <a href="#llvmir">encoding of LLVM IR</a> into the container format.</p> + +<p> +The bitstream format is an abstract encoding of structured data, very +similar to XML in some ways. Like XML, bitstream files contain tags, and nested +structures, and you can parse the file without having to understand the tags. +Unlike XML, the bitstream format is a binary encoding, and unlike XML it +provides a mechanism for the file to self-describe "abbreviations", which are +effectively size optimizations for the content.</p> + +<p>LLVM IR files may be optionally embedded into a <a +href="#wrapper">wrapper</a> structure that makes it easy to embed extra data +along with LLVM IR files.</p> + +<p>This document first describes the LLVM bitstream format, describes the +wrapper format, then describes the record structure used by LLVM IR files. +</p> + +</div> + +<!-- *********************************************************************** --> +<div class="doc_section"> <a name="bitstream">Bitstream Format</a></div> +<!-- *********************************************************************** --> + +<div class="doc_text"> + +<p> +The bitstream format is literally a stream of bits, with a very simple +structure. This structure consists of the following concepts: +</p> + +<ul> +<li>A "<a href="#magic">magic number</a>" that identifies the contents of + the stream.</li> +<li>Encoding <a href="#primitives">primitives</a> like variable bit-rate + integers.</li> +<li><a href="#blocks">Blocks</a>, which define nested content.</li> +<li><a href="#datarecord">Data Records</a>, which describe entities within the + file.</li> +<li>Abbreviations, which specify compression optimizations for the file.</li> +</ul> + +<p>Note that the <a +href="CommandGuide/html/llvm-bcanalyzer.html">llvm-bcanalyzer</a> tool can be +used to dump and inspect arbitrary bitstreams, which is very useful for +understanding the encoding.</p> + +</div> + +<!-- ======================================================================= --> +<div class="doc_subsection"><a name="magic">Magic Numbers</a> +</div> + +<div class="doc_text"> + +<p>The first two bytes of a bitcode file are 'BC' (0x42, 0x43). +The second two bytes are an application-specific magic number. Generic +bitcode tools can look at only the first two bytes to verify the file is +bitcode, while application-specific programs will want to look at all four.</p> + +</div> + +<!-- ======================================================================= --> +<div class="doc_subsection"><a name="primitives">Primitives</a> +</div> + +<div class="doc_text"> + +<p> +A bitstream literally consists of a stream of bits, which are read in order +starting with the least significant bit of each byte. The stream is made up of a +number of primitive values that encode a stream of unsigned integer values. +These integers are encoded in two ways: either as <a href="#fixedwidth">Fixed +Width Integers</a> or as <a href="#variablewidth">Variable Width +Integers</a>. +</p> + +</div> + +<!-- _______________________________________________________________________ --> +<div class="doc_subsubsection"> <a name="fixedwidth">Fixed Width Integers</a> +</div> + +<div class="doc_text"> + +<p>Fixed-width integer values have their low bits emitted directly to the file. + For example, a 3-bit integer value encodes 1 as 001. Fixed width integers + are used when there are a well-known number of options for a field. For + example, boolean values are usually encoded with a 1-bit wide integer. +</p> + +</div> + +<!-- _______________________________________________________________________ --> +<div class="doc_subsubsection"> <a name="variablewidth">Variable Width +Integers</a></div> + +<div class="doc_text"> + +<p>Variable-width integer (VBR) values encode values of arbitrary size, +optimizing for the case where the values are small. Given a 4-bit VBR field, +any 3-bit value (0 through 7) is encoded directly, with the high bit set to +zero. Values larger than N-1 bits emit their bits in a series of N-1 bit +chunks, where all but the last set the high bit.</p> + +<p>For example, the value 27 (0x1B) is encoded as 1011 0011 when emitted as a +vbr4 value. The first set of four bits indicates the value 3 (011) with a +continuation piece (indicated by a high bit of 1). The next word indicates a +value of 24 (011 << 3) with no continuation. The sum (3+24) yields the value +27. +</p> + +</div> + +<!-- _______________________________________________________________________ --> +<div class="doc_subsubsection"> <a name="char6">6-bit characters</a></div> + +<div class="doc_text"> + +<p>6-bit characters encode common characters into a fixed 6-bit field. They +represent the following characters with the following 6-bit values:</p> + +<div class="doc_code"> +<pre> +'a' .. 'z' — 0 .. 25 +'A' .. 'Z' — 26 .. 51 +'0' .. '9' — 52 .. 61 + '.' — 62 + '_' — 63 +</pre> +</div> + +<p>This encoding is only suitable for encoding characters and strings that +consist only of the above characters. It is completely incapable of encoding +characters not in the set.</p> + +</div> + +<!-- _______________________________________________________________________ --> +<div class="doc_subsubsection"> <a name="wordalign">Word Alignment</a></div> + +<div class="doc_text"> + +<p>Occasionally, it is useful to emit zero bits until the bitstream is a +multiple of 32 bits. This ensures that the bit position in the stream can be +represented as a multiple of 32-bit words.</p> + +</div> + + +<!-- ======================================================================= --> +<div class="doc_subsection"><a name="abbrevid">Abbreviation IDs</a> +</div> + +<div class="doc_text"> + +<p> +A bitstream is a sequential series of <a href="#blocks">Blocks</a> and +<a href="#datarecord">Data Records</a>. Both of these start with an +abbreviation ID encoded as a fixed-bitwidth field. The width is specified by +the current block, as described below. The value of the abbreviation ID +specifies either a builtin ID (which have special meanings, defined below) or +one of the abbreviation IDs defined for the current block by the stream itself. +</p> + +<p> +The set of builtin abbrev IDs is: +</p> + +<ul> +<li><tt>0 - <a href="#END_BLOCK">END_BLOCK</a></tt> — This abbrev ID marks + the end of the current block.</li> +<li><tt>1 - <a href="#ENTER_SUBBLOCK">ENTER_SUBBLOCK</a></tt> — This + abbrev ID marks the beginning of a new block.</li> +<li><tt>2 - <a href="#DEFINE_ABBREV">DEFINE_ABBREV</a></tt> — This defines + a new abbreviation.</li> +<li><tt>3 - <a href="#UNABBREV_RECORD">UNABBREV_RECORD</a></tt> — This ID + specifies the definition of an unabbreviated record.</li> +</ul> + +<p>Abbreviation IDs 4 and above are defined by the stream itself, and specify +an <a href="#abbrev_records">abbreviated record encoding</a>.</p> + +</div> + +<!-- ======================================================================= --> +<div class="doc_subsection"><a name="blocks">Blocks</a> +</div> + +<div class="doc_text"> + +<p> +Blocks in a bitstream denote nested regions of the stream, and are identified by +a content-specific id number (for example, LLVM IR uses an ID of 12 to represent +function bodies). Block IDs 0-7 are reserved for <a href="#stdblocks">standard blocks</a> +whose meaning is defined by Bitcode; block IDs 8 and greater are +application specific. Nested blocks capture the hierarchical structure of the data +encoded in it, and various properties are associated with blocks as the file is +parsed. Block definitions allow the reader to efficiently skip blocks +in constant time if the reader wants a summary of blocks, or if it wants to +efficiently skip data it does not understand. The LLVM IR reader uses this +mechanism to skip function bodies, lazily reading them on demand. +</p> + +<p> +When reading and encoding the stream, several properties are maintained for the +block. In particular, each block maintains: +</p> + +<ol> +<li>A current abbrev id width. This value starts at 2 at the beginning of + the stream, and is set every time a + block record is entered. The block entry specifies the abbrev id width for + the body of the block.</li> + +<li>A set of abbreviations. Abbreviations may be defined within a block, in + which case they are only defined in that block (neither subblocks nor + enclosing blocks see the abbreviation). Abbreviations can also be defined + inside a <tt><a href="#BLOCKINFO">BLOCKINFO</a></tt> block, in which case + they are defined in all blocks that match the ID that the BLOCKINFO block is + describing. +</li> +</ol> + +<p> +As sub blocks are entered, these properties are saved and the new sub-block has +its own set of abbreviations, and its own abbrev id width. When a sub-block is +popped, the saved values are restored. +</p> + +</div> + +<!-- _______________________________________________________________________ --> +<div class="doc_subsubsection"> <a name="ENTER_SUBBLOCK">ENTER_SUBBLOCK +Encoding</a></div> + +<div class="doc_text"> + +<p><tt>[ENTER_SUBBLOCK, blockid<sub>vbr8</sub>, newabbrevlen<sub>vbr4</sub>, + <align32bits>, blocklen<sub>32</sub>]</tt></p> + +<p> +The <tt>ENTER_SUBBLOCK</tt> abbreviation ID specifies the start of a new block +record. The <tt>blockid</tt> value is encoded as an 8-bit VBR identifier, and +indicates the type of block being entered, which can be +a <a href="#stdblocks">standard block</a> or an application-specific block. +The <tt>newabbrevlen</tt> value is a 4-bit VBR, which specifies the abbrev id +width for the sub-block. The <tt>blocklen</tt> value is a 32-bit aligned value +that specifies the size of the subblock in 32-bit words. This value allows the +reader to skip over the entire block in one jump. +</p> + +</div> + +<!-- _______________________________________________________________________ --> +<div class="doc_subsubsection"> <a name="END_BLOCK">END_BLOCK +Encoding</a></div> + +<div class="doc_text"> + +<p><tt>[END_BLOCK, <align32bits>]</tt></p> + +<p> +The <tt>END_BLOCK</tt> abbreviation ID specifies the end of the current block +record. Its end is aligned to 32-bits to ensure that the size of the block is +an even multiple of 32-bits. +</p> + +</div> + + + +<!-- ======================================================================= --> +<div class="doc_subsection"><a name="datarecord">Data Records</a> +</div> + +<div class="doc_text"> +<p> +Data records consist of a record code and a number of (up to) 64-bit +integer values. The interpretation of the code and values is +application specific and may vary between different block types. +Records can be encoded either using an unabbrev record, or with an +abbreviation. In the LLVM IR format, for example, there is a record +which encodes the target triple of a module. The code is +<tt>MODULE_CODE_TRIPLE</tt>, and the values of the record are the +ASCII codes for the characters in the string. +</p> + +</div> + +<!-- _______________________________________________________________________ --> +<div class="doc_subsubsection"> <a name="UNABBREV_RECORD">UNABBREV_RECORD +Encoding</a></div> + +<div class="doc_text"> + +<p><tt>[UNABBREV_RECORD, code<sub>vbr6</sub>, numops<sub>vbr6</sub>, + op0<sub>vbr6</sub>, op1<sub>vbr6</sub>, ...]</tt></p> + +<p> +An <tt>UNABBREV_RECORD</tt> provides a default fallback encoding, which is both +completely general and extremely inefficient. It can describe an arbitrary +record by emitting the code and operands as VBRs. +</p> + +<p> +For example, emitting an LLVM IR target triple as an unabbreviated record +requires emitting the <tt>UNABBREV_RECORD</tt> abbrevid, a vbr6 for the +<tt>MODULE_CODE_TRIPLE</tt> code, a vbr6 for the length of the string, which is +equal to the number of operands, and a vbr6 for each character. Because there +are no letters with values less than 32, each letter would need to be emitted as +at least a two-part VBR, which means that each letter would require at least 12 +bits. This is not an efficient encoding, but it is fully general. +</p> + +</div> + +<!-- _______________________________________________________________________ --> +<div class="doc_subsubsection"> <a name="abbrev_records">Abbreviated Record +Encoding</a></div> + +<div class="doc_text"> + +<p><tt>[<abbrevid>, fields...]</tt></p> + +<p> +An abbreviated record is a abbreviation id followed by a set of fields that are +encoded according to the <a href="#abbreviations">abbreviation definition</a>. +This allows records to be encoded significantly more densely than records +encoded with the <tt><a href="#UNABBREV_RECORD">UNABBREV_RECORD</a></tt> type, +and allows the abbreviation types to be specified in the stream itself, which +allows the files to be completely self describing. The actual encoding of +abbreviations is defined below. +</p> + +<p>The record code, which is the first field of an abbreviated record, +may be encoded in the abbreviation definition (as a literal +operand) or supplied in the abbreviated record (as a Fixed or VBR +operand value).</p> + +</div> + +<!-- ======================================================================= --> +<div class="doc_subsection"><a name="abbreviations">Abbreviations</a> +</div> + +<div class="doc_text"> +<p> +Abbreviations are an important form of compression for bitstreams. The idea is +to specify a dense encoding for a class of records once, then use that encoding +to emit many records. It takes space to emit the encoding into the file, but +the space is recouped (hopefully plus some) when the records that use it are +emitted. +</p> + +<p> +Abbreviations can be determined dynamically per client, per file. Because the +abbreviations are stored in the bitstream itself, different streams of the same +format can contain different sets of abbreviations according to the needs +of the specific stream. +As a concrete example, LLVM IR files usually emit an abbreviation +for binary operators. If a specific LLVM module contained no or few binary +operators, the abbreviation does not need to be emitted. +</p> +</div> + +<!-- _______________________________________________________________________ --> +<div class="doc_subsubsection"><a name="DEFINE_ABBREV">DEFINE_ABBREV + Encoding</a></div> + +<div class="doc_text"> + +<p><tt>[DEFINE_ABBREV, numabbrevops<sub>vbr5</sub>, abbrevop0, abbrevop1, + ...]</tt></p> + +<p> +A <tt>DEFINE_ABBREV</tt> record adds an abbreviation to the list of currently +defined abbreviations in the scope of this block. This definition only exists +inside this immediate block — it is not visible in subblocks or enclosing +blocks. Abbreviations are implicitly assigned IDs sequentially starting from 4 +(the first application-defined abbreviation ID). Any abbreviations defined in a +<tt>BLOCKINFO</tt> record for the particular block type +receive IDs first, in order, followed by any +abbreviations defined within the block itself. Abbreviated data records +reference this ID to indicate what abbreviation they are invoking. +</p> + +<p> +An abbreviation definition consists of the <tt>DEFINE_ABBREV</tt> abbrevid +followed by a VBR that specifies the number of abbrev operands, then the abbrev +operands themselves. Abbreviation operands come in three forms. They all start +with a single bit that indicates whether the abbrev operand is a literal operand +(when the bit is 1) or an encoding operand (when the bit is 0). +</p> + +<ol> +<li>Literal operands — <tt>[1<sub>1</sub>, litvalue<sub>vbr8</sub>]</tt> +— Literal operands specify that the value in the result is always a single +specific value. This specific value is emitted as a vbr8 after the bit +indicating that it is a literal operand.</li> +<li>Encoding info without data — <tt>[0<sub>1</sub>, + encoding<sub>3</sub>]</tt> — Operand encodings that do not have extra + data are just emitted as their code. +</li> +<li>Encoding info with data — <tt>[0<sub>1</sub>, encoding<sub>3</sub>, +value<sub>vbr5</sub>]</tt> — Operand encodings that do have extra data are +emitted as their code, followed by the extra data. +</li> +</ol> + +<p>The possible operand encodings are:</p> + +<ul> +<li>Fixed (code 1): The field should be emitted as + a <a href="#fixedwidth">fixed-width value</a>, whose width is specified by + the operand's extra data.</li> +<li>VBR (code 2): The field should be emitted as + a <a href="#variablewidth">variable-width value</a>, whose width is + specified by the operand's extra data.</li> +<li>Array (code 3): This field is an array of values. The array operand + has no extra data, but expects another operand to follow it, indicating + the element type of the array. When reading an array in an abbreviated + record, the first integer is a vbr6 that indicates the array length, + followed by the encoded elements of the array. An array may only occur as + the last operand of an abbreviation (except for the one final operand that + gives the array's type).</li> +<li>Char6 (code 4): This field should be emitted as + a <a href="#char6">char6-encoded value</a>. This operand type takes no + extra data. Char6 encoding is normally used as an array element type. + </li> +<li>Blob (code 5): This field is emitted as a vbr6, followed by padding to a + 32-bit boundary (for alignment) and an array of 8-bit objects. The array of + bytes is further followed by tail padding to ensure that its total length is + a multiple of 4 bytes. This makes it very efficient for the reader to + decode the data without having to make a copy of it: it can use a pointer to + the data in the mapped in file and poke directly at it. A blob may only + occur as the last operand of an abbreviation.</li> +</ul> + +<p> +For example, target triples in LLVM modules are encoded as a record of the +form <tt>[TRIPLE, 'a', 'b', 'c', 'd']</tt>. Consider if the bitstream emitted +the following abbrev entry: +</p> + +<div class="doc_code"> +<pre> +[0, Fixed, 4] +[0, Array] +[0, Char6] +</pre> +</div> + +<p> +When emitting a record with this abbreviation, the above entry would be emitted +as: +</p> + +<div class="doc_code"> +<p> +<tt>[4<sub>abbrevwidth</sub>, 2<sub>4</sub>, 4<sub>vbr6</sub>, 0<sub>6</sub>, +1<sub>6</sub>, 2<sub>6</sub>, 3<sub>6</sub>]</tt> +</p> +</div> + +<p>These values are:</p> + +<ol> +<li>The first value, 4, is the abbreviation ID for this abbreviation.</li> +<li>The second value, 2, is the record code for <tt>TRIPLE</tt> records within LLVM IR file <tt>MODULE_BLOCK</tt> blocks.</li> +<li>The third value, 4, is the length of the array.</li> +<li>The rest of the values are the char6 encoded values + for <tt>"abcd"</tt>.</li> +</ol> + +<p> +With this abbreviation, the triple is emitted with only 37 bits (assuming a +abbrev id width of 3). Without the abbreviation, significantly more space would +be required to emit the target triple. Also, because the <tt>TRIPLE</tt> value +is not emitted as a literal in the abbreviation, the abbreviation can also be +used for any other string value. +</p> + +</div> + +<!-- ======================================================================= --> +<div class="doc_subsection"><a name="stdblocks">Standard Blocks</a> +</div> + +<div class="doc_text"> + +<p> +In addition to the basic block structure and record encodings, the bitstream +also defines specific built-in block types. These block types specify how the +stream is to be decoded or other metadata. In the future, new standard blocks +may be added. Block IDs 0-7 are reserved for standard blocks. +</p> + +</div> + +<!-- _______________________________________________________________________ --> +<div class="doc_subsubsection"><a name="BLOCKINFO">#0 - BLOCKINFO +Block</a></div> + +<div class="doc_text"> + +<p> +The <tt>BLOCKINFO</tt> block allows the description of metadata for other +blocks. The currently specified records are: +</p> + +<div class="doc_code"> +<pre> +[SETBID (#1), blockid] +[DEFINE_ABBREV, ...] +[BLOCKNAME, ...name...] +[SETRECORDNAME, RecordID, ...name...] +</pre> +</div> + +<p> +The <tt>SETBID</tt> record (code 1) indicates which block ID is being +described. <tt>SETBID</tt> records can occur multiple times throughout the +block to change which block ID is being described. There must be +a <tt>SETBID</tt> record prior to any other records. +</p> + +<p> +Standard <tt>DEFINE_ABBREV</tt> records can occur inside <tt>BLOCKINFO</tt> +blocks, but unlike their occurrence in normal blocks, the abbreviation is +defined for blocks matching the block ID we are describing, <i>not</i> the +<tt>BLOCKINFO</tt> block itself. The abbreviations defined +in <tt>BLOCKINFO</tt> blocks receive abbreviation IDs as described +in <tt><a href="#DEFINE_ABBREV">DEFINE_ABBREV</a></tt>. +</p> + +<p>The <tt>BLOCKNAME</tt> record (code 2) can optionally occur in this block. The elements of +the record are the bytes of the string name of the block. llvm-bcanalyzer can use +this to dump out bitcode files symbolically.</p> + +<p>The <tt>SETRECORDNAME</tt> record (code 3) can also optionally occur in this block. The +first operand value is a record ID number, and the rest of the elements of the record are +the bytes for the string name of the record. llvm-bcanalyzer can use +this to dump out bitcode files symbolically.</p> + +<p> +Note that although the data in <tt>BLOCKINFO</tt> blocks is described as +"metadata," the abbreviations they contain are essential for parsing records +from the corresponding blocks. It is not safe to skip them. +</p> + +</div> + +<!-- *********************************************************************** --> +<div class="doc_section"> <a name="wrapper">Bitcode Wrapper Format</a></div> +<!-- *********************************************************************** --> + +<div class="doc_text"> + +<p> +Bitcode files for LLVM IR may optionally be wrapped in a simple wrapper +structure. This structure contains a simple header that indicates the offset +and size of the embedded BC file. This allows additional information to be +stored alongside the BC file. The structure of this file header is: +</p> + +<div class="doc_code"> +<p> +<tt>[Magic<sub>32</sub>, Version<sub>32</sub>, Offset<sub>32</sub>, +Size<sub>32</sub>, CPUType<sub>32</sub>]</tt> +</p> +</div> + +<p> +Each of the fields are 32-bit fields stored in little endian form (as with +the rest of the bitcode file fields). The Magic number is always +<tt>0x0B17C0DE</tt> and the version is currently always <tt>0</tt>. The Offset +field is the offset in bytes to the start of the bitcode stream in the file, and +the Size field is the size in bytes of the stream. CPUType is a target-specific +value that can be used to encode the CPU of the target. +</p> + +</div> + +<!-- *********************************************************************** --> +<div class="doc_section"> <a name="llvmir">LLVM IR Encoding</a></div> +<!-- *********************************************************************** --> + +<div class="doc_text"> + +<p> +LLVM IR is encoded into a bitstream by defining blocks and records. It uses +blocks for things like constant pools, functions, symbol tables, etc. It uses +records for things like instructions, global variable descriptors, type +descriptions, etc. This document does not describe the set of abbreviations +that the writer uses, as these are fully self-described in the file, and the +reader is not allowed to build in any knowledge of this. +</p> + +</div> + +<!-- ======================================================================= --> +<div class="doc_subsection"><a name="basics">Basics</a> +</div> + +<!-- _______________________________________________________________________ --> +<div class="doc_subsubsection"><a name="ir_magic">LLVM IR Magic Number</a></div> + +<div class="doc_text"> + +<p> +The magic number for LLVM IR files is: +</p> + +<div class="doc_code"> +<p> +<tt>[0x0<sub>4</sub>, 0xC<sub>4</sub>, 0xE<sub>4</sub>, 0xD<sub>4</sub>]</tt> +</p> +</div> + +<p> +When combined with the bitcode magic number and viewed as bytes, this is +<tt>"BC 0xC0DE"</tt>. +</p> + +</div> + +<!-- _______________________________________________________________________ --> +<div class="doc_subsubsection"><a name="ir_signed_vbr">Signed VBRs</a></div> + +<div class="doc_text"> + +<p> +<a href="#variablewidth">Variable Width Integer</a> encoding is an efficient way to +encode arbitrary sized unsigned values, but is an extremely inefficient for +encoding signed values, as signed values are otherwise treated as maximally large +unsigned values. +</p> + +<p> +As such, signed VBR values of a specific width are emitted as follows: +</p> + +<ul> +<li>Positive values are emitted as VBRs of the specified width, but with their + value shifted left by one.</li> +<li>Negative values are emitted as VBRs of the specified width, but the negated + value is shifted left by one, and the low bit is set.</li> +</ul> + +<p> +With this encoding, small positive and small negative values can both +be emitted efficiently. Signed VBR encoding is used in +<tt>CST_CODE_INTEGER</tt> and <tt>CST_CODE_WIDE_INTEGER</tt> records +within <tt>CONSTANTS_BLOCK</tt> blocks. +</p> + +</div> + + +<!-- _______________________________________________________________________ --> +<div class="doc_subsubsection"><a name="ir_blocks">LLVM IR Blocks</a></div> + +<div class="doc_text"> + +<p> +LLVM IR is defined with the following blocks: +</p> + +<ul> +<li>8 — <a href="#MODULE_BLOCK"><tt>MODULE_BLOCK</tt></a> — This is the top-level block that + contains the entire module, and describes a variety of per-module + information.</li> +<li>9 — <a href="#PARAMATTR_BLOCK"><tt>PARAMATTR_BLOCK</tt></a> — This enumerates the parameter + attributes.</li> +<li>10 — <a href="#TYPE_BLOCK"><tt>TYPE_BLOCK</tt></a> — This describes all of the types in + the module.</li> +<li>11 — <a href="#CONSTANTS_BLOCK"><tt>CONSTANTS_BLOCK</tt></a> — This describes constants for a + module or function.</li> +<li>12 — <a href="#FUNCTION_BLOCK"><tt>FUNCTION_BLOCK</tt></a> — This describes a function + body.</li> +<li>13 — <a href="#TYPE_SYMTAB_BLOCK"><tt>TYPE_SYMTAB_BLOCK</tt></a> — This describes the type symbol + table.</li> +<li>14 — <a href="#VALUE_SYMTAB_BLOCK"><tt>VALUE_SYMTAB_BLOCK</tt></a> — This describes a value symbol + table.</li> +<li>15 — <a href="#METADATA_BLOCK"><tt>METADATA_BLOCK</tt></a> — This describes metadata items.</li> +<li>16 — <a href="#METADATA_ATTACHMENT"><tt>METADATA_ATTACHMENT</tt></a> — This contains records associating metadata with function instruction values.</li> +</ul> + +</div> + +<!-- ======================================================================= --> +<div class="doc_subsection"><a name="MODULE_BLOCK">MODULE_BLOCK Contents</a> +</div> + +<div class="doc_text"> + +<p>The <tt>MODULE_BLOCK</tt> block (id 8) is the top-level block for LLVM +bitcode files, and each bitcode file must contain exactly one. In +addition to records (described below) containing information +about the module, a <tt>MODULE_BLOCK</tt> block may contain the +following sub-blocks: +</p> + +<ul> +<li><a href="#BLOCKINFO"><tt>BLOCKINFO</tt></a></li> +<li><a href="#PARAMATTR_BLOCK"><tt>PARAMATTR_BLOCK</tt></a></li> +<li><a href="#TYPE_BLOCK"><tt>TYPE_BLOCK</tt></a></li> +<li><a href="#TYPE_SYMTAB_BLOCK"><tt>TYPE_SYMTAB_BLOCK</tt></a></li> +<li><a href="#VALUE_SYMTAB_BLOCK"><tt>VALUE_SYMTAB_BLOCK</tt></a></li> +<li><a href="#CONSTANTS_BLOCK"><tt>CONSTANTS_BLOCK</tt></a></li> +<li><a href="#FUNCTION_BLOCK"><tt>FUNCTION_BLOCK</tt></a></li> +<li><a href="#METADATA_BLOCK"><tt>METADATA_BLOCK</tt></a></li> +</ul> + +</div> + +<!-- _______________________________________________________________________ --> +<div class="doc_subsubsection"><a name="MODULE_CODE_VERSION">MODULE_CODE_VERSION Record</a> +</div> + +<div class="doc_text"> + +<p><tt>[VERSION, version#]</tt></p> + +<p>The <tt>VERSION</tt> record (code 1) contains a single value +indicating the format version. Only version 0 is supported at this +time.</p> +</div> + +<!-- _______________________________________________________________________ --> +<div class="doc_subsubsection"><a name="MODULE_CODE_TRIPLE">MODULE_CODE_TRIPLE Record</a> +</div> + +<div class="doc_text"> +<p><tt>[TRIPLE, ...string...]</tt></p> + +<p>The <tt>TRIPLE</tt> record (code 2) contains a variable number of +values representing the bytes of the <tt>target triple</tt> +specification string.</p> +</div> + +<!-- _______________________________________________________________________ --> +<div class="doc_subsubsection"><a name="MODULE_CODE_DATALAYOUT">MODULE_CODE_DATALAYOUT Record</a> +</div> + +<div class="doc_text"> +<p><tt>[DATALAYOUT, ...string...]</tt></p> + +<p>The <tt>DATALAYOUT</tt> record (code 3) contains a variable number of +values representing the bytes of the <tt>target datalayout</tt> +specification string.</p> +</div> + +<!-- _______________________________________________________________________ --> +<div class="doc_subsubsection"><a name="MODULE_CODE_ASM">MODULE_CODE_ASM Record</a> +</div> + +<div class="doc_text"> +<p><tt>[ASM, ...string...]</tt></p> + +<p>The <tt>ASM</tt> record (code 4) contains a variable number of +values representing the bytes of <tt>module asm</tt> strings, with +individual assembly blocks separated by newline (ASCII 10) characters.</p> +</div> + +<!-- _______________________________________________________________________ --> +<div class="doc_subsubsection"><a name="MODULE_CODE_SECTIONNAME">MODULE_CODE_SECTIONNAME Record</a> +</div> + +<div class="doc_text"> +<p><tt>[SECTIONNAME, ...string...]</tt></p> + +<p>The <tt>SECTIONNAME</tt> record (code 5) contains a variable number +of values representing the bytes of a single section name +string. There should be one <tt>SECTIONNAME</tt> record for each +section name referenced (e.g., in global variable or function +<tt>section</tt> attribu |