aboutsummaryrefslogtreecommitdiff
path: root/docs/PCHInternals.html
blob: 05278f37e0a5b2b288032905f2129bf0404642cc (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
          "http://www.w3.org/TR/html4/strict.dtd">
<html>
<head>
  <title>Precompiled Header and Modules Internals</title>
  <link type="text/css" rel="stylesheet" href="../menu.css">
  <link type="text/css" rel="stylesheet" href="../content.css">
  <style type="text/css">
    td {
    vertical-align: top;
    }
  </style>
</head>

<body>

<!--#include virtual="../menu.html.incl"-->

<div id="content">

<h1>Precompiled Header and Modules Internals</h1>

  <p>This document describes the design and implementation of Clang's
  precompiled headers (PCH) and modules. If you are interested in the end-user
  view, please see the <a
   href="UsersManual.html#precompiledheaders">User's Manual</a>.</p>

  <p><b>Table of Contents</b></p>
  <ul>
    <li><a href="#usage">Using Precompiled Headers with
    <tt>clang</tt></a></li>
    <li><a href="#philosophy">Design Philosophy</a></li>
    <li><a href="#contents">Serialized AST File Contents</a>
      <ul>
        <li><a href="#metadata">Metadata Block</a></li>
        <li><a href="#sourcemgr">Source Manager Block</a></li>
        <li><a href="#preprocessor">Preprocessor Block</a></li>
        <li><a href="#types">Types Block</a></li>
        <li><a href="#decls">Declarations Block</a></li>
        <li><a href="#stmt">Statements and Expressions</a></li>
        <li><a href="#idtable">Identifier Table Block</a></li>
        <li><a href="#method-pool">Method Pool Block</a></li>
      </ul>
    </li>
    <li><a href="#tendrils">AST Reader Integration Points</a></li>
    <li><a href="#chained">Chained precompiled headers</a></li>
    <li><a href="#modules">Modules</a></li>
</ul>
    
<h2 id="usage">Using Precompiled Headers with <tt>clang</tt></h2>

<p>The Clang compiler frontend, <tt>clang -cc1</tt>, supports two command line
options for generating and using PCH files.<p>

<p>To generate PCH files using <tt>clang -cc1</tt>, use the option
<b><tt>-emit-pch</tt></b>:

<pre> $ clang -cc1 test.h -emit-pch -o test.h.pch </pre>

<p>This option is transparently used by <tt>clang</tt> when generating
PCH files. The resulting PCH file contains the serialized form of the
compiler's internal representation after it has completed parsing and
semantic analysis. The PCH file can then be used as a prefix header
with the <b><tt>-include-pch</tt></b> option:</p>

<pre>
  $ clang -cc1 -include-pch test.h.pch test.c -o test.s
</pre>

<h2 id="philosophy">Design Philosophy</h2>
  
<p>Precompiled headers are meant to improve overall compile times for
  projects, so the design of precompiled headers is entirely driven by
  performance concerns. The use case for precompiled headers is
  relatively simple: when there is a common set of headers that is
  included in nearly every source file in the project, we
  <i>precompile</i> that bundle of headers into a single precompiled
  header (PCH file). Then, when compiling the source files in the
  project, we load the PCH file first (as a prefix header), which acts
  as a stand-in for that bundle of headers.</p>

<p>A precompiled header implementation improves performance when:</p>
<ul>
  <li>Loading the PCH file is significantly faster than re-parsing the
  bundle of headers stored within the PCH file. Thus, a precompiled
  header design attempts to minimize the cost of reading the PCH
  file. Ideally, this cost should not vary with the size of the
  precompiled header file.</li>
  
  <li>The cost of generating the PCH file initially is not so large
  that it counters the per-source-file performance improvement due to
  eliminating the need to parse the bundled headers in the first
  place. This is particularly important on multi-core systems, because
  PCH file generation serializes the build when all compilations
  require the PCH file to be up-to-date.</li>
</ul>

<p>Modules, as implemented in Clang, use the same mechanisms as
precompiled headers to save a serialized AST file (one per module) and
use those AST modules. From an implementation standpoint, modules are
a generalization of precompiled headers, lifting a number of
restrictions placed on precompiled headers. In particular, there can
only be one precompiled header and it must be included at the
beginning of the translation unit. The extensions to the AST file
format required for modules are discussed in the section on <a href="#modules">modules</a>.</p>

<p>Clang's AST files are designed with a compact on-disk
representation, which minimizes both creation time and the time
required to initially load the AST file. The AST file itself contains
a serialized representation of Clang's abstract syntax trees and
supporting data structures, stored using the same compressed bitstream
as <a href="http://llvm.org/docs/BitCodeFormat.html">LLVM's bitcode
file format</a>.</p>

<p>Clang's AST files are loaded "lazily" from disk. When an
AST file is initially loaded, Clang reads only a small amount of data
from the AST file to establish where certain important data structures
are stored. The amount of data read in this initial load is
independent of the size of the AST file, such that a larger AST file
does not lead to longer AST load times. The actual header data in the
AST file--macros, functions, variables, types, etc.--is loaded only
when it is referenced from the user's code, at which point only that
entity (and those entities it depends on) are deserialized from the
AST file. With this approach, the cost of using an AST file
for a translation unit is proportional to the amount of code actually
used from the AST file, rather than being proportional to the size of
the AST file itself.</p> 

<p>When given the <code>-print-stats</code> option, Clang produces
statistics describing how much of the AST file was actually
loaded from disk. For a simple "Hello, World!" program that includes
the Apple <code>Cocoa.h</code> header (which is built as a precompiled
header), this option illustrates how little of the actual precompiled
header is required:</p>

<pre>
*** PCH Statistics:
  933 stat cache hits
  4 stat cache misses
  895/39981 source location entries read (2.238563%)
  19/15315 types read (0.124061%)
  20/82685 declarations read (0.024188%)
  154/58070 identifiers read (0.265197%)
  0/7260 selectors read (0.000000%)
  0/30842 statements read (0.000000%)
  4/8400 macros read (0.047619%)
  1/4995 lexical declcontexts read (0.020020%)
  0/4413 visible declcontexts read (0.000000%)
  0/7230 method pool entries read (0.000000%)
  0 method pool misses
</pre>

<p>For this small program, only a tiny fraction of the source
locations, types, declarations, identifiers, and macros were actually
deserialized from the precompiled header. These statistics can be
useful to determine whether the AST file implementation can
be improved by making more of the implementation lazy.</p>

<p>Precompiled headers can be chained. When you create a PCH while
including an existing PCH, Clang can create the new PCH by referencing
the original file and only writing the new data to the new file. For
example, you could create a PCH out of all the headers that are very
commonly used throughout your project, and then create a PCH for every
single source file in the project that includes the code that is
specific to that file, so that recompiling the file itself is very fast,
without duplicating the data from the common headers for every
file. The mechanisms behind chained precompiled headers are discussed
in a <a href="#chained">later section</a>.

<h2 id="contents">AST File Contents</h2>

<img src="PCHLayout.png" style="float:right" alt="Precompiled header layout">

<p>Clang's AST files are organized into several different
blocks, each of which contains the serialized representation of a part
of Clang's internal representation. Each of the blocks corresponds to
either a block or a record within <a
 href="http://llvm.org/docs/BitCodeFormat.html">LLVM's bitstream
format</a>. The contents of each of these logical blocks are described
below.</p>

<p>For a given AST file, the <a
href="http://llvm.org/cmds/llvm-bcanalyzer.html"><code>llvm-bcanalyzer</code></a>
utility can be used to examine the actual structure of the bitstream
for the AST file. This information can be used both to help
understand the structure of the AST file and to isolate
areas where AST files can still be optimized, e.g., through
the introduction of abbreviations.</p>

<h3 id="metadata">Metadata Block</h3>

<p>The metadata block contains several records that provide
information about how the AST file was built. This metadata
is primarily used to validate the use of an AST file. For
example, a precompiled header built for a 32-bit x86 target cannot be used
when compiling for a 64-bit x86 target. The metadata block contains
information about:</p>

<dl>
  <dt>Language options</dt>
  <dd>Describes the particular language dialect used to compile the
AST file, including major options (e.g., Objective-C support) and more
minor options (e.g., support for "//" comments). The contents of this
record correspond to the <code>LangOptions</code> class.</dd>
  
  <dt>Target architecture</dt>
  <dd>The target triple that describes the architecture, platform, and
ABI for which the AST file was generated, e.g.,
<code>i386-apple-darwin9</code>.</dd>
  
  <dt>AST version</dt>
  <dd>The major and minor version numbers of the AST file
format. Changes in the minor version number should not affect backward
compatibility, while changes in the major version number imply that a
newer compiler cannot read an older precompiled header (and
vice-versa).</dd>

  <dt>Original file name</dt>
  <dd>The full path of the header that was used to generate the
AST file.</dd>

  <dt>Predefines buffer</dt>
  <dd>Although not explicitly stored as part of the metadata, the
predefines buffer is used in the validation of the AST file.
The predefines buffer itself contains code generated by the compiler
to initialize the preprocessor state according to the current target,
platform, and command-line options. For example, the predefines buffer
will contain "<code>#define __STDC__ 1</code>" when we are compiling C
without Microsoft extensions. The predefines buffer itself is stored
within the <a href="#sourcemgr">source manager block</a>, but its
contents are verified along with the rest of the metadata.</dd>

</dl>

<p>A chained PCH file (that is, one that references another PCH) and a
module (which may import other modules) have additional metadata
containing the list of all AST files that this AST file depends
on. Each of those files will be loaded along with this AST file.</p>

<p>For chained precompiled headers, the language options, target
architecture and predefines buffer data is taken from the end of the
chain, since they have to match anyway.</p>

<h3 id="sourcemgr">Source Manager Block</h3>

<p>The source manager block contains the serialized representation of
Clang's <a
 href="InternalsManual.html#SourceLocation">SourceManager</a> class,
which handles the mapping from source locations (as represented in
Clang's abstract syntax tree) into actual column/line positions within
a source file or macro instantiation. The AST file's
representation of the source manager also includes information about
all of the headers that were (transitively) included when building the
AST file.</p>

<p>The bulk of the source manager block is dedicated to information
about the various files, buffers, and macro instantiations into which
a source location can refer. Each of these is referenced by a numeric
"file ID", which is a unique number (allocated starting at 1) stored
in the source location. Clang serializes the information for each kind
of file ID, along with an index that maps file IDs to the position
within the AST file where the information about that file ID is
stored. The data associated with a file ID is loaded only when
required by the front end, e.g., to emit a diagnostic that includes a
macro instantiation history inside the header itself.</p>

<p>The source manager block also contains information about all of the
headers that were included when building the AST file. This
includes information about the controlling macro for the header (e.g.,
when the preprocessor identified that the contents of the header
dependent on a macro like <code>LLVM_CLANG_SOURCEMANAGER_H</code>)
along with a cached version of the results of the <code>stat()</code>
system calls performed when building the AST file. The
latter is particularly useful in reducing system time when searching
for include files.</p>

<h3 id="preprocessor">Preprocessor Block</h3>

<p>The preprocessor block contains the serialized representation of
the preprocessor. Specifically, it contains all of the macros that
have been defined by the end of the header used to build the
AST file, along with the token sequences that comprise each
macro. The macro definitions are only read from the AST file when the
name of the macro first occurs in the program. This lazy loading of
macro definitions is triggered by lookups into the <a
 href="#idtable">identifier table</a>.</p>

<h3 id="types">Types Block</h3>

<p>The types block contains the serialized representation of all of
the types referenced in the translation unit. Each Clang type node
(<code>PointerType</code>, <code>FunctionProtoType</code>, etc.) has a
corresponding record type in the AST file. When types are deserialized
from the AST file, the data within the record is used to
reconstruct the appropriate type node using the AST context.</p>

<p>Each type has a unique type ID, which is an integer that uniquely
identifies that type. Type ID 0 represents the NULL type, type IDs
less than <code>NUM_PREDEF_TYPE_IDS</code> represent predefined types
(<code>void</code>, <code>float</code>, etc.), while other
"user-defined" type IDs are assigned consecutively from
<code>NUM_PREDEF_TYPE_IDS</code> upward as the types are encountered.
The AST file has an associated mappi