diff options
author | mike-m <mikem.llvm@gmail.com> | 2010-05-07 00:28:04 +0000 |
---|---|---|
committer | mike-m <mikem.llvm@gmail.com> | 2010-05-07 00:28:04 +0000 |
commit | e2c3a49c8029ebd9ef530101cc24c66562e3dff5 (patch) | |
tree | 91bf9600cc8df90cf99751a8f8bafc317cffc91e /docs/tutorial | |
parent | c10b5afbe8138b0fdf3af4ed3e1ddf96cf3cb4cb (diff) |
Revert r103213. It broke several sections of live website.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@103219 91177308-0d34-0410-b5e6-96231b3b80d8
Diffstat (limited to 'docs/tutorial')
-rw-r--r-- | docs/tutorial/LangImpl1.html | 348 | ||||
-rw-r--r-- | docs/tutorial/LangImpl2.html | 1233 | ||||
-rw-r--r-- | docs/tutorial/LangImpl3.html | 1269 | ||||
-rw-r--r-- | docs/tutorial/LangImpl4.html | 1132 | ||||
-rw-r--r-- | docs/tutorial/LangImpl5-cfg.png | bin | 0 -> 38586 bytes | |||
-rw-r--r-- | docs/tutorial/LangImpl5.html | 1777 | ||||
-rw-r--r-- | docs/tutorial/LangImpl6.html | 1814 | ||||
-rw-r--r-- | docs/tutorial/LangImpl7.html | 2164 | ||||
-rw-r--r-- | docs/tutorial/LangImpl8.html | 365 | ||||
-rw-r--r-- | docs/tutorial/Makefile | 28 | ||||
-rw-r--r-- | docs/tutorial/OCamlLangImpl1.html | 365 | ||||
-rw-r--r-- | docs/tutorial/OCamlLangImpl2.html | 1045 | ||||
-rw-r--r-- | docs/tutorial/OCamlLangImpl3.html | 1093 | ||||
-rw-r--r-- | docs/tutorial/OCamlLangImpl4.html | 1029 | ||||
-rw-r--r-- | docs/tutorial/OCamlLangImpl5.html | 1569 | ||||
-rw-r--r-- | docs/tutorial/OCamlLangImpl6.html | 1574 | ||||
-rw-r--r-- | docs/tutorial/OCamlLangImpl7.html | 1907 | ||||
-rw-r--r-- | docs/tutorial/index.html | 48 |
18 files changed, 18760 insertions, 0 deletions
diff --git a/docs/tutorial/LangImpl1.html b/docs/tutorial/LangImpl1.html new file mode 100644 index 0000000000..66843db5d3 --- /dev/null +++ b/docs/tutorial/LangImpl1.html @@ -0,0 +1,348 @@ +<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" + "http://www.w3.org/TR/html4/strict.dtd"> + +<html> +<head> + <title>Kaleidoscope: Tutorial Introduction and the Lexer</title> + <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> + <meta name="author" content="Chris Lattner"> + <link rel="stylesheet" href="../llvm.css" type="text/css"> +</head> + +<body> + +<div class="doc_title">Kaleidoscope: Tutorial Introduction and the Lexer</div> + +<ul> +<li><a href="index.html">Up to Tutorial Index</a></li> +<li>Chapter 1 + <ol> + <li><a href="#intro">Tutorial Introduction</a></li> + <li><a href="#language">The Basic Language</a></li> + <li><a href="#lexer">The Lexer</a></li> + </ol> +</li> +<li><a href="LangImpl2.html">Chapter 2</a>: Implementing a Parser and AST</li> +</ul> + +<div class="doc_author"> + <p>Written by <a href="mailto:sabre@nondot.org">Chris Lattner</a></p> +</div> + +<!-- *********************************************************************** --> +<div class="doc_section"><a name="intro">Tutorial Introduction</a></div> +<!-- *********************************************************************** --> + +<div class="doc_text"> + +<p>Welcome to the "Implementing a language with LLVM" tutorial. This tutorial +runs through the implementation of a simple language, showing how fun and +easy it can be. This tutorial will get you up and started as well as help to +build a framework you can extend to other languages. The code in this tutorial +can also be used as a playground to hack on other LLVM specific things. +</p> + +<p> +The goal of this tutorial is to progressively unveil our language, describing +how it is built up over time. This will let us cover a fairly broad range of +language design and LLVM-specific usage issues, showing and explaining the code +for it all along the way, without overwhelming you with tons of details up +front.</p> + +<p>It is useful to point out ahead of time that this tutorial is really about +teaching compiler techniques and LLVM specifically, <em>not</em> about teaching +modern and sane software engineering principles. In practice, this means that +we'll take a number of shortcuts to simplify the exposition. For example, the +code leaks memory, uses global variables all over the place, doesn't use nice +design patterns like <a +href="http://en.wikipedia.org/wiki/Visitor_pattern">visitors</a>, etc... but it +is very simple. If you dig in and use the code as a basis for future projects, +fixing these deficiencies shouldn't be hard.</p> + +<p>I've tried to put this tutorial together in a way that makes chapters easy to +skip over if you are already familiar with or are uninterested in the various +pieces. The structure of the tutorial is: +</p> + +<ul> +<li><b><a href="#language">Chapter #1</a>: Introduction to the Kaleidoscope +language, and the definition of its Lexer</b> - This shows where we are going +and the basic functionality that we want it to do. In order to make this +tutorial maximally understandable and hackable, we choose to implement +everything in C++ instead of using lexer and parser generators. LLVM obviously +works just fine with such tools, feel free to use one if you prefer.</li> +<li><b><a href="LangImpl2.html">Chapter #2</a>: Implementing a Parser and +AST</b> - With the lexer in place, we can talk about parsing techniques and +basic AST construction. This tutorial describes recursive descent parsing and +operator precedence parsing. Nothing in Chapters 1 or 2 is LLVM-specific, +the code doesn't even link in LLVM at this point. :)</li> +<li><b><a href="LangImpl3.html">Chapter #3</a>: Code generation to LLVM IR</b> - +With the AST ready, we can show off how easy generation of LLVM IR really +is.</li> +<li><b><a href="LangImpl4.html">Chapter #4</a>: Adding JIT and Optimizer +Support</b> - Because a lot of people are interested in using LLVM as a JIT, +we'll dive right into it and show you the 3 lines it takes to add JIT support. +LLVM is also useful in many other ways, but this is one simple and "sexy" way +to shows off its power. :)</li> +<li><b><a href="LangImpl5.html">Chapter #5</a>: Extending the Language: Control +Flow</b> - With the language up and running, we show how to extend it with +control flow operations (if/then/else and a 'for' loop). This gives us a chance +to talk about simple SSA construction and control flow.</li> +<li><b><a href="LangImpl6.html">Chapter #6</a>: Extending the Language: +User-defined Operators</b> - This is a silly but fun chapter that talks about +extending the language to let the user program define their own arbitrary +unary and binary operators (with assignable precedence!). This lets us build a +significant piece of the "language" as library routines.</li> +<li><b><a href="LangImpl7.html">Chapter #7</a>: Extending the Language: Mutable +Variables</b> - This chapter talks about adding user-defined local variables +along with an assignment operator. The interesting part about this is how +easy and trivial it is to construct SSA form in LLVM: no, LLVM does <em>not</em> +require your front-end to construct SSA form!</li> +<li><b><a href="LangImpl8.html">Chapter #8</a>: Conclusion and other useful LLVM +tidbits</b> - This chapter wraps up the series by talking about potential +ways to extend the language, but also includes a bunch of pointers to info about +"special topics" like adding garbage collection support, exceptions, debugging, +support for "spaghetti stacks", and a bunch of other tips and tricks.</li> + +</ul> + +<p>By the end of the tutorial, we'll have written a bit less than 700 lines of +non-comment, non-blank, lines of code. With this small amount of code, we'll +have built up a very reasonable compiler for a non-trivial language including +a hand-written lexer, parser, AST, as well as code generation support with a JIT +compiler. While other systems may have interesting "hello world" tutorials, +I think the breadth of this tutorial is a great testament to the strengths of +LLVM and why you should consider it if you're interested in language or compiler +design.</p> + +<p>A note about this tutorial: we expect you to extend the language and play +with it on your own. Take the code and go crazy hacking away at it, compilers +don't need to be scary creatures - it can be a lot of fun to play with +languages!</p> + +</div> + +<!-- *********************************************************************** --> +<div class="doc_section"><a name="language">The Basic Language</a></div> +<!-- *********************************************************************** --> + +<div class="doc_text"> + +<p>This tutorial will be illustrated with a toy language that we'll call +"<a href="http://en.wikipedia.org/wiki/Kaleidoscope">Kaleidoscope</a>" (derived +from "meaning beautiful, form, and view"). +Kaleidoscope is a procedural language that allows you to define functions, use +conditionals, math, etc. Over the course of the tutorial, we'll extend +Kaleidoscope to support the if/then/else construct, a for loop, user defined +operators, JIT compilation with a simple command line interface, etc.</p> + +<p>Because we want to keep things simple, the only datatype in Kaleidoscope is a +64-bit floating point type (aka 'double' in C parlance). As such, all values +are implicitly double precision and the language doesn't require type +declarations. This gives the language a very nice and simple syntax. For +example, the following simple example computes <a +href="http://en.wikipedia.org/wiki/Fibonacci_number">Fibonacci numbers:</a></p> + +<div class="doc_code"> +<pre> +# Compute the x'th fibonacci number. +def fib(x) + if x < 3 then + 1 + else + fib(x-1)+fib(x-2) + +# This expression will compute the 40th number. +fib(40) +</pre> +</div> + +<p>We also allow Kaleidoscope to call into standard library functions (the LLVM +JIT makes this completely trivial). This means that you can use the 'extern' +keyword to define a function before you use it (this is also useful for mutually +recursive functions). For example:</p> + +<div class="doc_code"> +<pre> +extern sin(arg); +extern cos(arg); +extern atan2(arg1 arg2); + +atan2(sin(.4), cos(42)) +</pre> +</div> + +<p>A more interesting example is included in Chapter 6 where we write a little +Kaleidoscope application that <a href="LangImpl6.html#example">displays +a Mandelbrot Set</a> at various levels of magnification.</p> + +<p>Lets dive into the implementation of this language!</p> + +</div> + +<!-- *********************************************************************** --> +<div class="doc_section"><a name="lexer">The Lexer</a></div> +<!-- *********************************************************************** --> + +<div class="doc_text"> + +<p>When it comes to implementing a language, the first thing needed is +the ability to process a text file and recognize what it says. The traditional +way to do this is to use a "<a +href="http://en.wikipedia.org/wiki/Lexical_analysis">lexer</a>" (aka 'scanner') +to break the input up into "tokens". Each token returned by the lexer includes +a token code and potentially some metadata (e.g. the numeric value of a number). +First, we define the possibilities: +</p> + +<div class="doc_code"> +<pre> +// The lexer returns tokens [0-255] if it is an unknown character, otherwise one +// of these for known things. +enum Token { + tok_eof = -1, + + // commands + tok_def = -2, tok_extern = -3, + + // primary + tok_identifier = -4, tok_number = -5, +}; + +static std::string IdentifierStr; // Filled in if tok_identifier +static double NumVal; // Filled in if tok_number +</pre> +</div> + +<p>Each token returned by our lexer will either be one of the Token enum values +or it will be an 'unknown' character like '+', which is returned as its ASCII +value. If the current token is an identifier, the <tt>IdentifierStr</tt> +global variable holds the name of the identifier. If the current token is a +numeric literal (like 1.0), <tt>NumVal</tt> holds its value. Note that we use +global variables for simplicity, this is not the best choice for a real language +implementation :). +</p> + +<p>The actual implementation of the lexer is a single function named +<tt>gettok</tt>. The <tt>gettok</tt> function is called to return the next token +from standard input. Its definition starts as:</p> + +<div class="doc_code"> +<pre> +/// gettok - Return the next token from standard input. +static int gettok() { + static int LastChar = ' '; + + // Skip any whitespace. + while (isspace(LastChar)) + LastChar = getchar(); +</pre> +</div> + +<p> +<tt>gettok</tt> works by calling the C <tt>getchar()</tt> function to read +characters one at a time from standard input. It eats them as it recognizes +them and stores the last character read, but not processed, in LastChar. The +first thing that it has to do is ignore whitespace between tokens. This is +accomplished with the loop above.</p> + +<p>The next thing <tt>gettok</tt> needs to do is recognize identifiers and +specific keywords like "def". Kaleidoscope does this with this simple loop:</p> + +<div class="doc_code"> +<pre> + if (isalpha(LastChar)) { // identifier: [a-zA-Z][a-zA-Z0-9]* + IdentifierStr = LastChar; + while (isalnum((LastChar = getchar()))) + IdentifierStr += LastChar; + + if (IdentifierStr == "def") return tok_def; + if (IdentifierStr == "extern") return tok_extern; + return tok_identifier; + } +</pre> +</div> + +<p>Note that this code sets the '<tt>IdentifierStr</tt>' global whenever it +lexes an identifier. Also, since language keywords are matched by the same +loop, we handle them here inline. Numeric values are similar:</p> + +<div class="doc_code"> +<pre> + if (isdigit(LastChar) || LastChar == '.') { // Number: [0-9.]+ + std::string NumStr; + do { + NumStr += LastChar; + LastChar = getchar(); + } while (isdigit(LastChar) || LastChar == '.'); + + NumVal = strtod(NumStr.c_str(), 0); + return tok_number; + } +</pre> +</div> + +<p>This is all pretty straight-forward code for processing input. When reading +a numeric value from input, we use the C <tt>strtod</tt> function to convert it +to a numeric value that we store in <tt>NumVal</tt>. Note that this isn't doing +sufficient error checking: it will incorrectly read "1.23.45.67" and handle it as +if you typed in "1.23". Feel free to extend it :). Next we handle comments: +</p> + +<div class="doc_code"> +<pre> + if (LastChar == '#') { + // Comment until end of line. + do LastChar = getchar(); + while (LastChar != EOF && LastChar != '\n' && LastChar != '\r'); + + if (LastChar != EOF) + return gettok(); + } +</pre> +</div> + +<p>We handle comments by skipping to the end of the line and then return the +next token. Finally, if the input doesn't match one of the above cases, it is +either an operator character like '+' or the end of the file. These are handled +with this code:</p> + +<div class="doc_code"> +<pre> + // Check for end of file. Don't eat the EOF. + if (LastChar == EOF) + return tok_eof; + + // Otherwise, just return the character as its ascii value. + int ThisChar = LastChar; + LastChar = getchar(); + return ThisChar; +} +</pre> +</div> + +<p>With this, we have the complete lexer for the basic Kaleidoscope language +(the <a href="LangImpl2.html#code">full code listing</a> for the Lexer is +available in the <a href="LangImpl2.html">next chapter</a> of the tutorial). +Next we'll <a href="LangImpl2.html">build a simple parser that uses this to +build an Abstract Syntax Tree</a>. When we have that, we'll include a driver +so that you can use the lexer and parser together. +</p> + +<a href="LangImpl2.html">Next: Implementing a Parser and AST</a> +</div> + +<!-- *********************************************************************** --> +<hr> +<address> + <a href="http://jigsaw.w3.org/css-validator/check/referer"><img + src="http://jigsaw.w3.org/css-validator/images/vcss" alt="Valid CSS!"></a> + <a href="http://validator.w3.org/check/referer"><img + src="http://www.w3.org/Icons/valid-html401" alt="Valid HTML 4.01!"></a> + + <a href="mailto:sabre@nondot.org">Chris Lattner</a><br> + <a href="http://llvm.org">The LLVM Compiler Infrastructure</a><br> + Last modified: $Date$ +</address> +</body> +</html> diff --git a/docs/tutorial/LangImpl2.html b/docs/tutorial/LangImpl2.html new file mode 100644 index 0000000000..9c13b486fa --- /dev/null +++ b/docs/tutorial/LangImpl2.html @@ -0,0 +1,1233 @@ +<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" + "http://www.w3.org/TR/html4/strict.dtd"> + +<html> +<head> + <title>Kaleidoscope: Implementing a Parser and AST</title> + <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> + <meta name="author" content="Chris Lattner"> + <link rel="stylesheet" href="../llvm.css" type="text/css"> +</head> + +<body> + +<div class="doc_title">Kaleidoscope: Implementing a Parser and AST</div> + +<ul> +<li><a href="index.html">Up to Tutorial Index</a></li> +<li>Chapter 2 + <ol> + <li><a href="#intro">Chapter 2 Introduction</a></li> + <li><a href="#ast">The Abstract Syntax Tree (AST)</a></li> + <li><a href="#parserbasics">Parser Basics</a></li> + <li><a href="#parserprimexprs">Basic Expression Parsing</a></li> + <li><a href="#parserbinops">Binary Expression Parsing</a></li> + <li><a href="#parsertop">Parsing the Rest</a></li> + <li><a href="#driver">The Driver</a></li> + <li><a href="#conclusions">Conclusions</a></li> + <li><a href="#code">Full Code Listing</a></li> + </ol> +</li> +<li><a href="LangImpl3.html">Chapter 3</a>: Code generation to LLVM IR</li> +</ul> + +<div class="doc_author"> + <p>Written by <a href="mailto:sabre@nondot.org">Chris Lattner</a></p> +</div> + +<!-- *********************************************************************** --> +<div class="doc_section"><a name="intro">Chapter 2 Introduction</a></div> +<!-- *********************************************************************** --> + +<div class="doc_text"> + +<p>Welcome to Chapter 2 of the "<a href="index.html">Implementing a language +with LLVM</a>" tutorial. This chapter shows you how to use the lexer, built in +<a href="LangImpl1.html">Chapter 1</a>, to build a full <a +href="http://en.wikipedia.org/wiki/Parsing">parser</a> for +our Kaleidoscope language. Once we have a parser, we'll define and build an <a +href="http://en.wikipedia.org/wiki/Abstract_syntax_tree">Abstract Syntax +Tree</a> (AST).</p> + +<p>The parser we will build uses a combination of <a +href="http://en.wikipedia.org/wiki/Recursive_descent_parser">Recursive Descent +Parsing</a> and <a href= +"http://en.wikipedia.org/wiki/Operator-precedence_parser">Operator-Precedence +Parsing</a> to parse the Kaleidoscope language (the latter for +binary expressions and the former for everything else). Before we get to +parsing though, lets talk about the output of the parser: the Abstract Syntax +Tree.</p> + +</div> + +<!-- *********************************************************************** --> +<div class="doc_section"><a name="ast">The Abstract Syntax Tree (AST)</a></div> +<!-- *********************************************************************** --> + +<div class="doc_text"> + +<p>The AST for a program captures its behavior in such a way that it is easy for +later stages of the compiler (e.g. code generation) to interpret. We basically +want one object for each construct in the language, and the AST should closely +model the language. In Kaleidoscope, we have expressions, a prototype, and a +function object. We'll start with expressions first:</p> + +<div class="doc_code"> +<pre> +/// ExprAST - Base class for all expression nodes. +class ExprAST { +public: + virtual ~ExprAST() {} +}; + +/// NumberExprAST - Expression class for numeric literals like "1.0". +class NumberExprAST : public ExprAST { + double Val; +public: + NumberExprAST(double val) : Val(val) {} +}; +</pre> +</div> + +<p>The code above shows the definition of the base ExprAST class and one +subclass which we use for numeric literals. The important thing to note about +this code is that the NumberExprAST class captures the numeric value of the +literal as an instance variable. This allows later phases of the compiler to +know what the stored numeric value is.</p> + +<p>Right now we only create the AST, so there are no useful accessor methods on +them. It would be very easy to add a virtual method to pretty print the code, +for example. Here are the other expression AST node definitions that we'll use +in the basic form of the Kaleidoscope language: +</p> + +<div class="doc_code"> +<pre> +/// VariableExprAST - Expression class for referencing a variable, like "a". +class VariableExprAST : public ExprAST { + std::string Name; +public: + VariableExprAST(const std::string &name) : Name(name) {} +}; + +/// BinaryExprAST - Expression class for a binary operator. +class BinaryExprAST : public ExprAST { + char Op; + ExprAST *LHS, *RHS; +public: + BinaryExprAST(char op, ExprAST *lhs, ExprAST *rhs) + : Op(op), LHS(lhs), RHS(rhs) {} +}; + +/// CallExprAST - Expression class for function calls. +class CallExprAST : public ExprAST { + std::string Callee; + std::vector<ExprAST*> Args; +public: + CallExprAST(const std::string &callee, std::vector<ExprAST*> &args) + : Callee(callee), Args(args) {} +}; +</pre> +</div> + +<p>This is all (intentionally) rather straight-forward: variables capture the +variable name, binary operators capture their opcode (e.g. '+'), and calls +capture a function name as well as a list of any argument expressions. One thing +that is nice about our AST is that it captures the language features without +talking about the syntax of the language. Note that there is no discussion about +precedence of binary operators, lexical structure, etc.</p> + +<p>For our basic language, these are all of the expression nodes we'll define. +Because it doesn't have conditional control flow, it isn't Turing-complete; +we'll fix that in a later installment. The two things we need next are a way +to talk about the interface to a function, and a way to talk about functions +themselves:</p> + +<div class="doc_code"> +<pre> +/// PrototypeAST - This class represents the "prototype" for a function, +/// which captures its name, and its argument names (thus implicitly the number +/// of arguments the function takes). +class PrototypeAST { + std::string Name; + std::vector<std::string> Args; +public: + PrototypeAST(const std::string &name, const std::vector<std::string> &args) + : Name(name), Args(args) {} +}; + +/// FunctionAST - This class represents a function definition itself. +class FunctionAST { + PrototypeAST *Proto; + ExprAST *Body; +public: + FunctionAST(PrototypeAST *proto, ExprAST *body) + : Proto(proto), Body(body) {} +}; +</pre> +</div> + +<p>In Kaleidoscope, functions are typed with just a count of their arguments. +Since all values are double precision floating point, the type of each argument +doesn't need to be stored anywhere. In a more aggressive and realistic +language, the "ExprAST" class would probably have a type field.</p> + +<p>With this scaffolding, we can now talk about parsing expressions and function +bodies in Kaleidoscope.</p> + +</div> + +<!-- *********************************************************************** --> +<div class="doc_section"><a name="parserbasics">Parser Basics</a></div> +<!-- *********************************************************************** --> + +<div class="doc_text"> + +<p>Now that we have an AST to build, we need to define the parser code to build +it. The idea here is that we want to parse something like "x+y" (which is +returned as three tokens by the lexer) into an AST that could be generated with +calls like this:</p> + +<div class="doc_code"> +<pre> + ExprAST *X = new VariableExprAST("x"); + ExprAST *Y = new VariableExprAST("y"); + ExprAST *Result = new BinaryExprAST('+', X, Y); +</pre> +</div> + +<p>In order to do this, we'll start by defining some basic helper routines:</p> + +<div class="doc_code"> +<pre> +/// CurTok/getNextToken - Provide a simple token buffer. CurTok is the current +/// token the parser is looking at. getNextToken reads another token from the +/// lexer and updates CurTok with its results. +static int CurTok; +static int getNextToken() { + return CurTok = gettok(); +} +</pre> +</div> + +<p> +This implements a simple token buffer around the lexer. This allows +us to look one token ahead at what the lexer is returning. Every function in +our parser will assume that CurTok is the current token that needs to be +parsed.</p> + +<div class="doc_code"> +<pre> + +/// Error* - These are little helper functions for error handling. +ExprAST *Error(const char *Str) { fprintf(stderr, "Error: %s\n", Str);return 0;} +PrototypeAST *ErrorP(const char *Str) { Error(Str); return 0; } +FunctionAST *ErrorF(const char *Str) { Error(Str); return 0; } +</pre> +</div> + +<p> +The <tt>Error</tt> routines are simple helper routines that our parser will use +to handle errors. The error recovery in our parser will not be the best and +is not particular user-friendly, but it will be enough for our tutorial. These +routines make it easier to handle errors in routines that have various return +types: they always return null.</p> + +<p>With these basic helper functions, we can implement the first +piece of our grammar: numeric literals.</p> + +</div> + +<!-- *********************************************************************** --> +<div class="doc_section"><a name="parserprimexprs">Basic Expression + Parsing</a></div> +<!-- *********************************************************************** --> + +<div class="doc_text"> + +<p>We start with numeric literals, because they are the simplest to process. +For each production in our grammar, we'll define a function which parses that +production. For numeric literals, we have: +</p> + +<div class="doc_code"> +<pre> +/// numberexpr ::= number +static ExprAST *ParseNumberExpr() { + ExprAST *Result = new NumberExprAST(NumVal); + getNextToken(); // consume the number + return Result; +} +</pre> +</div> + +<p>This routine is very simple: it expects to be called when the current token +is a <tt>tok_number</tt> token. It takes the current number value, creates +a <tt>NumberExprAST</tt> node, advances the lexer to the next token, and finally +returns.</p> + +<p>There are some interesting aspects to this. The most important one is that +this routine eats all of the tokens that correspond to the production and +returns the lexer buffer with the next token (which is not part of the grammar +production) ready to go. This is a fairly standard way to go for recursive +descent parsers. For a better example, the parenthesis operator is defined like +this:</p> + +<div class="doc_code"> +<pre> +/// parenexpr ::= '(' expression ')' +static ExprAST *ParseParenExpr() { + getNextToken(); // eat (. + ExprAST *V = ParseExpression(); + if (!V) return 0; + + if (CurTok != ')') + return Error("expected ')'"); + getNextToken(); // eat ). + return V; +} +</pre> +</div> + +<p>This function illustrates a number of interesting things about the +parser:</p> + +<p> +1) It shows how we use the Error routines. When called, this function expects +that the current token is a '(' token, but after parsing the subexpression, it +is possible that there is no ')' waiting. For example, if the user types in +"(4 x" instead of "(4)", the parser should emit an error. Because errors can +occur, the parser needs a way to indicate that they happened: in our parser, we +return null on an error.</p> + +<p>2) Another interesting aspect of this function is that it uses recursion by +calling <tt>ParseExpression</tt> (we will soon see that <tt>ParseExpression</tt> can call +<tt>ParseParenExpr</tt>). This is powerful because it allows us to handle +recursive grammars, and keeps each production very simple. Note that +parentheses do not cause construction of AST nodes themselves. While we could +do it this way, the most important role of parentheses are to guide the parser +and provide grouping. Once the parser constructs the AST, parentheses are not +needed.</p> + +<p>The next simple production is for handling variable references and function +calls:</p> + +<div class="doc_code"> +<pre> +/// identifierexpr +/// ::= identifier +/// ::= identifier '(' expression* ')' +static ExprAST *ParseIdentifierExpr() { + std::string IdName = IdentifierStr; + + getNextToken(); // eat identifier. + + if (CurTok != '(') // Simple variable ref. + return new VariableExprAST(IdName); + + // Call. + getNextToken(); // eat ( + std::vector<ExprAST*> Args; + if (CurTok != ')') { + while (1) { + ExprAST *Arg = ParseExpression(); + if (!Arg) return 0; + Args.push_back(Arg); + + if (CurTok == ')') break; + + if (CurTok != ',') + return Error("Expected ')' or ',' in argument list"); + getNextToken(); + } + } + + // Eat the ')'. + getNextToken(); + + return new CallExprAST(IdName, Args); +} +</pre> +</div> + +<p>This routine follows the same style as the other routines. (It expects to be +called if the current token is a <tt>tok_identifier</tt> token). It also has +recursion and error handling. One interesting aspect of this is that it uses +<em>look-ahead</em> to determine if the current identifier is a stand alone +variable reference or if it is a function call expression. It handles this by +checking to see if the token after the identifier is a '(' token, constructing +either a <tt>VariableExprAST</tt> or <tt>CallExprAST</tt> node as appropriate. +</p> + +<p>Now that we have all of our simple expression-parsing logic in place, we can +define a helper function to wrap it together into one entry point. We call this +class of expressions "primary" expressions, for reasons that will become more +clear <a href="LangImpl6.html#unary">later in the tutorial</a>. In order to +parse an arbitrary primary expression, we need to determine what sort of +expression it is:</p> + +<div class="doc_code"> +<pre> +/// primary +/// ::= identifierexpr +/// ::= numberexpr +/// ::= parenexpr +static ExprAST *ParsePrimary() { + switch (CurTok) { + default: return Error("unknown token when expecting an expression"); + case tok_identifier: return ParseIdentifierExpr(); + case tok_number: return ParseNumberExpr(); + case '(': return ParseParenExpr(); + } +} +</pre> +</div> + +<p>Now that you see the definition of this function, it is more obvious why we +can assume the state of CurTok in the various functions. This uses look-ahead +to determine which sort of expression is being inspected, and then parses it +with a function call.</p> + +<p>Now that basic expressions are handled, we need to handle binary expressions. +They are a bit more complex.</p> + +</div> + +<!-- *********************************************************************** --> +<div class="doc_section"><a name="parserbinops">Binary Expression + Parsing</a></div> +<!-- *********************************************************************** --> + +<div class="doc_text"> + +<p>Binary expressions are significantly harder to parse because they are often +ambiguous. For example, when given the string "x+y*z", the parser can choose +to parse it as either "(x+y)*z" or "x+(y*z)". With common definitions from +mathematics, we expect the later parse, because "*" (multiplication) has +higher <em>precedence</em> than "+" (addition).</p> + +<p>There are many ways to handle this, but an elegant and efficient way is to +use <a href= +"http://en.wikipedia.org/wiki/Operator-precedence_parser">Operator-Precedence +Parsing</a>. This parsing technique uses the precedence of binary operators to +guide recursion. To start with, we need a table of precedences:</p> + +<div class="doc_code"> +<pre> +/// BinopPrecedence - This holds the precedence for each binary operator that is +/// defined. +static std::map<char, int> BinopPrecedence; + +/// GetTokPrecedence - Get the preceden |