diff options
Diffstat (limited to 'docs/tutorial/LangImpl2.html')
-rw-r--r-- | docs/tutorial/LangImpl2.html | 1231 |
1 files changed, 0 insertions, 1231 deletions
diff --git a/docs/tutorial/LangImpl2.html b/docs/tutorial/LangImpl2.html deleted file mode 100644 index 292dd4e516..0000000000 --- a/docs/tutorial/LangImpl2.html +++ /dev/null @@ -1,1231 +0,0 @@ -<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" - "http://www.w3.org/TR/html4/strict.dtd"> - -<html> -<head> - <title>Kaleidoscope: Implementing a Parser and AST</title> - <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> - <meta name="author" content="Chris Lattner"> - <link rel="stylesheet" href="../_static/llvm.css" type="text/css"> -</head> - -<body> - -<h1>Kaleidoscope: Implementing a Parser and AST</h1> - -<ul> -<li><a href="index.html">Up to Tutorial Index</a></li> -<li>Chapter 2 - <ol> - <li><a href="#intro">Chapter 2 Introduction</a></li> - <li><a href="#ast">The Abstract Syntax Tree (AST)</a></li> - <li><a href="#parserbasics">Parser Basics</a></li> - <li><a href="#parserprimexprs">Basic Expression Parsing</a></li> - <li><a href="#parserbinops">Binary Expression Parsing</a></li> - <li><a href="#parsertop">Parsing the Rest</a></li> - <li><a href="#driver">The Driver</a></li> - <li><a href="#conclusions">Conclusions</a></li> - <li><a href="#code">Full Code Listing</a></li> - </ol> -</li> -<li><a href="LangImpl3.html">Chapter 3</a>: Code generation to LLVM IR</li> -</ul> - -<div class="doc_author"> - <p>Written by <a href="mailto:sabre@nondot.org">Chris Lattner</a></p> -</div> - -<!-- *********************************************************************** --> -<h2><a name="intro">Chapter 2 Introduction</a></h2> -<!-- *********************************************************************** --> - -<div> - -<p>Welcome to Chapter 2 of the "<a href="index.html">Implementing a language -with LLVM</a>" tutorial. This chapter shows you how to use the lexer, built in -<a href="LangImpl1.html">Chapter 1</a>, to build a full <a -href="http://en.wikipedia.org/wiki/Parsing">parser</a> for -our Kaleidoscope language. Once we have a parser, we'll define and build an <a -href="http://en.wikipedia.org/wiki/Abstract_syntax_tree">Abstract Syntax -Tree</a> (AST).</p> - -<p>The parser we will build uses a combination of <a -href="http://en.wikipedia.org/wiki/Recursive_descent_parser">Recursive Descent -Parsing</a> and <a href= -"http://en.wikipedia.org/wiki/Operator-precedence_parser">Operator-Precedence -Parsing</a> to parse the Kaleidoscope language (the latter for -binary expressions and the former for everything else). Before we get to -parsing though, lets talk about the output of the parser: the Abstract Syntax -Tree.</p> - -</div> - -<!-- *********************************************************************** --> -<h2><a name="ast">The Abstract Syntax Tree (AST)</a></h2> -<!-- *********************************************************************** --> - -<div> - -<p>The AST for a program captures its behavior in such a way that it is easy for -later stages of the compiler (e.g. code generation) to interpret. We basically -want one object for each construct in the language, and the AST should closely -model the language. In Kaleidoscope, we have expressions, a prototype, and a -function object. We'll start with expressions first:</p> - -<div class="doc_code"> -<pre> -/// ExprAST - Base class for all expression nodes. -class ExprAST { -public: - virtual ~ExprAST() {} -}; - -/// NumberExprAST - Expression class for numeric literals like "1.0". -class NumberExprAST : public ExprAST { - double Val; -public: - NumberExprAST(double val) : Val(val) {} -}; -</pre> -</div> - -<p>The code above shows the definition of the base ExprAST class and one -subclass which we use for numeric literals. The important thing to note about -this code is that the NumberExprAST class captures the numeric value of the -literal as an instance variable. This allows later phases of the compiler to -know what the stored numeric value is.</p> - -<p>Right now we only create the AST, so there are no useful accessor methods on -them. It would be very easy to add a virtual method to pretty print the code, -for example. Here are the other expression AST node definitions that we'll use -in the basic form of the Kaleidoscope language: -</p> - -<div class="doc_code"> -<pre> -/// VariableExprAST - Expression class for referencing a variable, like "a". -class VariableExprAST : public ExprAST { - std::string Name; -public: - VariableExprAST(const std::string &name) : Name(name) {} -}; - -/// BinaryExprAST - Expression class for a binary operator. -class BinaryExprAST : public ExprAST { - char Op; - ExprAST *LHS, *RHS; -public: - BinaryExprAST(char op, ExprAST *lhs, ExprAST *rhs) - : Op(op), LHS(lhs), RHS(rhs) {} -}; - -/// CallExprAST - Expression class for function calls. -class CallExprAST : public ExprAST { - std::string Callee; - std::vector<ExprAST*> Args; -public: - CallExprAST(const std::string &callee, std::vector<ExprAST*> &args) - : Callee(callee), Args(args) {} -}; -</pre> -</div> - -<p>This is all (intentionally) rather straight-forward: variables capture the -variable name, binary operators capture their opcode (e.g. '+'), and calls -capture a function name as well as a list of any argument expressions. One thing -that is nice about our AST is that it captures the language features without -talking about the syntax of the language. Note that there is no discussion about -precedence of binary operators, lexical structure, etc.</p> - -<p>For our basic language, these are all of the expression nodes we'll define. -Because it doesn't have conditional control flow, it isn't Turing-complete; -we'll fix that in a later installment. The two things we need next are a way -to talk about the interface to a function, and a way to talk about functions -themselves:</p> - -<div class="doc_code"> -<pre> -/// PrototypeAST - This class represents the "prototype" for a function, -/// which captures its name, and its argument names (thus implicitly the number -/// of arguments the function takes). -class PrototypeAST { - std::string Name; - std::vector<std::string> Args; -public: - PrototypeAST(const std::string &name, const std::vector<std::string> &args) - : Name(name), Args(args) {} -}; - -/// FunctionAST - This class represents a function definition itself. -class FunctionAST { - PrototypeAST *Proto; - ExprAST *Body; -public: - FunctionAST(PrototypeAST *proto, ExprAST *body) - : Proto(proto), Body(body) {} -}; -</pre> -</div> - -<p>In Kaleidoscope, functions are typed with just a count of their arguments. -Since all values are double precision floating point, the type of each argument -doesn't need to be stored anywhere. In a more aggressive and realistic -language, the "ExprAST" class would probably have a type field.</p> - -<p>With this scaffolding, we can now talk about parsing expressions and function -bodies in Kaleidoscope.</p> - -</div> - -<!-- *********************************************************************** --> -<h2><a name="parserbasics">Parser Basics</a></h2> -<!-- *********************************************************************** --> - -<div> - -<p>Now that we have an AST to build, we need to define the parser code to build -it. The idea here is that we want to parse something like "x+y" (which is -returned as three tokens by the lexer) into an AST that could be generated with -calls like this:</p> - -<div class="doc_code"> -<pre> - ExprAST *X = new VariableExprAST("x"); - ExprAST *Y = new VariableExprAST("y"); - ExprAST *Result = new BinaryExprAST('+', X, Y); -</pre> -</div> - -<p>In order to do this, we'll start by defining some basic helper routines:</p> - -<div class="doc_code"> -<pre> -/// CurTok/getNextToken - Provide a simple token buffer. CurTok is the current -/// token the parser is looking at. getNextToken reads another token from the -/// lexer and updates CurTok with its results. -static int CurTok; -static int getNextToken() { - return CurTok = gettok(); -} -</pre> -</div> - -<p> -This implements a simple token buffer around the lexer. This allows -us to look one token ahead at what the lexer is returning. Every function in -our parser will assume that CurTok is the current token that needs to be -parsed.</p> - -<div class="doc_code"> -<pre> - -/// Error* - These are little helper functions for error handling. -ExprAST *Error(const char *Str) { fprintf(stderr, "Error: %s\n", Str);return 0;} -PrototypeAST *ErrorP(const char *Str) { Error(Str); return 0; } -FunctionAST *ErrorF(const char *Str) { Error(Str); return 0; } -</pre> -</div> - -<p> -The <tt>Error</tt> routines are simple helper routines that our parser will use -to handle errors. The error recovery in our parser will not be the best and -is not particular user-friendly, but it will be enough for our tutorial. These -routines make it easier to handle errors in routines that have various return -types: they always return null.</p> - -<p>With these basic helper functions, we can implement the first -piece of our grammar: numeric literals.</p> - -</div> - -<!-- *********************************************************************** --> -<h2><a name="parserprimexprs">Basic Expression Parsing</a></h2> -<!-- *********************************************************************** --> - -<div> - -<p>We start with numeric literals, because they are the simplest to process. -For each production in our grammar, we'll define a function which parses that -production. For numeric literals, we have: -</p> - -<div class="doc_code"> -<pre> -/// numberexpr ::= number -static ExprAST *ParseNumberExpr() { - ExprAST *Result = new NumberExprAST(NumVal); - getNextToken(); // consume the number - return Result; -} -</pre> -</div> - -<p>This routine is very simple: it expects to be called when the current token -is a <tt>tok_number</tt> token. It takes the current number value, creates -a <tt>NumberExprAST</tt> node, advances the lexer to the next token, and finally -returns.</p> - -<p>There are some interesting aspects to this. The most important one is that -this routine eats all of the tokens that correspond to the production and -returns the lexer buffer with the next token (which is not part of the grammar -production) ready to go. This is a fairly standard way to go for recursive -descent parsers. For a better example, the parenthesis operator is defined like -this:</p> - -<div class="doc_code"> -<pre> -/// parenexpr ::= '(' expression ')' -static ExprAST *ParseParenExpr() { - getNextToken(); // eat (. - ExprAST *V = ParseExpression(); - if (!V) return 0; - - if (CurTok != ')') - return Error("expected ')'"); - getNextToken(); // eat ). - return V; -} -</pre> -</div> - -<p>This function illustrates a number of interesting things about the -parser:</p> - -<p> -1) It shows how we use the Error routines. When called, this function expects -that the current token is a '(' token, but after parsing the subexpression, it -is possible that there is no ')' waiting. For example, if the user types in -"(4 x" instead of "(4)", the parser should emit an error. Because errors can -occur, the parser needs a way to indicate that they happened: in our parser, we -return null on an error.</p> - -<p>2) Another interesting aspect of this function is that it uses recursion by -calling <tt>ParseExpression</tt> (we will soon see that <tt>ParseExpression</tt> can call -<tt>ParseParenExpr</tt>). This is powerful because it allows us to handle -recursive grammars, and keeps each production very simple. Note that -parentheses do not cause construction of AST nodes themselves. While we could -do it this way, the most important role of parentheses are to guide the parser -and provide grouping. Once the parser constructs the AST, parentheses are not -needed.</p> - -<p>The next simple production is for handling variable references and function -calls:</p> - -<div class="doc_code"> -<pre> -/// identifierexpr -/// ::= identifier -/// ::= identifier '(' expression* ')' -static ExprAST *ParseIdentifierExpr() { - std::string IdName = IdentifierStr; - - getNextToken(); // eat identifier. - - if (CurTok != '(') // Simple variable ref. - return new VariableExprAST(IdName); - - // Call. - getNextToken(); // eat ( - std::vector<ExprAST*> Args; - if (CurTok != ')') { - while (1) { - ExprAST *Arg = ParseExpression(); - if (!Arg) return 0; - Args.push_back(Arg); - - if (CurTok == ')') break; - - if (CurTok != ',') - return Error("Expected ')' or ',' in argument list"); - getNextToken(); - } - } - - // Eat the ')'. - getNextToken(); - - return new CallExprAST(IdName, Args); -} -</pre> -</div> - -<p>This routine follows the same style as the other routines. (It expects to be -called if the current token is a <tt>tok_identifier</tt> token). It also has -recursion and error handling. One interesting aspect of this is that it uses -<em>look-ahead</em> to determine if the current identifier is a stand alone -variable reference or if it is a function call expression. It handles this by -checking to see if the token after the identifier is a '(' token, constructing -either a <tt>VariableExprAST</tt> or <tt>CallExprAST</tt> node as appropriate. -</p> - -<p>Now that we have all of our simple expression-parsing logic in place, we can -define a helper function to wrap it together into one entry point. We call this -class of expressions "primary" expressions, for reasons that will become more -clear <a href="LangImpl6.html#unary">later in the tutorial</a>. In order to -parse an arbitrary primary expression, we need to determine what sort of -expression it is:</p> - -<div class="doc_code"> -<pre> -/// primary -/// ::= identifierexpr -/// ::= numberexpr -/// ::= parenexpr -static ExprAST *ParsePrimary() { - switch (CurTok) { - default: return Error("unknown token when expecting an expression"); - case tok_identifier: return ParseIdentifierExpr(); - case tok_number: return ParseNumberExpr(); - case '(': return ParseParenExpr(); - } -} -</pre> -</div> - -<p>Now that you see the definition of this function, it is more obvious why we -can assume the state of CurTok in the various functions. This uses look-ahead -to determine which sort of expression is being inspected, and then parses it -with a function call.</p> - -<p>Now that basic expressions are handled, we need to handle binary expressions. -They are a bit more complex.</p> - -</div> - -<!-- *********************************************************************** --> -<h2><a name="parserbinops">Binary Expression Parsing</a></h2> -<!-- *********************************************************************** --> - -<div> - -<p>Binary expressions are significantly harder to parse because they are often -ambiguous. For example, when given the string "x+y*z", the parser can choose -to parse it as either "(x+y)*z" or "x+(y*z)". With common definitions from -mathematics, we expect the later parse, because "*" (multiplication) has -higher <em>precedence</em> than "+" (addition).</p> - -<p>There are many ways to handle this, but an elegant and efficient way is to -use <a href= -"http://en.wikipedia.org/wiki/Operator-precedence_parser">Operator-Precedence -Parsing</a>. This parsing technique uses the precedence of binary operators to -guide recursion. To start with, we need a table of precedences:</p> - -<div class="doc_code"> -<pre> -/// BinopPrecedence - This holds the precedence for each binary operator that is -/// defined. -static std::map<char, int> BinopPrecedence; - -/// GetTokPrecedence - Get the precedence of the pending binary operator token. -static int GetTokPrecedence() { - if (!isascii(CurTok)) - return -1; - - // Make sure it's a declared binop. - int TokPrec = BinopPrecedence[CurTok]; - if (TokPrec <= 0) return -1; - return TokPrec; -} - -int main() { - // Install standard binary operators. - // 1 is lowest precedence. - BinopPrecedence['<'] = 10; - BinopPrecedence['+'] = 20; - BinopPrecedence['-'] = 20; - BinopPrecedence['*'] = 40; // highest. - ... -} -</pre> -</div> - -<p>For the basic form of Kaleidoscope, we will only support 4 binary operators -(this can obviously be extended by you, our brave and intrepid reader). The -<tt>GetTokPrecedence</tt> function returns the precedence for the current token, -or -1 if the token is not a binary operator. Having a map makes it easy to add -new operators and makes it clear that the algorithm doesn't depend on the -specific operators involved, but it would be easy enough to eliminate the map -and do the comparisons in the <tt>GetTokPrecedence</tt> function. (Or just use -a fixed-size array).</p> - -<p>With the helper above defined, we can now start parsing binary expressions. -The basic idea of operator precedence parsing is to break down an expression -with potentially ambiguous binary operators into pieces. Consider ,for example, -the expression "a+b+(c+d)*e*f+g". Operator precedence parsing considers this -as a stream of primary expressions separated by binary operators. As such, -it will first parse the leading primary expression "a", then it will see the -pairs [+, b] [+, (c+d)] [*, e] [*, f] and [+, g]. Note that because parentheses -are primary expressions, the binary expression parser doesn't need to worry -about nested subexpressions like (c+d) at all. -</p> - -<p> -To start, an expression is a primary expression potentially followed by a -sequence of [binop,primaryexpr] pairs:</p> - -<div class="doc_code"> -<pre> -/// expression -/// ::= primary binoprhs -/// -static ExprAST *ParseExpression() { - ExprAST *LHS = ParsePrimary(); - if (!LHS) return 0; - - return ParseBinOpRHS(0, LHS); -} -</pre> -</div> - -<p><tt>ParseBinOpRHS</tt> is the function that parses the sequence of pairs for -us. It takes a precedence and a pointer to an expression for the part that has been -parsed so far. Note that "x" is a perfectly valid expression: As such, "binoprhs" is -allowed to be empty, in which case it returns the expression that is passed into -it. In our example above, the code passes the expression for "a" into -<tt>ParseBinOpRHS</tt> and the current token is "+".</p> - -<p>The precedence value passed into <tt>ParseBinOpRHS</tt> indicates the <em> -minimal operator precedence</em> that the function is allowed to eat. For -example, if the current pair stream is [+, x] and <tt>ParseBinOpRHS</tt> is -passed in a precedence of 40, it will not consume any tokens (because the -precedence of '+' is only 20). With this in mind, <tt>ParseBinOpRHS</tt> starts -with:</p> - -<div class="doc_code"> -<pre> -/// binoprhs -/// ::= ('+' primary)* -static ExprAST *ParseBinOpRHS(int ExprPrec, ExprAST *LHS) { - // If this is a binop, find its precedence. - while (1) { - int TokPrec = GetTokPrecedence(); - - // If this is a binop that binds at least as tightly as the current binop, - // consume it, otherwise we are done. - if (TokPrec < ExprPrec) - return LHS; -</pre> -</div> - -<p>This code gets the precedence of the current token and checks to see if if is -too low. Because we defined invalid tokens to have a precedence of -1, this -check implicitly knows that the pair-stream ends when the token stream runs out -of binary operators. If this check succeeds, we know that the token is a binary -operator and that it will be included in this expression:</p> - -<div class="doc_code"> -<pre> - // Okay, we know this is a binop. - int BinOp = CurTok; - getNextToken(); // eat binop - - // Parse the primary expression after the binary operator. - ExprAST *RHS = ParsePrimary(); - if (!RHS) return 0; -</pre> -</div> - -<p>As such, this code eats (and remembers) the binary operator and then parses -the primary expression that follows. This builds up the whole pair, the first of -which is [+, b] for the running example.</p> - -<p>Now that we parsed the left-hand side of an expression and one pair of the -RHS sequence, we have to decide which way the expression associates. In -particular, we could have "(a+b) binop unparsed" or "a + (b binop unparsed)". -To determine this, we look ahead at "binop" to determine its precedence and -compare it to BinOp's precedence (which is '+' in this case):</p> - -<div class="doc_code"> -<pre> - // If BinOp binds less tightly with RHS than the operator after RHS, let - // the pending operator take RHS as its LHS. - int NextPrec = GetTokPrecedence(); - if (TokPrec < NextPrec) { -</pre> -</div> - -<p>If the precedence of the binop to the right of "RHS" is lower or equal to the -precedence of our current operator, then we know that the parentheses associate -as "(a+b) binop ...". In our example, the current operator is "+" and the next -operator is "+", we know that they have the same precedence. In this case we'll -create the AST node for "a+b", and then continue parsing:</p> - -<div class="doc_code"> -<pre> - ... if body omitted ... - } - - // Merge LHS/RHS. - LHS = new BinaryExprAST(BinOp, LHS, RHS); - } // loop around to the top of the while loop. -} -</pre> -</div> - -<p>In our example above, this will turn "a+b+" into "(a+b)" and execute the next -iteration of the loop, with "+" as the current token. The code above will eat, -remember, and parse "(c+d)" as the primary expression, which makes the -current pair equal to [+, (c+d)]. It will then evaluate the 'if' conditional above with -"*" as the binop to the right of the primary. In this case, the precedence of "*" is -higher than the precedence of "+" so the if condition will be entered.</p> - -<p>The critical question left here is "how can the if condition parse the right -hand side in full"? In particular, to build the AST correctly for our example, -it needs to get all of "(c+d)*e*f" as the RHS expression variable. The code to -do this is surprisingly simple (code from the above two blocks duplicated for -context):</p> - -<div class="doc_code"> -<pre> - // If BinOp binds less tightly with RHS than the operator after RHS, let - // the pending operator take RHS as its LHS. - int NextPrec = GetTokPrecedence(); - if (TokPrec < NextPrec) { - <b>RHS = ParseBinOpRHS(TokPrec+1, RHS); - if (RHS == 0) return 0;</b> - } - // Merge LHS/RHS. - LHS = new BinaryExprAST(BinOp, LHS, RHS); - } // loop around to the top of the while loop. -} -</pre> -</div> - -<p>At this point, we know that the binary operator to the RHS of our primary -has higher precedence than the binop we are currently parsing. As such, we know -that any sequence of pairs whose operators are all higher precedence than "+" -should be parsed together and returned as "RHS". To do this, we recursively -invoke the <tt>ParseBinOpRHS</tt> function specifying "TokPrec+1" as the minimum -precedence required for it to continue. In our example above, this will cause -it to return the AST node for "(c+d)*e*f" as RHS, which is then set as the RHS -of the '+' expression.</p> - -<p>Finally, on the next iteration of the while loop, the "+g" piece is parsed -and added to the AST. With this little bit of code (14 non-trivial lines), we -correctly handle fully general binary expression parsing in a very elegant way. -This was a whirlwind tour of this code, and it is somewhat subtle. I recommend -running through it with a few tough examples to see how it works. -</p> - -<p>This wraps up handling of expressions. At this point, we can point the -parser at an arbitrary token stream and build an expression from it, stopping -at the first token that is not part of the expression. Next up we need to -handle function definitions, etc.</p> - -</div> - -<!-- *********************************************************************** --> -<h2><a name="parsertop">Parsing the Rest</a></h2> -<!-- *********************************************************************** --> - -<div> - -<p> -The next thing missing is handling of function prototypes. In Kaleidoscope, -these are used both for 'extern' function declarations as well as function body -definitions. The code to do this is straight-forward and not very interesting -(once you've survived expressions): -</p> - -<div class="doc_code"> -<pre> -/// prototype -/// ::= id '(' id* ')' -static PrototypeAST *ParsePrototype() { - if (CurTok != tok_identifier) - return ErrorP("Expected function name in prototype"); - - std::string FnName = IdentifierStr; - getNextToken(); - - if (CurTok != '(') - return ErrorP("Expected '(' in prototype"); - - // Read the list of argument names. - std::vector<std::string> ArgNames; - while (getNextToken() == tok_identifier) - ArgNames.push_back(IdentifierStr); - if (CurTok != ')') - return ErrorP("Expected ')' in prototype"); - - // success. - getNextToken(); // eat ')'. - - return new PrototypeAST(FnName, ArgNames); -} -</pre> -</div> - -<p>Given this, a function definition is very simple, just a prototype plus -an expression to implement the body:</p> - -<div class="doc_code"> -<pre> -/// definition ::= 'def' prototype expression -static FunctionAST *ParseDefinition() { - getNextToken(); // eat def. - PrototypeAST *Proto = ParsePrototype(); - if (Proto == 0) return 0; - - if (ExprAST *E = ParseExpression()) - return new FunctionAST(Proto, E); - return 0; -} -</pre> -</div> - -<p>In addition, we support 'extern' to declare functions like 'sin' and 'cos' as -well as to support forward declaration of user functions. These 'extern's are just -prototypes with no body:</p> - -<div class="doc_code"> -<pre> -/// external ::= 'extern' prototype -static PrototypeAST *ParseExtern() { - getNextToken(); // eat extern. - return ParsePrototype(); -} -</pre> -</div> - -<p>Finally, we'll also let the user type in arbitrary top-level expressions and -evaluate them on the fly. We will handle this by defining anonymous nullary -(zero argument) functions for them:</p> - -<div class="doc_code"> -<pre> -/// toplevelexpr ::= expression -static FunctionAST *ParseTopLevelExpr() { - if (ExprAST *E = ParseExpression()) { - // Make an anonymous proto. - PrototypeAST *Proto = new PrototypeAST("", std::vector<std::string>()); - return new FunctionAST(Proto, E); - } - return 0; -} -</pre> -</div> - -<p>Now that we have all the pieces, let's build a little driver that will let us -actually <em>execute</em> this code we've built!</p> - -</div> - -<!-- *********************************************************************** --> -<h2><a name="driver">The Driver</a></h2> -<!-- *********************************************************************** --> - -<div> - -<p>The driver for this simply invokes all of the parsing pieces with a top-level -dispatch loop. There isn't much interesting here, so I'll just include the -top-level loop. See <a href="#code">below</a> for full code in the "Top-Level -Parsing" section.</p> - -<div class="doc_code"> -<pre> -/// top ::= definition | external | expression | ';' -static void MainLoop() { - while (1) { - fprintf(stderr, "ready> "); - switch (CurTok) { - case tok_eof: return; - case ';': getNextToken(); break; // ignore top-level semicolons. - case tok_def: HandleDefinition(); break; - case tok_extern: HandleExtern(); break; - default: HandleTopLevelExpression(); break; - } - } -} -</pre> -</div> - -<p>The most interesting part of this is that we ignore top-level semicolons. -Why is this, you ask? The basic reason is that if you type "4 + 5" at the -command line, the parser doesn't know whether that is the end of what you will type -or not. For example, on the next line you could type "def foo..." in which case -4+5 is the end of a top-level expression. Alternatively you could type "* 6", -which would continue the expression. Having top-level semicolons allows you to -type "4+5;", and the parser will know you are done.</p> - -</div> - -<!-- *********************************************************************** --> -<h2><a name="conclusions">Conclusions</a></h2> -<!-- *********************************************************************** --> - -<div> - -<p>With just under 400 lines of commented code (240 lines of non-comment, -non-blank code), we fully defined our minimal language, including a lexer, -parser, and AST builder. With this done, the executable will validate -Kaleidoscope code and tell us if it is grammatically invalid. For -example, here is a sample interaction:</p> - -<div class="doc_code"> -<pre> -$ <b>./a.out</b> -ready> <b>def foo(x y) x+foo(y, 4.0);</b> -Parsed a function definition. -ready> <b>def foo(x y) x+y y;</b> -Parsed a function definition. -Parsed a top-level expr -ready> <b>def foo(x y) x+y );</b> -Parsed a function definition. -Error: unknown token when expecting an expression -ready> <b>extern sin(a);</b> -ready> Parsed an extern -ready> <b>^D</b> -$ -</pre> -</div> - -<p>There is a lot of room for extension here. You can define new AST nodes, -extend the language in many ways, etc. In the <a href="LangImpl3.html">next -installment</a>, we will describe how to generate LLVM Intermediate -Representation (IR) from the AST.</p> - -</div> - -<!-- *********************************************************************** --> -<h2><a name="code">Full Code Listing</a></h2> -<!-- *********************************************************************** --> - -<div> - -<p> -Here is the complete code listing for this and the previous chapter. -Note that it is fully self-contained: you don't need LLVM or any external -libraries at all for this. (Besides the C and C++ standard libraries, of -course.) To build this, just compile with:</p> - -<div class="doc_code"> -<pre> -# Compile -clang++ -g -O3 toy.cpp -# Run -./a.out -</pre> -</div> - -<p>Here is the code:</p> - -<div class="doc_code"> -<pre> -#include <cstdio> -#include <cstdlib> -#include <string> -#include <map> -#include <vector> - -//===----------------------------------------------------------------------===// -// Lexer -//===----------------------------------------------------------------------===// - -// The lexer returns tokens [0-255] if it is an unknown character, otherwise one -// of these for known things. -enum Token { - tok_eof = -1, - - // commands - tok_def = -2, tok_extern = -3, - - // primary - tok_identifier = -4, tok_number = -5 -}; - -static std::string IdentifierStr; // Filled in if tok_identifier -static double NumVal; // Filled in if tok_number - -/// gettok - Return the next token from standard input. -static int gettok() { - static int LastChar = ' '; - - // Skip any whitespace. - while (isspace(LastChar)) - LastChar = getchar(); - - if (isalpha(LastChar)) { // identifier: [a-zA-Z][a-zA-Z0-9]* - IdentifierStr = LastChar; - while (isalnum((LastChar = getchar()))) - IdentifierStr += LastChar; - - if (IdentifierStr == "def") return tok_def; - if (IdentifierStr == "extern") return tok_extern; - return tok_identifier; - } - - if (isdigit(LastChar) || LastChar == '.') { // Number: [0-9.]+ - std::string NumStr; - do { - NumStr += LastChar; - LastChar = getchar(); - } while (isdigit(LastChar) || LastChar == '.'); - - NumVal = strtod(NumStr.c_str(), 0); - return tok_number; - } - - if (LastChar == '#') { - // Comment until end of line. - do LastChar = getchar(); - while (LastChar != EOF && LastChar != '\n' && LastChar != '\r'); - - if (LastChar != EOF) - return gettok(); - } - - // Check for end of file. Don't eat the EOF. - if (LastChar == EOF) - return tok_eof; - - // Otherwise, just return the character as its ascii value. - int ThisChar = LastChar; - LastChar = getchar(); - return ThisChar; -} - -//===----------------------------------------------------------------------===// -// Abstract Syntax Tree (aka Parse Tree) -//===----------------------------------------------------------------------===// - -/// ExprAST - Base class for all expression nodes. -class ExprAST { -public: - virtual ~ExprAST() {} -}; - -/// NumberExprAST - Expression class for numeric literals like "1.0". -class NumberExprAST : public ExprAST { - double Val; -public: - NumberExprAST(double val) : Val(val) {} -}; - -/// VariableExprAST - Expression class for referencing a variable, like "a". -class VariableExprAST : public ExprAST { - std::string Name; -public: - VariableExprAST(const std::string &name) : Name(name) {} -}; - -/// BinaryExprAST - Expression class for a binary operator. -class BinaryExprAST : public ExprAST { - char Op; - ExprAST *LHS, *RHS; -public: - BinaryExprAST(char op, ExprAST *lhs, ExprAST *rhs) - : Op(op), LHS(lhs), RHS(rhs) {} -}; - -/// CallExprAST - Expression class for function calls. -class CallExprAST : public ExprAST { - std::string Callee; - std::vector<ExprAST*> Args; -public: - CallExprAST(const std::string &callee, std::vector<ExprAST*> &args) - : Callee(callee), Args(args) {} -}; - -/// PrototypeAST - This class represents the "prototype" for a function, -/// which captures its name, and its argument names (thus implicitly the number -/// of arguments the function takes). -class PrototypeAST { - std::string Name; - std::vector<std::string> Args; -public: - PrototypeAST(const std::string &name, const std::vector<std::string> &args) - : Name(name), Args(args) {} - -}; - -/// FunctionAST - This class represents a function definition itself. -class FunctionAST { - PrototypeAST *Proto; - ExprAST *Body; -public: - FunctionAST(PrototypeAST *proto, ExprAST *body) - : Proto(proto), Body(body) {} - -}; - -//===----------------------------------------------------------------------===// -// Parser -//===----------------------------------------------------------------------===// - -/// CurTok/getNextToken - Provide a simple token buffer. CurTok is the current -/// token the parser is looking at. getNextToken reads another token from the -/// lexer and updates CurTok with its results. -static int CurTok; -static int getNextToken() { - return CurTok = gettok(); -} - -/// BinopPrecedence - This holds the precedence for each binary operator that is -/// defined. -static std::map<char, int> BinopPrecedence; - -/// GetTokPrecedence - Get the precedence of the pending binary operator token. -static int GetTokPrecedence() { - if (!isascii(CurTok)) - return -1; - - // Make sure it's a declared binop. - int TokPrec = BinopPrecedence[CurTok]; - if (TokPrec <= 0) return -1; - return TokPrec; -} - -/// Error* - These are little helper functions for error handling. -ExprAST *Error(const char *Str) { fprintf(stderr, "Error: %s\n", Str);return 0;} -PrototypeAST *ErrorP(const char *Str) { Error(Str); return 0; } -FunctionAST *ErrorF(const char *Str) { Error(Str); return 0; } - -static ExprAST *ParseExpression(); - -/// identifierexpr -/// ::= identifier -/// ::= identifier '(' expression* ')' -static ExprAST *ParseIdentifierExpr() { - std::string IdName = IdentifierStr; - |