Date: Wed, 5 Dec 2012 00:26:32 +0000 Subject: docs: Sphinxify `docs/tutorial/` Sorry for the massive commit, but I just wanted to knock this one down and it is really straightforward. There are still a couple trivial (i.e. not related to the content) things left to fix: - Use of raw HTML links where :doc:`...` and :ref:`...` could be used instead. If you are a newbie and want to help fix this it would make for some good bite-sized patches; more experienced developers should be focusing on adding new content (to this tutorial or elsewhere, but please _do not_ waste your time on formatting when there is such dire need for documentation (see docs/SphinxQuickstartTemplate.rst to get started writing)). - Highlighting of the kaleidoscope code blocks (currently left as bare `::`). I will be working on writing a custom Pygments highlighter for this, mostly as training for maintaining the `llvm` code-block's lexer in-tree. I want to do this because I am extremely unhappy with how it just "gives up" on the slightest deviation from the expected syntax and leaves the whole code-block un-highlighted. More generally I am looking at writing some Sphinx extensions and keeping them in-tree as well, to support common use cases that currently have no good solution (like "monospace text inside a link"). git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@169343 91177308-0d34-0410-b5e6-96231b3b80d8 --- docs/tutorial/LangImpl1.html | 348 ------ docs/tutorial/LangImpl1.rst | 280 +++++ docs/tutorial/LangImpl2.html | 1231 --------------------- docs/tutorial/LangImpl2.rst | 1098 +++++++++++++++++++ docs/tutorial/LangImpl3.html | 1268 ---------------------- docs/tutorial/LangImpl3.rst | 1162 ++++++++++++++++++++ docs/tutorial/LangImpl4.html | 1152 -------------------- docs/tutorial/LangImpl4.rst | 1063 ++++++++++++++++++ docs/tutorial/LangImpl5.html | 1772 ------------------------------ docs/tutorial/LangImpl5.rst | 1609 +++++++++++++++++++++++++++ docs/tutorial/LangImpl6.html | 1829 ------------------------------- docs/tutorial/LangImpl6.rst | 1728 +++++++++++++++++++++++++++++ docs/tutorial/LangImpl7.html | 2164 ------------------------------------- docs/tutorial/LangImpl7.rst | 2005 ++++++++++++++++++++++++++++++++++ docs/tutorial/LangImpl8.html | 359 ------ docs/tutorial/LangImpl8.rst | 269 +++++ docs/tutorial/OCamlLangImpl1.html | 365 ------- docs/tutorial/OCamlLangImpl1.rst | 288 +++++ docs/tutorial/OCamlLangImpl2.html | 1043 ------------------ docs/tutorial/OCamlLangImpl2.rst | 899 +++++++++++++++ docs/tutorial/OCamlLangImpl3.html | 1093 ------------------- docs/tutorial/OCamlLangImpl3.rst | 964 +++++++++++++++++ docs/tutorial/OCamlLangImpl4.html | 1026 ------------------ docs/tutorial/OCamlLangImpl4.rst | 918 ++++++++++++++++ docs/tutorial/OCamlLangImpl5.html | 1560 -------------------------- docs/tutorial/OCamlLangImpl5.rst | 1365 +++++++++++++++++++++++ docs/tutorial/OCamlLangImpl6.html | 1574 --------------------------- docs/tutorial/OCamlLangImpl6.rst | 1444 +++++++++++++++++++++++++ docs/tutorial/OCamlLangImpl7.html | 1904 -------------------------------- docs/tutorial/OCamlLangImpl7.rst | 1726 +++++++++++++++++++++++++++++ docs/tutorial/OCamlLangImpl8.html | 359 ------ docs/tutorial/OCamlLangImpl8.rst | 269 +++++ docs/tutorial/index.rst | 44 +- 33 files changed, 17106 insertions(+), 19072 deletions(-) delete mode 100644 docs/tutorial/LangImpl1.html create mode 100644 docs/tutorial/LangImpl1.rst delete mode 100644 docs/tutorial/LangImpl2.html create mode 100644 docs/tutorial/LangImpl2.rst delete mode 100644 docs/tutorial/LangImpl3.html create mode 100644 docs/tutorial/LangImpl3.rst delete mode 100644 docs/tutorial/LangImpl4.html create mode 100644 docs/tutorial/LangImpl4.rst delete mode 100644 docs/tutorial/LangImpl5.html create mode 100644 docs/tutorial/LangImpl5.rst delete mode 100644 docs/tutorial/LangImpl6.html create mode 100644 docs/tutorial/LangImpl6.rst delete mode 100644 docs/tutorial/LangImpl7.html create mode 100644 docs/tutorial/LangImpl7.rst delete mode 100644 docs/tutorial/LangImpl8.html create mode 100644 docs/tutorial/LangImpl8.rst delete mode 100644 docs/tutorial/OCamlLangImpl1.html create mode 100644 docs/tutorial/OCamlLangImpl1.rst delete mode 100644 docs/tutorial/OCamlLangImpl2.html create mode 100644 docs/tutorial/OCamlLangImpl2.rst delete mode 100644 docs/tutorial/OCamlLangImpl3.html create mode 100644 docs/tutorial/OCamlLangImpl3.rst delete mode 100644 docs/tutorial/OCamlLangImpl4.html create mode 100644 docs/tutorial/OCamlLangImpl4.rst delete mode 100644 docs/tutorial/OCamlLangImpl5.html create mode 100644 docs/tutorial/OCamlLangImpl5.rst delete mode 100644 docs/tutorial/OCamlLangImpl6.html create mode 100644 docs/tutorial/OCamlLangImpl6.rst delete mode 100644 docs/tutorial/OCamlLangImpl7.html create mode 100644 docs/tutorial/OCamlLangImpl7.rst delete mode 100644 docs/tutorial/OCamlLangImpl8.html create mode 100644 docs/tutorial/OCamlLangImpl8.rst (limited to 'docs/tutorial') diff --git a/docs/tutorial/LangImpl1.html b/docs/tutorial/LangImpl1.html deleted file mode 100644 index a65646f286..0000000000 --- a/docs/tutorial/LangImpl1.html +++ /dev/null @@ -1,348 +0,0 @@ - - - - - Kaleidoscope: Tutorial Introduction and the Lexer - - - - - - - -

Kaleidoscope: Tutorial Introduction and the Lexer

- -

Up to Tutorial Index
Chapter 1 -
1. Tutorial Introduction
2. The Basic Language
3. The Lexer
-
Chapter 2: Implementing a Parser and AST

- -

Written by Chris Lattner

- - -

Tutorial Introduction

- - -

- -

Welcome to the "Implementing a language with LLVM" tutorial. This tutorial -runs through the implementation of a simple language, showing how fun and -easy it can be. This tutorial will get you up and started as well as help to -build a framework you can extend to other languages. The code in this tutorial -can also be used as a playground to hack on other LLVM specific things. -

- -

-The goal of this tutorial is to progressively unveil our language, describing -how it is built up over time. This will let us cover a fairly broad range of -language design and LLVM-specific usage issues, showing and explaining the code -for it all along the way, without overwhelming you with tons of details up -front.

- -

It is useful to point out ahead of time that this tutorial is really about -teaching compiler techniques and LLVM specifically, not about teaching -modern and sane software engineering principles. In practice, this means that -we'll take a number of shortcuts to simplify the exposition. For example, the -code leaks memory, uses global variables all over the place, doesn't use nice -design patterns like visitors, etc... but it -is very simple. If you dig in and use the code as a basis for future projects, -fixing these deficiencies shouldn't be hard.

- -

I've tried to put this tutorial together in a way that makes chapters easy to -skip over if you are already familiar with or are uninterested in the various -pieces. The structure of the tutorial is: -

- -

Chapter #1: Introduction to the Kaleidoscope -language, and the definition of its Lexer - This shows where we are going -and the basic functionality that we want it to do. In order to make this -tutorial maximally understandable and hackable, we choose to implement -everything in C++ instead of using lexer and parser generators. LLVM obviously -works just fine with such tools, feel free to use one if you prefer.
Chapter #2: Implementing a Parser and -AST - With the lexer in place, we can talk about parsing techniques and -basic AST construction. This tutorial describes recursive descent parsing and -operator precedence parsing. Nothing in Chapters 1 or 2 is LLVM-specific, -the code doesn't even link in LLVM at this point. :)
Chapter #3: Code generation to LLVM IR - -With the AST ready, we can show off how easy generation of LLVM IR really -is.
Chapter #4: Adding JIT and Optimizer -Support - Because a lot of people are interested in using LLVM as a JIT, -we'll dive right into it and show you the 3 lines it takes to add JIT support. -LLVM is also useful in many other ways, but this is one simple and "sexy" way -to shows off its power. :)
Chapter #5: Extending the Language: Control -Flow - With the language up and running, we show how to extend it with -control flow operations (if/then/else and a 'for' loop). This gives us a chance -to talk about simple SSA construction and control flow.
Chapter #6: Extending the Language: -User-defined Operators - This is a silly but fun chapter that talks about -extending the language to let the user program define their own arbitrary -unary and binary operators (with assignable precedence!). This lets us build a -significant piece of the "language" as library routines.
Chapter #7: Extending the Language: Mutable -Variables - This chapter talks about adding user-defined local variables -along with an assignment operator. The interesting part about this is how -easy and trivial it is to construct SSA form in LLVM: no, LLVM does not -require your front-end to construct SSA form!
Chapter #8: Conclusion and other useful LLVM -tidbits - This chapter wraps up the series by talking about potential -ways to extend the language, but also includes a bunch of pointers to info about -"special topics" like adding garbage collection support, exceptions, debugging, -support for "spaghetti stacks", and a bunch of other tips and tricks.

- -

By the end of the tutorial, we'll have written a bit less than 700 lines of -non-comment, non-blank, lines of code. With this small amount of code, we'll -have built up a very reasonable compiler for a non-trivial language including -a hand-written lexer, parser, AST, as well as code generation support with a JIT -compiler. While other systems may have interesting "hello world" tutorials, -I think the breadth of this tutorial is a great testament to the strengths of -LLVM and why you should consider it if you're interested in language or compiler -design.

- -

A note about this tutorial: we expect you to extend the language and play -with it on your own. Take the code and go crazy hacking away at it, compilers -don't need to be scary creatures - it can be a lot of fun to play with -languages!

- -

- - -

The Basic Language

- - -

- -

This tutorial will be illustrated with a toy language that we'll call -"Kaleidoscope" (derived -from "meaning beautiful, form, and view"). -Kaleidoscope is a procedural language that allows you to define functions, use -conditionals, math, etc. Over the course of the tutorial, we'll extend -Kaleidoscope to support the if/then/else construct, a for loop, user defined -operators, JIT compilation with a simple command line interface, etc.

- -

Because we want to keep things simple, the only datatype in Kaleidoscope is a -64-bit floating point type (aka 'double' in C parlance). As such, all values -are implicitly double precision and the language doesn't require type -declarations. This gives the language a very nice and simple syntax. For -example, the following simple example computes Fibonacci numbers:

- -

-# Compute the x'th fibonacci number.
-def fib(x)
-  if x < 3 then
-    1
-  else
-    fib(x-1)+fib(x-2)
-
-# This expression will compute the 40th number.
-fib(40)
-

- -

We also allow Kaleidoscope to call into standard library functions (the LLVM -JIT makes this completely trivial). This means that you can use the 'extern' -keyword to define a function before you use it (this is also useful for mutually -recursive functions). For example:

- -

-extern sin(arg);
-extern cos(arg);
-extern atan2(arg1 arg2);
-
-atan2(sin(.4), cos(42))
-

- -

A more interesting example is included in Chapter 6 where we write a little -Kaleidoscope application that displays -a Mandelbrot Set at various levels of magnification.

- -

Lets dive into the implementation of this language!

- -

- - -

The Lexer

- - -

- -

When it comes to implementing a language, the first thing needed is -the ability to process a text file and recognize what it says. The traditional -way to do this is to use a "lexer" (aka 'scanner') -to break the input up into "tokens". Each token returned by the lexer includes -a token code and potentially some metadata (e.g. the numeric value of a number). -First, we define the possibilities: -

- -

-// The lexer returns tokens [0-255] if it is an unknown character, otherwise one
-// of these for known things.
-enum Token {
-  tok_eof = -1,
-
-  // commands
-  tok_def = -2, tok_extern = -3,
-
-  // primary
-  tok_identifier = -4, tok_number = -5,
-};
-
-static std::string IdentifierStr;  // Filled in if tok_identifier
-static double NumVal;              // Filled in if tok_number
-

- -

Each token returned by our lexer will either be one of the Token enum values -or it will be an 'unknown' character like '+', which is returned as its ASCII -value. If the current token is an identifier, the IdentifierStr -global variable holds the name of the identifier. If the current token is a -numeric literal (like 1.0), NumVal holds its value. Note that we use -global variables for simplicity, this is not the best choice for a real language -implementation :). -

- -

The actual implementation of the lexer is a single function named -gettok. The gettok function is called to return the next token -from standard input. Its definition starts as:

- -

-/// gettok - Return the next token from standard input.
-static int gettok() {
-  static int LastChar = ' ';
-
-  // Skip any whitespace.
-  while (isspace(LastChar))
-    LastChar = getchar();
-

- -

-gettok works by calling the C getchar() function to read -characters one at a time from standard input. It eats them as it recognizes -them and stores the last character read, but not processed, in LastChar. The -first thing that it has to do is ignore whitespace between tokens. This is -accomplished with the loop above.

- -

The next thing gettok needs to do is recognize identifiers and -specific keywords like "def". Kaleidoscope does this with this simple loop:

- -

-  if (isalpha(LastChar)) { // identifier: [a-zA-Z][a-zA-Z0-9]*
-    IdentifierStr = LastChar;
-    while (isalnum((LastChar = getchar())))
-      IdentifierStr += LastChar;
-
-    if (IdentifierStr == "def") return tok_def;
-    if (IdentifierStr == "extern") return tok_extern;
-    return tok_identifier;
-  }
-

- -

Note that this code sets the 'IdentifierStr' global whenever it -lexes an identifier. Also, since language keywords are matched by the same -loop, we handle them here inline. Numeric values are similar:

- -

-  if (isdigit(LastChar) || LastChar == '.') {   // Number: [0-9.]+
-    std::string NumStr;
-    do {
-      NumStr += LastChar;
-      LastChar = getchar();
-    } while (isdigit(LastChar) || LastChar == '.');
-
-    NumVal = strtod(NumStr.c_str(), 0);
-    return tok_number;
-  }
-

- -

This is all pretty straight-forward code for processing input. When reading -a numeric value from input, we use the C strtod function to convert it -to a numeric value that we store in NumVal. Note that this isn't doing -sufficient error checking: it will incorrectly read "1.23.45.67" and handle it as -if you typed in "1.23". Feel free to extend it :). Next we handle comments: -

- -

-  if (LastChar == '#') {
-    // Comment until end of line.
-    do LastChar = getchar();
-    while (LastChar != EOF && LastChar != '\n' && LastChar != '\r');
-    
-    if (LastChar != EOF)
-      return gettok();
-  }
-

- -

We handle comments by skipping to the end of the line and then return the -next token. Finally, if the input doesn't match one of the above cases, it is -either an operator character like '+' or the end of the file. These are handled -with this code:

- -

-  // Check for end of file.  Don't eat the EOF.
-  if (LastChar == EOF)
-    return tok_eof;
-  
-  // Otherwise, just return the character as its ascii value.
-  int ThisChar = LastChar;
-  LastChar = getchar();
-  return ThisChar;
-}
-

- -

With this, we have the complete lexer for the basic Kaleidoscope language -(the full code listing for the Lexer is -available in the next chapter of the tutorial). -Next we'll build a simple parser that uses this to -build an Abstract Syntax Tree. When we have that, we'll include a driver -so that you can use the lexer and parser together. -

- -Next: Implementing a Parser and AST -

- - -

- - Chris Lattner
- The LLVM Compiler Infrastructure
- Last modified: $Date$ -

- - diff --git a/docs/tutorial/LangImpl1.rst b/docs/tutorial/LangImpl1.rst new file mode 100644 index 0000000000..eb84e4c923 --- /dev/null +++ b/docs/tutorial/LangImpl1.rst @@ -0,0 +1,280 @@ +================================================= +Kaleidoscope: Tutorial Introduction and the Lexer +================================================= + +.. contents:: + :local: + +Written by `Chris Lattner `_ + +Tutorial Introduction +===================== + +Welcome to the "Implementing a language with LLVM" tutorial. This +tutorial runs through the implementation of a simple language, showing +how fun and easy it can be. This tutorial will get you up and started as +well as help to build a framework you can extend to other languages. The +code in this tutorial can also be used as a playground to hack on other +LLVM specific things. + +The goal of this tutorial is to progressively unveil our language, +describing how it is built up over time. This will let us cover a fairly +broad range of language design and LLVM-specific usage issues, showing +and explaining the code for it all along the way, without overwhelming +you with tons of details up front. + +It is useful to point out ahead of time that this tutorial is really +about teaching compiler techniques and LLVM specifically, *not* about +teaching modern and sane software engineering principles. In practice, +this means that we'll take a number of shortcuts to simplify the +exposition. For example, the code leaks memory, uses global variables +all over the place, doesn't use nice design patterns like +`visitors `_, etc... but +it is very simple. If you dig in and use the code as a basis for future +projects, fixing these deficiencies shouldn't be hard. + +I've tried to put this tutorial together in a way that makes chapters +easy to skip over if you are already familiar with or are uninterested +in the various pieces. The structure of the tutorial is: + +- `Chapter #1 <#language>`_: Introduction to the Kaleidoscope + language, and the definition of its Lexer - This shows where we are + going and the basic functionality that we want it to do. In order to + make this tutorial maximally understandable and hackable, we choose + to implement everything in C++ instead of using lexer and parser + generators. LLVM obviously works just fine with such tools, feel free + to use one if you prefer. +- `Chapter #2 `_: Implementing a Parser and AST - + With the lexer in place, we can talk about parsing techniques and + basic AST construction. This tutorial describes recursive descent + parsing and operator precedence parsing. Nothing in Chapters 1 or 2 + is LLVM-specific, the code doesn't even link in LLVM at this point. + :) +- `Chapter #3 `_: Code generation to LLVM IR - With + the AST ready, we can show off how easy generation of LLVM IR really + is. +- `Chapter #4 `_: Adding JIT and Optimizer Support + - Because a lot of people are interested in using LLVM as a JIT, + we'll dive right into it and show you the 3 lines it takes to add JIT + support. LLVM is also useful in many other ways, but this is one + simple and "sexy" way to shows off its power. :) +- `Chapter #5 `_: Extending the Language: Control + Flow - With the language up and running, we show how to extend it + with control flow operations (if/then/else and a 'for' loop). This + gives us a chance to talk about simple SSA construction and control + flow. +- `Chapter #6 `_: Extending the Language: + User-defined Operators - This is a silly but fun chapter that talks + about extending the language to let the user program define their own + arbitrary unary and binary operators (with assignable precedence!). + This lets us build a significant piece of the "language" as library + routines. +- `Chapter #7 `_: Extending the Language: Mutable + Variables - This chapter talks about adding user-defined local + variables along with an assignment operator. The interesting part + about this is how easy and trivial it is to construct SSA form in + LLVM: no, LLVM does *not* require your front-end to construct SSA + form! +- `Chapter #8 `_: Conclusion and other useful LLVM + tidbits - This chapter wraps up the series by talking about + potential ways to extend the language, but also includes a bunch of + pointers to info about "special topics" like adding garbage + collection support, exceptions, debugging, support for "spaghetti + stacks", and a bunch of other tips and tricks. + +By the end of the tutorial, we'll have written a bit less than 700 lines +of non-comment, non-blank, lines of code. With this small amount of +code, we'll have built up a very reasonable compiler for a non-trivial +language including a hand-written lexer, parser, AST, as well as code +generation support with a JIT compiler. While other systems may have +interesting "hello world" tutorials, I think the breadth of this +tutorial is a great testament to the strengths of LLVM and why you +should consider it if you're interested in language or compiler design. + +A note about this tutorial: we expect you to extend the language and +play with it on your own. Take the code and go crazy hacking away at it, +compilers don't need to be scary creatures - it can be a lot of fun to +play with languages! + +The Basic Language +================== + +This tutorial will be illustrated with a toy language that we'll call +"`Kaleidoscope `_" (derived +from "meaning beautiful, form, and view"). Kaleidoscope is a procedural +language that allows you to define functions, use conditionals, math, +etc. Over the course of the tutorial, we'll extend Kaleidoscope to +support the if/then/else construct, a for loop, user defined operators, +JIT compilation with a simple command line interface, etc. + +Because we want to keep things simple, the only datatype in Kaleidoscope +is a 64-bit floating point type (aka 'double' in C parlance). As such, +all values are implicitly double precision and the language doesn't +require type declarations. This gives the language a very nice and +simple syntax. For example, the following simple example computes +`Fibonacci numbers: `_ + +:: + + # Compute the x'th fibonacci number. + def fib(x) + if x < 3 then + 1 + else + fib(x-1)+fib(x-2) + + # This expression will compute the 40th number. + fib(40) + +We also allow Kaleidoscope to call into standard library functions (the +LLVM JIT makes this completely trivial). This means that you can use the +'extern' keyword to define a function before you use it (this is also +useful for mutually recursive functions). For example: + +:: + + extern sin(arg); + extern cos(arg); + extern atan2(arg1 arg2); + + atan2(sin(.4), cos(42)) + +A more interesting example is included in Chapter 6 where we write a +little Kaleidoscope application that `displays a Mandelbrot +Set `_ at various levels of magnification. + +Lets dive into the implementation of this language! + +The Lexer +========= + +When it comes to implementing a language, the first thing needed is the +ability to process a text file and recognize what it says. The +traditional way to do this is to use a +"`lexer `_" (aka +'scanner') to break the input up into "tokens". Each token returned by +the lexer includes a token code and potentially some metadata (e.g. the +numeric value of a number). First, we define the possibilities: + +.. code-block:: c++ + + // The lexer returns tokens [0-255] if it is an unknown character, otherwise one + // of these for known things. + enum Token { + tok_eof = -1, + + // commands + tok_def = -2, tok_extern = -3, + + // primary + tok_identifier = -4, tok_number = -5, + }; + + static std::string IdentifierStr; // Filled in if tok_identifier + static double NumVal; // Filled in if tok_number + +Each token returned by our lexer will either be one of the Token enum +values or it will be an 'unknown' character like '+', which is returned +as its ASCII value. If the current token is an identifier, the +``IdentifierStr`` global variable holds the name of the identifier. If +the current token is a numeric literal (like 1.0), ``NumVal`` holds its +value. Note that we use global variables for simplicity, this is not the +best choice for a real language implementation :). + +The actual implementation of the lexer is a single function named +``gettok``. The ``gettok`` function is called to return the next token +from standard input. Its definition starts as: + +.. code-block:: c++ + + /// gettok - Return the next token from standard input. + static int gettok() { + static int LastChar = ' '; + + // Skip any whitespace. + while (isspace(LastChar)) + LastChar = getchar(); + +``gettok`` works by calling the C ``getchar()`` function to read +characters one at a time from standard input. It eats them as it +recognizes them and stores the last character read, but not processed, +in LastChar. The first thing that it has to do is ignore whitespace +between tokens. This is accomplished with the loop above. + +The next thing ``gettok`` needs to do is recognize identifiers and +specific keywords like "def". Kaleidoscope does this with this simple +loop: + +.. code-block:: c++ + + if (isalpha(LastChar)) { // identifier: [a-zA-Z][a-zA-Z0-9]* + IdentifierStr = LastChar; + while (isalnum((LastChar = getchar()))) + IdentifierStr += LastChar; + + if (IdentifierStr == "def") return tok_def; + if (IdentifierStr == "extern") return tok_extern; + return tok_identifier; + } + +Note that this code sets the '``IdentifierStr``' global whenever it +lexes an identifier. Also, since language keywords are matched by the +same loop, we handle them here inline. Numeric values are similar: + +.. code-block:: c++ + + if (isdigit(LastChar) || LastChar == '.') { // Number: [0-9.]+ + std::string NumStr; + do { + NumStr += LastChar; + LastChar = getchar(); + } while (isdigit(LastChar) || LastChar == '.'); + + NumVal = strtod(NumStr.c_str(), 0); + return tok_number; + } + +This is all pretty straight-forward code for processing input. When +reading a numeric value from input, we use the C ``strtod`` function to +convert it to a numeric value that we store in ``NumVal``. Note that +this isn't doing sufficient error checking: it will incorrectly read +"1.23.45.67" and handle it as if you typed in "1.23". Feel free to +extend it :). Next we handle comments: + +.. code-block:: c++ + + if (LastChar == '#') { + // Comment until end of line. + do LastChar = getchar(); + while (LastChar != EOF && LastChar != '\n' && LastChar != '\r'); + + if (LastChar != EOF) + return gettok(); + } + +We handle comments by skipping to the end of the line and then return +the next token. Finally, if the input doesn't match one of the above +cases, it is either an operator character like '+' or the end of the +file. These are handled with this code: + +.. code-block:: c++ + + // Check for end of file. Don't eat the EOF. + if (LastChar == EOF) + return tok_eof; + + // Otherwise, just return the character as its ascii value. + int ThisChar = LastChar; + LastChar = getchar(); + return ThisChar; + } + +With this, we have the complete lexer for the basic Kaleidoscope +language (the `full code listing `_ for the Lexer +is available in the `next chapter `_ of the tutorial). +Next we'll `build a simple parser that uses this to build an Abstract +Syntax Tree `_. When we have that, we'll include a +driver so that you can use the lexer and parser together. + +`Next: Implementing a Parser and AST `_ + diff --git a/docs/tutorial/LangImpl2.html b/docs/tutorial/LangImpl2.html deleted file mode 100644 index 292dd4e516..0000000000 --- a/docs/tutorial/LangImpl2.html +++ /dev/null @@ -1,1231 +0,0 @@ - - - - - Kaleidoscope: Implementing a Parser and AST - - - - - - - -

Kaleidoscope: Implementing a Parser and AST

- -

Up to Tutorial Index
Chapter 2 -
-
Chapter 3: Code generation to LLVM IR

- -

Written by Chris Lattner

- - -

Chapter 2 Introduction

- - -

- -

Welcome to Chapter 2 of the "Implementing a language -with LLVM" tutorial. This chapter shows you how to use the lexer, built in -Chapter 1, to build a full parser for -our Kaleidoscope language. Once we have a parser, we'll define and build an Abstract Syntax -Tree (AST).

- -

The parser we will build uses a combination of Recursive Descent -Parsing and Operator-Precedence -Parsing to parse the Kaleidoscope language (the latter for -binary expressions and the former for everything else). Before we get to -parsing though, lets talk about the output of the parser: the Abstract Syntax -Tree.

- -

- - -

The Abstract Syntax Tree (AST)

- - -

- -

The AST for a program captures its behavior in such a way that it is easy for -later stages of the compiler (e.g. code generation) to interpret. We basically -want one object for each construct in the language, and the AST should closely -model the language. In Kaleidoscope, we have expressions, a prototype, and a -function object. We'll start with expressions first:

- -

-/// ExprAST - Base class for all expression nodes.
-class ExprAST {
-public:
-  virtual ~ExprAST() {}
-};
-
-/// NumberExprAST - Expression class for numeric literals like "1.0".
-class NumberExprAST : public ExprAST {
-  double Val;
-public:
-  NumberExprAST(double val) : Val(val) {}
-};
-

- -

The code above shows the definition of the base ExprAST class and one -subclass which we use for numeric literals. The important thing to note about -this code is that the NumberExprAST class captures the numeric value of the -literal as an instance variable. This allows later phases of the compiler to -know what the stored numeric value is.

- -

Right now we only create the AST, so there are no useful accessor methods on -them. It would be very easy to add a virtual method to pretty print the code, -for example. Here are the other expression AST node definitions that we'll use -in the basic form of the Kaleidoscope language: -

- -

-/// VariableExprAST - Expression class for referencing a variable, like "a".
-class VariableExprAST : public ExprAST {
-  std::string Name;
-public:
-  VariableExprAST(const std::string &name) : Name(name) {}
-};
-
-/// BinaryExprAST - Expression class for a binary operator.
-class BinaryExprAST : public ExprAST {
-  char Op;
-  ExprAST *LHS, *RHS;
-public:
-  BinaryExprAST(char op, ExprAST *lhs, ExprAST *rhs) 
-    : Op(op), LHS(lhs), RHS(rhs) {}
-};
-
-/// CallExprAST - Expression class for function calls.
-class CallExprAST : public ExprAST {
-  std::string Callee;
-  std::vector<ExprAST*> Args;
-public:
-  CallExprAST(const std::string &callee, std::vector<ExprAST*> &args)
-    : Callee(callee), Args(args) {}
-};
-

- -

This is all (intentionally) rather straight-forward: variables capture the -variable name, binary operators capture their opcode (e.g. '+'), and calls -capture a function name as well as a list of any argument expressions. One thing -that is nice about our AST is that it captures the language features without -talking about the syntax of the language. Note that there is no discussion about -precedence of binary operators, lexical structure, etc.

- -

For our basic language, these are all of the expression nodes we'll define. -Because it doesn't have conditional control flow, it isn't Turing-complete; -we'll fix that in a later installment. The two things we need next are a way -to talk about the interface to a function, and a way to talk about functions -themselves:

- -

-/// PrototypeAST - This class represents the "prototype" for a function,
-/// which captures its name, and its argument names (thus implicitly the number
-/// of arguments the function takes).
-class PrototypeAST {
-  std::string Name;
-  std::vector<std::string> Args;
-public:
-  PrototypeAST(const std::string &name, const std::vector<std::string> &args)
-    : Name(name), Args(args) {}
-};
-
-/// FunctionAST - This class represents a function definition itself.
-class FunctionAST {
-  PrototypeAST *Proto;
-  ExprAST *Body;
-public:
-  FunctionAST(PrototypeAST *proto, ExprAST *body)
-    : Proto(proto), Body(body) {}
-};
-

- -

In Kaleidoscope, functions are typed with just a count of their arguments. -Since all values are double precision floating point, the type of each argument -doesn't need to be stored anywhere. In a more aggressive and realistic -language, the "ExprAST" class would probably have a type field.

- -

With this scaffolding, we can now talk about parsing expressions and function -bodies in Kaleidoscope.

- -

- - -

Parser Basics

- - -

- -

Now that we have an AST to build, we need to define the parser code to build -it. The idea here is that we want to parse something like "x+y" (which is -returned as three tokens by the lexer) into an AST that could be generated with -calls like this:

- -

-  ExprAST *X = new VariableExprAST("x");
-  ExprAST *Y = new VariableExprAST("y");
-  ExprAST *Result = new BinaryExprAST('+', X, Y);
-

- -

In order to do this, we'll start by defining some basic helper routines:

- -

-/// CurTok/getNextToken - Provide a simple token buffer.  CurTok is the current
-/// token the parser is looking at.  getNextToken reads another token from the
-/// lexer and updates CurTok with its results.
-static int CurTok;
-static int getNextToken() {
-  return CurTok = gettok();
-}
-

- -

-This implements a simple token buffer around the lexer. This allows -us to look one token ahead at what the lexer is returning. Every function in -our parser will assume that CurTok is the current token that needs to be -parsed.

- -

-
-/// Error* - These are little helper functions for error handling.
-ExprAST *Error(const char *Str) { fprintf(stderr, "Error: %s\n", Str);return 0;}
-PrototypeAST *ErrorP(const char *Str) { Error(Str); return 0; }
-FunctionAST *ErrorF(const char *Str) { Error(Str); return 0; }
-

- -

-The Error routines are simple helper routines that our parser will use -to handle errors. The error recovery in our parser will not be the best and -is not particular user-friendly, but it will be enough for our tutorial. These -routines make it easier to handle errors in routines that have various return -types: they always return null.

- -

With these basic helper functions, we can implement the first -piece of our grammar: numeric literals.

- -

- - -

Basic Expression Parsing

- - -

- -

We start with numeric literals, because they are the simplest to process. -For each production in our grammar, we'll define a function which parses that -production. For numeric literals, we have: -

- -

-/// numberexpr ::= number
-static ExprAST *ParseNumberExpr() {
-  ExprAST *Result = new NumberExprAST(NumVal);
-  getNextToken(); // consume the number
-  return Result;
-}
-

- -

This routine is very simple: it expects to be called when the current token -is a tok_number token. It takes the current number value, creates -a NumberExprAST node, advances the lexer to the next token, and finally -returns.

- -

There are some interesting aspects to this. The most important one is that -this routine eats all of the tokens that correspond to the production and -returns the lexer buffer with the next token (which is not part of the grammar -production) ready to go. This is a fairly standard way to go for recursive -descent parsers. For a better example, the parenthesis operator is defined like -this:

- -

-/// parenexpr ::= '(' expression ')'
-static ExprAST *ParseParenExpr() {
-  getNextToken();  // eat (.
-  ExprAST *V = ParseExpression();
-  if (!V) return 0;
-  
-  if (CurTok != ')')
-    return Error("expected ')'");
-  getNextToken();  // eat ).
-  return V;
-}
-

- -

This function illustrates a number of interesting things about the -parser:

- -

-1) It shows how we use the Error routines. When called, this function expects -that the current token is a '(' token, but after parsing the subexpression, it -is possible that there is no ')' waiting. For example, if the user types in -"(4 x" instead of "(4)", the parser should emit an error. Because errors can -occur, the parser needs a way to indicate that they happened: in our parser, we -return null on an error.

- -

2) Another interesting aspect of this function is that it uses recursion by -calling ParseExpression (we will soon see that ParseExpression can call -ParseParenExpr). This is powerful because it allows us to handle -recursive grammars, and keeps each production very simple. Note that -parentheses do not cause construction of AST nodes themselves. While we could -do it this way, the most important role of parentheses are to guide the parser -and provide grouping. Once the parser constructs the AST, parentheses are not -needed.

- -

The next simple production is for handling variable references and function -calls:

- -

-/// identifierexpr
-///   ::= identifier
-///   ::= identifier '(' expression* ')'
-static ExprAST *ParseIdentifierExpr() {
-  std::string IdName = IdentifierStr;
-  
-  getNextToken();  // eat identifier.
-  
-  if (CurTok != '(') // Simple variable ref.
-    return new VariableExprAST(IdName);
-  
-  // Call.
-  getNextToken();  // eat (
-  std::vector<ExprAST*> Args;
-  if (CurTok != ')') {
-    while (1) {
-      ExprAST *Arg = ParseExpression();
-      if (!Arg) return 0;
-      Args.push_back(Arg);
-
-      if (CurTok == ')') break;
-
-      if (CurTok != ',')
-        return Error("Expected ')' or ',' in argument list");
-      getNextToken();
-    }
-  }
-
-  // Eat the ')'.
-  getNextToken();
-  
-  return new CallExprAST(IdName, Args);
-}
-

- -

This routine follows the same style as the other routines. (It expects to be -called if the current token is a tok_identifier token). It also has -recursion and error handling. One interesting aspect of this is that it uses -look-ahead to determine if the current identifier is a stand alone -variable reference or if it is a function call expression. It handles this by -checking to see if the token after the identifier is a '(' token, constructing -either a VariableExprAST or CallExprAST node as appropriate. -

- -

Now that we have all of our simple expression-parsing logic in place, we can -define a helper function to wrap it together into one entry point. We call this -class of expressions "primary" expressions, for reasons that will become more -clear later in the tutorial. In order to -parse an arbitrary primary expression, we need to determine what sort of -expression it is:

- -

-/// primary
-///   ::= identifierexpr
-///   ::= numberexpr
-///   ::= parenexpr
-static ExprAST *ParsePrimary() {
-  switch (CurTok) {
-  default: return Error("unknown token when expecting an expression");
-  case tok_identifier: return ParseIdentifierExpr();
-  case tok_number:     return ParseNumberExpr();
-  case '(':            return ParseParenExpr();
-  }
-}
-

- -

Now that you see the definition of this function, it is more obvious why we -can assume the state of CurTok in the various functions. This uses look-ahead -to determine which sort of expression is being inspected, and then parses it -with a function call.

- -

Now that basic expressions are handled, we need to handle binary expressions. -They are a bit more complex.

- -

- - -

Binary Expression Parsing

- - -

- -

Binary expressions are significantly harder to parse because they are often -ambiguous. For example, when given the string "x+y*z", the parser can choose -to parse it as either "(x+y)*z" or "x+(y*z)". With common definitions from -mathematics, we expect the later parse, because "*" (multiplication) has -higher precedence than "+" (addition).

- -

There are many ways to handle this, but an elegant and efficient way is to -use Operator-Precedence -Parsing. This parsing technique uses the precedence of binary operators to -guide recursion. To start with, we need a table of precedences:

- -

-/// BinopPrecedence - This holds the precedence for each binary operator that is
-/// defined.
-static std::map<char, int> BinopPrecedence;
-
-/// GetTokPrecedence - Get the precedence of the pending binary operator token.
-static int GetTokPrecedence() {
-  if (!isascii(CurTok))
-    return -1;
-    
-  // Make sure it's a declared binop.
-  int TokPrec = BinopPrecedence[CurTok];
-  if (TokPrec <= 0) return -1;
-  return TokPrec;
-}
-
-int main() {
-  // Install standard binary operators.
-  // 1 is lowest precedence.
-  BinopPrecedence['<'] = 10;
-  BinopPrecedence['+'] = 20;
-  BinopPrecedence['-'] = 20;
-  BinopPrecedence['*'] = 40;  // highest.
-  ...
-}
-

- -

For the basic form of Kaleidoscope, we will only support 4 binary operators -(this can obviously be extended by you, our brave and intrepid reader). The -GetTokPrecedence function returns the precedence for the current token, -or -1 if the token is not a binary operator. Having a map makes it easy to add -new operators and makes it clear that the algorithm doesn't depend on the -specific operators involved, but it would be easy enough to eliminate the map -and do the comparisons in the GetTokPrecedence function. (Or just use -a fixed-size array).

- -

With the helper above defined, we can now start parsing binary expressions. -The basic idea of operator precedence parsing is to break down an expression -with potentially ambiguous binary operators into pieces. Consider ,for example, -the expression "a+b+(c+d)*e*f+g". Operator precedence parsing considers this -as a stream of primary expressions separated by binary operators. As such, -it will first parse the leading primary expression "a", then it will see the -pairs [+, b] [+, (c+d)] [*, e] [*, f] and [+, g]. Note that because parentheses -are primary expressions, the binary expression parser doesn't need to worry -about nested subexpressions like (c+d) at all. -

- -

-To start, an expression is a primary expression potentially followed by a -sequence of [binop,primaryexpr] pairs:

- -

-/// expression
-///   ::= primary binoprhs
-///
-static ExprAST *ParseExpression() {
-  ExprAST *LHS = ParsePrimary();
-  if (!LHS) return 0;
-  
-  return ParseBinOpRHS(0, LHS);
-}
-

- -

ParseBinOpRHS is the function that parses the sequence of pairs for -us. It takes a precedence and a pointer to an expression for the part that has been -parsed so far. Note that "x" is a perfectly valid expression: As such, "binoprhs" is -allowed to be empty, in which case it returns the expression that is passed into -it. In our example above, the code passes the expression for "a" into -ParseBinOpRHS and the current token is "+".

- -

The precedence value passed into ParseBinOpRHS indicates the -minimal operator precedence that the function is allowed to eat. For -example, if the current pair stream is [+, x] and ParseBinOpRHS is -passed in a precedence of 40, it will not consume any tokens (because the -precedence of '+' is only 20). With this in mind, ParseBinOpRHS starts -with:

- -

-/// binoprhs
-///   ::= ('+' primary)*
-static ExprAST *ParseBinOpRHS(int ExprPrec, ExprAST *LHS) {
-  // If this is a binop, find its precedence.
-  while (1) {
-    int TokPrec = GetTokPrecedence();
-    
-    // If this is a binop that binds at least as tightly as the current binop,
-    // consume it, otherwise we are done.
-    if (TokPrec < ExprPrec)
-      return LHS;
-

- -

This code gets the precedence of the current token and checks to see if if is -too low. Because we defined invalid tokens to have a precedence of -1, this -check implicitly knows that the pair-stream ends when the token stream runs out -of binary operators. If this check succeeds, we know that the token is a binary -operator and that it will be included in this expression:

- -

-    // Okay, we know this is a binop.
-    int BinOp = CurTok;
-    getNextToken();  // eat binop
-    
-    // Parse the primary expression after the binary operator.
-    ExprAST *RHS = ParsePrimary();
-    if (!RHS) return 0;
-

- -

As such, this code eats (and remembers) the binary operator and then parses -the primary expression that follows. This builds up the whole pair, the first of -which is [+, b] for the running example.

- -

Now that we parsed the left-hand side of an expression and one pair of the -RHS sequence, we have to decide which way the expression associates. In -particular, we could have "(a+b) binop unparsed" or "a + (b binop unparsed)". -To determine this, we look ahead at "binop" to determine its precedence and -compare it to BinOp's precedence (which is '+' in this case):

- -

-    // If BinOp binds less tightly with RHS than the operator after RHS, let
-    // the pending operator take RHS as its LHS.
-    int NextPrec = GetTokPrecedence();
-    if (TokPrec < NextPrec) {
-

- -

If the precedence of the binop to the right of "RHS" is lower or equal to the -precedence of our current operator, then we know that the parentheses associate -as "(a+b) binop ...". In our example, the current operator is "+" and the next -operator is "+", we know that they have the same precedence. In this case we'll -create the AST node for "a+b", and then continue parsing:

- -

-      ... if body omitted ...
-    }
-    
-    // Merge LHS/RHS.
-    LHS = new BinaryExprAST(BinOp, LHS, RHS);
-  }  // loop around to the top of the while loop.
-}
-

- -

In our example above, this will turn "a+b+" into "(a+b)" and execute the next -iteration of the loop, with "+" as the current token. The code above will eat, -remember, and parse "(c+d)" as the primary expression, which makes the -current pair equal to [+, (c+d)]. It will then evaluate the 'if' conditional above with -"*" as the binop to the right of the primary. In this case, the precedence of "*" is -higher than the precedence of "+" so the if condition will be entered.

- -

The critical question left here is "how can the if condition parse the right -hand side in full"? In particular, to build the AST correctly for our example, -it needs to get all of "(c+d)*e*f" as the RHS expression variable. The code to -do this is surprisingly simple (code from the above two blocks duplicated for -context):

- -

-    // If BinOp binds less tightly with RHS than the operator after RHS, let
-    // the pending operator take RHS as its LHS.
-    int NextPrec = GetTokPrecedence();
-    if (TokPrec < NextPrec) {
-      RHS = ParseBinOpRHS(TokPrec+1, RHS);
-      if (RHS == 0) return 0;
-    }
-    // Merge LHS/RHS.
-    LHS = new BinaryExprAST(BinOp, LHS, RHS);
-  }  // loop around to the top of the while loop.
-}
-

- -

At this point, we know that the binary operator to the RHS of our primary -has higher precedence than the binop we are currently parsing. As such, we know -that any sequence of pairs whose operators are all higher precedence than "+" -should be parsed together and returned as "RHS". To do this, we recursively -invoke the ParseBinOpRHS function specifying "TokPrec+1" as the minimum -precedence required for it to continue. In our example above, this will cause -it to return the AST node for "(c+d)*e*f" as RHS, which is then set as the RHS -of the '+' expression.

- -

Finally, on the next iteration of the while loop, the "+g" piece is parsed -and added to the AST. With this little bit of code (14 non-trivial lines), we -correctly handle fully general binary expression parsing in a very elegant way. -This was a whirlwind tour of this code, and it is somewhat subtle. I recommend -running through it with a few tough examples to see how it works. -

- -

This wraps up handling of expressions. At this point, we can point the -parser at an arbitrary token stream and build an expression from it, stopping -at the first token that is not part of the expression. Next up we need to -handle function definitions, etc.

- -

- - -

Parsing the Rest

- - -

- -

-The next thing missing is handling of function prototypes. In Kaleidoscope, -these are used both for 'extern' function declarations as well as function body -definitions. The code to do this is straight-forward and not very interesting -(once you've survived expressions): -

- -

-/// prototype
-///   ::= id '(' id* ')'
-static PrototypeAST *ParsePrototype() {
-  if (CurTok != tok_identifier)
-    return ErrorP("Expected function name in prototype");
-
-  std::string FnName = IdentifierStr;
-  getNextToken();
-  
-  if (CurTok != '(')
-    return ErrorP("Expected '(' in prototype");
-  
-  // Read the list of argument names.
-  std::vector<std::string> ArgNames;
-  while (getNextToken() == tok_identifier)
-    ArgNames.push_back(IdentifierStr);
-  if (CurTok != ')')
-    return ErrorP("Expected ')' in prototype");
-  
-  // success.
-  getNextToken();  // eat ')'.
-  
-  return new PrototypeAST(FnName, ArgNames);
-}
-

- -

Given this, a function definition is very simple, just a prototype plus -an expression to implement the body:

- -

-/// definition ::= 'def' prototype expression
-static FunctionAST *ParseDefinition() {
-  getNextToken();  // eat def.
-  PrototypeAST *Proto = ParsePrototype();
-  if (Proto == 0) return 0;
-
-  if (ExprAST *E = ParseExpression())
-    return new FunctionAST(Proto, E);
-  return 0;
-}
-

- -

In addition, we support 'extern' to declare functions like 'sin' and 'cos' as -well as to support forward declaration of user functions. These 'extern's are just -prototypes with no body:

- -

-/// external ::= 'extern' prototype
-static PrototypeAST *ParseExtern() {
-  getNextToken();  // eat extern.
-  return ParsePrototype();
-}
-

- -

Finally, we'll also let the user type in arbitrary top-level expressions and -evaluate them on the fly. We will handle this by defining anonymous nullary -(zero argument) functions for them:

- -

-/// toplevelexpr ::= expression
-static FunctionAST *ParseTopLevelExpr() {
-  if (ExprAST *E = ParseExpression()) {
-    // Make an anonymous proto.
-    PrototypeAST *Proto = new PrototypeAST("", std::vector<std::string>());
-    return new FunctionAST(Proto, E);
-  }
-  return 0;
-}
-

- -

Now that we have all the pieces, let's build a little driver that will let us -actually execute this code we've built!

- -

- - -

The Driver

- - -

- -

The driver for this simply invokes all of the parsing pieces with a top-level -dispatch loop. There isn't much interesting here, so I'll just include the -top-level loop. See below for full code in the "Top-Level -Parsing" section.

- -

-/// top ::= definition | external | expression | ';'
-static void MainLoop() {
-  while (1) {
-    fprintf(stderr, "ready> ");
-    switch (CurTok) {
-    case tok_eof:    return;
-    case ';':        getNextToken(); break;  // ignore top-level semicolons.
-    case tok_def:    HandleDefinition(); break;
-    case tok_extern: HandleExtern(); break;
-    default:         HandleTopLevelExpression(); break;
-    }
-  }
-}
-

- -

The most interesting part of this is that we ignore top-level semicolons. -Why is this, you ask? The basic reason is that if you type "4 + 5" at the -command line, the parser doesn't know whether that is the end of what you will type -or not. For example, on the next line you could type "def foo..." in which case -4+5 is the end of a top-level expression. Alternatively you could type "* 6", -which would continue the expression. Having top-level semicolons allows you to -type "4+5;", and the parser will know you are done.

- -

- - -

LLVM Tutorial: Table of Contents

Kaleidoscope: Tutorial Introduction and the Lexer

Tutorial Introduction

The Basic Language

The Lexer

Kaleidoscope: Implementing a Parser and AST

Chapter 2 Introduction

The Abstract Syntax Tree (AST)

Parser Basics

Basic Expression Parsing

Binary Expression Parsing

Parsing the Rest

The Driver

Conclusions