diff options
Diffstat (limited to 'third_party/ply/doc/ply.html')
-rw-r--r-- | third_party/ply/doc/ply.html | 3262 |
1 files changed, 3262 insertions, 0 deletions
diff --git a/third_party/ply/doc/ply.html b/third_party/ply/doc/ply.html new file mode 100644 index 00000000..fdcd88a5 --- /dev/null +++ b/third_party/ply/doc/ply.html @@ -0,0 +1,3262 @@ +<html> +<head> +<title>PLY (Python Lex-Yacc)</title> +</head> +<body bgcolor="#ffffff"> + +<h1>PLY (Python Lex-Yacc)</h1> + +<b> +David M. Beazley <br> +dave@dabeaz.com<br> +</b> + +<p> +<b>PLY Version: 3.4</b> +<p> + +<!-- INDEX --> +<div class="sectiontoc"> +<ul> +<li><a href="#ply_nn1">Preface and Requirements</a> +<li><a href="#ply_nn1">Introduction</a> +<li><a href="#ply_nn2">PLY Overview</a> +<li><a href="#ply_nn3">Lex</a> +<ul> +<li><a href="#ply_nn4">Lex Example</a> +<li><a href="#ply_nn5">The tokens list</a> +<li><a href="#ply_nn6">Specification of tokens</a> +<li><a href="#ply_nn7">Token values</a> +<li><a href="#ply_nn8">Discarded tokens</a> +<li><a href="#ply_nn9">Line numbers and positional information</a> +<li><a href="#ply_nn10">Ignored characters</a> +<li><a href="#ply_nn11">Literal characters</a> +<li><a href="#ply_nn12">Error handling</a> +<li><a href="#ply_nn13">Building and using the lexer</a> +<li><a href="#ply_nn14">The @TOKEN decorator</a> +<li><a href="#ply_nn15">Optimized mode</a> +<li><a href="#ply_nn16">Debugging</a> +<li><a href="#ply_nn17">Alternative specification of lexers</a> +<li><a href="#ply_nn18">Maintaining state</a> +<li><a href="#ply_nn19">Lexer cloning</a> +<li><a href="#ply_nn20">Internal lexer state</a> +<li><a href="#ply_nn21">Conditional lexing and start conditions</a> +<li><a href="#ply_nn21">Miscellaneous Issues</a> +</ul> +<li><a href="#ply_nn22">Parsing basics</a> +<li><a href="#ply_nn23">Yacc</a> +<ul> +<li><a href="#ply_nn24">An example</a> +<li><a href="#ply_nn25">Combining Grammar Rule Functions</a> +<li><a href="#ply_nn26">Character Literals</a> +<li><a href="#ply_nn26">Empty Productions</a> +<li><a href="#ply_nn28">Changing the starting symbol</a> +<li><a href="#ply_nn27">Dealing With Ambiguous Grammars</a> +<li><a href="#ply_nn28">The parser.out file</a> +<li><a href="#ply_nn29">Syntax Error Handling</a> +<ul> +<li><a href="#ply_nn30">Recovery and resynchronization with error rules</a> +<li><a href="#ply_nn31">Panic mode recovery</a> +<li><a href="#ply_nn35">Signaling an error from a production</a> +<li><a href="#ply_nn32">General comments on error handling</a> +</ul> +<li><a href="#ply_nn33">Line Number and Position Tracking</a> +<li><a href="#ply_nn34">AST Construction</a> +<li><a href="#ply_nn35">Embedded Actions</a> +<li><a href="#ply_nn36">Miscellaneous Yacc Notes</a> +</ul> +<li><a href="#ply_nn37">Multiple Parsers and Lexers</a> +<li><a href="#ply_nn38">Using Python's Optimized Mode</a> +<li><a href="#ply_nn44">Advanced Debugging</a> +<ul> +<li><a href="#ply_nn45">Debugging the lex() and yacc() commands</a> +<li><a href="#ply_nn46">Run-time Debugging</a> +</ul> +<li><a href="#ply_nn39">Where to go from here?</a> +</ul> +</div> +<!-- INDEX --> + + + +<H2><a name="ply_nn1"></a>1. Preface and Requirements</H2> + + +<p> +This document provides an overview of lexing and parsing with PLY. +Given the intrinsic complexity of parsing, I would strongly advise +that you read (or at least skim) this entire document before jumping +into a big development project with PLY. +</p> + +<p> +PLY-3.0 is compatible with both Python 2 and Python 3. Be aware that +Python 3 support is new and has not been extensively tested (although +all of the examples and unit tests pass under Python 3.0). If you are +using Python 2, you should try to use Python 2.4 or newer. Although PLY +works with versions as far back as Python 2.2, some of its optional features +require more modern library modules. +</p> + +<H2><a name="ply_nn1"></a>2. Introduction</H2> + + +PLY is a pure-Python implementation of the popular compiler +construction tools lex and yacc. The main goal of PLY is to stay +fairly faithful to the way in which traditional lex/yacc tools work. +This includes supporting LALR(1) parsing as well as providing +extensive input validation, error reporting, and diagnostics. Thus, +if you've used yacc in another programming language, it should be +relatively straightforward to use PLY. + +<p> +Early versions of PLY were developed to support an Introduction to +Compilers Course I taught in 2001 at the University of Chicago. In this course, +students built a fully functional compiler for a simple Pascal-like +language. Their compiler, implemented entirely in Python, had to +include lexical analysis, parsing, type checking, type inference, +nested scoping, and code generation for the SPARC processor. +Approximately 30 different compiler implementations were completed in +this course. Most of PLY's interface and operation has been influenced by common +usability problems encountered by students. Since 2001, PLY has +continued to be improved as feedback has been received from users. +PLY-3.0 represents a major refactoring of the original implementation +with an eye towards future enhancements. + +<p> +Since PLY was primarily developed as an instructional tool, you will +find it to be fairly picky about token and grammar rule +specification. In part, this +added formality is meant to catch common programming mistakes made by +novice users. However, advanced users will also find such features to +be useful when building complicated grammars for real programming +languages. It should also be noted that PLY does not provide much in +the way of bells and whistles (e.g., automatic construction of +abstract syntax trees, tree traversal, etc.). Nor would I consider it +to be a parsing framework. Instead, you will find a bare-bones, yet +fully capable lex/yacc implementation written entirely in Python. + +<p> +The rest of this document assumes that you are somewhat familar with +parsing theory, syntax directed translation, and the use of compiler +construction tools such as lex and yacc in other programming +languages. If you are unfamilar with these topics, you will probably +want to consult an introductory text such as "Compilers: Principles, +Techniques, and Tools", by Aho, Sethi, and Ullman. O'Reilly's "Lex +and Yacc" by John Levine may also be handy. In fact, the O'Reilly book can be +used as a reference for PLY as the concepts are virtually identical. + +<H2><a name="ply_nn2"></a>3. PLY Overview</H2> + + +PLY consists of two separate modules; <tt>lex.py</tt> and +<tt>yacc.py</tt>, both of which are found in a Python package +called <tt>ply</tt>. The <tt>lex.py</tt> module is used to break input text into a +collection of tokens specified by a collection of regular expression +rules. <tt>yacc.py</tt> is used to recognize language syntax that has +been specified in the form of a context free grammar. <tt>yacc.py</tt> uses LR parsing and generates its parsing tables +using either the LALR(1) (the default) or SLR table generation algorithms. + +<p> +The two tools are meant to work together. Specifically, +<tt>lex.py</tt> provides an external interface in the form of a +<tt>token()</tt> function that returns the next valid token on the +input stream. <tt>yacc.py</tt> calls this repeatedly to retrieve +tokens and invoke grammar rules. The output of <tt>yacc.py</tt> is +often an Abstract Syntax Tree (AST). However, this is entirely up to +the user. If desired, <tt>yacc.py</tt> can also be used to implement +simple one-pass compilers. + +<p> +Like its Unix counterpart, <tt>yacc.py</tt> provides most of the +features you expect including extensive error checking, grammar +validation, support for empty productions, error tokens, and ambiguity +resolution via precedence rules. In fact, everything that is possible in traditional yacc +should be supported in PLY. + +<p> +The primary difference between +<tt>yacc.py</tt> and Unix <tt>yacc</tt> is that <tt>yacc.py</tt> +doesn't involve a separate code-generation process. +Instead, PLY relies on reflection (introspection) +to build its lexers and parsers. Unlike traditional lex/yacc which +require a special input file that is converted into a separate source +file, the specifications given to PLY <em>are</em> valid Python +programs. This means that there are no extra source files nor is +there a special compiler construction step (e.g., running yacc to +generate Python code for the compiler). Since the generation of the +parsing tables is relatively expensive, PLY caches the results and +saves them to a file. If no changes are detected in the input source, +the tables are read from the cache. Otherwise, they are regenerated. + +<H2><a name="ply_nn3"></a>4. Lex</H2> + + +<tt>lex.py</tt> is used to tokenize an input string. For example, suppose +you're writing a programming language and a user supplied the following input string: + +<blockquote> +<pre> +x = 3 + 42 * (s - t) +</pre> +</blockquote> + +A tokenizer splits the string into individual tokens + +<blockquote> +<pre> +'x','=', '3', '+', '42', '*', '(', 's', '-', 't', ')' +</pre> +</blockquote> + +Tokens are usually given names to indicate what they are. For example: + +<blockquote> +<pre> +'ID','EQUALS','NUMBER','PLUS','NUMBER','TIMES', +'LPAREN','ID','MINUS','ID','RPAREN' +</pre> +</blockquote> + +More specifically, the input is broken into pairs of token types and values. For example: + +<blockquote> +<pre> +('ID','x'), ('EQUALS','='), ('NUMBER','3'), +('PLUS','+'), ('NUMBER','42), ('TIMES','*'), +('LPAREN','('), ('ID','s'), ('MINUS','-'), +('ID','t'), ('RPAREN',')' +</pre> +</blockquote> + +The identification of tokens is typically done by writing a series of regular expression +rules. The next section shows how this is done using <tt>lex.py</tt>. + +<H3><a name="ply_nn4"></a>4.1 Lex Example</H3> + + +The following example shows how <tt>lex.py</tt> is used to write a simple tokenizer. + +<blockquote> +<pre> +# ------------------------------------------------------------ +# calclex.py +# +# tokenizer for a simple expression evaluator for +# numbers and +,-,*,/ +# ------------------------------------------------------------ +import ply.lex as lex + +# List of token names. This is always required +tokens = ( + 'NUMBER', + 'PLUS', + 'MINUS', + 'TIMES', + 'DIVIDE', + 'LPAREN', + 'RPAREN', +) + +# Regular expression rules for simple tokens +t_PLUS = r'\+' +t_MINUS = r'-' +t_TIMES = r'\*' +t_DIVIDE = r'/' +t_LPAREN = r'\(' +t_RPAREN = r'\)' + +# A regular expression rule with some action code +def t_NUMBER(t): + r'\d+' + t.value = int(t.value) + return t + +# Define a rule so we can track line numbers +def t_newline(t): + r'\n+' + t.lexer.lineno += len(t.value) + +# A string containing ignored characters (spaces and tabs) +t_ignore = ' \t' + +# Error handling rule +def t_error(t): + print "Illegal character '%s'" % t.value[0] + t.lexer.skip(1) + +# Build the lexer +lexer = lex.lex() + +</pre> +</blockquote> +To use the lexer, you first need to feed it some input text using +its <tt>input()</tt> method. After that, repeated calls +to <tt>token()</tt> produce tokens. The following code shows how this +works: + +<blockquote> +<pre> + +# Test it out +data = ''' +3 + 4 * 10 + + -20 *2 +''' + +# Give the lexer some input +lexer.input(data) + +# Tokenize +while True: + tok = lexer.token() + if not tok: break # No more input + print tok +</pre> +</blockquote> + +When executed, the example will produce the following output: + +<blockquote> +<pre> +$ python example.py +LexToken(NUMBER,3,2,1) +LexToken(PLUS,'+',2,3) +LexToken(NUMBER,4,2,5) +LexToken(TIMES,'*',2,7) +LexToken(NUMBER,10,2,10) +LexToken(PLUS,'+',3,14) +LexToken(MINUS,'-',3,16) +LexToken(NUMBER,20,3,18) +LexToken(TIMES,'*',3,20) +LexToken(NUMBER,2,3,21) +</pre> +</blockquote> + +Lexers also support the iteration protocol. So, you can write the above loop as follows: + +<blockquote> +<pre> +for tok in lexer: + print tok +</pre> +</blockquote> + +The tokens returned by <tt>lexer.token()</tt> are instances +of <tt>LexToken</tt>. This object has +attributes <tt>tok.type</tt>, <tt>tok.value</tt>, +<tt>tok.lineno</tt>, and <tt>tok.lexpos</tt>. The following code shows an example of +accessing these attributes: + +<blockquote> +<pre> +# Tokenize +while True: + tok = lexer.token() + if not tok: break # No more input + print tok.type, tok.value, tok.line, tok.lexpos +</pre> +</blockquote> + +The <tt>tok.type</tt> and <tt>tok.value</tt> attributes contain the +type and value of the token itself. +<tt>tok.line</tt> and <tt>tok.lexpos</tt> contain information about +the location of the token. <tt>tok.lexpos</tt> is the index of the +token relative to the start of the input text. + +<H3><a name="ply_nn5"></a>4.2 The tokens list</H3> + + +All lexers must provide a list <tt>tokens</tt> that defines all of the possible token +names that can be produced by the lexer. This list is always required +and is used to perform a variety of validation checks. The tokens list is also used by the +<tt>yacc.py</tt> module to identify terminals. + +<p> +In the example, the following code specified the token names: + +<blockquote> +<pre> +tokens = ( + 'NUMBER', + 'PLUS', + 'MINUS', + 'TIMES', + 'DIVIDE', + 'LPAREN', + 'RPAREN', +) +</pre> +</blockquote> + +<H3><a name="ply_nn6"></a>4.3 Specification of tokens</H3> + + +Each token is specified by writing a regular expression rule. Each of these rules are +are defined by making declarations with a special prefix <tt>t_</tt> to indicate that it +defines a token. For simple tokens, the regular expression can +be specified as strings such as this (note: Python raw strings are used since they are the +most convenient way to write regular expression strings): + +<blockquote> +<pre> +t_PLUS = r'\+' +</pre> +</blockquote> + +In this case, the name following the <tt>t_</tt> must exactly match one of the +names supplied in <tt>tokens</tt>. If some kind of action needs to be performed, +a token rule can be specified as a function. For example, this rule matches numbers and +converts the string into a Python integer. + +<blockquote> +<pre> +def t_NUMBER(t): + r'\d+' + t.value = int(t.value) + return t +</pre> +</blockquote> + +When a function is used, the regular expression rule is specified in the function documentation string. +The function always takes a single argument which is an instance of +<tt>LexToken</tt>. This object has attributes of <tt>t.type</tt> which is the token type (as a string), +<tt>t.value</tt> which is the lexeme (the actual text matched), <tt>t.lineno</tt> which is the current line number, and <tt>t.lexpos</tt> which +is the position of the token relative to the beginning of the input text. +By default, <tt>t.type</tt> is set to the name following the <tt>t_</tt> prefix. The action +function can modify the contents of the <tt>LexToken</tt> object as appropriate. However, +when it is done, the resulting token should be returned. If no value is returned by the action +function, the token is simply discarded and the next token read. + +<p> +Internally, <tt>lex.py</tt> uses the <tt>re</tt> module to do its patten matching. When building the master regular expression, +rules are added in the following order: +<p> +<ol> +<li>All tokens defined by functions are added in the same order as they appear in the lexer file. +<li>Tokens defined by strings are added next by sorting them in order of decreasing regular expression length (longer expressions +are added first). +</ol> +<p> +Without this ordering, it can be difficult to correctly match certain types of tokens. For example, if you +wanted to have separate tokens for "=" and "==", you need to make sure that "==" is checked first. By sorting regular +expressions in order of decreasing length, this problem is solved for rules defined as strings. For functions, +the order can be explicitly controlled since rules appearing first are checked first. + +<p> +To handle reserved words, you should write a single rule to match an +identifier and do a special name lookup in a function like this: + +<blockquote> +<pre> +reserved = { + 'if' : 'IF', + 'then' : 'THEN', + 'else' : 'ELSE', + 'while' : 'WHILE', + ... +} + +tokens = ['LPAREN','RPAREN',...,'ID'] + list(reserved.values()) + +def t_ID(t): + r'[a-zA-Z_][a-zA-Z_0-9]*' + t.type = reserved.get(t.value,'ID') # Check for reserved words + return t +</pre> +</blockquote> + +This approach greatly reduces the number of regular expression rules and is likely to make things a little faster. + +<p> +<b>Note:</b> You should avoid writing individual rules for reserved words. For example, if you write rules like this, + +<blockquote> +<pre> +t_FOR = r'for' +t_PRINT = r'print' +</pre> +</blockquote> + +those rules will be triggered for identifiers that include those words as a prefix such as "forget" or "printed". This is probably not +what you want. + +<H3><a name="ply_nn7"></a>4.4 Token values</H3> + + +When tokens are returned by lex, they have a value that is stored in the <tt>value</tt> attribute. Normally, the value is the text +that was matched. However, the value can be assigned to any Python object. For instance, when lexing identifiers, you may +want to return both the identifier name and information from some sort of symbol table. To do this, you might write a rule like this: + +<blockquote> +<pre> +def t_ID(t): + ... + # Look up symbol table information and return a tuple + t.value = (t.value, symbol_lookup(t.value)) + ... + return t +</pre> +</blockquote> + +It is important to note that storing data in other attribute names is <em>not</em> recommended. The <tt>yacc.py</tt> module only exposes the +contents of the <tt>value</tt> attribute. Thus, accessing other attributes may be unnecessarily awkward. If you +need to store multiple values on a token, assign a tuple, dictionary, or instance to <tt>value</tt>. + +<H3><a name="ply_nn8"></a>4.5 Discarded tokens</H3> + + +To discard a token, such as a comment, simply define a token rule that returns no value. For example: + +<blockquote> +<pre> +def t_COMMENT(t): + r'\#.*' + pass + # No return value. Token discarded +</pre> +</blockquote> + +Alternatively, you can include the prefix "ignore_" in the token declaration to force a token to be ignored. For example: + +<blockquote> +<pre> +t_ignore_COMMENT = r'\#.*' +</pre> +</blockquote> + +Be advised that if you are ignoring many different kinds of text, you may still want to use functions since these provide more precise +control over the order in which regular expressions are matched (i.e., functions are matched in order of specification whereas strings are +sorted by regular expression length). + +<H3><a name="ply_nn9"></a>4.6 Line numbers and positional information</H3> + + +<p>By default, <tt>lex.py</tt> knows nothing about line numbers. This is because <tt>lex.py</tt> doesn't know anything +about what constitutes a "line" of input (e.g., the newline character or even if the input is textual data). +To update this information, you need to write a special rule. In the example, the <tt>t_newline()</tt> rule shows how to do this. + +<blockquote> +<pre> +# Define a rule so we can track line numbers +def t_newline(t): + r'\n+' + t.lexer.lineno += len(t.value) +</pre> +</blockquote> +Within the rule, the <tt>lineno</tt> attribute of the underlying lexer <tt>t.lexer</tt> is updated. +After the line number is updated, the token is simply discarded since nothing is returned. + +<p> +<tt>lex.py</tt> does not perform and kind of automatic column tracking. However, it does record positional +information related to each token in the <tt>lexpos</tt> attribute. Using this, it is usually possible to compute +column information as a separate step. For instance, just count backwards until you reach a newline. + +<blockquote> +<pre> +# Compute column. +# input is the input text string +# token is a token instance +def find_column(input,token): + last_cr = input.rfind('\n',0,token.lexpos) + if last_cr < 0: + last_cr = 0 + column = (token.lexpos - last_cr) + 1 + return column +</pre> +</blockquote> + +Since column information is often only useful in the context of error handling, calculating the column +position can be performed when needed as opposed to doing it for each token. + +<H3><a name="ply_nn10"></a>4.7 Ignored characters</H3> + + +<p> +The special <tt>t_ignore</tt> rule is reserved by <tt>lex.py</tt> for characters +that should be completely ignored in the input stream. +Usually this is used to skip over whitespace and other non-essential characters. +Although it is possible to define a regular expression rule for whitespace in a manner +similar to <tt>t_newline()</tt>, the use of <tt>t_ignore</tt> provides substantially better +lexing performance because it is handled as a special case and is checked in a much +more efficient manner than the normal regular expression rules. + +<H3><a name="ply_nn11"></a>4.8 Literal characters</H3> + + +<p> +Literal characters can be specified by defining a variable <tt>literals</tt> in your lexing module. For example: + +<blockquote> +<pre> +literals = [ '+','-','*','/' ] +</pre> +</blockquote> + +or alternatively + +<blockquote> +<pre> +literals = "+-*/" +</pre> +</blockquote> + +A literal character is simply a single character that is returned "as is" when encountered by the lexer. Literals are checked +after all of the defined regular expression rules. Thus, if a rule starts with one of the literal characters, it will always +take precedence. +<p> +When a literal token is returned, both its <tt>type</tt> and <tt>value</tt> attributes are set to the character itself. For example, <tt>'+'</tt>. + +<H3><a name="ply_nn12"></a>4.9 Error handling</H3> + + +<p> +Finally, the <tt>t_error()</tt> +function is used to handle lexing errors that occur when illegal +characters are detected. In this case, the <tt>t.value</tt> attribute contains the +rest of the input string that has not been tokenized. In the example, the error function +was defined as follows: + +<blockquote> +<pre> +# Error handling rule +def t_error(t): + print "Illegal character '%s'" % t.value[0] + t.lexer.skip(1) +</pre> +</blockquote> + +In this case, we simply print the offending character and skip ahead one character by calling <tt>t.lexer.skip(1)</tt>. + +<H3><a name="ply_nn13"></a>4.10 Building and using the lexer</H3> + + +<p> +To build the lexer, the function <tt>lex.lex()</tt> is used. This function +uses Python reflection (or introspection) to read the the regular expression rules +out of the calling context and build the lexer. Once the lexer has been built, two methods can +be used to control the lexer. + +<ul> +<li><tt>lexer.input(data)</tt>. Reset the lexer and store a new input string. +<li><tt>lexer.token()</tt>. Return the next token. Returns a special <tt>LexToken</tt> instance on success or +None if the end of the input text has been reached. +</ul> + +The preferred way to use PLY is to invoke the above methods directly on the lexer object returned by the +<tt>lex()</tt> function. The legacy interface to PLY involves module-level functions <tt>lex.input()</tt> and <tt>lex.token()</tt>. +For example: + +<blockquote> +<pre> +lex.lex() +lex.input(sometext) +while 1: + tok = lex.token() + if not tok: break + print tok +</pre> +</blockquote> + +<p> +In this example, the module-level functions <tt>lex.input()</tt> and <tt>lex.token()</tt> are bound to the <tt>input()</tt> +and <tt>token()</tt> methods of the last lexer created by the lex module. This interface may go away at some point so +it's probably best not to use it. + +<H3><a name="ply_nn14"></a>4.11 The @TOKEN decorator</H3> + + +In some applications, you may want to define build tokens from as a series of +more complex regular expression rules. For example: + +<blockquote> +<pre> +digit = r'([0-9])' +nondigit = r'([_A-Za-z])' +identifier = r'(' + nondigit + r'(' + digit + r'|' + nondigit + r')*)' + +def t_ID(t): + # want docstring to be identifier above. ????? + ... +</pre> +</blockquote> + +In this case, we want the regular expression rule for <tt>ID</tt> to be one of the variables above. However, there is no +way to directly specify this using a normal documentation string. To solve this problem, you can use the <tt>@TOKEN</tt> +decorator. For example: + +<blockquote> +<pre> +from ply.lex import TOKEN + +@TOKEN(identifier) +def t_ID(t): + ... +</pre> +</blockquote> + +This will attach <tt>identifier</tt> to the docstring for <tt>t_ID()</tt> allowing <tt>lex.py</tt> to work normally. An alternative +approach this problem is to set the docstring directly like this: + +<blockquote> +<pre> +def t_ID(t): + ... + +t_ID.__doc__ = identifier +</pre> +</blockquote> + +<b>NOTE:</b> Use of <tt>@TOKEN</tt> requires Python-2.4 or newer. If you're concerned about backwards compatibility with older +versions of Python, use the alternative approach of setting the docstring directly. + +<H3><a name="ply_nn15"></a>4.12 Optimized mode</H3> + + +For improved performance, it may be desirable to use Python's +optimized mode (e.g., running Python with the <tt>-O</tt> +option). However, doing so causes Python to ignore documentation +strings. This presents special problems for <tt>lex.py</tt>. To +handle this case, you can create your lexer using +the <tt>optimize</tt> option as follows: + +<blockquote> +<pre> +lexer = lex.lex(optimize=1) +</pre> +</blockquote> + +Next, run Python in its normal operating mode. When you do +this, <tt>lex.py</tt> will write a file called <tt>lextab.py</tt> to +the current directory. This file contains all of the regular +expression rules and tables used during lexing. On subsequent +executions, +<tt>lextab.py</tt> will simply be imported to build the lexer. This +approach substantially improves the startup time of the lexer and it +works in Python's optimized mode. + +<p> +To change the name of the lexer-generated file, use the <tt>lextab</tt> keyword argument. For example: + +<blockquote> +<pre> +lexer = lex.lex(optimize=1,lextab="footab") +</pre> +</blockquote> + +When running in optimized mode, it is important to note that lex disables most error checking. Thus, this is really only recommended +if you're sure everything is working correctly and you're ready to start releasing production code. + +<H3><a name="ply_nn16"></a>4.13 Debugging</H3> + + +For the purpose of debugging, you can run <tt>lex()</tt> in a debugging mode as follows: + +<blockquote> +<pre> +lexer = lex.lex(debug=1) +</pre> +</blockquote> + +<p> +This will produce various sorts of debugging information including all of the added rules, +the master regular expressions used by the lexer, and tokens generating during lexing. +</p> + +<p> +In addition, <tt>lex.py</tt> comes with a simple main function which +will either tokenize input read from standard input or from a file specified +on the command line. To use it, simply put this in your lexer: +</p> + +<blockquote> +<pre> +if __name__ == '__main__': + lex.runmain() +</pre> +</blockquote> + +Please refer to the "Debugging" section near the end for some more advanced details +of debugging. + +<H3><a name="ply_nn17"></a>4.14 Alternative specification of lexers</H3> + + +As shown in the example, lexers are specified all within one Python module. If you want to +put token rules in a different module from the one in which you invoke <tt>lex()</tt>, use the +<tt>module</tt> keyword argument. + +<p> +For example, you might have a dedicated module that just contains +the token rules: + +<blockquote> +<pre> +# module: tokrules.py +# This module just contains the lexing rules + +# List of token names. This is always required +tokens = ( + 'NUMBER', + 'PLUS', + 'MINUS', + 'TIMES', + 'DIVIDE', + 'LPAREN', + 'RPAREN', +) + +# Regular expression rules for simple tokens +t_PLUS = r'\+' +t_MINUS = r'-' +t_TIMES = r'\*' +t_DIVIDE = r'/' +t_LPAREN = r'\(' +t_RPAREN = r'\)' + +# A regular expression rule with some action code +def t_NUMBER(t): + r'\d+' + t.value = int(t.value) + return t + +# Define a rule so we can track line numbers +def t_newline(t): + r'\n+' + t.lexer.lineno += len(t.value) + +# A string containing ignored characters (spaces and tabs) +t_ignore = ' \t' + +# Error handling rule +def t_error(t): + print "Illegal character '%s'" % t.value[0] + t.lexer.skip(1) +</pre> +</blockquote> + +Now, if you wanted to build a tokenizer from these rules from within a different module, you would do the following (shown for Python interactive mode): + +<blockquote> +<pre> +>>> import tokrules +>>> <b>lexer = lex.lex(module=tokrules)</b> +>>> lexer.input("3 + 4") +>>> lexer.token() +LexToken(NUMBER,3,1,1,0) +>>> lexer.token() +LexToken(PLUS,'+',1,2) +>>> lexer.token() +LexToken(NUMBER,4,1,4) +>>> lexer.token() +None +>>> +</pre> +</blockquote> + +The <tt>module</tt> option can also be used to define lexers from instances of a class. For example: + +<blockquote> +<pre> +import ply.lex as lex + +class MyLexer: + # List of token names. This is always required + tokens = ( + 'NUMBER', + 'PLUS', + 'MINUS', + 'TIMES', + 'DIVIDE', + 'LPAREN', + 'RPAREN', + ) + + # Regular expression rules for simple tokens + t_PLUS = r'\+' + t_MINUS = r'-' + t_TIMES = r'\*' + t_DIVIDE = r'/' + t_LPAREN = r'\(' + t_RPAREN = r'\)' + + # A regular expression rule with some action code + # Note addition of self parameter since we're in a class + def t_NUMBER(self,t): + r'\d+' + t.value = int(t.value) + return t + + # Define a rule so we can track line numbers + def t_newline(self,t): + r'\n+' + t.lexer.lineno += len(t.value) + + # A string containing ignored characters (spaces and tabs) + t_ignore = ' \t' + + # Error handling rule + def t_error(self,t): + print "Illegal character '%s'" % t.value[0] + t.lexer.skip(1) + + <b># Build the lexer + def build(self,**kwargs): + self.lexer = lex.lex(module=self, **kwargs)</b> + + # Test it output + def test(self,data): + self.lexer.input(data) + while True: + tok = lexer.token() + if not tok: break + print tok + +# Build the lexer and try it out +m = MyLexer() +m.build() # Build the lexer +m.test("3 + 4") # Test it +</pre> +</blockquote> + + +When building a lexer from class, <em>you should construct the lexer from +an instance of the class</em>, not the class object itself. This is because +PLY only works properly if the lexer actions are defined by bound-methods. + +<p> +When using the <tt>module</tt> option to <tt>lex()</tt>, PLY collects symbols +from the underlying object using the <tt>dir()</tt> function. There is no +direct access to the <tt>__dict__</tt> attribute of the object supplied as a +module value. + +<P> +Finally, if you want to keep things nicely encapsulated, but don't want to use a +full-fledged class definition, lexers can be defined using closures. For example: + +<blockquote> +<pre> +import ply.lex as lex + +# List of token names. This is always required +tokens = ( + 'NUMBER', + 'PLUS', + 'MINUS', + 'TIMES', + 'DIVIDE', + 'LPAREN', + 'RPAREN', +) + +def MyLexer(): + # Regular expression rules for simple tokens + t_PLUS = r'\+' + t_MINUS = r'-' + t_TIMES = r'\*' + t_DIVIDE = r'/' + t_LPAREN = r'\(' + t_RPAREN = r'\)' + + # A regular expression rule with some action code + def t_NUMBER(t): + r'\d+' + t.value = int(t.value) + return t + + # Define a rule so we can track line numbers + def t_newline(t): + r'\n+' + t.lexer.lineno += len(t.value) + + # A string containing ignored characters (spaces and tabs) + t_ignore = ' \t' + + # Error handling rule + def t_error(t): + print "Illegal character '%s'" % t.value[0] + t.lexer.skip(1) + + # Build the lexer from my environment and return it + return lex.lex() +</pre> +</blockquote> + + +<H3><a name="ply_nn18"></a>4.15 Maintaining state</H3> + + +In your lexer, you may want to maintain a variety of state +information. This might include mode settings, symbol tables, and +other details. As an example, suppose that you wanted to keep +track of how many NUMBER tokens had been encountered. + +<p> +One way to do this is to keep a set of global variables in the module +where you created the lexer. For example: + +<blockquote> +<pre> +num_count = 0 +def t_NUMBER(t): + r'\d+' + global num_count + num_count += 1 + t.value = int(t.value) + return t +</pre> +</blockquote> + +If you don't like the use of a global variable, another place to store +information is inside the Lexer object created by <tt>lex()</tt>. +To this, you can use the <tt>lexer</tt> attribute of tokens passed to +the various rules. For example: + +<blockquote> +<pre> +def t_NUMBER(t): + r'\d+' + t.lexer.num_count += 1 # Note use of lexer attribute + t.value = int(t.value) + return t + +lexer = lex.lex() +lexer.num_count = 0 # Set the initial count +</pre> +</blockquote> + +This latter approach has the advantage of being simple and working +correctly in applications where multiple instantiations of a given +lexer exist in |