diff options
author | Daniel Jasper <djasper@google.com> | 2012-12-07 20:34:49 +0000 |
---|---|---|
committer | Daniel Jasper <djasper@google.com> | 2012-12-07 20:34:49 +0000 |
commit | 8784b03d35ff99d1851ef63bc3cccb651c7672d9 (patch) | |
tree | b83d7a529e92366ae441545940477a0738e94b4d | |
parent | afa7cae15b117c4b75794c6c32424953d94b4359 (diff) |
AST matcher tutorial (Part I)
This an AST matcher tutorial based on Sam Panzer's document
(https://docs.google.com/a/google.com/document/d/1oTkVLhCdRJUEH1_LDaQdXqe8-aOqT5GLDL9e4MhoFF8/edit).
Checking in now although some parts might be a bit rough so others can
help improving it.
git-svn-id: https://llvm.org/svn/llvm-project/cfe/trunk@169632 91177308-0d34-0410-b5e6-96231b3b80d8
-rw-r--r-- | docs/LibASTMatchersTutorial.html | 533 |
1 files changed, 533 insertions, 0 deletions
diff --git a/docs/LibASTMatchersTutorial.html b/docs/LibASTMatchersTutorial.html new file mode 100644 index 0000000000..222ce70dfa --- /dev/null +++ b/docs/LibASTMatchersTutorial.html @@ -0,0 +1,533 @@ +<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" + "http://www.w3.org/TR/html4/strict.dtd"> +<html> +<head> +<title>Tutorial for building tools using LibTooling and LibASTMatchers</title> +<link type="text/css" rel="stylesheet" href="../menu.css" /> +<link type="text/css" rel="stylesheet" href="../content.css" /> +</head> +<body> + +<!--#include virtual="../menu.html.incl"--> + +<div id="content"> + +<h1>Tutorial for building tools using LibTooling and LibASTMatchers</h1> +<p>This document is intended to show how to build a useful source-to-source +translation tool based on Clang's <a href="LibTooling.html">LibTooling</a>. It +is explicitly aimed at people who are new to Clang, so all you should need is a +working knowledge of C++ and the command line.</p> + +<p>In order to work on the compiler, you need some basic knowledge of the +abstract syntax tree (AST). To this end, the reader is incouraged to skim the +<a href="http://clang.llvm.org/docs/IntroductionToTheClangAST.html">Introduction +to the Clang AST</a></p> + +<!-- ======================================================================= --> +<h2 id="obtainingclang">Step 0: Obtaining Clang</h2> +<!-- ======================================================================= --> +As Clang is part of the LLVM project, you'll need to download LLVM's source code +first. Both Clang and LLVM are maintained as Subversion repositories, but we'll +be accessing them through the git mirror. For further information, see the +<a href="http://llvm.org/docs/GettingStarted.html">getting started guide</a>. + +<pre class="doc_code"> + mkdir ~/clang-llvm && cd ~/clang-llvm + git clone http://llvm.org/git/llvm.git + cd llvm/tools + git clone http://llvm.org/git/clang.git +</pre> + +Next you need to obtain the CMake build system and Ninja build tool. You may +already have CMake installed, but current binary versions of CMake aren't built +with Ninja support. + +<pre class="doc_code"> + cd ~/clang-llvm + git clone https://github.com/martine/ninja.git + cd ninja + git checkout release + ./bootstrap.py + sudo cp ninja /usr/bin/ + + cd ~/clang-llvm + git clone git://cmake.org/stage/cmake.git + cd cmake + git checkout next + ./bootstrap + make + sudo make install +</pre> + +<p>Okay. Now we'll build Clang!</p> + +<pre class="doc_code"> + cd ~/clang-llvm + mkdir build && cd build + cmake -G Ninja ../llvm -DLLVM_BUILD_TESTS=ON # Enable tests; default is off. + ninja + ninja check # Test LLVM only. + ninja clang-test # Test Clang only. + ninja install +</pre> + +<p>And we're live.</p> + +<p>All of the tests should pass, though there is a (very) small chance that you +can catch LLVM and Clang out of sync. Running <tt>'git svn rebase'</tt> in both +the llvm and clang directories should fix any problems.</p> + +<p>Finally, we want to set Clang as its own compiler.</p> + +<pre class="doc_code"> + cd ~/clang-llvm/build + ccmake ../llvm +</pre> + +<p>The second command will bring up a GUI for configuring Clang. You need to set +the entry for <tt>CMAKE_CXX_COMPILER</tt>. Press <tt>'t'</tt> to turn on +advanced mode. Scroll down to <tt>CMAKE_CXX_COMPILER</tt>, and set it to +<tt>/usr/bin/clang++</tt>, or wherever you installed it. Press <tt>'c'</tt> to +configure, then <tt>'g'</tt> to generate CMake's files.</p> + +<p>Finally, run ninja one last time, and you're done.</p> + +<!-- ======================================================================= --> +<h2 id="clangtool">Step 1: Create a ClangTool</h2> +<!-- ======================================================================= --> +<p>Now that we have enough background knowledge, it's time to create the +simplest productive ClangTool in existence: a syntax checker. While this already +exists as <tt>clang-check</tt>, it's important to understand what's going +on.</p> + +<p>First, we'll need to create a new directory for our tool and tell CMake that +it exists. As this is not going to be a core clang tool, it will live in the +<tt>tools/extra</tt> repository.</p> + +<pre class="doc_code"> + cd ~/clang-llvm/llvm/tools/clang + mkdir tools/extra/loop-convert + echo 'add_subdirectory(loop-convert)' >> tools/extra/CMakeLists.txt + vim tools/extra/loop-convert/CMakeLists.txt +</pre> + +CMakeLists.txt should have the following contents: +<pre class="doc_code"> + set(LLVM_LINK_COMPONENTS support) + set(LLVM_USED_LIBS clangTooling clangBasic clangAST) + + add_clang_executable(loop-convert + LoopConvert.cpp + ) + target_link_libraries(loop-convert + clangTooling + clangBasic + clangASTMatchers + ) +</pre> + +<p>With that done, Ninja will be able to compile our tool. Let's give it +something to compile! Put the following into +<tt>tools/extra/loop-convert/LoopConvert.cpp</tt>. A detailed explanation of why +the different parts are needed can be found in the +<a href="LibTooling.html">LibTooling documentation</a>.</p> + +<pre> + // Declares clang::SyntaxOnlyAction. + #include "clang/Frontend/FrontendActions.h" + #include "clang/Tooling/CommonOptionsParser.h" + #include "clang/Tooling/Tooling.h" + // Declares llvm::cl::extrahelp. + #include "llvm/Support/CommandLine.h" + + using namespace clang::tooling; + using namespace llvm; + + // CommonOptionsParser declares HelpMessage with a description of the common + // command-line options related to the compilation database and input files. + // It's nice to have this help message in all tools. + static cl::extrahelp CommonHelp(CommonOptionsParser::HelpMessage); + + // A help message for this specific tool can be added afterwards. + static cl::extrahelp MoreHelp("\nMore help text..."); + + int main(int argc, const char **argv) { + CommonOptionsParser OptionsParser(argc, argv); + ClangTool Tool(OptionsParser.GetCompilations(), + OptionsParser.GetSourcePathList()); + return Tool.run(newFrontendActionFactory<clang::SyntaxOnlyAction>()); + } +</pre> + +<p>And that's it! You can compile our new tool by running ninja from the +<tt>build</tt> directory.</p> + +<pre class="doc_code"> + cd ~/clang-llvm/build + ninja +</pre> + +<p>You should now be able to run the syntax checker, which is located in +<tt>~/clang-llvm/build/bin</tt>, on any source file. Try it!</p> + +<pre class="doc_code"> + cat "void main() {}" > test.cpp + bin/loop-convert test.cpp -- +</pre> + +<p>Note the two dashes after we specify the source file. The additional options +for the compiler are passed after the dashes rather than loading them from a +compilation database - there just aren't any options needed right now.</p> + +<!-- ======================================================================= --> +<h2 id="learnastmatchers">Intermezzo: Learn AST matcher basics</h2> +<!-- ======================================================================= --> +<p>Clang recently introduced the <a href=LibASTMatchers.html>ASTMatcher +library</a> to provide a simple, powerful, and concise way to describe specific +patterns in the AST. Implemented as a DSL powered by macros and templates (see +<a href="../doxygen/ASTMatchers_8h_source.html">ASTMatchers.h</a> if you're +curious), matchers offer the feel of algebraic data types common to functional +programming languages.</p> + +<p>For example, suppose you wanted to examine only binary operators. There is a +matcher to do exactly that, conveniently named <tt>binaryOperator</tt>. I'll +give you one guess what this matcher does:</p> + +<pre class="doc_code"> + binaryOperator(hasOperatorName("+"), hasLHS(integerLiteral(equals(0)))) +</pre> + +<p>Shockingly, it will match against addition expressions whose left hand side +is exactly the literal 0. It will not match against other forms of 0, such as +<tt>'\0'</tt> or <tt>NULL</tt>, but it will match against macros that expand to +0. The matcher will also not match against calls to the overloaded operator +<tt>'+'</tt>, as there is a separate <tt>operatorCallExpr</tt> matcher to handle +overloaded operators.</p> + +<p>There are AST matchers to match all the different nodes of the AST, narrowing +matchers to only match AST nodes fulfilling specific criteria, and traversal +matchers to get from one kind of AST node to another. For a complete list of AST +matchers, take a look at the <a href="LibASTMatchersReference.html">AST Matcher +References</a></p> + +<p>All matcher that are nouns describe entities in the AST and can be bound, +so that they can be referred to whenever a match is found. To do so, simply call +the method <tt>bind</tt> on these matchers, e.g.:</p> +<pre class="doc_code"> + variable(hasType(isInteger())).bind("intvar") +</pre> + +<!-- ======================================================================= --> +<h2 id="usingastmatchers">Step 2: Using AST matchers</h2> +<!-- ======================================================================= --> +<p>Okay, on to using matchers for real. Let's start by defining a matcher which +will capture all <tt>for</tt> statements that define a new variable +initialized to zero. Let's start with matching all <tt>for</tt> loops:</p> + +<pre class="doc_code"> + forStmt() +</pre> + +<p>Next, we want to specify that a single variable is declared in the first +portion of the loop, so we can extend the matcher to</p> +<pre class="doc_code"> + forStmt(hasLoopInit(declStmt(hasSingleDecl(varDecl())))) +</pre> + +<p>Finally, we can add the condition that the variable is initialized to +zero.</p> +<pre class="doc_code"> + forStmt(hasLoopInit(declStmt(hasSingleDecl(varDecl( + hasInitializer(integerLiteral(equals(0)))))))) +</pre> + +</p>It is fairly easy to read and understand the matcher definition ("match +loops whose init portion declares a single variable which is initialized to the +integer literal 0"), but deciding that every piece is necessary is more +difficult. Note that this matcher will not match loops whose variables are +initialized to <tt>'\0'</tt>, <tt>0.0</tt>, <tt>NULL</tt>, or any form of zero +besides the integer 0.</p> + +<p>The last step is giving the matcher a name and binding the <tt>ForStmt</tt> +as we will want to do something with it:</p> +<pre class="doc_code"> + StatementMatcher LoopMatcher = + forStmt(hasLoopInit(declStmt(hasSingleDecl(varDecl( + hasInitializer(integerLiteral(equals(0)))))))).bind("forLoop"); +</pre> + +<p>Once you have defined your matchers, you will need to add a little more +scaffolding in order to run them. Matchers are paired with a +<tt>MatchCallback</tt> and registered with a <tt>MatchFinder</tt> object, then +run from a <tt>ClangTool</tt>. More code!</p> + +Add the following to <tt>LoopConvert.cpp</tt>: +<pre class="doc_code"> + StatementMatcher LoopMatcher = + forStmt(hasLoopInit(declStmt(hasSingleDecl(varDecl( + hasInitializer(integerLiteral(equals(0)))))))).bind("forLoop"); + + class LoopPrinter : public MatchFinder::MatchCallback { + public : + virtual void run(const MatchFinder::MatchResult &Result) { + if (const ForStmt *FS = Result.Nodes.getNodeAs<clang::ForStmt>("forLoop")) + FS->dump(); + }; +</pre> + +And change <tt>main()</tt> to: +<pre class="doc_code"> + int main(int argc, const char **argv) { + CommonOptionsParser OptionsParser(argc, argv); + ClangTool Tool(OptionsParser.GetCompilations(), + OptionsParser.GetSourcePathList()); + + LoopPrinter Printer; + MatchFinder Finder; + Finder.addMatcher(LoopMatcher, &Printer); + + return Tool.run(newFrontendActionFactory(&Finder)); + } +</pre> + +<p>Now, you should be able to recompile and run the code to discover for loops. +Create a new file with a few examples, and test out our new handiwork:</p> + +<pre class="doc_code"> + cd ~/clang-llvm/llvm/llvm_build/ + ninja loop-convert + vim ~/test-files/simple-loops.cc + bin/loop-convert ~/test-files/simple-loops.cc +</pre> + +<!-- FIXME: add example step-2a --> + +<!-- ======================================================================= --> +<h2 id="morematchers">Step 3.5: More Complicated Matchers</h2> +<!-- ======================================================================= --> +<p>Our simple matcher is capable of discovering for loops, but we would still +need to filter out many more ourselves. We can do a good portion of the +remaining work with some cleverly chosen matchers, but first we need to decide +exactly which properties we want to allow.</p> + +<p>How can we characterize for loops over arrays which would be eligible for +translation to range-based syntax? Range based loops over arrays of size +<tt>N</tt> that:</p> +<ul> + <li>start at index <tt>0</tt></li> + <li>iterate consecutively</li> + <li>end at index <tt>N-1</tt></li> +</ul> + +<p>We already check for (1), so all we need to add is a check to the loop's +condition to ensure that the loop's index variable is compared against +<tt>N</tt> and another check to ensure that the increment step just increments +this same variable. The matcher for (2) is straightforward: require a pre- or +post-increment of the same variable declared in the init portion.</p> + +<p>Unfortunately, such a matcher is impossible to write. Matchers contain no +logic for comparing two arbitrary AST nodes and determining whether or not they +are equal, so the best we can do is matching more than we would like to allow, +and punting extra comparisons to the callback.</p> + +<p>In any case, we can start building this sub-matcher. We can require that the +increment step be a unary increment like this:</p> + +<pre class="doc_code"> + hasIncrement(unaryOperator(hasOperatorName("++"))) +</pre> + +<p>Specifying what is incremented introduces another quirk of Clang's AST: +Usages of variables are represented as <tt>DeclRefExpr</tt>'s ("declaration +reference expressions") because they are expressions which refer to variable +declarations. To find a <tt>unaryOperator</tt> that refers to a specific +declaration, we can simply add a second condition to it:</p> +<pre class="doc_code"> + hasIncrement(unaryOperator( + hasOperatorName("++"), + hasUnaryOperand(declRefExpr()))) +</pre> + +<p>Furthermore, we can restrict our matcher to only match if the incremented +variable is an integer:</p> +<pre class="doc_code"> + hasIncrement(unaryOperator( + hasOperatorName("++"), + hasUnaryOperand(declRefExpr(to(varDecl(hasType(isInteger()))))))) +</pre> + +</p>And the last step will be to attach an identifier to this variable, so that +we can retrieve it in the callback:</p> +<pre class="doc_code"> + hasIncrement(unaryOperator( + hasOperatorName("++"), + hasUnaryOperand(declRefExpr(to( + varDecl(hasType(isInteger())).bind("incrementVariable")))))) +</pre> + +<p>We can add this code to the definition of <tt>LoopMatcher</tt> and make sure +that our program, outfitted with the new matcher, only prints out loops that +declare a single variable initialized to zero and have an increment step +consisting of a unary increment of some variable.</p> + +<!-- FIXME: add example step-2b --> + +<p>Now, we just need to add a matcher to check if the condition part of the +<tt>for</tt> loop compares a variable against the size of the array. There is +only one problem - we don't know which array we're iterating over without +looking at the body of the loop! We are again restricted to approximating the +result we want with matchers, filling in the details in the callback. So we +start with:</p> +<pre class="doc_code"> + hasCondition(binaryOperator(hasOperatorName("<")) +</pre> + +<p>It makes sense to ensure that the left-hand side is a reference to a +variable, and that the right-hand side has integer type.</p> +<pre class="doc_code"> + hasCondition(binaryOperator( + hasOperatorName("<"), + hasRHS(expr(hasType(isInteger()))), + hasLHS(declRefExpr(to(varDecl(hasType(isInteger()))))))) +</pre> + +<!-- FIXME: add example step-2c --> + +<p>Why? Because it doesn't work. Of the three loops provided in +<tt>test-files/simple.cpp</tt>, zero of them have a matching condition. A quick +look at the AST dump of the first for loop, produced by the previous iteration +of loop-convert, shows us the answer:</p> + +<pre class="doc_code"> + (ForStmt 0x173b240 + (DeclStmt 0x173afc8 + 0x173af50 "int i = + (IntegerLiteral 0x173afa8 'int' 0)") + <<<NULL>>> + (BinaryOperator 0x173b060 '_Bool' '<' + (ImplicitCastExpr 0x173b030 'int' <LValueToRValue> + (DeclRefExpr 0x173afe0 'int' lvalue Var 0x173af50 'i' 'int')) + (ImplicitCastExpr 0x173b048 'int' <LValueToRValue> + (DeclRefExpr 0x173b008 'const int' lvalue Var 0x170fa80 'N' 'const int'))) + (UnaryOperator 0x173b0b0 'int' lvalue prefix '++' + (DeclRefExpr 0x173b088 'int' lvalue Var 0x173af50 'i' 'int')) + (CompoundStatement … +</pre> + +<p>We already know that the declaration and increments both match, or this loop +wouldn't have been dumped. The culprit lies in the implicit cast applied to the +first operand (i.e. the LHS) of the less-than operator, an L-value to R-value +conversion applied to the expression referencing <tt>i</tt>. Thankfully, the +matcher library offers a solution to this problem in the form of +<tt>ignoringParenImpCasts</tt>, which instructs the matcher to ignore implicit +casts and parentheses before continuing to match. Adjusting the condition +operator will restore the desired match.</p> + +<pre class="doc_code"> + hasCondition(binaryOperator( + hasOperatorName("<"), + hasLHS(expr(hasType(isInteger()))), + hasRHS(ignoringParenImpCasts(declRefExpr( + to(varDecl(hasType(isInteger())))))))) +</pre> + +<p>After adding binds to the expressions we wished to capture and extracting the +identifier strings into variables, we have array-step-2 completed.</p> + +<!-- ======================================================================= --> +<h2 id="usingastmatchers">Step 4: Retrieving Matched Nodes</h2> +<!-- ======================================================================= --> +<p>So far, the matcher callback isn't very interesting: it just dumps the loop's +AST. At some point, we will need to make changes to the input source code. Next, +we'll work on using the nodes we bound in the previous step.</p> + +<p>The <tt>MatchFinder::run()</tt> callback takes a +<tt>MatchFinder::MatchResult&</tt> as its parameter. We're most interested in +its <tt>Context</tt> and <tt>Nodes</tt> members. Clang uses the +<tt>ASTContext</tt> class to represent contextual information about the AST, as +the name implies, though the most functionally important detail is that several +operations require an <tt>ASTContext*</tt> parameter. More immediately useful is +the set of matched nodes, and how we retrieve them.</p> + +<!-- FIXME: Where is this binding described? --> + +<p>Since we bind three variables (identified by ConditionVarName, +InitVarName, and IncrementVarName), we can obtain the matched nodes by using the +<tt>getNodeAs()</tt> member function.</p> + +<p>In <tt>LoopActions.cpp</tt>:</p> +<pre class="doc_code"> + #include "clang/AST/ASTContext.h" + + void LoopPrinter::run(const MatchFinder::MatchResult &Result) { + ASTContext *Context = Result.Context; + const ForStmt *FS = Result.Nodes.getStmtAs<ForStmt>(LoopName); + // We do not want to convert header files! + if (!FS || !Context->getSourceManager().isFromMainFile(FS->getForLoc())) + return; + const VarDecl *IncVar = Result.Nodes.getNodeAs<VarDecl>(IncrementVarName); + const VarDecl *CondVar = Result.Nodes.getNodeAs<VarDecl>(ConditionVarName); + const VarDecl *InitVar = Result.Nodes.getNodeAs<VarDecl>(InitVarName); +</pre> + +<p>Now that we have the three variables, represented by their respective +declarations, let's make sure that they're all the same, using a helper function +I call <tt>areSameVariable()</tt>.</p> +<pre class="doc_code"> + if (!areSameVariable(IncVar, CondVar) || !areSameVariable(IncVar, InitVar)) + return; + llvm::outs() << "Potential array-based loop discovered.\n"; +} +</pre> + +<p>If execution reaches the end of <tt>LoopPrinter::run()</tt>, we know that the +loop shell that looks like</p> +<pre class="doc_code"> + for (int i= 0; i < expr(); ++i) { ... } +</pre> + +<p>For now, we will just print a message explaining that we found a loop. The +next section will deal with recursively traversing the AST to discover all +changes needed.</p> + +<p>As a side note, here is the implementation of <tt>areSameVariable</tt>. Clang +associates a <tt>VarDecl</tt> with each variable to represent the variable's +declaration. Since the "canonical" form of each declaration is unique by +address, all we need to do is make sure neither <tt>ValueDecl</tt> (base class +of <tt>VarDecl</tt>) is <tt>NULL</tt> and compare the canonical Decls.</p> +<pre class="doc_code"> + static bool areSameVariable(const ValueDecl *First, const ValueDecl *Second) { + return First && Second && + First->getCanonicalDecl() == Second->getCanonicalDecl(); + } +</pre> + +<p>It's not as trivial to test if two expressions are the same, though Clang has +already done the hard work for us by providing a way to canonicalize +expressions:</p> +<pre class="doc_code"> + static bool areSameExpr(ASTContext* Context, const Expr *First, + const Expr *Second) { + if (!First || !Second) + return false; + llvm::FoldingSetNodeID FirstID, SecondID; + First->Profile(FirstID, *Context, true); + Second->Profile(SecondID, *Context, true); + return FirstID == SecondID; + } +</pre> + +<!-- FIXME: Add code example. --> + +<p>This code relies on the comparison between two +<tt>llvm::FoldingSetNodeIDs</tt>. As the documentation for +<tt>Stmt::Profile()</tt> indicates, the <tt>Profile()</tt> member function +builds a description of a node in the AST, based on its properties, along with +those of its children. <tt>FoldingSetNodeID</tt> then serves as a hash we can +use to compare expressions. We will need <tt>areSameExpr</tt> later. Before you +run the new code on the additional loops added to test-files/simple.cpp, try to +figure out which ones will be considered potentially convertible.</p> + +</div> +</body> +</html> |