The following is a tutorial on how to use the Haskell grammar-combinators parser library.
The grammar-combinators library is a next-generation parser library which addresses the fundamental limitations inherent to current parser combinator libraries. For more info, see the project's website.
In the following text, we define an example grammar and use it with some grammar algorithms. This text assumes you are more or less familiar with common context-free grammar terminology and parser combinators. This text does not attempt to explain the internals of the grammar-combinators library, but tries to explain it from the point of view of a library user.
First install the grammar-combinators library through Cabal with the following commands:
cabal update
cabal install grammar-combinators
The rest of this text contains code fragments. These code fragments can be copied into a Haskell source file and executed directly. You can also download the source code of this page and export its contents to a Haskell script that you can work with in GHCi (all examples can be executed):
darcs get http://code.haskell.org/grammar-combinators/
cd grammar-combinators/website
make
ghci tutorial.hs
We will use some GHC extensions (you can't use grammar-combinators without these).
{-# LANGUAGE FlexibleContexts #-}
{-# LANGUAGE GADTs #-}
{-# LANGUAGE TypeFamilies #-}
The base import for grammar-combinators is the following:
import Text.GrammarCombinators.Base
And beyond that, we use algorithms from the following modules.
import Text.GrammarCombinators.Utils.PrintGrammar
import Text.GrammarCombinators.Utils.CalcFirst
import Text.GrammarCombinators.Parser.Packrat
import Text.GrammarCombinators.Parser.Parsec
import Text.GrammarCombinators.Parser.UUParse
import Text.GrammarCombinators.Transform.FilterDies
import Text.GrammarCombinators.Transform.FoldLoops
import Text.GrammarCombinators.Transform.UniformPaull
import Text.GrammarCombinators.Transform.UnfoldDead
The grammar we will work with is the following (in a BNF-like notation).
Line ::= Expr EOF
Expr ::= Expr '+' Term
| Term
Term ::= Term '*' Factor
| Factor
Factor ::= '(' Expr ')'
| Digit+
Digit ::= '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9'
To define this grammar using the grammar-combinators library, we start by defining data types for the Abstract Syntax Tree (AST) of the grammar. This is important because it defines the structure of the grammar and the structural relations between the non-terminals.
newtype Line = SExpr Expr
data Expr = Sum Expr Term
| STerm Term
data Term = Product Term Factor
| SFactor Factor
data Factor = Paren Expr
| LiteralNumber [Digit]
newtype Digit = MkDigit Char
We can see from these definitions that for example an expression must either be a sum of a subexpression and a term or it must be a single term itself.
Next, we define a GADT representing the domain (the set of non-terminals) of our grammar. The ArithDomain ix
functor contains proof terms proving that the five AST data types are elements of the domain of our grammar.
data ArithDomain ix where
Line :: ArithDomain Line
Expr :: ArithDomain Expr
Term :: ArithDomain Term
Factor :: ArithDomain Factor
Digit :: ArithDomain Digit
Note that the GADT's constructors have the same names as the corresponding non-terminal types (this is possible because Haskell separates value level and type level name spaces).
Any domain needs to implement some type classes. The FoldFam
type class defines a foldFam
operation folding a function over all non-terminals in the domain. ShowFam
defines function showIdx
returning a String
representation of a non-terminal. EqFam
defines the function overrideIdx
, which allows type-safe overriding a domain function for a single non-terminal. Finally, MemoFam
defines fromMemo
and toMemo
, allowing memoization of a domain function. The Domain
type class is empty, but requires the other four type classes to be implemented.
All of the following is boilerplate, and it is probably possible to generate this code from the domain definition automatically using Template Haskell, but this is not yet implemented. Contributions are welcome ;).
instance FoldFam ArithDomain where
foldFam f n = f Line $ f Expr $ f Term $ f Factor $ f Digit n
instance ShowFam ArithDomain where
showIdx Line = "Line"
showIdx Expr = "Expr"
showIdx Term = "Term"
showIdx Factor = "Factor"
showIdx Digit = "Digit"
instance EqFam ArithDomain where
overrideIdx _ Line v Line = v
overrideIdx _ Expr v Expr = v
overrideIdx _ Term v Term = v
overrideIdx _ Factor v Factor = v
overrideIdx _ Digit v Digit = v
overrideIdx f _ _ idx = f idx
instance MemoFam ArithDomain where
data Memo ArithDomain v = MemoArithDomain (v Line) (v Expr) (v Term) (v Factor) (v Digit)
toMemo f = MemoArithDomain (f Line) (f Expr) (f Term) (f Factor) (f Digit)
fromMemo (MemoArithDomain v _ _ _ _) Line = v
fromMemo (MemoArithDomain _ v _ _ _) Expr = v
fromMemo (MemoArithDomain _ _ v _ _) Term = v
fromMemo (MemoArithDomain _ _ _ v _) Factor = v
fromMemo (MemoArithDomain _ _ _ _ v) Digit = v
instance Domain ArithDomain
Next, we need to define the pattern functor PFArith
for our domain, defining the structure of the non-terminals in a more general way than the AST defined above. It is a GADT with constructors analogous to the AST data types defined above, but with recursive positions replaced by corresponding values of a given subtree representation functor r. Pattern functor values are tagged with the AST type they represent.
data PFArith r ix where
LineF :: r Expr -> PFArith r Line
SumF :: r Expr -> r Term -> PFArith r Expr
SubtractionF :: r Expr -> r Term -> PFArith r Expr
SingleTermF :: r Term -> PFArith r Expr
ProductF :: r Term -> r Factor -> PFArith r Term
QuotientF :: r Term -> r Factor -> PFArith r Term
SingleFactorF :: r Factor -> PFArith r Term
ParenthesizedF :: r Expr -> PFArith r Factor
NumberF :: [r Digit] -> PFArith r Factor
DigitF :: Char -> PFArith r Digit
The following type instance registers PFArith
as the pattern functor for domain ArithDomain
.
type instance PF ArithDomain = PFArith
We can now define our grammar. The grammar is represented by a function grammarArith, which returns for each non-terminal its production rules.
grammarArith :: ExtendedContextFreeGrammar ArithDomain Char
grammarArith Line =
LineF $>> ref Expr >>>* endOfInput
grammarArith Expr =
SubtractionF $>> ref Expr >>>* token '-' >>> ref Term
||| SumF $>> ref Expr >>>* token '+' >>> ref Term
||| SingleTermF $>> ref Term
grammarArith Term =
SingleFactorF $>> ref Factor
||| QuotientF $>> ref Term >>>* token '/' >>> ref Factor
||| ProductF $>> ref Term >>>* token '*' >>> ref Factor
grammarArith Factor =
NumberF $>> many1Ref Digit
||| ParenthesizedF $>>* token '(' >>> ref Expr >>>* token ')'
grammarArith Digit =
DigitF $>> tokenRange ['0' .. '9']
The production rules are composed using the following combinators:
ruleA >>> ruleB
: Sequence rules ruleA
and ruleB
, applying the result of the first to the second. Similar to the Applicative (<*>)
combinator.epsilon v
: A rule that always succeeds, returning value v
f $>>> rule
: Apply function f
to the right-hand side rule. Identical to epsilon f >>> rule
ruleA >>>* ruleB
: Sequence rules ruleA
and ruleB
, throwing away the result of ruleB
and returning that of ruleA
. Identical to epsilon const >>> ruleA >>> ruleB
v $>>>* rule
: Throw away result of rule
and replace it with value v
. Identical to epsilon v >>>* rule
ref idx
: A recursive reference to non-terminal idx. You should always use this and never include non-terminal idx's production rules directly (which you could do using 'grammarArith idx' instead). It is this combinator that is mainly responsible for the increased power of the grammar-combinators library when compared to traditional parser combinator libraries.token c
: Primitive token c
. c
is a character in the above grammar, but any type that is an instance of type class Token
is allowed.tokenRange r
: Match any token from the given range r
.manyRef idx
or many1Ref idx
: match zero or more or one or more times the given non-terminal idx
.The grammar's type is ExtendedContextFreeGrammar ArithDomain Char
, meaning that it is an extended context-free grammar for domain ArithDomain
and token type Char
. It is a context-free grammar because it uses the ref
recursion combinator and it is extended because it additionally uses many1Ref
. Several other types of grammars exist in the grammar-combinators library.
We can now already do some useful stuff with the newly defined grammar. For example, we can print a textual representation of the grammar using the printGrammar
algorithm.
This gives the following result:
*Main> putStr $ printGrammar grammarArith
<Line> ::= <Expr> EOI
<Expr> ::= (<Expr> '-' <Term>) | (<Expr> '+' <Term>) | <Term>
<Term> ::= <Factor> | (<Term> '/' <Factor>) | (<Term> '*' <Factor>)
<Factor> ::= <Digit>+ | ('(' <Expr> ')')
<Digit> ::= '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9'
We could also compute FIRST sets for non-terminals with the calcFirst
algorithm:
*Main> calcFirst grammarArith Line
FS {firstSet = fromList "(0123456789", canBeEmpty = False, canBeEOI = False}
The most important thing to do with a grammar is of course to parse strings according to it. To do this, we first need to obtain a processing grammar: a grammar combined with a set of semantic actions (a semantic processor). The standard trivialProcessor
returns unit values wrapped in the Multirec identity functor K0
for all non-terminals. A parser that does not produce semantic values is often called a recognizer.
recognizeGrammarArith :: ProcessingExtendedContextFreeGrammar ArithDomain Char (K0 ())
recognizeGrammarArith = applyProcessorE grammarArith trivialProcessor
Note that we absolutely need to provide a type signature for recognizeGrammarArith
, because otherwise GHC will give a strange error message:
No instances for (LoopParserRule p ArithDomain (K0 ()),
TokenParserRule p Char)
If we try to use this recognizer grammar with the Packrat parser, we get the following error.
*Main> parsePackrat recognizeGrammarArith Line "123"
<interactive>:1:13:
Could not deduce (LoopParserRule p ArithDomain (K0 ()))
from the context (ParserRule p,
RecParserRule p ArithDomain (K0 ()),
TokenParserRule p Char)
arising from a use of `recognizeGrammarArith'
at <interactive>:1:13-33
This error message is not very clear (if you're not the grammar-combinators library author ;)), but what it means is that we are trying to use the packrat parser with an extended context-free grammar, which it is not capable of. The solution to this problem is to use a first grammar transformation, in this case, the equivalent of the standard translation of extended context-free grammars to context-free grammars, which we call loop folding.
Applying the foldLoops algorithm to our arithmetic expressions recognizer grammar gives us the following definition:
recognizeGrammarArithFolded :: ProcessingContextFreeGrammar (FoldLoopsDomain ArithDomain) Char (FoldLoopsValue (K0 ()))
recognizeGrammarArithFolded = foldAndProcessLoops recognizeGrammarArith
We get a new, non-extended, grammar over a new domain FoldLoopsDomain ArithDomain with a new semantic value family FoldLoopsValue (K0 ()). If we print this new grammar, we get the following:
*Main> putStr $ printGrammar recognizeGrammarArithFolded
<Line*> ::= (<Line> <Line*>) | epsilon
<Expr*> ::= (<Expr> <Expr*>) | epsilon
<Term*> ::= (<Term> <Term*>) | epsilon
<Factor*> ::= (<Factor> <Factor*>) | epsilon
<Digit*> ::= (<Digit> <Digit*>) | epsilon
<Line> ::= <Expr> EOI
<Expr> ::= (<Expr> '-' <Term>) | (<Expr> '+' <Term>) | <Term>
<Term> ::= <Factor> | (<Term> '/' <Factor>) | (<Term> '*' <Factor>)
<Factor> ::= (<Digit> <Digit*>) | ('(' <Expr> ')')
<Digit> ::= '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9'
We can see that the foldLoops
algorithm transformed the original grammar as follows: for each non-terminal idx
, add a new non-terminal idx\*
, representing a Kleene-*-quantified version of the base non-terminal. Additionally, quantified references (using manyRef and many1Ref) are translated into normal references to the new Kleene-* non-terminals.
We can now try to parse a string with this grammar, but GHCi complains that it cannot print a value of type K0 () Line
.
*Main> parsePackrat recognizeGrammarArithFolded (FLBase Line) "123"
<interactive>:1:0:
No instance for (Show (K0 () Line))
arising from a use of `print' at <interactive>:1:0-59
Possible fix: add an instance declaration for (Show (K0 () Line))
In a stmt of a 'do' expression: print it
We are not sure why Multirec doesn't provide this by default, but we can easily add a suitable Show instance:
instance (Show v) => Show (K0 v ix) where
show (K0 v) = "K0 (" ++ show v ++ ")"
When we now try to parse, we go into an infinite loop:
*Main> parsePackrat recognizeGrammarArithFolded (FLBase Line) "123"
^CInterrupted.
The problem is a well-known problem for parser combinator libraries, namely that they can't handle grammars with left-recursion (which means that a production rule directly references itself in the left-most position (like Term
and Expr
above). Standard parser combinator libraries cannot handle this problem. In fact, they cannot even detect that it occurs.
One of the most important advantages of the grammar-combinators library is that we can automatically fix this problem.
Grammar-combinators actually provides two transformations that remove left-recursion from a grammar: the left-corner transformation and a uniform variant of the classic Paull transformation. We will use the latter in this text, because it is currently better optimized for grammars that only feature indirect left-recursion.
recognizeGrammarArithTP :: ProcessingExtendedContextFreeGrammar (UPDomain ArithDomain) Char (UPValue (K0 ()))
recognizeGrammarArithTP = transformUniformPaullE recognizeGrammarArith
The transformation gives us a new, transformed, grammar over new domain UPDomain ArithDomain
with transformed semantic value family UPValue (K0 ())
.
Printing this grammar gives us a grammar that is rather difficult to read:
*Main> putStr $ printGrammar recognizeGrammarArithTP
<Line> ::= <Line_head> <Line_tail>*
<Line_head> ::= die | ((<Expr> | die | (die <Expr>) | die | (die <Expr>)) EOI) | die | (die <Expr> EOI) | die
<Line_tail> ::= die
<Expr> ::= <Expr_head> <Expr_tail>*
<Expr_head> ::= die | ((die | ((((die | die) <Expr_tail>*) | die | (die <Expr>) | die | (die <Expr>)) '-')) <Term>) | die | ((die | ((((die | die) <Expr_tail>*) | die | (die <Expr>) | die | (die <Expr>)) '+')) <Term>) | <Term> | die | (die <Term>) | die | (die ((<Expr> '-' <Term>) | (<Expr> '+' <Term>) | <Term>)) | die
<Expr_tail> ::= ('-' <Term>) | ('+' <Term>) | die
<Term> ::= <Term_head> <Term_tail>*
<Term_head> ::= <Factor> | die | (die <Factor>) | die | ((die | ((((die | die) <Term_tail>*) | die | (die <Term>) | die | (die <Term>)) '/')) <Factor>) | die | ((die | ((((die | die) <Term_tail>*) | die | (die <Term>) | die | (die <Term>)) '*')) <Factor>) | die | (die (<Factor> | (<Term> '/' <Factor>) | (<Term> '*' <Factor>))) | die
<Term_tail> ::= ('/' <Factor>) | ('*' <Factor>) | die
<Factor> ::= <Factor_head> <Factor_tail>*
<Factor_head> ::= <Digit>+ | die | (die <Digit>+) | die | ((die | (('(' | die | ((die | die | die) '(')) <Expr>) | die | (die '(' <Expr>)) ')') | die | (die (<Digit>+ | ('(' <Expr> ')'))) | die
<Factor_tail> ::= die
<Digit> ::= <Digit_head> <Digit_tail>*
<Digit_head> ::= '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9' | die | (die ('0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9')) | die | (die ('0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9')) | die
<Digit_tail> ::= die
The grammar is actually cluttered by many die
rules (can never match), and non-terminals that can never match anything (e.g. Line_tail
). We can remove these from the grammar using algorithms filterDies
and unfoldDead
:
recognizeGrammarArithTPF :: ProcessingExtendedContextFreeGrammar (UPDomain ArithDomain) Char (UPValue (K0 ()))
recognizeGrammarArithTPF = filterDiesE (unfoldDeadE recognizeGrammarArithTP)
Additionally, we use print algorithm printReachableGrammar
, which only prints non-terminals that are reachable from a given start non-terminal:
*Main> putStr $ printReachableGrammar recognizeGrammarArithTPF (UPBase Line)
<Digit_head> ::= '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9'
<Digit> ::= <Digit_head>
<Factor_head> ::= <Digit>+ | ('(' <Expr> ')')
<Factor> ::= <Factor_head>
<Term_head> ::= <Factor>
<Term_tail> ::= ('/' <Factor>) | ('*' <Factor>)
<Term> ::= <Term_head> <Term_tail>*
<Expr_head> ::= <Term>
<Expr_tail> ::= ('-' <Term>) | ('+' <Term>)
<Expr> ::= <Expr_head> <Expr_tail>*
<Line_head> ::= <Expr> EOI
<Line> ::= <Line_head>
The grammar is now pretty readable and actually corresponds perfectly to what we would get after doing a standard manual removal of left recursion.
After folding loops in the transformed grammar, we can parse using the packrat parser:
recognizeGrammarArithTPFF :: ProcessingContextFreeGrammar (FoldLoopsDomain (UPDomain ArithDomain)) Char (FoldLoopsValue (UPValue (K0 ())))
recognizeGrammarArithTPFF = foldAndProcessLoops recognizeGrammarArithTPF
*Main> parsePackrat recognizeGrammarArithTPFF (FLBase (UPBase Line)) "123"
Parsed FLBV {unFLBV = UPBV {unUPBV = K0 (())}} _
*Main> parsePackrat recognizeGrammarArithTPFF (FLBase (UPBase Line)) "123+"
NoParse
*Main> parsePackrat recognizeGrammarArithTPFF (FLBase (UPBase Line)) "123+12"
Parsed FLBV {unFLBV = UPBV {unUPBV = K0 (())}} _
Our parser does not yet produce useful result values (because we used trivialProcessor
), but it already correctly recognizes correct strings in our language. For succesful parses, we get a wrapped version of the value K0 ()
produced by trivialProcessor
.
In order to produce meaningful result values using our grammar, we need a suitable semantic processor. We first define its semantic value family, which defines the set of values it produces.
data family ArithValue ix
newtype instance ArithValue Line = ArithValueL Int deriving (Show)
newtype instance ArithValue Expr = ArithValueE Int deriving (Show)
newtype instance ArithValue Term = ArithValueT Int deriving (Show)
newtype instance ArithValue Factor = ArithValueF Int deriving (Show)
newtype instance ArithValue Digit = ArithValueD Char deriving (Show)
A semantic value family is a type family that specifies for each non-terminal what its semantic value is. In this case, we have an Int for all non-terminals except Digit, for which we use a character.
We now define a semantic processor using this semantic value family:
calcArith :: Processor ArithDomain ArithValue
calcArith Line (LineF (ArithValueE e)) = ArithValueL e
calcArith Expr (SumF (ArithValueE e) (ArithValueT t)) = ArithValueE $ e + t
calcArith Expr (SingleTermF (ArithValueT t)) = ArithValueE t
calcArith Term (ProductF (ArithValueT e) (ArithValueF t)) = ArithValueT $ e * t
calcArith Term (SingleFactorF (ArithValueF t)) = ArithValueT t
calcArith Factor (ParenthesizedF (ArithValueE e)) = ArithValueF e
calcArith Factor (NumberF ds) = ArithValueF $ read $ map unArithValueD ds
calcArith Digit (DigitF c) = ArithValueD c
unArithValueD :: ArithValue Digit -> Char
unArithValueD (ArithValueD c) = c
The function calcArith
is given the non-terminal matched, and the components of the matched production rule that the grammar stored in the pattern functor value. For example, for a match of the Expr
non-terminal, using the production rule "Expr ::= Expr '+' Term" (2nd case in the calcArith definition, PFArith
constructor SumF
), the value of the left-hand side and the right-hand side expressions are unwrapped, added and wrapped in the ArithValue
constructor for Expr
again. Note how all of this corresponds almost perfectly to what you will find in a syntax-directed translation definition.
We can now repeat the grammar transformations that we used with the trivialProcessor
before to get the following:
calcGrammarArith :: ProcessingExtendedContextFreeGrammar ArithDomain Char ArithValue
calcGrammarArith = applyProcessorE grammarArith calcArith
calcGrammarArithTP :: ProcessingExtendedContextFreeGrammar (UPDomain ArithDomain) Char (UPValue ArithValue)
calcGrammarArithTP = transformUniformPaullE calcGrammarArith
calcGrammarArithTPF :: ProcessingExtendedContextFreeGrammar (UPDomain ArithDomain) Char (UPValue ArithValue)
calcGrammarArithTPF = filterDiesE (unfoldDeadE calcGrammarArithTP)
calcGrammarArithTPFF :: ProcessingContextFreeGrammar (FoldLoopsDomain (UPDomain ArithDomain)) Char (FoldLoopsValue (UPValue ArithValue))
calcGrammarArithTPFF = foldAndProcessLoops calcGrammarArithTPF
And parsing now gives us:
*Main> parsePackrat calcGrammarArithTPFF (FLBase (UPBase Line)) "123"
Parsed FLBV {unFLBV = UPBV {unUPBV = ArithValueL 123}} _
*Main> parsePackrat calcGrammarArithTPFF (FLBase (UPBase Line)) "123+"
NoParse
*Main> parsePackrat calcGrammarArithTPFF (FLBase (UPBase Line)) "123+12"
Parsed FLBV {unFLBV = UPBV {unUPBV = ArithValueL 135}} _
Our grammar can now indeed calculate the values for the strings that it parses. Note how we have used a single grammar with multiple semantic processors, without using the AST as an intermediate representation.
Note how we can actually use the exact same grammar with Parsec or UUParse parse algorithms:
*Main> parseParsec calcGrammarArithTPFF (FLBase (UPBase Line)) "" "123+12"
Right (FLBV {unFLBV = UPBV {unUPBV = ArithValueL 135}})
*Main> parseUU calcGrammarArithTPFF (FLBase (UPBase Line)) "123+12"
FLBV {unFLBV = UPBV {unUPBV = ArithValueL 135}}
One of the features we can take advantage of by using UUParse is its error correction facilities. If we parse a faulty string, we get a corrected result, and there is a way to obtain a list of errors it found and corrections it applied.
*Main> parseUU calcGrammarArithTPFF (FLBase (UPBase Line)) (0 :: Int) "123+(12"
FLBV {unFLBV = UPBV {unUPBV = ArithValueL 135}}