Building a lexer and parser with Scala's Parser Combinators
As part of an ongoing project at e.near, one of our Scala teams was recently tasked with a requirement to build an interpreter for executing workflows which are modelled with a textual DSL. These workflows had to be validated for errors, compiled to a simpler bytecode-like representation and stored, in order to be interpreted with as little overhead as possible.
To deal with the parsing phases, we stumbled upon Scala’s Parser Combinators, which turned out to be a great tool for writing modular and composable lexers and parsers in idiomatic Scala, with little to no boilerplate code.
While there are a few Web resources on how to use parser combinators for building simple parsers, there were none so far that describe how to build full lexical and syntactical analyzers from scratch.
We will be going through the process of building a lexer, to scan text into a sequence of tokens, and a parser, to parse said tokens into an abstract syntax tree (AST).
Language overview
In this “workflow code”, a program implements a workflow, i.e. an executable directed acyclic graph of instructions. Below is an example of a valid program. Keep in mind that blocks are delimited by indentation, in a similar way to languages like Python.
read input name, country
switch:
country == "PT" ->
call service "A"
exit
otherwise ->
call service "B"
switch:
name == "unknown" ->
exit
otherwise ->
call service "C"
exit
We want this to be parsed into:
AndThen(
ReadInput(List(name, country)),
Choice(List(
IfThen(
Equals(country, PT),
AndThen(CallService(A), Exit)
),
OtherwiseThen(
AndThen(
CallService(B),
Choice(List(
IfThen(Equals(name, unknown), Exit),
OtherwiseThen(AndThen(CallService(C), Exit))
))
)
)
))
)
A possible BNF representation for this grammar is outlined below.
<block> ::= (<statement>)+
<statement> ::= "exit"
| "read input" (<identifier> ",")* <identifier>
| "call service" <stringLiteral>
| "switch" ":" INDENT (<ifThen>)+ [otherwiseThen] DEDENT
<ifThen> ::= <condition> "->" INDENT <block> DEDENT
<otherwiseThen> ::= "otherwise" "->" INDENT <block> DEDENT
<condition> ::= <identifier> "==" <stringLiteral>
Parser combinators
A parser combinator is simply a function which accepts parsers as input and returns a new parser as output, similarly to how higher-order functions rely on calling other functions that are passed as input to produce a new function as output.
As an example, assuming we have a parser int
that recognizes integer literals and a parser plus
that recognizes the ‘+’ character, we can produce a parser that recognizes the sequence int plus int
as an integer addition.
The Scala standard library includes an implementation of parser combinators which is hosted at: https://github.com/scala/scala-parser-combinators
To use it, you will simply need the following dependency on your project:
"org.scala-lang.modules" %% "scala-parser-combinators" % "1.0.4"
Building a Lexer
We will be needing tokens for identifiers and string literals, as well as all the reserved keywords and punctuation marks: exit
, read input
, call service
, switch
, otherwise
, :
, ->
, ==
, and ,
.
We also need to produce artifical tokens that represent increases and decreases in identation: INDENT
and DEDENT
, respectively. Please ignore these for now, as we will be going over them at later phase.
sealed trait WorkflowToken
case class IDENTIFIER(str: String) extends WorkflowToken
case class LITERAL(str: String) extends WorkflowToken
case class INDENTATION(spaces: Int) extends WorkflowToken
case object EXIT extends WorkflowToken
case object READINPUT extends WorkflowToken
case object CALLSERVICE extends WorkflowToken
case object SWITCH extends WorkflowToken
case object OTHERWISE extends WorkflowToken
case object COLON extends WorkflowToken
case object ARROW extends WorkflowToken
case object EQUALS extends WorkflowToken
case object COMMA extends WorkflowToken
case object INDENT extends WorkflowToken
case object DEDENT extends WorkflowToken
Our lexer extends from RegexParsers, a subtype of Parsers. RegexParsers
is tailored for building character parsers using regular expressions. It provides implicit conversions from String
and Regex
to Parser[String]
, allowing us to use them as the starting point for composing increasingly complex parsers.
object WorkflowLexer extends RegexParsers {
Let’s start by specifying which characters should be ignored as whitespace. We cannot ignore \n
, since we need it to recognize the level of identation defined by the number of spaces that follow it. Every other whitespace character can be ignored.
override def skipWhitespace = true
override val whiteSpace = "[ \t\r\f]+".r
Now let’s try building a parser for identifiers:
def identifier: Parser[IDENTIFIER] = {
"[a-zA-Z_][a-zA-Z0-9_]*".r ^^ { str => IDENTIFIER(str) }
}
The ^^
operator acts as a map over the parse result. The regex "[a-zA-Z_][a-zA-Z0-9_]*".r
is implicitly converted to an instance of Parser[String]
, on which we map a function (String => IDENTIFIER)
, thus returning a instance of Parser[IDENTIFIER]
.
The parsers for string literals and identations are similar:
def literal: Parser[LITERAL] = {
""""[^"]*"""".r ^^ { str =>
val content = str.substring(1, str.length - 1)
LITERAL(content)
}
}
def indentation: Parser[INDENTATION] = {
"\n[ ]*".r ^^ { whitespace =>
val nSpaces = whitespace.length - 1
INDENTATION(nSpaces)
}
}
Generating parsers for keywords is trivial:
def exit = "exit" ^^ (_ => EXIT)
def readInput = "read input" ^^ (_ => READINPUT)
def callService = "call service" ^^ (_ => CALLSERVICE)
def switch = "switch" ^^ (_ => SWITCH)
def otherwise = "otherwise" ^^ (_ => OTHERWISE)
def colon = ":" ^^ (_ => COLON)
def arrow = "->" ^^ (_ => ARROW)
def equals = "==" ^^ (_ => EQUALS)
def comma = "," ^^ (_ => COMMA)
We will now compose all of these into a parser capable of recognizing every token. We will take advantage the following operators:
|
(or), for recognizing any of our token parsers;rep1
, which recognizes one or more repetitions of its argument;phrase
, which attempts to consume all input until no more is left.
def tokens: Parser[List[WorkflowToken]] = {
phrase(rep1(exit | readInput | callService | switch | otherwise | colon | arrow
| equals | comma | literal | identifier | indentation)) ^^ { rawTokens =>
processIndentations(rawTokens)
}
}
Note that the order of operands matters when dealing with ambiguity. If we were to place identifier
before exit
, readInput
, etc., our parser would never recognize them as special keywords, since they would be successfully parsed as identifiers instead.
Processing indentation
We apply a brief post-processing step to our parse result with the processIndentations
method. This is used to produce the artifical INDENT
and DEDENT
tokens from the INDENTATION
tokens. Each increase in indentation level will be pushed to a stack, producing an INDENT
, and decreases in indentation level will be popped from the indentation stack, producing DEDENT
s.
private def processIndentations(tokens: List[WorkflowToken],
indents: List[Int] = List(0)): List[WorkflowToken] = {
tokens.headOption match {
// if there is an increase in indentation level, we push this new level into the stack
// and produce an INDENT
case Some(INDENTATION(spaces)) if spaces > indents.head =>
INDENT :: processIndentations(tokens.tail, spaces :: indents)
// if there is a decrease, we pop from the stack until we have matched the new level,
// producing a DEDENT for each pop
case Some(INDENTATION(spaces)) if spaces < indents.head =>
val (dropped, kept) = indents.partition(_ > spaces)
(dropped map (_ => DEDENT)) ::: processIndentations(tokens.tail, kept)
// if the indentation level stays unchanged, no tokens are produced
case Some(INDENTATION(spaces)) if spaces == indents.head =>
processIndentations(tokens.tail, indents)
// other tokens are ignored
case Some(token) =>
token :: processIndentations(tokens.tail, indents)
// the final step is to produce a DEDENT for each indentation level still remaining, thus
// "closing" the remaining open INDENTS
case None =>
indents.filter(_ > 0).map(_ => DEDENT)
}
}
And we’re all set! This token parser will produce a ParseResult[List[WorkflowToken]]
by consuming a Reader[Char]
. RegexParsers
defines its own Reader[Char]
, which is internally called by the parse
method it provides. Let’s then define an apply
method for WorkflowLexer
:
trait WorkflowCompilationError
case class WorkflowLexerError(msg: String) extends WorkflowCompilationError
object WorkflowLexer extends RegexParsers {
...
def apply(code: String): Either[WorkflowLexerError, List[WorkflowToken]] = {
parse(tokens, code) match {
case NoSuccess(msg, next) => Left(WorkflowLexerError(msg))
case Success(result, next) => Right(result)
}
}
}
Let’s try our lexer on the example above:
scala> WorkflowLexer(code)
res0: Either[WorkflowLexerError,List[WorkflowToken]] = Right(List(READINPUT, IDENTIFIER(name), COMMA,
IDENTIFIER(country), SWITCH, COLON, INDENT, IDENTIFIER(country), EQUALS, LITERAL(PT), ARROW, INDENT,
CALLSERVICE, LITERAL(A), EXIT, DEDENT, OTHERWISE, ARROW, INDENT, CALLSERVICE, LITERAL(B), SWITCH, COLON,
INDENT, IDENTIFIER(name), EQUALS, LITERAL(unknown), ARROW, INDENT, EXIT, DEDENT, OTHERWISE, ARROW,
INDENT, CALLSERVICE, LITERAL(C), EXIT, DEDENT, DEDENT, DEDENT, DEDENT))
Building a Parser
Now that we have taken care of the lexical analysis, we are still missing the syntactic analysis step, i.e. transforming a sequence of tokens into an abstract syntax tree (AST). Unlike RegexParsers
which generate String
parsers, we will be needing a WorkflowToken
parser.
object WorkflowParser extends Parsers {
override type Elem = WorkflowToken
We also need to define a Reader[WorkflowToken]
which will be used by the parser to read from a sequence of WorkflowToken
s. This is pretty straightforward:
class WorkflowTokenReader(tokens: Seq[WorkflowToken]) extends Reader[WorkflowToken] {
override def first: WorkflowToken = tokens.head
override def atEnd: Boolean = tokens.isEmpty
override def pos: Position = NoPosition
override def rest: Reader[WorkflowToken] = new WorkflowTokenReader(tokens.tail)
}
Moving on with the parser implementation, the process is similar to the one used to build the lexer. We define simple parsers and compose them into more complex ones. Only this time around our parsers will be returning ASTs instead of tokens:
sealed trait WorkflowAST
case class AndThen(step1: WorkflowAST, step2: WorkflowAST) extends WorkflowAST
case class ReadInput(inputs: Seq[String]) extends WorkflowAST
case class CallService(serviceName: String) extends WorkflowAST
case class Choice(alternatives: Seq[ConditionThen]) extends WorkflowAST
case object Exit extends WorkflowAST
sealed trait ConditionThen { def thenBlock: WorkflowAST }
case class IfThen(predicate: Condition, thenBlock: WorkflowAST) extends ConditionThen
case class OtherwiseThen(thenBlock: WorkflowAST) extends ConditionThen
sealed trait Condition
case class Equals(factName: String, factValue: String) extends Condition
Being a WorkflowToken
parser, we inherit an implicit conversion from WorkflowToken
to Parser[WorkflowToken]
. This is useful for parsing parameterless tokens, such as EXIT
, CALLSERVICE
, etc. For IDENTIFIER
and LITERAL
we can pattern match on these tokens with the accept
method.
private def identifier: Parser[IDENTIFIER] = {
accept("identifier", { case id @ IDENTIFIER(name) => id })
}
private def literal: Parser[LITERAL] = {
accept("string literal", { case lit @ LITERAL(name) => lit })
}
Grammar rules may be implemented as such:
def condition: Parser[Equals] = {
(identifier ~ EQUALS ~ literal) ^^ { case id ~ eq ~ lit => Equals(id, lit) }
}
This is similar to what we did previously to produce tokens; here we mapped the parse result (a composition of the results of parsers identifier
, EQUALS
and literal
) to an instance of Equals
. Notice how pattern matching may be used to expressively unpack the result of a parser composition by sequencing (i.e. the ~
operator).
The implementation of the remaining rules looks very much like the grammar we have specified above:
def program: Parser[WorkflowAST] = {
phrase(block)
}
def block: Parser[WorkflowAST] = {
rep1(statement) ^^ { case stmtList => stmtList reduceRight AndThen }
}
def statement: Parser[WorkflowAST] = {
val exit = EXIT ^^ (_ => Exit)
val readInput = READINPUT ~ rep(identifier ~ COMMA) ~ identifier ^^ {
case read ~ inputs ~ IDENTIFIER(lastInput) => ReadInput(inputs.map(_._1.str) ++ List(lastInput))
}
val callService = CALLSERVICE ~ literal ^^ {
case call ~ LITERAL(serviceName) => CallService(serviceName)
}
val switch = SWITCH ~ COLON ~ INDENT ~ rep1(ifThen) ~ opt(otherwiseThen) ~ DEDENT ^^ {
case _ ~ _ ~ _ ~ ifs ~ otherwise ~ _ => Choice(ifs ++ otherwise)
}
exit | readInput | callService | switch
}
def ifThen: Parser[IfThen] = {
(condition ~ ARROW ~ INDENT ~ block ~ DEDENT) ^^ {
case cond ~ _ ~ _ ~ block ~ _ => IfThen(cond, block)
}
}
def otherwiseThen: Parser[OtherwiseThen] = {
(OTHERWISE ~ ARROW ~ INDENT ~ block ~ DEDENT) ^^ {
case _ ~ _ ~ _ ~ block ~ _ => OtherwiseThen(block)
}
}
Just like we did with the lexer, we also define a monadic apply method that we can later use to express a pipeline of operations:
case class WorkflowParserError(msg: String) extends WorkflowCompilationError
object WorkflowParser extends RegexParsers {
...
def apply(tokens: Seq[WorkflowToken]): Either[WorkflowParserError, WorkflowAST] = {
val reader = new WorkflowTokenReader(tokens)
program(reader) match {
case NoSuccess(msg, next) => Left(WorkflowParserError(msg))
case Success(result, next) => Right(result)
}
}
}
Pipelining
We have now covered both the lexical and syntactical analysers. All that’s left is to chain them together:
object WorkflowCompiler {
def apply(code: String): Either[WorkflowCompilationError, WorkflowAST] = {
for {
tokens <- WorkflowLexer(code).right
ast <- WorkflowParser(tokens).right
} yield ast
}
}
Let’s try it on our sample program:
scala> WorkflowCompiler(code)
res0: Either[WorkflowCompilationError,WorkflowAST] = Right(AndThen(ReadInput(List(name, country)),
Choice(List(IfThen(Equals(country,PT),AndThen(CallService(A),Exit)),OtherwiseThen(AndThen(CallService(B),
Choice(List(IfThen(Equals(name,unknown),Exit), OtherwiseThen(AndThen(CallService(C),Exit))))))))))
Great! Our compiler has proven to be capable of parsing valid programs.
Error handling
Let’s now try to parse a syntactically invalid program. Suppose that we had forgotten to quote PT
on the first switch case:
scala> WorkflowCompiler(invalidCode)
res1: Either[WorkflowCompilationError,WorkflowAST] = Left(WorkflowParserError(string literal expected))
We get a clear error message reporting an expected string literal, but where? It would be good to know the source location of this error. Luckily for us, Scala’s parser combinators supports recording a token’s original source location when it is parsed.
In order to do this, and first of all, our WorkflowToken
and WorkflowAST
traits must be mixed in with Positional. This provides a mutable pos
variable and a setPos
method which may be used once to decorate an instance with its line and column numbers.
Secondly, we must use the positioned
operator on each of our parsers. For instance, the parser for the IDENTIFIER
token would be written as:
def identifier: Parser[IDENTIFIER] = positioned {
"[a-zA-Z_][a-zA-Z0-9_]*".r ^^ { str => IDENTIFIER(str) }
}
One ugly side effect of the Positional
mixin is that all of our tokens must now become case classes instead of case objects, since each one now holds mutable state.
Our WorkflowCompilationError subtypes now include the location info…
case class WorkflowLexerError(location: Location, msg: String) extends WorkflowCompilationError
case class WorkflowParserError(location: Location, msg: String) extends WorkflowCompilationError
case class Location(line: Int, column: Int) {
override def toString = s"$line:$column"
}
…which is reported by each phase’s apply
method:
def apply(code: String): Either[WorkflowLexerError, List[WorkflowToken]] = {
parse(tokens, code) match {
case NoSuccess(msg, next) => Left(WorkflowLexerError(Location(next.pos.line, next.pos.column), msg))
case Success(result, next) => Right(result)
}
}
Let’s now try to parse some invalid code again:
scala> WorkflowCompiler(invalidCode)
res1: Either[WorkflowCompilationError,WorkflowAST] = Left(3:14,WorkflowParserError(string literal expected))
Final notes
That’s it! We went from a splitting a text stream into a sequence of tokens, to finally assemble them into a typed abstract syntax tree, resulting in a much more effective means to reason about a program.
We can now extend the compiler to perform other non-parsing related tasks, such as validation (for instance, making sure that all code paths end with the exit
keyword) or the most obvious one: code generation, i.e. traversing this AST to generate a sequence of instructions.
In case you want to keep playing with Scala’s parser combinators, the final code used for this tutorial is available at: https://github.com/enear/parser-combinators-tutorial