A Tiny Introduction to Parsers
2020-07-08
A Question
The other day, someone asked how parsers worked on The Programmer’s Hangout Discord server. Here’s an an abridged version of my explanation.
My Response
A parser simply takes a stream of data and from it produces a datastructure. This stream is usually a stream of tokens, which are groups of tagged delimiters and data used to construct the datastructure.
Let’s assume you’re building a parser for a programming language; the parser converts a file to an abstract syntax tree. We start with a file; in this case, it’s AverageScript:
for i in [1, 2, 3] {
print(i)
}
And lex it into a series of tokens by walking over the file linearly. So, for instance:
for: keyword
i: identifier
in: keyword
[: delimiter
1: number
,: separator
2: number
... snip ...
}: delimiter
This is the stream of tokens that will be fed into the parser to form the AST. So, how exactly does a parser work?
There are quite a few different kinds of parsers, but they all essentially reduce to the same set of steps:
- Look at the next few tokens to determine what is being parsed
- For each sub-section of what is being parsed, repeat this process.
- Using this information, build the datastructure.
Most parsers are recursive for this reason; for example, in most languages, a function call can contain another function call:
foo(bar(1, 3))
Instead of manually writing out cases for nesting, after we determine we’re looking at a function, we simply parse the contents of foo(...)
according to some pre-defined rule.
So returning to our original case: how might a parser parse the for i in ...
?
Assuming some form of a recursive descent parser:
- The parser sees
for
and determines a for loop is being parsed; In this made-up language, a for loop has three components, an identifier, an iterator, and a body. - It takes the next token,
i
and checks that it’s an identifier. - It takes the next token,
in
and makes sure it’s the in keyword. - It recursively evaluates the next series of tokens as an expression
- It sees
[
so it starts parsing as a list - It sees
2
and adds it to said list - And so on until the entire list is parsed
- It sees
- It checks for
{
delimiting a for body - It recursively parses the next few tokens as a block of code
- and so on until the entire code block is parsed
- It checks for
}
, ending the for loop - It constructs a for node and returns it.
The end of this process might result in an AST that looks like so:
For {
identifier: i,
expression: List {
items: [1, 2, 3]
},
body: Code {
statements: [
Function {
identifier: print,
arguments: [i],
},
],
},
}
This AST can then be walked, interpreted, compiled, etc. There exist common formats to describe parsers; one such format is Extended Backus-Naur Form. Many parsing techniques exist; what I described models a LL(1) parser. Many parsing models exist; to learn more about them, I recommend you check the Parsing Wikipedia page.
So, to summarize:
- A parser takes a stream of data and builds a data-structure
- Parsers generally work in a recursive manner by defining a set of rules that can be efficiently transversed.
- Parsers can be formally described.
- Many different parsing methods exist.
I hope this helped clear some things up 🙂.