Parser Combinators
Today we will explore how to build a small parser combinator library in Haskell from scratch. This blog post is the result of an experiment to see if I could actually implement this by only looking at the base and text documentation, explicitly without looking at other parser implementations or examples.
I think most other Haskell parser examples will work on String
s, but since String
come with a lot of downsides I will try to run our parser on Text
from the text package and see where that gets me. Thus, our parser will take a Text
and return the leftover unparsed Text
with a parsed result a
or a parse error:
type ParseError = T.Text
newtype Parser a
= Parser { runParser :: T.Text -> (T.Text, Either ParseError a) }
To go on, we should implement useful type classes such as Functor
, Applicative
, Alternative
and Monad
. This will give us lots of combinators for free.
Now let’s define a Functor
instance for our Parser
. The complete minimal definition is fmap :: (a -> b) -> f a -> f b
(apply a function a -> b
to the “contained” value of f
):
instance Functor Parser where
fmap f (Parser parse) =
Parser $ \txt ->
let (rest, result) = parse txt
in (rest, fmap f result)
Next up is the Applicative
instance which requires the definitions of pure :: a -> f a
to lift a value a
into the f
“universe” (in our case Parser
), and (<*>) :: f (a -> b) -> f a -> f b
to sequentially apply two parsers and combine the result.
instance Applicative Parser where
pure val = Parser $ \txt -> (txt, Right val)
(Parser funParser) <*> continue =
Parser $ \txt ->
let (rest, result) = funParser txt
in case result of
Left err -> (rest, Left err)
Right f -> runParser (fmap f continue) rest
The <*>
implementation is a little bit more interesting, so here is what it does in words: first, we run the left hand parser to receive the function (a -> b
) which we must then apply to the value of the second parser. Next, we either propagate failure, or we use our previously defined Functor
instance to convert the second parser Parser a
to a Parser b
and run that on the left over of the left hand parser.
After defining Applicative
we will also implement it’s close friend and very handy Alternative
:
instance Alternative Parser where
empty = Parser $ \txt -> (txt, Left "Parsing failed!")
(Parser pa) <|> otherParser =
Parser $ \txt ->
case pa txt of
full@(_, Right _) -> full
_ -> runParser otherParser txt
empty
is just a failing Parser
that does nothing - this is because we can not invent an arbitrary a
. <|>
will first try to run the left hand side parser, and if that succeeds then it will return the result. Otherwise, the right hand parser is run.
Now it’s time for becoming a Monad
!
instance Monad Parser where
return = pure
fail errMsg = Parser $ \txt -> (txt, Left $ T.pack errMsg)
(Parser parse) >>= next =
Parser $ \txt ->
let (leftOver, res) = parse txt
in case res of
Left errMsg -> (leftOver, Left errMsg)
Right val -> runParser (next val) leftOver
With all those abstract concepts implemented we are ready to write concrete parsers. Let’s start out by writing a parser that reads input until the predicate on each subsequent character fails:
satisfy :: (Char -> Bool) -> Parser T.Text
satisfy f =
Parser $ \txt ->
let (matches, rest) = T.span f txt
in (rest, Right matches)
This is very simple because the text
library already provides a function span :: (Char -> Bool) -> Text -> (Text, Text)
that essentially does the heavy lifting efficiently for us. We also want a satisfy1
function, that requires that we read at least one character:
satisfy1 :: (Char -> Bool) -> Parser T.Text
satisfy1 f =
satisfy f >>= \res ->
do when (T.null res) $ fail "satisfy1 didn't read anything!"
pure res
This combination gives us skileWhile
and skipWhile1
for free:
skipWhile :: (Char -> Bool) -> Parser ()
skipWhile = void . satisfy
skipWhile1 :: (Char -> Bool) -> Parser ()
skipWhile1 = void . satisfy1
Now we’ll write a parser for a specific single character Char
and a whole string T.Text
.
char :: Char -> Parser Char
char c =
Parser $ \txt ->
case T.uncons txt of
Just (firstC, rest) | firstC == c -> (rest, Right c)
_ -> (txt, Left $ T.pack $ "Expected a " ++ show c)
string :: T.Text -> Parser T.Text
string t =
Parser $ \txt ->
let tlen = T.length t
in if T.take tlen txt == t
then (T.drop tlen txt, Right t)
else (txt, Left $ T.pack $ "Expected " ++ show t)
To implement parsers for Int
and Double
we will cheat a little and use the read :: Read a => String -> a
function from base. Usually I’d go for readMay :: Read a => String -> Maybe a
from the safe package, but thanks to our already defined parser combinators we can be quite sure that our read
will not crash at runtime:
numStarter :: Parser T.Text
numStarter =
do optNeg <- optional (char '-')
rest <- satisfy1 isDigit
pure $ maybe rest (`T.cons` rest) optNeg
int :: Parser Int
int = fmap (read . T.unpack) numStarter
double :: Parser Double
double =
do firstPart <- numStarter
secondPart <-
optional $
do ch <- char '.'
rest <- satisfy1 isDigit
pure (ch `T.cons` rest)
pure $ (read . T.unpack) (firstPart <> fromMaybe "" secondPart)
Now is probably a good time to define some unit tests for our parsers. We use the excellent HTF package for this:
test_char :: IO ()
test_char =
do assertEqual ("Fooo", Right 'c') (runParser (char 'c') "cFooo")
assertEqual ("Fooo", Left "Expected a 'c'") (runParser (char 'c') "Fooo")
assertEqual ("", Left "Expected a 'c'") (runParser (char 'c') "")
test_string :: IO ()
test_string =
do assertEqual ("Fooo", Right "cc") (runParser (string "cc") "ccFooo")
assertEqual ("Fooo", Left "Expected \"cc\"") (runParser (string "cc") "Fooo")
assertEqual ("", Left "Expected \"cc\"") (runParser (string "cc") "")
test_int :: IO ()
test_int =
do assertEqual ("bar", Right 23) (runParser int "23bar")
assertEqual ("bar", Right (-23)) (runParser int "-23bar")
assertEqual (".bar", Right 23) (runParser int "23.bar")
assertEqual ("a23.bar", Left "skipWhile1 didn't ready anything!") (runParser int "a23.bar")
assertEqual ("", Left "skipWhile1 didn't ready anything!") (runParser int "")
test_double :: IO ()
test_double =
do assertEqual ("bar", Right 23) (runParser double "23bar")
assertEqual ("bar", Right (-23)) (runParser double "-23bar")
assertEqual ("bar", Right 23.2) (runParser double "23.2bar")
assertEqual (".bar", Right 23) (runParser double "23.bar")
assertEqual ("a23.bar", Left "skipWhile1 didn't ready anything!") (runParser double "a23.bar")
assertEqual ("", Left "skipWhile1 didn't ready anything!") (runParser double "")
Great, our basic building blocks seem to be working! As you can see the error messages our parsers produce are not quite useful (yet?), but this might be material for a possible blog post in the near future.
Now let’s write a parser for this simple data file:
language: haskell; type: functional;
language: purescript; type: functional;
language: java; type: oop;
We would like to parse it into the following Haskell data structures:
data LanguageType
= LanguageTypeFunctional
| LanguageTypeOOP
deriving (Show, Eq)
data Language
= Language
{ l_name :: T.Text
, l_type :: LanguageType
} deriving (Show, Eq)
type LangList = [Language]
We start off by writing parsers for the building blocks, with tests:
langType :: Parser LanguageType
langType =
LanguageTypeFunctional <$ string "functional"
<|> LanguageTypeOOP <$ string "oop"
langName :: Parser T.Text
langName = satisfy1 (\c -> not (isSpace c) && c /= ';')
test_langType :: IO ()
test_langType =
do assertEqual ("", Right LanguageTypeFunctional) (runParser langType "functional")
assertEqual ("", Right LanguageTypeOOP) (runParser langType "oop")
assertEqual ("foobar", Left "Expected \"oop\"") (runParser langType "foobar")
test_langName :: IO ()
test_langName =
do assertEqual ("", Right "haskell") (runParser langName "haskell")
assertEqual ("", Right "java") (runParser langName "java")
assertEqual (" bar baz", Right "java") (runParser langName "java bar baz")
and combining them to parse a single row:
skipVertSpace :: Parser ()
skipVertSpace = skipWhile (\c -> c == ' ' || c == '\t')
lang :: Parser Language
lang =
do void $ string "language:" *> skipVertSpace
name <- langName
skipVertSpace
void $ char ';'
skipVertSpace
void $ string "type:" *> skipVertSpace
ty <- langType
skipVertSpace
void $ char ';'
skipVertSpace
pure (Language name ty)
test_lang :: IO ()
test_lang =
do assertEqual ("", Right $ Language "haskell" LanguageTypeFunctional)
(runParser lang "language: haskell; type: functional;")
assertEqual ("", Right $ Language "java" LanguageTypeOOP)
(runParser lang "language:java; type:oop; ")
assertEqual ("language1:!java; type:oop; ", Left "Expected \"language:\"")
(runParser lang "language1:!java; type:oop; ")
To write a parser for the whole file, we need to introduce two new parser combinators. sepBy
will be used to parse values separated by a separator:
sepBy :: Parser val -> Parser sep -> Parser [val]
sepBy valP sepP =
do listHead <- optional valP
case listHead of
Nothing -> pure []
Just x ->
do rest <- many (sepP *> valP)
pure (x : rest)
and endOfInput
is a combinator to check if we consumed all input:
endOfInput :: Parser ()
endOfInput =
Parser $ \txt ->
if T.null txt
then (txt, Right ())
else (txt, Left "Expecting endOfInput")
Putting it all together, our file parser (with test of course!) will look like this:
langFile :: Parser LangList
langFile =
(lang `sepBy` char '\n') <* skipWhile isSpace <* endOfInput
test_langFile :: IO ()
test_langFile =
assertEqual ("", Right langList) (runParser langFile sampleFile)
where
langList =
[ Language "haskell" LanguageTypeFunctional
, Language "purescript" LanguageTypeFunctional
, Language "java" LanguageTypeOOP
]
sampleFile =
T.unlines
[ "language: haskell; type: functional;"
, "language: purescript; type: functional;"
, "language: java; type: oop;"
]
Success! Now this is just the beginning of a parser combinator library, there are still many areas to be explored such as nicer error messages, backtracking, performance concerns and of course more combinators! You should probably use one of the awesome parser combinator libraries out there that address these issues:
- attoparsec: Addresses backtracking and performance concerns
- parsec: Addresses error messages and backtracking (
try
operator) - megaparsec: modern version of
parsec
- (and many more…)
That’s all for now - a working project can be found on GitHub: agrafix/parser-playground. To build and run the tests, clone the project an run stack test
.
Feel free to join the discussion on reddit or read the original paper