SuperRecord: Anonymous Records for Haskell
Many mainstream Haskell programs that reach a certain size need to solve at least two core problems. First, all your logic will run in some kind of environment that provides logging, configuration, other external data such as templates and/or some global application state. This environment must be managed somehow and correctly be passed to the specific logic. Secondly, you will need to read and write out data into the real world, for example, to talk to a JSON REST-API. Let’s take a look in detail at these problems and how we can solve them today.
Motivation
Environments
A common simplified example for an environment will look similar to this
The Env
is created once at the start of the program and then passed to all components of the application. This can easily be achieved by simply passing it to every function as parameter
Now for this example, this may look like a reasonable thing to do, but the approach has two drawbacks: As the program grows, we have to pass around this env :: Env
many times which will result in slightly obfuscated code. More importantly though, if further down the stack in a function we only need smaller portions of Env
, like for example only l_logInfo
of Logger
and c_port
of Config
, we will either need to introduce a new data type containing only the needed fields (LogInfoPort
) and convert the Env
to that, or we still take Env
as parameter and have a hard time reasoning about what the function does and reusing/testing it because we need to then fill in all the other unused fields with garbage.
To solve the problem of passing things around to everything, we can use the ReaderT Env IO a
monad. This will get rid of having to pass around the parameter all the time. To attack the second problem, we can introduce type classes that denote single components of our config
And then describe what our function needs using those constraints:
This idea is explored in depth in a recent FPComplete post by Michael Snoyman. This idea is pretty decent, but it also has some drawbacks: How do we express nested constraints? If we for example only need c_port
of Config
we could add a new type class
and write two instances, one for Config
and one for Env
using that. The result is lot’s of type classes and we start to loose track of what happens again. Another drawback is that we now can construct smaller environments for testing (or reuse sites), but we still need to introduce new types and write type class instances for every smaller bit.
We will look into solutions, but let’s introduce the other problem before hand.
Talking to the real world
Whether talking to a REST API or writing a REST service, one always needs to specify the structure of the data that is being accepted or sent. A common path with Haskell here is to use JSON as data format using aeson
as a library and the ToJSON
and FromJSON
type classes to specify what the structure of the JSON sent or parsed should be. This requires you to define a Haskell type for each JSON payload you want to send. For example, when implementing a REST service, we would define a RequestXX
and a ResponseXX
type for each end point, implement FromJSON RequestXX
and ToJSON ResponseXX
. In our handler, we would first read the request into RequestXX
and then deconstruct and work in the data of the RequestXX
type, and then pack the results back into a ResponseXX
type. For good maintainability, code reuse and testability we should not implement business logic functions in terms of any RequestXX
or ResponseXX
type, thus we will always write converting between business logic types and these request/response types. Also, if our response/request type only slightly differs between end points, we need to introduce a new type again (and implement serialization/parsing again). Some of these problems can partially be resolved similar to the ReaderT
case, we could also write these HasXX
type classes and implement instances for our request/response types, but it does not solve the problem of defining many types and writing many similar JSON parsers/serializers and comes with the drawbacks mentioned above.
A solution?
One way to solve these problems can be via anonymous records. Let’s take a look.
Anonymous records
An anonymous record is similar to a data
type, but instead of defining it up front with a data
declaration we can declare/use it on the fly. Here’s an example from the proposed Haskell superrecord package which implements the idea of anonymous records:
The type of person is person :: Rec '["name" := String, "age" := Int]
. This basically means person
is a record (Rec
) that has the fields name
and age
of types String
and Int
. We will explain the concrete meaning and machinery later on. Using Haskell’s native data
types, it would look like this:
On trivial example why the first representation is beneficial is that we can write a function that requires at least a field name
:
where the Has "name" r String
constraint means: “The record of type Rec r
must have a field name
of value String
”. This function can work on many values like:
where as the function
can only work on values of type Person
. Again, this could be solved using type classes, but you’d still need to write a new PersonX
type for the three examples above and implement a HasName
instance for all of them. Circling down on the “environment problem”, we can now model our environments as anonymous records and use the Has
constraint to explain exactly what we need! For example or doDeepWork
from above
We can also easily express nested environment constraints
The big advantage here is that we do not need to introduce any new data types or type classes, we can simply write down what dependencies or functions actually have, and then after combining them or when testing them providing just what they need. For example:
Moving to our real-world data problem, these record already look a lot like JSON. They seem to fit our request/response problem pretty well. In fact, the superrecord
library has a JSON representation built in, that works as expected:
This means that we no longer need to write RequestXX
and ResponseXX
types, but instead can directly parse as a superrecord
record and read/write fields as needed when converting to/from business logic types, while automatically having the JSON parsing/serialization taken care of. There’s even an interesting library in the works by Denis Redozubov that could one day provide automatic migrations on top of that.
Related work
Before looking at how superrecord
works, we discuss the existing options. We discovered the following Haskell packages: labels
, vinyl
, rawr
and bookkeeper
.
vinyl
The vinyl package is one of the oldest packages, with the first release in 2012. The core type of vinyl
is:
This is basically a heterogeneous linked list that allows tracking its contents at the type level. It’s very general, using the first type parameter you can control the structure of individual values. If we provide Identity
, we basically get a heterogeneous list. If we provide
we can now add a label to each value and thus get the desired anonymous records described earlier. The library provides many useful combinators to work with them, but comes with a major drawback: The core type is a linked list! Thus already each access will be O(n)
(compared to O(1)
for native data
types) - practically meaning that if you read fields “in the back of” the record it will take more time as the record grows. The library also does not have out-of-the-box support for OverloadedLabels to allow syntax like get #somefield
, but this could trivially be added.
bookkeeper
The bookkeeper package is more concrete than vinyl
, it focuses on anonymous records and encourages the use of OverloadedLabels
. Take a look at the example from the README:
It also provides a type class to convert to native Haskell types with the same structure (via Generic
), but unfortunately the core data type is defined as
which is essentially also a linked list with a degrade in performance compared to native Haskell data
types.
rawr
and labels
Bot rawr and labels packages are build around Haskell tuples. Thus, they do not have a core data type, but instead build up records using type classes defined on tuples and a Field
data type. Taken from the labels
package:
Records look like this: (#foo := "hi", #bar := 123)
. This is an interesting idea, especially as GHC can optimize these tuples like native data
types thus giving similar performance as the type classes explicitly encode a read to a field by getting the n-th element from the tuple. The major drawback here is that it is very tedious to define new type class instances for these records as one must use code generation (e.g. TemplateHaskell) to generate instances for all the tuple combinations up to a certain size. For usual type classes O(n^2)
instances would be needed (where n
is the number of fields you would like to support), for type classes that append two records one would need O(n^4)
instances!
SuperRecord
The drawbacks of the mentioned libraries resulted in the design and implementation of superrecord
. The goals are to provide fast anonymous records while still having a manageable way to define type class instances. This resulted in the idea of having a heterogeneous array holding the values and tracking the contents on the type level. We define a core data type:
At type level, we track which field has what position in the physical storage array (in our case, equal to the location in the type level list), and what the type of that field is. We erased the type on the lowest level (all elements are Any
) and did not use something like Data.Dynamic
to remove the overhead of useless casting as we - if our type families are correct - certainly know the types of all elements. We did not pick Vector
from the vector library as our holding type, as it internally still does some bounds checking (which we don’t need for the same reason). Also, Vector
is represented as Array#
, but typically records are smaller than 128 fields and we directly freeze all arrays to remain with functional semantics so SmallArray#
is a better-suited representation from space and performance point of view (e.g. no card table is needed).
With this definition at hand, we can now start building our record library. We also need a type for labeled fields, similar to the labels
approach
with which we can now define functions for building a Rec
. FldProxy
is data FldProxy (t :: Symbol) = FldProxy
to allow writing a non-orphan instance IsLabel l (FldProxy l)
to allow OverloadedLabels
and the get #field
notation. To create an empy record we write
This allocates a new SmallArray#
with zero elements and directly freezes it. On the type level, we
track that the record is empty Rec '[]
. To add field and value to the record, we define
This allocates a new SmallArray#
, which is one larger that the given array, copies all the
elements (physically pointers) into the new array and writes the new element into the free slot. Finally, the array is frozen again. The new element must also be cast to Any
using unsafeCoerce
. At type level, we check that the provided key l
does not already exist in the existing record Rec lts
and we compute the size of the existing record
Finally, we add the label and the new fields type to our record type and get Rec (l := t ': lts)
. Note that the physical order of the elements is the reverse of the order on the type level. There’s no up or downside here, we could also copy the old array to an offset 1
and write the new element to position 0
. We only need to keep this in mind when combining two records and reading/writing to them.
Reading a field from the record is now simply
using RecTy l lts
to compute the type of the field with label l
in record Rec lts
and RecVecIdxPos l lts
to compute the physical index position of the label l
in the record Rec lts
.
Using natVal'
we bring the index position to value level and read our SmallArray#
at that position, using unsafeCoerce
to cast it back to it’s original value. Setting a field is implemented using the same information used for get
and rcons
. All other operations are either implemented in terms of get
and set
, or leverage the presented ideas to compute physical locations from the type structure.
We can also convert our records to and from native Haskell data
types leveraging GHC.Generics
in
a very straight forward way:
To implement type classes like ToJSON
, we implement a reflection mechanism
which allows to apply a function given some constraints c
to be applied to each field and value of
a record. For example converting any Rec lts
to a aeson
Value
would look like this:
This confirms that we reached our first goal and can reason about the structure of the Rec lts
using type classes and type families allowing to write general type class instances and transformations without the need of code generation or other boilerplate.
The library also provides many more type class instances and combinators (like appending two records) which can be found on in the superrecord Haddock documentation.
Benchmarks
To confirm that our second goal, performance, was met, we conduct some benchmarks. We also looked at generated assembler code to confirm that a simple field get with superrecord
results in the same code as a native field read. We benchmark against the three other approaches (native, tuples and linked lists) via (labels
, bookkeeper
and native data
types).
library | get | nested get |
---|---|---|
native | 7.7 | 17.1 |
labels | 8.1 | 20.2 |
bookkeeper | 9.3 | 24.2 |
superrecord | 8.0 | 23.0 |
(all times in ns)
As we can see, for simple field reading we are in the game. At the point of writing, I am unsure why the nested get is slower. The number of pointers that need to be followed should roughly be similar, this will have to be researched further.
library | json/read write |
---|---|
native (Generics) | 334.9 |
superrecord | 274.1 |
(all times in µs)
The superrecord
JSON roundtrip is faster than on generated from native types using Generics
. The superrecord
variant is optimized here, we first allocate on single SmallArray#
with enough room for all fields, and then fill them one by one without any copying. More benchmarks can be found in the repository
Outlook
One idea that surfaced during development was that it may be possible for nested records to inline them into a single SmallArray#
to speed up nested getting and setting. The problem is that this would remove sharing if one extracted a sub record and worked with it, but the usual coding path we found in our applications mostly read individual fields (leaves) which would indeed benefit from such approach.
Another area of exploration would be database libraries, where SQL is writting in a type driven DSL (for example opaleye). In the case of joins, we the help of anonymous records should greatly simplify library APIs.
Conclusion
The end result is pretty satisifying: A practical library for anonymous records that is both fast and has an ergonomic interface for both using and extending it.