lamdera/containers

lamdera/containers

Released on 2024-09-22View source code

This package was released along with version 1.3 of the Lamdera compiler. It uses kernel code^(?) to allow Elm programmers to finally have a Dict and Set type that does not require comparable keys^(?) or require any workarounds, while also being similar in performance to the built-in Dict and Set types.

In this article I want to show a discovery I made while working on this. It's a set of significant trade-offs I faced that, as far as I can tell, are unavoidable regardless of how you go about designing a Dict/Set data structure.

But first some thanks are in order!

Thank you Robin for the original implementation of this package. You handled most of the work for me!
Thank you Ambue for letting me work on this package during work hours.
Thank you miniBill for the Dict test suite. Without it I would have missed some critical bugs.

So here's what I discovered

First, for the purposes of explaining this discovery, lets say a Dict is a type with the following API:

get : key -> Dict key value -> Maybe value

insert : key -> value -> Dict key value -> Dict key value

fromList : List (key, value) -> Dict key value

toList : Dict key value -> List (key, value)

Now let me list some properties that are really nice to have in a Dict type:

The key type variable doesn't have to be comparable
Two dicts with exactly the same key-value pairs are equal regardless of insertion order. For example, fromList [ ("X", 0), ("Y", 1) ] == fromList [ ("Y", 1), ("X", 0) ]
If dictA == dictB then f dictA == f dictB where f can be any function^(?)
Renaming/reordering record fields or custom type variants^(?) should never change the order of key-value pairs returned from toList

My discovery is that, as far as I can tell, regardless of what programming language you use or performance characteristics you allow for, it's impossible to have more than three of these properties in a Dict type.

For the sake of brevity, I'll refer to these properties by number through-out the rest of the article.

Why should you care?

I expect some readers will wonder why these properties matter. Maybe they feel arbitrary. Let me make a case for why they are important:

We want keys to be non-comparable for the sake of convenience and type-safety. It's annoying when I have type Id = Id String or { x : Int, y : Int } and I have to convert them into a comparable type before they can be used as a key. This leads to boilerplate code and increased risk of mixing up types which can lead to bugs.
Having two Dicts be equal based on contents, regardless of insertion order is nice but often not that big a deal. But for Sets it's kind of their whole purpose. And Sets are essentially just an alias for Dict key ()
dictA == dictB implies f dictA == f dictB is quite handy:
- If two values are equal and the same function is applied to both, you can save yourself CPU time by only computing one and reusing the result. This can also be used for caching.
- It's easier to reason about what some code does if you are certain that == means two values are indistinguishable no matter what you do to them.
- Any code written in Elm will have this property. It would be a shame for lamdera/containers to introduce some kernel code that violates this!
People often mention that when refactoring in Elm, they feel safe making large changes to their code. Part of why this is true is because renaming or reordering functions/types/fields/variants will never affect the runtime behavior of Elm code (with the sole exception being field order affecting record constructors^(?)). It would be a shame to lose this just because, Dict.toList would change when renaming/reordering record fields or custom type variants.

Rock solid mathematical proof

I don't have a mathematical proof to back this claim but let me explain why I believe it is true.

First, lets consider the built-in Dict type. Which of the 4 properties does it support?

Well it fails on 1. You can only have comparable keys.
It passes on 2. You'll find that fromList [ ("X", 0), ("Y", 1) ] == fromList [ ("Y", 1), ("X", 0) ] is indeed true.
Passes on 3. dictA == dictB implies f dictA == f dictB
And passes on 4. Since the built-in Dict only allows for comparable keys, records and custom types aren't allowed. Which in turn means there's no way for renames or reorderings to affect toList

It's nice that the built-in Dict passes on 2, 3, and 4. But comparable keys are really restrictive! So let's try allowing for non-comparable keys while trying to keep those other nice properties.

Well, the question we immediately encounter is, how should toList sort the list of key-value pairs it returns? With the built-in Dict this is easy because all of the keys are comparable values. But what do we do if we have non-comparable keys? For example, suppose we have the following custom type being used as our key.

type Color
    = Red
    | Green
    | Blue

myDict =
    fromList [ (Blue, 2), (Green, 1), (Red, 0) ]

We have three options:

Sort based on variant names. For example alphabetically sort the names in ascending order. Which gives us toList myDict == [ (Blue, 2), (Green, 1), (Red, 0) ]
Sort by the order the variants are defined in code. That gives us toList myDict == [ (Red, 0), (Green, 1), (Blue, 2) ]
Sort the list based on the order the key-value pairs were added. Then we get toList myDict == [ (Blue, 2), (Green, 1), (Red, 0) ] which is just the original list.

Do any of these approaches to sorting let us keep all 4 of the nice-to-have properties?

Violates 4. If we rename a variant, it could potentially change its alphabetical ordering and thereby change the output of toList.
Also violates 4. If we reorder our variants, that will change the output of toList

This does not violate 4! But what about 2 and 3?

Well it depends on how == is implemented for our Dict type. Supposed dictA == dictB is the same as writing toList dictA == toList dictB. In that case 2 fails in the following example:

a = fromList [ ("X", 0), ("Y", 1) ] == fromList [ ("Y", 1), ("X", 0) ]

-- is converted into this because of how we defined == to work for dict equality
a = toList (fromList [ ("X", 0), ("Y", 1) ]) == toList (fromList [ ("Y", 1), ("X", 0) ])

-- simplifies to this since ordering by insertion means toList and fromList cancel eachother out
a = [ ("X", 0), ("Y", 1) ] == [ ("Y", 1), ("X", 0) ]

-- which is
a = False

On the bright side, property 3 is valid at least!

Okay well how about we make it so dictA == dictB checks that both dicts have the same key-values pairs while ignoring order? In that case 2 is valid! But let's look at 3. Consider this code:
```
dictA = fromList [ ("X", 0), ("Y", 1) ]
dictB = fromList [ ("Y", 1), ("X", 0) ]

a = dictA == dictB --> True

b = toList dictA == toList dictB --> False
```
This violates 3 which says that if dictA == dictB is true, then that implies f dictA == f dictB is also true for any function f. In our example that's not the case when f is toList

You might argue I've skipped an obvious approach to sorting toList's output. Just don't sort at all! We'll leave it as an implementation detail of our Dict type. Maybe a hashmap or binary tree?

Unfortunately that still sorts it, just in a way that probably ends up depending on some combination of variant name, variant order, and insertion order.

For example, if you use a hashmap, how will you hash custom types? Maybe hash (variantName + data)? Or perhaps hash (variantIndex + data)? If you don't use any of those, what is left? If you use a binary tree then instead of a hash function you need some kind of comparable function internally but you have the same problem. You need to compare keys based on something. Even if you don't care about performance and your dict is just a list with == used on every existing key to check for duplicates you end up with it depending on insertion order.

It's starting to feel like a game of whack-a-mole isn't it? Every time we try to force one property to be valid, another one breaks. Maybe we can solve this by thinking outside of the box?

Time for some lateral thinking

In our custom type example, what if we could allow the user to define a function for each non-comparable type that tells the dict how to sort it? Maybe we could introduce some new syntax and have it look like this:

type Color
    = Red
    | Green
    | Blue
    compareWith(compareColor)

compareColor : Color -> Color -> Order
compareColor a b =
    Basics.compare (colorToInt a) (colorToInt b)

colorToInt : Color -> Order
colorToInt a =
    case a of
        Red -> 0
        Green -> 1
        Blue -> 2

I'd argue you this doesn't give you all 4 properties. We're instead giving up 1 (non-comparable keys) but with a way to make custom types comparable. Languages like Haskell support this but it comes with its own set of trade-offs (for example, how does this work with anonymous/structural records).

How about a framing challenge? From the start I said our Dict type has the following functions

get : key -> Dict key value -> Maybe value

insert : key -> value -> Dict key value -> Dict key value

fromList : List (key, value) -> Dict key value

toList : Dict key value -> List (key, value)

but toList is causing us lots of trouble. What if we just remove it?

Well we can. But we'd also need to remove any other functions that iterate over the dict's key-value pairs such as merge and foldl. That's quite limiting.

Okay what if we keep toList but change it to this?

toList : (key -> key -> Order) -> Dict key value -> List (key, value)

Now the user has to choose how to sort the key-value pairs, problem solved!

That indeed solves it but it's not ideal. The first reason is that it's inconvenient. The second is that there are some types that can't easily be sorted by the user. For example

import SomePackage exposing (OpaqueType)

dict : Dict OpaqueType Int
dict = fromList [ ... ]

list = toList sortBy dict

sortBy = Debug.todo "How do I sort an opaque type that doesn't expose any data I can use to sort with?"

Maybe this situation is rare enough that this approach is practical? Hard to say.

Conclusion

It wasn't at all obvious to me from the start that I'd encounter so many trade-offs when all I wanted was a Dict type that didn't demand comparable keys. While I independently discovered this, I'm sure either other people have also figured this out, or alternatively, I've made a mistake somewhere and my conclusions are incorrect. I sure hope it's the latter. I really want all 4 of those properties in a dict package...

Oh. And you probably want to know which properties I ended up choosing for the Dict and Set type in lamdera/containers. It was difficult to decide but here's what I settled on:

Obviously I included support for non-comparable keys. This whole package would be pointless without it.
I gave up fromList [ ("X", 0), ("Y", 1) ] == fromList [ ("Y", 1), ("X", 0) ]. This means if you want to check if two Dicts or Sets are equal, you'll need to use an extra function called unorderedEquals. This isn't ideal since you can't use it on a larger data structure that contains a Set.^(?)
I kept 3.
I kept 4.

Both 3 and 4 are preserved for the same reason. They are properties guaranteed by Elm and that ultimately seemed more valuable than 2.

Follow-up questions I've gotten

Is it possible to pick any combination of three properties?

It is! Let's look at the possibilities:

If we pick all but 1 then we have the built-in Dict.
If we pick all but 2 then you get this package.
If we pick all but 3, one example would be a Dict with a toList function that returns keys by insertion order and == ignores insertion order. There no existing packages that support this since Elm guarantees property 3. You'd need kernel code to implement it.
If we pick all but 4, toList sorts keys alphabetically, == ignores order, keys are not required to be comparable, but as a result, renaming fields/variants can affect toList. Like with 3, there's no existing packages that do this since Elm guarantees 4 and you'd need kernel code to work around that.

And of course, you can pick fewer than three properties. Elm is rare in that it guarantees properties 3 and 4 for any user code. Most languages don't and that probably extends to their respective Dict types.

Can this package be used with the normal Elm compiler?

This package is only supported by the Lamdera compiler^(?). In part this is because it contains kernel code and it's unlikely I'd be allowed to publish it to the Elm package ecosystem. But even if I was allowed, it also overrides Debug.toString and == which is only possible due to some changes to the compiler.