Paul Bone

Explaining GADTs

I’ve started working on Plasma's type system, I’m currently implementing basic algebraic data types (ADTs) (sum and product types), which I’m quite comfortable with. I’ve also heard about these things called generalised algebraic data types (GADTs) but until recently I found it difficult to understand what they were and more importantly when are they useful. This article attempts to answer those questions. I assume that the reader is comfortable with ADTs and Haskell syntax.

String formatting

The problem we’re going to apply this to is string formatting.

data Format = Show
            | Quote
            | Pad Char Int

printf :: Show a => Format -> a -> String
printf Show s = show s
printf Quote s = "\"" ++ (escape (show s)) ++ "\""
printf (Pad c n) s = pad c n $ show s

-- I've included these functions for completeness.
pad c n s = let len = length s
                in if len >= n then s
                               else (replicate (n - len) c) ++ s

escape [] = []
escape (x:xs0) = case x of
                '"' -> "\\\"" ++ xs
                _   -> x:xs
    where xs = escape xs0

This Haskell code compiles and works correctly, it’s simple and I’m almost happy with it. The printf function is the main part of this program, it simply uses its first argument to decide how to format its second argument. The first argument is a format type, which allows for three different formatting options, the first just calls show, the second escapes and quotes the string, and the third will pad the string - useful for tabular data. The pad and escape functions are included for completeness. The show function belongs to the Show typeclass, our printf function requires the type variable a to be in this typeclass, this is what Show a => means in printf 's type signature.

The problem

One thing I really like when presenting information is to be able to format numbers with thousands separators, let’s add that.

data Format = Show
            | Quote
            | Pad Char Int
            | ThousandsNumber

printf :: Show a => Format -> a -> String
printf Show s = show s
printf Quote s = "\"" ++ (escape (show s)) ++ "\""
printf (Pad c n) s = pad c n $ show s
printf ThousandsNumber n = thousandsNumber n

-- Included for completeness
thousandsNumber :: Integer -> String
thousandsNumber n =
    if n < 1000 then show n
                else let bigpart = show $ n `div` 1000
                         littlepart = printf (Pad '0' 3) $ show (n `mod` 1000)
                     in bigpart ++ "," ++ littlepart

It didn’t work:

adt.hs:13:44:
    Couldn't match expected type ‘Integer’ with actual type ‘a’
      ‘a’ is a rigid type variable bound by
          the type signature for printf :: Show a => Format -> a -> String
          at adt.hs:9:11
    Relevant bindings include
      n :: a (bound at adt.hs:13:24)
      printf :: Format -> a -> String (bound at adt.hs:10:1)
    In the first argument of ‘thousandsNumber’, namely ‘n’
    In the expression: thousandsNumber n

This shouldn’t be surprising. The new clause for printf expects its parameter to be an Integer, but its signature (and my intentions) require it to be polymorphic (a). This is exactly the error that GHC is reporting.

At this point I’d like to acknowledge that this example is a little contrived. It would be simple at this point to make each Format a separate function that does the formatting, and is either passed to printf as a higher order value or printf is removed and these functions are used directly. However for the sake of explaining GADTs let’s assume that for some reason we need to stick with the design we have. Note that this would be a more realistic example if we can imagine more than one printf function that all format data, but might format it differently, eg: a formatter for HTML that might implement the Pad format by using right alignment rather than inserting spaces.

The ADT solution

To solve this problem without GADTs we can introduce a new type (Value) which allows us to implement the ThousandsNumber case for Integer but not for String:

data Format = Show
            | Quote
            | Pad Char Int
            | ThousandsNumber

data Value = String String
           | Integer Integer

-- Value implements show, which makes it easy to keep most of the code the
-- same as before.
instance Show Value where
    show (String s) = s
    show (Integer i) = show i

printf :: Format -> Value -> String
printf Show v = show v
printf Quote v = "\"" ++ (escape (show v)) ++ "\""
printf (Pad c n) v = pad c n $ show v
printf ThousandsNumber (Integer n) = thousandsNumber n
printf ThousandsNumber (String _) = error "Unhandled case"

I have omitted a small modification required to the thousandsNumber function.

The main thing to notice with this solution, is that a call like

printf ThousandsNumber (String "foo")

will throw an exception. Or the code could be changed to return a Maybe, allowing the error to be caught more easily. Regardless of how errors are raised, they can be raised, and I wouldn’t disagree with anyone who wanted to say that this code isn’t really type safe. This is the main problem that the GADT solution solves.

There are some other problems with this solution:

  • The code is now more complicated.

  • printf is no-longer polymorphic on its second argument.

The latter problem can also be fixed using GADTs.

The GADT solution

We’ll take this a bit at a time, explaining each part and how it works: First enable the GADT language extension:

{-#LANGUAGE GADTs #-}

Here’s the tricky bit, the GADT itself:

Before (ADT)
data Format = Show
            | Quote
            | Pad Char Int
            | ThousandsNumber
After (GADT)
data Format a where
    Raw :: Format String
    Quote :: Format String
    Pad :: Char -> Int -> Format String
    ThousandsNumber :: Format Integer

The very first difference between these data types is that the in the GADT case I’ve added a type variable. This is not part of understanding GADTs but it is part of understanding why I’m using GADTs in this program, this variable represents the type of data that is being formatted, since what we’re trying to do is say that ThousandsNumber is only applicable to Integer.

The syntax is also quite different, which makes sense once we understand what is going on. A normal ADT (left) has the same type (Format) for all of its constructors. A GADT however (right) can associate a different but related type with each of its constructors. Therefore the GADT syntax allows us to give a type to each constructor, each constructor has the form:

Name :: Type

The first two constructors' types are easy enough to understand. They are both Format String. The second type is a function with two parameters. I found this initially confusing, until I realized that this is the type of Pad before it is given its arguments (the character to pad with, and the total width); with suitable arguments it simply has the type Format String. We can see the same argument types in the same constructor on the left, it’s just a different syntax. Finally the fourth constructor has the type Format Integer, it formats integers, that’s what we’ve been working towards.

Now we’re ready to understand what a GADT is. It is like an ADT, except that each constructor may have a different type. The value of a variable now plays a role in deciding the variable’s type, this is a step towards dependent types, but it does not itself, make this dependent typing.

Finally, let’s see the printf function:

printf :: Format a -> a -> String
printf Raw s = s
printf Quote s = "\"" ++ (escape s) ++ "\""
printf (Pad c n) s = pad c n s
printf ThousandsNumber n = thousandsNumber n

Now that the Format type is parametric, we have to adjust the type signature here. We’ve also removed the Show a constraint, and not calling show as we once were. We can do this in each case because the constructor is known, the actual type of the type variable a is also known, and in the first three cases known to be String. This makes each of the first three cases type correct, for example in the case for Raw this means that s must be a String, and s is the RHS of this case, and printf needs to return a string, then this is type correct.

In the ThousandsNumber case, the GADT ensures that n is an Integer, because the type variable a is known to be Integer. This is the same type that the ThousandsNumber function requires, and therefore this case is also type correct.

This is why we made the Format type parametric, it allows us to link the type of each constructor to the type of the second argument to printf.

Improving the GADT solution

One problem with the above solution is that it is no-longer polymorphic when it should be. This was deliberate and made explaining GADTs easier. Now we will show how GADTs can interact with typeclass constraints

data Format a where
    Show :: (Show a) => Format a
    Quote :: (Show a) => Format a
    Pad :: (Show a) => Char -> Int -> Format a
    ThousandsNumber :: Format Integer

printf :: Format a -> a -> String
printf Show s = show s
printf Quote s = "\"" ++ (escape (show s)) ++ "\""
printf (Pad c n) s = pad c n $ show s
printf ThousandsNumber n = thousandsNumber n

In this solution typeclass constraints have been added to the first three data constructors. They are part of the type of each data constructor and will affect printf’s type when it is called with a particular constructor. This means that printf now looks more like the original printf, it contains calls to show.

It is very likely that the type of ThousandsNumber can be relaxed to work with any Integral type, in a similar way to the other constructors.

Are GADTs useful?

Maybe. In this article I tried to use an example that most programmers are somewhat likely to encounter. However as we mentioned above this could be solved in other ways, unless how you formatted something changed with output format. For example to right-align something on a console you would normally add spaces to the left, but in HTML you would use a style on that block. Therefore I think this example is still meaningful, I’ve just not shown that in my code. Another common example used is the one on the GADT Wikipedia page which is a bit more specialised and therefore some developers may find it less motivating.

I’ve never used GADTs "in anger" (in a real project), so I’m not aware of there are any pitfalls that you may be likely to encounter. This is going to be mostly because Mercury, which I tend to use most of the time, does not support GADTs. Also, until recently I did not know exactly what they were.

Finally, I think they could be useful, but they won’t be useful most of the time. As far as I can tell GADTs will rarely be the only solution to a given problem. If your language of choice already supports them, then that’s cool, you’ve got an extra tool in your belt. If not, it’s probably not worth worrying about, you can almost certainly get by with slightly less elegant code. It seems that the law of diminishing returns applies to type system features.

I’m no expert on GADTs, so if you disagree with anything I’ve said here, or can otherwise provide some information about how useful they are in practice, I’d be grateful.