Pattern matching records

· Allanderek's blog

#programming #elm

A blog post popped up on my feed, which has the main point that tagged union types (called custom types in Elm) are overrated for implementing intermediate representations. To be clear the author is not saying that tagged union types are in general overrated, but their suitability for intermediate representations within compilers is just not that big of a win.

It's easy to see why tagged union types are thought of as perfect for internal representations of code in compilers. The grammar of a language kind of looks like a tagged union type already. You might represent the grammar of an elm expression in Elm's custom types something like the following:

1type Expr
2    = IntLiteral Int
3    | FloatLiteral Float
4    | StringLiteral String
5    | Name String
6    | FunctionApplication Expr (List Expr)
7    | LetExpr (List LetDefs) Expr
8    | CaseExpr Expr (List (Pattern, Expr))
9    | RecordExpr (List (Name, Expr))

Missing a couple of expressions such as record update expression but you get the idea. This can indeed form the basis of your type for the abstract syntax tree, but typically you will need to store much more information. One problem is that much of the information you wish to store is the same for all types of expressions. This means that you end up with either a record type in which one field is the tagged union type, or you stuff all the information into each constructor of the tagged union type.

For example, when you detect an error in the program you will want to print out the offending expression, or at least some of it. So you need to store here some representation of the actual source code of the expression in question. There are a couple of ways of doing this, you can either just directly store the string, which ends up using quite a bit of memory because you store the same parts of the string over and over again as you parse down the tree. Alternatively, you can just store the start and end character positions of the expression, because for most of them you will not need it, when you need to print out an error, you can recover the original string. Either way lets assume you have some type SourceCode that relates to the source code of a particular expression. We can update our abstract syntax tree in one of two ways, first of all with a record type:

 1type alias Expr =
 2    { source : Sourcecode
 3    , tree : ExprTree
 4    }
 5type ExprTree
 6    = IntLiteral Int
 7    | FloatLiteral Float
 8    | StringLiteral String
 9    | Name String
10    | FunctionApplication Expr (List Expr)
11    | LetExpr (List LetDefs) Expr
12    | CaseExpr Expr (List (Pattern, Expr))
13    | RecordExpr (List (Name, Expr))

Note that these are mutually recursive types, inside the ExprTree type we reference Expr and inside the Expr type we reference ExprTree. The alternative is to push the source code each of the constructors:

1type Expr
2    = IntLiteral SourceCode Int
3    | FloatLiteral SourceCode Float
4    | StringLiteral SourceCode String
5    | Name SourceCode String
6    | FunctionApplication SourceCode Expr (List Expr)
7    | LetExpr SourceCode (List LetDefs) Expr
8    | CaseExpr SourceCode Expr (List (Pattern, Expr))
9    | RecordExpr SourceCode (List (Name, Expr))

The downside of the latter is that you cannot refer to expr.source. In the case that the meta-data you're intersetd in is the original source code of the expression, you probably only use that when printing out an error message, so it's probably not that much pain. However, a compiler might end up with quite a bit of extra information, such as the names used in an expression, its inferred type etc. Each of these bits of meta data might be used more frequently, still it's not that much pain to call a function to get some meta data rather than access it via a record field access.

The downside of the former approach is that pattern matching deeper than the first layer is awkward, or, as in Elm, not at all possible. Unfortunately transformations inside compilers can often use nested patterns quite usefully. For example, suppose I want to take all (fully applied) uses of Maybe.withDefault and transform them into case expressions (so that the default value is never needlessly calculated), so I want to turn:

1Maybe.withDefault (createBlankUser model) mUser

into:

1case mUser of
2    Nothing ->
3        createBlankUser model
4    Just user ->
5        user

Using the latter form I can do the pattern match like so:

 1case expr of
 2    (FunctionApplication 
 3        applSource
 4        [ Name _ "Maybe.withDefault"
 5        , defaultExpr
 6        , maybeExpr
 7        ]) ->
 8        CaseExpr 
 9            applSource
10            maybeExpr
11            [ ( NamePattern "Nothing", defaultExpr)
12            , (TaggedPattern "Just" (NamePattern "expr"), Name "expr")
13            ]
14    ... recursively apply this transformation ...

I've fudged this a little because we didn't specify an abstract syntax tree type for the patterns but the ones we are producing should probably have source information, which would probably have to be some value that indicates there is no source because the pattern is a result of an internal transformation. Anyway the point here is that I can do this kind of nested pattern matching to find a tree of the right 'shape'. The key part of the above code is the nested pattern (FunctionApplication applSource [ Name _ "Maybe.withDefault, defaultExpr, maybeExpr]), this is a pretty neat way of saying "Match all function applications that have three sub expressions, the first of which is the name expression 'Match.withDefault'".

Now with the latter outer record type representation we cannot do this, because we cannot pattern match on record expressions (in Elm), you have to do case expr.tree of, so you get some less nice code:

 1case expr.tree of
 2    (FunctionApplication [ fun, defaultExpr, maybeExpr ]) ->
 3        case fun.tree of
 4            Name "Maybe.withDefault" ->
 5                { source = expr.source
 6                , tree = 
 7                    CaseExpr
 8                        maybeExpr
 9                        [ ( NamePattern "Nothing", defaultExpr)
10                        , (TaggedPattern "Just" (NamePattern "expr"), Name "expr" )
11                        ]
12    ... recursively apply this transformation ...

I quite like this representation. It would perhaps be a little nicer if I could do:

 1case expr.tree of
 2    (FunctionApplication [ { tree = Name "Maybe.withDefault" }, defaultExpr, maybeExpr ]) ->
 3        { source = expr.source
 4        , tree = 
 5            CaseExpr
 6                maybeExpr
 7                [ ( NamePattern "Nothing", defaultExpr)
 8                , (TaggedPattern "Just" (NamePattern "expr"), Name "expr" )
 9                ]
10    ... recursively apply this transformation ...

In reality, and I think this is the main point behind the linked to blog post, neither of these patterns quite work. You cannot really pattern match in this way, because instead of asking if the function being applied is literally written as Match.withDefault, we want to know if that is the function being applied. We have to take care of at least two scenarios. The first is that you have imported withDefault unqualified with import Maybe exposing (withDefault), although if you do not catch this case nothing bad happens it is a common way to import that function. The second scenario would be way less common, but actually mean your optimiser could introduce a bug. That is where you have made an alternative Maybe module, that exports withDefault, and you have imported this alternative whilst aliasing it to Maybe a la import Helpers.Maybe as Maybe. Presumably your withDefault does a different thing, so the optimisation in question wouldn't be appropriate. Kind of unlikely, but you have to handle such a case. A more common, related case, is if you allow the unqualified withDefault then what about if someone has imported import Result exposing (withMaybe). A final possibility is import Maybe as M and then used M.withDefault. Note that in many of these cases you basically need to know the current scope. So you simply won't be able to make this test using a pattern.

The point is, that in reality the ability to nest patterns like this is not that useful, because ultimately the patterns you want to match against are just not that simple.

# Polymorphism

One further point in this somewhat rambling post. Sometimes in a compiler you will want to 'add' meta-data to the abstract syntax tree as you complete passes over the compiler. In the non-record type this is quite elegant, you can do the following:

 1type Expr a
 2    = IntLiteral a Int
 3    | FloatLiteral a Float
 4    | StringLiteral a String
 5    | Name a String
 6    | FunctionApplication a Expr (List Expr)
 7    | LetExpr a (List LetDefs) Expr
 8    | CaseExpr a Expr (List (Pattern, Expr))
 9    | RecordExpr a (List (Name, Expr))
10
11type alias ParsedExpr =
12    Expr { source : SourceCode }

Then you can write a pass for the compiler that changes the type of the meta-data, maybe it adds meta-data:

 1type alias NamedExpr =
 2    { source : SourceCode
 3    , names : List Name
 4    }
 5usedNames : ParsedExpr -> NamedExpr
 6usedNames expr =
 7    case expr of
 8        IntLiteral meta i ->
 9            let
10                newMeta =
11                    { source : meta.source
12                    , names = []
13                    }
14            in 
15            IntLiteral newMeta i
16        ...
17        Name meta s ->
18            let
19                newMeta =
20                    { source : meta.source
21                    , names = [s]
22                    }
23            in 
24            Name newMeta s
25        ...

Or you could write a pass that throws some meta data away. For example, once you have type checked the program, you know you won't need the 'source' anymore, so now you can write passes such as the one detailed above that transforms the source, only now, any generated expressions do not need to add in 'source' code, which is great because obviously such generated expressions do not have a representation in the original source code.

Using the record-based expression representation it's a little more awkward to allow for different types of metadata.

 1type alias Expr a =
 2    { meta : a
 3    , tree : ExprTree
 4    }
 5type ExprTree a
 6    = IntLiteral Int
 7    | FloatLiteral Float
 8    | StringLiteral String
 9    | Name String
10    | FunctionApplication (Expr a) (List (Expr a))
11    | LetExpr (List (LetDefs a)) (Expr a)
12    | CaseExpr (Expr a) (List ((Pattern a), Expr a))
13    | RecordExpr (List (Name, Expr a))