Why do functional programmers model things as data?

Functional programmers tend to prefer pure, immutable data to represent their domain. We talk about many of the benefits of such an approach. But we focus on one in particular: that good data representations can reduce complexity by reducing the number of if statements you need.

Transcript

Eric Normand: Why do functional programmers like to represent things as data?

Hi, my name is Eric Normand. These are my thoughts on functional programming. This is one of those thoughts that I'm not quite sure about.

Mostly, I'm not sure about the explanation that I'm able to give. I am sure that from my experience this is true. I feel like I'm making leaps that are not justified when I'm trying to explain it more rationally.

If you have an idea about how I could explain this better or even just nitpick on my explanation so I can get better at it, ask questions, etc. Here it goes. I'm going to try it anyway and we'll see.

One of the biggest problems in software is complexity. We have these really big complex systems and we can't understand them, at least not all at once because they're too complicated. We have to break it up into small pieces or we could try to reduce the complexity.

We've already talked about essential complexity versus accidental complexity. I'm just talking about complexity in general right now. One source of complexity is branches. Every conditional you have in your program means at least two branches and those two branches will multiply the number of code paths that you have in your program.

This is a traditional or classical way to measure complexity. It's called cyclometric complexity. It's where you measure the number of code paths that your program could go through. Again, it's all about not knowing what's the next thing that's going to happen because you have a conditional, could be this branch, could be that branch, you don't know. It's just hard to keep in your head.

Imagine a program with 100 lines but no conditionals. Just do this, do this, do this, do this, do this versus 100 lines of do this, but then IF this is true, then do that, but otherwise, do this. It just starts to get harder to understand just because the branch is through there.

What functional programming does, one way that it tries to reduce complexity, is by eliminating branches. We do that by modeling things as data. This is where the explanation is starting to make some leaps.

The reason we like to model things as data is because data has a well-understood structure to it. It's a limited structure, something like a pure function. Inside it could have branches. It could have a ton of stuff. It's Turing Complete. It could calculate anything.

Whereas if you have a piece of data, there's only certain cases that it can have. For example, an array is going to have an empty case. It's going to have a singular case. It's going to have a plural case if you want to divide it up that way. There's a certain number of cases. The plural case is like two or more. You can see how it's different from the singular case anyway.

You could have an infinite number of things in that array but we tend to think of it in those terms. There's a lot of other structure to it. For instance, the zero index is the first thing. There's no negatives. Then there's a limit however big that array is. There's a maximum index. They're all integers. These are all structural constraints that the data structures that we use have.

It's those constraints that drastically simplify our implementations. If your implementation uses an array, there's only certain stuff you can do with it. This is in addition to all of its other benefits that it's serializable and deserializable, of course.

It's also often on a literal representation. You could include it right in your program. You could store it in a database if you had a database that had a similar structure. You could do that.

This is something else. This is saying it's much more constrained and so it reduces the complexity. Everything you could move from a Turing Complete function into a highly constrained data structure is going to simplify your code. There's another benefit to data which is that it's not opaque. A function is opaque.

When you have a function, sure as a human you can read the code that generated that function, you can understand it to a certain extent. When you are past a function, let's anthropomorphize it, you are a function.

You get past a function as an argument, the only thing you can do to that function is call it. There's no way to understand what it's going to do besides just doing it. That's different from data.

Data typically has an API, a HashMap. You can ask, "What are all your keys? What are all your values?" Give me all your keys and values in a sequence. There's all sorts of stuff that you can do, add stuff to the HashMap, remove stuff.

It has its own API but it's a well-known, well-understood API that is limited but gives you properties that you want to be able to take advantage of. It's a function. The nice thing about it is all you can do is run it. The downside is you don't know what it's going to do until you run it.

That means that you can have a piece of data that is interpreted in different ways. This is sort of the value of a database. You store all these bits of data in there and then you can query them in different ways that you hadn't planned for before. It gives you a huge amount of flexibility. I do want to focus mostly on this complexity argument.

The main reason we do it is because it's less complex. Just thinking of the branches, that is my best explanation that I've managed of why the data representation is less complex. It has fewer branches.

It has known branches, so what you do is you start modeling your domain with data structures. You have to choose what kind of data will best represent this domain idea. What you can do is you can see, you can analyze that my domain needs zero or more items in it.

You think, "Well, that sounds perfect for an array." You have this very close fit between your domain and your data representation or your domain model. The domain and its model in a close fit, there's going to be complexity in your domain.

We talked about that. That's called the central complexity. That essential complexity is exactly mirrored in the array and it has nothing else. There's no accidental complexity because the domain had zero or more cases.

Now you're modeling that with something with zero or more cases. Imagine you had a different scenario where you had one or more items. The array can handle that, but it also has this extra case where it's empty, that it's handling that, too. You have this danger that your data structure is a little bit more complex than your domain.

Your domain model is one case more complex than the domain. You're adding in complexity. You're adding in that accidental complexity. You're going to have to deal with that somehow. You might deal with it with a conditional. When you create that array, you might say, "Hey, is this empty?"

Whatever you're creating as a constructor, as some factory, you're saying, "Is this array empty?" If it is, then throw in an exception or do something to indicate this failed. Can't do it. Empty, one or more. There is no empty case. Or, you could do something else where your constructor requires a first item before it will make the thing, before it will make the array for you.

It will put that first item in for you and ensure that there's at least one. There's two ways to do it but notice they both add a little complexity that you have to add in yourself because they're trying to move the complexity from the data structure which allows the zero case, the empty case.

They're trying to move it away and make sure that that's impossible. You're shifting it around. That becomes the design question where this complexity goes, how to minimize its impact on the rest of the system. That's my explanation. I'll just summarize real briefly. Modeling stuff with data has a lot of benefits but the main benefit is that it reduces complexity.

It does so because our data, the ways we can represent data in our languages, is constrained. Much more constrained than a Turing Complete function for sure. Those constraints are what give us the material, the grain that lets us build the main models that can have the same structure as the domain. At least very close.

The closer you get the fewer workarounds you have to do in your code to avoid those corner cases, the cases where it's a misfit. The more of those you avoid the less accidental complexity you have. You're going to capture as close to just essential complexity as possible.

Please let me know what you think. I love getting emails from my listeners. You can email me at eric@lispcast.com. I'm also on Twitter. You can follow me. I love to tweet out little quotes and stuff from this podcast so you can learn about stuff that way.

Get your fix on Twitter if you want. Love to have discussions on Twitter. I'm @ericnormand. Follow me. Mention me. Let's get talking. Awesome. Thanks so much. See you later.