Whitespace, the Language

One of the esoteric programming languages that a fair number of people have heard of, is “Whitespace”. I'm sure this has nothing at all to do with jokes like its source code being so fantastically efficient when being printed out. The language was actually meant as a joke (at least that's how one of its creators puts it when he mentions the language in some of his talks). But even though people are aware of the language's existence, they rarely know how it works. Let's change that, because it's really not all that hard.

At its core, the language describes operations on a stack machine, that has access to a heap. The only data type of the machine are integer values of arbitrary size. There are operations for the machine, that use these values to encode characters as well. And that's about it. The rest is just a peculiar way to encode operations and values using three ASCII whitespace characters (TAB, SPACE and LINEFEED).

Operations are grouped by function. These groups determine an operation's encoding prefix, that the language spec calls “Instruction Modification Parameter”. Many of the operations have no arguments, as they create and consume arguments on the machine's stack. Some however do take an argument: Integers and Labels. Label arguments are used by flow control operations; and integer arguments are used by some of the stack manipulation operations.

Arithmetic Operation: Integer Division

Their encoding is similar: Both use strings of spaces and tabs, that are terminated by linefeeds. In labels, the spaces and tabs have no special semantics. At least not within the language specification; more on that later. In integers, tabs encode ones and spaces encode zeroes. Something to note about such integer literals is that they do not use two's complement to encode negative numbers. Instead, they use the literal's leftmost bit as a signedness bit: Tab means negative number, space means positive number. That makes encoding arbitrarily wide integers straight-forward.

Integer Literals

When you take a look at actual whitespace programs, you'll sometimes notice extremely long labels. Oftentimes with that, there seems to be a silent convention to use chunks of eight characters to encode eight bits (same semantics as in number literals as to what characters encode ones and zeroes) that are turned into a positive number, which is then mapped to the ASCII encoding (seven would have sufficed, but that's not what the programs I've seen use).

When you try and implement this language, you'll notice a couple of things your machine implementation needs: A stack obviously, since whitespace is a stack-manipulating language. Another stack, used as a callstack, since the language has Call and Return operations. It also needs a heap, mapping addresses to integers. Finally you'll need program memory and a program counter register. You might want a jump-table too, to deal with translating labels to addresses. That's not strictly required, though: You could just translate all labels to addresses before loading the program into your machine.

When I digged deep enough into the language spec to figure this out, I was intrigued enough to actually do yet another implementation of the language. It's called SpaceMan and it is available at gitlab.com/ft/spaceman as well as github.com/ft/spaceman.

I've added an org-mode conversion of the original language homepage, because that one is currently only available via archive.org. When trying some of the more complex examples you can find on the net, I was running into problems. My implementation failed to even parse them. I was verifying my code for quite some time, until I concluded that it was implementing the parser correctly. So I looked at other implementations. And it turned out most of them implemented two additional stack-manipulating operations: Copy and Slide. Apparently, they were added to a later specification of the language. I couldn't find such a spec on the net, though (not that I invested a lot of time — see the update at the end of the post for a resolution to this). However, after implementing these two, spaceman could run the most elaborate examples that I could find online, like a sudoku solver. I've added those two additional operations to the included version of the language spec.

I'm using Megaparsec for parsing purposes. And with a couple of utilities put in place, writing the parser becomes rather pleasant:

stackParser :: Parser StackOperation
stackParser = do
  (      try $ imp [ space ])
  (      try $ Push  <$> number [ space ])
    <|> (try $ operation        [ linefeed, space    ] Duplicate)
    <|> (try $ operation        [ linefeed, tabular  ] Swap)
    <|> (try $ operation        [ linefeed, linefeed ] Drop)
    <|> (try $ Copy  <$> number [ tabular,  space    ])
    <|> (try $ Slide <$> number [ tabular,  linefeed ])

When implementing the language's operations, you'll find that you're facing lots of common instructions that manipulate the virtual machine. You put those common tasks into functions, of course, and like any designer of an assembly language worth their salt, you obviously give your instructions slightly cryptic three letter names. With those, implementing the stack-manipulating operations looks like this:

eval :: WhitespaceMachine -> StackOperation -> IO WhitespaceMachine
eval m (Push n)  = return $ pci $ psh [n] m
eval m Duplicate = return $ pci $ psh h m               where h     = peek 1 m
eval m Swap      = return $ pci $ psh [b,a] $ drp 2 m   where [a,b] = peek 2 m
eval m Drop      = return $ pci $ drp 1 m
eval m (Copy i)  = return $ pci $ psh [n] m             where n = ref  i m
eval m (Slide n) = return $ pci $ psh h $ drp (n+1) m   where h = peek 1 m

Implementing the other groups of operations looks similar. I sort of like it. Each of them would basically fit onto an overhead slide.

As it turns out, editing whitespace programs is tough work. Doing it directly is best done in an hex-editor. But spaceman has a feature, that makes it dump the programs syntax-tree to stdout. And those program dumps are actually executable programs. So if you'd like to edit a whitespace program, you can dump it into a file, edit its AST and then run that program to yield the changed whitespace program.

“Yet another whitespace implemention created” achievement unlocked, I guess.

[Update] Andrew Archibald informs me, that if you spend a little more time looking for the language specification, that includes Slide and Copy, you will find http://compsoc.dur.ac.uk/whitespace/index.php (via archive.org), that contains v0.3 of the whitespace specification, that adds these operations.