One of the esoteric programming languages that a fair number of people have heard of, is “Whitespace”. I'm sure this has nothing at all to do with jokes like its source code being so fantastically efficient when being printed out. The language was actually meant as a joke (at least that's how one of its creators puts it when he mentions the language in some of his talks). But even though people are aware of the language's existence, they rarely know how it works. Let's change that, because it's really not all that hard.
At its core, the language describes operations on a stack machine, that has
access to a heap. The only data type of the machine are integer values of
arbitrary size. There are operations for the machine, that use these values to
encode characters as well. And that's about it. The rest is just a peculiar way
to encode operations and values using three ASCII whitespace characters (TAB
,
SPACE
and LINEFEED
).
Operations are grouped by function. These groups determine an operation's encoding prefix, that the language spec calls “Instruction Modification Parameter”. Many of the operations have no arguments, as they create and consume arguments on the machine's stack. Some however do take an argument: Integers and Labels. Label arguments are used by flow control operations; and integer arguments are used by some of the stack manipulation operations.
Their encoding is similar: Both use strings of spaces and tabs, that are terminated by linefeeds. In labels, the spaces and tabs have no special semantics. At least not within the language specification; more on that later. In integers, tabs encode ones and spaces encode zeroes. Something to note about such integer literals is that they do not use two's complement to encode negative numbers. Instead, they use the literal's leftmost bit as a signedness bit: Tab means negative number, space means positive number. That makes encoding arbitrarily wide integers straight-forward.
When you take a look at actual whitespace programs, you'll sometimes notice extremely long labels. Oftentimes with that, there seems to be a silent convention to use chunks of eight characters to encode eight bits (same semantics as in number literals as to what characters encode ones and zeroes) that are turned into a positive number, which is then mapped to the ASCII encoding (seven would have sufficed, but that's not what the programs I've seen use).
When you try and implement this language, you'll notice a couple of things your
machine implementation needs: A stack obviously, since whitespace is a
stack-manipulating language. Another stack, used as a callstack, since the
language has Call
and Return
operations. It also needs a heap, mapping
addresses to integers. Finally you'll need program memory and a program counter
register. You might want a jump-table too, to deal with translating labels to
addresses. That's not strictly required, though: You could just translate all
labels to addresses before loading the program into your machine.
When I digged deep enough into the language spec to figure this out, I was
intrigued enough to actually do yet another implementation of the language.
It's called SpaceMan
and it is available at
gitlab.com/ft/spaceman as well as
github.com/ft/spaceman.
I've added an org-mode conversion of the original language
homepage,
because that one is currently only available via archive.org. When trying some
of the more complex examples you can find on the net, I was running into
problems. My implementation failed to even parse them. I was verifying my code
for quite some time, until I concluded that it was implementing the parser
correctly. So I looked at other implementations. And it turned out most of them
implemented two additional stack-manipulating operations: Copy
and Slide
.
Apparently, they were added to a later specification of the language. I
couldn't find such a spec on the net, though (not that I invested a lot of time
— see the update at the end of the post for a resolution to this). However,
after implementing these two, spaceman could run the most elaborate examples
that I could find online, like a sudoku solver. I've added those two additional
operations to the included version of the language
spec.
I'm using Megaparsec for parsing purposes. And with a couple of utilities put in place, writing the parser becomes rather pleasant:
stackParser :: Parser StackOperation
stackParser = do
( try $ imp [ space ])
( try $ Push <$> number [ space ])
<|> (try $ operation [ linefeed, space ] Duplicate)
<|> (try $ operation [ linefeed, tabular ] Swap)
<|> (try $ operation [ linefeed, linefeed ] Drop)
<|> (try $ Copy <$> number [ tabular, space ])
<|> (try $ Slide <$> number [ tabular, linefeed ])
When implementing the language's operations, you'll find that you're facing lots of common instructions that manipulate the virtual machine. You put those common tasks into functions, of course, and like any designer of an assembly language worth their salt, you obviously give your instructions slightly cryptic three letter names. With those, implementing the stack-manipulating operations looks like this:
eval :: WhitespaceMachine -> StackOperation -> IO WhitespaceMachine
eval m (Push n) = return $ pci $ psh [n] m
eval m Duplicate = return $ pci $ psh h m where h = peek 1 m
eval m Swap = return $ pci $ psh [b,a] $ drp 2 m where [a,b] = peek 2 m
eval m Drop = return $ pci $ drp 1 m
eval m (Copy i) = return $ pci $ psh [n] m where n = ref i m
eval m (Slide n) = return $ pci $ psh h $ drp (n+1) m where h = peek 1 m
Implementing the other groups of operations looks similar. I sort of like it. Each of them would basically fit onto an overhead slide.
As it turns out, editing whitespace programs is tough work. Doing it directly is best done in an hex-editor. But spaceman has a feature, that makes it dump the programs syntax-tree to stdout. And those program dumps are actually executable programs. So if you'd like to edit a whitespace program, you can dump it into a file, edit its AST and then run that program to yield the changed whitespace program.
“Yet another whitespace implemention created” achievement unlocked, I guess.
[Update] Andrew Archibald informs me, that if you spend a little more time
looking for the language specification, that includes Slide
and Copy
, you
will find http://compsoc.dur.ac.uk/whitespace/index.php (via
archive.org),
that contains v0.3 of the whitespace specification, that adds these operations.