The unreasonable effectiveness of language models for source code

This post is written in whatever the intersection of “short research paper” genre and “frustrated rant” genre is. Caveat emptor.

The history of computational linguistics, tl;dr version

If we start the timeline of computational linguistics from 1950 when Alan Turing introduced his famous test for conversational agents, the first 40 years of it were marked by linguistics experts using their deep knowledge of natural languages to write rule-based algorithms for things like machine translation [Forcada et al 2011]. But the slow rise of sequence-to-sequence neural networks that could be trained on giant corpora of texts and didn’t require any knowledge of linguistics from humans involved precipitated the slow decline of classical computational linguistics into its current state, succinctly formulated by Fred Jelinek. It’s not exactly clear when the turning point was - perhaps it was BERT [Devlin et al 2018], though note that this direction was obvous to Mr. Jelinek 2 decades earlier. And then OpenAI threw a few zettaFLOPS [Appendix D, Brown et al 2020] at the problem, giving us a model that writes books and powers things like AI Dungeon - a video game that writes itself as you play it.

The history of automatic programming, tl;dr version

The history of automatic programming is, likewise, full of fascinating ideas like deductive programming and algorithms operating on abstract syntax trees. And then OpenAI put all of that aside and threw a few zettaFLOPS at the problem [Chen et al 2021]. Worse yet, they had the indecency to do it while I was working on my PhD, making various clumsy attempts to attack the problem of automatic programming that now look quite useless.

The part where I throw away a year’s worth of work

So at this point it’s probably blindingly obvious where this is going. Why use tree neural networks, probabilistic grammars, genetic programming, deductive programming if you can just:

  1. Train an autoregressive neural network with 12 billion hyperparameters [Chen et. al. 2021] on all of github.
  2. Write a “header” for the source file with a comment describing in plain English what the program should do.
  3. Autocomplete the header with the autoregressive model.
  4. Since the model is probabilistic, if the output program doesn’t pass tests, one could just try again.

Here’s a quick comparison I have done, comparing Codex with PushGP [Lee and Robinson 2002], a state of the art genetic programming system on Program Synthesis Benchmark 2 [Helmuth and Kelly 2021]. The benchmark consitsts of 20 college-level programming tasks. In addition to a text description of a task the benchmark contains a collection of test cases in I/O format. for example:

Task: 
Given an integer x, return "Fizz" if x is divisible by 3, "Buzz" if x is divisible by 5, "FizzBuzz" if xis divisible by 3 and 5, and a string version of x if none of the above hold.
Test input: 1
Test output: 1
Test input: 20
Test output: Buzz

In the table below success means that out of the programs generated by PushGP or Codex, at least one passed 100% of the test cases. EPG stands for excess programs generated, i.e. how many programs were generated but did not pass the test cases during a single experiment, i.e. how many program synthesis retries were required. The PushGP experiments were run by Helmuth and Kelly and, unfortunately, they don’t provide EPG for every experiment, but they do provide a way to lower bound their EPG (they separate solving each task into 100 experiments, where every failed experiment should have been 300 thousand programs). The Codex experiments were run by me.

TaskPushGP successCodex successCodex EPGPushGP EPG
basement++1>2.97x10e+7
bouncing-balls--10003x10e+7
bowling--10003x10e+7
camel-case+-1000>2.97x10e+7
coin-sums+-1000>2.94x10e+7
cut-vector--10003x10e+7
dice-game--10003x10e+7
find-pair++51>2.88x10e+7
fizz-buzz++0>2.25x10e+7
fuel-cost++0>1.5x10e+7
gcd+-1000>2.76x10e+7
indices-of-substring-+03x10e+7
leaders--10003x10e+7
luhn--10003x10e+7
mastermind--10003x10e+7
middle-character++6>1.29x10e+7
shopping-list--1253x10e+7
square-digits-+33x10e+7
twitter++0>2.94x10e+7
vector-distance--10003x10e+7

We can see that like Codex has been able to produce a 100% working program for 8 out of 20 tasks, while PushGP solved 9. However, PushGP had to output thousands of programs to choose from, while with Codex the very first program was often correct.

And this is just the beginning. The right way to tackle this problem is probably to generate an initial solution with Codex and then proceed to a debugging phase where a specialized module (probably also a neural network) look into the discrepancies between provided outputs and expected outputs and fixes the program accordingly. But we see that even a naive “first draft” implementation of a Codex-based program synthesis engine outperforms a complex genetic programming engine of a pre-GPT generation.