Factorize your language models

A traditional language model is a probability distribution $p_\text{LM}(t)$ predicting how likely text $t$ is to appear in a training corpus approximating all of language. A 2023 language model (GPT-4 [OpenAI 2023], Claude [Bai et al 2022], Alpaca [Taori et al 2023]) is a probability distribution $p_\text{LLM}(t)$ estimating the likelihood that a certain text $t$ satisfies several constraints:

$t$ is grammatically correct according to the norms of one of the languages represented in the corpus
the user reading $t$ will find it useful: full of interesting, truthful and actionable information
$t$ does not contain “unsafe” information: advocacy or instructions for illegal or “dangerous” behavior, prosecuted opinions, etc.

These constraints aren’t separate modules in the model architecture: a single transformer neural network [Vaswani et al 2017] is trained to accomplish (and gracefully navigate the contradictions between) all goals while separate structures for grammar, usefulness and safety, possibly in the form of specialized attention heads [Voita et al 2019], may or may not emerge as a side effect of the training procedure. Even if they do, the distribution of repsonsibilities between individual components of the transformer is unknown, even to the developers of the model, and notoriously hard to study [Olah 2022]. As a result, any attempt to edit or rebalance the constraints requires costly [u/AxeLond 2020] retraining.

Some last-minute adjustments can be made via prompt addition [Liu et al 2023, section 2], that is replacing $p_\text{LLM}(t)$ with $p\prime_\text{LLM}(t)=p_\text{LLM}(t_\text{prefix}+t)$ But not all types of biases introduced in the training phase can be counteracted this way. If the model is trained to assign $p(t)=0$ to any text containing a “dangerous” word and next year the word is not longer considered “dangerous” concatenating tokens to $t$ will not help.

Wouldn’t it be great to be able to change one of the constraints, say switch the usefulness function from rewarding the kind of clarity expected from a legal text to simplicity and entertainment value of a children’s novel, all while keeping grammar and safety constraints constant? We can achieve this flexibility by respecting the software engineering principle of separation of concerns [Martin 2000], namely by splitting the distribution into $N$ factors $p_\text{LLM}(t)=\prod_{i=1}^{N} p_i(t)$ for example $p_\text{LLM}(t)=p_\text{grammar}(t)p_\text{useful}(t)p_\text{safe}(t)$ This way we can customize the model at inference time, by pretraining a repository of factors and then mixing and matching them as required: $p_\text{style}(t)$ to prioritize texts that follow the Elements of Style [Strunk 1918], $p_\text{lawyer}(t)$ to pessimize texts that can get your company in legal trouble, etc. And if it turns out that the desired generative model cannot be assembled out of existing factors, one can train a new (small) one while re-using evergreen factors like $p_\text{grammar}(t)$ and thus vastly reduce computational costs.

But “wait!” I hear you proclaim. Perhaps the factors don’t need to be retrained as often as a single model. But we just went from 1 LLM (L for large) to $N$ where $N$ might be in the hundreds. Can that negate the benefits of higher reusability? I don’t think so, because in practice, the factors will share a common encoding stack, i.e. $p_i(t) = q_i(f_\text{enc}(t))$ if $q_i$ are much smaller that $f_\text{enc}$ in terms of neural network layers, the total parameter footprint of all the factors is similar to that of $f_\text{enc}$ , hence we can expect this approach to use a similar number of parameters to a monolythic LLM, but retrain them less often.

So spare your GPU budget and the planet [Luccioni, Viguier, Ligozat 2022] and factorize. Your. Language. Models.

P.S. A reader notes that this approach assumes independence of factor distributions which doesn’t hold in general. I think that if pairs of factors with obvious corelations (think $p_\text{science}$ and $p_\text{biology}$ ) are avoided, factors can be assumed approximately independent and the $p_\text{LLM}$ formula deemed approximately correct. Every $p$ involved in language modeling is an approximation anyway.

P.P.S. If you’re working on a software tool for this and/or an empirical evaluation paper, please reach out and let’s factorize language models together.

P.P.P.S. This work is taxpayer funded