Modeling Sequences with Quantum States
In the past few months, I've shared a few mathematical ideas that I think are pretty neat: drawing matrices as bipartite graphs, picturing linear maps as tensor network diagrams, and understanding the linear algebraic (or "quantum") versions of probabilities.
These ideas are all related by a project I've been working on with Miles Stoudenmire—a research scientist at the Flatiron Institute—and John Terilla—a mathematician at CUNY and Tunnel. We recently posted a paper on the arXiv: "Modeling sequences with quantum states: a look under the hood," and today I'd like to tell you a little about it.
Before jumping in, let's warm up with a question:
Question: Suppose we have a set of data points drawn from a (possibly unknown) probability distribution $\pi$. Can we use the data points to define a new probability distribution that estimates $\pi$, with the goal of generating new data?
To illustrate this idea, here are a couple of examples.
Bitstrings
Suppose our data points are bitstrings—sequences of 0s and 1s—all of length $16$, for example. There are $2^{16}=65,536$ such strings, but suppose we only have a small fraction of them—a couple thousand, say—that we drew from some probability distribution $\pi$. We may or may not know what $\pi$ is, but either way we'd like to use information about the strings we do have in order to take a best guess at $\pi$. Once we have this best guess, we can generate new bitstrings by sampling from the distribution.
Natural language
Or suppose our data points are meaningful sentences—sequences of words from the English alphabet. There are tons of possibilities, but suppose we only have a small fraction of them—pages from Wikipedia or books from your local library. There is some probability distribution on natural language (e.g. the probability of "cute little dog" is higher than the probability of "tomato rickety blue"), though we don't have access to it. But we might like to use the data we do have to estimate the probability distribution on language in order to generate new, plausible text. That's what language models do.
These examples, I hope, help put today's blog post in context. We're interested in modeling probability distributions given samples from some dataset. There is mathematical theory that motivates the model, and there is a training algorithm to produce the model. What's more, it uses only basic tools from linear algebra, though I'll say more on that later. (We're also interested in predicting how well one of these generative models performs, given only the number of samples used to train it. But more on that later, too.)
Like the examples above, the data points we consider are sequences—strings of symbols from some finite alphabet $A$. This alphabet could be the set of bits $\{0,1\}$ or the set of 26 letters of the English alphabet, or the set of all words from Wikipedia, or whatever you like. Either way, $A$ is just a finite set. And for simplicity, let's assume our sequences are all have the same length, $N$. I'll denote the set of all sequences of length $N$ by $A^N.$
Back to the bitstrings
For a concrete example, let's think back to bitstrings so that $A=\{0,1\}$. Now here's a nice fact to know: bitstrings come in two flavors: even and odd. A bitstring has even parity if it has an even number of 1s. It has odd parity if it has an odd number of 1s.
What's more, exactly half of all bitstrings of a given length are even, and the other half is odd. So the probability of choosing an even parity bitstring of length N is $2^{-(N-1)}$. This is true for any even parity string, so it's uniform on even strings and 0 elsewhere. This is a simple probability distribution, so let's keep it in mind.
If you like, imagine that bitstrings are the sentences in our language, and the only "valid" sentences—i.e. the only expressions with meaning—are those with even parity. So in this language 01100 is a valid expression, whereas 10011 is not. This is analogous to how "cute little dog" is meaningful in English (It has a non-zero probability of being spoken.) whereas "tomato rickety blue" is not (The probability that you'll say that phrase is, I assume , zero).
So the goal is two-fold:
- To find a model that can infer this probability distribution if you show it a few examples.
- To understand the theory well enough to predict how good the model will perform based the number of examples used.
Can we do it?
Yes!
That's what the paper's about. We present the model, the training algorithm, and (a roadmap for) the theoretical analysis for any choice of alphabet $A$, and we work with the even-parity dataset to illustrate the ideas.
We have a theoretical model. We have a deterministic algorithm to train the model. The algorithm uses only simple ideas from linear algebra, so all aspects can be understood clearly and concretely. There's a way to measure how well the model works, and there is an experiment to back it all up.
I won't get into the details today, but I will say a few words about our model—we've already seen it on Math3ma before. In short, we model probability distributions by density operators, a special kind of linear transformation. I'll encourage you to check out those old blog posts—Part 1 and Part 2—for all the details: how to get a density operator from a probability distribution, and why this is a good thing to do. (For the Cliffs-Notes version, check out this Twitter thread.) But the main idea is very simple and worth sharing again, so I'll briefly recap it here.
Modeling Probabilities with Linear Algebra
A probability distribution on a finite set $S$ is simply a function $\pi\colon S\to [0,1]$ whose values add up to one: $\sum_{s\in S}\pi(s) = 1$. If $S$ has $n$ ordered elements, then we can think of $\pi$ as an $n$-tuple whose entries add up to 1:
That $n$-tuple, you can imagine, defines a point in $\mathbb{R}^n$, and that point defines a vector that starts at the origin and terminates at it. We like unit vectors, though, (This is a discussion about probability after all. We like when things add up to 1!) so let's make one small adjustment: let's take the square root of each coordinate. This new vector, call it $\psi_\pi$, is a model for $\pi$.
We then proceed to define an operator $\rho_\pi$ on $\mathbb{R}^n$ by orthogonal projection onto the line spanned by $\psi_\pi$. This operator, it turns out, is a density operator and the diagonal entries of its matrix representation are the original probabilities $\pi$. For this reason, density operators are the linear algebraic analogues of probability distributions.
So here's the upshot. By passing from the function $\pi\colon X\to [0,1]$ to the density operator $\rho_\pi$ on $\mathbb{R}^X$, we've passed from the world of classical probability to the world of linear algebra. And the fact that the entries of $\psi_\pi$ are square roots of probabilities means we are in the land of "2-norm probability." This usually goes by the name of quantum probability—and density operators are called quantum states—but it's really just another way to think about probability. It's just linear algebra! So while our paper is entitled "Modeling Sequences with Quantum States," you don't have to be a quantum physicist to understand it. It may've just as well been "Modeling Probabilities with Linear Algebra."
Speaking of the paper, let's get back to it. I'll close with a few words about how these ideas help us reach the two goals listed above.
Modeling Sequences with Quantum States
In Sections 2 and 3 we detail this idea of using an empirical probability distribution $\pi$ on a set (a "training set") of sequences $S\subset A^N$ to obtain a vector $\psi$ (and ultimately the density operator $\rho_\pi$). This vector is essentially just the sum of the training samples weighted by the square roots of their probabilities, and it lives in $\mathbb{R}^{A^N}$ which, crucially, is isomorphic to the $N$-fold tensor product $\mathbb{R}^A \otimes\cdots\otimes\mathbb{R}^A$. That means we can depict $\psi$ using tensor network diagram notation as below. Section 4 has lots of pictures like this, where we describe a training algorithm that takes us from $\psi$ to a new vector called $\psi_{\text{MPS}}$.
Vectors like the one shown on the right are called matrix product states (MPS). MPS are nice because far fewer parameters are needed to describe them than to describe a generic vector in a tensor product. This is especially useful when you're working in ultra-large-dimensional spaces. So MPS are, in a word, efficient. And here's the punchline: the classical probability distribution defined by (the squares of the entries of) $\psi_\text{MPS}$ is very close to the true probability distribution that we wanted to infer from the dataset!
The training algorithm to produce $\psi_{\text{MPS}}$ consists of successive applications of a singular value decomposition (SVD) on reduced density operators. The reason we do this—and the reason why an MPS is such a natural choice of model—is spelled out in the story I told in those old articles on quantum probability theory. The moral of that story was "the eigenvectors of reduced density operators contain conditional probabilistic information about your data." I won't explain this idea here—go read the old articles! or the Twitter thread!—but just know what we capitalize on it in the algorithm.
Each tensor (blue node above) in our MPS model is an array of eigenvectors of some reduced density operator obtained from $\psi$. As a result, our model knows what "words" go together to form "meaningful sentences" just based on the statistics of the training data. And since the algorithm is deterministic and it's all just simple linear algebra, we can predict when the model will break down, and we know exactly how and why it happens.
By the way, you might recall that every matrix has a singular value decomposition. We're just using a higher-order analogue of this: any vector in a tensor product of spaces has an MPS factorization. In fact, it's more than an analogy—it's a corollary! The way to factor a vector—like the blue blob above—into an MPS is by successive applications of SVD. And that's the heart of the algorithm.
Now, however, you might be thinking, "Wait, you can't apply SVD to a vector. What's does it mean to 'factor a vector'?" But the idea is not foreign. Any vector in a tensor product of spaces (like the blue blob above) can be reshaped into a matrix—if you're using tensor diagram notation, "just bend the wires around"—and you can apply SVD to that matrix. I've written about this idea before in The Tensor Product, Demystified. We also review it and much more in Section 2, which contains lots of mathematical background.
So the heart of the algorithm is really very simple, and yet it goes by a fancy name in the physics literature: the density matrix renormalization group (DMRG) procedure, introduced in 1992 by physicist Steven White. DMRG is usually used to find a particular eigenvector of a linear operator (the Hamiltonian of a quantum many-body system), but we instead apply it to the vector $\psi$, the sum of our data points.
What's nice is that results are easily interpretable since it's all just linear algebra. So in Section 5, we're able to give an "under the hood" analysis of the algorithm, which then allows us to make predictions about the model's ability to generalize. There's a bit of combinatorics involved, and we end up representing matrices as bipartite graphs, as on the right, for bookkeeping purposes. This is what inspired the article Viewing Matrices and Probability as Graphs from earlier this year.
In the end, we can predict how well the model will perform just knowing the fraction of samples used to train it. We do that in Section 6, which contains experimental results using the iTensor library on the even-parity dataset of bitstrings of length 16.
There's lots more to say, but I'll stop here. I may share more about the mathematics later in the future. In the mean time, feel free to dive in!
Update Jan. 6, 2020: I gave at talk at the MIT category theory seminar on December 5, 2019 to explain some of the ideas here. Here's the video: