This is an adaptation of a talk I gave at Microsoft Research in November 2018.
I exposit the sampling techniques used in my recommendation systems work and its follow-ups in dequantized machine learning:
- Tang – A quantum-inspired algorithm for recommendation systems
- Tang – Quantum-inspired classical algorithms for principal component analysis and supervised clustering;
- Gilyén, Lloyd, Tang – Quantum-inspired low-rank stochastic regression with logarithmic dependence on the dimension;
- Chia, Lin, Wang – Quantum-inspired sublinear classical algorithms for solving low-rank linear systems.
The core ideas used are super simple. This goal of this blog post is to break down these ideas into intuition relevant for quantum researchers and create more understanding of this machine learning paradigm.
Notation is defined in the Glossary.
The intended audience is researchers comfortable with probability and linear algebra (SVD, in particular). Basic quantum knowledge helps with intuition, but is not essential: everything from The model onward is purely classical. The appendix is optional and explains the dequantized techniques in more detail.
An introduction to dequantization
The best, most sought-after quantum algorithms are those that take in raw, classical input and give some classical output. For example, Shor’s algorithm for factoring takes this form. These classical-to-classical algorithms (a term I invented for this post) have the best chance to be efficiently implemented in practice: all you need is a scalable quantum computer. (It’s just that easy!)
Nevertheless, many quantum algorithms aren’t so nice. Most well-known QML algorithms convert input quantum states to a desired output state or value. Thus, they do not provide a routine to get necessary copies of these input states (a state preparation routine) and a strategy to extract information from an output state. Both are essential to making the algorithm useful.
An example of an algorithm that is not classical-to-classical is the swap test. If we have many copies of the quantum states , then the swap test estimates their inner product in time polylogarithmic in dimension. While this routine seems much faster than naively computing classically, we can only run this algorithm if we know how to prepare the states and . It may well be the case that state preparation is too expensive for input vectors, making the quantum algorithm as slow as the classical algorithm. This illustrates the format and failings of most QML algorithms.
You might then ask: can we fill in the missing routines in QML algorithms to get a classical-to-classical algorithm that’s provably fast and useful? This is an open research problem: see Scott Aaronson’s piece on QML1. We have a variety of partial results towards the affirmative, but as far as I know, they don’t answer the question unless you’re loose with your definitions of at least one of “classical”, “provably fast”, or “useful”. So let’s settle for a simpler question.
How can we compare the speed of quantum algorithms with quantum input and quantum output to classical algorithms with classical input and classical output? Quantum machine learning algorithms can be exponentially faster than the best standard classical algorithms for similar tasks, but this comparison is unfair because the quantum algorithms get outside help through input state preparation. We want a classical model that helps its algorithms stand a chance against quantum algorithms, while still ensuring that they can be run in nearly all circumstances one would run the quantum algorithm. The answer I propose: compare quantum algorithms with quantum state preparation to classical algorithms with sample and query access to input.
Before we proceed with definitions, we’ll establish some conventions. First, we generally consider our input as being some vector in or , subject to an access model to be described. Second, we’ll only concern ourselves with an algorithm’s query complexity, the number of accesses to the input. Our algorithms will have query complexity independent of input dimensions and polynomial in other parameters. If we assume that each access costs (say) or , the time complexity is still polylogarithmic in input dimension and at most polynomially worse in other parameters.
Now, we define query access to input; we can get query access simply by having the input in RAM.
Definition. We have query access to (denoted ) if, given , we can efficiently compute .
If we have stored normally as an array in our classical computer’s memory, we have because finding the th entry of can be done with the code
This notion of access can represent more than just memory: we can also have if is implicitly described.
For example, consider the vector of squares: for all .
We can have access to without writing in memory.
This will be important for the algorithms to come.
Definition. We have sample and query access to (denoted ) if we have query access to ; can produce independent random samples where we sample with probability ; and can query for .
Sampling and query access to will be our classical analogue to assuming quantum state preparation of copies of . This should make some intuitive sense: our classical analogue has the standard assumption of query access to input, along with samples, which are essentially measurements of in the computational basis. Knowledge of is for normalization issues, and is often assumed for quantum algorithms as well (though for both classical and quantum algorithms, often approximate knowledge suffices).
Example. Like query access, we can get efficient sample and query access from an explicit memory structure. To get for a bit vector , store the number of nonzero entries and a sorted array of the 1-indices . For example, we could store as
Then we can find by checking if , we can sample from by picking an index from uniformly at random, and we know , since it’s just . This generalizes to an efficient binary search tree data structure for for any .
We can also define sample and query access to matrices as just sample and query access to vectors “in” the matrix.
Definition. For , is defined as for the rows of , along with for the vector of row norms (so ).
By replacing quantum states with these classical analogues, we form a model based on sample and query access which we codify with the informal definition of “dequantization”.
Definition. Let be a quantum algorithm with input and output either a state or a value . We say we dequantize if we describe a classical algorithm that, given , can evaluate queries to or output , with similar guarantees to and query complexity .
That is, given sample and query access to the inputs, we can output sample and query access to a desired vector or a desired value, with at most polynomially larger query complexity.
We justify why this model is a reasonable point of comparison two sections from now, in Implications. Next, though, we will jump into how to build these dequantized protocols.
Quantum for the quantum-less
So far, all dequantized results revolve around three dequantized protocols that we piece together into more useful tasks. In query complexity independent of and , we can perform the following:
(Inner Product) For , given and , we can estimate to error with probability ;
(Thin Matrix-Vector) For , given and , we can simulate with queries;
(Low-rank Approximation) For , given and some threshold , we can output a description of a low-rank approximation of with queries.
Specifically, our output is for , , and (), and this implicitly describes the low-rank approximation to , (notice rank ).
This matrix satisfies the following low-rank guarantee with probability : for , and (using ’s SVD),
This guarantee is non-standard: instead of , we use . This makes our promise weaker, since it is useless if has no large singular values.
For intuition, it’s helpful to think of as multiplied with a “projector” that projects the rows of onto the columns of , where these columns are “singular vectors” (approximately orthonormal, and with corresponding “singular values” that are encoded in the diagonal matrix ).
The first two protocols are dequantized swap tests and the third is essentially a dequantized variant of phase estimation seen in quantum recommendation systems2.
Now, we describe how these techniques are used to get the results for recommendation systems, supervised clustering, and low-rank matrix inversion. We defer the important details of models and error analyses to Implications, instead focusing on the algorithms themselves and how they use dequantized protocols.
We want to find the distance from a point to the centroid (average) of a cluster of points . If we assume sample and query access to the data points, computing reduces to computing for
access to gives access to and so the supervised clustering problem reduces to the following:
Problem. For , and , approximate to additive error.
Algorithm. We can write as the inner product of an order three tensor; through basic tensor arithmetic, it is equal to , where are
Applying the algorithm for inner product (1) gives the desired approximation with samples and queries.
We want to randomly sample a product that is a good recommendation for a particular user , given incomplete data on user-product preferences. If we store this data in a matrix with sampling and query access, in the right model, finding good recommendations reduces to:
Problem. For a matrix along with a row , given , approximately sample from where is a sufficiently good low-rank approximation of .
Remark. This task is essentially a variant of PCA, since a low-rank decomposition is dimensionality reduction of the matrix, viewed as a set of row vectors. This is the “dequantized PCA” I refer to in other work3.
Algorithm. Apply (3) to get for a low-rank approximation . It turns out that this low-rank approximation is good enough to get good recommendations. So it suffices to sample from , where with .
Approximate to norm using inner product protocols (1). Next, compute with naive matrix-vector multiplication. Finally, sample from , which is a thin matrix-vector product (2).
An aside. This gives an exponential speedup over previous classical results from 15-20 years ago4. The story here is quite odd. From what I can tell, researchers at the time knew the important (read: hard) part of the algorithm, how to compute low-rank approximations fast, but didn’t notice that the resulting knowledge of and could be used to sample the desired recommendations in sublinear time, which I think is much easier to understand. This gave me anxiety during research, since I figured there was no way this would have been overlooked. I’m glad these fears were unfounded; it’s cool that this quantum perspective made this step natural and obvious!
Low-rank matrix inversion
The goal here is to mimic a quantum algorithm that can solve systems of equations for low-rank. The dequantized version of this is:
Problem. For a low-rank matrix and a vector , given , (approximately) respond to requests for , where is the pseudoinverse of .
Algorithm. Use the low-rank approximation protocol (3) to get . From applying the matrix-vector protocol (2), we have , where ; with some analysis we can show that the columns of behave like the right singular vectors of . Further, behaves like their approximate singular values. Using this information, we can approximate the vector we want to sample from:
We approximate to additive error for all by noticing that is an inner product of the order two tensors and . Thus, we can apply (1), since being given implies for viewed as a long vector. Finally, using (2), sample from the linear combination using these estimates and .
We have just described examples of dequantized algorithms for the following problems:
- Recommendation systems52 (this classical algorithm exponentially improves on the previous best!)
- Supervised clustering37
- Low-rank matrix inversion8910
We address here what to take away from these results.
For quantum computing
The most important conclusion, in my opinion, is a heuristic:
Heuristic 1. Linear algebra problems in low-dimensional spaces (constant, say, or polylogarithmic) likely can be dequantized.
The intuition for this heuristic is that, if your problem operates in a subspace of such low dimension, the main challenge is “finding” this subspace and rotating to it. Then, we can think about our problem as lying in where is small, and can solve it with a simple polynomial-time (in ) algorithm. Finding the subspace is an unordered search problem if you squint, so can’t be sped up much by exploiting quantum.
Remark. There are high-dimensional problems that cannot be dequantized; for example, given , it takes queries to approximately sample from , where is the Hadamard matrix (this is the Fourier Sampling problem11).
Why do we care about dequantizing algorithms? As the name suggests, I argue that this is a reasonable classical analogue to quantum machine learning algorithms.
Heuristic 2. For machine learning problems, SQ assumptions are more reasonable than state preparation assumptions.
That is, the practical task of preparing quantum states is probably always harder than the practical task of preparing sample and query access. Practically, this makes sense, since for state preparation we need, well, quantum computers.
Quantum computing applications that are realizable with zero qubits!
Even assuming the existence of a practical quantum computer, there is evidence that state preparation assumptions are still harder to satisfy than sample and query access, up to polynomial slowdown. For example, preparing a generic quantum state corresponding to an input vector takes quantum queries to in general, while responding to accesses takes classical queries. Because dequantized algorithms are polynomial in , this means that getting SQ access to a generic vector is much more expensive than running the algorithm.
Of course, we can also consider special classes of vectors where quantum state preparation is easier, but generally SQ access gets proportionally faster as well. For example, we can quickly prepare vectors where all entries have roughly equal magnitude (think vectors whose entries are either or ), but correspondingly, we can compute SQ accesses to such vectors similarly quickly.
On the classical side, the assumption of SQ access is on par with other typical assumptions to make machine learning algorithms sublinear:
- There is a classical dynamic data structure that supports SQ access, fast updates, and sparsity in log time.
- Given an input vector as a list of nonzero entries, sampling from it takes time linear in sparsity.
- independent samples can be prepared with one pass through the data in space.
To summarize these heuristics: quantum machine learning for low-dimensional datasets will probably never get speedups as significant as, say, Shor’s algorithm, even in best-case scenarios. Unfortunately, QML for low-dimensional problems were the most practical algorithms in the literature, so with this research it’s unclear what the state of the field is today.
The story might not be over, though. We know that quantum computers can “efficiently solve” high-dimensional linear algebra problems12; however, this assumes that we have some way to evolve a quantum system precisely according to input data, a much harder problem than the linear algebra itself. Nevertheless, I hold out hope that this result can be applied to achieve exponential speedups in machine learning or elsewhere.
For classical computing
I am cautiously optimistic about the implications of this work for classical computing. The major advantage of dequantized algorithms is sheer speed (asymptotically, at least). However, the issues listed below prevent dequantized algorithms from being strict improvements over current algorithms.
- Gaining SQ access to input typically requires preliminary data processing or the use of a data structure. This means that dequantized algorithms can’t be plugged into existing systems without large amounts of computation.
- SQ access to output might not always be useful or practical.
- Current dequantized algorithms have large error compared to standard techniques.
- Current algorithms have large theoretical exponents, so right now we don’t know whether they run quickly in practice. I expect we can cut down these exponents greatly.
If I had to guess, the best chance for success in dequantized techniques remains recommendation systems, since speed matters significantly in that context. I view the other algorithms as significantly less likely to see use in practice, though probably more likely than their corresponding quantum algorithms.
Regardless, these works fit nicely into the classical literature: dequantized quantum machine learning is just a nicely modular, quantum-inspired form of randomized numerical linear algebra.
Appendix: More details
As a reminder, here are the three techniques:
- Inner Product
- Thin Matrix-Vector
- Low-rank Approximation
Below, we explain (1) and (2) fully, and give a rough sketch of (3).
1. Estimating inner products
First, we give a basic way of estimating the mean of an arbitrary distribution with finite variance.
Fact. For i.i.d random variables with mean and variance , let
Then with probability , using only copies of .
Proof sketch. The proof follows from two facts: first, the median of is at least precisely when at least half of the are at least ; second, Chebyshev’s inequality (applied to the mean).
Estimating the inner product is just a basic corollary of this estimator.
Proposition. For , given and , we can estimate to error with probability with query complexity .
Proof. Sample from and let . Apply the Fact with being independent copies of .
2. Thin matrix-vector product with rejection sampling
We first go over rejection sampling, a naive way to efficiently generate samples from a specified distribution from samples from another distribution.
Input: samples from distribution
Output: samples from distribution
- Pull a sample from ;
- Compute for some constant ;
- Output with probability and restart otherwise.
Fact. If for all , then the above procedure is well-defined and outputs a sample from in iterations in expectation.
Proposition. For and , given and , we can simulate with expected query complexity , where
We can compute entries with queries.
We can sample using rejection sampling:
- is the distribution formed by sampling from with probability proportional to ;
- is the target .
Notice that we can compute these ’s (in fact, despite that we cannot compute probabilities from the target distribution), and that the rejection sampling guarantee is satisfied (via Cauchy-Schwarz).
The probability of success is . Thus, to estimate the norm of , it suffices to estimate the probability of success of this rejection sampling process. We can view this as estimating the heads probability of a biased coin, where the coin is heads if rejection sampling succeeds and tails otherwise. Through a Chernoff bound, we see that the average of “coin flips” is in with probability , where each coin flip costs queries and samples.
3. Low-rank approximation, briefly
Proposition. For , given and some threshold , we can output a description of a low-rank approximation of .
Specifically, our output is for , , (), and this implicitly describes the low-rank approximation to , (notice rank ).
This matrix satisfies the following low-rank guarantee with probability : for , and (using SVD),
This algorithm comes from the 1998 paper of Frieze, Kannan, and Vempala13. See the recent survey14 by Kannan and Vempala for a survey of these techniques, and see Woodruff’s textbook15 for a discussion of more general techniques. The form I state above is a simple variant that I discuss in my recommendation systems paper5.
The core piece of analysis is the following theorem (sometimes called the Approximate Matrix Product property in the literature).
Theorem. Let , where is with probability (so is sampled from ). For sufficiently small and , with probability ,
This looks like a further higher-order (two order two tensor inner product) generalization of inner product (two order one tensor inner product) and thin matrix-vector (order two and order one tensor inner product); it’s possible that a clever rephrasing of this result in the model could make the low-rank approximation result more quantum-ic.
We now sketch the algorithm along with intuition: it’s most useful to consider the low-rank approximation task as one of finding large approximate singular vectors. First, sample rows of according to norm, and consider the matrix of these rows, all renormalized to have the same length. This is the that we output. By the above theorem, with good probability, which implies that the large right singular vectors of (eigenvectors of ) approximate the large right singular vectors of (eigenvectors of ).
Next, we can perform the same process to : sample rows of and get a normalized submatrix such that . Since is a constant-sized matrix, we can compute and , the large left singular vectors and values of , which approximate the large left singular vectors and values of . Then, translates these large left singular vectors to their corresponding right singular vectors and rescales them accordingly, giving the approximate singular vectors of as desired.
For natural numbers , vector and :
denotes ; and is big O notation; and denotes the th row of and the th column of ; denotes the norm of , ;
is bra-ket notation: kets are column vectors , bras are row vectors , standard basis vectors are denoted , and the tensor product of and is denoted . Of course, these are all really quantum states, but that’s only relevant for quantum algorithms: for my purposes, I use and interchangeably to refer to vectors. (I ignore normalization, but those issues can be dealt with.)
The singular value decomposition (SVD) of is a decomposition , where and are unitary and is diagonal. In other words, for and the columns of and , respectively, and the diagonal entries of , . By convention, .
Using ’s SVD, we can define basic linear algebraic objects. is the spectral norm of . is the Frobenius norm of . is an optimal rank approximation to in both spectral and Frobenius norm. is ’s pseudoinverse.
I define , , and in An introduction to dequantization.