Matrix multiplication algorithm
Because matrix multiplication is such a central operation in many numerical algorithms, much work has been invested in making matrix multiplication algorithms efficient. Applications of matrix multiplication in computational problems are found in many fields including scientific computing and pattern recognition and in seemingly unrelated problems such as counting the paths through a graph. Many different algorithms have been designed for multiplying matrices on different types of hardware, including parallel and distributed systems, where the computational work is spread over multiple processors.
Directly applying the mathematical definition of matrix multiplication gives an algorithm that takes time on the order of field operations to multiply two matrices over that field. Better asymptotic bounds on the time required to multiply matrices have been known since the Strassen's algorithm in the 1960s, but the optimal time remains unknown., the best bound on the asymptotic complexity of a matrix multiplication algorithm is time, given by Alman, Duan, Williams, Xu, Xu, and Zhou. However, this algorithm is a galactic algorithm because of the large constants and cannot be realized practically.
Iterative algorithm
The definition of matrix multiplication is that if for an matrix and an matrix, then is an matrix with entriesFrom this, a simple algorithm can be constructed which loops over the indices from 1 through and from 1 through, computing the above using a nested loop:
- Input: matrices and
- Let be a new matrix of the appropriate size
- For from 1 to :
- * For from 1 to :
- ** Let
- ** For from 1 to :
- *** Set
- ** Set
- Return
Cache behavior
The three loops in iterative matrix multiplication can be arbitrarily swapped with each other without an effect on correctness or asymptotic running time. However, the order can have a considerable impact on practical performance due to the memory access patterns and cache use of the algorithm;which order is best also depends on whether the matrices are stored in row-major order, column-major order, or a mix of both.
In particular, in the idealized case of a fully associative cache consisting of bytes and bytes per cache line, the above algorithm is sub-optimal for and stored in row-major order. When, every iteration of the inner loop incurs a cache miss when accessing an element of. This means that the algorithm incurs cache misses in the worst case., the speed of memories compared to that of processors is such that the cache misses, rather than the actual calculations, dominate the running time for sizable matrices.
The optimal variant of the iterative algorithm for and in row-major layout is a tiled version, where the matrix is implicitly divided into square tiles of size by :
- Input: matrices and
- Let be a new matrix of the appropriate size
- Pick a tile size
- For from 1 to in steps of :
- * For from 1 to in steps of :
- ** For from 1 to in steps of :
- *** Multiply and into, that is:
- *** For from to :
- **** For from to :
- ***** Let
- ***** For from to :
- ****** Set
- ***** Set
- Return
Divide-and-conquer algorithm
An alternative to the iterative algorithm is the divide-and-conquer algorithm for matrix multiplication. This relies on the block partitioningwhich works for all square matrices whose dimensions are powers of two, i.e., the shapes are for some. The matrix product is now
which consists of eight multiplications of pairs of submatrices, followed by an addition step. The divide-and-conquer algorithm computes the smaller multiplications recursively, using the scalar multiplication as its base case.
The complexity of this algorithm as a function of is given by the recurrence
accounting for the eight recursive calls on matrices of size and to sum the four pairs of resulting matrices element-wise. Application of the master theorem for divide-and-conquer recurrences shows this recursion to have the solution, the same as the iterative algorithm.
Non-square matrices
A variant of this algorithm that works for matrices of arbitrary shapes and is faster in practice splits matrices in two instead of four submatrices, as follows.Splitting a matrix now means dividing it into two parts of equal size, or as close to equal sizes as possible in the case of odd dimensions.
- Inputs: matrices of size, of size.
- Base case: if is below some threshold, use an unrolled version of the iterative algorithm.
- Recursive cases:
Cache behavior
The number of cache misses incurred by this algorithm, on a machine with lines of ideal cache, each of size bytes, is bounded by
Sub-cubic algorithms
Algorithms exist that provide better running times than the straightforward ones. The first to be discovered was Strassen's algorithm, devised by Volker Strassen in 1969 and often referred to as "fast matrix multiplication". It is based on a way of multiplying two 2×2 matrices which requires only 7 multiplications, at the expense of several additional addition and subtraction operations. Applying this recursively gives an algorithm with a multiplicative cost of. Strassen's algorithm is more complex, and the numerical stability is reduced compared to the naïve algorithm, but it is faster in cases where or so and appears in several libraries, such as BLAS. It is very useful for large matrices over exact domains such as finite fields, where numerical stability is not an issue.Since Strassen's algorithm is actually used in practical numerical software and computer algebra systems, improving on the constants hidden in the big-O notation has its merits. A table that compares key aspects of the improved version based on recursive multiplication of 2×2-block matrices via 7 block matrix multiplications follows. As usual, gives the dimensions of the matrix and designates the memory size.
| Year | Reference | #matrix multiplications per step | #matrix additions per step | total arithmetic operations | total I/O-complexity |
| 1969 | Strassen | 7 | 18 | ||
| 1971 | Winograd | 7 | 15 | ||
| 2017 | Karstadt, Schwartz | 7 | 12 | ||
| 2023 | Schwartz, Vaknin | 7 | 12 |
It is known that a Strassen-like algorithm with a 2×2-block matrix step requires at least 7 block matrix multiplications. In 1976 Probert showed that such an algorithm requires at least 15 additions ; however, a hidden assumption was that the blocks and the 2×2-block matrix are represented in the same basis. Karstadt and Schwartz computed in different bases and traded 3 additions for less expensive basis transformations. They also proved that one cannot go below 12 additions per step using different bases. In subsequent work Beniamini et el. applied this base-change trick to more general decompositions than 2×2-block matrices and improved the leading constant for their run times.
It is an open question in theoretical computer science how well Strassen's algorithm can be improved in terms of asymptotic complexity. The matrix multiplication exponent, usually denoted, is the smallest real number for which any matrix over a field can be multiplied together using field operations. The current best bound on is, by Alman, Duan, Williams, Xu, Xu, and Zhou. This algorithm, like all other recent algorithms in this line of research, is a generalization of the Coppersmith–Winograd algorithm, which was given by Don Coppersmith and Shmuel Winograd in 1990. The conceptual idea of these algorithms is similar to Strassen's algorithm: a way is devised for multiplying two -matrices with fewer than multiplications, and this technique is applied recursively. However, the constant coefficient hidden by the big-O notation is so large that these algorithms are only worthwhile for matrices that are too large to handle on present-day computers. Victor Pan proposed so-called feasible sub-cubic matrix multiplication algorithms with an exponent slightly above 2.77, but in return with a much smaller hidden constant coefficient.
Freivalds' algorithm is a simple Monte Carlo algorithm that, given matrices, and, verifies in time if.