Prefix sum
In computer science, the prefix sum, cumulative sum, inclusive scan, or simply scan of a sequence of numbers is a second sequence of numbers, the sums of prefixes of the input sequence:
For instance, the prefix sums of the natural numbers are the triangular numbers:
Prefix sums are trivial to compute in sequential models of computation, by using the formula to compute each output value in sequence order. However, despite their ease of computation, prefix sums are a useful primitive in certain algorithms such as counting sort,
and they form the basis of the scan higher-order function in functional programming languages. Prefix sums have also been much studied in parallel algorithms, both as a test problem to be solved and as a useful primitive to be used as a subroutine in other parallel algorithms.
Abstractly, a prefix sum requires only a binary associative operator ⊕, making it useful for many applications from calculating well-separated pair decompositions of points to string processing.
Mathematically, the operation of taking prefix sums can be generalized from finite to infinite sequences; in that context, a prefix sum is known as a partial sum of a series. Prefix summation or partial summation form linear operators on the vector spaces of finite or infinite sequences; their inverses are finite difference operators.
Scan higher order function
In functional programming terms, the prefix sum may be generalized to any binary operation ; the higher order function resulting from this generalization is called a scan, and it is closely related to the fold operation. Both the scan and the fold operations apply the given binary operation to the same sequence of values, but differ in that the scan returns the whole sequence of results from the binary operation, whereas the fold returns only the final result. For instance, the sequence of factorial numbers may be generated by a scan of the natural numbers using multiplication instead of addition:Inclusive and exclusive scans
Programming language and library implementations of scan may be either inclusive or exclusive. An inclusive scan includes input when computing output while an exclusive scan does not. In the latter case, implementations either leave undefined or accept a separate "" value with which to seed the scan. Either type of scan can be transformed into the other: an inclusive scan can be transformed into an exclusive scan by shifting the array produced by the scan right by one element and inserting the identity value at the left of the array. Conversely, an exclusive scan be transformed into an inclusive scan by shifting the array produced by the scan left and inserting the sum of the last element of the scan and the last element of the input array at the right of the array.The following table lists examples of the inclusive and exclusive scan functions provided by a few programming languages and libraries:
| Language/library | Inclusive scan | Exclusive scan |
| APL | +\ | |
| C++ | ||
| Clojure | ||
| CUDA | ||
| F# | ||
| Haskell | ||
| HPF | sum_prefix | sum_prefix |
| Java | ||
| Kotlin | ||
| MPI | ||
| Python | when initial is None | when initial is not None |
| Rust | ||
| Scala |
The directive-based OpenMP parallel programming model supports both inclusive and exclusive scan support beginning with Version 5.0.
Parallel algorithms
There are two key algorithms for computing a prefix sum in parallel. The first offers a shorter span and more parallelism but is not work-efficient. The second is work-efficient but requires double the span and offers less parallelism. These are presented in turn below.Algorithm 1: Shorter span, more parallel
and Steele present thefollowing parallel prefix sum algorithm:
for i <- 0 to log2 do
for j <- 0 to n - 1 do in parallel
if j < 2i then
x <- x
else
x <- x + x
In the above, the notation means the value of the th element of array in timestep.
With a single processor this algorithm would run in time. However, if the machine has at least processors to perform the inner loop in parallel, the algorithm as a whole runs in time, the number of iterations of the outer loop.
Algorithm 2: Work-efficient
A work-efficient parallel prefix sum can be computed by the following steps.- Compute the sums of consecutive pairs of items in which the first item of the pair has an even index:,, etc.
- Recursively compute the prefix sum of the sequence
- Express each term of the final sequence as the sum of up to two terms of these intermediate sequences:,,,, etc. After the first value, each successive number is either copied from a position half as far through the sequence, or is the previous value added to one value in the sequence.
Discussion
Each of the preceding algorithms runs in time. However, the former takes exactly steps, while the latter requires steps. For the 16-input examples illustrated, Algorithm 1 is 12-way parallel while Algorithm 2 is only 4-way parallel. However, Algorithm 2 is work-efficient—it performs only a constant factor of the amount of work required by the sequential algorithm—while Algorithm 1 is work-inefficient—it performs asymptotically more work than is required sequentially. Consequently, Algorithm 1 is likely to perform better when abundant parallelism is available, but Algorithm 2 is likely to perform better when parallelism is more limited.Parallel algorithms for prefix sums can often be generalized to other scan operations on associative binary operations, and they can also be computed efficiently on modern parallel hardware such as a GPU. The idea of building in hardware a functional unit dedicated to computing multi-parameter prefix-sum was patented by Uzi Vishkin.
Many parallel implementations follow a two pass procedure where partial prefix sums are calculated in the first pass on each processing unit; the prefix sum of these partial sums is then calculated and broadcast back to the processing units for a second pass using the now known prefix as the initial value. Asymptotically this method takes approximately two read operations and one write operation per item.
Concrete implementations of prefix sum algorithms
An implementation of a parallel prefix sum algorithm, like other parallel algorithms, has to take the parallelization architecture of the platform into account. More specifically, multiple algorithms exist which are adapted for platforms working on shared memory as well as algorithms which are well suited for platforms using distributed memory, relying on message passing as the only form of interprocess communication.Shared memory: Two-level algorithm
The following algorithm assumes a shared memory machine model; all processing elements have access to the same memory. A version of this algorithm is implemented in the Multi-Core Standard Template Library, a parallel implementation of the C++ standard template library which provides adapted versions for parallel computing of various algorithms.In order to concurrently calculate the prefix sum over data elements with processing elements, the data is divided into blocks, each containing elements. Note, that although the algorithm divides the data into blocks, only processing elements run in parallel at a time.
In a first sweep, each PE calculates a local prefix sum for its block. The last block does not need to be calculated, since these prefix sums are only calculated as offsets to the prefix sums of succeeding blocks and the last block is by definition not succeeded.
The offsets which are stored in the last position of each block are accumulated in a prefix sum of their own and stored in their succeeding positions. For being a small number, it is faster to do this sequentially, for a large, this step could be done in parallel as well.
A second sweep is performed. This time the first block does not have to be processed, since it does not need to account for the offset of a preceding block. However, in this sweep the last block is included instead and the prefix sums for each block are calculated taking the prefix sum block offsets calculated in the previous sweep into account.
function prefix_sum
Improvement: In case that the number of blocks are too much that makes the serial step time-consuming by deploying a single processor, the Hillis and Steele algorithm can be used to accelerate the second phase.
Distributed memory: Hypercube algorithm
The Hypercube Prefix Sum Algorithm is well adapted for distributed memory platforms and works with the exchange of messages between the processing elements. It assumes to have processor elements participating in the algorithm equal to the number of corners in a -dimensional hypercube.Throughout the algorithm, each PE is seen as a corner in a hypothetical hyper cube with knowledge of the total prefix sum as well as the prefix sum of all elements up to itself, both in its own hypercube.
- The algorithm starts by assuming every PE is the single corner of a zero dimensional hyper cube and therefore and are equal to the local prefix sum of its own elements.
- The algorithm goes on by unifying hypercubes which are adjacent along one dimension. During each unification, is exchanged and aggregated between the two hyper cubes which keeps the invariant that all PEs at corners of this new hyper cube store the total prefix sum of this newly unified hyper cube in their variable. However, only the hyper cube containing the PEs with higher index also adds this to their local variable, keeping the invariant that only stores the value of the prefix sum of all elements at PEs with indices smaller or equal to their own index.
Assuming a duplex communication model where the of two adjacent PEs in different hyper cubes can be exchanged in both directions in one communication step, this means communication startups.
i := Index of own processor element
m := prefix sum of local elements of this PE
d := number of dimensions of the hyper cube
x = m; // Invariant: The prefix sum up to this PE in the current sub cube
σ = m; // Invariant: The prefix sum of all elements in the current sub cube
for