Spatial architecture


In computer science, spatial architectures are a kind of computer architecture leveraging many collectively coordinated and directly communicating processing elements to quickly and efficiently run highly parallelizable kernels.
The "spatial" term comes from processing element instances being typically arranged in an array or grid, both logically and in the silicon design.
Their most common workloads consist of matrix multiplications, convolutions, or, in general, tensor contractions.
As such, spatial architectures are often used in AI accelerators.
The key goal of a spatial architecture is to reduce the latency and power consumption of running very large kernels through the exploitation of scalable parallelism and data reuse.
Consider a kernel, i.e. a function to be applied to several inputs, expressed as one or more loops; this means distributing its computations between processing elements while ensuring that their data dependencies land either within the same element or the same region of elements.
While spatial architectures can be designed or programmed to support different algorithms, each workload must then be mapped onto the processing elements using specialized dataflows.
Formulating a mapping involves the assignment of each operation to a processing element and the scheduling of the ensuing data movements.
All tuned to maximize data parallelism and reuse.
Spatial architectures are classifiable as a SPMD array processor, in that each processing element runs the same operations on a different subset of data, yet they are still programmed through a single mapping.
The architecture of an individual processing element can then itself belong to any Flynn class.
In particular, spatial architectures are well suited for applications whose dataflow exhibits producer-consumer relationships or can leverage efficient data sharing among a region of PEs.
Spatial architectures can typically be found as hardware accelerators in heterogeneous systems, under the broader category of manycore processor.

Design details

Core element of spatial architecture is its multidimensional array of processing elements.
Each processing element is simple, namely a multiply-and-accumulate functional unit, a stripped-down core, or application-specific logic.
Processing elements are then connected with each other and the memory hierarchy through busses or a network on chip, or even asynchronous logic.
The memory hierarchy is explicitly managed and may consist of multiple on-chip buffers, like register files, scratchpads, and FIFOs, backed by large off-chip DRAM and non-volatile memories.
The number of processing elements, interconnect bandwidth, and amount of on-chip memory vary widely between designs and target applications. From thousands of processing elements and tens of megabytes of memory for high-performance computing to tens of elements and a few kilobytes for the edge.
The key performance metrics for a spatial architecture are its consumed energy and latency when running a given workload.
Due to technology and bandwidth limitations, the energy and latency required to access larger memories, like DRAM, dominate those of computation, being hundreds of times more than what's needed for storage near processing elements.
That's why a spatial architecture's memory hierarchy is intended to localize most repeated value accesses on faster and more efficient on-chip memories, exploiting data reuse to minimize costly accesses.

Data reuse

The mechanisms that enable reuse in spatial architectures are multicast and reduction.
Reuse can be further classified as spatial and temporal.
Spatial architectures' interconnects can support spatial multicast as one read from an outer memory being used for multiple writes to inner instances, and spatial reduction, where reads from inner memories are accumulated in a single outer write.
These can be implemented either with direct element-to-element forwarding, like in systolic arrays, or on the interconnect during memory accesses.
Temporal reuse occurs when the same value is retained in a memory while being read and/or updated in place multiple times without it being re-fetched from another memory.
Consider the case of kernels that can be computed with parallel ALU-like processing elements, such as matrix multiplications and convolutions.
Direct inter-processing-element communication can be used effectively for passing partial sums to achieve spatially distributed accumulation, or sharing the same input data for parallel computation without repeated accesses to outer memories.
The amount of data reuse that can be exploited is a property of the kernel being run, and can be inferred by analyzing its data dependencies.
When the kernel can be expressed as a loop nest, reuse arises from subsequent iterations accessing, in part, the same values.
This overlap is a form of access locality and constitutes a reuse opportunity for spatial architectures often called "stationarity".
For kernels presenting affine transformations of indices, like convolutions and, more generally, stencil patterns, the partial overlap arising from the sliding window of the computation also yields a reuse opportunity, taking the name of "ghost zone" or "halo".
Naturally, spatial architectures are more effective the more reuse opportunities are present. At the same time, limited hardware resources mean that not all opportunities can be leveraged at once, requiring proper planning of the computation to exploit the most effective ones.

Mapping computations

To run a kernel on a spatial architecture a mapping must be constructed, detailing how the execution will unfold.
Mapping a workload to a spatial architecture requires binding each of its computations to a processing element and then scheduling both computations and the data movements required to support them.
A good choice of mapping is crucial to maximize performance.
The starting point for a mapping is the loop nest representation of a kernel.
To leverage parallelism and data reuse simultaneously, iterations must be divided between processing elements while taking care that inter-iteration data dependencies are handled by the same element or neighborhood of elements.
For illustrative purposes, the following mapping example focuses on a matrix multiplication, but everything remains generalizable to any data-parallel kernel.
This choice stems from the fact that most works on spatial architectures focus on neural networks support and related optimizations.
Note that a similar parallelization of matrix multiplication has also been discussed in the context of multiprocessor systems.
All mappings can be constructed through three loop transformations of the original kernel:
  • Loop tiling, resulting in smaller and smaller block matrices that can fit on inner memories. This is repeated for every memory in the hierarchy, for each creating a nested copy of the original loops. At any moment during execution, a memory only needs to store all data required for iterations on its copies of the loops and inner ones.
  • Parallelization, similar to tiling, but different tiles are processed concurrently across multiple processing elements, rather than sequentially.
  • Computation ordering, loops can be arbitrarily reordered inside each copy of the original loop nest. This alters data access patterns, changing which and when the same values are used in consecutive iterations.
Each memory hierarchy level is in charge of progressing through its assigned iterations.
After parallelization, each processing element runs the same inner loops on partially different data.
Exploited data reuse is implicit in this formalism.
Tiling enables temporal reuse, while parallelization enables spatial reuse. Finally, the order of computation determines which values can actually undergo reuse.
Consider, for this example, a spatial architecture comprising a large scratchpad storing operands in their entirety and an array of processing elements, each with a small register file and a multiply-and-accumulate unit.
Then, let the original kernel be this matrix multiplication:

M, K, N = 128, 64, 256
for m in += W * In

A complete mapping, with a tiled and parallelized kernel and a fully specified dataflow, can be written as follows. Iterations that are distributed across processing elements to run concurrently are marked with pfor:

  1. outer memory

for m_mem in += W * In

Temporal data reuse can be observed in the form of stationarity, with outputs accumulated in-place during k_pe iterations and weights remaining stationary during n_mem ones.
Spatial reuse occurs as the reduction of outputs over k_par and the multicast of inputs throughout m_par.
Each processing element sees instead a unique tile of weights.
While inferring the reuse opportunities exploited by a mapping, any loop with a single iteration must be ignored, while, for each operand, only loops affecting its dimensions need to be considered.

Mapping optimization

The number of possible mappings achievable varies depending on the target hardware, but is hardly less than billions, due to the large number of possible combinations resulting from the above decisions.
As a result, finding the best set of transformations yielding the highest data reuse, and thus the lowest running latency and power consumption for a spatial architecture and kernel, is a challenging optimization problem.
Most spatial architecture designs have been developed together with a tailored mapping technique.
The complexity of the problem has, however, motivated the development of dedicated mapping tools that can work for a variety of spatial architectures and implement general heuristics that consistently find good mappings within reasonable time.
Techniques successfully applied to the problem include:
  • Pruned search, since many mappings yield identical behavior, they can be pruned as redundant. Then, a random search is often able to find good mappings.
  • Genetic algorithms have been used to iteratively improve an initial set of diverse random mappings by transplanting between them the most successful loop transformations.
  • Simulated annealing also starts from a random pool of mappings, iteratively applying to them random transformations. Each transformation is then retained with a probability that is proportional to its performance improvement and inversely proportional to the elapsed time since the start of the exploration.
  • Integer programming can be applied by reformulating the problem of assigning kernel iterations to different loops and reordering them as a generalized assignment problem. It can then be solved through dedicated tools, like Gurobi.