Communication-avoiding algorithm
Communication-avoiding algorithms minimize movement of data within a memory hierarchy for improving its running-time and energy consumption. These minimize the total of two costs : arithmetic and communication. Communication, in this context refers to moving data, either between levels of memory or between multiple processors over a network. It is much more expensive than arithmetic.
Formal theory
Two-level memory model
A common computational model in analyzing communication-avoiding algorithms is the two-level memory model:- There is one processor and two levels of memory.
- Level 1 memory is infinitely large. Level 0 memory has size.
- In the beginning, input resides in level 1. In the end, the output resides in level 1.
- Processor can only operate on data in cache.
- The goal is to minimize data transfers between the two levels of memory.
Matrix multiplication
Corollary 6.2:More general results for other numerical linear algebra operations can be found in. The following proof is from.
Motivation
Consider the following running-time model:- Measure of computation = Time per FLOP = γ
- Measure of communication = No. of words of data moved = β
From the fact that β >> γ as measured in time and energy, communication cost dominates computation cost. Technological trends indicate that the relative cost of communication is increasing on a variety of platforms, from cloud computing to supercomputers to mobile devices. The report also predicts that gap between DRAM access time and FLOPs will increase 100× over coming decade to balance power usage between processors and DRAM.
| FLOP rate | DRAM bandwidth | Network bandwidth |
| 59% / year | 23% / year | 26% / year |
Energy consumption increases by orders of magnitude as we go higher in the memory hierarchy.
United States president Barack Obama cited communication-avoiding algorithms in the FY 2012 Department of Energy budget request to Congress:
Objectives
Communication-avoiding algorithms are designed with the following objectives:- Reorganize algorithms to reduce communication across all memory hierarchies.
- Attain the lower-bound on communication when possible.
Matrix multiplication example
Let A, B and C be square matrices of order n × n. The following naive algorithm implements C = C + A * B:for i = 1 to n
for j = 1 to n
for k = 1 to n
C = C + A * B
Arithmetic cost : n2 for sufficiently large n or O.
Rewriting this algorithm with communication cost labelled at each step
for i = 1 to n
- n2 reads
for j = 1 to n
- n2 reads
- n3 reads
for k = 1 to n
C = C + A * B
- n2 writes
Fast memory may be defined as the local processor memory of size M and slow memory may be defined as the DRAM.
Communication cost : n3 + 3n2 or O
Since total running time = γ·O + β·O and β >> γ the communication cost is dominant. The blocked matrix multiplication algorithm reduces this dominant term:
Blocked (tiled) matrix multiplication
Consider A, B and C to be n/''b-by-n''/b matrices of b-by-b sub-blocks where b is called the block size; assume three b-by-b blocks fit in fast memory.for i = 1 to n/b
for j = 1 to n/b
- b2 × 2 = n2 reads
for k = 1 to n/b
- b2 × 3 = n3/b reads
- b2 × 3 = n3/b reads
C = C + A * B -
- b2 × 2 = n2 writes
Communication cost: 2n3/b + 2n2 reads/writes << 2n3 arithmetic cost
Making b as large possible:
we achieve the following communication lower bound: