Stream processing


In computer science, stream processing is a programming paradigm which views streams, or sequences of events in time, as the central input and output objects of computation. Stream processing encompasses dataflow programming, reactive programming, and distributed data processing. Stream processing systems use streaming algorithms to trace parallel processing for data streams. The software stack for these systems includes components such as programming models and query languages, for expressing computation; stream management systems, for distribution and scheduling; and hardware components for acceleration including floating-point units, graphics processing units, and field-programmable gate arrays.
The stream processing paradigm simplifies parallel software and hardware by restricting the parallel computation that can be performed. Given a sequence of data, a series of operations is applied to each element in the stream. Kernel functions are usually pipelined, and optimal local on-chip memory reuse is attempted, in order to minimize the loss in bandwidth, associated with external memory interaction. Uniform streaming, where one kernel function is applied to all elements in the stream, is typical. Since the kernel and stream abstractions expose data dependencies, compiler tools can fully automate and optimize on-chip management tasks. Stream processing hardware can use scoreboarding, for example, to initiate a direct memory access when dependencies become known. The elimination of manual DMA management reduces software complexity, and an associated elimination for hardware cached I/O, reduces the data area expanse that has to be involved with service by specialized computational units such as arithmetic logic units.
During the 1980s stream processing was explored within dataflow programming. An example is the language SISAL.

Applications

Stream processing is essentially a compromise, driven by a data-centric model that works very well for traditional DSP or GPU-type applications but less so for general purpose processing with more randomized data access. By sacrificing some flexibility in the model, the implications allow easier, faster and more efficient execution. Depending on the context, processor design may be tuned for maximum efficiency or a trade-off for flexibility.
Stream processing is especially suitable for applications that exhibit three application characteristics:
  • Compute intensity, the number of arithmetic operations per I/O or global memory reference. In many signal processing applications today it is well over 50:1 and increasing with algorithmic complexity.
  • Data parallelism exists in a kernel if the same function is applied to all records of an input stream and a number of records can be processed simultaneously without waiting for results from previous records.
  • Data locality is a specific type of temporal locality common in signal and media processing applications where data is produced once, read once or twice later in the application, and never read again. Intermediate streams passed between kernels as well as intermediate data within kernel functions can capture this locality directly using the stream processing programming model.
Examples of records within streams include:
  • In graphics, each record might be the vertex, normal, and color information for a triangle;
  • In image processing, each record might be a single pixel from an image;
  • In a video encoder, each record may be 256 pixels forming a macroblock of data; or
  • In wireless signal processing, each record could be a sequence of samples received from an antenna.
For each record we can only read from the input, perform operations on it, and write to the output. It is permissible to have multiple inputs and multiple outputs, but never a piece of memory that is both readable and writable.

Code examples

By way of illustration, the following code fragments demonstrate detection of patterns within event streams. The first is an example of processing a data stream using a continuous SQL query. This code fragment illustrates a JOIN of two data streams, one for stock orders, and one for the resulting stock trades. The query outputs a stream of all Orders matched by a Trade within one second of the Order being placed. The output stream is sorted by timestamp, in this case, the timestamp from the Orders stream.

SELECT DataStream
Orders.TimeStamp, Orders.orderId, Orders.ticker,
Orders.amount, Trade.amount
FROM Orders
JOIN Trades OVER
ON Orders.orderId = Trades.orderId;

Another sample code fragment detects weddings among a flow of external "events" such as church bells ringing, the appearance of a man in a tuxedo or morning suit, a woman in a flowing white gown and rice flying through the air. A "complex" or "composite" event is what one infers from the individual simple events: a wedding is happening.

WHEN Person.Gender EQUALS "man" AND Person.Clothes EQUALS "tuxedo"
FOLLOWED-BY
Person.Clothes EQUALS "gown" AND

WITHIN 2 hours
ACTION Wedding

Comparison to prior parallel paradigms

Basic computers started from a sequential execution paradigm. Traditional CPUs are SISD based, which means they conceptually perform only one operation at a time.
As the computing needs of the world evolved, the amount of data to be managed increased very quickly. It was obvious that the sequential programming model could not cope with the increased need for processing power. Various efforts have been spent on finding alternative ways to perform massive amounts of computations but the only solution was to exploit some level of parallel execution. The result of those efforts was SIMD, a programming paradigm which allowed applying one instruction to multiple instances of data. Most of the time, SIMD was being used in a SWAR environment. By using more complicated structures, one could also have MIMD parallelism.
Although those two paradigms were efficient, real-world implementations were plagued with limitations from memory alignment problems to synchronization issues and limited parallelism. Only few SIMD processors survived as stand-alone components; most were embedded in standard CPUs.
Consider a simple program adding up two arrays containing 100 4-component vectors.

Conventional, sequential paradigm


for

This is the sequential paradigm that is most familiar. Variations do exist, but they ultimately boil down to that construct.

Parallel SIMD paradigm, packed registers (SWAR)


// for each vector
for

This is actually oversimplified. It assumes the instruction vector_sum works. Although this is what happens with instruction intrinsics, much information is actually not taken into account here such as the number of vector components and their data format. This is done for clarity.
You can see however, this method reduces the number of decoded instructions from numElements * componentsPerElement to numElements. The number of jump instructions is also decreased, as the loop is run fewer times. These gains result from the parallel execution of the four mathematical operations.
What happened however is that the packed SIMD register holds a certain amount of data so it's not possible to get more parallelism. The speed up is somewhat limited by the assumption we made of performing four parallel operations.

Parallel stream paradigm (SIMD/MIMD)


// This is a fictional language for demonstration purposes.
elements = array streamElement
kernel = instance streamKernel
result = kernel.invoke

In this paradigm, the whole dataset is defined, rather than each component block being defined separately. Describing the set of data is assumed to be in the first two rows. After that, the result is inferred from the sources and kernel. For simplicity, there's a 1:1 mapping between input and output data but this does not need to be. Applied kernels can also be much more complex.
An implementation of this paradigm can "unroll" a loop internally. This allows throughput to scale with chip complexity, easily utilizing hundreds of ALUs. The elimination of complex data patterns makes much of this extra power available.
While stream processing is a branch of SIMD/MIMD processing, they must not be confused. Although SIMD implementations can often work in a "streaming" manner, their performance is not comparable: the model envisions a very different usage pattern which allows far greater performance by itself.
It has been noted that when applied on generic processors such as standard CPU, only a 1.5x speedup can be reached. By contrast, ad-hoc stream processors easily reach over 10x performance, mainly attributed to the more efficient memory access and higher levels of parallel processing.
Although there are various degrees of flexibility allowed by the model, stream processors usually impose some limitations on the kernel or stream size. For example, consumer hardware often lacks the ability to perform high-precision math, lacks complex indirection chains or presents lower limits on the number of instructions which can be executed.

Research

stream processing projects included the Stanford Real-Time Programmable Shading Project started in 1999.
A prototype called Imagine was developed in 2002.
A project called Merrimac ran until about 2004.
AT&T also researched stream-enhanced processors as graphics processing units rapidly evolved in both speed and functionality. Since these early days, dozens of stream processing languages have been developed, as well as specialized hardware.