Quasi-delay-insensitive circuit


A quasi-delay-insensitive circuit is an asynchronous circuit design methodology employed in digital logic design. Developed in response to the performance challenges of building sub-micron, multi-core architectures with conventional synchronous designs, QDI circuits exhibit lower power consumption, extremely fine-grain pipelining, high circuit robustness against process–voltage–temperature variations, on-demand operation, and data-dependent completion time.

Overview

Advantages
  • Robust against process variation, temperature fluctuation, circuit redesign, and FPGA remapping.
  • Natural event sequencing facilitates complex control circuitry.
  • Automatic clock gating and compute-dependent cycle time can save dynamic power and increase throughput by optimizing for average-case workload characteristics instead of worst-case.
Disadvantages
QDI circuits have been used to manufacture a large number of research chips, a small selection of which follows.
  • Caltech's asynchronous microprocessor and MIPS R3000 clone
  • Tokyo University's TITAC and TITAC-2 processors

    Theory

The simplest QDI circuit is a ring oscillator implemented using a cycle of inverters. Each gate drives two events on its output node. Either the pull up network drives node's voltage from GND to Vdd or the pull down network from VDD to GND. This gives the ring oscillator six events in total.
Multiple cycles may be connected using a multi-input gate. A c-element, which waits for its inputs to match before copying the value to its output, may be used to synchronize multiple cycles. If one cycle reaches the c-element before another, it is forced to wait. Synchronizing three or more of these cycles creates a pipeline allowing the cycles to trigger one after another.
If cycles are known to be mutually exclusive, then they may be connected using combinational logic. This allows the active cycle to continue regardless of the inactive cycles, and is generally used to implement delay insensitive encodings.
For larger systems, this is too much to manage. So, they are partitioned into processes. Each process describes the interaction between a set of cycles grouped into channels, and the process boundary breaks these cycles into channel ports. Each port has a set of request nodes that tend to encode data and acknowledge nodes that tend to be dataless. The process that drives the request is the sender while the process that drives the acknowledgement is the receiver. Now, the sender and receiver communicate using certain protocols and the sequential triggering of communication actions from one process to the next is modeled as a token traversing the pipeline.

Stability and non-interference

The correct operation of a QDI circuit requires that events be limited to monotonic digital transitions. Instability or interference can force the system into illegal states causing incorrect/unstable results, deadlock, and circuit damage. The previously described cyclic structure that ensures stability is called acknowledgement. A transition T1 acknowledges another T2 if there is a causal sequence of events from T1 to T2 that prevents T2 from occurring until T1 has completed. For a DI circuit, every transition must acknowledge every input to its associated gate. For a QDI circuit, there are a few exceptions in which the stability property is maintained using timing assumptions guaranteed with layout constraints rather than causality.

Isochronic fork assumption

An isochronic fork is a wire fork in which one end does not acknowledge the transition driving the wire. A good example of such a fork can be found in the standard implementation of a [|pre-charge half buffer]. There are two types of Isochronic forks. An asymmetric isochronic fork assumes that the transition on the non-acknowledging end happens before or when the transition has been observed on the acknowledging end. A symmetric isochronic fork ensures that both ends observe the transition simultaneously. In QDI circuits, every transition that drives a wire fork must be acknowledged by at least one end of that fork. This concept was first introduced by A. J. Martin to distinguish between asynchronous circuits that satisfy QDI requirements and those that do not. Martin also established that it is impossible to design useful systems without including at least some isochronic forks given reasonable assumptions about the available circuit elements. Isochronic forks were long thought to be the weakest compromise away from fully delay-insensitive systems.
In fact, every CMOS gate has one or more internal isochronic forks between the pull-up and pull-down networks. The pull-down network only acknowledges the up-going transitions of the inputs while the pull-up network only acknowledges the down-going transitions.

Adversarial path assumption

The adversarial path assumption also deals with wire forks, but is ultimately weaker than the isochronic fork assumption. At some point in the circuit after a wire fork, the two paths must merge back into one. The adversarial path is the one that fails to acknowledge the transition on the wire fork. This assumption states that the transition propagating down the acknowledging path reaches the merge point after it would have down the adversarial path. This effectively extends the isochronic fork assumption beyond the confines of the forked wire and into the connected paths of gates.

Half-cycle timing assumption

This assumption relaxes the QDI requirements a little further in the quest for performance. The c-element is effectively three gates, the logic, the driver, and the feedback and is non-inverting. This gets to be cumbersome and expensive if there is a need for a large amount of logic. The acknowledgement theorem states that the driver must acknowledge the logic. The half-cycle timing assumption assumes that the driver and feedback will stabilize before the inputs to the logic are allowed to switch. This allows the designer use the output of the logic directly, bypassing the driver and making shorter cycles for higher frequency processing.

Atomic complex gates

A large amount of the automatic synthesis literature uses atomic complex gates. A tree of gates is assumed to transition completely before any of the inputs at the leaves of the tree are allowed to switch again. While this assumption allows automatic synthesis tools to bypass the bubble reshuffling problem, the reliability of these gates tends to be difficult to guarantee.

Relative timing

Relative Timing is a framework for making and implementing arbitrary timing assumptions in QDI circuits. It represents a timing assumption as a virtual causality arc to complete a broken cycle in the event graph. This allows designers to reason about timing assumptions as a method to realize circuits with higher throughput and energy efficiency by systematically sacrificing robustness.

Representations

Communicating hardware processes (CHP)

Communicating hardware processes is a program notation for QDI circuits inspired by Tony Hoare's communicating sequential processes and Edsger W. Dijkstra's guarded commands. The syntax is described below in descending precedence.
  • Skip skip does nothing. It simply acts as a placeholder for pass-through conditions.
  • Dataless assignment a+ sets the voltage of the node a to Vdd while a- sets the voltage of a to GND.
  • Assignment a := e evaluates the expression e then assigns the resulting value to the variable a.
  • Send X!e evaluates the expression e then sends the resulting value across the channel X. X! is a dataless send.
  • Receive X?a waits until there is a valid value on the channel X then assigns that value to the variable a. X? is a dataless receive.
  • Probe #X returns the value waiting on the channel X without executing the receive.
  • Simultaneous composition S * T executes the process fragments S and T at the same time.
  • Internal parallel composition S, T executes the process fragments S and T in any order.
  • Sequential composition S; T executes the process fragments S followed by T.
  • Parallel composition S || T executes the process fragments S and T in any order. This is functionally equivalent to internal parallel composition but with lower precedence.
  • Deterministic selection G1 -> S1...Gn -> Sn] implements choice in which G0,G1,...,Gn are guards which are dataless boolean expressions or data expressions that are implicitly cast using a validity check and S0,S1,...,Sn are process fragments. Deterministic selection waits until one of the guards evaluates to Vdd, then proceeds to execute the guard's associated process fragment. If two guards evaluate to Vdd during the same window of time, an error occurs. is shorthand for and simply implements a wait.
  • Non-deterministic selection is the same as deterministic selection except that more than one guard is allowed to evaluate to Vdd. Only the process fragment associated with the first guard to evaluate to Vdd is executed.
  • Repetition *G1 -> S1...Gn -> Sn] or * is similar to the associated selection statements except that the action is repeated while any guard evaluates to Vdd. * is shorthand for * and implements infinite repetition.