Long-tail traffic
A long-tailed or heavy-tailed distribution is one that assigns relatively high probabilities to regions far from the mean or median. A more formal mathematical definition is given below. In the context of teletraffic engineering a number of quantities of interest have been shown to have a long-tailed distribution. For example, if we consider the sizes of files transferred from a web server, then, to a good degree of accuracy, the distribution is heavy-tailed, that is, there are a large number of small files transferred but, crucially, the number of very large files transferred remains a major component of the volume downloaded.
Many processes are technically long-range dependent but not self-similar. The differences between these two phenomena are subtle. Heavy-tailed refers to a probability distribution, and long-range dependent refers to a property of a time series and so these should be used with care and a distinction should be made. The terms are distinct although superpositions of samples from heavy-tailed distributions aggregate to form long-range dependent time series.
Additionally, there is Brownian motion which is self-similar but not long-range dependent.
Overview
The design of robust and reliable networks and network services has become an increasingly challenging task in today's Internet world. To achieve this goal, understanding the characteristics of Internet traffic plays a more and more critical role. Empirical studies of measured traffic traces have led to the wide recognition of self-similarity in network traffic.Self-similar Ethernet traffic exhibits dependencies over a long range of time scales. This is to be contrasted with telephone traffic which is Poisson in its arrival and departure process.
With many time-series if the series is averaged then the data begins to look smoother. However, with self-similar data, one is confronted with traces that are spiky and bursty, even at large scales. Such behaviour is caused by strong dependence in the data: large values tend to come in clusters, and clusters of clusters, etc. This can have far-reaching consequences for network performance.
Heavy-tail distributions have been observed in many natural phenomena including both physical and sociological phenomena. Mandelbrot established the use of heavy-tail distributions to model real-world fractal phenomena, e.g. Stock markets, earthquakes, and the weather.
Ethernet, WWW, SS7, TCP, FTP, TELNET and VBR video traffic is self-similar.
Self-similarity in packetised data networks can be caused by the distribution of file sizes, human interactions and/or Ethernet dynamics. Self-similar and long-range dependent characteristics in computer networks present a fundamentally different set of problems to people doing analysis and/or design of networks, and many of the previous assumptions upon which systems have been built are no longer valid in the presence of self-similarity.
Short-range dependence vs. long-range dependence
Long-range and short-range dependent processes are characterised by their autocovariance functions.In short-range dependent processes, the coupling between values at different times decreases rapidly as the time difference increases.
- The sum of the autocorrelation function over all lags is finite.
- As the lag increases, the autocorrelation function of short-range dependent processes decays quickly.
- The area under the autocorrelation function summed over all lags is infinite.
- The decay of the autocorrelation function is often assumed to have the specific functional form,
Long-range dependence as a consequence of mathematical convergence
Such power law scaling of the autocorrelation function can be shown to be biconditionally related to a power law relationship between the variance and the mean, when evaluated from sequences by the method of expanding bins. This variance to mean power law is an inherent feature of a family of statistical distributions called the Tweedie exponential dispersion models. Much as the central limit theorem explains how certain types of random data converge towards the form of a normal distribution there exists a related theorem, the Tweedie convergence theorem that explains how other types of random data will converge towards the form of these Tweedie distributions, and consequently express both the variance to mean power law and a power law decay in their autocorrelation functions.The Poisson distribution and traffic
Before the heavy-tail distribution is introduced mathematically, the memoryless Poisson distribution, used to model traditional telephony networks, is briefly reviewed below. For more details, see the article on the Poisson distribution.Assuming pure-chance arrivals and pure-chance terminations leads to the following:
- The number of call arrivals in a given time has a Poisson distribution, i.e.:
- The number of call departures in a given time also has a Poisson distribution, i.e.:
- The intervals, T, between call arrivals and departures are intervals between independent, identically distributed random events. It can be shown that these intervals have a negative exponential distribution, i.e.:
Information on the fundamentals of statistics and probability theory can be found in the external links section.
The heavy-tail distribution
Heavy-tail distributions have properties that are qualitatively different from commonly used distributions such as the exponential distribution.The Hurst parameter H is a measure of the level of self-similarity of a time series that exhibits long-range dependence, to which the heavy-tail distribution can be applied. H takes on values from 0.5 to 1. A value of 0.5 indicates the data is uncorrelated or has only short-range correlations. The closer H is to 1, the greater the degree of persistence or long-range dependence.
Typical values of the Hurst parameter, H:
- Any pure random process has H = 0.5
- Phenomena with H > 0.5 typically have a complex process structure.
This means that regardless of the distribution for small values of the random variable, if the asymptotic shape of the distribution is hyperbolic, it is heavy-tailed. The simplest heavy-tail distribution is the Pareto distribution which is hyperbolic over its entire range. Complementary distribution functions for the exponential and Pareto distributions are shown below. Shown on the left is a graph of the distributions shown on linear axes, spanning a large domain. To its right is a graph of the complementary distribution functions over a smaller domain, and with a logarithmic range.
If the logarithm of the range of an exponential distribution is taken, the resulting plot is linear. In contrast, that of the heavy-tail distribution is still curvilinear. These characteristics can be clearly seen on the graph above to the right. A characteristic of long-tail distributions is that if the logarithm of both the range and the domain is taken, the tail of the long-tail distribution is approximately linear over many orders of magnitude. In the graph above left, the condition for the existence of a heavy-tail distribution, as previously presented, is not met by the curve labelled "Gamma-Exponential Tail".
The probability mass function of a heavy-tail distribution is given by:
and its cumulative distribution function is given by:
where k represents the smallest value the random variable can take.
Readers interested in a more rigorous mathematical treatment of the subject are referred to the external links section.
What causes long-tail traffic?
In general, there are three main theories for the causes of long-tail traffic. First, is a cause based in the application layer which theorizes that user session durations vary with a long-tail distribution due to the file size distribution. If the distribution of file sizes is heavy-tailed then the superposition of many file transfers in a client/server network environment will be long-range dependent. Additionally, this causal mechanism is robust with respect to changes in network resources and network topology. This is currently the most popular explanation in the engineering literature and the one with the most empirical evidence through observed file size distributions.Second, is a transport layer cause which theorizes that the feedback between multiple TCP streams due to TCP's congestion avoidance algorithm in moderate to high packet loss situations causes self-similar traffic or at least allows it to propagate. However, this is believed only to be a significant factor at relatively short timescales and not the long-term cause of self-similar traffic.
Finally, is a theorized link layer cause which is predicated based on physics simulations of packet switching networks on simulated topologies. At a critical packet creation rate, the flow in a network becomes congested and exhibits 1/f noise and long-tail traffic characteristics. There have been criticisms on these sorts of models though as being unrealistic in that network traffic is long-tailed even in non-congested regions and at all levels of traffic.
Simulation showed that long-range dependence could arise in the queue
length dynamics at a given node within a communications network even when the traffic sources are free of long-range dependence. The mechanism for this is believed to relate to feedback from routing effects in the simulation.