Bit array
A bit array is an array data structure that compactly stores bits. It can be used to implement a simple set data structure. A bit array is effective at exploiting bit-level parallelism in hardware to perform operations quickly. A typical bit array stores kw bits, where w is the number of bits in the unit of storage, such as a byte or word, and k is some nonnegative integer. If w does not divide the number of bits to be stored, some space is wasted due to internal fragmentation.
Definition
A bit array is a mapping from some domain to values in the set. The values can be interpreted as dark/light, absent/present, locked/unlocked, valid/invalid, et cetera. The point is that there are only two possible values, so they can be stored in one bit. As with other arrays, the access to a single bit can be managed by applying an index to the array. Assuming its size to be n bits, the array can be used to specify a subset of the domain, where a -bit indicates the presence and a -bit the absence of a number in the set. This set data structure uses about n/''w words of space, where w'' is the number of bits in each machine word. Whether the least significant bit or the most significant bit indicates the smallest-index number is largely irrelevant, but the former tends to be preferred.A finite binary relation may be represented by a bit array called a logical matrix. In the calculus of relations, these arrays are composed with matrix multiplication where the arithmetic is Boolean, and such a composition represents composition of relations.
Basic operations
Although most machines are not able to address individual bits in memory, nor have instructions to manipulate single bits, each bit in a word can be singled out and manipulated using bitwise operations. In particular:Use
OR to set a bit to one:11101010
OR 00000100
= 11101110
AND to set a bit to zero:11101010
AND 11111101
= 11101000
AND to determine if a bit is set, by zero-testing:11101010
AND 00000001
= 00000000
11101010
AND 00000010
= 00000010
XOR to invert or toggle a bit:11101010
XOR 00000100
= 11101110
11101110
XOR 00000100
= 11101010
NOT to invert all bits:NOT 10110010
= 01001101
To obtain the bit mask needed for these operations, we can use a bit shift operator to shift the number 1 to the left by the appropriate number of places, as well as bitwise negation if necessary.
Given two bit arrays of the same size representing sets, we can compute their union, intersection, and set-theoretic difference using n/''w simple bit operations each, as well as the complement of either:
for i from 0 to n/w-1
complement_a := not a
union := a or b
intersection := a and b
difference := a and
If we wish to iterate through the bits of a bit array, we can do this efficiently using a doubly nested loop that loops through each word, one at a time. Only n''/w memory accesses are required:
for i from 0 to n/w-1
index := 0 // if needed
word := a
for b from 0 to w-1
value := word and 1 ≠ 0
word := word shift right 1
// do something with value
index := index + 1 // if needed
Both of these code samples exhibit ideal locality of reference, which will subsequently receive large performance boost from a data cache. If a cache line is k words, only about n/''wk'' cache misses will occur.
More complex operations
As with character strings it is straightforward to define length, substring, lexicographical compare, concatenation, reverse operations. The implementation of some of these operations is sensitive to endianness.Population / Hamming weight
If we wish to find the number of 1 bits in a bit array, sometimes called the population count or Hamming weight, there are efficient branch-free algorithms that can compute the number of bits in a word using a series of simple bit operations. We simply run such an algorithm on each word and keep a running total. Counting zeros is similar. See the Hamming weight article for examples of an efficient implementation.Inversion
Vertical flipping of a one-bit-per-pixel image, or some FFT algorithms, requires flipping the bits of individual words.When this operation is not available on the processor, it's still possible to proceed by successive passes, in this example on 32 bits:
exchange two 16-bit halfwords
exchange bytes by pairs
...
swap bits by pairs
swap bits
The last operation can be written | ).
Find first one
The find first set or find first one operation identifies the index or position of the 1-bit with the smallest index in an array, and has widespread hardware support and efficient algorithms for its computation. When a priority queue is stored in a bit array, find first one can be used to identify the highest priority element in the queue. To expand a word-size find first one to longer arrays, one can find the first nonzero word and then run find first one on that word. The related operations find first zero, count leading zeros, count leading ones, count trailing zeros, count trailing ones, and log base 2 can also be extended to a bit array in a straightforward manner.Compression
A bit array is the most dense storage for "random" bits, that is, where each bit is equally likely to be 0 or 1, and each one is independent. But most data are not random, so it may be possible to store it more compactly. For example, the data of a typical fax image is not random and can be compressed. Run-length encoding is commonly used to compress these long streams. However, most compressed data formats are not so easy to access randomly; also by compressing bit arrays too aggressively we run the risk of losing the benefits due to bit-level parallelism. Thus, instead of compressing bit arrays as streams of bits, we might compress them as streams of bytes or words.Advantages and disadvantages
Bit arrays, despite their simplicity, have a number of marked advantages over other data structures for the same problems:- They are extremely compact; no other data structures can store n independent pieces of data in n/''w'' words.
- They allow small arrays of bits to be stored and manipulated in the register set for long periods of time with no memory accesses.
- Because of their ability to exploit bit-level parallelism, limit memory access, and maximally use the data cache, they often outperform many other data structures on practical data sets, even those that are more asymptotically efficient.
- Without compression, they are wasteful set data structures for sparse sets in both time and space. For such applications, compressed bit arrays, Judy arrays, tries, or even Bloom filters should be considered instead.
- Accessing individual elements can be expensive and difficult to express in some languages. If random access is more common than sequential and the array is relatively small, a byte array may be preferable on a machine with byte addressing. A word array, however, is probably not justified due to the huge space overhead and additional cache misses it causes, unless the machine only has word addressing.
Applications
Bit arrays are used for priority queues, where the bit at index k is set if and only if k is in the queue; this data structure is used, for example, by the Linux kernel, and benefits strongly from a find-first-zero operation in hardware.
Bit arrays can be used for the allocation of memory pages, inodes, disk sectors, etc. In such cases, the term bitmap may be used. However, this term is frequently used to refer to raster images, which may use multiple bits per pixel.
Another application of bit arrays is the Bloom filter, a probabilistic set data structure that can store large sets in a small space in exchange for a small probability of error. It is also possible to build probabilistic hash tables based on bit arrays that accept either false positives or false negatives.
Bit arrays and the operations on them are also important for constructing succinct data structures, which use close to the minimum possible space. In this context, operations like finding the nth 1 bit or counting the number of 1 bits up to a certain position become important.
Bit arrays are also a useful abstraction for examining streams of compressed data, which often contain elements that occupy portions of bytes or are not byte-aligned. For example, the compressed Huffman coding representation of a single 8-bit character can be anywhere from 1 to 255 bits long.
In information retrieval, bit arrays are a good representation for the posting lists of very frequent terms. If we compute the gaps between adjacent values in a list of strictly increasing integers and encode them using unary coding, the result is a bit array with a 1 bit in the nth position if and only if n is in the list. The implied probability of a gap of n is 1/2n. This is also the special case of Golomb coding where the parameter M is 1; this parameter is only normally selected when, or roughly the term occurs in at least 38% of documents.