Find first set
In computer software and hardware, find first set or find first one is a bit operation that, given an unsigned machine word, designates the index or position of the least significant bit set to one in the word counting from the least significant bit position. A nearly equivalent operation is count trailing zeros or number of trailing zeros, which counts the number of zero bits following the least significant one bit. The complementary operation that finds the index or position of the most significant set bit is log base 2, so called because it computes the binary logarithm. This is closely related to count leading zeros or number of leading zeros, which counts the number of zero bits preceding the most significant one bit.
There are two common variants of find first set, the POSIX definition which starts indexing of bits at 1, herein labelled ffs, and the variant which starts indexing of bits at zero, which is equivalent to ctz and so will be called by that name.
Most modern CPU instruction set architectures provide one or more of these as hardware operators; software emulation is usually provided for any that aren't available, either as compiler intrinsics or in system libraries.
Examples
Given the following 32-bit word:The count trailing zeros operation would return 3, while the count leading zeros operation returns 16. The count leading zeros operation depends on the word size: if this 32-bit word were truncated to a 16-bit word, count leading zeros would return zero. The find first set operation would return 4, indicating the 4th position from the right. The truncated log base 2 is 15.
Similarly, given the following 32-bit word, the bitwise negation of the above word:
The count trailing ones operation would return 3, the count leading ones operation would return 16, and the find first zero operation ffz would return 4.
If the word is zero, count leading zeros and count trailing zeros both return the number of bits in the word, while ffs returns zero. Both log base 2 and zero-based implementations of find first set generally return an undefined result for the zero word.
Hardware support
Many architectures include instructions to rapidly perform find first set and/or related operations, listed below. The most common operation is count leading zeros, likely because all other operations can be implemented efficiently in terms of it.| Platform | Mnemonic | Name | Operand widths | Description | On application to 0 |
| ARM except Cortex-M0/M0+/M1/M23 | clz | Count Leading Zeros | 32 | clz | 32 |
| ARM | clz | Count Leading Zeros | 32, 64 | clz | Operand width |
| AVR32 | clz | Count Leading Zeros | 32 | clz | 32 |
| DEC Alpha | ctlz | Count Leading Zeros | 64 | clz | 64 |
| DEC Alpha | cttz | Count Trailing Zeros | 64 | ctz | 64 |
| Intel 80386 and later | bsf | Bit Scan Forward | 16, 32, 64 | ctz | Undefined; sets zero flag |
| Intel 80386 and later | bsr | Bit Scan Reverse | 16, 32, 64 | Log base 2 | Undefined; sets zero flag |
| x86 supporting BMI1 or ABM | lzcnt | Count Leading Zeros | 16, 32, 64 | clz | Operand width; sets carry flag |
| x86 supporting BMI1 | tzcnt | Count Trailing Zeros | 16, 32, 64 | ctz | Operand width; sets carry flag |
| Itanium | clz | Count Leading Zeros | 64 | clz | 64 |
| MIPS32/MIPS64 | clz | Count Leading Zeros in Word | 32, 64 | clz | Operand width |
| MIPS32/MIPS64 | clo | Count Leading Ones in Word | 32, 64 | clo | Operand width |
| Motorola 68020 and later | bfffo | Find First One in Bit Field | Arbitrary | Log base 2 | Field offset + field width |
| PDP-10 | jffo | Jump if Find First One | 36 | clz | 0; no operation |
| POWER/PowerPC/Power ISA | cntlz/cntlzw/cntlzd | Count Leading Zeros | 32, 64 | clz | Operand width |
| Power ISA 3.0 and later | cnttzw/cnttzd | Count Trailing Zeros | 32, 64 | ctz | Operand width |
| RISC-V | clz | Count Leading Zeros | 32, 64 | clz | Operand width |
| RISC-V | ctz | Count Trailing Zeros | 32, 64 | ctz | Operand width |
| SPARC Oracle Architecture 2011 and later | lzcnt | Leading Zero Count | 64 | clz | 64 |
| VAX | ffs | Find First Set | 0–32 | ctz | Operand width; sets zero flag |
| IBM z/Architecture | flogr | Find Leftmost One | 64 | clz | 64 |
| IBM z/Architecture | vclz | Vector Count Leading Zeroes | 8, 16, 32, 64 | clz | Operand width |
| IBM z/Architecture | vctz | Vector Count Trailing Zeroes | 8, 16, 32, 64 | ctz | Operand width |
On some Alpha platforms CTLZ and CTTZ are emulated in software.
Tool and library support
A number of compiler and library vendors supply compiler intrinsics or library functions to perform find first set and/or related operations, which are frequently implemented in terms of the hardware instructions above:| Tool/library | Name | Type | Input type | Notes | On application to 0 |
| C23 standard library | stdc_first_trailing_onestdc_trailing_zerosstdc_first_leading_onestdc_leading_zeros | Library function | unsigned char, unsigned short, unsigned int, unsigned long, unsigned long long | Functions for zero and ones are also provided. | 0 Operand width |
| C++20 standard library | bit_ceil bit_floorbit_widthcountl_zero countl_onecountr_zero countr_one | Library function | unsigned char, unsigned short, unsigned int, unsigned long, unsigned long long | ||
| POSIX.1 compliant libc 4.3BSD libc OS X 10.3 libc | ffs | Library function | int | Includes glibc. POSIX does not supply the complementary log base 2 / clz. | 0 |
| FreeBSD 5.3 libc OS X 10.4 libc | ffslflsflsl | Library function | int, long | fls computes + 1. | 0 |
| FreeBSD 7.1 libc | ffsllflsll | Library function | long long | 0 | |
| GCC 3.4.0 Clang 5.x | __builtin_ffs__builtin_clz__builtin_ctz | Built-in functions | unsigned int, unsigned long, unsigned long long, uintmax_t | GCC documentation considers result undefined clz and ctz on 0. | 0 |
| Visual Studio 2005 | _BitScanForward_BitScanReverse | Compiler intrinsics | unsigned long, unsigned __int64 | Separate return value to indicate zero input | Undefined |
| Visual Studio 2008 | __lzcnt | Compiler intrinsic | unsigned short, unsigned int, unsigned __int64 | Relies on hardware support for the lzcnt instruction introduced in BMI1 or ABM. | Operand width |
| Visual Studio 2012 | _arm_clz | Compiler intrinsic | unsigned int | Relies on hardware support for the clz instruction introduced in the ARM microarchitectures|ARMv5T architecture and later]. | ? |
| Intel C++ Compiler | _bit_scan_forward_bit_scan_reverse | Compiler intrinsics | int | Undefined | |
| Nvidia CUDA | __clz | Functions | 32-bit, 64-bit | Compiles to fewer instructions on the GeForce 400 series | 32 |
| Nvidia CUDA | __ffs | Functions | 32-bit, 64-bit | Compiles to fewer instructions on the GeForce 400 series | 0 |
| LLVM | llvm.ctlz.*llvm.cttz.* | Intrinsic | 8, 16, 32, 64, 256 | LLVM assembly language | Operand width, if 2nd argument is 0; undefined otherwise |
GHC 7.10, in Data.Bits | countLeadingZeroscountTrailingZeros | Library function | FiniteBits b => b | Haskell programming language | Operand width |
| Rust standard library | highest_oneleading_zeroslowest_onetrailing_zeros | Library method | All primitive integer types and other integer types such as NonZero | highest_one and lowest_one are not yet stable. | |
| Zig | @clz@ctz | builtin function | integer or vector |
Properties and relations
If bits are labeled starting at 1, then count trailing zeros and find first set operations are related by . If bits are labeled starting at, then count trailing zeros and find first set are exactly equivalent operations. Given bits per word, the is easily computed from the and vice versa by.As demonstrated in the example above, the find first zero, count leading ones, and count trailing ones operations can be implemented by negating the input and using find first set, count leading zeros, and count trailing zeros. The reverse is also true.
On platforms with an efficient log2 operation such as M68000, can be computed by:
where denotes bitwise AND and denotes the two's complement of. The expression clears all but the least-significant bit, so that the most- and least-significant bit are the same.
On platforms with an efficient count leading zeros operation such as ARM and PowerPC, can be computed by:
Conversely, on machines without or operators, can be computed using, albeit inefficiently:
On platforms with an efficient Hamming weight operation such as SPARC's
POPC or Blackfin's ONES, there is:where denotes bitwise exclusive-OR, denotes bitwise OR and denotes bitwise negation.
The inverse problem can be computed with a left-shift.
Find first set and related operations can be extended to arbitrarily large bit arrays in a straightforward manner by starting at one end and proceeding until a word that is not all-zero or not all-one is encountered. A tree data structure that recursively uses bitmaps to track which words are nonzero can accelerate this.
Software emulation
Most CPUs dating from the late 1980s onward have bit operators for ffs or equivalent, but a few modern ones like some of the ARM-Mx series do not. In lieu of hardware operators for ffs, clz and ctz, software can emulate them with shifts, integer arithmetic and bitwise operators. There are several approaches depending on architecture of the CPU and to a lesser extent, the programming language semantics and compiler code generation quality. The approaches may be loosely described as linear search, binary search, search+table lookup, de Bruijn multiplication, floating point conversion/exponent extract, and bit operator methods. There are tradeoffs between execution time and storage space as well as portability and efficiency.Software emulations are usually deterministic. They return a defined result for all input values; in particular, the result for an input of all zero bits is usually 0 for ffs, and the bit length of the operand for the other operations.
If one has a hardware clz or equivalent, ctz can be efficiently computed with bit operations, but the converse is not true: clz is not efficient to compute in the absence of a hardware operator.
2n
The function using shifts and bitwise ORs is not efficient to compute as in this 32-bit example and even more inefficient if we have a 64-bit or 128-bit operand:function pow2:
if x = 0 return invalid // invalid is implementation defined
x ← x - 1
for each y in : x ← x |
return x + 1
FFS
Since ffs = ctz + 1 or ffs = ctz, the applicable algorithms for ctz may be used, with a possible final step of adding 1 to the result, and returning 0 instead of the operand length for input of all zero bits.CTZ
The canonical algorithm is a loop counting zeros starting at the LSB until a 1-bit is encountered:function ctz1
if x = 0 return w
t ← 1
r ← 0
while = 0
t ← t << 1
r ← r + 1
return r
This algorithm executes O time and operations, and is impractical in practice due to a large number of conditional branches.
An exception is if the inputs are uniformly distributed. In that case, we can rely on the fact that half the return values will be 0, one quarter will be 1, and so on. The average number of loop iterations per function call is 1, and the algorithm executes in O average-case time.
A lookup table can eliminate most branches:
table = ctz for i in 1..2n-1
function ctz2
if x = 0 return w
r ← 0
while ≠ 0
x ← x >> n
r ← r + n
return r + table
The parameter n is fixed and represents a time–space tradeoff. The loop may also be fully unrolled. But as a linear lookup, this approach is still O in the number of bits in the operand.
If n = 4 is chosen, the table of 16 2-bit entries can be encoded in a single 32-bit constant using SIMD within a register techniques:
// binary
table ← 0x12131210
function ctz2a
if x = 0 return w
r ← 0
while ≠ 0
x ← x >> 4
r ← r + 4
return r + ;
A binary search implementation takes a logarithmic number of operations and branches, as in this 32-bit version:
function ctz3
if x = 0 return 32
n ← 0
if = 0: n ← n + 16, x ← x >> 16
if = 0: n ← n + 8, x ← x >> 8
if = 0: n ← n + 4, x ← x >> 4
if = 0: n ← n + 2, x ← x >> 2
if = 0: n ← n + 1
// Equivalently, n ← n + 1 -
return n
This algorithm can be assisted by a table as well, replacing the last 2 or 3 if statements with a 16- or 256-entry lookup table using the least significant bits of as an index.
As mentioned in, if the hardware has a clz operator, the most efficient approach to computing ctz is thus:
function ctz4
if x = 0 return w
// Isolates the LSB
x ← x & −x
return w − 1 − clz
A similar technique can take advantage of a population count instruction:
function ctz4a
if x = 0 return w
// Makes a mask of the least-significant bits
x ← x ^
return popcount − 1
An algorithm for 32-bit ctz uses de Bruijn sequences to construct a minimal perfect hash function that eliminates all branches.
This algorithm assumes that the result of the multiplication is truncated to 32 bit.
for i from 0 to 31: table ← i // table initialized
function ctz5
if x = 0 return 32
return table
The expression again isolates the least-significant 1 bit. There are then only 32 possible words, which the unsigned multiplication and shift hash to the correct position in the table. This algorithm is branch-free if it does not need to handle the zero input.
The technique can be extended to 64-bit words.
CLZ
The canonical algorithm examines one bit at a time starting from the MSB until a non-zero bit is found, as shown in this example. It executes in O time where n is the bit-length of the operand, and is not a practical algorithm for general use.function clz1
if x = 0 return w
t ← 1 <<
r ← 0
while = 0
t ← t >> 1
r ← r + 1
return r
An improvement on the previous looping approach examines eight bits at a time then uses a 256 entry lookup table for the first non-zero byte. This approach, however, is still O in execution time.
function clz2
if x = 0 return w
t ← 0xff <<
r ← 0
while = 0
t ← t >> 8
r ← r + 8
return r + table
Binary search can reduce execution time to O:
function clz3
if x = 0 return 32
n ← 0
if = 0: n ← n + 16, x ← x << 16
if = 0: n ← n + 8, x ← x << 8
if = 0: n ← n + 4, x ← x << 4
if = 0: n ← n + 2, x ← x << 2
if = 0: n ← n + 1
return n
The fastest portable approaches to simulate clz are a combination of binary search and table lookup: an 8-bit table lookup can replace the bottom 3 branches in binary search. 64-bit operands require an additional branch. A larger width lookup can be used but the maximum practical table size is limited by the size of L1 data cache on modern processors, which is 32 KB for many. Saving a branch is more than offset by the latency of an L1 cache miss.
An algorithm similar to de Bruijn multiplication for CTZ works for CLZ, but rather than isolating the most-significant bit, it rounds up to the nearest integer of the form 2n−1 using shifts and bitwise ORs:
table =
function clz4
for each y in : x ← x |
return table
For processors with deep pipelines, like Prescott and later Intel processors, it may be faster to replace branches by bitwise AND and OR operators to avoid pipeline flushes for mispredicted branches :
function clz5
r = << 4; x >>= r;
q = << 3; x >>= q; r |= q;
q = << 2; x >>= q; r |= q;
q = << 1; x >>= q; r |= q;
r |= ;
return r;
On platforms that provide hardware conversion of integers to floating point, the exponent field can be extracted and subtracted from a constant to compute the count of leading zeros. Corrections are needed to account for rounding errors. Floating point conversion can have substantial latency. This method is highly non-portable and not usually recommended.
int x;
int r;
union t;
t.u = 0x43300000; // LE is 1 for little-endian
t.u = x;
t.d -= 4503599627370496.0;
r = - 0x3FF; // log2
r++; // CLZ
Applications
The count leading zeros operation can be used to efficiently implement normalization, which encodes an integer as m × 2e, where m has its most significant bit in a known position. This can in turn be used to implement Newton–Raphson division, perform integer to floating point conversion in software, and other applications.Count leading zeros can be used to compute the 32-bit predicate "x = y" via the identity, where ">>" is unsigned right shift. It can be used to perform more sophisticated bit operations like finding the first string of n 1 bits. The expression is an effective initial guess for computing the square root of a 32-bit integer using Newton's method. CLZ can efficiently implement null suppression, a fast data compression technique that encodes an integer as the number of leading zero bytes together with the nonzero bytes. It can also efficiently generate exponentially distributed integers by taking the clz of uniformly random integers.
The log base 2 can be used to anticipate whether a multiplication will overflow, since.
Count leading zeros and count trailing zeros can be used together to implement Gosper's loop-detection algorithm, which can find the period of a function of finite range using limited resources.
The binary GCD algorithm spends many cycles removing trailing zeros; this can be replaced by a count trailing zeros followed by a shift. A similar loop appears in computations of the hailstone sequence.
A bit array can be used to implement a priority queue. In this context, find first set is useful in implementing the "pop" or "pull highest priority element" operation efficiently. The Linux kernel real-time scheduler internally uses
sched_find_first_bit for this purpose.The count trailing zeros operation gives a simple optimal solution to the Tower of Hanoi problem: the disks are numbered from zero, and at move k, disk number ctz is moved the minimum possible distance to the right. It can also generate a Gray code by taking an arbitrary word and flipping bit ctz at step k.