Operational taxonomic unit

An operational taxonomic unit is an operational definition used to classify groups of closely related individuals. The term was originally introduced in 1963 by Robert R. Sokal and Peter H. A. Sneath in the context of numerical taxonomy, where an "operational taxonomic unit" is simply the group of organisms currently being studied. In this sense, an OTU is a pragmatic definition to group individuals by similarity, equivalent to but not necessarily in line with classical Linnaean taxonomy or modern evolutionary taxonomy.
Nowadays, however, the term is commonly used in a different context and refers to clusters of organisms, grouped by DNA sequence similarity of a specific taxonomic marker gene. In other words, OTUs are pragmatic proxies for "species" at different taxonomic levels, in the absence of traditional systems of biological classification as are available for macroscopic organisms. For several years, OTUs have been the most commonly used units of diversity, especially when analysing small subunit 16S or 18S rRNA marker gene sequence datasets.

Molecular OTU by clustering of marker gene sequences

In the approach represented by DNA barcoding, a particular locus is chosen to be used as the marker gene for classification. This locus should be universally present in the scope selected, variable enough to be different among close-related species, and be flanked by conservative sequences that allow for easy amplification and detection. There are databases containing sequences for such marker genes from many different species, allowing for comparison.
Sequences obtained this way can be clustered according to their similarity to one another, and operational taxonomic units are defined based on the similarity threshold set by the researcher. The exact threshold depends on the taxa in question and the mutational rates of the selected locus in the taxon. 97-99% are commonly used, but "it is now recognized to be somewhat arbitrary as sequence variation within and among species varies across taxa". 100% similarity is also common, also known as single variants.
It remains debatable how well this commonly used method recapitulates true microbial species phylogeny or ecology. Although OTUs can be calculated differently when using different algorithms or thresholds, research by Schmidt et al. demonstrated that 16S-derived microbial OTUs were generally ecologically consistent across habitats and several clustering approaches. The number of OTUs defined may be inflated due to errors in DNA sequencing.

OTU clustering approaches

There are three main approaches to clustering OTUs:De novo, for which the clustering is based on similarities between sequencing reads.

Closed-reference, for which the clustering is performed against a reference database of sequences.
Open-reference, where clustering is first performed against a reference database of sequences, then any remaining sequences that could not be mapped to the reference are clustered de novo.

Using a reference provides taxonomic context for the OTUs found. Alternatively, taxonomic context can be found after the construction of clusters by comparing representative sequences from clusters against a reference database. There are also specialized classifiers for this purpose which are much faster than naive comparison using BLAST.

OTU clustering algorithms

Hierarchical clustering algorithms : uclust & cd-hit & ESPRIT
Bayesian clustering: CROP

Molecular OTU by other methods

In addition to similarity-based grouping, marker gene sequences can be sorted into OTUs using molecular phylogeny, k-mer composition, or hybrid methods combining these methods with similarity. There are also Bayesian tree-less methods and machine learning approaches. Using phylogeny often involves manually assigning terminal clades or single nodes to an OTU, so this is usually only done for refinement.
Genome skimming can be used to obtain high-copy DNA without the need to choose marker genes or to design PCR primers for the chosen genes. It can provide fairly good coverage of organelle DNA and repetitive elements such as ribosomal DNA, both of which can be used like marker genes in OTU analysis.
Whole-genome sequencing is more expensive and involves the production and processing of more data. By considering the entire genome, many marker genes can be used at the same time, producing highly resolved phylogenies that correctly identify problematic taxa. It is also possible to use entire genomes for OTU assignment. For example, genomes from different bacterial species almost always have an average nucleotide identity lower than 95%, a fact that can be used to define new OTUs.