Source attribution
In the field of epidemiology, source attribution refers to a category of methods with the objective of reconstructing the transmission of an infectious disease from a specific source, such as a population, individual, or location. For example, source attribution methods may be used to trace the origin of a new pathogen that recently crossed from another host species into humans, or from one geographic region to another. It may be used to determine the common source of an outbreak of a foodborne infectious disease, such as a contaminated water supply. Finally, source attribution may be used to estimate the probability that an infection was transmitted from one specific individual to another, i.e., "who infected whom".
Source attribution can play an important role in public health surveillance and management of infectious disease outbreaks. In practice, it tends to be a problem of statistical inference, because transmission events are seldom observed directly and may have occurred in the distant past. Thus, there is an unavoidable level of uncertainty when reconstructing transmission events from residual evidence, such as the spatial distribution of the disease. As a result, source attribution models often employ Bayesian methods that can accommodate substantial uncertainty in model parameters.
Molecular source attribution is a subfield of source attribution that uses the molecular characteristics of the pathogen — most often its nucleic acid genome — to reconstruct transmission events. Many infectious diseases are routinely detected or characterized through genetic sequencing, which can be faster than culturing isolates in a reference laboratory and can identify specific strains of the pathogen at substantially higher precision than laboratory assays, such as antibody-based assays or drug susceptibility tests. On the other hand, analyzing the genetic sequence data requires specialized computational methods to fit models of transmission. Consequently, molecular source attribution is a highly interdisciplinary area of molecular epidemiology that incorporates concepts and skills from mathematical statistics and modeling, microbiology, public health and computational biology.
There are generally two ways that molecular data are used for source attribution. First, infections can be categorized into different "subtypes" that each corresponds to a unique molecular variety, or a cluster of similar varieties. Source attribution can then be inferred from the similarity of subtypes. Individual infections that belong to the same subtype are more likely to be related epidemiologically, including direct source-recipient transmission, because they have not substantially evolved away from their common ancestor. Similarly, we assume the true source population will have frequencies of subtypes that are more similar to the recipient population, relative to other potential sources. Second, molecular sequences from different infections can be directly compared to reconstruct a phylogenetic tree, which represents how they are related by common ancestors. The resulting phylogeny can approximate the transmission history, and a variety of methods have been developed to adjust for confounding factors.
Due to the associated stigma and the criminalization of transmission for specific infectious diseases, molecular source attribution at the level of individuals can be a controversial use of data that was originally collected in a healthcare setting, with potentially severe legal consequences for individuals who become identified as putative sources. In these contexts, the development and application of molecular source attribution techniques may involve trade-offs between public health responsibilities and individual rights to data privacy.
Microbial subtyping
Microbial subtyping or strain typing is the use of laboratory methods to assign microbial samples to subtypes, which are predefined classifications based on distinct characteristics.The assignment of specimens to subtypes can provide a basis of source attribution, since we assume that a pathogen undergoes minimal change when transmitted to an uninfected host.
Therefore, infections of the same subtype are implied to be epidemiologically related, i.e., linked by one or more recent transmission events.
The assumption that the pathogen is unchanged when transmitted is generally reasonable if the rate of evolution for the pathogen is slower than the rate of transmission, such that few mutations are observed on an epidemiological time scale.
For example, suppose host A is infected by a pathogen that we have categorized as subtype 1.
They are more likely to have been infected by host B, who also carries the subtype 1 pathogen, than host C who carries the subtype 2 pathogen.
In other words, transmission from host B is a more parsimonious explanation if there is a relatively small probability that the pathogen population in host C evolved from subtype 1 to subtype 2 after transmission to host A.
Today it is more common to use genetic sequencing to characterize the microbial sample at the level of its nucleotide sequence by sequencing the whole genome or proportions thereof.
However, other molecular methods such as restriction length fragment polymorphism
have historically played an important role in microbial subtyping before genetic sequencing became an affordable and ubiquitous technology in reference laboratories.
Sequence-based typing methods confer an advantage over other laboratory methods
because there is an enormous number of potential subtypes that can be resolved at the level of the genetic sequence.
Consider the above example again; however, this time host A carries the same infection subtype as many other hosts.
In this case we would have no information to differentiate between these hosts as the potential source of host A's infection.
Our ability to identify potential sources, therefore, depends on having a sufficient number of different subtypes.
However, defining too many subtypes in the population makes it likely that every individual carries a unique subtype, especially for rapidly-evolving pathogens that can accumulate high levels of genetic diversity in a relatively short period of time.
Hence, there exists an intermediate level of subtype resolution that confers the greatest amount of information for source attribution.
When source attribution is considered for a pathogen with high diversity, such that most specimens have unique genetic sequences, it is useful to group multiple unique sequences with a clustering method.
Single and multi-locus typing
Before whole-genome sequencing was cost-effective, targeting a specific part of the pathogen genome was an important step to facilitate microbial subtyping.For example, the ribosomal gene 16S is a standard target for identifying bacteria, in part because it is present across all known species and contains a mixture of conserved and variable regions.
Within a pathogen species, sequencing targets tended to be selected on the basis of their length, ubiquity and exposure to diversifying selection, which may be dictated by the function of the gene product for expressed regions.
For example, so-called "housekeeping" or core genes have indispensable biological functions, such as copying genetic material or building proteins.
These genes are often preferred candidates for microbial subtyping because they are less likely to be absent from a given genome.
Gene presence/absence is particularly relevant for bacteria where genetic material is frequently exchanged through horizontal gene transfer.
Targeting multiple regions of the pathogen genome confers greater precision to distinguish between lineages, since the chance to observe informative genetic differences between infections is increased.
This approach is referred to as multi-locus sequence typing.
Similar to single-locus typing, MLST requires the selection of specific loci to target for sequencing.
Moreover, for subtyping to be consistent across laboratories a reference database must be maintained that maps sequences from single or multiple loci to a fixed notation of allele numbers or designations.
Whole genome sequencing
Although single- and multiple-locus subtyping is still predominantly used for molecular epidemiology, ongoing improvements in sequencing technologies and computing power continue to lower the barrier to whole-genome sequencing.Next-generation sequencing technologies provide cost-effective methods to generate whole genome sequences from a given sample by individually amplifying and sequencing templates in parallel using customized technologies such as sequencing-by-synthesis.
Shotgun sequencing applications of NGS generate full-length genome sequences by shearing the nucleic acid extracted from the sample into small fragments that are converted into a sequencing library, and then using a de novo sequence assembler program the genome sequence is reconstituted from the sequence fragments.
Alternatively, short reads can be mapped to a reference genome sequence that has been converted into an index for efficient lookup of exact substring matches.
This approach can be faster than de novo assembly, but relies on having a reference genome that is sufficiently similar to the genome sequence of the sample.
While NGS makes it feasible to simultaneously generate full-length genome sequences from hundreds of pathogen samples in a single run, it introduces a number of other challenges.
For instance, NGS platforms tend to have higher sequencing error rates than conventional sequencing, and regions of the genome with long stretches of repetitive sequence can be difficult to reassemble.
Whole genome sequencing can confer a significant advantage for source attribution over single- or multiple-locus subtyping.
Sequencing the entire genome is the maximal extent of multi-locus typing, in that all possible loci are covered.
Having whole genome sequences will tend to make one-to-one subtyping less useful, since most genomes will be unique by at least one mutation for rapidly evolving pathogens.
Consequently, applications of WGS for source attribution at a population level will likely have to cluster similar genomes together.
The breadth of coverage offered by WGS is more advantageous for the epidemiology of bacterial pathogens than viruses.
Bacterial genomes tend to be longer, ranging from about 106 to 107 base pairs, whereas virus genomes seldom exceed 106 base pairs.
In addition, bacteria tend to evolve at a slower rate than viruses, so mutations tend to be distributed more sparsely throughout a bacterial genome.
For example, WGS data revealed differences between isolates of Burkholderia pseudomallei from Australia and Cambodia that had otherwise appeared to be identical by multi-locus subtyping due to convergent evolution.
WGS has also been utilized in several recent studies to resolve transmission networks of Mycobacterium tuberculosis in greater detail, because isolates with identical multi-locus subtypes were frequently separated by large numbers of nucleotide differences in the full genome sequence, comprising roughly 4.3 million nucleotides encodoing over 4,000 genes.