Single-nucleotide polymorphism
In genetics and bioinformatics, a single-nucleotide polymorphism is a germline substitution of a single nucleotide at a specific position in the genome. Although certain definitions require the substitution to be present in a sufficiently large fraction of the population, many publications do not apply such a frequency threshold.
For example, a G nucleotide present at a specific location in a reference genome may be replaced by an A in a minority of individuals. The two possible nucleotide variations of this SNP – G or A – are called alleles.
SNPs can help explain differences in susceptibility to a wide range of diseases across a population. For example, a common SNP in the CFH gene is associated with increased risk of age-related macular degeneration. Differences in the severity of an illness or response to treatments may also be manifestations of genetic variations caused by SNPs. For example, two common SNPs in the APOE gene, rs429358 and rs7412, lead to three major APO-E alleles with different associated risks for development of Alzheimer's disease and age at onset of the disease.
Single nucleotide substitutions with an allele frequency of less than 1% are sometimes called single-nucleotide variants. "Variant" may also be used as a general term for any single nucleotide change in a DNA sequence, encompassing both common SNPs and rare mutations, whether germline or somatic. The term single-nucleotide variant has therefore been used to refer to point mutations found in cancer cells. DNA variants must also commonly be taken into consideration in molecular diagnostics applications such as designing PCR primers to detect viruses, in which the viral RNA or DNA sample may contain single-nucleotide variants. However, this nomenclature uses arbitrary distinctions and is not used consistently across all fields; the resulting disagreement has prompted calls for a more consistent framework for naming differences in DNA sequences between two samples.
Types
Single-nucleotide polymorphisms may fall within coding sequences of genes, non-coding regions of genes, or in the intergenic regions. SNPs within a coding sequence do not necessarily change the amino acid sequence of the protein that is produced, due to degeneracy of the genetic code.SNPs in the coding region are of two types: synonymous SNPs and nonsynonymous SNPs. Synonymous SNPs do not affect the protein sequence, while nonsynonymous SNPs change the amino acid sequence of protein.
- SNPs in non-coding regions can manifest in a higher risk of cancer, and may affect mRNA structure and disease susceptibility. Non-coding SNPs can also alter the level of expression of a gene, as an eQTL.
- SNPs in coding regions:
- * synonymous substitutions by definition do not result in a change of amino acid in the protein, but still can affect its function in other ways. An example would be a seemingly silent mutation in the multidrug resistance gene 1, which codes for a cellular membrane pump that expels drugs from the cell, can slow down translation and allow the peptide chain to fold into an unusual conformation, causing the mutant pump to be less functional and the C3435T polymorphism changes ATC to ATT at position 1145 ).
- * nonsynonymous substitutions:
- ** missense – single change in the base results in change in amino acid of protein and its malfunction which leads to disease in the DNA sequence
- ** nonsense – point mutation in a sequence of DNA that results in a premature stop codon, or a nonsense codon in the transcribed mRNA, and in a truncated, incomplete, and usually nonfunctional protein product.
Frequency
More than 600 million SNPs have been identified across the human genome in the world's population. A typical genome differs from the reference human genome at 4–5 million sites, most of which consist of SNPs and short indels.Within a genome
The genomic distribution of SNPs is not homogenous; SNPs occur in non-coding regions more frequently than in coding regions or, in general, where natural selection is acting and "fixing" the allele of the SNP that constitutes the most favorable genetic adaptation. Other factors, like genetic recombination and mutation rate, can also determine SNP density.SNP density can be predicted by the presence of microsatellites: AT microsatellites in particular are potent predictors of SNP density, with long repeat tracts tending to be found in regions of significantly reduced SNP density and low GC content.
Within a population
Since there are variations between human populations, a SNP allele that is common in one geographical or ethnic group may be rarer in another. However, this pattern of variation is relatively rare; in a global sample of 67.3 million SNPs, the Human Genome Diversity Project "found no such private variants that are fixed in a given continent or major region. The highest frequencies are reached by a few tens of variants present at >70% in Africa, the Americas, and Oceania. By contrast, the highest frequency variants private to Europe, East Asia, the Middle East, or Central and South Asia reach just 10 to 30%."Within a population, SNPs can be assigned a minor allele frequency —the lowest allele frequency at a locus that is observed in a particular population. This is simply the lesser of the two allele frequencies for single-nucleotide polymorphisms.
With this knowledge, scientists have developed new methods in analyzing population structures in less studied species. By using pooling techniques, the cost of the analysis is significantly lowered. These techniques are based on sequencing a population in a pooled sample instead of sequencing every individual within the population by itself. With new bioinformatics tools, there is a possibility of investigating population structure, gene flow, and gene migration by observing the allele frequencies within the entire population. With these protocols there is a possibility for combining the advantages of SNPs with micro satellite markers. However, there is information lost in the process, such as linkage disequilibrium and zygosity information.
Applications
Single nucleotide polymorphisms serve as powerful molecular markers in contemporary genetic research and clinical practice. Association studies, particularly genome-wide association studies, represent the primary application of SNP technology for identifying genetic variants linked to human diseases and traits. These comprehensive analyses examine hundreds of thousands of genetic markers simultaneously to detect statistical associations between specific SNPs and phenotypic characteristics, enabling researchers to uncover genetic contributions to complex disorders including cardiovascular disease, diabetes, and neurological conditions.The development of tag SNP methodology has significantly enhanced the efficiency of genomic studies by exploiting patterns of linkage disequilibrium across the human genome. Tag SNPs function as representative markers that capture genetic variation within specific chromosomal regions, allowing researchers to survey large genomic areas without genotyping every individual variant. This approach reduces both the financial cost and computational burden of large-scale genetic studies while maintaining sufficient power to detect disease-associated loci. The selection of optimal tag SNPs relies on sophisticated algorithms that identify markers capable of capturing the maximum amount of genetic information within defined genomic intervals.
Haplotype reconstruction represents another fundamental application where SNPs enable the characterization of inherited genetic blocks. Researchers utilize dense SNP maps to identify and analyze haplotype structures, which consist of sets of closely linked alleles that tend to be transmitted together through generations. These haplotype patterns provide insights into population history, demographic events, and evolutionary processes that have shaped contemporary genetic diversity. The International HapMap Project exemplified this application by creating comprehensive maps of common haplotype patterns across diverse human populations.
Linkage disequilibrium analysis forms the theoretical foundation for many SNP-based applications in population genetics and disease mapping. This phenomenon describes the non-random association of alleles at different genomic positions, which occurs when variants are inherited together more frequently than would be expected by chance alone. The extent of linkage disequilibrium between SNPs depends primarily on physical distance along chromosomes and local recombination rates, with closer variants generally showing stronger associations. Understanding these patterns enables researchers to predict which SNPs will provide redundant information and guides the selection of informative markers for association studies.
In genetic epidemiology, SNPs have emerged as essential tools for investigating disease transmission patterns and population structure. Whole-genome sequencing approaches utilize SNP variation to define transmission clusters in infectious disease outbreaks, where cases showing similar genetic profiles may represent linked transmission events. This application has proven particularly valuable for tuberculosis surveillance and contact tracing, where traditional epidemiological methods may fail to identify all transmission links. Additionally, SNP-based analyses contribute to understanding population stratification and ancestry, which are crucial factors in designing appropriate study controls and interpreting association results across diverse ethnic groups.