Single-nucleotide polymorphism


In genetics and bioinformatics, a single-nucleotide polymorphism is a germline substitution of a single nucleotide at a specific position in the genome. Although certain definitions require the substitution to be present in a sufficiently large fraction of the population, many publications do not apply such a frequency threshold.
For example, a G nucleotide present at a specific location in a reference genome may be replaced by an A in a minority of individuals. The two possible nucleotide variations of this SNP – G or A – are called alleles.
SNPs can help explain differences in susceptibility to a wide range of diseases across a population. For example, a common SNP in the CFH gene is associated with increased risk of age-related macular degeneration. Differences in the severity of an illness or response to treatments may also be manifestations of genetic variations caused by SNPs. For example, two common SNPs in the APOE gene, rs429358 and rs7412, lead to three major APO-E alleles with different associated risks for development of Alzheimer's disease and age at onset of the disease.
Single nucleotide substitutions with an allele frequency of less than 1% are sometimes called single-nucleotide variants. "Variant" may also be used as a general term for any single nucleotide change in a DNA sequence, encompassing both common SNPs and rare mutations, whether germline or somatic. The term single-nucleotide variant has therefore been used to refer to point mutations found in cancer cells. DNA variants must also commonly be taken into consideration in molecular diagnostics applications such as designing PCR primers to detect viruses, in which the viral RNA or DNA sample may contain single-nucleotide variants. However, this nomenclature uses arbitrary distinctions and is not used consistently across all fields; the resulting disagreement has prompted calls for a more consistent framework for naming differences in DNA sequences between two samples.

Types

Single-nucleotide polymorphisms may fall within coding sequences of genes, non-coding regions of genes, or in the intergenic regions. SNPs within a coding sequence do not necessarily change the amino acid sequence of the protein that is produced, due to degeneracy of the genetic code.
SNPs in the coding region are of two types: synonymous SNPs and nonsynonymous SNPs. Synonymous SNPs do not affect the protein sequence, while nonsynonymous SNPs change the amino acid sequence of protein.
  • SNPs in non-coding regions can manifest in a higher risk of cancer, and may affect mRNA structure and disease susceptibility. Non-coding SNPs can also alter the level of expression of a gene, as an eQTL.
  • SNPs in coding regions:
  • * synonymous substitutions by definition do not result in a change of amino acid in the protein, but still can affect its function in other ways. An example would be a seemingly silent mutation in the multidrug resistance gene 1, which codes for a cellular membrane pump that expels drugs from the cell, can slow down translation and allow the peptide chain to fold into an unusual conformation, causing the mutant pump to be less functional and the C3435T polymorphism changes ATC to ATT at position 1145 ).
  • * nonsynonymous substitutions:
  • ** missense – single change in the base results in change in amino acid of protein and its malfunction which leads to disease in the DNA sequence
  • ** nonsensepoint mutation in a sequence of DNA that results in a premature stop codon, or a nonsense codon in the transcribed mRNA, and in a truncated, incomplete, and usually nonfunctional protein product.
SNPs that are not in protein-coding regions may still affect gene splicing, transcription factor binding, messenger RNA degradation, or the sequence of noncoding RNA. Gene expression affected by this type of SNP is referred to as an eSNP and may be upstream or downstream from the gene.

Frequency

More than 600 million SNPs have been identified across the human genome in the world's population. A typical genome differs from the reference human genome at 4–5 million sites, most of which consist of SNPs and short indels.

Within a genome

The genomic distribution of SNPs is not homogenous; SNPs occur in non-coding regions more frequently than in coding regions or, in general, where natural selection is acting and "fixing" the allele of the SNP that constitutes the most favorable genetic adaptation. Other factors, like genetic recombination and mutation rate, can also determine SNP density.
SNP density can be predicted by the presence of microsatellites: AT microsatellites in particular are potent predictors of SNP density, with long repeat tracts tending to be found in regions of significantly reduced SNP density and low GC content.

Within a population

Since there are variations between human populations, a SNP allele that is common in one geographical or ethnic group may be rarer in another. However, this pattern of variation is relatively rare; in a global sample of 67.3 million SNPs, the Human Genome Diversity Project "found no such private variants that are fixed in a given continent or major region. The highest frequencies are reached by a few tens of variants present at >70% in Africa, the Americas, and Oceania. By contrast, the highest frequency variants private to Europe, East Asia, the Middle East, or Central and South Asia reach just 10 to 30%."
Within a population, SNPs can be assigned a minor allele frequency —the lowest allele frequency at a locus that is observed in a particular population. This is simply the lesser of the two allele frequencies for single-nucleotide polymorphisms.
With this knowledge, scientists have developed new methods in analyzing population structures in less studied species. By using pooling techniques, the cost of the analysis is significantly lowered. These techniques are based on sequencing a population in a pooled sample instead of sequencing every individual within the population by itself. With new bioinformatics tools, there is a possibility of investigating population structure, gene flow, and gene migration by observing the allele frequencies within the entire population. With these protocols there is a possibility for combining the advantages of SNPs with micro satellite markers. However, there is information lost in the process, such as linkage disequilibrium and zygosity information.

Applications

Single nucleotide polymorphisms serve as powerful molecular markers in contemporary genetic research and clinical practice. Association studies, particularly genome-wide association studies , represent the primary application of SNP technology for identifying genetic variants linked to human diseases and traits. These comprehensive analyses examine hundreds of thousands of genetic markers simultaneously to detect statistical associations between specific SNPs and phenotypic characteristics, enabling researchers to uncover genetic contributions to complex disorders including cardiovascular disease, diabetes, and neurological conditions.
The development of tag SNP methodology has significantly enhanced the efficiency of genomic studies by exploiting patterns of linkage disequilibrium across the human genome. Tag SNPs function as representative markers that capture genetic variation within specific chromosomal regions, allowing researchers to survey large genomic areas without genotyping every individual variant. This approach reduces both the financial cost and computational burden of large-scale genetic studies while maintaining sufficient power to detect disease-associated loci. The selection of optimal tag SNPs relies on sophisticated algorithms that identify markers capable of capturing the maximum amount of genetic information within defined genomic intervals.
Haplotype reconstruction represents another fundamental application where SNPs enable the characterization of inherited genetic blocks. Researchers utilize dense SNP maps to identify and analyze haplotype structures, which consist of sets of closely linked alleles that tend to be transmitted together through generations. These haplotype patterns provide insights into population history, demographic events, and evolutionary processes that have shaped contemporary genetic diversity. The International HapMap Project exemplified this application by creating comprehensive maps of common haplotype patterns across diverse human populations.
Linkage disequilibrium analysis forms the theoretical foundation for many SNP-based applications in population genetics and disease mapping. This phenomenon describes the non-random association of alleles at different genomic positions, which occurs when variants are inherited together more frequently than would be expected by chance alone. The extent of linkage disequilibrium between SNPs depends primarily on physical distance along chromosomes and local recombination rates, with closer variants generally showing stronger associations. Understanding these patterns enables researchers to predict which SNPs will provide redundant information and guides the selection of informative markers for association studies.
In genetic epidemiology, SNPs have emerged as essential tools for investigating disease transmission patterns and population structure. Whole-genome sequencing approaches utilize SNP variation to define transmission clusters in infectious disease outbreaks, where cases showing similar genetic profiles may represent linked transmission events. This application has proven particularly valuable for tuberculosis surveillance and contact tracing, where traditional epidemiological methods may fail to identify all transmission links. Additionally, SNP-based analyses contribute to understanding population stratification and ancestry, which are crucial factors in designing appropriate study controls and interpreting association results across diverse ethnic groups.

Importance

Variations in the DNA sequences of humans can affect how humans develop diseases and respond to pathogens, chemicals, drugs, vaccines, and other agents. SNPs are also critical for personalized medicine. Examples include biomedical research, forensics, pharmacogenetics, and disease causation, as outlined below.

Clinical research

Genome-wide association study (GWAS)

One of the main contributions of SNPs in clinical research is genome-wide association study. Genome-wide genetic data can be generated by multiple technologies, including SNP array and whole genome sequencing. GWAS has been commonly used in identifying SNPs associated with diseases or clinical phenotypes or traits. Since GWAS is a genome-wide assessment, a large sample site is required to obtain sufficient statistical power to detect all possible associations. Some SNPs have relatively small effect on diseases or clinical phenotypes or traits. To estimate study power, the genetic model for disease needs to be considered, such as dominant, recessive, or additive effects. Due to genetic heterogeneity, GWAS analysis must be adjusted for race.

Candidate gene association study

Candidate gene association study is commonly used in genetic study before the invention of high throughput genotyping or sequencing technologies. Candidate gene association study is to investigate limited number of pre-specified SNPs for association with diseases or clinical phenotypes or traits. So this is a hypothesis driven approach. Since only a limited number of SNPs are tested, a relatively small sample size is sufficient to detect the association. Candidate gene association approach is also commonly used to confirm findings from GWAS in independent samples.

Homozygosity mapping in disease

Genome-wide SNP data can be used for homozygosity mapping. Homozygosity mapping is a method used to identify homozygous autosomal recessive loci, which can be a powerful tool to map genomic regions or genes that are involved in disease pathogenesis.

Methylation patterns

Recently, preliminary results reported SNPs as important components of the epigenetic program in organisms. Moreover, cosmopolitan studies in European and South Asiatic populations have revealed the influence of SNPs in the methylation of specific CpG sites. In addition, meQTL enrichment analysis using GWAS database, demonstrated that those associations are important toward the prediction of biological traits.

Forensic sciences

SNPs have historically been used to match a forensic DNA sample to a suspect but has been made obsolete due to advancing STR-based DNA fingerprinting techniques. However, the development of next-generation-sequencing technology may allow for more opportunities for the use of SNPs in phenotypic clues such as ethnicity, hair color, and eye color with a good probability of a match. This can additionally be applied to increase the accuracy of facial reconstructions by providing information that may otherwise be unknown, and this information can be used to help identify suspects even without a STR DNA profile match.
Some cons to using SNPs versus STRs is that SNPs yield less information than STRs, and therefore more SNPs are needed for analysis before a profile of a suspect is able to be created. Additionally, SNPs heavily rely on the presence of a database for comparative analysis of samples. However, in instances with degraded or small volume samples, SNP techniques are an excellent alternative to STR methods. SNPs have an abundance of potential markers, can be fully automated, and a possible reduction of required fragment length to less than 100 bp.

Pharmacogenetics

Pharmacogenetics focuses on identifying genetic variations including SNPs associated with differential responses to treatment. Many drug metabolizing enzymes, drug targets, or target pathways can be influenced by SNPs. The SNPs involved in drug metabolizing enzyme activities can change drug pharmacokinetics, while the SNPs involved in drug target or its pathway can change drug pharmacodynamics. Therefore, SNPs are potential genetic markers that can be used to predict drug exposure or effectiveness of the treatment. Genome-wide pharmacogenetic study is called pharmacogenomics. Pharmacogenetics and pharmacogenomics are important in the development of precision medicine, especially for life-threatening diseases such as cancers.

Disease

Only small amount of SNPs in the human genome may have impact on human diseases. Large scale GWAS has been done for the most important human diseases, including heart diseases, metabolic diseases, autoimmune diseases, and neurodegenerative and psychiatric disorders. Most of the SNPs with relatively large effects on these diseases have been identified. These findings have significantly improved understanding of disease pathogenesis and molecular pathways, and facilitated development of better treatment. Further GWAS with larger samples size will reveal the SNPs with relatively small effect on diseases. For common and complex diseases, such as type-2 diabetes, rheumatoid arthritis, and Alzheimer's disease, multiple genetic factors are involved in disease etiology. In addition, gene-gene interaction and gene-environment interaction also play an important role in disease initiation and progression.

Examples

As there are for genes, bioinformatics databases exist for SNPs.
  • dbSNP is a SNP database from the National Center for Biotechnology Information., dbSNP listed 149,735,377 SNPs in humans.
  • is a compendium of SNPs from multiple data sources including dbSNP.
  • SNPedia is a wiki-style database supporting personal genome annotation, interpretation and analysis.
  • The OMIM database describes the association between polymorphisms and diseases
  • dbSAP – single amino-acid polymorphism database for protein variation detection
  • The Human Gene Mutation Database provides gene mutations causing or associated with human inherited diseases and functional SNPs
  • The International HapMap Project, where researchers are identifying Tag SNPs to be able to determine the collection of haplotypes present in each subject.
  • GWAS Central allows users to visually interrogate the actual summary-level association data in one or more genome-wide association studies.
The International SNP Map working group mapped the sequence flanking each SNP by alignment to the genomic sequence of large-insert clones in Genebank. These alignments were converted to chromosomal coordinates that is shown in Table 1. This list has greatly increased since, with, for instance, the Kaviar database now listing 162 million single nucleotide variants.
ChromosomeLengthAll SNPsTSC SNPs
Total SNPskb per SNPTotal SNPskb per SNP
1214,066,000129,9311.6575,1662.85
2222,889,000103,6642.1576,9852.90
3186,938,00093,1402.0163,6692.94
4169,035,00084,4262.0065,7192.57
5170,954,000117,8821.4563,5452.69
6165,022,00096,3171.7153,7973.07
7149,414,00071,7522.0842,3273.53
8125,148,00057,8342.1642,6532.93
9107,440,00062,0131.7343,0202.50
10127,894,00061,2982.0942,4663.01
11129,193,00084,6631.5347,6212.71
12125,198,00059,2452.1138,1363.28
1393,711,00053,0931.7735,7452.62
1489,344,00044,1122.0329,7463.00
1573,467,00037,8141.9426,5242.77
1674,037,00038,7351.9123,3283.17
1773,367,00034,6212.1219,3963.78
1873,078,00045,1351.6227,0282.70
1956,044,00025,6762.1811,1855.01
2063,317,00029,4782.1517,0513.71
2133,824,00020,9161.629,1033.72
2233,786,00028,4101.1911,0563.06
X131,245,00034,8423.7720,4006.43
Y21,753,0004,1935.191,78412.19
RefSeq15,696,67414,5341.08-
Totals2,710,164,0001,419,1901.91887,4503.05

Nomenclature

The nomenclature for SNPs include several variations for an individual SNP, while lacking a common consensus.
The rs### standard is that which has been adopted by dbSNP and uses the prefix "rs", for "reference SNP", followed by a unique and arbitrary number. SNPs are frequently referred to by their dbSNP rs number, as in the examples above.
The Human Genome Variation Society uses a standard which conveys more information about the SNP. Examples are:
  • c.76A>T: "c." for coding region, followed by a number for the position of the nucleotide, followed by a one-letter abbreviation for the nucleotide, followed by a greater than sign to indicate substitution, followed by the abbreviation of the nucleotide which replaces the former
  • p.Ser123Arg: "p." for protein, followed by a three-letter abbreviation for the amino acid, followed by a number for the position of the amino acid, followed by the abbreviation of the amino acid which replaces the former.

    SNP analysis

SNPs can be easily assayed due to only containing two possible alleles and three possible genotypes involving the two alleles: homozygous A, homozygous B and heterozygous AB, leading to many possible techniques for analysis. Some include: DNA sequencing; capillary electrophoresis; mass spectrometry; single-strand conformation polymorphism ; single base extension; electrochemical analysis; denaturating HPLC and gel electrophoresis; restriction fragment length polymorphism; and hybridization analysis.

Programs for prediction of SNP effects

An important group of SNPs are those that corresponds to missense mutations causing amino acid change on protein level. Point mutation of particular residue can have different effect on protein function. Usually, change in amino acids with similar size and physico-chemical properties has mild effect, and opposite. Similarly, if SNP disrupts secondary structure elements such mutation usually may affect whole protein structure and function. Using those simple and many other machine learning derived rules a group of programs for the prediction of SNP effect was developed:
  • This program provides insight into how a laboratory induced missense or nonsynonymous mutation will affect protein function based on physical properties of the amino acid and sequence homology.
  • estimates the potential deleteriousness of mutations resulted from altering their protein functions. It is based on the assumption that variations observed in closely related species are more significant when assessing conservation compared to those in distantly related species.
  • MutationTaster:
  • from the Ensembl project
  • : This program provides a 3D representation of the protein affected, highlighting the amino acid change so doctors can determine pathogenicity of the mutant protein.
  • is a database which maps variants to experimental and predicted protein structures.
  • is a tool which provides a stereochemical report on the effect of missense variants on protein structure.