Human genome
The human genome is a complete set of DNA sequences for each of the 22 autosomes and the two distinct sex chromosomes. A small DNA molecule is found within individual mitochondria. These are usually treated separately as the nuclear genome and the mitochondrial genome.
Human genomes include both genes and various other types of functional DNA elements. The latter is a diverse category that includes regulatory DNA scaffolding regions, telomeres, centromeres, and origins of replication. In addition, there are large numbers of transposable elements, inserted viral DNA, non-functional pseudogenes and simple, highly repetitive sequences. Introns make up a large percentage of the human genome.
Some of the genome is non-functional junk DNA, such as pseudogenes, but there is no firm consensus on the total amount of junk DNA.
Size of the human genome
In 2000, scientists reported the sequencing of 88% of human genome, but as of 2020, at least 8% was still missing. In 2021, scientists reported sequencing a complete, female genome. The human Y chromosome, consisting of 62,460,029 base pairs from a different cell line and found in all males, was sequenced completely in January 2022.The current version of the standard reference genome is called GRCh38.p14. It consists of 22 autosomes plus one copy of the X chromosome and one copy of the Y chromosome. It contains approximately 3.1 billion base pairs. This represents the size of a composite genome based on data from multiple individuals but it is a good indication of the typical amount of DNA in a haploid set of chromosomes because the Y chromosome is quite small. Most human cells are diploid so they contain twice as much DNA.
In 2023, a draft human pangenome reference was published. It is based on 47 genomes from people of varied ethnicity. Plans are underway for an improved reference capturing still more biodiversity from a still wider sample.
While there are significant differences among the genomes of human individuals, these are considerably smaller than the differences between humans and their closest living relatives, the bonobos and chimpanzees.
Molecular organization and gene content
The total length of the human reference genome does not represent the sequence of any specific individual, nor does it represent the sequence of all of the DNA found within a cell. The human reference genome only includes one copy of each of the paired, homologous autosomes plus one copy of each of the two sex chromosomes. The total amount of DNA in this reference genome is 3.1 billion base pairs.Protein-coding genes
Protein-coding sequences represent the most widely studied and best understood component of the human genome. These sequences ultimately lead to the production of all human proteins, although several biological processes can lead to the production of many more unique proteins than the number of protein-coding genes.The human reference genome contains somewhere between 19,000 and 20,000 protein-coding genes. These genes contain an average of 10 introns and the average size of an intron is about 6 kb. This means that the average size of a protein-coding gene is about 62 kb and these genes take up about 40% of the genome.
Exon sequences consist of coding DNA and untranslated regions at either end of the mature mRNA. The total amount of coding DNA is about 1-2% of the genome.
Many people divide the genome into coding and non-coding DNA based on the idea that coding DNA is the most important functional component of the genome. About 98–99% of the human genome is non-coding DNA.
Non-coding genes
Noncoding RNA molecules play many essential roles in cells, especially in the many reactions of protein synthesis and RNA processing. Noncoding genes include those for tRNAs, ribosomal RNAs, microRNAs, snRNAs and long non-coding RNAs. The number of reported non-coding genes continues to rise slowly but the exact number in the human genome is yet to be determined. Many RNAs are thought to be non-functional.Many ncRNAs are critical elements in gene regulation and expression. Noncoding RNA also contributes to epigenetics, transcription, RNA splicing, and the translational machinery. The role of RNA in genetic regulation and disease offers a new potential level of unexplored genomic complexity.
Pseudogenes
Pseudogenes are inactive copies of protein-coding genes, often generated by gene duplication, that have become nonfunctional through the accumulation of inactivating mutations. The number of pseudogenes in the human genome is on the order of 13,000, and in some chromosomes is nearly the same as the number of functional protein-coding genes. Gene duplication is a major mechanism through which new genetic material is generated during molecular evolution.For example, the olfactory receptor gene family is one of the best-documented examples of pseudogenes in the human genome. More than 60 percent of the genes in this family are non-functional pseudogenes in humans. By comparison, only 20 percent of genes in the mouse olfactory receptor gene family are pseudogenes. Research suggests that this is a species-specific characteristic, as the most closely related primates all have proportionally fewer pseudogenes. This genetic discovery helps to explain the less acute sense of smell in humans relative to other mammals.
Regulatory DNA sequences
The human genome has many different regulatory sequences which are crucial to controlling gene expression. Some scientists believe that these sequences make up 8% of the genome, but other scientists predict that 20% or more of the human genome might be devoted to regulatory sequences.A value of 8% would correspond to approximately 10,000 bp of regulatory DNA per gene and a value of 20% corresponds to 25,000 bp of regulatory DNA per gene. Many scientists think that these estimates are unreasonably high and conflict with the view that only 10% of the genome is functional and 90% is junk DNA.
Regulatory sequences have been known since the late 1960s. The first identification of regulatory sequences in the human genome relied on recombinant DNA technology. Later with the advent of genomic sequencing, the identification of these sequences could be inferred by evolutionary conservation. The evolutionary branch between the primates and mouse, for example, occurred 70–90 million years ago. So computer comparisons of gene sequences that identify conserved sequences will be an indication of their importance in functions such as gene regulation.
The results indicate that about 10% of the human genome is conserved. Several hundred thousand human genome sequences have been sequenced and there is a considerable amount of variation between individuals. Only 10% of the genome seems to be protected from mutations by purifying selection.
As of 2012, the efforts have shifted toward finding interactions between DNA and regulatory proteins by the technique ChIP-Seq, or gaps where the DNA is not packaged by histones, both of which tell where there are potential regulatory sequences in the investigated cell type.
Repetitive DNA sequences
comprise approximately 50% of the human genome.About 8% of the human genome consists of tandem DNA arrays or tandem repeats, low complexity repeat sequences that have multiple adjacent copies. The tandem sequences may be of variable lengths, from two nucleotides to tens of nucleotides. These sequences are highly variable, even among closely related individuals, and so are used for genealogical DNA testing and forensic DNA analysis.
Repeated sequences of fewer than ten nucleotides are termed microsatellite sequences. Among the microsatellite sequences, trinucleotide repeats are of particular importance, as sometimes occur within coding regions of genes for proteins and may lead to genetic disorders. For example, Huntington's disease results from an expansion of the trinucleotide repeat n within the Huntingtin gene on human chromosome 4. Telomeres end with a microsatellite hexanucleotide repeat of the sequence n.
Tandem repeats of longer sequences are termed minisatellites.
Transposable genetic elements, DNA sequences that can replicate and insert copies of themselves at other locations within a host genome, are an abundant component in the human genome. The most abundant transposon lineage, Alu, has about 50,000 active copies, and can be inserted into intragenic and intergenic regions. One other lineage, LINE-1, has about 100 active copies per genome. Together with non-functional relics of old transposons, they account for over half of total human DNA. Sometimes called "jumping genes", transposons have played a major role in sculpting the human genome. Some of these sequences represent endogenous retroviruses, DNA copies of viral sequences that have become permanently integrated into the genome and are now passed on to succeeding generations. There are also a significant number of retroviruses in human DNA, at least 3 of which have been proven to possess an important function.
Mobile elements within the human genome can be classified into LTR retrotransposons, SINEs including Alu elements, LINEs, SVAs and Class II DNA transposons.