Intron
An intron is any nucleotide sequence within a gene that is not expressed or operative in the final RNA product. The word intron is derived from the term intragenic region, i.e., a region inside a gene. The term intron refers to both the DNA sequence within a gene and the corresponding RNA sequence in RNA transcripts. The non-intron sequences that become joined by this RNA processing to form the mature RNA are called exons.
Introns are found in the genes of most eukaryotes and many eukaryotic viruses, and they can be located in both protein-coding genes and genes that function as RNA. There are four main types of introns: tRNA introns, group I introns, group II introns, and spliceosomal introns. Introns are rare in Bacteria and Archaea.
Discovery and etymology
Introns were first discovered in protein-coding genes of adenovirus, and were subsequently identified in genes encoding transfer RNA and ribosomal RNA genes. Introns are now known to occur within a wide variety of genes throughout organisms, bacteria, and viruses within all of the biological kingdoms.The fact that genes were split or interrupted by introns was discovered independently in a number of labs in 1977 including those run by Phillip Allen Sharp and Richard J. Roberts, for which they shared the Nobel Prize in Physiology or Medicine in 1993, Other labs that contributed to the discovery were those of Louise Chow and Thomas Broker. Much of the work in the Sharp lab was done by postdoctoral fellow Susan Berget.
The term intron was introduced by American biochemist Walter Gilbert:
"The notion of the cistron ... must be replaced by that of a transcription unit containing regions which will be lost from the mature messenger – which I suggest we call introns – alternating with regions which will be expressed – exons."
The term intron also refers to intracistron, i.e., an additional piece of DNA that arises within a cistron.
Although introns are sometimes called intervening sequences, the term "intervening sequence" can refer to any of several families of internal nucleic acid sequences that are not present in the final gene product, including inteins, untranslated regions, and nucleotides removed by RNA editing, in addition to introns.
Distribution
The frequency of introns within different genomes is observed to vary widely across the spectrum of biological organisms. For example, introns are extremely common within the nuclear genome of jawed vertebrates, where protein-coding genes almost always contain multiple introns, while introns are rare within the nuclear genes of some eukaryotic microorganisms, for example baker's/brewer's yeast. In contrast, the mitochondrial genomes of vertebrates are entirely devoid of introns, while those of eukaryotic microorganisms may contain many introns.Image:Pre-mRNA_to_mRNA_MH.svg|right|thumbnail|420px|Simple illustration of an unspliced mRNA precursor, with two introns and three exons. After the introns have been removed via splicing, the mature mRNA sequence is ready for translation.
A particularly extreme case is the Drosophila DhDhc7 gene containing a ≥3.6 megabase intron, which takes roughly three days to transcribe. On the other extreme, a 2015 study suggests that the shortest known metazoan intron length is 30 base pairs belonging to the human MST1L gene. The shortest known introns belong to the heterotrich ciliates, such as Stentor coeruleus, in which most introns are 15 or 16 bp long.
Classification
Splicing of all intron-containing RNA molecules is superficially similar, as described above. However, different types of introns were identified through the examination of intron structure by DNA sequence analysis, together with genetic and biochemical analysis of RNA splicing reactions. At least four distinct classes of introns have been identified:- Introns in nuclear protein-coding genes that are removed by spliceosomes
- Introns in nuclear and archaeal transfer RNA genes that are removed by proteins
- Self-splicing group I introns that are removed by RNA catalysis
- Self-splicing group II introns that are removed by RNA catalysis
Spliceosomal introns
Nuclear pre-mRNA introns are characterized by specific intron sequences located at the boundaries between introns and exons. These sequences are recognized by spliceosomal RNA molecules when the splicing reactions are initiated. In addition, they contain a branch point, a particular nucleotide sequence near the 3' end of the intron that becomes covalently linked to the 5' end of the intron during the splicing process, generating a branched intron. Apart from these three short conserved elements, nuclear pre-mRNA intron sequences are highly variable. Nuclear pre-mRNA introns are often much longer than their surrounding exons.tRNA introns
Transfer RNA introns that depend upon proteins for removal occur at a specific location within the anticodon loop of unspliced tRNA precursors, and are removed by a tRNA splicing endonuclease. The exons are then linked together by a second protein, the tRNA splicing ligase. Note that self-splicing introns are also sometimes found within tRNA genes.Group I and group II introns
Group I and group II introns are found in genes encoding proteins, transfer RNA and ribosomal RNA in a very wide range of living organisms. Following transcription into RNA, group I and group II introns also make extensive internal interactions that allow them to fold into a specific, complex three-dimensional architecture. These complex architectures allow some group I and group II introns to be self-splicing, that is, the intron-containing RNA molecule can rearrange its own covalent structure so as to precisely remove the intron and link the exons together in the correct order. In some cases, particular intron-binding proteins are involved in splicing, acting in such a way that they assist the intron in folding into the three-dimensional structure that is necessary for self-splicing activity. Group I and group II introns are distinguished by different sets of internal conserved sequences and folded structures, and by the fact that splicing of RNA molecules containing group II introns generates branched introns, while group I introns use a non-encoded guanosine nucleotide to initiate splicing, adding it on to the 5'-end of the excised intron.On the accuracy of splicing
The spliceosome is a very complex structure containing up to one hundred proteins and five different RNAs. The substrate of the reaction is a long RNA molecule, and the transesterification reactions catalyzed by the spliceosome require the bringing together of sites that may be thousands of nucleotides apart. All biochemical reactions are associated with known error rates – and the more complicated the reaction, the higher the error rate. Therefore, it is not surprising that the splicing reaction catalyzed by the spliceosome has a significant error rate even though there are spliceosome accessory factors that suppress the accidental cleavage of cryptic splice sites.Under ideal circumstances, the splicing reaction is likely to be 99.999% accurate and the correct exons will be joined and the correct intron will be deleted. However, these ideal conditions require very close matches to the best splice site sequences and the absence of any competing cryptic splice site sequences within the introns, and those conditions are rarely met in large eukaryotic genes that may cover more than 40 kilobase pairs. Recent studies have shown that the actual error rate can be considerably higher than 10−5 and may be as high as 2% or 3% errors per gene. Additional studies suggest that the error rate is no less than 0.1% per intron. This relatively high level of splicing errors explains why most splice variants are rapidly degraded by nonsense-mediated decay.
The presence of sloppy binding sites within genes causes splicing errors and it may seem strange that these sites haven't been eliminated by natural selection. The argument for their persistence is similar to the argument for junk DNA.
Although mutations which create or disrupt binding sites may be slightly deleterious, the large number of possible such mutations makes it inevitable that some will reach fixation in a population. This is particularly relevant in species, such as humans, with relatively small long-term effective population sizes. It is plausible, then, that the human genome carries a substantial load of suboptimal sequences which cause the generation of aberrant transcript isoforms. In this study, we present direct evidence that this is indeed the case.
While the catalytic reaction may be accurate enough for effective processing most of the time, the overall error rate may be partly limited by the fidelity of transcription because transcription errors will introduce mutations that create cryptic splice sites. In addition, the transcription error rate of 10−5 – 10−6 is high enough that one in every 25,000 transcribed exons will have an incorporation error in one of the splice sites leading to a skipped intron or a skipped exon. Almost all multi-exon genes will produce incorrectly spliced transcripts but the frequency of this background noise will depend on: the size of the genes, the number of introns, and the quality of the splice site sequences.
In some cases, splice variants will be produced by mutations in the gene. These can be SNP polymorphisms that create a cryptic splice site or mutate a functional site. They can also be somatic cell mutations that affect splicing in a particular tissue or a cell line. When the mutant allele is in a heterozygous state, this will result in production of two abundant splice variants: one functional, and one non-functional. In the homozygous state, the mutant alleles may cause a genetic disease, such as the hemophilia found in descendants of Queen Victoria, where a mutation in one of the introns in a blood clotting factor gene creates a cryptic 3' splice site, resulting in aberrant splicing. A significant fraction of human deaths by disease may be caused by mutations that interfere with normal splicing, mostly by creating cryptic splice sites.
Incorrectly spliced transcripts can easily be detected and their sequences entered into the online databases. They are usually described as "alternatively spliced" transcripts, which can be confusing because the term does not distinguish between real, biologically relevant, alternative splicing and processing noise due to splicing errors. One of the central issues in the field of alternative splicing is working out the differences between these two possibilities. Many scientists have argued that the null hypothesis should be splicing noise, putting the burden of proof on those who claim biologically relevant alternative splicing. According to those scientists, the claim of function must be accompanied by convincing evidence that multiple functional products are produced from the same gene.