Transcriptome


The transcriptome is the set of all RNA molecules in a cell or a population of cells. It includes all of the functional RNA molecules and all other transcripts that may arise by spurious transcription or transcription of non-functional regions such as pseudogenes or virus fragments. A major goal of modern molecular biology is to determine which transcripts are functional and which ones are junk RNA.
The term transcriptome is a portmanteau of the words transcript and genome; it is associated with the process of transcript production during the biological process of transcription. The functional part of the transcriptome is dynamic — it changes with cell type, developmental stage, environment, and stimuli — and therefore represents the active gene expression state rather than the static DNA sequence.
Eukaryotic transcriptomes tend to be more complex than bacterial transcriptomes and the transcriptomes of multicellular eukaryotes are even more complex than those of unicellular eukaryotes.

Etymology and history

The word transcriptome is a portmanteau of the words transcript and genome. It appeared along with other neologisms formed using the suffixes -ome and -omics to denote all studies conducted on a genome-wide scale in the fields of life sciences and technology. As such, transcriptome and transcriptomics were one of the first words to emerge along with genome and proteome. The first study to present a case of a collection of a cDNA library for silk moth mRNA was published in 1979. The first seminal study to mention and investigate the transcriptome of an organism was published in 1997 and it described 60,633 transcripts expressed in S. cerevisiae using serial analysis of gene expression. With the rise of high-throughput technologies and bioinformatics and the subsequent increased computational power, it became increasingly efficient and easy to characterize and analyze enormous amount of data. Attempts to characterize the transcriptome became more prominent with the advent of automated DNA sequencing during the 1980s. During the 1990s, expressed sequence tag sequencing was used to identify genes and their fragments. This was followed by techniques such as serial analysis of gene expression, cap analysis of gene expression, and massively parallel signature sequencing.

Transcription

The transcriptome encompasses all the ribonucleic acid transcripts present in a given organism or experimental sample. The functional component of the transcriptome includes RNAs that carry genetic information that is responsible for the process of converting DNA into an organism's phenotype. A gene gives rise to a single-stranded RNA molecule through a molecular process known as transcription; this RNA is complementary to the strand of DNA it originated from. The enzyme RNA polymerase attaches to the template DNA strand and catalyzes the addition of ribonucleotides to the 3' end of the growing sequence of the RNA transcript.
In order to initiate its function, RNA polymerase needs to recognize a promoter sequence, located near the transcription start site that defines the beginning of the gene. This process is usually mediated and regulated by transcription factors. Transcription ends at a terminator site that defines the other end of the gene. The terminator site is often identified by termination sequences.

Types of RNA transcripts

Almost all functional transcripts are derived from known genes. The only exceptions are a small number of transcripts that might play a direct role in regulating gene expression near the prompters of known genes.
Genes occupy most of prokaryotic genomes so most of their genomes are transcribed. Many eukaryotic genomes are very large and known genes may take up only a fraction of the genome. In mammals, for example, known genes only account for 40-50% of the genome. Nevertheless, identified transcripts often map to a much larger fraction of the genome suggesting that the transcriptome contains spurious transcripts that do not come from genes. Some of these transcripts are known to be non-functional because they map to transcribed pseudogenes or degenerative transposons and viruses. Others map to unidentified regions of the genome that may be junk DNA.
Spurious transcription is very common in eukaryotes, especially those with large genomes that might contain a lot of junk DNA. Some scientists claim that if a transcript has not been assigned to a known gene then the default assumption must be that it is junk RNA until it has been shown to be functional. This would mean that much of the transcriptome in species with large genomes is probably junk RNA.
The transcriptome includes the transcripts of protein-coding genes as well as the transcripts of non-coding genes.
  • Ribosomal RNA/rRNA: Usually the most abundant RNA in the transcriptome.
  • Long non-coding RNA/lncRNA: Non-coding RNA transcripts that are more than 200 nucleotides long. Members of this group comprise the largest fraction of the non-coding transcriptome other than introns. It is not known how many of these transcripts are functional and how many are junk RNA.
  • transfer RNA/tRNA
  • micro RNA/miRNA: 19-24 nucleotides long. Micro RNAs up- or downregulate expression levels of mRNAs by the process of RNA interference at the post-transcriptional level.
  • small interfering RNA/siRNA: 20-24 nt
  • small nucleolar RNA/snoRNA
  • Piwi-interacting RNA/piRNA: 24-31 nt. They interact with Piwi proteins of the Argonaute family and have a function in targeting and cleaving transposons.
  • enhancer RNA/eRNA:

    Scope of study

In the human genome, all genes get transcribed into RNA because that's how the molecular gene is defined. The transcriptome consists of coding regions of mRNA plus non-coding untranslated regions, introns, non-coding RNAs, and spurious non-functional transcripts.
Several factors render the content of the transcriptome difficult to establish. These include alternative splicing, RNA editing and alternative transcription among others. Additionally, transcriptome techniques are capable of capturing transcription occurring in a sample at a specific time point, although the content of the transcriptome can change during differentiation. The main aims of transcriptomics are the following: "catalogue all species of transcript, including mRNAs, non-coding RNAs and small RNAs; to determine the transcriptional structure of genes, in terms of their start sites, 5′ and 3′ ends, splicing patterns and other post-transcriptional modifications; and to quantify the changing expression levels of each transcript during development and under different conditions".
The term can be applied to the total set of transcripts in a given organism, or to the specific subset of transcripts present in a particular cell type. Unlike the genome, which is roughly fixed for a given cell line, the transcriptome can vary with external environmental conditions. Because it includes all mRNA transcripts in the cell, the transcriptome reflects the genes that are being actively expressed at any given time, with the exception of mRNA degradation phenomena such as transcriptional attenuation. The study of transcriptomics,, examines the expression level of RNAs in a given cell population, often focusing on mRNA, but sometimes including others such as tRNAs and sRNAs.

Methods of construction

Transcriptomics is the quantitative science that encompasses the assignment of a list of strings to the object. To calculate the expression strength, the density of reads corresponding to each object is counted. Initially, transcriptomes were analyzed and studied using expressed sequence tags libraries and serial and cap analysis of gene expression.
Currently, the two main transcriptomics techniques include DNA microarrays and RNA-Seq. Both techniques require RNA isolation through RNA extraction techniques, followed by its separation from other cellular components and enrichment of mRNA.
There are two general methods of inferring transcriptome sequences. One approach maps sequence reads onto a reference genome, either of the organism itself or of a closely related species. The other approach, de novo transcriptome assembly, uses software to infer transcripts directly from short sequence reads and is used in organisms with genomes that are not sequenced.

DNA microarrays

The first transcriptome studies were based on microarray techniques. Microarrays consist of thin glass layers with spots on which oligonucleotides, known as "probes" are arrayed; each spot contains a known DNA sequence.
When performing microarray analyses, mRNA is collected from a control and an experimental sample, the latter usually representative of a disease. The RNA of interest is converted to cDNA to increase its stability and marked with fluorophores of two colors, usually green and red, for the two groups. The cDNA is spread onto the surface of the microarray where it hybridizes with oligonucleotides on the chip and a laser is used to scan. The fluorescence intensity on each spot of the microarray corresponds to the level of gene expression and based on the color of the fluorophores selected, it can be determined which of the samples exhibits higher levels of the mRNA of interest.
One microarray usually contains enough oligonucleotides to represent all known genes; however, data obtained using microarrays does not provide information about unknown genes. During the 2010s, microarrays were almost completely replaced by next-generation techniques that are based on DNA sequencing.

RNA sequencing

RNA sequencing is a next-generation sequencing technology; as such it requires only a small amount of RNA and no previous knowledge of the genome. It allows for both qualitative and quantitative analysis of RNA transcripts, the former allowing discovery of new transcripts and the latter a measure of relative quantities for transcripts in a sample.
The three main steps of sequencing transcriptomes of any biological samples include RNA purification, the synthesis of an RNA or cDNA library and sequencing the library. The RNA purification process is different for short and long RNAs. This step is usually followed by an assessment of RNA quality, with the purpose of avoiding contaminants such as DNA or technical contaminants related to sample processing. RNA quality is measured using UV spectrometry with an absorbance peak of 260 nm. RNA integrity can also be analyzed quantitatively comparing the ratio and intensity of 28S RNA to 18S RNA reported in the RNA Integrity Number score. Since mRNA is the species of interest and it represents only 3% of its total content, the RNA sample should be treated to remove rRNA and tRNA and tissue-specific RNA transcripts.
The step of library preparation with the aim of producing short cDNA fragments, begins with RNA fragmentation to transcripts in length between 50 and 300 base pairs. Fragmentation can be enzymatic, chemical or mechanical. Reverse transcription is used to convert the RNA templates into cDNA and three priming methods can be used to achieve it, including oligo-DT, using random primers or ligating special adaptor oligos.