Bioinformatics


Bioinformatics is an interdisciplinary field of science that develops computational methods and software tools for understanding biological data, especially when the data sets are large and complex. Bioinformatics uses biology, chemistry, physics, computer science, data science, computer programming, information engineering, mathematics and statistics to analyze and interpret biological data. This process can sometimes be referred to as computational biology, however the distinction between the two terms is often disputed. To some, the term computational biology refers to building and using models of biological systems.
Computational, statistical, and computer programming techniques have been used for computer simulation analyses of biological queries. They include reused specific analysis "pipelines", particularly in the field of genomics, such as by the identification of genes and single nucleotide polymorphisms. These pipelines are used to better understand the genetic basis of disease, unique adaptations, desirable properties, or differences between populations. Bioinformatics also includes proteomics, which aims to understand the organizational principles within nucleic acid and protein sequences.
Image and signal processing allow the extraction of useful results from large amounts of raw data. It aids in sequencing and annotating genomes and their observed mutations. Bioinformatics includes text mining of biological literature and the development of biological and gene ontologies to organize and query biological data. It also plays a role in the analysis of gene and protein expression and regulation. Bioinformatic tools aid in comparing, analyzing, and interpreting genetic and genomic data and in the understanding of evolutionary aspects of molecular biology. At a more integrative level, it helps analyze and catalogue the biological pathways and networks that are an important part of systems biology. In structural biology, it aids in the simulation and modeling of DNA, RNA, proteins as well as biomolecular interactions.

History

The first definition of the term bioinformatics was coined by Paulien Hogeweg and Ben Hesper in 1970, to refer to the study of information processes in biotic systems. This definition placed bioinformatics as a field parallel to biochemistry.
Bioinformatics and computational biology involved the analysis of biological data, particularly DNA, RNA, and protein sequences. The field of bioinformatics experienced explosive growth starting in the mid-1990s, driven largely by the Human Genome Project and by rapid advances in DNA sequencing technology.
Analyzing biological data to produce meaningful information involves writing and running software programs that use algorithms from graph theory, artificial intelligence, soft computing, data mining, image processing, and computer simulation. The algorithms in turn depend on theoretical foundations such as discrete mathematics, control theory, system theory, information theory, and statistics.

Sequences

There has been a tremendous advance in speed and cost reduction since the completion of the Human Genome Project, with some labs able to sequence over 100,000 billion bases each year, and a full genome can be sequenced for $1,000 or less.
Computers became essential in molecular biology when protein sequences became available after Frederick Sanger determined the sequence of insulin in the early 1950s. Comparing multiple sequences manually turned out to be impractical. Margaret Oakley Dayhoff, a pioneer in the field, compiled one of the first protein sequence databases, initially published as books as well as methods of sequence alignment and molecular evolution. Another early contributor to bioinformatics was Elvin A. Kabat, who pioneered biological sequence analysis in 1970 with his comprehensive volumes of antibody sequences released online with Tai Te Wu between 1980 and 1991.
In the 1970s, new techniques for sequencing DNA were applied to bacteriophage MS2 and øX174, and the extended nucleotide sequences were then parsed with informational and statistical algorithms. These studies illustrated that well known features, such as the coding segments and the triplet code, are revealed in straightforward statistical analyses and were the proof of the concept that bioinformatics would be insightful.

Goals

In order to study how normal cellular activities are altered in different disease states, raw biological data must be combined to form a comprehensive picture of these activities. Therefore, the field of bioinformatics has evolved such that the most pressing task now involves the analysis and interpretation of various types of data. This also includes nucleotide and amino acid sequences, protein domains, and protein structures.
Important sub-disciplines within bioinformatics and computational biology include:
  • Development and implementation of computer programs to efficiently access, manage, and use various types of information.
  • Development of new mathematical algorithms and statistical measures to assess relationships among members of large data sets. For example, there are methods to locate a gene within a sequence, to predict protein structure and/or function, and to cluster protein sequences into families of related sequences.
The primary goal of bioinformatics is to increase the understanding of biological processes. What sets it apart from other approaches is its focus on developing and applying computationally intensive techniques to achieve this goal. Examples include: pattern recognition, data mining, machine learning algorithms, and visualization. Major research efforts in the field include sequence alignment, gene finding, genome assembly, drug design, drug discovery, protein structure alignment, protein structure prediction, prediction of gene expression and protein–protein interactions, genome-wide association studies, the modeling of evolution and cell division/mitosis.
Bioinformatics entails the creation and advancement of databases, algorithms, computational and statistical techniques, and theory to solve formal and practical problems arising from the management and analysis of biological data.
Over the past few decades, rapid developments in genomic and other molecular research technologies and developments in information technologies have combined to produce a tremendous amount of information related to molecular biology. Bioinformatics is the name given to these mathematical and computing approaches used to glean understanding of biological processes.
Common activities in bioinformatics include mapping and analyzing DNA and protein sequences, aligning DNA and protein sequences to compare them, and creating and viewing 3-D models of protein structures.

Sequence analysis

Since the bacteriophage Phage Φ-X174 was sequenced in 1977, the DNA sequences of thousands of organisms have been decoded and stored in databases. This sequence information is analyzed to determine genes that encode proteins, RNA genes, regulatory sequences, structural motifs, and repetitive sequences. A comparison of genes within a species or between different species can show similarities between protein functions, or relations between species. With the growing amount of data, it long ago became impractical to analyze DNA sequences manually. Computer programs such as BLAST are used routinely to search sequences—as of 2008, from more than 260,000 organisms, containing over 190 billion nucleotides.

DNA sequencing

Before sequences can be analyzed, they are obtained from a data storage bank, such as GenBank. DNA sequencing is still a non-trivial problem as the raw data may be noisy or affected by weak signals. Algorithms have been developed for base calling for the various experimental approaches to DNA sequencing.

Sequence assembly

Most DNA sequencing techniques produce short fragments of sequence that need to be assembled to obtain complete gene or genome sequences. The shotgun sequencing technique generates the sequences of many thousands of small DNA fragments. The ends of these fragments overlap and, when aligned properly by a genome assembly program, can be used to reconstruct the complete genome. Shotgun sequencing yields sequence data quickly, but the task of assembling the fragments can be quite complicated for larger genomes. For a genome as large as the human genome, it may take many days of CPU time on large-memory, multiprocessor computers to assemble the fragments, and the resulting assembly usually contains numerous gaps that must be filled in later. Shotgun sequencing is the method of choice for virtually all genomes sequenced, and genome assembly algorithms are a critical area of bioinformatics research.

Genome annotation

In genomics, annotation refers to the process of marking the stop and start regions of genes and other biological features in a sequenced DNA sequence. Many genomes are too large to be annotated by hand. As the rate of sequencing exceeds the rate of genome annotation, genome annotation has become the new bottleneck in bioinformatics.
Genome annotation can be classified into three levels: the nucleotide, protein, and process levels.
Gene finding is a chief aspect of nucleotide-level annotation. For complex genomes, a combination of ab initio gene prediction and sequence comparison with expressed sequence databases and other organisms can be successful. Nucleotide-level annotation also allows the integration of genome sequence with other genetic and physical maps of the genome.
The principal aim of protein-level annotation is to assign function to the protein products of the genome. Databases of protein sequences and functional domains and motifs are used for this type of annotation. About half of the predicted proteins in a new genome sequence tend to have no obvious function.
Understanding the function of genes and their products in the context of cellular and organismal physiology is the goal of process-level annotation. An obstacle of process-level annotation has been the inconsistency of terms used by different model systems. The Gene Ontology Consortium is helping to solve this problem.
The first description of a comprehensive annotation system was published in 1995 by The Institute for Genomic Research, which performed the first complete sequencing and analysis of the genome of a free-living organism, the bacterium Haemophilus influenzae. The system identifies the genes encoding all proteins, transfer RNAs, ribosomal RNAs, in order to make initial functional assignments. The GeneMark program trained to find protein-coding genes in Haemophilus influenzae is constantly changing and improving.
Following the goals that the Human Genome Project left to achieve after its closure in 2003, the ENCODE project was developed by the National Human Genome Research Institute. This project is a collaborative data collection of the functional elements of the human genome that uses next-generation DNA-sequencing technologies and genomic tiling arrays, technologies able to automatically generate large amounts of data at a dramatically reduced per-base cost but with the same accuracy and fidelity.