Protein engineering
Protein engineering is the process of developing useful or valuable proteins through the design and production of unnatural polypeptides, often by altering amino acid sequences found in nature. It is a young discipline, with much research taking place into the understanding of protein folding and recognition for protein design principles. It has been used to improve the function of many enzymes for industrial catalysis. It is also a product and services market, with an estimated value of $168 billion by 2017.
There are two general strategies for protein engineering: rational protein design and directed evolution. These methods are not mutually exclusive; researchers will often apply both. In the future, more detailed knowledge of protein structure and function, and advances in high-throughput screening, may greatly expand the abilities of protein engineering. Eventually, even unnatural amino acids may be included, via newer methods, such as expanded genetic code, that allow encoding novel amino acids in genetic code.
The applications in numerous fields, including medicine and industrial bioprocessing, are vast and numerous.
Approaches
Rational design
In rational protein design, a scientist uses detailed knowledge of the structure and function of a protein to make desired changes. In general, this has the advantage of being inexpensive and technically easy, since site-directed mutagenesis methods are well-developed. However, its major drawback is that detailed structural knowledge of a protein is often unavailable, and, even when available, it can be very difficult to predict the effects of various mutations since structural information most often provide a static picture of a protein structure. However, programs such as Folding@home and Foldit have utilized crowdsourcing techniques in order to gain insight into the folding motifs of proteins.Computational protein design algorithms seek to identify novel amino acid sequences that are low in energy when folded to the pre-specified target structure. While the sequence-conformation space that needs to be searched is large, the most challenging requirement for computational protein design is a fast, yet accurate, energy function that can distinguish optimal sequences from similar suboptimal ones.
Multiple sequence alignment
Without structural information about a protein, sequence analysis is often useful in elucidating information about the protein. These techniques involve alignment of target protein sequences with other related protein sequences. This alignment can show which amino acids are conserved between species and are important for the function of the protein. These analyses can help to identify hot spot amino acids that can serve as the target sites for mutations. Multiple sequence alignment utilizes data bases such as PREFAB, SABMARK, OXBENCH, IRMBASE, and BALIBASE in order to cross reference target protein sequences with known sequences. Multiple sequence alignment techniques are listed below.This method begins by performing pair wise alignment of sequences using k-tuple or Needleman–Wunsch methods. These methods calculate a matrix that depicts the pair wise similarity among the sequence pairs. Similarity scores are then transformed into distance scores that are used to produce a guide tree using the neighbor joining method. This guide tree is then employed to yield a multiple sequence alignment.
Clustal omega
This method is capable of aligning up to 190,000 sequences by utilizing the k-tuple method. Next sequences are clustered using the mBed and k-means methods. A guide tree is then constructed using the UPGMA method that is used by the HH align package. This guide tree is used to generate multiple sequence alignments.MAFFT
This method utilizes fast Fourier transform that converts amino acid sequences into a sequence composed of volume and polarity values for each amino acid residue. This new sequence is used to find homologous regions.K-Align
This method utilizes the Wu-Manber approximate string matching algorithm to generate multiple sequence alignments.Multiple sequence comparison by log expectation (MUSCLE)
This method utilizes Kmer and Kimura distances to generate multiple sequence alignments.T-Coffee
This method utilizes tree based consistency objective functions for alignment evolution. This method has been shown to be 5–10% more accurate than Clustal W.Coevolutionary analysis
Coevolutionary analysis is also known as correlated mutation, covariation, or co-substitution. This type of rational design involves reciprocal evolutionary changes at evolutionarily interacting loci. Generally this method begins with the generation of a curated multiple sequence alignments for the target sequence. This alignment is then subjected to manual refinement that involves removal of highly gapped sequences, as well as sequences with low sequence identity. This step increases the quality of the alignment. Next, the manually processed alignment is utilized for further coevolutionary measurements using distinct correlated mutation algorithms. These algorithms result in a coevolution scoring matrix. This matrix is filtered by applying various significance tests to extract significant coevolution values and wipe out background noise. Coevolutionary measurements are further evaluated to assess their performance and stringency. Finally, the results from this coevolutionary analysis are validated experimentally.Structural prediction
De novo generation of protein benefits from knowledge of existing protein structures. This knowledge of existing protein structure assists with the prediction of new protein structures. Methods for protein structure prediction fall under one of the four following classes: ab initio, fragment based methods, homology modeling, and protein threading.''Ab initio''
These methods involve free modeling without using any structural information about the template. Ab initio methods are aimed at prediction of the native structures of proteins corresponding to the global minimum of its free energy. some examples of ab initio methods are AMBER, GROMOS, GROMACS, CHARMM, OPLS, and ENCEPP12. General steps for ab initio methods begin with the geometric representation of the protein of interest. Next, a potential energy function model for the protein is developed. This model can be created using either molecular mechanics potentials or protein structure derived potential functions. Following the development of a potential model, energy search techniques including molecular dynamic simulations, Monte Carlo simulations and genetic algorithms are applied to the protein.Fragment based
These methods use database information regarding structures to match homologous structures to the created protein sequences. These homologous structures are assembled to give compact structures using scoring and optimization procedures, with the goal of achieving the lowest potential energy score. Webservers for fragment information are I-TASSER, ROSETTA, ROSETTA @ home, FRAGFOLD, CABS fold, PROFESY, CREF, QUARK, UNDERTAKER, HMM, and ANGLOR.Homology modeling
These methods are based upon the homology of proteins. These methods are also known as comparative modeling. The first step in homology modeling is generally the identification of template sequences of known structure which are homologous to the query sequence. Next the query sequence is aligned to the template sequence. Following the alignment, the structurally conserved regions are modeled using the template structure. This is followed by the modeling of side chains and loops that are distinct from the template. Finally the modeled structure undergoes refinement and assessment of quality. Servers that are available for homology modeling data are listed here: SWISS MODEL, MODELLER, ReformAlign, PyMOD, TIP-STRUCTFAST, COMPASS, 3d-PSSM, SAMT02, SAMT99, HHPRED, FAGUE, 3D-JIGSAW, META-PP, ROSETTA, and I-TASSER.Protein threading
Protein threading can be used when a reliable homologue for the query sequence cannot be found. This method begins by obtaining a query sequence and a library of template structures. Next, the query sequence is threaded over known template structures. These candidate models are scored using scoring functions. These are scored based upon potential energy models of both query and template sequence. The match with the lowest potential energy model is then selected. Methods and servers for retrieving threading data and performing calculations are listed here: GenTHREADER, pGenTHREADER, pDomTHREADER, ORFEUS, PROSPECT, BioShell-Threading, FFASO3, RaptorX, HHPred, LOOPP server, Sparks-X, SEGMER, THREADER2, ESYPRED3D, LIBRA, TOPITS, RAPTOR, COTH, MUSTER.For more information on rational design see site-directed mutagenesis.
Multivalent binding
Multivalent binding can be used to increase the binding specificity and affinity through avidity effects. Having multiple binding domains in a single biomolecule or complex increases the likelihood of other interactions to occur via individual binding events. Avidity or effective affinity can be much higher than the sum of the individual affinities providing a cost and time-effective tool for targeted binding.Multivalent proteins
Multivalent proteins are relatively easy to produce by post-translational modifications or multiplying the protein-coding DNA sequence. The main advantage of multivalent and multispecific proteins is that they can increase the effective affinity for a target of a known protein. In the case of an inhomogeneous target using a combination of proteins resulting in multispecific binding can increase specificity, which has high applicability in protein therapeutics.The most common example for multivalent binding are the antibodies, and there is extensive research for bispecific antibodies. Applications of bispecific antibodies cover a broad spectrum that includes diagnosis, imaging, prophylaxis, and therapy.