Pseudo K-tuple nucleotide composition

The Pseudo K-tuple nucleotide composition or PseKNC, is a method for converting a nucleotide sequence into a numerical vector so as to be used in pattern recognition techniques. Generally, the K-tuple can refer to a dinucleotide or a trinucleotide. Depending on the instance, the technique can also be called PseDNC or PseTNC.
The method was derived from an analogous method in proteomics known as PseAAC that is applied to protein sequences.

Background

PseAAC

PseKNC was derived from an analogous method in proteomics known as PseAAC. Previously, investigations either relied on sequential models for making predictions of certain protein properties, or a discrete model which represents a vector of twenty elements, each of which represent the frequency of each amino acid in the protein sample. The discrete model, however, fails to account for sequence-order information. The PseACC model extends the 20-length vector in the discrete model with λ components, each of which in some way captures sequence-order information, and this vector becomes the basis for making predictions.

Analogous problem in genomics

Analogously, a discrete model of a nucleotide sequence based on its dinucleotide composition would lay involve a vector of 16 elements, the value of which one representing the frequency of each dinucleotide in the sequence:
Where D is the DNA sequence, T is the transpose operator, and f is the normalized occurrence frequency of AA in the DNA sequence. A trinucleotide representation can be denoted as:
As can be seen, these discrete models fail to consider any global or long-range sequence-order information. To address this for both DNA and RNA sequences, the pseudo K-tuple nucleotide composition or PseKNC was proposed.

PseKNC

PseKNC extends the discrete model by adding λ components to represent sequence-order and physico-chemical properties of the nucleotide sequence. The original KNC model will involve 4^K components. In a dinucleotide situation where K = 2, 4² = 16 components will be included. The extension by PseKNC results in components.

Applications

A wide diversity of applications have been developed with respect to the PseKNC method. For example, it has become an integral component of many algorithms designed to predict the locations of recombination hotspots and coldspots from sequence information.

Web servers

For the convenience scientific community, a freely available web server called PseKNC and an open source package called PseKNC-General were developed in 2013 and 2014, respectively, that could convert large-scale sequence datasets to pseudo nucleotide compositions with numerous choices of physicochemical property combinations. PseKNC-General can generate several modes of pseudo nucleotide compositions, including conventional k-tuple nucleotide compositions, Moreau–Broto autocorrelation coefficient, Moran autocorrelation coefficient, Geary autocorrelation coefficient, Type I PseKNC and Type II PseKNC.
Another web server, Pse-in-One, allows users to hand-select all pre-existing PseAAC and PseKNC methods for protein, RNA, and DNA sequences, along with any selection of the existing availability of physicochemical property combinations for these options.