Shapiro–Senapathy algorithm
The algorithm
A splice site defines the boundary between a coding exon and a non-coding intron in eukaryotic genes. The S&S algorithm employs a sliding window, corresponding to the length of the splice site motif, to scan a gene sequence and detect potential splice sites. For each sliding window, the algorithm calculates a score by comparing the nucleotide sequence to a Position Weight Matrix derived from known splice sites. This formula generates a percentile score, indicating the likelihood that a given sequence functions as a donor or acceptor splice site.The majority of disease-causing mutations in the human genome are located in splice sites. Clinical genomics studies analyze the splice site scores generated by the S&S algorithm to predict the consequences of splice site mutations including exon skipping and intron retention. The algorithm's sensitivity to single-nucleotide changes allows it to determine mutations that may impact RNA splicing and contribute to disease.
In addition to identifying real splice sites, the S&S algorithm has been used to discover cryptic splice sites alternative splice sites activated by mutations which may disrupt normal splicing. The algorithm detects mutations that lead to the activation of cryptic splice sites, which may be located proximal to real splice sites or deep within non-coding introns. It has thus been used to determine the causes of numerous diseases that are due to cryptic splicing.
Cancer gene discovery using S&S
The S&S algorithm has been used to identify splice-site mutations in genes associated with several cancers. For example, genes causing commonly occurring cancers including breast cancer, ovarian cancer, colorectal cancer, leukemia, head and neck cancers, prostate cancer, retinoblastoma, squamous cell carcinoma, gastrointestinal cancer, melanoma, liver cancer, Lynch syndrome, skin cancer, and neurofibromatosis have been found. In addition, splicing mutations in genes causing less commonly known cancers including gastric cancer, gangliogliomas, Li-Fraumeni syndrome, Loeys–Dietz syndrome, Osteochondromas, Nevoid basal cell carcinoma syndrome, and Pheochromocytomas have been identified.Specific mutations in different splice sites in various genes causing breast cancer, ovarian cancer, colon cancer, colorectal cancer, skin cancer, and Fanconi anemia have been uncovered. The mutations in the donor and acceptor splice sites in different genes causing a variety of cancers that have been identified by S&S are shown in Table 1.
Discovery of genes causing inherited disorders using S&S
Specific mutations in different splice sites in various genes that cause inherited disorders, including, for example, Type 1 diabetes, hypertension, Marfan syndrome, cardiac diseases, eye disorders have been uncovered. A few example mutations in the donor and acceptor splice sites in different genes causing a variety of inherited disorders identified using S&S are shown in Table 2.| Disease type | Gene symbol | Mutation location | Original sequence | Mutated sequence | Splicing aberration |
| Diabetes | PTPN22 | Exon 18 | AAGGTAAAG | AACGTAAAG | Skipping of exon 18 |
| Diabetes | TCF1 | Intron 4 | TTTGTGCCCCTCAGG | TTTGTGCCCCTCGGG | Skipping of exon 5 |
| Hypertension | LDL | Intron 10 | TGGGTGCGT | TGGGTGCAT | Normolipidemic to classical heterozygous FH |
| Hypertension | LDLR | Intron 2 | GCTGTGAGT | GCTGTGTGT | May cause splicing abnormalities through an in-silico analysis |
| Hypertension | LPL | Intron 2 | ACGGTAAGG | ACGATAAGG | Cryptic splice sites is activated in vivo at the sites |
| Marfan syndrome | FBN1 | Intron 46 | CAAGTAAGA | CAAGTAAAA | Exon skipping/cryptic splice site |
| Marfan syndrome | TGFBR2 | Intron 1 | ATCCTGTTTTACAGA | ATCCTGTTTTACGGA | Abnormal splicing |
| Marfan syndrome | FBN2 | Intron45 | TGGGTAAGT | TGGGGAAGT | Splice site alterations leading to frameshift mutations, causing a truncated protein |
| Cardiac disease | COL1A2 | Intron 46 | GCTGTAAGT | GCTGCAAGT | Permitted almost exclusive use of a cryptic donor site 17 nt upstream in the exon |
| Cardiac disease | MYBPC3 | Intron 5 | CTCCATGCACACAGG | CTCCATGCACACCGG | Abnormal mRNA transcript with a premature stop codon will produce a truncated protein lacking the binding sites for myosin and titin |
| Cardiac disease | ACTC1 | Intron 1 | TTTTCTTCTCATAGG | TTTTCTTCTTATAGG | No effect |
| Eye disorder | ABCR | Intron 30 | CAGGTACCT | CAGTTACCT | Autosomal recessive RP and CRD |
| Eye disorder | VSX1 | Intron 5 | TTTTTTTTTACAAGG | TATTTTTTTACAAGG | Aberrant splicing |
Genes causing immune system disorders
More than 100 immune system disorders affect humans, including inflammatory bowel diseases, multiple sclerosis, systemic lupus erythematosus, bloom syndrome, familial cold autoinflammatory syndrome, and dyskeratosis congenita. The Shapiro–Senapathy algorithm has been used to discover genes and mutations involved in many immune disorder diseases, including Ataxia telangiectasia, B-cell defects, epidermolysis bullosa, and X-linked agammaglobulinemia.Xeroderma pigmentosum, an autosomal recessive disorder is caused by faulty proteins formed due to new preferred splice donor site identified using S&S algorithm and resulted in defective nucleotide excision repair.
Type I Bartter syndrome (BS) is caused by mutations in the gene SLC12A1. S&S algorithm helped in disclosing the presence of two novel heterozygous mutations c.724 + 4A > G in intron 5 and c.2095delG in intron 16 leading to complete exon 5 skipping.
Mutations in the MYH gene, which is responsible for removing the oxidatively damaged DNA lesion are cancer-susceptible in the individuals. The IVS1+5C plays a causative role in the activation of a cryptic splice donor site and the alternative splicing in intron 1, S&S algorithm shows, guanine at the position of IVS+5 is well conserved among primates. This also supported the fact that the G/C SNP in the conserved splice junction of the MYH gene causes the alternative splicing of intron 1 of the β type transcript.
Splice site scores were calculated according to S&S to find EBV infection in X-linked lymphoproliferative disease. Identification of Familial tumoral calcinosis (FTC) is an autosomal recessive disorder characterized by ectopic calcifications and elevated serum phosphate levels and it is because of aberrant splicing.
Application of S&S in hospitals for clinical practice and research
The Shapiro–Senapathy algorithm has played a significant role in advancing the diagnosis and treatment of human diseases through its application in modern clinical genomics. With the widespread adoption of next-generation sequencing technologies, the S&S algorithm is now routinely integrated into clinical practice by geneticists and diagnostic laboratories. It is implemented in various computational tools such as Human Splicing Finder, Splice Site Finder, and Alamut Visual, which assist in interpreting the functional impact of genetic variants on RNA splicing.The algorithm is particularly useful in identifying pathogenic splice site mutations in cases where the clinical presentation is unclear or where conventional diagnostic methods have failed to identify a causative gene. Its utility has been demonstrated across diverse patient cohorts, including individuals from different ethnic backgrounds with various cancers and inherited genetic disorders. The following are selected examples illustrating its application in clinical research.
S&S - Algorithm for identifying splice sites, exons and split genes
The Shapiro–Senapathy algorithm was developed to identify splice sites in uncharacterized genomic sequences, with early applications in the Human Genome Project. The method introduced a Position Weight Matrix -based approach to analyze splicing sequences across eukaryotic organisms, marking the first computational framework to systematically define splice sites using probabilistic scoring.Key innovations of the algorithm included:
- Exon Detection – Exons were defined as sequences bounded by acceptor and donor splice sites with S&S scores above a threshold, requiring an open reading frame for validation.
- Gene Prediction – The method enabled the identification of complete genes by assembling predicted exons, forming a basis for later gene-finding tools.
- Mutation Analysis – The algorithm distinguishes deleterious splice-site mutations from neutral variations. This capability allowed researchers to study disease-linked cryptic splice sites in humans, animals, and plants.
Discovering the mechanisms of aberrant splicing in diseases
The Shapiro–Senapathy algorithm has been used to determine the various aberrant splicing mechanisms in genes due to deleterious mutations in the splice sites, which cause numerous diseases. Deleterious splice site mutations impair the normal splicing of the gene transcripts, and thereby make the encoded protein defective. A mutant splice site can become "weak" compared to the original site, due to which the mutated splice junction becomes unrecognizable by the spliceosomal machinery. This can lead to the skipping of the exon in the splicing reaction, resulting in the loss of that exon in the spliced mRNA. On the other hand, a partial or complete intron could be included in the mRNA due to a splice site mutation that makes it unrecognizable. A partial exon-skipping or intron inclusion can lead to premature termination of the protein from the mRNA, which will become defective leading to diseases. The S&S has thus paved the way to determine the mechanisms by which a deleterious mutation could lead to a defective protein, resulting in different diseases depending on which gene is affected.Examples of splicing aberrations
An example of splicing aberration caused by a mutation in the donor splice site in the exon 8 of MLH1 gene that led to colorectal cancer is given below. This example shows that a mutation in a splice site within a gene can lead to a profound effect in the sequence and structure of the mRNA, and the sequence, structure and function of the encoded protein, leading to disease.S&S in cryptic splice sites research and medical applications
The proper identification of splice sites has to be highly precise as the consensus splice sequences are very short and there are many other sequences similar to the authentic splice sites within gene sequences, which are known as cryptic, non-canonical, or pseudo splice sites. When an authentic or real splice site is mutated, any cryptic splice sites present close to the original real splice site could be erroneously used as authentic site, resulting in an aberrant mRNA. The erroneous mRNA may include a partial sequence from the neighboring intron or lose a partial exon, which may result in a premature stop codon. The result may be a truncated protein that would have lost its function completely.Shapiro–Senapathy algorithm can identify the cryptic splice sites, in addition to the authentic splice sites. Cryptic sites can often be stronger than the authentic sites, with a higher S&S score. However, due to the lack of an accompanying complementary donor or acceptor site, this cryptic site will not be active or used in a splicing reaction. When a neighboring real site is mutated to become weaker than the cryptic site, then the cryptic site may be used instead of the real site, resulting in a cryptic exon and an aberrant transcript.
Numerous diseases have been caused by cryptic splice site mutations or usage of cryptic splice sites due to the mutations in authentic splice sites.
S&S in animal and plant genomics research
S&S has also been used in RNA splicing research in many animals and plants.The mRNA splicing plays a fundamental role in gene functional regulation. Very recently, it has been shown that A to G conversions at splice sites can lead to mRNA mis-splicing in Arabidopsis. The splicing and exon–intron junction prediction coincided with the GT/AG rule in the Molecular characterization and evolution of carnivorous sundew class V b-1,3-glucanase. Unspliced and spliced transcripts of NAD+ dependent sorbitol dehydroge nase of strawberry were investigated for phytohormonal treatments.
Ambra1 is a positive regulator of autophagy, a lysosome-mediated degradative process involved both in physiological and pathological conditions. Nowadays, this function of Ambra1 has been characterized only in mammals and zebrafish. Diminution of rbm24a or rbm24b gene products by morpholino knockdown resulted in significant disruption of somite formation in mouse and zebrafish. Dr.Senapathy algorithm used extensively to study intron-exon organization of fut8 genes. The intron-exon boundaries of Sf9 fut8 were in agreement with the consensus sequence for the splicing donor and acceptor sites concluded using S&S.