Representative sequences
In social sciences and other domains, representative sequences are whole sequences that best characterize or summarize a set of sequences. In bioinformatics, representative sequences also designate substrings of a sequence that characterize the sequence.
Social sciences
In Sequence analysis in [social sciences], representative sequences are used to summarize sets of sequences describing for example the family life course or professional career of several thousands individuals.The identification of representative sequences proceeds from the pairwise dissimilarities between sequences. One typical solution is the medoid sequence, i.e., the observed sequence that minimizes the sum of its distances to all other sequences in the set. An other solution is the densest observed sequence, i.e., the sequence with the greatest number of other sequences in its neighborhood. When the diversity of the sequences is large, a single representative is often insufficient to efficiently characterize the set. In such cases, an as small as possible set of representative sequences covering a given percentage of all sequences is searched.
A solution also considered is to select the medoids of relative frequency groups. More specifically, the method consists in sorting the sequences, splitting the sorted list into equal sized groups, and selecting the medoids of the equal sized groups.
The methods for identifying representative sequences described above have been implemented in the R package .