Alignment free Dissimilarities for sequence classification

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

One way to represent a DNA sequence is to break it down into substrings of length L, called L-tuples, and count the occurence of each L-tuple in the sequence. This representation defines a mapping of a sequence into a numerical space by a numerical feature vector of fixed length, that allows to measure sequence similarity in an alignment free way simply using disssimilarity functions between vectors. This work presents a benchmark study of 4 alignment free disssimilarity functions between sequences, computed on their L-tuples representation, for the purpose of sequence classification. In our experiments, we have tested the classes of geometric-based, correlation-based and information-based dissimilarities, incorporating them into a nearest neighbor classifier. Results computed on three dataset of nucleosome forming and inhibiting sequences, shows that the geometric and correlation disssimilaritiess are more suitable for nucleosome classification. Finally, their use could be a valid alternative to the alignment-based similarity measures, which remains yet the preferred choice when dealing with sequence similarity problems
Original languageEnglish
Title of host publicationComputational Intelligence Methods for Bioinformatics and Biostatistics, CIBB 2015
Pages1-5
Number of pages5
Publication statusPublished - 2015

Fingerprint

Dive into the research topics of 'Alignment free Dissimilarities for sequence classification'. Together they form a unique fingerprint.

Cite this