Informational and linguistic analysis of large genomic sequence collections via efficient Hadoop cluster algorithms

Raffaele Giancarlo, Gianluca Roscigno, Umberto Ferraro Petrillo, Giuseppe Cattaneo, Umberto Ferraro Petrillo

Risultato della ricerca: Article

8 Citazioni (Scopus)

Abstract

Motivation Information theoretic and compositional/linguistic analysis of genomes have a central role in bioinformatics, even more so since the associated methodologies are becoming very valuable also for epigenomic and meta-genomic studies. The kernel of those methods is based on the collection of k-mer statistics, i.e. how many times each k-mer in A,C,G,Tk occurs in a DNA sequence. Although this problem is computationally very simple and efficiently solvable on a conventional computer, the sheer amount of data available now in applications demands to resort to parallel and distributed computing. Indeed, those type of algorithms have been developed to collect k-mer statistics in the realm of genome assembly. However, they are so specialized to this domain that they do not extend easily to the computation of informational and linguistic indices, concurrently on sets of genomes. Results Following the well-established approach in many disciplines, and with a growing success also in bioinformatics, to resort to MapReduce and Hadoop to deal with 'Big Data' problems, we present KCH, the first set of MapReduce algorithms able to perform concurrently informational and linguistic analysis of large collections of genomic sequences on a Hadoop cluster. The benchmarking of KCH that we provide indicates that it is quite effective and versatile. It is also competitive with respect to the parallel and distributed algorithms highly specialized to k-mer statistics collection for genome assembly problems. In conclusion, KCH is a much needed addition to the growing number of algorithms and tools that use MapReduce for bioinformatics core applications.
Lingua originaleEnglish
pagine (da-a)1826-1833
Numero di pagine8
RivistaBioinformatics
Volume34
Stato di pubblicazionePublished - 2018

Fingerprint

Cluster Algorithm
Linguistics
Genomics
MapReduce
Genome
Efficient Algorithms
Bioinformatics
Genes
Computational Biology
Statistics
Parallel algorithms
Parallel and Distributed Computing
Benchmarking
DNA sequences
Distributed computer systems
Parallel processing systems
Distributed Algorithms
Epigenomics
DNA Sequence
Parallel Algorithms

All Science Journal Classification (ASJC) codes

  • Statistics and Probability
  • Biochemistry
  • Molecular Biology
  • Computer Science Applications
  • Computational Theory and Mathematics
  • Computational Mathematics

Cita questo

Giancarlo, R., Roscigno, G., Ferraro Petrillo, U., Cattaneo, G., & Ferraro Petrillo, U. (2018). Informational and linguistic analysis of large genomic sequence collections via efficient Hadoop cluster algorithms. Bioinformatics, 34, 1826-1833.

Informational and linguistic analysis of large genomic sequence collections via efficient Hadoop cluster algorithms. / Giancarlo, Raffaele; Roscigno, Gianluca; Ferraro Petrillo, Umberto; Cattaneo, Giuseppe; Ferraro Petrillo, Umberto.

In: Bioinformatics, Vol. 34, 2018, pag. 1826-1833.

Risultato della ricerca: Article

Giancarlo, R, Roscigno, G, Ferraro Petrillo, U, Cattaneo, G & Ferraro Petrillo, U 2018, 'Informational and linguistic analysis of large genomic sequence collections via efficient Hadoop cluster algorithms', Bioinformatics, vol. 34, pagg. 1826-1833.
Giancarlo, Raffaele ; Roscigno, Gianluca ; Ferraro Petrillo, Umberto ; Cattaneo, Giuseppe ; Ferraro Petrillo, Umberto. / Informational and linguistic analysis of large genomic sequence collections via efficient Hadoop cluster algorithms. In: Bioinformatics. 2018 ; Vol. 34. pagg. 1826-1833.
@article{5c0f5a978e534a878f7e28debf482ac3,
title = "Informational and linguistic analysis of large genomic sequence collections via efficient Hadoop cluster algorithms",
abstract = "Motivation Information theoretic and compositional/linguistic analysis of genomes have a central role in bioinformatics, even more so since the associated methodologies are becoming very valuable also for epigenomic and meta-genomic studies. The kernel of those methods is based on the collection of k-mer statistics, i.e. how many times each k-mer in A,C,G,Tk occurs in a DNA sequence. Although this problem is computationally very simple and efficiently solvable on a conventional computer, the sheer amount of data available now in applications demands to resort to parallel and distributed computing. Indeed, those type of algorithms have been developed to collect k-mer statistics in the realm of genome assembly. However, they are so specialized to this domain that they do not extend easily to the computation of informational and linguistic indices, concurrently on sets of genomes. Results Following the well-established approach in many disciplines, and with a growing success also in bioinformatics, to resort to MapReduce and Hadoop to deal with 'Big Data' problems, we present KCH, the first set of MapReduce algorithms able to perform concurrently informational and linguistic analysis of large collections of genomic sequences on a Hadoop cluster. The benchmarking of KCH that we provide indicates that it is quite effective and versatile. It is also competitive with respect to the parallel and distributed algorithms highly specialized to k-mer statistics collection for genome assembly problems. In conclusion, KCH is a much needed addition to the growing number of algorithms and tools that use MapReduce for bioinformatics core applications.",
keywords = "Biochemistry, Computational Mathematics, Computational Theory and Mathematics, Computer Science Applications1707 Computer Vision and Pattern Recognition, Molecular Biology, Statistics and Probability",
author = "Raffaele Giancarlo and Gianluca Roscigno and {Ferraro Petrillo}, Umberto and Giuseppe Cattaneo and {Ferraro Petrillo}, Umberto",
year = "2018",
language = "English",
volume = "34",
pages = "1826--1833",
journal = "Bioinformatics",
issn = "1367-4803",
publisher = "Oxford University Press",

}

TY - JOUR

T1 - Informational and linguistic analysis of large genomic sequence collections via efficient Hadoop cluster algorithms

AU - Giancarlo, Raffaele

AU - Roscigno, Gianluca

AU - Ferraro Petrillo, Umberto

AU - Cattaneo, Giuseppe

AU - Ferraro Petrillo, Umberto

PY - 2018

Y1 - 2018

N2 - Motivation Information theoretic and compositional/linguistic analysis of genomes have a central role in bioinformatics, even more so since the associated methodologies are becoming very valuable also for epigenomic and meta-genomic studies. The kernel of those methods is based on the collection of k-mer statistics, i.e. how many times each k-mer in A,C,G,Tk occurs in a DNA sequence. Although this problem is computationally very simple and efficiently solvable on a conventional computer, the sheer amount of data available now in applications demands to resort to parallel and distributed computing. Indeed, those type of algorithms have been developed to collect k-mer statistics in the realm of genome assembly. However, they are so specialized to this domain that they do not extend easily to the computation of informational and linguistic indices, concurrently on sets of genomes. Results Following the well-established approach in many disciplines, and with a growing success also in bioinformatics, to resort to MapReduce and Hadoop to deal with 'Big Data' problems, we present KCH, the first set of MapReduce algorithms able to perform concurrently informational and linguistic analysis of large collections of genomic sequences on a Hadoop cluster. The benchmarking of KCH that we provide indicates that it is quite effective and versatile. It is also competitive with respect to the parallel and distributed algorithms highly specialized to k-mer statistics collection for genome assembly problems. In conclusion, KCH is a much needed addition to the growing number of algorithms and tools that use MapReduce for bioinformatics core applications.

AB - Motivation Information theoretic and compositional/linguistic analysis of genomes have a central role in bioinformatics, even more so since the associated methodologies are becoming very valuable also for epigenomic and meta-genomic studies. The kernel of those methods is based on the collection of k-mer statistics, i.e. how many times each k-mer in A,C,G,Tk occurs in a DNA sequence. Although this problem is computationally very simple and efficiently solvable on a conventional computer, the sheer amount of data available now in applications demands to resort to parallel and distributed computing. Indeed, those type of algorithms have been developed to collect k-mer statistics in the realm of genome assembly. However, they are so specialized to this domain that they do not extend easily to the computation of informational and linguistic indices, concurrently on sets of genomes. Results Following the well-established approach in many disciplines, and with a growing success also in bioinformatics, to resort to MapReduce and Hadoop to deal with 'Big Data' problems, we present KCH, the first set of MapReduce algorithms able to perform concurrently informational and linguistic analysis of large collections of genomic sequences on a Hadoop cluster. The benchmarking of KCH that we provide indicates that it is quite effective and versatile. It is also competitive with respect to the parallel and distributed algorithms highly specialized to k-mer statistics collection for genome assembly problems. In conclusion, KCH is a much needed addition to the growing number of algorithms and tools that use MapReduce for bioinformatics core applications.

KW - Biochemistry

KW - Computational Mathematics

KW - Computational Theory and Mathematics

KW - Computer Science Applications1707 Computer Vision and Pattern Recognition

KW - Molecular Biology

KW - Statistics and Probability

UR - http://hdl.handle.net/10447/291365

UR - http://bioinformatics.oxfordjournals.org/

M3 - Article

VL - 34

SP - 1826

EP - 1833

JO - Bioinformatics

JF - Bioinformatics

SN - 1367-4803

ER -