A pipeline for variants discovery using next-generation DNA sequencing data

Research output: Contribution to conferenceOtherpeer-review


Recent advances in next generation sequencing (NGS) technologyprovide a cost-effective approach to large-scale resequencingof livestock samples in order to study severalbiological phenomena. NGS produces millions of short DNAsequences that require an unbiased way to make possiblecomprehensive searches for variation to identify putativecausative mutations for economically important traits. Theaim of this work was to present a bioinformatics pipelineanalysis for variants discovery in ovine genome. A total of30 individuals belonging to Valle del Belice dairy ewes wasused for whole genome sequencing of pooled libraries preparedusing Illumina Nextera Kit. Paired-end sequencingwas carried out in an 8-lanes flow-cell of the IlluminaHiScanSQ platform yielding a total of 1,159,664,912, 101 bplength reads. The left and right raw reads were separatedinto two files, and converted to the fastq format usingCASAVA 1.8. The whole procedure was split in differentworkflows, in order to give more flexibility to end-users.One workflow is aimed to verify the quality of the rawsequencing reads using FastQC and FASTX-Toolkit, in orderto keep bases with Phred quality Score greater than 20 andto trim the reads with poor quality. Another step aligns thereads to the Ovis aries 3.1 reference genome using BWAmemwith standard parameters. The resulting SAM file wasconverted in BAM file using the SAMtools software, thenunmapped and duplicate reads were removed using theCleanSam and MarkDuplicate commands of the Picard software.Therefore, to get more accurate base qualities,Genetic Analysis Tool Kit (GATK) was used to locally realignreads such that the number of mismatching bases due toindels is minimized across all the reads (IndelRealigner)and to detect systematic errors in base quality scores(BaseRecalibrator). In the last workflow SNPs and indelsare identified using mpileup command of SAMtools software.The resulting BCF file is passed to “bcftools view” tool tobe filtered and converted into VCF format. Finally, for variantsannotation the SNPSift software was used. A total of6,357,170 variations, of which 5,265,739 SNPs and 1,091,431indels, were discovered. About 77% of the SNPs were presentin the Ovis aries dbSNP v147 while the remainingwere novel SNPs. The discovered SNPs must be validatedand then could be used to several applications as phylogenicanalysis, genome-wide association studies or genomicselection.
Original languageEnglish
Number of pages2
Publication statusPublished - 2017


Dive into the research topics of 'A pipeline for variants discovery using next-generation DNA sequencing data'. Together they form a unique fingerprint.

Cite this