The knowledge of the complete DNA makeup of an organism is crucial to provide the most detailed resolution of genetic and epigenetic variations. Nowadays, third generation sequencers provide long and redundant substrings from the DNA, called reads, with average length of 10 kb, allowing the alignment of unambiguous sequence and providing high quality de novo genome assemblies. However, long reads still have high error rates, up to 15% considering Pacific Biosciences technology, which make hard the assembly of whole high quality genome.
In this work we propose a novel approach to handle erroneous data in the initial step of the assembly pipeline, which consists in finding overlapping reads. The approach is based on shared short substrings belonging to the reads, named k-mers. To efficiently discover the overlaps, we exploit sparse matrix multiplication, achieving true positive rates greater than 90% for several genomes.
is a DEIB laboratory, with different research lines on advanced topics in computing systems: from architectural characteristics, to hardware-software codesign methodologies, to security and dependability issues of complex system architectures.
Every week, the "NECST Friday Talk
" invites researchers, professionals or entrepreneurs to share their work experiences and projects they are implementing in the "Computing Systems".