Blended Length Genome Sequencing (blend-seq): Combining Short Reads with Low-Coverage Long Reads to Maximize Variant Discovery

bioRxiv |

We introduce blend-seq, a method for combining data from traditional short-read sequencing pipelines with low-coverage long reads, with the goal of substantially improving variant discovery for single samples without the full cost of high-coverage long reads. We demonstrate that with only 4x long read coverage augmenting 30x short reads, we can improve SNP discovery across the genome and achieve precision and recall beyond what is possible with short reads, even at very high coverage (60x). For genotype-agnostic discovery of structural variants, we see a threefold improvement in recall while maintaining precision by using the low-coverage long reads on their own. For the more specialized scenario of genotype-aware structural variant calling, we show how combining the long and short reads in a graph-based approach results in greater performance than either technology on its own. The observed gains highlight the complementary nature of short and long read technologies: long reads help with SNP discovery by better mapping to difficult regions, and they provide better performance with long insertions and deletions (structural variants) by virtue of their length, while the larger number of short-read layers help with genotyping structural variants discovered by long reads. In this way, blend-seq offers many of the benefits of long-read pipelines without incurring the cost of high-coverage long reads.