Mathematical and Computational Biology Stream

Algorithmic techniques for DNA sequence analysis

Faculty : Navin Kashyap (ECE) Chirag Jain (CDS)

This project involves developing novel algorithmic techniques and open-source software to address problems in computational biology. The following topics of research are being proposed in this project, a well-chosen subset of which could be covered during the course of a PhD:

  1. Information-optimal  graph  sparsification  for  human  genome  reconstruction:  Genome  as- sembly is the task of reconstructing the sequenced genome from a large number of short substrings of genome. In lay terms, think of it as a jigsaw puzzle where one has to arrange the short string frag- ments to obtain the final picture, but without prior knowledge of how the complete picture looks like. The latest sequencing technologies and bioinformatics algorithms have shown the greatest promise in achieving high-quality human genome assemblies. Graph-based data structures play a central role in computing genome assembly, where vertices correspond to the string fragments, and edges corre- spond to the overlaps among the string fragments. In such a graph, one can spell the genome using a walk in the graph. A crucial step in this analysis is to sparsify the graph to be able to reduce the count of false and redundant overlaps [1]. The current state-of-the-art methods perform the graph sparsification using heuristics without guaranteeing that there is no information loss. We seek an information-optimal framework for graph sparsification during genome assembly.
  2. Secure DNA-based storage of digital data: With  rapid  accumulation  of  digital  data  in  all application sectors, long-term storage of data has become a bottleneck. Synthetic DNA has been projected as a promising medium for data storage because of its enormously dense structure. For example, at theoretical maximum, DNA can encode 455 exabyte data per gram [2]. Over the past decade, several research groups have successfully demonstrated working DNA-based storage systems. In this project, we seek a secure DNA storage system where the encoded data in a DNA sample can be decoded only by an authorised party [3]. The security mechanism should be robust against future advances in DNA sequencing technology. There could even be a wet-lab component to this project, in which biochemical experiments with synthetic DNA molecules are conducted.
  3. Compression of long-read DNA sequencing data: The ongoing GenomeIndia project aims  to complete sequencing of 20,000 Indian individuals. Collectively, the storage space required to archive the generated data is anticipated to be over two petabytes. Substantial data-storage requirements in genomic research imply that the costs associated with storage and transmission can be higher than the price to generate genomic data. Off the shelf compression tools (e.g., zip) do not exploit the the domain-specific characteristics of data which has motivated development of compression techniques specifically targeted to genomic data [4]. We are interested in investigating further improvements to lossless and lossy compression techniques that can be applicable to long-read DNA sequencing data. Another important consideration in this context is the ability of the compression algorithm to provide a mechanism for searching within compressed data. For instance, we may wish to identify reads with certain specific characteristics, such as those containing a particular genetic marker or motif, within a large compressed database. If instead of decompressing and searching, it would be more efficient to search directly within the compressed database if possible. We would like to develop an understanding of the tradeoffs between compression ratio achieved and time taken to search through a large compressed genomic database.

References

  1. Chirag Jain. Coverage-preserving sparsification of overlap graphs for long-read assembly. bioRxiv, 2022.
  2. George M Church, Yuan Gao, and Sriram Kosuri. Next-generation digital information storage in dna. Science, 337(6102):1628–1628, 2012.
  3. Praneeth Kumar Vippathalla and Navin Kashyap. The secure storage capacity of a DNA wiretap channel model. arXiv, 2022.
  4. Marek  Kokot,  Adam  Gudy´s,  Heng  Li,  and  Sebastian  Deorowicz.   CoLoRd:  compressing  long  reads.   Nature Methods, 2022.