Tehran Institute for Advanced Studies (TEIAS)

/ Using Gapped K-mers to Model Biological Sequences __ Mahmoud Ghandi


Using Gapped K-mers to Model Biological Sequences

July 23, 2018


Khatam University, Building No2.
Address: Mollasadra Blvd., North Shirazi St., East Daneshvar St., No.17. See location on Google map


Dr. Mahmoud Ghandi

Senior Group Leader in Broad Institute of MIT and Harvard 


Oligomers of length k, or k-mers, are convenient and widely used features for modeling the properties and functions of DNA and protein sequences. However, k-mers suffer from the inherent limitation that if the parameter k is increased to resolve longer features, the probability of observing any specific k-mer becomes very small, and k-mer counts approach a binary variable, with most k-mers absent and a few present once. Thus, any statistical learning approach using k-mers as features becomes susceptible to noisy training set k-mer frequencies once k becomes large. In this talk, I will introduce alternative feature sets using gapped k-mers and a general method for robust estimation of k-mer frequencies. I will also show how this could be used to form a new kernel for an SVM classifier (gkm-SVM) to model biological sequences. Finally, I will discuss computational methods to implement and apply these methods on real biological datasets.


Mahmoud Ghandi is a group leader in Levi Garraway’s lab. He leads the Computational Biology group in the Cancer Cell Line Encyclopedia (CCLE) project — a public collection of cancer cell line data consisting of over 1000 cancer cell lines — as part of the Cancer Program at the Broad Institute of MIT and Harvard. His group uses high-throughput genomics integrated with large-scale small molecules, shRNA, and CRISPR screens to study the molecular mechanisms of cancer, find new vulnerabilities and therapeutic targets, and investigate the mechanisms of drug resistance. The group also develops and uses state-of-the-art computational methods for integrative analysis of next generation sequencing data as well as proteomic, metabolomic, microRNA, and epigenetic data to build predictive models for drug sensitivities and to advance our understanding of cancer biology.
Ghandi completed his Ph.D. in biomedical engineering at Johns Hopkins University prior to joining the Broad Institute in 2012. During his doctoral studies, he undertook two summer internships, one with NextBio in Cupertino, California, and the other with Bristol-Myers Squibb in Syracuse, New York. He also had work experience with the Advanced Communications Research Institute after completion of his B.Sc. in electrical engineering from the Sharif University of Technology in Iran.