SRP Proposal



Senior Research Project Proposal
Anvita Gupta

  1. Title of Project:

High Performance Computing Clusters for Clinical Therapy


  1. Statement of Purpose:

According to Eric Schmidt, executive chairman of Google, “From the dawn of human history until 2003, mankind generated five exabytes of data. Now we produce five exabytes every two days...and the pace is accelerating.” High performance computing clusters are essentially many small computers that work together so tightly that they can be viewed as one computer. In my research, I aim to build a high performance computing cluster optimized for biological data.


  1. Background:

I have taken AP Computer Science and taught myself programming languages like R, Python, and Matlab. I also have some computer hardware experience from Robotics clubs in the seventh and eighth grades. I am interested in gaining more experience with building computers and understanding their hardware. I have experience in applying artificial intelligence to treating diseases through the projects I have conducted in the past few years in conjunction with Wright State University, the Research Science Institute, and Tgen. As the amount of data available to us grows, in biology, demographics, and other fields, more efficient computer systems are needed in order to process it, increasing the need for high performance computing clusters.


  1. Prior Research:

The software framework for high performance computing clusters was first developed by LexisNexis Risk Solutions as an open source software. The HPCC includes support of parallel data processing through Thor and online query applications through Roxie. HPCCs also have their own declarative programming languages optimized for parallel data processing, called ECL.

HPCs that have been optimized for medical applications include QIIME [1], which was developed by scientists at the University of Colorado. QIIME allows analysis of high-throughput community sequencing data, and is optimized for microbial data. QIIME has shown applications to understanding the ecology of microbe populations [1]. Other published HPCs primarily for biological use include SCIRun2 [2]. SCIRun2 demonstrates that HPCs optimized to solve scientific problems should have several computational elements in a computer (module) that each do one thing, which are then connected to form a network that makes up the program [2]. In addition, SciRun2 found that scientific programs should consist of both computational and visualization nodes to orchestrate the solution to some scientific problem.

Molecular visualization problems which can be solved by HPCCs include molecular docking simulations, which require a great amount of computational power [4]. Another computationally demanding problem is all-atom simulation of the way every atom moves in a molecule. The largest all-atom biomolecular simulation published to date is of the ribosome. Increasing computational power at a lower cost can help grow the size of molecules we are able to simulate. The atomic simulation of the ribosome required 1024 CPUs [3]. It is possible that much research on HPCCs for drug discovery has not been published because large pharmaceutical industries are hesitant to release information about their specialized HPCC systems.


  1. Significance:

Currently, bioinformatics is at a crossroads because of the sudden availability of vast amounts of biological data that can transform medicine, including data on metabolites, DNA, RNA, and proteins. However, researchers do not yet understand how these scales interact and cause diseases like Cancer or Alzheimers, or even how this data can be used to treat diseases like Ebola or Malaria. Over the past few months, I've been conducting research into a set of diseases known as “neglected tropical diseases” because even though they affect a disproportionate amount of people, most of the victims live in developing countries whose residents cannot afford to pay the exorbitant prices for most drugs. Consequently, while nearly one thousand cancer drugs have been developed this year alone, a new medicine for tuberculosis has not been developed in nearly forty years, even in the face of strains completely resistant to antibiotics. Computational drug discovery promises to be a cheaper method for developing medication, but current software is either very expensive or largely inaccurate. Computational drug discovery is difficult because screening chemical compounds requires enormous computational power, since the number of possible chemical structures are so high. One of the most effective methods of drug development is pharmacophore screening, which looks at chemical compounds that already bind to a protein, isolates the section of the compound that binds (the “pharmacophore”), and screens chemical libraries for compounds with that motif. However, this method is inefficient without a supercomputer, which is very expensive.

High power computing clusters (HPCs) are a viable solution, since they are a collection of multi-core processors (called nodes) that compute in parallel, with the same power as a supercomputer but much less cost. I think one of the most interesting problems of bioinformatics is designing HPCs and more accurate software for predicting the toxicity of drugs, to make drug development cheaper and more accessible. Developing a cheap HPC optimized for clinical data can have enormous impacts on biological research, and can even be used in doctors' offices to use patient data for projects like finding the optimal combination of medications, or creating personalized treatment plans. I believe that this research can teach me more about the hardware and hidden backbones of computers. Once I am done with my findings, I might decide to present my work to local bioinformatics startups in Phoenix or in the Bay Area. Cheaper computational methods are in great demand at present, so I anticipate that many researchers and entrepreneurs would be interested in such a high performance computing cluster.


  1.  Description

The research I will conduct in this process will be primarily engineering-based, as I will first build the cluster from scratch, measure its data-processing performance, and then optimize the computer. This cycle will be repeated multiple times until our HPC is able to improve upon existing home HPCs. As a result of my research, I will produce a fully-functioning High Performance Computing Cluster available for home or office use, specifically optimized for handling large amounts of clinical and biological data like sequencing information and patient histories.


  1. Methodology:

First, I will build the high performance computing cluster. The cluster will be optimized for in-home and in-office use, so one of our requirements is that the HPCC be portable. However, the portability also places restrictions on the size of the computing cluster we are able to have. Several compute nodes will be controlled by one head node. Since Linux is optimized for servers, we will run a flavor of linux on our high performance computing cluster. We will optimize the cluster for handling large amounts of biological data by equipping it with fast processers and a system of distributed and parallel computing.

Cornell has developed computational biology applications for HPCs (BioHPC) that allow researchers to submit jobs for parallel computing in a user-friendly manner [5]. BioHPC is in the open source. We will use this software as a basis for our HPCC's computing environment.

Finally, we will test the processing power of our HPCC to search through large amounts of genetic sequencing data. The accuracy and time taken by the HPCC will be measured and compared with the recorded performance statistics of existing HPCCs.


  1. Problems:

Some of the problems I expect initially in answering my research question is the low availability of high amounts of publicly available biological data on which to test the performance of my high performance computing cluster. However, by conglomerating multiple data sources from around the web, I anticipate that I will be able to solve this problem. Another problem I anticipate is difficulty in building the cluster in a small space and keeping the temperature cool. I aim to fix this problem through thorough research on available ventillating systems for computing clusters which I can then build into my HPC.

  1. Bibliography:

[1] Caporaso, J. et al. (2010). QIIME allows analysis of high-throughput community sequencing data. Nature Methods, 7, 335–336-335–336.

[2] Zhang, K., Damevski, K., Venkatachalapathy, V., & Parker, S. (2004). SCIRun2: A CCA framework for high performance computing. High-Level Parallel Programming Models and Supportive Environments. Proceedings., 72-79.

[3] Sanbonmatsu, K., & Tung, C. (2007). High performance computing in biology: Multimillion atom simulations of nanoscale systems. Journal of Structural Biology, 157(3), 470-480.

[4] Okimoto, N., Futatsugi, N., Fuji, H., Suenaga, A., Morimoto, G., Yanai, R., ... Case, D. (2009). High-Performance Drug Discovery: Computational Screening by Combining Docking and Molecular Dynamics Simulations. PLoS Computational Biology, E1000528-E1000528.

[5] Pillardy, J. “Computational Biology Applications Suite for High Performance Computing (Bio HPC).” Cornell University.