Senior
Research Project Proposal
Anvita
Gupta
- Title of Project:
High
Performance Computing Clusters for Clinical Therapy
- Statement of Purpose:
According
to Eric Schmidt, executive chairman of Google, “From the dawn of
human history until 2003, mankind generated five exabytes of data.
Now we produce five exabytes every two days...and the pace is
accelerating.” High performance computing clusters are essentially
many small computers that work together so tightly that they can be
viewed as one computer. In my research, I aim to build a high performance
computing cluster optimized for biological data.
- Background:
I
have taken AP Computer Science and taught myself programming
languages like R, Python, and Matlab. I also have some computer
hardware experience from Robotics clubs in the seventh and eighth
grades. I am interested in gaining more experience with building
computers and understanding their hardware. I have experience in
applying artificial intelligence to treating diseases through the
projects I have conducted in the past few years in conjunction with
Wright State University, the Research Science Institute, and Tgen. As
the amount of data available to us grows, in biology, demographics,
and other fields, more efficient computer systems are needed in order
to process it, increasing the need for high performance computing
clusters.
- Prior Research:
The
software framework for high performance computing clusters was first
developed by LexisNexis Risk Solutions as an open source software.
The HPCC includes support of parallel data processing through Thor
and online query applications through Roxie. HPCCs also have their
own declarative programming languages optimized for parallel data
processing, called ECL.
HPCs
that have been optimized for medical applications include QIIME
[1], which was developed by scientists at the University of Colorado.
QIIME allows analysis of high-throughput community sequencing data,
and is optimized for microbial data. QIIME has shown applications to
understanding the ecology of microbe populations [1]. Other published
HPCs primarily for biological use include SCIRun2 [2]. SCIRun2
demonstrates that HPCs optimized to solve scientific problems should
have several computational elements in a computer (module) that each
do one thing, which are then connected to form a network that makes
up the program [2]. In addition, SciRun2 found that scientific
programs should consist of both computational and visualization nodes
to orchestrate the solution to some scientific problem.
Molecular
visualization problems which can be solved by HPCCs include molecular
docking simulations, which require a great amount of computational
power [4]. Another computationally demanding problem is all-atom
simulation of the way every atom moves in a molecule. The largest
all-atom biomolecular simulation published to date is of the
ribosome. Increasing computational power at a lower cost can help
grow the size of molecules we are able to simulate. The atomic
simulation of the ribosome required 1024 CPUs [3]. It is possible
that much research on HPCCs for drug discovery has not been published
because large pharmaceutical industries are hesitant to release
information about their specialized HPCC systems.
- Significance:
Currently,
bioinformatics is at a crossroads because of the sudden availability
of vast amounts of biological data that can transform medicine,
including data on metabolites, DNA, RNA, and proteins. However,
researchers do not yet understand how these scales interact and cause
diseases like Cancer or Alzheimers, or even how this data can be used
to treat diseases like Ebola or Malaria. Over
the past few months, I've been conducting research into a set of
diseases known as “neglected tropical diseases” because even
though they affect a disproportionate amount of people, most of the
victims live in developing countries whose residents cannot afford to
pay the exorbitant prices for most drugs. Consequently, while nearly
one thousand cancer drugs have been developed this year alone, a new
medicine for tuberculosis has not been developed in nearly forty
years, even in the face of strains completely resistant to
antibiotics. Computational drug discovery promises to be a cheaper
method for developing medication, but current software is either very
expensive or largely inaccurate. Computational drug discovery is
difficult because screening chemical compounds requires enormous
computational power, since the number of possible chemical structures
are so high. One of the most effective methods of drug development is
pharmacophore screening, which looks at chemical compounds that
already bind to a protein, isolates the section of the compound that
binds (the “pharmacophore”), and screens chemical libraries for
compounds with that motif. However, this method is inefficient
without a supercomputer, which is very expensive.
High
power computing clusters (HPCs) are a viable solution, since they are
a collection of multi-core processors (called nodes) that compute in
parallel, with the same power as a supercomputer but much less cost.
I think one of the most interesting problems of bioinformatics is
designing HPCs and more accurate software for predicting the toxicity
of drugs, to make drug development cheaper and more accessible.
Developing a cheap HPC optimized for clinical data can have enormous
impacts on biological research, and can even be used in doctors'
offices to use patient data for projects like finding the optimal
combination of medications, or creating personalized treatment plans.
I believe that this research can teach me more about the hardware and
hidden backbones of computers. Once I am done with my findings, I
might decide to present my work to local bioinformatics startups in
Phoenix or in the Bay Area. Cheaper computational methods are in
great demand at present, so I anticipate that many researchers and
entrepreneurs would be interested in such a high performance
computing cluster.
- Description
The
research I will conduct in this process will be primarily
engineering-based, as I will first build the cluster from scratch,
measure its data-processing performance, and then optimize the
computer. This cycle will be repeated multiple times until our HPC is
able to improve upon existing home HPCs. As a result of my research,
I will produce a fully-functioning High Performance Computing Cluster
available for home or office use, specifically optimized for handling
large amounts of clinical and biological data like sequencing
information and patient histories.
- Methodology:
First,
I will build the high performance computing cluster. The cluster will
be optimized for in-home and in-office use, so one of our
requirements is that the HPCC be portable. However, the portability
also places restrictions on the size of the computing cluster we are
able to have. Several compute nodes will be controlled by one head
node. Since Linux is optimized for servers, we will run a flavor of
linux on our high performance computing cluster. We will optimize the
cluster for handling large amounts of biological data by equipping it
with fast processers and a system of distributed and parallel
computing.
Cornell
has developed computational biology applications for HPCs (BioHPC)
that allow researchers to submit jobs for parallel computing in a
user-friendly manner [5]. BioHPC is in the open source. We will use
this software as a basis for our HPCC's computing environment.
Finally,
we will test the processing power of our HPCC to search through large
amounts of genetic sequencing data. The accuracy and time taken by
the HPCC will be measured and compared with the recorded performance
statistics of existing HPCCs.
- Problems:
Some
of the problems I expect initially in answering my research question
is the low availability of high amounts of publicly available
biological data on which to test the performance of my high
performance computing cluster. However, by conglomerating multiple
data sources from around the web, I anticipate that I will be able to
solve this problem. Another problem I anticipate is difficulty in
building the cluster in a small space and keeping the temperature
cool. I aim to fix this problem through thorough research on
available ventillating systems for computing clusters which I can
then build into my HPC.
- Bibliography:
[1]
Caporaso, J. et al. (2010). QIIME allows analysis of high-throughput
community sequencing data. Nature
Methods,
7,
335–336-335–336.
[2]
Zhang, K., Damevski, K., Venkatachalapathy, V., & Parker, S.
(2004). SCIRun2: A CCA framework for high performance computing.
High-Level
Parallel Programming Models and Supportive Environments.
Proceedings.,
72-79.
[3]
Sanbonmatsu, K., & Tung, C. (2007). High performance computing in
biology: Multimillion atom simulations of nanoscale systems. Journal
of Structural Biology,
157(3),
470-480.
[4]
Okimoto, N., Futatsugi, N., Fuji, H., Suenaga, A., Morimoto, G.,
Yanai, R., ... Case, D. (2009). High-Performance Drug Discovery:
Computational Screening by Combining Docking and Molecular Dynamics
Simulations. PLoS
Computational Biology,
E1000528-E1000528.
[5]
Pillardy, J. “Computational Biology Applications Suite for High
Performance Computing (Bio HPC).” Cornell University.