Saturday, March 21, 2015

I'm Back! And Supercomputing Applications

Hello All,

I've been in DC for a lot of the past few weeks (I'm actually there right now!). I've been thinking about some of the potential applications for my supercomputing project, which is important for a later stage of this project-- testing my supercomputer's performance on analyzing medical/biological data. Now, biological data is a very...broad description, so I have to decide what exactly my supercomputer will be used for so I can test it on the appropriate data set.

For the past few years I've been doing a lot of work on computational drug discovery-- specifically, "teaching" the computer through machine learning to identify drug leads. I've been able to reach 91% accuracy with some of the algorithms I've built, based on only four features. This is when the algorithm is trained on approximately 2000 data points. With millions and millions of data points that I could get from integrating various publicly available databases, on a large scale, this accuracy could actually go up quite a bit. Another application where big data analytics are important are personalized medicine. In personalized medicine, you would ideally have the computer go through a person's complete genome and look for regions where mutations are. The computer would then have to learn which medicines work best for a person with a certain combination of mutations. A database with this information is currently not available, but research is progressing rapidly in our area, and our supercomputer should be ready to deal with such a personalized medicine database when it arises.

Our supercomputer should be able to perform more than simple queries on this big data set. True big-data analytics involve finding patterns in the data, even when we have not trained the computer on what specifically to look for. This is called unsupervised learning.

An example of SUPERVISED learning would be showing a computer a set of drugs that are active (and the computer knows they are active) and allowing the computer to learn what the characteristics are of active drugs. The computer can then predict, given the characteristics of a drug it has never seen before, whether that drug will be active. In UNSUPERVISED learning, we would simply give the computer a lot of drugs and their characteristics, and the computer would look for patterns. It might cluster the drugs by the characteristics we have given it, and the active drugs might end up in one cluster and the inactive drugs might end up in another cluster. I know for sure that we would want to test how much time it takes for our supercomputer to cluster the biological data we give it (whether for drug discovery or another application), and test the accuracy of the clustering.

One important principal in design is to always keep in mind the audience. I'm designing this supercomputer, so I need to keep in mind what it will be used for, in order to modify my design accordingly (in terms of both software compatibility and hardware used). So that's what I've been working on so far.

Will report more soon!
Anvita