Hello All,
I've been in DC for a lot of the past few weeks (I'm actually there right now!). I've been thinking about some of the potential applications for my supercomputing project, which is important for a later stage of this project-- testing my supercomputer's performance on analyzing medical/biological data. Now, biological data is a very...broad description, so I have to decide what exactly my supercomputer will be used for so I can test it on the appropriate data set.
For the past few years I've been doing a lot of work on computational drug discovery-- specifically, "teaching" the computer through machine learning to identify drug leads. I've been able to reach 91% accuracy with some of the algorithms I've built, based on only four features. This is when the algorithm is trained on approximately 2000 data points. With millions and millions of data points that I could get from integrating various publicly available databases, on a large scale, this accuracy could actually go up quite a bit. Another application where big data analytics are important are personalized medicine. In personalized medicine, you would ideally have the computer go through a person's complete genome and look for regions where mutations are. The computer would then have to learn which medicines work best for a person with a certain combination of mutations. A database with this information is currently not available, but research is progressing rapidly in our area, and our supercomputer should be ready to deal with such a personalized medicine database when it arises.
Our supercomputer should be able to perform more than simple queries on this big data set. True big-data analytics involve finding patterns in the data, even when we have not trained the computer on what specifically to look for. This is called unsupervised learning.
An example of SUPERVISED learning would be showing a computer a set of drugs that are active (and the computer knows they are active) and allowing the computer to learn what the characteristics are of active drugs. The computer can then predict, given the characteristics of a drug it has never seen before, whether that drug will be active. In UNSUPERVISED learning, we would simply give the computer a lot of drugs and their characteristics, and the computer would look for patterns. It might cluster the drugs by the characteristics we have given it, and the active drugs might end up in one cluster and the inactive drugs might end up in another cluster. I know for sure that we would want to test how much time it takes for our supercomputer to cluster the biological data we give it (whether for drug discovery or another application), and test the accuracy of the clustering.
One important principal in design is to always keep in mind the audience. I'm designing this supercomputer, so I need to keep in mind what it will be used for, in order to modify my design accordingly (in terms of both software compatibility and hardware used). So that's what I've been working on so far.
Will report more soon!
Anvita
I've been in DC for a lot of the past few weeks (I'm actually there right now!). I've been thinking about some of the potential applications for my supercomputing project, which is important for a later stage of this project-- testing my supercomputer's performance on analyzing medical/biological data. Now, biological data is a very...broad description, so I have to decide what exactly my supercomputer will be used for so I can test it on the appropriate data set.
For the past few years I've been doing a lot of work on computational drug discovery-- specifically, "teaching" the computer through machine learning to identify drug leads. I've been able to reach 91% accuracy with some of the algorithms I've built, based on only four features. This is when the algorithm is trained on approximately 2000 data points. With millions and millions of data points that I could get from integrating various publicly available databases, on a large scale, this accuracy could actually go up quite a bit. Another application where big data analytics are important are personalized medicine. In personalized medicine, you would ideally have the computer go through a person's complete genome and look for regions where mutations are. The computer would then have to learn which medicines work best for a person with a certain combination of mutations. A database with this information is currently not available, but research is progressing rapidly in our area, and our supercomputer should be ready to deal with such a personalized medicine database when it arises.
Our supercomputer should be able to perform more than simple queries on this big data set. True big-data analytics involve finding patterns in the data, even when we have not trained the computer on what specifically to look for. This is called unsupervised learning.
An example of SUPERVISED learning would be showing a computer a set of drugs that are active (and the computer knows they are active) and allowing the computer to learn what the characteristics are of active drugs. The computer can then predict, given the characteristics of a drug it has never seen before, whether that drug will be active. In UNSUPERVISED learning, we would simply give the computer a lot of drugs and their characteristics, and the computer would look for patterns. It might cluster the drugs by the characteristics we have given it, and the active drugs might end up in one cluster and the inactive drugs might end up in another cluster. I know for sure that we would want to test how much time it takes for our supercomputer to cluster the biological data we give it (whether for drug discovery or another application), and test the accuracy of the clustering.
One important principal in design is to always keep in mind the audience. I'm designing this supercomputer, so I need to keep in mind what it will be used for, in order to modify my design accordingly (in terms of both software compatibility and hardware used). So that's what I've been working on so far.
Will report more soon!
Anvita
Of course 91% isn't enough for Anvita Gupta. You're doing amazing things and I'm excited to see the end product soon!
ReplyDeleteYour explanation of the different types of learning was really helpful. I didn't know there was a distinction between the two. Would it be possible for your supercomputer to take what it learns from clustering drugs to another type of data set? Or is it sort of stuck with one kind of data set?
ReplyDeleteIt can take new drugs and cluster them as well, if that's what you mean! Clustering is a pretty simple process...just looking for similarities in all the data points you give the computer based on some algorithm. Most clustering algorithms will give the probability that a data point belongs in each cluster.
DeleteWhat kinds of features did you include in your drug discovery algorithm?
ReplyDeleteYou noted that your algorithm analyses drugs based on four features; how did you arrive at that number? Is it possible to consider more in the future, or is there some mechanical limitation?
ReplyDeleteHi Ryan,
DeleteIt's definitely possible to include more. I just happen to use four 3d features that describe the data reasonably well (accuracy greater than 0.9 is considered to be "excellent"). The rule of thumb is that you should not have more features than 0.1 times the number of data points you have. So if I had 700 data points, seventy features max. Obviously I have some leg room with only four features, so I'm looking for additional ones to add to the performance.