High Performance Computing Clusters for Clinical Therapy: Hadoop in a Letter Tray

Hello All,

Excited to report that my search for a case has come to an end-- at least for the present! We have decided to use...a letter tray from IKEA to house the HPCC! The letter tray actually has the perfect dimensions to hold four microATX boards, and one large space at the bottom to hold the power supply and two hard drives. There are some other issues to consider, though, before building.

In supercomputing, the architecture of the system has to be tailored to the tasks you want to accomplish. For example, if you want to run a task with many small pieces that can be completed independently, it's fine to have one master node and many different slave nodes that don't have to be connected to each other. A "master node" is the main computer that processes your input (the task you have asked the computer to complete), divides the task into smaller ones, delegates each smaller task to one of the "slave nodes", and then combines all the output from the slave nodes and presents it neatly to you- the user.

If a task can be divided cleanly into subtasks that don't rely on each other, that task is known as "massively parallel." For example, if you want to add 5 to 10,000 data points and you have 5 computers, then you can give each computer 2,000 data points and tell it to add five to each of them. The computers don't rely on each other's output and can work in parallel- hence, the task is called "massively parallel."

Supercomputers love massively parallel tasks. In such tasks, the slave nodes don't have to communicate with each other and don't have to be connected. Each slave is only connected to the master. This is called a "share nothing" architecture. Slave nodes in such cases do not have to have hard drives or permanent storage- they just compute and send their output immediately back to the master node. Each slave node can be different, although they usually aren't.

Many supercomputers have an alternate architecture today and use Apache Hadoop instead. Hadoop is basically a framework that you install on top of your cluster. Hadoop doesn't adhere to the master/slave architecture (it's more egalitarian.) Hadoop has master nodes and "worker" nodes. The worker nodes are all identical and have hard drives. Worker nodes are often connected to each other and the master node by a switch.

It's kind of a small distinction, but an important one. After all, if we want to run Hadoop, we have to make sure that the hard drives will fit with each computer in our case. Hadoop is much better than relational databases at handling large amounts of unstructured data. Unstructured data basically has no rules attached to it- each data point can have as many numbers or words associated with it as it wants. Hadoop has been shown to be very effective for biological data, so we are leaning towards it right now (http://bioinformatics.oxfordjournals.org/content/early/2013/09/10/bioinformatics.btt528).

I hope that wasn't too much information thrown at you too quickly. If you have any questions, please let me know in the comments and I will answer them as best I can!

Until Next Time,
Anvita

6 comments:

Rich BloomFebruary 22, 2015 at 11:09 AM
hi anvita,
That is a super interesting post. Thanks for the informative explanation of nodes and computing. It went a little over my head but I like reading and learning about it. I'm curious about who you were working with. Is this project of building the computer part of your internship or is it your separate research project? Also, where does the name Hadoop come from?

Mr. Bloom
Arion KoliopoulosFebruary 22, 2015 at 7:04 PM
It's cool that you were able to find a case that worked so well. I'm curious if you're going to be using regular hard drives or solid state drives? Is there any limitations on using either for supercomputing?
Angela HemesathFebruary 23, 2015 at 11:15 AM
Anvita, that sounds like a lot of COMPUTATION (wink). But, wow, really, this is a lot of information. I think you explained it very well and I'm better for the knowledge. I think there is really something fantastic about computer naming (ig. "super computer", "master node", "slave node"). Keep doing what you're doing!
UnknownFebruary 27, 2015 at 10:04 AM
Hi Anvita - When the computations/tasks are interdependent and involve partially or entirely serial computations, how much human interaction is needed to optimize the efficiency and how does it affect the number of data turns per task?

- mac

Saturday, February 21, 2015

Hadoop in a Letter Tray

6 comments: