5 Things to Consider When Deploying High Performance Computing (HPC) to the Cloud
By Boyd Wilson, Sara Jeanes, and Amy Cannon
In case you were unable to attend Internet2 Global Summit, or were not able to find a seat in our standing-room-only room, this post expands upon the discussion during the HPC in the Cloud BoF session. This will be the first of two posts on this topic. The second post will cover the Cost and Benefit of HPC in the Cloud.
People & Disciplines
Using HPC is a combination of Researchers and PIs with domain knowledge and technology practitioners who can facilitate. HPC or Parallel computation opens doors, but a majority of researchers can’t do it alone. CI Practitioners play a key role in facilitating the introduction and effective use of HPC.
HPC is currently used in a number of disciplines including the Sciences, Engineering, Arts, Humanities, and Social Sciences. Machine Learning will expand this list and use cases will continue to expand. There will be additional pressure on resources and funding, support and infrastructure are all squeezed by increased use and decreased budgets.
The type of workload will also dictate what resources are required to run the job. Pleasingly Parallel (P2), High Throughput Computing, and Data Intensive Computing/Big Data applications all have distinct advantages when run in the cloud. The cloud also offers pluggable Graphics Processing Units (GPUs), a uniquely cloudy approach to hardware.
Other workload considerations include how light or heavy use of the Message Passing Interface (MPI) is. Field-programmable gate arrays (FPGA), Interactive Computation (Jupyter), and Real Time all present additional challenges that can be addressed in the cloud but may require some additional work to accomplish.
HPC Job Routing is an important and often overlooked component in any HPC stack. Adding the cloud to the mix will also require an addition layer of routing. First the job needs to be reviewed and either run on prem or sent to the cloud depending on factors including time to complete, size, and available cycles. Data for the job then needs to be stage before launching the job. Results then need to be returned when the job completes. Services like CloudyCluster can handle scaling the cluster on the cloud side, and open source projects like CCQHub can manage routing.
It is one thing to work alone but adding collaborators to the mix will significantly increase the complexity of any implementation. Federated Web Authentication is an absolute must. Does the application of choice support Shibboleth or OAuth? Do collaborators have a self service invite mechanism? Can they use existing collaboration platforms like Google Drive and NET+ Box for data sharing?
Cloud native platforms are also significantly changing the way jobs can be executed. Serverless offerings like Google Cloud Functions, Azure Functions and AWS Lambda are events triggered, meaning code can be run only when there is pending input, without needing to managed the underlying cluster. This is great for ad hoc or intermittent request and tied to a data processing pipeline, can completely overturn a traditional idea of job processing.
The explosion of Machine Learning as a Service offerings make previously nearly impossible tasks trivial. Just some of the recently launched offerings include TensorFlow based Machine Learning as a Service offerings like GCP ML Engine; speech, image, and text recognition from Amazon (in the form of Lex, Polly, and Rekogntion respectively); Azure Cognitive Services; and, perhaps most famously, Jeopardy champion IBM Watson.
Funding & Integration
NIH Cloud Credits pilot offers up to $6m for current NIH Investigators. Credits are received directly from the cloud provider. In the Spring of 2017, the NSF Big Data Sciences and Engineering program, $29m and $9m respectively, offered Public Cloud Credits (from AWS, Azure, and GCP) given directly to researchers.
Be sure to check out our next blog post, Cloud vs. Datacenter Costs for High Performance Computing (HPC): A Real World Example.