Jing Xie’s Post

View profile for Jing Xie

🌊Spot Instance Surfer | 🤖GPU Optimizer

These features help HPC users save 50-80% on EC2 compute costs...here's how they work -- Optimize is a feature we developed for stateful jobs running on EC2 On-Demand and migrates them to run on EC2 Spot instances: 1. This compute optimization feature uses our memory checkpointing technology to ensure the job doesn't lose state This is key because for long-running scientific computing jobs it is difficult to restart what could be hours, days or even weeks of runtime just to save $ 2. During extremely busy periods, EC2 Spot interruption rates can be as high as 50-80% and responding to each rebalance notification is very disruptive If you're running on a Spot instance that actually get reclaimed and our software can't easily find another Spot instance, we perform a stateful migration to on-demand instances and Optimize helps look for a Spot instance 10, 20, 30 mins later to move back to -- How many times have you run out of memory? Out of memory protection is a feature we developed for HPC jobs that are hard to deterministically predict memory requirements pre-run: 1. Being able to run on smaller EC2 instances and migrating to larger instances only when needed without losing state results in material cost savings 2. Our software monitors resource utilization on all the worker nodes and when we detect memory utilization approaching 100% and/or swap is hit, we perform a full application checkpoint, migrate you to the larger instance, and resume running -- Some of these capabilities that we have developed to optimize Cloud costs are also quite useful for on-prem HPC application. Drop a comment below if you have questions or shoot me a DM if you want to chat on how implement memory checkpointing in your HPC architecture. #AWS #HPC #SuperComputing #Slurm #PEARC24

Jing Xie

🌊Spot Instance Surfer | 🤖GPU Optimizer

7mo

This is an example where the job starts on Spot. Then there was a Spot preemption storm and we used memory checkpointing to move a customer job to on-demand (see 2nd EC2 instance stats). Then used Optimize to go back to using Spot instances and SpotSurfer to survive over 20 Spot interruptions and continually find other Spot instances to migrate to. It also bumped the job to a larger memory instance with OOM protection, saving the customer time and completing the job at a fraction of the on-demand cost.

  • No alternative text description for this image
Like
Reply
Divya Atre

Building brand & demand through content marketing, social media marketing and campaigns

7mo

This is a great feature for HPC users to save on EC2 compute costs. The memory checkpointing technology is key for long-running jobs and the out of memory protection feature is helpful for hard-to-predict memory requirements. It's also interesting to know that these capabilities can be useful for on-prem HPC applications. Thanks for sharing!

Like
Reply
See more comments

To view or add a comment, sign in

Explore topics