I rarely used cloud-based HPC clusters for scientific simulations because they are just too expansive. I always have access to clusters at different high-performance computing centers. But those clusters have one serious problem: too many people in the queues waiting for resources. It’s not uncommon I have to wait for a day or even days for my jobs to start running after I submit the jobs. For runs that run for several days, it may be fine to wait for a day. But sometimes I just need a cluster for something like a 10-minute interactive session to debug internode parallel issues or a quick benchmark. It totally doesn’t make sense to wait for 1 day just for a 10-minute job.
So I decided to build my own personal HPC cluster with a cloud service provider. It sounds good. I can do pay-as-you-go. And because I will only use this cloud cluster for small jobs, it’s not going to be expensive.
I used to use Microsoft Azure at some point in the past. But after comparing the prices and convenience, I decided to go with Google Cloud Platform.
Everything went fine during the playing-around stage. I was able to build a small cluster with 10 compute nodes of very-cheap instances, the Slurm scheduler, and an NFS server. A lustre cluster is too complicated, so I passed. Google Cloud Platform doesn’t seem to have Infiniband. But that’s OK for a personal cluster.
However, terrible things happened when I move on to more a production stage. When I tried to replace those very cheap compute nodes with high-performance instances (i.e., C2-series instances) and GPUs, I started to regret using Google Cloud Platform.
First, C2-series instances cannot have GPUs. This is ridiculous. Yes, I know maching learning applications usually don’t need powerful CPUs. BUT I’M NOT RUNNING MACHINE LEARNING APPLICATIONS!!! Many traditional scientific simulation programs exploit all available heterogeneous hardware. My applications need both powerful CPUs and GPUs. This is stupid that GPUs can not go with powerful instances. It seems Google Cloud Platform only cares about machine learning applications. All their webpages and all their promotions are talking about machine learning. They don’t care about other applications, which need more powerful machines.
Second, they put very ridiculous quotas on how many CPUs and GPUs you can request! The default quota of C2 CPUs is 8 CPU cores, and the quota for V100 GPUs is 1 GPU. What kind of HPC clusters only have 8 CPU cores and 1 GPU? OK, so I can request an increase of quota. That’s fine. I did the request to increase the C2 CPU quota from 8 to 256 and the V100 GPU quota from 1 to 16. I think at least I can have a cluster with 8 nodes, and each node has 32 C2 cores and 2 V100 GPUs. But no. THEY REJECTED MY REQUEST IMMEDIATELY!!! I just don’t understand. 256 CPU cores and 16 GPUs are not a crazy number. And my cluster would not be running 24/7. It will only be up when I need it. And the rejection came right after I clicked the request submission button. That means no real human had even reviewed my small request.
This is stupid. I’m willing to pay out of my pocket, but they don’t even bother to review my request.