[Feature][Docs][Discussion] Provider consistent guidance on resource Request and Limits #744

DmitriGekhtman · 2022-11-19T19:51:55Z

Search before asking

I had searched in the issues and found no similar feature requirement.

Description

Documentation on resource request and limits should be made clearer and more prominent, perhaps repeated in several places in the docs.

Best practices are laid out here:
https://discuss.ray.io/t/questions-for-configurations-using-helm-chart/7762?u=dmitri
https://docs.ray.io/en/latest/cluster/kubernetes/user-guides/config.html#resources

Subtleties:

We currently make the following recommendations:

KubeRay takes CPU/memory resource info from limits and advertises that info to Ray. For that reason, we recommend setting requests=limits for KubeRay configs. (Otherwise, the Ray pod may be scheduled with less resource availability than is advertised to Ray.)
Because a Ray cluster does not benefit from running multiple pods per node, we recommend sizing pods to be as large as possible. When possible within a user's constraints it is best to run one Ray pod per Kubernetes node.
Because Ray round up the CPU limit to the nearest whole CPU, we recommend setting an integer CPU limit.

Some issues with these recommendations.

The documentation is not fully consistent with this recommendation.
If you take points 1. and 2. literally, it requires determining the exact allocatable resources of a node and then subtracting some safety amount to make sure your Ray pods can get scheduled. Another valid approach is to set limits equal to the node's total capacity and requests to something just under the allocatable capacity.
Taking points 1. and 3. literally requires setting CPU requests and limits both to at least 1 in all configs. This is difficult for example configs and CI test configs meant to run in CPU-constrained Minikube/Kind environments.

Use case

Less user confusion when figuring out resources for Ray on K8s.

Related issues

No response

Are you willing to submit a PR?

Yes I am willing to submit a PR!

Jeffwan · 2022-11-22T18:28:25Z

Because Ray does not benefit from running multiple pods per node, we recommend sizing pods to take up entire nodes.

I think we can not assume we only run one Ray cluster in K8s cluster. So different size clusters will fill the k8s clusters.

DmitriGekhtman · 2022-11-22T19:45:43Z

I've amended that bullet slightly. It's of course subject to the constraints of the user's Kubernetes environment, but generally speaking, it does not make sense to pack many tiny Ray pods into a single K8s node.

DmitriGekhtman · 2022-11-22T19:47:49Z

A common setup (and one we've used at Anyscale in the past) is to set up cluster autoscaling such that you get a new K8s node of the appropriate type each time you request a Ray pod. The Ray pod and supporting machinery then fill up the entire K8s node.

peterghaddad · 2023-01-23T17:44:05Z

Hey @DmitriGekhtman for number 2:

Because a Ray cluster does not benefit from running multiple pods per node, we recommend sizing pods to be as large as possible. When possible within a user's constraints it is best to run one Ray pod per Kubernetes node

Is there a way to configure the autoscaler to use STRICT_SPREAD? https://docs.ray.io/en/latest/ray-core/scheduling/placement-group.html?highlight=placement

Seems that is is only possible from the SDK and not when configuring the cluster? https://github.com/ray-project/kuberay/blob/3aebd8c9f5ae5d9d9d12489d5636d3cf1b97548e/ray-operator/config/samples/ray-cluster.autoscaler.large.yaml

DmitriGekhtman · 2023-01-24T06:05:49Z

STRICT_SPREAD helps to spread Ray tasks across different Ray pods.

To spread Ray pods across different Kubernetes nodes, I think the thing to look into would be pod anti-affinity.

davidxia · 2025-02-26T19:16:00Z

I noticed that @ray.remote(num_cpus=0.5) can be fractional. Why doesn't ray start --num-cpus allow for fractional too? Currently requires integer value.

related docs

DmitriGekhtman · 2025-02-26T19:38:18Z

Mostly because the Ray core APIs originally targeted running Ray nodes directly on VM and bare metal, rather than running Ray nodes as Kubernetes pods.

It would require substantive changes to allow Ray node CPU capacity to be non-integer.

On the other hand, with Ray custom resources, you can express whatever resource accounting semantics you want.
So as a workaround, you can define a custom resource TENTH_OF_A_CPU and annotate a node as having 5 of those.

DmitriGekhtman added enhancement P1 docs discussion labels Nov 19, 2022

DmitriGekhtman mentioned this issue Nov 19, 2022

[KubeRay] Clearly document best practices for Ray pod requests and limits ray-project/ray#29101

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature][Docs][Discussion] Provider consistent guidance on resource Request and Limits #744

[Feature][Docs][Discussion] Provider consistent guidance on resource Request and Limits #744

DmitriGekhtman commented Nov 19, 2022 •

edited

Loading

Jeffwan commented Nov 22, 2022

DmitriGekhtman commented Nov 22, 2022

DmitriGekhtman commented Nov 22, 2022

peterghaddad commented Jan 23, 2023

DmitriGekhtman commented Jan 24, 2023

davidxia commented Feb 26, 2025 •

edited

Loading

DmitriGekhtman commented Feb 26, 2025

[Feature][Docs][Discussion] Provider consistent guidance on resource Request and Limits #744

[Feature][Docs][Discussion] Provider consistent guidance on resource Request and Limits #744

Comments

DmitriGekhtman commented Nov 19, 2022 • edited Loading

Search before asking

Description

Use case

Related issues

Are you willing to submit a PR?

Jeffwan commented Nov 22, 2022

DmitriGekhtman commented Nov 22, 2022

DmitriGekhtman commented Nov 22, 2022

peterghaddad commented Jan 23, 2023

DmitriGekhtman commented Jan 24, 2023

davidxia commented Feb 26, 2025 • edited Loading

DmitriGekhtman commented Feb 26, 2025

DmitriGekhtman commented Nov 19, 2022 •

edited

Loading

davidxia commented Feb 26, 2025 •

edited

Loading