Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Automatically reload pods when/if the image changes in RayCluster CR #234

Closed
2 tasks done
jeffreyftang opened this issue Apr 20, 2022 · 3 comments
Closed
2 tasks done
Labels
enhancement New feature or request operator

Comments

@jeffreyftang
Copy link

Search before asking

  • I had searched in the issues and found no similar feature requirement.

Description

Currently, changing things like the number of worker replicas in the ray cluster CR will be detected and actioned by the operator. However, changing the image used isn't.

My understanding is that this behavior is implemented here:

func (r *RayClusterReconciler) reconcilePods(instance *rayiov1alpha1.RayCluster) error {
. It would be nice if this logic could also detect when the pods are running an outdated image by comparing with the CR and automatically reload the pods.

Use case

Essentially, after upgrading or modifying a ray cluster CR with a new image, I'd like that image to be automatically picked up by any existing pods.

Related issues

No response

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!
@jeffreyftang jeffreyftang added the enhancement New feature or request label Apr 20, 2022
@akanso
Copy link
Collaborator

akanso commented Apr 20, 2022

That's a great point.

Normally a couple of questions are raised with the image upgrade:

1- is the assumption that the new image compatible with the old image?

2- should all pods be updated in a single step, or do we perform a rolling upgrade?

@Jeffwan
Copy link
Collaborator

Jeffwan commented Apr 28, 2022

@jeffreyftang @akanso PR #231 address this issue. If you guys have time, please have another review.

This was to address #155

@tekumara
Copy link
Contributor

tekumara commented Jul 23, 2022

I tried changing spec.containers[*].image in the RayCluster object from rayproject/ray:1.12.1 to rayproject/ray:1.12.1-py39-cpu but the ray head pod is still using rayproject/ray:1.12.1.

$ kubectl rayclusters.ray.io -o yaml
apiVersion: v1
items:
- apiVersion: ray.io/v1alpha1
  kind: RayCluster
  metadata:
    annotations:
      kubectl.kubernetes.io/last-applied-configuration: |
        {"apiVersion":"ray.io/v1alpha1","kind":"RayCluster","metadata":{"annotations":{},"labels":{"controller-tools.k8s.io":"1.0"},"name":"raycluster-mini","namespace":"default"},"spec":{"headGroupSpec":{"rayStartParams":{"block":"true","dashboard-host":"0.0.0.0","node-ip-address":"$MY_POD_IP","num-cpus":"1","object-store-memory":"100000000","port":"6379"},"replicas":1,"serviceType":"ClusterIP","template":{"metadata":{"annotations":{"key":"value"},"labels":{"groupName":"headgroup","rayCluster":"raycluster-sample","rayNodeType":"head"}},"spec":{"containers":[{"env":[{"name":"MY_POD_IP","valueFrom":{"fieldRef":{"fieldPath":"status.podIP"}}}],"image":"rayproject/ray:1.12.1-py39-cpu","name":"ray-head","ports":[{"containerPort":6379,"name":"gcs-server"},{"containerPort":8265,"name":"dashboard"},{"containerPort":10001,"name":"client"}],"resources":{"limits":{"cpu":1,"memory":"2Gi"},"requests":{"cpu":1,"memory":"2Gi"}}}]}}},"rayVersion":"1.12.1"}}
    creationTimestamp: "2022-07-23T06:00:57Z"
    generation: 2
    labels:
      controller-tools.k8s.io: "1.0"
    name: raycluster-mini
    namespace: default
    resourceVersion: "6445"
    uid: a3a3822d-9514-4ac7-856d-78b0caa75e04
  spec:
    headGroupSpec:
      rayStartParams:
        block: "true"
        dashboard-host: 0.0.0.0
        node-ip-address: $MY_POD_IP
        num-cpus: "1"
        object-store-memory: "100000000"
        port: "6379"
      replicas: 1
      serviceType: ClusterIP
      template:
        metadata:
          annotations:
            key: value
          labels:
            groupName: headgroup
            rayCluster: raycluster-sample
            rayNodeType: head
        spec:
          containers:
          - env:
            - name: MY_POD_IP
              valueFrom:
                fieldRef:
                  fieldPath: status.podIP
            image: rayproject/ray:1.12.1-py39-cpu
            name: ray-head
            ports:
            - containerPort: 6379
              name: gcs-server
              protocol: TCP
            - containerPort: 8265
              name: dashboard
              protocol: TCP
            - containerPort: 10001
              name: client
              protocol: TCP
            resources:
              limits:
                cpu: 1
                memory: 2Gi
              requests:
                cpu: 1
                memory: 2Gi
    rayVersion: 1.12.1
  status:
    availableWorkerReplicas: 1
    endpoints:
      client: "0"
      dashboard: "0"
      gcs-server: "0"
    lastUpdateTime: "2022-07-23T06:49:11Z"
    state: ready
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""
  
$ kubectl get pods -o yaml
apiVersion: v1
items:
- apiVersion: v1
  kind: Pod
  metadata:
    annotations:
      key: value
      ray.io/ha-enabled: "false"
      ray.io/health-state: ""
    creationTimestamp: "2022-07-23T06:00:57Z"
    generateName: raycluster-mini-head-
    labels:
      app.kubernetes.io/created-by: kuberay-operator
      app.kubernetes.io/name: kuberay
      groupName: headgroup
      ray.io/cluster: raycluster-mini
      ray.io/cluster-dashboard: raycluster-mini-dashboard
      ray.io/group: headgroup
      ray.io/identifier: raycluster-mini-head
      ray.io/is-ray-node: "yes"
      ray.io/node-type: head
      rayCluster: raycluster-sample
      rayNodeType: head
    name: raycluster-mini-head-vzpbh
    namespace: default
    ownerReferences:
    - apiVersion: ray.io/v1alpha1
      blockOwnerDeletion: true
      controller: true
      kind: RayCluster
      name: raycluster-mini
      uid: a3a3822d-9514-4ac7-856d-78b0caa75e04
    resourceVersion: "2755"
    uid: 12aa224c-118e-4015-ac7e-0f6b2283dfd4
  spec:
    containers:
    - args:
      - 'ulimit -n 65536; ray start --head  --port=6379  --block  --dashboard-host=0.0.0.0  --node-ip-address=$MY_POD_IP  --num-cpus=1  --object-store-memory=100000000  --metrics-export-port=8080  --memory=2147483648 '
      command:
      - /bin/bash
      - -c
      - --
      env:
      - name: MY_POD_IP
        valueFrom:
          fieldRef:
            apiVersion: v1
            fieldPath: status.podIP
      - name: RAY_IP
        value: 127.0.0.1
      - name: RAY_PORT
        value: "6379"
      - name: RAY_ADDRESS
        value: 127.0.0.1:6379
      - name: REDIS_PASSWORD
      image: rayproject/ray:1.12.1
      imagePullPolicy: IfNotPresent
      name: ray-head
      ports:
      - containerPort: 6379
        name: gcs-server
        protocol: TCP
      - containerPort: 8265
        name: dashboard
        protocol: TCP
      - containerPort: 10001
        name: client
        protocol: TCP
      - containerPort: 8080
        name: metrics
        protocol: TCP
      resources:
        limits:
          cpu: "1"
          memory: 2Gi
        requests:
          cpu: "1"
          memory: 2Gi
      terminationMessagePath: /dev/termination-log
      terminationMessagePolicy: File
      volumeMounts:
      - mountPath: /dev/shm
        name: shared-mem
      - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
        name: kube-api-access-79p58
        readOnly: true
    dnsPolicy: ClusterFirst
    enableServiceLinks: true
    nodeName: k3d-ray-agent-1
    preemptionPolicy: PreemptLowerPriority
    priority: 0
    restartPolicy: Always
    schedulerName: default-scheduler
    securityContext: {}
    serviceAccount: default
    serviceAccountName: default
    terminationGracePeriodSeconds: 30
    tolerations:
    - effect: NoExecute
      key: node.kubernetes.io/not-ready
      operator: Exists
      tolerationSeconds: 300
    - effect: NoExecute
      key: node.kubernetes.io/unreachable
      operator: Exists
      tolerationSeconds: 300
    volumes:
    - emptyDir:
        medium: Memory
        sizeLimit: 2Gi
      name: shared-mem
    - name: kube-api-access-79p58
      projected:
        defaultMode: 420
        sources:
        - serviceAccountToken:
            expirationSeconds: 3607
            path: token
        - configMap:
            items:
            - key: ca.crt
              path: ca.crt
            name: kube-root-ca.crt
        - downwardAPI:
            items:
            - fieldRef:
                apiVersion: v1
                fieldPath: metadata.namespace
              path: namespace
  status:
    conditions:
    - lastProbeTime: null
      lastTransitionTime: "2022-07-23T06:00:57Z"
      status: "True"
      type: Initialized
    - lastProbeTime: null
      lastTransitionTime: "2022-07-23T06:03:54Z"
      status: "True"
      type: Ready
    - lastProbeTime: null
      lastTransitionTime: "2022-07-23T06:03:54Z"
      status: "True"
      type: ContainersReady
    - lastProbeTime: null
      lastTransitionTime: "2022-07-23T06:00:57Z"
      status: "True"
      type: PodScheduled
    containerStatuses:
    - containerID: containerd://d3e579566d58f3c8368fed996e4b8f5d43588c927b057c957211aded47b10e4b
      image: docker.io/rayproject/ray:1.12.1
      imageID: docker.io/rayproject/ray@sha256:fef963a72cd0faf49b8f3bda96dd106f8ab77872b665d2613836abb7239f20ce
      lastState: {}
      name: ray-head
      ready: true
      restartCount: 0
      started: true
      state:
        running:
          startedAt: "2022-07-23T06:03:53Z"
    hostIP: 172.22.0.4
    phase: Running
    podIP: 10.42.0.8
    podIPs:
    - ip: 10.42.0.8
    qosClass: Guaranteed
    startTime: "2022-07-23T06:00:57Z"
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""  

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request operator
Projects
None yet
Development

No branches or pull requests

4 participants