Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node is not properly drained #74

Closed
Halama opened this issue Sep 18, 2021 · 3 comments
Closed

Node is not properly drained #74

Halama opened this issue Sep 18, 2021 · 3 comments

Comments

@Halama
Copy link

Halama commented Sep 18, 2021

Image I'm using:
328549459982.dkr.ecr.eu-west-1.amazonaws.com/bottlerocket-update-operator:v0.1.4

Deployment manifest:

apiVersion: v1
kind: Namespace
metadata:
  name: bottlerocket
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: bottlerocket-update-operator-controller
rules:
  - apiGroups: [""]
    resources: ["nodes"]
    verbs: ["get", "list", "watch", "update", "patch"]
  # Allow the controller to remove Pods running on the Nodes that are updating.
  - apiGroups: [""]
    resources: ["pods"]
    verbs: ["get", "list", "delete"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: bottlerocket-update-operator-controller
subjects:
  - kind: ServiceAccount
    name: update-operator-controller
    namespace: bottlerocket
roleRef:
  kind: ClusterRole
  name: bottlerocket-update-operator-controller
  apiGroup: rbac.authorization.k8s.io
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: bottlerocket-update-operator-agent
rules:
  - apiGroups: [""]
    resources: ["nodes"]
    verbs: ["get", "list", "watch", "update", "patch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: bottlerocket-update-operator-agent
subjects:
  - kind: ServiceAccount
    name: update-operator-agent
    namespace: bottlerocket
roleRef:
  kind: ClusterRole
  name: bottlerocket-update-operator-agent
  apiGroup: rbac.authorization.k8s.io
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: update-operator-controller
  namespace: bottlerocket
  annotations:
    kubernetes.io/service-account.name: update-operator-controller
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: update-operator-agent
  namespace: bottlerocket
  annotations:
    kubernetes.io/service-account.name: update-operator-agent
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: update-operator-controller
  namespace: bottlerocket
  labels:
    update-operator: controller
spec:
  replicas: 1
  strategy:
    rollingUpdate:
      maxUnavailable: 100%
  selector:
    matchLabels:
      update-operator: controller
  template:
    metadata:
      namespace: bottlerocket
      labels:
        update-operator: controller
        app: bottlerocket-update-operator
      annotations:
        log: "true"
    spec:
      serviceAccountName: update-operator-controller
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                  - key: bottlerocket.aws/updater-interface-version
                    operator: Exists
                  - key: "kubernetes.io/os"
                    operator: In
                    values:
                      - linux
                  - key: "kubernetes.io/arch"
                    operator: In
                    values:
                      - amd64
                      - arm64
        # Avoid update-operator's Agent Pods if possible.
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 10
              podAffinityTerm:
                topologyKey: bottlerocket.aws/updater-interface-version
                labelSelector:
                  matchExpressions:
                    - key: update-operator
                      operator: In
                      values: ["agent"]
      containers:
        - name: controller
          image: "328549459982.dkr.ecr.${AWS_REGION}.amazonaws.com/bottlerocket-update-operator:v0.1.4"
          imagePullPolicy: Always
          args:
            - -controller
            - -debug
            - -nodeName
            - $(NODE_NAME)
          env:
            - name: NODE_NAME
              valueFrom:
                fieldRef:
                  fieldPath: spec.nodeName
---
# This DaemonSet is for Bottlerocket hosts that support updates through the Bottlerocket API (Bottlerocket OS versions >= v0.4.1)
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: update-operator-agent-update-api
  namespace: bottlerocket
  labels:
    update-operator: agent
spec:
  selector:
    matchLabels:
      update-operator: agent
  template:
    metadata:
      labels:
        update-operator: agent
        app: bottlerocket-update-operator
      annotations:
        log: "true"
    spec:
      serviceAccountName: update-operator-agent
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                  - key: "bottlerocket.aws/updater-interface-version"
                    operator: In
                    values:
                      - 2.0.0
                  - key: "kubernetes.io/os"
                    operator: In
                    values:
                      - linux
                  - key: "kubernetes.io/arch"
                    operator: In
                    values:
                      - amd64
                      - arm64
      hostPID: true
      containers:
        - name: agent
          image: "328549459982.dkr.ecr.${AWS_REGION}.amazonaws.com/bottlerocket-update-operator:v0.1.4"
          imagePullPolicy: Always
          args:
            - -agent
            - -debug
            - -nodeName
            - $(NODE_NAME)
          env:
            - name: NODE_NAME
              valueFrom:
                fieldRef:
                  fieldPath: spec.nodeName
          resources:
            limits:
              memory: 50Mi
            requests:
              cpu: 10m
              memory: 50Mi
          volumeMounts:
            - name: bottlerocket-api-socket
              mountPath: /run/api.sock
          securityContext:
            seLinuxOptions:
              user: system_u
              role: system_r
              type: super_t
              level: s0
      volumes:
        - name: bottlerocket-api-socket
          hostPath:
            path: /run/api.sock
            type: Socket

Node info:

image

Issue or Feature Request:

Hello, we have noticed that node is not properly drained during update. Update operator doesn't wait until all pods on node are evicted and reboots node immediately which leads to service interruption. The eviction of pods is probably not started at all.

Operator logs could not drain with error User \"system:serviceaccount:bottlerocket:update-operator-controller\" cannot get resource \"daemonsets\" in API group \"apps\" in the namespace \"kube-system\" followed by proceeding anyway , see more details below.

Update operator log during reboot:

2021-09-18T15:53:16.000Z bottlerocket-update-operator controller--b4c55546b-sp4br time="2021-09-18T15:53:15Z" level=error msg="could not drain" component=controller error="[cannot delete daemonsets.apps \"kube-proxy\" is forbidden: User \"system:serviceaccount:bottlerocket:update-operator-controller\" cannot get resource \"daemonsets\" in API group \"apps\" in the namespace \"kube-system\": kube-system/kube-proxy-n2hzd, cannot delete daemonsets.apps \"update-operator-agent-update-api\" is forbidden: User \"system:serviceaccount:bottlerocket:update-operator-controller\" cannot get resource \"daemonsets\" in API group \"apps\" in the namespace \"bottlerocket\": bottlerocket/update-operator-agent-update-api-pmv6k, cannot delete daemonsets.apps \"datadog-agent\" is forbidden: User \"system:serviceaccount:bottlerocket:update-operator-controller\" cannot get resource \"daemonsets\" in API group \"apps\" in the namespace \"datadog\": datadog/datadog-agent-cqx69, cannot delete daemonsets.apps \"calico-node\" is forbidden: User \"system:serviceaccount:bottlerocket:update-operator-controller\" cannot get resource \"daemonsets\" in API group \"apps\" in the namespace \"kube-system\": kube-system/calico-node-s8vrr, cannot delete daemonsets.apps \"fluentd-papertrail-containerd\" is forbidden: User \"system:serviceaccount:bottlerocket:update-operator-controller\" cannot get resource \"daemonsets\" in API group \"apps\" in the namespace \"kube-system\": kube-system/fluentd-papertrail-containerd-g29tm]" intent="reboot-update,perform-update,ready update:true" node=ip-10-233-157-101.eu-west-1.compute.internal worker=manager
2021-09-18T15:53:16.000Z bottlerocket-update-operator controller--b4c55546b-sp4br time="2021-09-18T15:53:15Z" level=warning msg="proceeding anyway" component=controller intent="reboot-update,perform-update,ready update:true" node=ip-10-233-157-101.eu-west-1.compute.internal worker=manager

If I add permissions to update controller:

- verbs:
      - get
      - list
    apiGroups:
      - apps
    resources:
      - daemonsets
      - replicasets

Following error is logged cannot delete DaemonSet-managed Pods (use --ignore-daemonsets to ignore):

2021-09-18T16:17:37.000Z bottlerocket-update-operator controller--b4c55546b-sp4br time="2021-09-18T16:17:36Z" level=error msg="could not drain" component=controller error="cannot delete DaemonSet-managed Pods (use --ignore-daemonsets to ignore): bottlerocket/update-operator-agent-update-api-l5jvm, datadog/datadog-agent-tch22, kube-system/calico-node-dhn88, kube-system/fluentd-papertrail-containerd-p65zh, kube-system/kube-proxy-mvxcb" intent="reboot-update,perform-update,ready update:true" node=ip-10-233-156-93.eu-west-1.compute.internal worker=manager
2021-09-18T16:17:37.000Z bottlerocket-update-operator controller--b4c55546b-sp4br time="2021-09-18T16:17:36Z" level=warning msg="proceeding anyway" component=controller intent="reboot-update,perform-update,ready update:true" node=ip-10-233-156-93.eu-west-1.compute.internal worker=manager

Can I somehow configure deamonsets ignore on drain in update operator?
thanks

@cbgbt
Copy link
Contributor

cbgbt commented Sep 21, 2021

Thanks for opening this report. It seems like the operator doesn't handle error responses from the Drain API as one might expect. I'll look into better handling this case.

To clarify, in this case are you anticipating that the operator should drain the DaemonSet pod, or would you rather it ignore DaemonSet pods and wait for the rest to be drained?

@Halama
Copy link
Author

Halama commented Sep 23, 2021

Hi Sean,
thanks for the reply.

I would like to set the operator to ignore DaemonSet pods, same as kubectl drain ip-10-233-156-21.eu-west-1.compute.internal --delete-local-data --ignore-daemonsets --force

@cbgbt
Copy link
Contributor

cbgbt commented Feb 15, 2022

Hi Martin. This should be fixed in the latest Update Operator release, 0.2.0.

Please let us know if you continue to have issues using the new release. Thanks!

@cbgbt cbgbt closed this as completed Feb 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants