Skip to content

Fail to resolve auth-url DNS #9222

@agramner

Description

@agramner

What happened:

We use "External OAUTH Authentication" feature with nginx.ingress.kubernetes.io/auth-url set to a DNS of a service in our cluster (i.e. xxx.xxx.svc.cluster.local). We run ingress-nginx in 3 replicas, one on each of our hosts with hostNetwork=true.
Intermittently the "External OAUTH Authentication" feature in one (or more) of our instances stops working because nginx can't resolve the DNS of the auth-url (i.e. xxx.xxx.svc.cluster.local).
Log message (masked IP etc with XXX):

2022/10/27 11:23:00 [error] 28#28: *1026 xxx.xxx.svc.cluster.local could not be resolved (110: Operation timed out), client: XXX, server: _, request: "POST /kibana/api/core/capabilities HTTP/2.0", subrequest: "/_external-auth-XXX-Prefix", host: "XXX", referrer: "https://XXX/kibana/app/kibana"
XXX - - [27/Oct/2022:11:23:00 +0000] "POST /kibana/api/core/capabilities HTTP/2.0" 502 0 "https://XXX/kibana/app/kibana" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36" 0 29.999 [XXX] [] - - - - XX
2022/10/27 11:23:00 [error] 28#28: *1026 auth request unexpected status: 502 while sending to client, client: XXX, server: _, request: "POST /kibana/api/core/capabilities HTTP/2.0", host: "XXX", referrer: "https://XXX/kibana/app/kibana"
10.254.254.17 - - [27/Oct/2022:11:23:00 +0000] "POST /kibana/api/core/capabilities HTTP/2.0" 500 572 "https://XXX/kibana/app/kibana" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36" 70 29.999 [XXX] [] - - - - XXX

The pod never seem to recover from this state.

When this happens it's still possible to nslookup DNS and curl the auth-url (xxx.xxx.svc.cluster.local) from within the container. It seems like it's just nginx that can't resolve the DNS.

Performing any of these actions fixes the issue:

  • Exec into the container and run nginx -s reload
  • Delete the pod. A new working one is created.

Not sure what triggers the error when we've seen it in production but found one way to trigger it is to delete the "core-dns" pods in "kube-system" namespace.

This doesn't trigger the issue every time but after a couple of recreations of "core-dns" pods at least one of the "ingress-nginx" instances get stuck in this "broken" state.

Exec into container and manually change the following line to not use a variable $target and instead set the url directly in the proxy_pass directive fixes the issue and it can't be reproduced by deleting "core-dns" pods anymore. How can this be any different?

set $target {{ $externalAuth.URL }};

What you expected to happen:
The DNS resolution of auth-url should always work. Even if "core-dns" pods are recreated.

NGINX Ingress controller version (exec into the pod and run nginx-ingress-controller --version.):

-------------------------------------------------------------------------------
NGINX Ingress controller
  Release:       v1.4.0
  Build:         50be2bf95fd1ef480420e2aa1d6c5c7c138c95ea
  Repository:    https://github.com/kubernetes/ingress-nginx
  nginx version: nginx/1.19.10

-------------------------------------------------------------------------------

Kubernetes version (use kubectl version):
Server Version: v1.23.0

Environment:

  • Cloud provider or hardware configuration:
    Self hosted
  • OS (e.g. from /etc/os-release):
    Ubuntu 18.04.6 LTS (Bionic Beaver)
  • Kernel (e.g. uname -a):
    5.4.0-131-generic #147~18.04.1-Ubuntu SMP Sat Oct 15 13:10:18 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
  • Install tools:
    kubeadm
  • Basic cluster related info:
    Server Version: v1.23.0
NAME            STATUS   ROLES                  AGE   VERSION   INTERNAL-IP     EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION      CONTAINER-RUNTIME
NAME            STATUS   ROLES                  AGE   VERSION   INTERNAL-IP     EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION      CONTAINER-RUNTIME
xxx-1   Ready    control-plane,master   16d   v1.23.3   xxx   <none>        Ubuntu 18.04.6 LTS   5.4.0-131-generic   cri-o://1.21.7
xxx-1   Ready    <none>                 16d   v1.23.3   xxx   <none>        Ubuntu 18.04.6 LTS   5.4.0-131-generic   cri-o://1.21.7
xxx-2   Ready    <none>                 16d   v1.23.3   xxx   <none>        Ubuntu 18.04.6 LTS   5.4.0-131-generic   cri-o://1.21.7
xxx-3   Ready    <none>                 16d   v1.23.3   xxx  <none>        Ubuntu 18.04.6 LTS   5.4.0-131-generic   cri-o://1.21.7
xxx-4   Ready    <none>                 16d   v1.23.3   xxx   <none>        Ubuntu 18.04.6 LTS   5.4.0-131-generic   cri-o://1.21.7
xxx-5   Ready    <none>                 16d   v1.23.3   xxx   <none>        Ubuntu 18.04.6 LTS   5.4.0-131-generic   cri-o://1.21.7
  • How was the ingress-nginx-controller installed:
ingress-nginx:
  controller:
    admissionWebhooks:
      enabled: false
    daemonset:
      useHostPort: true
    dnsPolicy: ClusterFirstWithHostNet
    hostNetwork: true
    kind: DaemonSet
    nodeSelector:
      core: ""
    service:
      annotations:
        service.beta.kubernetes.io/aws-load-balancer-type: nlb
  nameOverride: nginx-ingress
  • Current State of the controller:
    • kubectl describe ingressclasses
Name:         nginx
Labels:       app.kubernetes.io/component=controller
              app.kubernetes.io/instance=XXX
              app.kubernetes.io/managed-by=Helm
              app.kubernetes.io/name=nginx-ingress
              app.kubernetes.io/part-of=nginx-ingress
              app.kubernetes.io/version=1.4.0
              helm.sh/chart=ingress-nginx-4.3.0
Annotations:  meta.helm.sh/release-name: XXX
              meta.helm.sh/release-namespace: XXX
Controller:   k8s.io/ingress-nginx
Events:       <none>
  • kubectl -n <ingresscontrollernamespace> get all -A -o wide
  • kubectl -n <ingresscontrollernamespace> describe po <ingresscontrollerpodname>
Name:             xxx
Namespace:        xxx
Priority:         0
Service Account:  xxx
Node:             xxx
Start Time:       Thu, 27 Oct 2022 13:19:25 +0200
Labels:           app.kubernetes.io/component=controller
                  app.kubernetes.io/instance=XXX
                  app.kubernetes.io/name=nginx-ingress
                  controller-revision-hash=58d9d684cc
                  pod-template-generation=6
Annotations:      <none>
Status:           Running
IP:              xxx
IPs:
  IP:           xxx
Controlled By:  DaemonSet/xxx
Containers:
  controller:
    Container ID:  cri-o://5a2c60b3f3dc788430e3079f616161e71ec502a8f6acdabfa4c578cb093fda1e
    Image:         registry.k8s.io/ingress-nginx/controller:v1.4.0@sha256:34ee929b111ffc7aa426ffd409af44da48e5a0eea1eb2207994d9e0c0882d143
    Image ID:      registry.k8s.io/ingress-nginx/controller@sha256:34ee929b111ffc7aa426ffd409af44da48e5a0eea1eb2207994d9e0c0882d143
    Ports:         80/TCP, 443/TCP
    Host Ports:    80/TCP, 443/TCP
    Args:
      /nginx-ingress-controller
      --publish-service=$(POD_NAMESPACE)/xxx
      --election-id=ingress-controller-leader
      --controller-class=k8s.io/ingress-nginx
      --ingress-class=nginx
      --configmap=$(POD_NAMESPACE)/xxx
    State:          Running
      Started:      Thu, 27 Oct 2022 14:02:08 +0200
    Ready:          True
    Restart Count:  1
    Requests:
      cpu:      100m
      memory:   90Mi
    Liveness:   http-get http://:10254/healthz delay=10s timeout=1s period=10s #success=1 #failure=5
    Readiness:  http-get http://:10254/healthz delay=10s timeout=1s period=10s #success=1 #failure=3
    Environment:
      POD_NAME:       xxx(v1:metadata.name)
      POD_NAMESPACE:  xxx(v1:metadata.namespace)
      LD_PRELOAD:     /usr/local/lib/libmimalloc.so
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-b5sd5 (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  kube-api-access-b5sd5:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              core=
                             kubernetes.io/os=linux
Tolerations:                 node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/network-unavailable:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists
                             node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                             node.kubernetes.io/unreachable:NoExecute op=Exists
                             node.kubernetes.io/unschedulable:NoSchedule op=Exists
  • kubectl -n <ingresscontrollernamespace> describe svc <ingresscontrollerservicename>
Name:                     xxx
Namespace:                xx
Labels:                   app.kubernetes.io/component=controller
                          app.kubernetes.io/instance=xxx
                          app.kubernetes.io/managed-by=Helm
                          app.kubernetes.io/name=nginx-ingress
                          app.kubernetes.io/part-of=nginx-ingress
                          app.kubernetes.io/version=1.4.0
                          helm.sh/chart=ingress-nginx-4.3.0
Annotations:              meta.helm.sh/release-name: xxx
                          meta.helm.sh/release-namespace: xxx
                          service.beta.kubernetes.io/aws-load-balancer-type: nlb
Selector:                 app.kubernetes.io/component=controller,app.kubernetes.io/instance=xxx,app.kubernetes.io/name=nginx-ingress
Type:                     LoadBalancer
IP Family Policy:         SingleStack
IP Families:              IPv4
IP:                       xxx
IPs:                      xxx
Port:                     http  80/TCP
TargetPort:               http/TCP
NodePort:                 http  32655/TCP
Endpoints:                xxx
Port:                     https  443/TCP
TargetPort:               https/TCP
NodePort:                 https  32081/TCP
Endpoints:                xxx
Session Affinity:         None
External Traffic Policy:  Cluster
Events:                   <none>
  • Current state of ingress object, if applicable:
Name:       xxx      
Labels:           app.kubernetes.io/instance=xxx
                  app.kubernetes.io/managed-by=Helm
                  app.kubernetes.io/name=xxx
                  helm.sh/chart=xxx-0.1.0
Namespace:        xxx
Address:          
Ingress Class:    <none>
Default backend:  <default>
Rules:
  Host             Path  Backends
  ----             ----  --------
  xxx
                   /grafana(/|$)(.*)   grafana:80 (10.0.2.238:3000)
                   /kibana(/|$)(.*)    kibana-kibana:5601 (10.0.2.167:5601)
  *                
                   /grafana(/|$)(.*)   grafana:80 (10.0.2.238:3000)
                   /kibana(/|$)(.*)    kibana-kibana:5601 (10.0.2.167:5601)
Annotations:       cert-manager.io/cluster-issuer: xxx
                   kubernetes.io/ingress.class: nginx
                   meta.helm.sh/release-name: xxx
                   meta.helm.sh/release-namespace: xxx
                   nginx.ingress.kubernetes.io/auth-cache-duration: 200 202 60s, 401 0s
                   nginx.ingress.kubernetes.io/auth-cache-key: $cookie_xxx
                   nginx.ingress.kubernetes.io/auth-response-headers: xxx
                   nginx.ingress.kubernetes.io/auth-signin: /
                   nginx.ingress.kubernetes.io/auth-url: http://xxx.svc.cluster.local:8080/xxx
                   nginx.ingress.kubernetes.io/rewrite-target: /$2
                   nginx.ingress.kubernetes.io/server-snippet:
                     location = /kibana {
                       return 301 /kibana/app/kibana#/discover/xxx...
                     }
                   nginx.ingress.kubernetes.io/ssl-redirect: false
Events:            <none>
  • Others:

How to reproduce this issue:
Delete the "core-dns" pods in "kube-system" namespace a couple of times.

Anything else we need to know:

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/supportCategorizes issue or PR as a support question.needs-priorityneeds-triageIndicates an issue or PR lacks a `triage/foo` label and requires one.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions