-
Notifications
You must be signed in to change notification settings - Fork 8.4k
Description
What happened:
We use "External OAUTH Authentication" feature with nginx.ingress.kubernetes.io/auth-url set to a DNS of a service in our cluster (i.e. xxx.xxx.svc.cluster.local). We run ingress-nginx in 3 replicas, one on each of our hosts with hostNetwork=true.
Intermittently the "External OAUTH Authentication" feature in one (or more) of our instances stops working because nginx can't resolve the DNS of the auth-url (i.e. xxx.xxx.svc.cluster.local).
Log message (masked IP etc with XXX):
2022/10/27 11:23:00 [error] 28#28: *1026 xxx.xxx.svc.cluster.local could not be resolved (110: Operation timed out), client: XXX, server: _, request: "POST /kibana/api/core/capabilities HTTP/2.0", subrequest: "/_external-auth-XXX-Prefix", host: "XXX", referrer: "https://XXX/kibana/app/kibana"
XXX - - [27/Oct/2022:11:23:00 +0000] "POST /kibana/api/core/capabilities HTTP/2.0" 502 0 "https://XXX/kibana/app/kibana" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36" 0 29.999 [XXX] [] - - - - XX
2022/10/27 11:23:00 [error] 28#28: *1026 auth request unexpected status: 502 while sending to client, client: XXX, server: _, request: "POST /kibana/api/core/capabilities HTTP/2.0", host: "XXX", referrer: "https://XXX/kibana/app/kibana"
10.254.254.17 - - [27/Oct/2022:11:23:00 +0000] "POST /kibana/api/core/capabilities HTTP/2.0" 500 572 "https://XXX/kibana/app/kibana" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36" 70 29.999 [XXX] [] - - - - XXX
The pod never seem to recover from this state.
When this happens it's still possible to nslookup
DNS and curl
the auth-url (xxx.xxx.svc.cluster.local) from within the container. It seems like it's just nginx that can't resolve the DNS.
Performing any of these actions fixes the issue:
- Exec into the container and run
nginx -s reload
- Delete the pod. A new working one is created.
Not sure what triggers the error when we've seen it in production but found one way to trigger it is to delete the "core-dns" pods in "kube-system" namespace.
This doesn't trigger the issue every time but after a couple of recreations of "core-dns" pods at least one of the "ingress-nginx" instances get stuck in this "broken" state.
Exec into container and manually change the following line to not use a variable $target
and instead set the url directly in the proxy_pass
directive fixes the issue and it can't be reproduced by deleting "core-dns" pods anymore. How can this be any different?
set $target {{ $externalAuth.URL }}; |
What you expected to happen:
The DNS resolution of auth-url should always work. Even if "core-dns" pods are recreated.
NGINX Ingress controller version (exec into the pod and run nginx-ingress-controller --version.):
-------------------------------------------------------------------------------
NGINX Ingress controller
Release: v1.4.0
Build: 50be2bf95fd1ef480420e2aa1d6c5c7c138c95ea
Repository: https://github.com/kubernetes/ingress-nginx
nginx version: nginx/1.19.10
-------------------------------------------------------------------------------
Kubernetes version (use kubectl version
):
Server Version: v1.23.0
Environment:
- Cloud provider or hardware configuration:
Self hosted - OS (e.g. from /etc/os-release):
Ubuntu 18.04.6 LTS (Bionic Beaver)
- Kernel (e.g.
uname -a
):
5.4.0-131-generic #147~18.04.1-Ubuntu SMP Sat Oct 15 13:10:18 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
- Install tools:
kubeadm - Basic cluster related info:
Server Version: v1.23.0
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
xxx-1 Ready control-plane,master 16d v1.23.3 xxx <none> Ubuntu 18.04.6 LTS 5.4.0-131-generic cri-o://1.21.7
xxx-1 Ready <none> 16d v1.23.3 xxx <none> Ubuntu 18.04.6 LTS 5.4.0-131-generic cri-o://1.21.7
xxx-2 Ready <none> 16d v1.23.3 xxx <none> Ubuntu 18.04.6 LTS 5.4.0-131-generic cri-o://1.21.7
xxx-3 Ready <none> 16d v1.23.3 xxx <none> Ubuntu 18.04.6 LTS 5.4.0-131-generic cri-o://1.21.7
xxx-4 Ready <none> 16d v1.23.3 xxx <none> Ubuntu 18.04.6 LTS 5.4.0-131-generic cri-o://1.21.7
xxx-5 Ready <none> 16d v1.23.3 xxx <none> Ubuntu 18.04.6 LTS 5.4.0-131-generic cri-o://1.21.7
- How was the ingress-nginx-controller installed:
ingress-nginx:
controller:
admissionWebhooks:
enabled: false
daemonset:
useHostPort: true
dnsPolicy: ClusterFirstWithHostNet
hostNetwork: true
kind: DaemonSet
nodeSelector:
core: ""
service:
annotations:
service.beta.kubernetes.io/aws-load-balancer-type: nlb
nameOverride: nginx-ingress
- Current State of the controller:
kubectl describe ingressclasses
Name: nginx
Labels: app.kubernetes.io/component=controller
app.kubernetes.io/instance=XXX
app.kubernetes.io/managed-by=Helm
app.kubernetes.io/name=nginx-ingress
app.kubernetes.io/part-of=nginx-ingress
app.kubernetes.io/version=1.4.0
helm.sh/chart=ingress-nginx-4.3.0
Annotations: meta.helm.sh/release-name: XXX
meta.helm.sh/release-namespace: XXX
Controller: k8s.io/ingress-nginx
Events: <none>
kubectl -n <ingresscontrollernamespace> get all -A -o wide
kubectl -n <ingresscontrollernamespace> describe po <ingresscontrollerpodname>
Name: xxx
Namespace: xxx
Priority: 0
Service Account: xxx
Node: xxx
Start Time: Thu, 27 Oct 2022 13:19:25 +0200
Labels: app.kubernetes.io/component=controller
app.kubernetes.io/instance=XXX
app.kubernetes.io/name=nginx-ingress
controller-revision-hash=58d9d684cc
pod-template-generation=6
Annotations: <none>
Status: Running
IP: xxx
IPs:
IP: xxx
Controlled By: DaemonSet/xxx
Containers:
controller:
Container ID: cri-o://5a2c60b3f3dc788430e3079f616161e71ec502a8f6acdabfa4c578cb093fda1e
Image: registry.k8s.io/ingress-nginx/controller:v1.4.0@sha256:34ee929b111ffc7aa426ffd409af44da48e5a0eea1eb2207994d9e0c0882d143
Image ID: registry.k8s.io/ingress-nginx/controller@sha256:34ee929b111ffc7aa426ffd409af44da48e5a0eea1eb2207994d9e0c0882d143
Ports: 80/TCP, 443/TCP
Host Ports: 80/TCP, 443/TCP
Args:
/nginx-ingress-controller
--publish-service=$(POD_NAMESPACE)/xxx
--election-id=ingress-controller-leader
--controller-class=k8s.io/ingress-nginx
--ingress-class=nginx
--configmap=$(POD_NAMESPACE)/xxx
State: Running
Started: Thu, 27 Oct 2022 14:02:08 +0200
Ready: True
Restart Count: 1
Requests:
cpu: 100m
memory: 90Mi
Liveness: http-get http://:10254/healthz delay=10s timeout=1s period=10s #success=1 #failure=5
Readiness: http-get http://:10254/healthz delay=10s timeout=1s period=10s #success=1 #failure=3
Environment:
POD_NAME: xxx(v1:metadata.name)
POD_NAMESPACE: xxx(v1:metadata.namespace)
LD_PRELOAD: /usr/local/lib/libmimalloc.so
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-b5sd5 (ro)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
kube-api-access-b5sd5:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: Burstable
Node-Selectors: core=
kubernetes.io/os=linux
Tolerations: node.kubernetes.io/disk-pressure:NoSchedule op=Exists
node.kubernetes.io/memory-pressure:NoSchedule op=Exists
node.kubernetes.io/network-unavailable:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists
node.kubernetes.io/pid-pressure:NoSchedule op=Exists
node.kubernetes.io/unreachable:NoExecute op=Exists
node.kubernetes.io/unschedulable:NoSchedule op=Exists
kubectl -n <ingresscontrollernamespace> describe svc <ingresscontrollerservicename>
Name: xxx
Namespace: xx
Labels: app.kubernetes.io/component=controller
app.kubernetes.io/instance=xxx
app.kubernetes.io/managed-by=Helm
app.kubernetes.io/name=nginx-ingress
app.kubernetes.io/part-of=nginx-ingress
app.kubernetes.io/version=1.4.0
helm.sh/chart=ingress-nginx-4.3.0
Annotations: meta.helm.sh/release-name: xxx
meta.helm.sh/release-namespace: xxx
service.beta.kubernetes.io/aws-load-balancer-type: nlb
Selector: app.kubernetes.io/component=controller,app.kubernetes.io/instance=xxx,app.kubernetes.io/name=nginx-ingress
Type: LoadBalancer
IP Family Policy: SingleStack
IP Families: IPv4
IP: xxx
IPs: xxx
Port: http 80/TCP
TargetPort: http/TCP
NodePort: http 32655/TCP
Endpoints: xxx
Port: https 443/TCP
TargetPort: https/TCP
NodePort: https 32081/TCP
Endpoints: xxx
Session Affinity: None
External Traffic Policy: Cluster
Events: <none>
- Current state of ingress object, if applicable:
Name: xxx
Labels: app.kubernetes.io/instance=xxx
app.kubernetes.io/managed-by=Helm
app.kubernetes.io/name=xxx
helm.sh/chart=xxx-0.1.0
Namespace: xxx
Address:
Ingress Class: <none>
Default backend: <default>
Rules:
Host Path Backends
---- ---- --------
xxx
/grafana(/|$)(.*) grafana:80 (10.0.2.238:3000)
/kibana(/|$)(.*) kibana-kibana:5601 (10.0.2.167:5601)
*
/grafana(/|$)(.*) grafana:80 (10.0.2.238:3000)
/kibana(/|$)(.*) kibana-kibana:5601 (10.0.2.167:5601)
Annotations: cert-manager.io/cluster-issuer: xxx
kubernetes.io/ingress.class: nginx
meta.helm.sh/release-name: xxx
meta.helm.sh/release-namespace: xxx
nginx.ingress.kubernetes.io/auth-cache-duration: 200 202 60s, 401 0s
nginx.ingress.kubernetes.io/auth-cache-key: $cookie_xxx
nginx.ingress.kubernetes.io/auth-response-headers: xxx
nginx.ingress.kubernetes.io/auth-signin: /
nginx.ingress.kubernetes.io/auth-url: http://xxx.svc.cluster.local:8080/xxx
nginx.ingress.kubernetes.io/rewrite-target: /$2
nginx.ingress.kubernetes.io/server-snippet:
location = /kibana {
return 301 /kibana/app/kibana#/discover/xxx...
}
nginx.ingress.kubernetes.io/ssl-redirect: false
Events: <none>
- Others:
How to reproduce this issue:
Delete the "core-dns" pods in "kube-system" namespace a couple of times.
Anything else we need to know: