该问题最后在Github的Cilium项目中提了issue,被大佬解决了。
issue地址:https://github.com/cilium/cilium/issues/20498
问题描述
由于后面打算学习eBPF,将原来用的CNI组件从flannel换成cilium。Cilium作为基于eBPF的CNI实现,比起flannel有了更多的功能,包括但不限于高度定制化的网络策略、安全加固。
但是,我在使用cilium时,出现如下问题,以下是集群部署的过程,及问题发现和问题描述。
版本信息如下:
- Kubernetes 1.23.0 (kubelet/kubeadm/kubectl都对应集群版本)
- Cilium 1.11.6
- 使用
kubeadm --config kubeadm.conf
初始化集群,初始化的配置文件如下。
kind: InitConfiguration
localAPIEndpoint:
advertiseAddress: 192.168.153.21
bindPort: 6443
nodeRegistration:
criSocket: /var/run/dockershim.sock
imagePullPolicy: IfNotPresent
name: nm
taints: null
---
apiServer:
timeoutForControlPlane: 4m0s
apiVersion: kubeadm.k8s.io/v1beta3
certificatesDir: /etc/kubernetes/pki
clusterName: kubernetes
controllerManager: {}
dns: {}
etcd:
local:
dataDir: /var/lib/etcd
imageRepository: registry.aliyuncs.com/google_containers
kind: ClusterConfiguration
kubernetesVersion: 1.23.0
networking:
dnsDomain: cluster.local
serviceSubnet: 10.96.0.0/12
podSubnet: 10.5.0.0/16
scheduler: {}
---
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
resolvConf: /run/systemd/resolve/resolv.conf
- 然后像往常一样加入工作节点,测试集群使用1Master和2Worker,其IP如下。
- master: 192.168.153.21
- worker1: 192.168.153.22
- worker2: 192.168.153.23
- 安装网络组件,使用
cilium install
进行安装,其中一些cilium的配置保持默认。
root@nm:/work-place/kubernetes/create-cluster# kubectl get pods -A -o wide
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
kube-system cilium-99lxc 1/1 Running 0 18m 192.168.153.22 na <none> <none>
kube-system cilium-ct5s7 1/1 Running 0 18m 192.168.153.21 nm <none> <none>
kube-system cilium-drtlh 1/1 Running 0 18m 192.168.153.23 nb <none> <none>
kube-system cilium-operator-5d67fc458d-zxgdd 1/1 Running 0 18m 192.168.153.22 na <none> <none>
kube-system coredns-6d8c4cb4d-jkssb 0/1 Running 8 (2m55s ago) 19m 10.0.0.240 na <none> <none>
kube-system coredns-6d8c4cb4d-psxvw 0/1 CrashLoopBackOff 8 (83s ago) 19m 10.0.2.176 nb <none> <none>
kube-system etcd-nm 1/1 Running 2 25m 192.168.153.21 nm <none> <none>
kube-system kube-apiserver-nm 1/1 Running 2 25m 192.168.153.21 nm <none> <none>
kube-system kube-controller-manager-nm 1/1 Running 2 25m 192.168.153.21 nm <none> <none>
kube-system kube-proxy-hv5nc 1/1 Running 0 24m 192.168.153.22 na <none> <none>
kube-system kube-proxy-pbzlx 1/1 Running 0 24m 192.168.153.23 nb <none> <none>
kube-system kube-proxy-rqpxw 1/1 Running 0 25m 192.168.153.21 nm <none> <none>
kube-system kube-scheduler-nm 1/1 Running 2 25m 192.168.153.21 nm <none> <none>
安装cilium之前,我已经清理的之前的CNI配置,将/etc/cni/net.d/
下面的网络配置全部清除。但是出现了上述情况,两个coredns
的Pod虽然处于Running
状态但是始终不Ready
。
然后我查看了coredns
的log
以及describe
。
root@nm:/work-place/kubernetes/create-cluster# kubectl logs coredns-6d8c4cb4d-jkssb -n kube-system
[WARNING] plugin/kubernetes: starting server with unsynced Kubernetes API
.:53
[INFO] plugin/reload: Running configuration MD5 = db32ca3650231d74073ff4cf814959a7
CoreDNS-1.8.6
linux/amd64, go1.17.1, 13a9191
[ERROR] plugin/errors: 2 7607030484537686268.4300248127207674545. HINFO: read udp 10.0.0.240:39983->192.168.153.2:53: i/o timeout
[ERROR] plugin/errors: 2 7607030484537686268.4300248127207674545. HINFO: read udp 10.0.0.240:53240->192.168.153.2:53: i/o timeout
[ERROR] plugin/errors: 2 7607030484537686268.4300248127207674545. HINFO: read udp 10.0.0.240:49802->192.168.153.2:53: i/o timeout
[ERROR] plugin/errors: 2 7607030484537686268.4300248127207674545. HINFO: read udp 10.0.0.240:54428->192.168.153.2:53: i/o timeout
[ERROR] plugin/errors: 2 7607030484537686268.4300248127207674545. HINFO: read udp 10.0.0.240:43974->192.168.153.2:53: i/o timeout
[ERROR] plugin/errors: 2 7607030484537686268.4300248127207674545. HINFO: read udp 10.0.0.240:37821->192.168.153.2:53: i/o timeout
[ERROR] plugin/errors: 2 7607030484537686268.4300248127207674545. HINFO: read udp 10.0.0.240:36545->192.168.153.2:53: i/o timeout
[ERROR] plugin/errors: 2 7607030484537686268.4300248127207674545. HINFO: read udp 10.0.0.240:56785->192.168.153.2:53: i/o timeout
[ERROR] plugin/errors: 2 7607030484537686268.4300248127207674545. HINFO: read udp 10.0.0.240:47913->192.168.153.2:53: i/o timeout
[ERROR] plugin/errors: 2 7607030484537686268.4300248127207674545. HINFO: read udp 10.0.0.240:38162->192.168.153.2:53: i/o timeout
[INFO] SIGTERM: Shutting down servers then terminating
[INFO] plugin/health: Going into lameduck mode for 5s
coredns
对其上游的网关UDP服务不可达,这里的192.168.153.2
是我虚拟机网络的网关。
root@nm:/work-place/kubernetes/create-cluster# kubectl describe pod coredns-6d8c4cb4d-jkssb -n kube-system
Name: coredns-6d8c4cb4d-jkssb
Namespace: kube-system
Priority: 2000000000
Priority Class Name: system-cluster-critical
Node: na/192.168.153.22
Start Time: Wed, 13 Jul 2022 00:37:51 +0800
Labels: k8s-app=kube-dns
pod-template-hash=6d8c4cb4d
Annotations: <none>
Status: Running
IP: 10.0.0.240
IPs:
IP: 10.0.0.240
Controlled By: ReplicaSet/coredns-6d8c4cb4d
Containers:
coredns:
Container ID: docker://cc35b97903b120cb54765641da47c69ea8c833e6c72958407c7e605a5aa001b4
Image: registry.aliyuncs.com/google_containers/coredns:v1.8.6
Image ID: docker-pullable://registry.aliyuncs.com/google_containers/coredns@sha256:5b6ec0d6de9baaf3e92d0f66cd96a25b9edbce8716f5f15dcd1a616b3abd590e
Ports: 53/UDP, 53/TCP, 9153/TCP
Host Ports: 0/UDP, 0/TCP, 0/TCP
Args:
-conf
/etc/coredns/Corefile
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Completed
Exit Code: 0
Started: Wed, 13 Jul 2022 00:57:12 +0800
Finished: Wed, 13 Jul 2022 00:59:06 +0800
Ready: False
Restart Count: 8
Limits:
memory: 170Mi
Requests:
cpu: 100m
memory: 70Mi
Liveness: http-get http://:8080/health delay=60s timeout=5s period=10s #success=1 #failure=5
Readiness: http-get http://:8181/ready delay=0s timeout=1s period=10s #success=1 #failure=3
Environment: <none>
Mounts:
/etc/coredns from config-volume (ro)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-v8hzn (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
config-volume:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: coredns
Optional: false
kube-api-access-v8hzn:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: Burstable
Node-Selectors: kubernetes.io/os=linux
Tolerations: CriticalAddonsOnly op=Exists
node-role.kubernetes.io/control-plane:NoSchedule
node-role.kubernetes.io/master:NoSchedule
node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 21m default-scheduler Successfully assigned kube-system/coredns-6d8c4cb4d-jkssb to na
Warning FailedCreatePodSandBox 20m kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "8ae4c118e4c3ff1c0bd2c601c808cae2c17cbc27552fb148b755b7d798f0bb71" network for pod "coredns-6d8c4cb4d-jkssb": networkPlugin cni failed to set up pod "coredns-6d8c4cb4d-jkssb_kube-system" network: unable to connect to Cilium daemon: failed to create cilium agent client after 30.000000 seconds timeout: Get "http:///var/run/cilium/cilium.sock/v1/config": dial unix /var/run/cilium/cilium.sock: connect: no such file or directory
Is the agent running?
Normal SandboxChanged 20m kubelet Pod sandbox changed, it will be killed and re-created.
Normal Pulled 20m kubelet Container image "registry.aliyuncs.com/google_containers/coredns:v1.8.6" already present on machine
Normal Created 20m kubelet Created container coredns
Normal Started 20m kubelet Started container coredns
Warning Unhealthy 20m (x2 over 20m) kubelet Readiness probe failed: Get "http://10.0.0.240:8181/ready": dial tcp 10.0.0.240:8181: i/o timeout (Client.Timeout exceeded while awaiting headers)
Warning Unhealthy 18m (x13 over 20m) kubelet Readiness probe failed: Get "http://10.0.0.240:8181/ready": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Warning Unhealthy 15m (x12 over 19m) kubelet Liveness probe failed: Get "http://10.0.0.240:8080/health": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Normal Killing 14m kubelet Container coredns failed liveness probe, will be restarted
而describe
给出的信息则是健康检查失败。奇怪的是,cilium的组件的状态则正常,但由于coredns
状态异常,显然集群网络无法正常工作。
root@nm:/work-place/kubernetes/create-cluster# cilium status
/¯¯\
/¯¯\__/¯¯\ Cilium: 3 errors
\__/¯¯\__/ Operator: OK
/¯¯\__/¯¯\ Hubble: disabled
\__/¯¯\__/ ClusterMesh: disabled
\__/
DaemonSet cilium Desired: 3, Ready: 3/3, Available: 3/3
Deployment cilium-operator Desired: 1, Ready: 1/1, Available: 1/1
Containers: cilium Running: 3
cilium-operator Running: 1
Cluster Pods: 2/2 managed by Cilium
Image versions cilium-operator quay.io/cilium/operator-generic:v1.11.6@sha256:9f6063c7bcaede801a39315ec7c166309f6a6783e98665f6693939cf1701bc17: 1
cilium quay.io/cilium/cilium:v1.11.6@sha256:f7f93c26739b6641a3fa3d76b1e1605b15989f25d06625260099e01c8243f54c: 3
Errors: cilium cilium-hn9g5 controller cilium-health-ep is failing since 27s (21x): Get "http://10.0.2.134:4240/hello": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
cilium cilium-7l6br controller cilium-health-ep is failing since 27s (21x): Get "http://10.0.0.36:4240/hello": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
cilium cilium-rzkb6 controller cilium-health-ep is failing since 27s (21x): Get "http://10.0.1.222:4240/hello": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
补充信息
另外,一些关键组件的补充信息如下。
cilium sysdump
root@nm:/work-place/kubernetes/create-cluster# cilium sysdump
🔍 Collecting sysdump with cilium-cli version: v0.11.11, args: [sysdump]
🔍 Collecting Kubernetes nodes
🔍 Collect Kubernetes nodes
🔍 Collecting Kubernetes events
🔍 Collecting Kubernetes pods
🔍 Collect Kubernetes version
🔍 Collecting Kubernetes namespaces
🔍 Collecting Kubernetes services
🔍 Collecting Kubernetes pods summary
🔍 Collecting Kubernetes endpoints
🔍 Collecting Kubernetes network policies
🔍 Collecting Cilium cluster-wide network policies
🔍 Collecting Cilium network policies
🔍 Collecting Cilium local redirect policies
🔍 Collecting Cilium egress NAT policies
🔍 Collecting Cilium endpoints
🔍 Collecting Cilium identities
🔍 Collecting Cilium nodes
🔍 Collecting Ingresses
🔍 Collecting CiliumEnvoyConfigs
🔍 Collecting CiliumClusterwideEnvoyConfigs
🔍 Collecting Cilium etcd secret
🔍 Collecting the Cilium configuration
🔍 Collecting the Cilium daemonset(s)
🔍 Collecting the Hubble daemonset
🔍 Collecting the Hubble Relay deployment
🔍 Collecting the Hubble Relay configuration
🔍 Collecting the Hubble UI deployment
🔍 Collecting the Cilium operator deployment
🔍 Collecting the CNI configuration files from Cilium pods
⚠️ Deployment "hubble-ui" not found in namespace "kube-system" - this is expected if Hubble UI is not enabled
🔍 Collecting the CNI configmap
🔍 Collecting the 'clustermesh-apiserver' deployment
⚠️ Deployment "hubble-relay" not found in namespace "kube-system" - this is expected if Hubble is not enabled
🔍 Collecting gops stats from Cilium pods
🔍 Collecting gops stats from Hubble pods
🔍 Collecting gops stats from Hubble Relay pods
🔍 Collecting 'cilium-bugtool' output from Cilium pods
🔍 Collecting logs from Cilium pods
🔍 Collecting logs from Cilium operator pods
⚠️ Deployment "clustermesh-apiserver" not found in namespace "kube-system" - this is expected if 'clustermesh-apiserver' isn't enabled
🔍 Collecting logs from 'clustermesh-apiserver' pods
🔍 Collecting logs from Hubble pods
🔍 Collecting logs from Hubble Relay pods
🔍 Collecting logs from Hubble UI pods
🔍 Collecting platform-specific data
🔍 Collecting Hubble flows from Cilium pods
⚠️ The following tasks failed, the sysdump may be incomplete:
⚠️ [11] Collecting Cilium egress NAT policies: failed to collect Cilium egress NAT policies: the server could not find the requested resource (get ciliumegressnatpolicies.cilium.io)
⚠️ [12] Collecting Cilium local redirect policies: failed to collect Cilium local redirect policies: the server could not find the requested resource (get ciliumlocalredirectpolicies.cilium.io)
⚠️ [17] Collecting CiliumClusterwideEnvoyConfigs: failed to collect CiliumClusterwideEnvoyConfigs: the server could not find the requested resource (get ciliumclusterwideenvoyconfigs.cilium.io)
⚠️ [18] Collecting CiliumEnvoyConfigs: failed to collect CiliumEnvoyConfigs: the server could not find the requested resource (get ciliumenvoyconfigs.cilium.io)
⚠️ [23] Collecting the Hubble Relay configuration: failed to collect the Hubble Relay configuration: configmaps "hubble-relay-config" not found
⚠️ cniconflist-cilium-7l6br: error dialing backend: dial tcp 192.168.153.23:10250: connect: no route to host
⚠️ cniconflist-cilium-hn9g5: command terminated with exit code 1
⚠️ cniconflist-cilium-rzkb6: command terminated with exit code 1
⚠️ gops-cilium-7l6br-memstats: failed to list processes "cilium-7l6br" ("cilium-agent") in namespace "kube-system": error dialing backend: dial tcp 192.168.153.23:10250: connect: no route to host
⚠️ gops-cilium-7l6br-stack: failed to list processes "cilium-7l6br" ("cilium-agent") in namespace "kube-system": error dialing backend: dial tcp 192.168.153.23:10250: connect: no route to host
⚠️ gops-cilium-7l6br-stats: failed to list processes "cilium-7l6br" ("cilium-agent") in namespace "kube-system": error dialing backend: dial tcp 192.168.153.23:10250: connect: no route to host
⚠️ cilium-bugtool-cilium-7l6br: failed to collect 'cilium-bugtool' output for "cilium-7l6br" in namespace "kube-system": error dialing backend: dial tcp 192.168.153.23:10250: connect: no route to host:
⚠️ logs-cilium-7l6br-cilium-agent: failed to collect logs for "cilium-7l6br" ("cilium-agent") in namespace "kube-system": Get "https://192.168.153.23:10250/containerLogs/kube-system/cilium-7l6br/cilium-agent?limitBytes=1073741824&sinceTime=2021-07-13T08%3A20%3A54Z×tamps=true": dial tcp 192.168.153.23:10250: connect: no route to host
⚠️ logs-cilium-operator-5d67fc458d-gjdc6-cilium-operator: failed to collect logs for "cilium-operator-5d67fc458d-gjdc6" ("cilium-operator") in namespace "kube-system": Get "https://192.168.153.23:10250/containerLogs/kube-system/cilium-operator-5d67fc458d-gjdc6/cilium-operator?limitBytes=1073741824&sinceTime=2021-07-13T08%3A20%3A55Z×tamps=true": dial tcp 192.168.153.23:10250: connect: no route to host
⚠️ logs-cilium-7l6br-mount-cgroup: failed to collect logs for "cilium-7l6br" ("mount-cgroup") in namespace "kube-system": Get "https://192.168.153.23:10250/containerLogs/kube-system/cilium-7l6br/mount-cgroup?limitBytes=1073741824&sinceTime=2021-07-13T08%3A20%3A54Z×tamps=true": dial tcp 192.168.153.23:10250: connect: no route to host
⚠️ logs-cilium-7l6br-clean-cilium-state: failed to collect logs for "cilium-7l6br" ("clean-cilium-state") in namespace "kube-system": Get "https://192.168.153.23:10250/containerLogs/kube-system/cilium-7l6br/clean-cilium-state?limitBytes=1073741824&sinceTime=2021-07-13T08%3A20%3A54Z×tamps=true": dial tcp 192.168.153.23:10250: connect: no route to host
⚠️ hubble-flows-cilium-7l6br: failed to collect hubble flows for "cilium-7l6br" in namespace "kube-system": error dialing backend: dial tcp 192.168.153.23:10250: connect: no route to host:
⚠️ Please note that depending on your Cilium version and installation options, this may be expected
🗳 Compiling sysdump
✅ The sysdump has been saved to /work-place/kubernetes/create-cluster/cilium-sysdump-20220713-162053.zip
coredns
的configMap
root@nm:/work-place/kubernetes/create-cluster# kubectl describe cm coredns -n kube-system
Name: coredns
Namespace: kube-system
Labels: <none>
Annotations: <none>
Data
====
Corefile:
----
.:53 {
errors
health {
lameduck 5s
}
ready
kubernetes cluster.local in-addr.arpa ip6.arpa {
pods insecure
fallthrough in-addr.arpa ip6.arpa
ttl 30
}
prometheus :9153
forward . /etc/resolv.conf {
max_concurrent 1000
}
cache 30
loop
reload
loadbalance
}
BinaryData
====
Events: <none>
kubelet
的config
信息
root@nm:/home/lzl# cat /var/lib/kubelet/config.yaml
apiVersion: kubelet.config.k8s.io/v1beta1
authentication:
anonymous:
enabled: false
webhook:
cacheTTL: 0s
enabled: true
x509:
clientCAFile: /etc/kubernetes/pki/ca.crt
authorization:
mode: Webhook
webhook:
cacheAuthorizedTTL: 0s
cacheUnauthorizedTTL: 0s
cgroupDriver: systemd
clusterDNS:
- 10.96.0.10
clusterDomain: cluster.local
cpuManagerReconcilePeriod: 0s
evictionPressureTransitionPeriod: 0s
fileCheckFrequency: 0s
healthzBindAddress: 127.0.0.1
healthzPort: 10248
httpCheckFrequency: 0s
imageMinimumGCAge: 0s
kind: KubeletConfiguration
logging:
flushFrequency: 0
options:
json:
infoBufferSize: "0"
verbosity: 0
memorySwap: {}
nodeStatusReportFrequency: 0s
nodeStatusUpdateFrequency: 0s
resolvConf: /run/systemd/resolve/resolv.conf
rotateCertificates: true
runtimeRequestTimeout: 0s
shutdownGracePeriod: 0s
shutdownGracePeriodCriticalPods: 0s
staticPodPath: /etc/kubernetes/manifests
streamingConnectionIdleTimeout: 0s
syncFrequency: 0s
volumeStatsAggPeriod: 0s
kubeadm
对kubelet
添加的flag
root@nm:/home/lzl# cat /var/lib/kubelet/kubeadm-flags.env
KUBELET_KUBEADM_ARGS="--network-plugin=cni --pod-infra-container-image=registry.aliyuncs.com/google_containers/pause:3.6"
- OS信息
$ cat /etc/os-release
PRETTY_NAME="Ubuntu 22.04 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy
$ uname -a
Linux nm 5.15.0-41-generic #44-Ubuntu SMP Wed Jun 22 14:20:53 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
sysctl -a | grep -w rp_filter
信息如下
root@nm:/work-place/kubernetes/create-cluster# sysctl -a | grep -w rp_filter
net.ipv4.conf.all.rp_filter = 0
net.ipv4.conf.cilium_host.rp_filter = 0
net.ipv4.conf.cilium_net.rp_filter = 0
net.ipv4.conf.cilium_vxlan.rp_filter = 0
net.ipv4.conf.default.rp_filter = 2
net.ipv4.conf.docker0.rp_filter = 2
net.ipv4.conf.ens33.rp_filter = 2
net.ipv4.conf.lo.rp_filter = 2
net.ipv4.conf.lxc_health.rp_filter = 2
此外,补充一点,在我使用flannel做集群网络时,一切都是正常的。
我所做的尝试
在大佬指点之前,我按照下面的链接做出一些调整。
- coredns在k8s集群中的troubleshooting:https://github.com/coredns/coredns/blob/master/plugin/loop/README.md#troubleshooting-loops-in-kubernetes-clusters
- 和我这个问题比较相似的plugin/loop导致的集群网络异常:https://github.com/coredns/coredns/issues/2790
- 本次排错中也出现的coredns [ERROR] plugin/errors: 2 read udp上游不可达:https://github.com/kubernetes/kubernetes/issues/86762
然后我又通过busybox
进到容器里面,分别在正常的使用flannel的网络环境和异常的使用cilium的网络环境下去尝试ping
我的网关。
在使用flannel的网络中,一切正常:
root@master:/home/lzl/work-place/kubernetes/deploy-k8s# kubectl run -it --rm --restart=Never busybox --image=docker.io/library/busybox sh
If you don't see a command prompt, try pressing enter.
/ # ping 10.96.0.10
PING 10.96.0.10 (10.96.0.10): 56 data bytes
^C
--- 10.96.0.10 ping statistics ---
4 packets transmitted, 0 packets received, 100% packet loss
/ # ping 192.168.153.2
PING 192.168.153.2 (192.168.153.2): 56 data bytes
64 bytes from 192.168.153.2: seq=0 ttl=127 time=0.458 ms
64 bytes from 192.168.153.2: seq=1 ttl=127 time=0.405 ms
64 bytes from 192.168.153.2: seq=2 ttl=127 time=1.041 ms
^C
--- 192.168.153.2 ping statistics ---
3 packets transmitted, 3 packets received, 0% packet loss
round-trip min/avg/max = 0.405/0.634/1.041 ms
而使用cilium中异常网络,则无法ping
通:
root@nm:/work-place/kubernetes/create-cluster# kubectl run -it --rm --restart=Never busybox --image=docker.io/library/busybox sh
If you don't see a command prompt, try pressing enter.
/ # ping 192.168.153.2
PING 192.168.153.2 (192.168.153.2): 56 data bytes
我所做的尝试都没有用。
问题解决
最终的解决办法是vincentmli给出的,如下。
手工在 /etc/sysctl.d/
中写入下面的文件,然后重启节点。
cat /etc/sysctl.d/99-zzz-override_cilium.conf
# Disable rp_filter on Cilium interfaces since it may cause mangled packets to be dropped
net.ipv4.conf.lxc*.rp_filter = 0
net.ipv4.conf.cilium_*.rp_filter = 0
# The kernel uses max(conf.all, conf.{dev}) as its value, so we need to set .all. to 0 as well.
# Otherwise it will overrule the device specific settings.
net.ipv4.conf.all.rp_filter = 0
所有节点执行上面的工作后,coredns
终于正常了,而且我尝试部署了测试应用,可以通过网络访问。
关于为什么这样做请看这条issue:https://github.com/cilium/cilium/pull/20072
另外,其他信息请查看:https://github.com/cilium/cilium/issues/20498