项目中存在2个网段的服务器,并且中间存在网络设备管理网络安全。在测试时发现跨网段访问部分节点的K8s内部服务IP不通,现象为service IP可以ping通但是无法TCP或UDP访问,POD ip无法ping也无法访问。
服务器网段 |
---|
192.168.232.0/24 |
192.168.223.0/24 |
排查思路
测试网络连通性
可以使用nc命令测试网络连通性,测试TCP链接
nc -vz 192.168.232.128 10256
测试UDP链接(测试同网段比较准,非同网段默认成功)
nc -vz -u 192.168.232.128 8472
traceroute测试
traceroute -T -p 10256 192.168.75.134
traceroute -U -p 8472 192.168.75.134
检查k8sflannel
通过命令查看flannel使用的默认网卡IP,通常flannel使用的是默认路由网卡。
[root@lolicp ~]# kubectl get nodes -o yaml|grep 'flannel.alpha.coreos.com'
flannel.alpha.coreos.com/backend-data: '{"VNI":1,"VtepMAC":"26:62:ce:31:ff:03"}'
flannel.alpha.coreos.com/backend-type: vxlan
flannel.alpha.coreos.com/kube-subnet-manager: "true"
flannel.alpha.coreos.com/public-ip: 192.168.232.128
flannel.alpha.coreos.com/backend-data: '{"VNI":1,"VtepMAC":"32:c9:2e:8e:61:f6"}'
flannel.alpha.coreos.com/backend-type: vxlan
flannel.alpha.coreos.com/kube-subnet-manager: "true"
flannel.alpha.coreos.com/public-ip: 192.168.232.129
flannel.alpha.coreos.com/backend-data: '{"VNI":1,"VtepMAC":"be:3e:34:69:a7:d9"}'
flannel.alpha.coreos.com/backend-type: vxlan
flannel.alpha.coreos.com/kube-subnet-manager: "true"
flannel.alpha.coreos.com/public-ip: 192.168.232.130
flannel.alpha.coreos.com/backend-data: '{"VNI":1,"VtepMAC":"ba:5e:2d:f7:de:c7"}'
flannel.alpha.coreos.com/backend-type: vxlan
flannel.alpha.coreos.com/kube-subnet-manager: "true"
flannel.alpha.coreos.com/public-ip: 192.168.223.130
flannel.alpha.coreos.com/backend-data: '{"VNI":1,"VtepMAC":"8e:b9:4e:23:06:c3"}'
flannel.alpha.coreos.com/backend-type: vxlan
flannel.alpha.coreos.com/kube-subnet-manager: "true"
flannel.alpha.coreos.com/public-ip: 192.168.223.131
也可在单独节点上查看默认flannel使用的网卡,使用ens224网卡,192.168.232.128 地址
[root@lolicp ~]# ip -d link show flannel.1
10: flannel.1: mtu 1450 qdisc noqueue state UNKNOWN mode DEFAULT group default
link/ether 26:62:ce:31:ff:03 brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 65535
vxlan id 1 local 192.168.232.128 dev ens224 srcport 0 0 dstport 8472 nolearning ttl auto ageing 300 udpcsum noudp6zerocsumtx noudp6zerocsumrx addrgenmode eui64 numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535
tcpdump抓包
[S] : SYN(开始连接)
[P] : PSH(推送数据)
[P.] : PSH(推送数据)
[F] : FIN (结束连接)
[R] : RST(重置连接)
[.] : 没有 Flag (意思是除上面四种类型外的其他情况,有可能是 ACK 也有可能是 URG)
[S.]:SYN和ACK
[F.]:FIN和ACK
在2台节点上通过tcpdump命令抓包获取报文(可以直接重定向文件过滤比较方便)
tcpdump -i any -s0 -nn -e not port 22 and not host 127.0.0.1
这个是正常报文,可以看到请求request和返回reply报文。存在request报文则表示已发送或者收到请求,存在reply报文则表示已返回或已收到返回(如果对方没有收到reply报文则表示被网络设备拦截无法通过路由返回)。
09:53:14.217481 In 00:50:56:c0:00:03 ethertype IPv4 (0x0800), length 150: 192.168.232.1.53693 > 192.168.232.130.8472: OTV, flags [I] (0x08), overlay 0, instance 1
5a:af:8c:67:f0:d6 > 02:bb:32:12:e0:c2, ethertype IPv4 (0x0800), length 98: 10.244.3.0 > 10.244.2.0: ICMP echo request, id 46, seq 1, length 64
09:53:14.217552 In 5a:af:8c:67:f0:d6 ethertype IPv4 (0x0800), length 100: 10.244.3.0 > 10.244.2.0: ICMP echo request, id 46, seq 1, length 64
09:53:14.217592 Out 02:bb:32:12:e0:c2 ethertype IPv4 (0x0800), length 100: 10.244.2.0 > 10.244.3.0: ICMP echo reply, id 46, seq 1, length 64
09:53:14.217622 Out 00:0c:29:05:bd:55 ethertype IPv4 (0x0800), length 150: 192.168.232.130.59081 > 192.168.223.131.8472: OTV, flags [I] (0x08), overlay 0, instance 1
02:bb:32:12:e0:c2 > 5a:af:8c:67:f0:d6, ethertype IPv4 (0x0800), length 98: 10.244.2.0 > 10.244.3.0: ICMP echo reply, id 46, seq 1, length 64
由于缺失路由导致得到3行 ICMP 目标不可达的报文。
09:58:29.540403 In 00:00:00:00:00:00 ethertype IPv4 (0x0800), length 178: 192.168.232.130 > 192.168.232.130: ICMP host 192.168.223.131 unreachable, length 142
09:58:29.540417 In 00:00:00:00:00:00 ethertype IPv4 (0x0800), length 178: 192.168.232.130 > 192.168.232.130: ICMP host 192.168.223.131 unreachable, length 142
09:58:29.540418 In 00:00:00:00:00:00 ethertype IPv4 (0x0800), length 178: 192.168.232.130 > 192.168.232.130: ICMP host 192.168.223.131 unreachable, length 142
09:58:29.541101 In 00:50:56:c0:00:03 ethertype IPv4 (0x0800), length 150: 192.168.232.1.60053 > 192.168.232.130.8472: OTV, flags [I] (0x08), overlay 0, instance 1
5a:af:8c:67:f0:d6 > 02:bb:32:12:e0:c2, ethertype IPv4 (0x0800), length 98: 10.244.3.0 > 10.244.2.0: ICMP echo request, id 48, seq 4, length 64
09:58:29.541127 In 5a:af:8c:67:f0:d6 ethertype IPv4 (0x0800), length 100: 10.244.3.0 > 10.244.2.0: ICMP echo request, id 48, seq 4, length 64
09:58:29.541139 Out 02:bb:32:12:e0:c2 ethertype IPv4 (0x0800), length 100: 10.244.2.0 > 10.244.3.0: ICMP echo reply, id 48, seq 4, length 64
由于路由错误导致导致无法将请求发出
10:35:42.821265 In 00:50:56:c0:00:03 ethertype IPv4 (0x0800), length 150: 192.168.232.1.60681 > 192.168.232.130.8472: OTV, flags [I] (0x08), overlay 0, instance 1
5a:af:8c:67:f0:d6 > 02:bb:32:12:e0:c2, ethertype IPv4 (0x0800), length 98: 10.244.3.0 > 10.244.2.0: ICMP echo request, id 49, seq 3, length 64
10:35:42.821303 In 5a:af:8c:67:f0:d6 ethertype IPv4 (0x0800), length 100: 10.244.3.0 > 10.244.2.0: ICMP echo request, id 49, seq 3, length 64
10:35:42.821326 Out 02:bb:32:12:e0:c2 ethertype IPv4 (0x0800), length 100: 10.244.2.0 > 10.244.3.0: ICMP echo reply, id 49, seq 3, length 64
可以通过命令解析mac地址为对应网卡
# 网卡mac
arp -a -n |grep '00:50:56:c0:00:03'
# 查看本机网卡mac
ip a|grep -C 1 '00:50:56:c0:00:03'
根据测试发现k8s的flannel网络插件正常,该问题属于非k8s组件问题(网络设备未开放8472端口策略,导致K8s集群内flannel插件无法通讯),根据实际情况可以推断属于route或者网络设备异常。
文章来源于互联网:记一次K8s集群flannel内部地址不通问题处理