5. Kubernetes > Cilium > NXOS Deploy Cilium Over EVPN / VXLAN – eBGP

My configuration is based on the following Cisco Live session BRKDCN-2982. Specifically the “Deploy Cilium Over EVPN / VXLAN Fabric” use case.

On-Demand Session Library – Cisco Live On-Demand – Cisco

Configure Kubernetes Pod

Next I created a namespace and pod deployment so I actually have something to advertise into BGP.

#Create namespace
kubectl create namespace plumbing

#Deploy our new nginx container within our watersource Pod
kubectl create deployment watersource --image=docker.io/nginx:1.15.8

#Replicate our watersource pod so it's horizontally scaled across our workernodes. Note: each replica of the pod will be given a /24 subnet from our wider /16 podcidr of 172.16.0.0 
kubectl scale deployment watersource --replicas=2

Now we confirm our Kubernetes deployment.

kubectl get deployment -o wide

kubectl get pod -o wide

We can see in the above output that our pod is split across multiple workernodes each with an IP address from a /24 from within the wider 172.16.0.0/16 PodCIDR.

Cilium eBGP Configuration

Next we can finally configure our eBGP in Cilium. I have created a networkconfiguration directory, here i’ve created a routerbgp.yaml file.

Within the routerbgp.yaml file we have the following configuration.

Key sections here:

CiliumBGPClusterConfig
CiliumBGPPeerConfig
CiliumBGPAdvertisement

Within the CiliumBGPClusterConfig we can see the following:

nodeSelector : This is important, if you set nodeSelector labels you must label your Workernodes to include them in BGP. This enables us to filter out eBGP peering with our Controlplane (Master Node). You can completely remove the nodeSelector section and all nodes will carry out eBGP peering. The following command sets node labels, leaf: leaf is used in the yaml file.
- kubectl label nodes workernode1 workernode2 leaf=leaf
LocalASN (Kubernetes ASN)
PeerASN (Nexus Spine and Leaf ASN)
PeerAddress (Loopback999 IP on Nexus Spine and Leaf)

Within the CiliumBGPPeerConfig we should focus on the following:

eBGP multihop, this is important as we’re peering with Loopbacks on the Nexus Leaf.

Finally the CiliumBGPAdvertisement:

PodCIDR is configured for advertisement, this ensures our 172.16.0.0/16 is advertised (Only the active POD addressing will be advertised, in our case 172.16.1.0/24 and 172.16.2.0/24).
It’s worth noting this is where we can add advertisement of our Service IPs.

apiVersion: "cilium.io/v2"
kind: CiliumBGPClusterConfig
metadata:
  name: leaf
spec:
  nodeSelector:
    matchLabels:
      leaf: leaf
  bgpInstances:
    - name: "instance-65010"
      localASN: 65010
      peers:
        - name: "peer-65010-leaf1"
          peerASN: 65001
          peerAddress: "10.255.255.1"
          peerConfigRef:
            name: "peer-config-generic"
        - name: "peer-65010-leaf2"
          peerASN: 65001
          peerAddress: "10.255.255.2"
          peerConfigRef:
            name: "peer-config-generic"
---
apiVersion: "cilium.io/v2"
kind: CiliumBGPPeerConfig
metadata:
  name: peer-config-generic
spec:
  ebgpMultihop: 5
  families:
    - afi: ipv4
      safi: unicast
      advertisements:
        matchLabels:
          advertise: "pod-cidr"
---
apiVersion: "cilium.io/v2"
kind: CiliumBGPAdvertisement
metadata:
  name: pod-cidr
  labels:
    advertise: pod-cidr
spec:
  advertisements:
    - advertisementType: "PodCIDR"

Once the yaml file is saved we can apply the configuration.

kubectl apply -f networkconfiguration/routerbgp.yaml

We can confirm our BGP peers have been configured with the following command. At this point we haven’t configured our Nexus eBGP configuration so our neighbors will stay in idle.

Its worth noting, if no neighbors appear when you run this command you may have incorrectly labelled your nodes, you should be seeing idle entries.

cilium bgp peers

Finally for our lab we must ensure we have our INTER-VM addressing configured and our static routing in place so we can reach the Nexus Loopbacks via their anycast gateway.

In my lab I have configured a temporary solution, a permanent fix would be configuring netplan (I just haven’t got round to it yet).

I have also added a static route to our Spine and Leaf specifically for our inter-pod (172.16.0.0/16) communication due to our use of native routing.

This static route PodCIDR /16 combined with our eBGP advertisements of Pod /24s allows pods on separate worker nodes.

#worknode1
sudo ip addr add 192.168.100.252/24 dev ens37
sudo ip link set dev ens37 up
sudo ip route add 10.255.255.1/32 via 192.168.100.10
sudo ip route add 10.255.255.2/32 via 192.168.100.10
#INTER-POD Native Routing 
sudo ip route add 172.16.0.0/16 via 192.168.100.10

#workernode2
sudo ip addr add 192.168.100.253/24 dev ens37
sudo ip link set dev ens37 up
sudo ip route add 10.255.255.1/32 via 192.168.100.10
sudo ip route add 10.255.255.2/32 via 192.168.100.10
#INTER-POD Native Routing
sudo ip route add 172.16.0.0/16 via 192.168.100.10

Nexus BGP Configuration : L3VNI + Local Loopbacks + iBGP

Finally we can bring up our Nexus configuration and ensure our PodCIDR addressing is advertised into our Nexus Spine and Leaf.

Its worth noting the fundamental VXLAN / EVPN configuration is as per my Single-Pod VXLAN / EVPN series of posts and wont be discussed here.

I also won’t be discussing VPC configuration.

As a reminder here is our routing design

First configure our External VRF (L3VNI). Named External as we expect the addressing we receive from our Kubernetes Cluster over eBGP is to be advertised out of the Spine and Leaf to allow for egress connectivity.

L3VNI

vrf context External
  vni 212121
  rd auto
  address-family ipv4 unicast
    route-target both auto
    route-target both auto evpn
!
interface nve1
 member vni 212121 associate-vrf
!
vlan 2000
 name K8s-External
 vn-segment 212121
!
evpn 
vrf context External
  rd auto
  address-family ipv4 unicast
    route-target both auto
    route-target both auto evpn

Now we can configure our localised loopbacks

Loopbacks

#LEAF-1
interface loopback999
  vrf member External
  ip address 10.255.255.1/32

#LEAF-2 
interface loopback999
  vrf member External
  ip address 10.255.255.2/32

Next configure iBGP within our external VRF between Leafs and ensure we’re advertising Loopback999. Typically we don’t connect Leaf switches together (VPC exception, unless fabric VPC) however to ensure a straight iBGP connection i’ve added a link purely for this.

I’ve added a permit all route-map for redistributing directly connected, and a prefix-list for the LF999 address 10.255.255.1/32.

Next-hop-self is important as we want to allow Cilium to eBGP peer via our primary VPC over our Peer-link to our secondary VPC and vice versa for failover purposes and multi-pathing.

iBGP

router bgp 65001 
  vrf External
    address-family ipv4 unicast
      redistribute direct route-map redist-direct
     neighbor 192.168.200.2
      remote-as 65001
      address-family ipv4 unicast
        prefix-list LF999 out
        next-hop-self

Now we confirm our iBGP peering is established between Leaf-1 and Leaf-2.

Next check our lo999 addressing is in our routing table as iBGP. This addressing shouldn’t be advertised via the L2VPN EVPN address-family.

Nexus BGP Configuration : L2VNI + Anycast Gateway + VPC Interface

Now we can move onto configuring our L2VNI + anycast gateway that will be used as a next-hop for cilium to reach our localised loopbacks.

L2VNI

vlan 300
  name K8s
  vn-segment 160300
!
interface nve1 
  member vni 160300
    mcast-group 229.0.100.1
!
evpn 
  vni 160300 l2
    rd auto
    route-target import auto
    route-target export auto

Anycast Gateways

#LEAF-1
interface Vlan300
  no shutdown
  vrf member External
  no ip redirects
  ip address 192.168.100.10/24
  fabric forwarding mode anycast-gateway

#LEAF-2
interface Vlan300
  no shutdown
  vrf member External
  no ip redirects
  ip address 192.168.100.10/24
  fabric forwarding mode anycast-gateway

VPC Interface

interface port-channel100
  switchport access vlan 300
  vpc 100
!
interface Ethernet1/12
  switchport access vlan 300
  channel-group 100 mode active

Nexus BGP Configuration : eBGP Configuration

Finally we can configure our eBGP peers from Nexus to Cilium!

Workernode1 : 192.168.100.252 Node IP within INTER-VM network
Workernode2: 192.168.100.253 Node IP within INTER-VM network
eBGP Multihop: Peering eBGP from a loopback
Update-source Loopback999: Peering eBGP from a loopback
As-override: We need external non-Cilium eBGP peers receiving addressing to see the Spine and Leaf AS only. Overrides Cilium ASN 65010 with Nexus ASN 65001
Disable-peer-as-check: Overrides loop prevention of eBGP, I assume this is due to as-override, Nexus sees 65001 in received advertisements but is allowed to readvertise.

router bgp 65001  
vrf External
    address-family ipv4 unicast
      maximum-paths 32
    neighbor 192.168.100.252
      remote-as 65010
      update-source loopback999
      ebgp-multihop 5
      address-family ipv4 unicast
        as-override
        disable-peer-as-check
    neighbor 192.168.100.253
      remote-as 65010
      update-source loopback999
      ebgp-multihop 5
      address-family ipv4 unicast
        as-override
        disable-peer-as-check

So now we can confirm our eBGP neighborships firstly in the Nexus Leafs.

Next we confirm our Cilium BGP Peers are as expected.

Finally we can confirm we’re seeing our PodCIDR in our External VRF.

Some interesting points about the routing before we test connectivity!

172.16.1.0/24 is Workernode1 172.16.1.243 : Pod is Watersource
- Next-hop is Workernode1 IP 192.168.100.252
172.16.2.0/24 is Workernode2 172.16.2.39: Pod is Watersource
- Next-hop is Workernode2 IP 192.168.100.253

172.16.1.0/24, ubest/mbest: 1/0
    *via 192.168.100.252, [20/0], 05:14:51, bgp-65001, external, tag 65010
172.16.2.0/24, ubest/mbest: 1/0
    *via 192.168.100.253, [20/0], 05:14:51, bgp-65001, external, tag 65010

Ping checks from Leaf-1 to our Pods at 172.16.2.39 and 172.16.1.243 confirm connectivity from our VXLAN / EVPN fabric to our Kubernetes cluster via our Cilium eBGP routes.

So that’s it, our eBGP connectivity and PodCIDR route advertisement is completed.

Native Routing Test

To add, as we’re using native routing, we can test our connectivity between Pods.

As seen on previous pages I have already deployed and replicated my nginx containers across nodes.

Now I can deploy my nettools container and run a ping test between pods on different Workernodes utilising separate /24 ranges within the PodCIDR 172.16.0.0/16.

apiVersion: v1
kind: Pod
metadata:
  name: nettools
  namespace: default
spec:
  containers:
  - name: nettools
    image: jrecord/nettools:latest
    command:
      - sleep
      - "3600"
    imagePullPolicy: IfNotPresent
  restartPolicy: Never

Once I configure the above yaml file I can deploy the pod using the kubectl apply command.

kubectl apply -f podcommands/nettools.yaml

We can now see our nettools pod is ready to use.

From here we can access the nettools container using the following command.

kubectl exec --stdin --tty nettools -- /bin/bash

Finally we can see that from our nettools pod on 172.16.2.63 can ping both our watersource pods on 172.16.2.29 and 172.16.1.90.

We can also see the difference in tracepaths between the pods deployed on workernode2 together and the pod deployed separately on workernode1

We can see these packets source at 172.16.2.63 (workernode2) destined for 172.16.1.90 (workernode1) traverse our Cisco Spine and Leaf.

Future posts:

Set type-5 gateway + multipath testing
Advertise ServiceIP and PodCIDR (Cluster Mesh)