23 May 2020

DHCP Relay issue Meraki MX64



** Update this issue has been fixed in MX 15.34 **

In 10 Feb 2020  an interesting problem to solve which has taught me in great detail how DHCP relay works.

I am writing this to help others understand and explain the bug in the DHCP relay MX64 implementation.

Client had a IoT device that would work fine for 1 day and the next day it would loose network connectivity completely (would not ping) , the only way to fix the issue was to restart the client IoT device, all other clients were fine on the same network.

After some head scratching and some observations over a couple of days, I realized the client was stopping exactly 24 hours after it was rebooted.  I checked the DHCP lease time and it was set to 24 hours.  I changed to 48 hours and sure enough the client stopped working 48 hours later, so I was on to something.  All other clients were fine so what was special about this client?

Out came the wire-shark toolkit, using Meraki gear is great as you can take pcap trace at any point in the network and load into wire-shark.

This network did not have a local DHCP server  it was using a DHCP relay (Cisco DHCP helper address)  to a Meraki MX64 on another subnet running a DHCP server service.

Now this device did have a DHCP reservation.   Maybe this was the issue ??

Below is a diagram of the network  (with identifying bits removed)

  

The IoT device does a broadcast for a DHCP address and the layer 3 switch then sends this packet as a uni-cast to x.y .4.1 the dhcp server.

After looking at the .pcap  the initial DHCP request and reply worked initially, but when it came to renewal time the device would ask to renew and nothing would come back and then eventually the client would stop and unbind the existing ip address.

I needed to get a trace at the DHCP server.  When I did this the server was replying at renewal but with a NAK  and with the address of 255.255.255.255  the device would never see the NAK, as the address was a broadcast on the x.y.4.0/24 subnet  and the device was on the x.y.1.0/24 subnet.  To me it seemed common sense that this needed to be a uni-cast not a broadcast to get through the routers. 

I opened a case with Meraki (04855962) to ask them why.  This was very frustrating as after an hour working with the engineer and getting all the captures, they went off shift and I had to start again with another engineer. (the difference between working with Meraki compared to Cisco TAC)

After Meraki reviewed, I was told the issue was NOT caused by the meraki MX64 it was operating correctly.

I was told
-----------------------
I've reviewed your issue and the issue is NOT caused by the MX as per Section 3.2 in 'IETF RFC2131 | Dynamic Host Configuration Protocol' below:

RFC  link:
https://tools.ietf.org/html/rfc2131#page-17
3.2 Client-server interaction - reusing a previously allocated network
    address
    If 'giaddr' is 0x0 in the DHCPREQUEST message, the client is on
          the same subnet as the server.  The server MUST
          broadcast the DHCPNAK message to the 0xffffffff broadcast
address
          because the client may not have a correct network address or subnet
          mask, and the client may not be answering ARP requests.
          Otherwise, the server MUST send the DHCPNAK message to the IP
          address of the BOOTP relay agent, as recorded in 'giaddr'.  The
          relay agent will, in turn, forward the message directly to the
          client's hardware address, so that the DHCPNAK can be delivered even
          if the client has moved to a new network.


      Please refer to page 9 of the RFC for more information about the DHCP fields description.
      e.g. giaddr: Relay agent IP address, used in booting via a relay agent.

The RFC2131 said that if the 'giaddr' (Relay agent IP address) in the client's DHCPREQUEST is 0.0.0.0, the MX will consider this DHCPREQUEST is within same broadcast domain of your MX's subnet: x.y.4.0/24. Because the client's requested 'dhcp.ip.client == x.y.1.71' (as per frame.number == 120 in your provided packet capture) is NOT within the same broadcast domain of the MX, the MX's DHCP server will send a DHCPNAK to a broadcast address.

If you are filtering the pcap with the filter: 'dhcp.ip.relay == x.y.1.0/24', you will see there are 2 DHCP relay agent IP addresses: x.y.1.1 and x.y.1.4 which were in the client's DHCP requests and they were the working DHCP renewal requests as per below screenshot. Please note that the source IP of the DHCPREQUEST will be always a DHCP relay agent IP if it is included in the DHCPREQUEST (as per RFC2131).



---------------------------------------------

Well this looked very technical and quoting the RFC,  but it did not make any sense.   I could NOT believe that people smarter than me that write the RFC made this glaring mistake.  It was time to download the RFC and look for myself.  7 hours later and some experiments I had made some progress !

You needed to read further in the RFC Meraki Quoted.



My issue was with the renewing process and  NOT for the  Initial DHCP Discover ??



From Page 31 of the same document meraki quoted


DHCPREQUEST generated during RENEWING state:

      'server identifier' MUST NOT be filled in, 'requested IP address'
      option MUST NOT be filled in, 'ciaddr' MUST be filled in with
      client's IP address. In this situation, the client is completely
      configured, and is trying to extend its lease. This message will
      be unicast, so no relay agents will be involved in its
      transmission.  Because 'giaddr' is therefore not filled in, the
      DHCP server will trust the value in 'ciaddr', and use it when
      replying to the client.

      A client MAY choose to renew or extend its lease prior to T1.  The
      server may choose not to extend the lease (as a policy decision by
      the network administrator), but should return a DHCPACK message
      regardless.


The renew conversation is between end device and the DHCP server (the relay agent is NOT involved)  Why would it need to put the relay agent IP address in ?

The issue in this network is the broadcast will not even be seen by the  DHCP relay agent as there is a router in between.

I thought this might be a BUG just with the IoT device   so I decided to  do the same experiment with a WINDOWS 10 PC Latest Build


NO reservation in Meraki DHCP


On PC did a ipconfig /release then  ipconfig /renew (for initial request)  and then a ipconfig /renew  (for the renewal request)




Dell.pcap showed



You can not tell me that Windows 10 has the same bug ??  But users were not complaining with the same issue yet they were still affected by it.

I did not think what meraki was referring to from RFC is for the RENEWAL .  The RENEWAL should be a Uni CAST

Also why is the DHCP server sending a NAK  and Not and ACK ?   (some logic is wrong here)

I did not believe I had found a bug in the RFC
 
From the Meraki.pcap  file in their example  x.y.1.1  and x.y.1.4    (by the way x.y.1.4 is a cisco WLAN controller, this also acks like a dhcp helper for the wireless clients)

The Meraki Tech working example are NOT renewals  these are initial DHCP requests

I did some testing using the wireless DHCP x.y.1.4 helper with the same laptop

Now it works perfectly as the WLAN controller  PROXIES ALL DHCP  and will not let the client talk directly to DHCP server

The renewal goes via the PROXY and the GiAddr is the WLAN controller..


(DellWireless.pcap) 


BUT THIS IS NOT THE ISSUE I WAS TALKING ABOUT

My issue is when the DHCP client talks directly to the DHCP server to renew as in Dell.pcap



It should ACK the request  and should unicast this back to client.



 Also after look at the .pcap the client is SPECIFICALLY Requesting a unicast reply.


Clients requesting renewal of an existing lease may communicate directly via UDP unicast, since the client already has an established IP address at that point. Additionally, there is a BROADCAST flag (1 bit in 2 byte flags field, where all other bits are reserved and so are set to 0) the client can use to indicate in which way (broadcast or unicast) it can receive the DHCPOFFER: 0x8000 for broadcast, 0x0000 for unicast.[8] Usually, the DHCPOFFER is sent through unicast.

In the Dell PC example




From the IoT device




Now when I read the RFC






it has the flag the other way around ??




This may be where the bug came in?



When I look at the document Meraki refer to  Page 24


A client that can
   receive unicast IP datagrams before its protocol software has been
   configured SHOULD clear the BROADCAST bit to 0.


So it looks like wireshark is correctly decoding….


At this point I was convinced....  but Meraki was not !!!

So back to lab and I setup identical network  and this time I used a Windows 2008R2Server as the DHCP server and NOT the Meraki. I wanted to see if other vendors DHCP server had the same issue.

 
I removed the MX64  and replaced with Windows 2008R2 Server running DHCP Scope with x.x.1.0/24  subnet and reservation for x.x.1.33


So I was just changing the MX64 for a Microsoft windows 2008R2 DHCP server the network was identical.


Using my Dell Latop


 DHCP initial address  broadcasts and gets IP via the DHCP proxy  (No 1 to 4 in capture below)


 DHCP renewal is a Unicast direct to DHCP server and the ACK is a unicast to the client (No 4 , 5 below)


  DHCP Release  is a Unicast Direct to DHCP server…  (No 7 below)




This is 100% correct and works PERFECTLY



Compare this to how the MX64 works (explained in detail above)



  DHCP initial address  broadcasts and gets IP via the DHCP proxy



  DHCP renewal is a Unicast direct to DHCP server and the NAK is a broadcast (saying the address is not available)





Device never sees the broadcast as not on the same subnet



And keeps trying…  ALSO why iit get a NAK when it is a reservation ?


Meraki now believed me and the "Meraki Support Firewall" would now send this bug to development. (What a battle, you would think they would want to know about bugs in their software)

I was puzzled why other devices and the WLAN controller were not affected.  After some more traces all the clients were affected.  The difference here was after the renew failed all the other devices then did a normal broadcast for a new ip address after the renewal failed.  This IoT device did not and just went offline.


Mystery solved.

It is now 23rd of May 2020 and I still don't have any info from Meraki when this bug will be fixed.


 I did expect a bug of this nature to be fixed quite quickly.  I suppose this is the difference between Cisco Meraki and Cisco Enterprise products.  Bugs I have found in Cisco Enterprise products have been hot fixed within a week.

It is always tricky being a consultant as to how can I charge this customer for finding bugs in Meraki Devices.  I could have stopped when I identified the problem which is what the customer actually asked, but I wanted to get this issue actually fixed, but did not expect the resistance I got from the meraki support when trying to get this bug fixed.  Meraki support engineers get paid to work in the lab and isolate issues, I do not. I could not reasonably bill the customer for the 20 hours I have spent working on this simple issue.  I think Meraki owe me some "Meraki Swag" for finding this one !
 
** Update this issue has been fixed in MX 15.34 **