** Update this issue has been fixed in MX 15.34 **
I am writing this to help others understand and explain the bug in the DHCP relay MX64 implementation.
Client had a IoT device that would work fine for 1 day and the next day it would loose network connectivity completely (would not ping) , the only way to fix the issue was to restart the client IoT device, all other clients were fine on the same network.
After some head scratching and some observations over a couple of days, I realized the client was stopping exactly 24 hours after it was rebooted. I checked the DHCP lease time and it was set to 24 hours. I changed to 48 hours and sure enough the client stopped working 48 hours later, so I was on to something. All other clients were fine so what was special about this client?
Out came the wire-shark toolkit, using Meraki gear is great as you can take pcap trace at any point in the network and load into wire-shark.
This network did not have a local DHCP server it was using a DHCP relay (Cisco DHCP helper address) to a Meraki MX64 on another subnet running a DHCP server service.
Now this device did have a DHCP reservation. Maybe this was the issue ??
Below is a diagram of the network (with identifying bits removed)
The IoT device does a broadcast for a DHCP address and the layer 3 switch then sends this packet as a uni-cast to x.y .4.1 the dhcp server.
After looking at the .pcap the initial DHCP request and reply worked initially, but when it came to renewal time the device would ask to renew and nothing would come back and then eventually the client would stop and unbind the existing ip address.
I needed to get a trace at the DHCP server. When I did this the server was replying at renewal but with a NAK and with the address of 255.255.255.255 the device would never see the NAK, as the address was a broadcast on the x.y.4.0/24 subnet and the device was on the x.y.1.0/24 subnet. To me it seemed common sense that this needed to be a uni-cast not a broadcast to get through the routers.
I opened a case with Meraki (04855962) to ask them why. This was very frustrating as after an hour working with the engineer and getting all the captures, they went off shift and I had to start again with another engineer. (the difference between working with Meraki compared to Cisco TAC)
After Meraki reviewed, I was told the issue was NOT caused by the meraki MX64 it was operating correctly.
I was told
-----------------------
I've reviewed your
issue and the issue is NOT caused by the MX as per Section 3.2 in 'IETF RFC2131
| Dynamic Host Configuration Protocol' below:
RFC link: https://tools.ietf.org/html/rfc2131#page-17
3.2 Client-server interaction - reusing a previously allocated network
address
If 'giaddr' is 0x0 in the DHCPREQUEST message, the client is on
the same subnet as the server. The server MUST
broadcast the DHCPNAK message to the 0xffffffff broadcast address
because the client may not have a correct network address or subnet
mask, and the client may not be answering ARP requests.
Otherwise, the server MUST send the DHCPNAK message to the IP
address of the BOOTP relay agent, as recorded in 'giaddr'. The
relay agent will, in turn, forward the message directly to the
client's hardware address, so that the DHCPNAK can be delivered even
if the client has moved to a new network.
Please refer to page 9 of the RFC for more information about the DHCP fields description.
e.g. giaddr: Relay agent IP address, used in booting via a relay agent.
The RFC2131 said that if the 'giaddr' (Relay agent IP address) in the client's DHCPREQUEST is 0.0.0.0, the MX will consider this DHCPREQUEST is within same broadcast domain of your MX's subnet: x.y.4.0/24. Because the client's requested 'dhcp.ip.client == x.y.1.71' (as per frame.number == 120 in your provided packet capture) is NOT within the same broadcast domain of the MX, the MX's DHCP server will send a DHCPNAK to a broadcast address.
If you are filtering the pcap with the filter: 'dhcp.ip.relay == x.y.1.0/24', you will see there are 2 DHCP relay agent IP addresses: x.y.1.1 and x.y.1.4 which were in the client's DHCP requests and they were the working DHCP renewal requests as per below screenshot. Please note that the source IP of the DHCPREQUEST will be always a DHCP relay agent IP if it is included in the DHCPREQUEST (as per RFC2131).
RFC link: https://tools.ietf.org/html/rfc2131#page-17
3.2 Client-server interaction - reusing a previously allocated network
address
If 'giaddr' is 0x0 in the DHCPREQUEST message, the client is on
the same subnet as the server. The server MUST
broadcast the DHCPNAK message to the 0xffffffff broadcast address
because the client may not have a correct network address or subnet
mask, and the client may not be answering ARP requests.
Otherwise, the server MUST send the DHCPNAK message to the IP
address of the BOOTP relay agent, as recorded in 'giaddr'. The
relay agent will, in turn, forward the message directly to the
client's hardware address, so that the DHCPNAK can be delivered even
if the client has moved to a new network.
Please refer to page 9 of the RFC for more information about the DHCP fields description.
e.g. giaddr: Relay agent IP address, used in booting via a relay agent.
The RFC2131 said that if the 'giaddr' (Relay agent IP address) in the client's DHCPREQUEST is 0.0.0.0, the MX will consider this DHCPREQUEST is within same broadcast domain of your MX's subnet: x.y.4.0/24. Because the client's requested 'dhcp.ip.client == x.y.1.71' (as per frame.number == 120 in your provided packet capture) is NOT within the same broadcast domain of the MX, the MX's DHCP server will send a DHCPNAK to a broadcast address.
If you are filtering the pcap with the filter: 'dhcp.ip.relay == x.y.1.0/24', you will see there are 2 DHCP relay agent IP addresses: x.y.1.1 and x.y.1.4 which were in the client's DHCP requests and they were the working DHCP renewal requests as per below screenshot. Please note that the source IP of the DHCPREQUEST will be always a DHCP relay agent IP if it is included in the DHCPREQUEST (as per RFC2131).
---------------------------------------------
Well this looked very technical and quoting the RFC, but it did not make any sense. I could NOT believe that people smarter than me that write the RFC made this glaring mistake. It was time to download the RFC and look for myself. 7 hours later and some experiments I had made some progress !
You needed to read
further in the RFC Meraki Quoted.
My issue was with
the renewing process and NOT for the Initial DHCP Discover ??
From Page 31 of the
same document meraki quoted
DHCPREQUEST
generated during RENEWING
state:
'server identifier' MUST NOT be filled in, 'requested IP address'
option MUST NOT be filled in, 'ciaddr' MUST be filled in with
client's IP address. In this situation, the client is completely
configured, and is trying to extend its lease. This message
will
be unicast, so no relay agents will be involved in its
transmission. Because
'giaddr' is therefore not filled in, the
DHCP server will trust the
value in 'ciaddr', and use it when
replying to the client.
A client MAY choose to renew or extend its lease prior to T1. The
server may choose not to extend the lease (as a policy decision by
the network administrator), but should return a DHCPACK message
regardless.
The renew
conversation is between end device and the DHCP server (the relay agent is NOT
involved) Why would it need to put the relay agent IP address in ?
The issue in this
network is the broadcast will not even be seen by the DHCP relay
agent as there is a router in between.
I thought this
might be a BUG just with the IoT device so I decided to do
the same experiment with a WINDOWS 10 PC Latest Build
NO reservation in
Meraki DHCP
On PC did a ipconfig /release then ipconfig /renew (for initial request) and then a ipconfig /renew (for the renewal request)
Dell.pcap showed
You can not tell me
that Windows 10 has the same bug ?? But users were not complaining with the same issue yet they were still affected by it.
I did not think what meraki was referring to from RFC is for the RENEWAL . The RENEWAL should be
a Uni CAST
Also why is the
DHCP server sending a NAK and Not and ACK ? (some logic is
wrong here)
I did not believe I
had found a bug in the RFC
From the Meraki.pcap file in their example x.y.1.1 and x.y.1.4 (by the way x.y.1.4 is a cisco WLAN controller, this also acks like a dhcp helper for the wireless clients)
The Meraki Tech working
example are NOT renewals these are initial DHCP requests
I did some
testing using the wireless DHCP x.y.1.4 helper with the same laptop
Now it works perfectly as the WLAN controller PROXIES ALL DHCP
and will not let the client talk directly to DHCP server
The renewal goes
via the PROXY and the GiAddr is the WLAN controller..
(DellWireless.pcap)
BUT THIS IS NOT THE
ISSUE I WAS TALKING ABOUT
My issue is when the
DHCP client talks directly to the DHCP server to renew as in Dell.pcap
It should ACK the
request and should unicast this back to client.
Also after look at the .pcap the client is
SPECIFICALLY Requesting a unicast reply.
Clients
requesting renewal of an existing lease may communicate directly via UDP unicast, since
the client already has an established IP address at that point.
Additionally, there is a BROADCAST flag (1 bit in 2 byte flags field, where all
other bits are reserved and so are set to 0) the client can use to indicate in
which way (broadcast or unicast) it can receive the DHCPOFFER: 0x8000 for
broadcast, 0x0000 for unicast.[8]
Usually, the DHCPOFFER is sent through unicast.
In the Dell PC
example
From the IoT device
Now when I read the
RFC
it has the flag the
other way around ??
This may be where
the bug came in?
When I look at the
document Meraki refer to Page 24
A client that can
receive unicast IP datagrams before its protocol software has been
configured SHOULD clear the BROADCAST bit to 0.
So it looks like
wireshark is correctly decoding….
At this point I was convinced.... but Meraki was not !!!
So back to lab and I setup identical network and this time I used a Windows 2008R2Server as the DHCP server and NOT the Meraki. I wanted to see if other vendors DHCP server had the same issue.
I removed the MX64 and replaced with Windows 2008R2
Server running DHCP Scope with x.x.1.0/24 subnet and reservation for x.x.1.33
So I was just changing the MX64 for a Microsoft windows 2008R2 DHCP
server the network was identical.
Using my Dell Latop
DHCP initial address broadcasts and gets IP via
the DHCP proxy (No 1 to 4 in capture below)
DHCP renewal is a Unicast direct to DHCP server and
the ACK is a unicast to the client (No 4 , 5 below)
DHCP Release is a Unicast Direct to DHCP
server… (No 7 below)
This is 100% correct and works PERFECTLY
Compare this to how the MX64 works (explained in detail above)
DHCP initial address broadcasts and gets IP via
the DHCP proxy
DHCP renewal is a Unicast direct to DHCP server and
the NAK is a broadcast (saying the address is not available)
Device never sees the
broadcast as not on the same subnet
And keeps trying… ALSO why iit get a NAK when it is a
reservation ?
Meraki now believed me and the "Meraki Support Firewall" would now send this bug to development. (What a battle, you would think they would want to know about bugs in their software)
I was puzzled why other devices and the WLAN controller were not affected. After some more traces all the clients were affected. The difference here was after the renew failed all the other devices then did a normal broadcast for a new ip address after the renewal failed. This IoT device did not and just went offline.
Mystery solved.
It is now 23rd of May 2020 and I still don't have any info from Meraki when this bug will be fixed.
I did expect a bug of this nature to be fixed quite quickly. I suppose this is the difference between Cisco Meraki and Cisco Enterprise products. Bugs I have found in Cisco Enterprise products have been hot fixed within a week.
It is always tricky being a consultant as to how can I charge this customer for finding bugs in Meraki Devices. I could have stopped when I identified the problem which is what the customer actually asked, but I wanted to get this issue actually fixed, but did not expect the resistance I got from the meraki support when trying to get this bug fixed. Meraki support engineers get paid to work in the lab and isolate issues, I do not. I could not reasonably bill the customer for the 20 hours I have spent working on this simple issue. I think Meraki owe me some "Meraki Swag" for finding this one !
** Update this issue has been fixed in MX 15.34 **