OTBR Discovery Proxy Bug Persistent MDNS Queries Issue And Resolution
Introduction
This article delves into a peculiar bug encountered within the OpenThread Border Router (OTBR) Discovery Proxy, specifically concerning its behavior of continuously sending queries for previously resolved hostnames. This issue, observed during testing, raises questions about the efficiency and resource utilization of the mDNS query mechanism within the OTBR's core functionality. The OTBR, acting as a bridge between Thread networks and IP networks, plays a crucial role in enabling devices to discover and communicate with services across network boundaries. The Discovery Proxy, a key component within the OTBR, facilitates this process by handling mDNS queries and responses on behalf of Thread devices. When the Discovery Proxy receives a DNS query for a hostname, it initiates an mDNS query to resolve the name within the Thread network. Once the name is resolved, the OTBR responds to the client with the unicast DNS response, completing the initial request. However, in certain scenarios, the OTBR exhibits an unusual behavior of continuously sending mDNS queries for the same hostnames, even after a response has been successfully sent to the DNS client. This persistent querying raises concerns about potential network congestion and unnecessary resource consumption, as the OTBR diligently resends queries for names that have already been resolved and answered. Investigating this behavior is paramount to ensure the optimal performance and reliability of the OTBR in production environments. Understanding the underlying causes of this persistent querying is essential for developing effective solutions and preventing potential performance degradation in real-world deployments.
Describe the Bug
The bug manifests as the OTBR repeatedly sending mDNS queries for hostnames that it has already resolved and responded to via unicast DNS. This behavior was observed in an OTBR build utilizing the AdvProxy/DiscProxy functions from the OT core. Specifically, when the OTBR receives a unicast DNS query for a hostname on the AIL (Application Interface Layer), it initiates an mDNS query for the name on the AIL and subsequently responds to the client with a unicast DNS response. However, instead of ceasing the mDNS queries after the response is sent, the OTBR continues to send queries for the same hostname. This continuous querying behavior persists for an extended period, as observed during testing, with the OTBR sending queries for at least 15-20 minutes. This behavior is unexpected, as the mDNS queries should ideally stop once the DNS response has been successfully sent to the client. The ongoing queries raise concerns about network congestion and unnecessary resource utilization, as the OTBR is essentially performing redundant operations. Further investigation is required to identify the root cause of this behavior and implement a fix to ensure efficient operation of the Discovery Proxy. The persistent querying could potentially impact the performance of the OTBR and the overall network, especially in scenarios with a high volume of DNS queries or limited network bandwidth. Therefore, addressing this issue is crucial for maintaining the stability and reliability of the OTBR in various deployment scenarios. It is imperative to pinpoint the exact mechanism within the OTBR that triggers this continuous querying and implement appropriate measures to prevent it.
Steps to Reproduce
To reproduce the bug, the following setup and steps were used:
-
OT-reference-release build process: Built the OTBR using the ot-reference-release build process with the specified OT commit: 8a19434b8ae56ed5ffbc931d3f1aa212823633c6.
-
Modified build flags: Modified the build flags to enable the OT-core versions of Discovery Proxy and AdvProxy. This involved replacing the original build flags with the following configuration:
readonly OTBR_THREAD_1_4_OPTIONS=( ${OTBR_COMMON_OPTIONS[@]} "-DOT_THREAD_VERSION=1.4" "-DOTBR_DUA_ROUTING=ON" "-DOT_DUA=ON" "-DOT_MLR=ON" "-DOT_DNSSD_DISCOVERY_PROXY=ON" "-DOT_SRP_ADV_PROXY=ON" "-DOT_MDNS=ON" "-DOT_BORDER_ROUTING=ON" "-DOT_SRP_CLIENT=ON" "-DOT_SRP_SERVER=ON" "-DOT_DNS_CLIENT=ON" "-DOT_DNSSD_SERVER=ON" "-DOT_TCP=ON" "-DOT_DNS_CLIENT_OVER_TCP=ON" "-DOTBR_TREL=ON" "-DOTBR_NAT64=ON" "-DOTBR_DHCP6_PD=ON" )
-
Network setup: Deployed a single OTBR on the AIL with a single FTD (NCS dongle) acting as a DNS client.
-
DNS Query: From the FTD client, initiated a DNS query for a hostname on the AIL. For example:
ot dns resolve4 myrouter.default.service.arpa
By following these steps, the bug can be reproduced, and the continuous mDNS queries for the resolved hostname can be observed. The modified build flags are crucial for enabling the OT-core versions of the Discovery Proxy and AdvProxy, which are essential for triggering the bug. The network setup with a single OTBR and FTD client provides a controlled environment for observing the behavior. Monitoring network traffic using tools like Wireshark can help confirm the continuous mDNS queries being sent by the OTBR. This detailed reproduction procedure ensures that the bug can be consistently replicated for further analysis and debugging.
Expected Behavior
The expected behavior is that the mDNS queries should cease once the DNS response has been sent to the client. After the OTBR receives a unicast DNS query, initiates an mDNS query, and responds to the client, the mDNS query process for that particular hostname should terminate. There should be no further mDNS queries sent for the same hostname unless a new DNS query is received from a client. The cessation of mDNS queries is crucial for optimizing network resource utilization and preventing unnecessary traffic. By stopping the queries after a successful response, the OTBR avoids consuming bandwidth and processing power on redundant operations. This efficient behavior is essential for maintaining the overall performance and scalability of the Thread network. The expected behavior aligns with the standard mDNS protocol, which dictates that queries are typically sent for a limited time period or until a response is received. Deviating from this behavior can lead to inefficiencies and potential performance issues. Therefore, ensuring that the OTBR adheres to the expected behavior of stopping mDNS queries after a response is crucial for its reliable operation.
Console/Log Output
The console output on the FTD client when resolving a hostname looks as follows:
ot dns resolve4 myrouter.default.service.arpa
DNS response for myrouter.default.service.arpa. -
Done
Notably, the IPv4 address (A record) of the host "myrouter" is not correctly found in this case. While queries for external domains such as "ipv4.google.com" do work, this discrepancy indicates a separate issue that could be filed separately. However, the primary focus of this bug report is the continuous mDNS queries, which occur regardless of the IPv4 address resolution issue. There is no specific console log output indicating the repeated mDNS queries. The log output at the time of receiving the DNS query from the client is as follows, and this represents the last log entry related to mDNS/DProxy:
Jul 07 12:17:37 thread-br-core otbr-agent[688]: 00:57:15.462 [I] MeshForwarder-: Received IPv6 UDP msg, len:95, chksum:f6f4, ecn:no, from:0xd000, sec:yes, prio:normal, rss:-59.0, radio:15.4
Jul 07 12:17:37 thread-br-core otbr-agent[688]: 00:57:15.462 [I] MeshForwarder-: src:[fd00:abba:0:0:7fb:2954:7c9c:7077]:49153
Jul 07 12:17:37 thread-br-core otbr-agent[688]: 00:57:15.462 [I] MeshForwarder-: dst:[fd00:abba:0:0:76c4:d7da:8379:1d05]:53
Jul 07 12:17:37 thread-br-core otbr-agent[688]: 00:57:15.462 [I] DnssdServer---: Received query from fd00:abba:0:0:7fb:2954:7c9c:7077
Jul 07 12:17:37 thread-br-core otbr-agent[688]: 00:57:15.462 [I] DnssdServer---: A query for 'myrouter.default.service.arpa.'
Jul 07 12:17:37 thread-br-core otbr-agent[688]: [INFO]-DPROXY--: Subscribe: myrouter.default.service.arpa.
Jul 07 12:17:37 thread-br-core otbr-agent[688]: [INFO]-MDNS----: Subscribe host myrouter (total 14)
Jul 07 12:17:37 thread-br-core otbr-agent[688]: [INFO]-MDNS----: DNSServiceGetAddrInfo myrouter.local inf 0
Jul 07 12:17:37 thread-br-core otbr-agent[688]: [INFO]-MDNS----: DNSServiceGetAddrInfo reply: flags=1073741827, host=myrouter.local., sa_family=10, error=0
Jul 07 12:17:37 thread-br-core otbr-agent[688]: [INFO]-MDNS----: DNSServiceGetAddrInfo reply: add address=2001:1c02:1580:8d00:d66a:6aff:fefd:236, ttl=120
Jul 07 12:17:37 thread-br-core otbr-agent[688]: [INFO]-MDNS----: DNSServiceGetAddrInfo reply: flags=1073741826, host=myrouter.local., sa_family=2, error=0
Jul 07 12:17:37 thread-br-core otbr-agent[688]: [INFO]-MDNS----: Host myrouter is resolved successfully: host myrouter.local. addresses 1 ttl 120
Jul 07 12:17:37 thread-br-core otbr-agent[688]: [INFO]-DPROXY--: Host discovered: myrouter hostname myrouter.local. addresses 1
Jul 07 12:17:37 thread-br-core otbr-agent[688]: [INFO]-DPROXY--: Unsubscribe: myrouter.default.service.arpa.
Jul 07 12:17:37 thread-br-core otbr-agent[688]: 00:57:15.465 [I] DnssdServer---: Send response, rcode:0
Jul 07 12:17:37 thread-br-core otbr-agent[688]: 00:57:15.473 [I] MeshForwarder-: Sent IPv6 UDP msg, len:95, chksum:76f4, ecn:no, to:0xd000, sec:yes, prio:normal, radio:15.4
Jul 07 12:17:37 thread-br-core otbr-agent[688]: 00:57:15.473 [I] MeshForwarder-: src:[fd00:abba:0:0:76c4:d7da:8379:1d05]:53
Jul 07 12:17:37 thread-br-core otbr-agent[688]: 00:57:15.473 [I] MeshForwarder-: dst:[fd00:abba:0:0:7fb:2954:7c9c:7077]:49153
The absence of specific log output regarding the repeated mDNS queries makes it challenging to diagnose the issue directly from the logs. However, the PCAP file, if provided, can offer valuable insights into the network traffic and confirm the presence of these continuous queries. Analyzing the PCAP file can help identify the frequency and patterns of the mDNS queries, which can aid in pinpointing the source of the problem. The log output does indicate the initial DNS query processing and response transmission, but it lacks information about the subsequent mDNS query behavior. Therefore, relying on network traffic analysis through PCAP files becomes crucial for a comprehensive understanding of the issue.
Additional Context
In addition to the bug description and reproduction steps, it is worth noting that the continuous mDNS querying behavior can have significant implications for network performance and resource utilization. The unnecessary mDNS queries can consume valuable bandwidth, potentially impacting other network services and devices. In resource-constrained environments, such as those typical of IoT deployments, this can be particularly problematic. Furthermore, the continuous querying can put a strain on the OTBR's processing capabilities, leading to increased CPU utilization and potentially impacting its overall responsiveness. Therefore, resolving this bug is crucial for ensuring the efficient and reliable operation of the OTBR in various deployment scenarios. The issue may also be exacerbated in networks with a high volume of DNS queries, as the continuous mDNS queries can further contribute to network congestion. Understanding the specific conditions that trigger this behavior is essential for developing effective mitigation strategies and preventing potential performance degradation. This additional context highlights the importance of addressing this bug and underscores its potential impact on real-world deployments.
Conclusion
In summary, the OTBR Discovery Proxy exhibits a bug where it persistently sends mDNS queries for previously resolved hostnames, even after successfully responding to the client. This behavior leads to unnecessary network traffic and resource consumption. The bug can be reproduced by setting up an OTBR with specific build flags, initiating a DNS query from a client, and observing the continuous mDNS queries. The expected behavior is that mDNS queries should cease after a DNS response is sent. While console logs don't directly show the repeated queries, network traffic analysis using a PCAP file can confirm the issue. This bug is significant as it can impact network performance and OTBR responsiveness, especially in resource-constrained environments. Addressing this issue is crucial for maintaining the efficiency and reliability of OTBR deployments. Further investigation is needed to pinpoint the root cause and implement a solution to prevent this persistent querying behavior.