MDNS Discovery Proxy Bug In OpenThread Border Router Continuous Host Queries Discussion

by StackCamp Team 88 views

Introduction

This article addresses a bug encountered in the OpenThread Border Router (OTBR) related to the mDNS Discovery Proxy functionality. Specifically, the OTBR exhibits a behavior of continuously querying for hostnames even after providing a DNS response to the client. This issue was observed in an OTBR build utilizing the AdvProxy/DiscProxy functions from the OpenThread core. This can lead to unnecessary network traffic and potentially impact the performance of the OTBR. This comprehensive analysis will delve into the details of the bug, the steps to reproduce it, the expected behavior, and relevant logs, offering a deep understanding of the problem and potential solutions.

Bug Description

The core issue lies in the persistent mDNS queries initiated by the OTBR. When a unicast DNS query for a hostname on the AIL (Application Interface Layer) is received, the OTBR correctly starts an mDNS query for the name and responds to the client with a unicast DNS response. However, instead of ceasing the mDNS queries after the response, the OTBR continues to send queries for the same hostname. This behavior persists even for names that have already been resolved and answered, which is not the intended functionality. The continuous querying was observed for at least 15-20 minutes, raising concerns about the efficiency and resource utilization of the OTBR. Understanding the root cause of this continuous querying is crucial for ensuring the stability and performance of OpenThread networks.

Steps to Reproduce

To replicate this bug, the following setup and steps were employed. Reproducing the bug consistently is essential for effective debugging and resolution. The steps involve specific build configurations and network setups to isolate the issue.

  1. OTBR Build: Utilize the ot-reference-release build process with a specific OpenThread commit: 8a19434b8ae56ed5ffbc931d3f1aa212823633c6. This ensures that the bug is tested in the same environment and codebase where it was initially observed.

  2. Modified Build Flags: Apply modifications to the build flags to enable the OpenThread core versions of Discovery Proxy and AdvProxy. This is a crucial step as it activates the specific components involved in the bug.

  3. Network Setup: Establish a simple network topology consisting of a single OTBR on the AIL and a single FTD (Full Thread Device) acting as a DNS client. This minimal setup helps in isolating the bug and eliminating potential interference from other network components.

  4. Build Flags: Use the following build flags (replacing the original ones):

    readonly OTBR_THREAD_1_4_OPTIONS=(
        ${OTBR_COMMON_OPTIONS[@]}
        "-DOT_THREAD_VERSION=1.4"
        "-DOTBR_DUA_ROUTING=ON"
        "-DOT_DUA=ON"
        "-DOT_MLR=ON"
        "-DOT_DNSSD_DISCOVERY_PROXY=ON"
        "-DOT_SRP_ADV_PROXY=ON"
        "-DOT_MDNS=ON"
        "-DOT_BORDER_ROUTING=ON"
        "-DOT_SRP_CLIENT=ON"
        "-DOT_SRP_SERVER=ON"
        "-DOT_DNS_CLIENT=ON"
        "-DOT_DNSSD_SERVER=ON"
        "-DOT_TCP=ON"
        "-DOT_DNS_CLIENT_OVER_TCP=ON"
        "-DOTBR_TREL=ON"
        "-DOTBR_NAT64=ON"
        "-DOTBR_DHCP6_PD=ON"
    )
    

These flags are essential for enabling the required features and configurations for reproducing the bug.

Expected Behavior

The expected behavior is that the mDNS queries should cease after the DNS response has been successfully sent to the client. Once the OTBR resolves the hostname and sends the DNS response, there should be no further need to query for the same hostname. This ensures efficient network operation and prevents unnecessary traffic. The discrepancy between the expected and actual behavior highlights the presence of the bug and its potential impact.

Observed Behavior and Log Output

The observed behavior deviates from the expectation. The OTBR continues to send mDNS queries for hostnames even after a DNS response has been sent to the client. This behavior was observed for an extended period, indicating a persistent issue. Analyzing the logs and network traffic provides valuable insights into the root cause of this discrepancy.

The console output on the FTD client shows the following:

ot dns resolve4 myrouter.default.service.arpa
DNS response for myrouter.default.service.arpa. -
Done

Interestingly, the IPv4 address (A record) for the host "myrouter" is not correctly resolved, while queries for external domains like "ipv4.google.com" work as expected. This could be a separate issue, but it highlights potential complexities in the DNS resolution process. Addressing this IPv4 resolution issue may also shed light on the continuous mDNS querying bug.

The OTBR logs at the time of receiving the DNS query from the client are as follows:

Jul 07 12:17:37 thread-br-core otbr-agent[688]: 00:57:15.462 [I] MeshForwarder-: Received IPv6 UDP msg, len:95, chksum:f6f4, ecn:no, from:0xd000, sec:yes, prio:normal, rss:-59.0, radio:15.4
Jul 07 12:17:37 thread-br-core otbr-agent[688]: 00:57:15.462 [I] MeshForwarder-:     src:[fd00:abba:0:0:7fb:2954:7c9c:7077]:49153
Jul 07 12:17:37 thread-br-core otbr-agent[688]: 00:57:15.462 [I] MeshForwarder-:     dst:[fd00:abba:0:0:76c4:d7da:8379:1d05]:53
Jul 07 12:17:37 thread-br-core otbr-agent[688]: 00:57:15.462 [I] DnssdServer---: Received query from fd00:abba:0:0:7fb:2954:7c9c:7077
Jul 07 12:17:37 thread-br-core otbr-agent[688]: 00:57:15.462 [I] DnssdServer---: A query for 'myrouter.default.service.arpa.'
Jul 07 12:17:37 thread-br-core otbr-agent[688]: [INFO]-DPROXY--: Subscribe: myrouter.default.service.arpa.
Jul 07 12:17:37 thread-br-core otbr-agent[688]: [INFO]-MDNS----: Subscribe host myrouter (total 14)
Jul 07 12:17:37 thread-br-core otbr-agent[688]: [INFO]-MDNS----: DNSServiceGetAddrInfo myrouter.local inf 0
Jul 07 12:17:37 thread-br-core otbr-agent[688]: [INFO]-MDNS----: DNSServiceGetAddrInfo reply: flags=1073741827, host=myrouter.local., sa_family=10, error=0
Jul 07 12:17:37 thread-br-core otbr-agent[688]: [INFO]-MDNS----: DNSServiceGetAddrInfo reply: add address=2001:1c02:1580:8d00:d66a:6aff:fefd:236, ttl=120
Jul 07 12:17:37 thread-br-core otbr-agent[688]: [INFO]-MDNS----: DNSServiceGetAddrInfo reply: flags=1073741826, host=myrouter.local., sa_family=2, error=0
Jul 07 12:17:37 thread-br-core otbr-agent[688]: [INFO]-MDNS----: Host myrouter is resolved successfully: host myrouter.local. addresses 1 ttl 120
Jul 07 12:17:37 thread-br-core otbr-agent[688]: [INFO]-DPROXY--: Host discovered: myrouter hostname myrouter.local. addresses 1
Jul 07 12:17:37 thread-br-core otbr-agent[688]: [INFO]-DPROXY--: Unsubscribe: myrouter.default.service.arpa.
Jul 07 12:17:37 thread-br-core otbr-agent[688]: 00:57:15.465 [I] DnssdServer---: Send response, rcode:0
Jul 07 12:17:37 thread-br-core otbr-agent[688]: 00:57:15.473 [I] MeshForwarder-: Sent IPv6 UDP msg, len:95, chksum:76f4, ecn:no, to:0xd000, sec:yes, prio:normal, radio:15.4
Jul 07 12:17:37 thread-br-core otbr-agent[688]: 00:57:15.473 [I] MeshForwarder-:     src:[fd00:abba:0:0:76c4:d7da:8379:1d05]:53
Jul 07 12:17:37 thread-br-core otbr-agent[688]: 00:57:15.473 [I] MeshForwarder-:     dst:[fd00:abba:0:0:7fb:2954:7c9c:7077]:49153

This log excerpt shows the sequence of events when the DNS query is received and processed. Notably, the log indicates that the OTBR unsubscribes from the service (Unsubscribe: myrouter.default.service.arpa.) after discovering the host. However, the continuous mDNS queries suggest that this unsubscribe action is not effectively stopping the queries. Further investigation is needed to pinpoint the exact location in the code where the queries are being initiated and why they are not being terminated.

Additional Context and Potential Causes

Understanding the broader context of the bug can help in identifying potential causes. Several factors might contribute to this issue:

  1. Caching Issues: The OTBR might be caching mDNS queries incorrectly, leading to repeated queries even after a response has been sent. Examining the caching mechanism and its interaction with the mDNS query process is crucial.
  2. Timer or Loop Bugs: There could be a timer or loop within the mDNS query logic that is not being properly reset or terminated after a successful resolution. Identifying and fixing such timer or loop issues is essential.
  3. Subscription Management: The subscription and unsubscription mechanism for mDNS queries might have a flaw, causing the OTBR to remain subscribed to the hostname even after unsubscribing. A thorough review of the subscription management code is necessary.
  4. Thread Synchronization: If the mDNS query process involves multiple threads, synchronization issues could lead to race conditions where queries are initiated repeatedly. Investigating thread synchronization and potential race conditions is important.

Conclusion

The persistent mDNS query bug in the OpenThread Border Router represents a significant issue that can lead to unnecessary network traffic and potential performance degradation. By meticulously following the steps to reproduce the bug, analyzing the logs, and considering the additional context, developers can effectively pinpoint the root cause and implement a robust solution. Addressing this bug will enhance the stability and efficiency of OpenThread networks, ensuring a better user experience. Further debugging and code review, focusing on the potential causes mentioned above, will be instrumental in resolving this issue and preventing its recurrence in future releases.

By understanding the nuances of this bug, the OpenThread community can work collaboratively to create a more resilient and efficient IoT ecosystem. This analysis serves as a foundation for further investigation and resolution, ultimately contributing to the advancement of OpenThread technology.