Bug AMF Crashes On Late SMF Select Data Response From Nudm-SDM
Introduction
This article delves into a critical bug encountered in Open5GS version 2.7.5, specifically within the Access and Mobility Management Function (AMF). The issue manifests as an unexpected crash of the AMF when it receives a delayed smf-select-data response from the Nudm-SDM (Unified Data Management - Subscriber Data Management) after the User Equipment (UE) context has been removed. This scenario highlights a critical flaw in the state management of the AMF, particularly in handling asynchronous responses in constrained environments. Let's explore the steps to reproduce this bug, analyze the logs, understand the expected and observed behaviors, and discuss the implications of this issue.
Understanding the Bug: AMF Crash on Delayed SMF Select Data Response
In the realm of 5G network architecture, the AMF plays a pivotal role in managing UE access and mobility. It interacts with various network functions, including the Nudm-SDM, to retrieve subscriber data essential for session establishment and management. However, in certain scenarios, network conditions or resource constraints can lead to delays in responses from the Nudm-SDM. This is where the bug in Open5GS v2.7.5 surfaces, causing the AMF to crash when a smf-select-data response arrives after the UE context has already been removed.
This bug stems from a lack of robust state validation within the AMF's GMM (GPRS Mobility Management) state machine. When the AMF receives a delayed response from the Nudm-SDM after the UE has already been deregistered, the GMM state machine lacks the logic to handle this unexpected event. This leads to a fatal error, specifically triggering a gmm_state_exception with the message "should not be reached," ultimately causing the AMF to crash. The severity of this issue lies in its potential to disrupt network services, leading to dropped connections and service unavailability for users.
Steps to Reproduce the AMF Crash
Reproducing this bug requires setting up a specific environment that mimics real-world network conditions, particularly resource constraints or network instability. Here's a step-by-step guide to reproduce the AMF crash:
- Environment Setup: Deploy the latest Open5GS source branch using Docker containers. This provides a controlled and isolated environment for testing.
- NF Container Startup: Start all necessary Network Function (NF) containers, including the AMF, Nudm-SDM, and others required for basic network operation.
- Memory Constraints: Apply strict memory constraints to the container or the host system. This simulates a resource-constrained environment, which can exacerbate the timing issues leading to the bug. This can be achieved through Docker's resource limiting capabilities or by configuring memory limits on the host system.
- Trigger UE Registration: Initiate a UE registration procedure. This involves the UE attempting to connect to the network and the AMF interacting with the Nudm-SDM to retrieve subscriber data.
- Observe the Crash: Monitor the AMF logs for any errors or fatal exceptions. The crash typically occurs during the initialization or registration phase when the AMF receives the delayed smf-select-data response.
By following these steps, developers and testers can reliably reproduce the AMF crash and gain a deeper understanding of the underlying issue.
Analyzing the Logs: Tracing the Root Cause
The provided logs offer valuable insights into the sequence of events leading to the AMF crash. Let's dissect the logs to pinpoint the root cause and understand the AMF's behavior:
- Warnings and Information Messages:
WARNING: [suci-0-466-92-0000-0-0-0123456005] Holding NG Context
: This indicates that the AMF is holding the NG context for the specified SUCI (Subscription Concealed Identifier), which is a temporary identifier used during initial registration.WARNING: NAS MAC verification failed
: This warning suggests a potential issue with the integrity of the NAS (Non-Access Stratum) message, which could be due to various factors like synchronization problems or security key mismatches. However, it's likely a red herring in this case, as the crash is triggered by the delayed SBI response.INFO: RAN_UE_NGAP_ID[165] AMF_UE_NGAP_ID[165] TAC[1] CellID[0x111]
: This provides information about the Radio Access Network (RAN) UE NGAP ID and AMF UE NGAP ID, along with the Tracking Area Code (TAC) and Cell ID, which are essential for identifying the UE's location.
- Debug Messages:
- A series of
DEBUG
messages trace the AMF's state transitions and interactions with other network functions. These messages are crucial for understanding the control flow and identifying the point of failure. DEBUG: amf_state_operational(): AMF_EVENT_5GMM_MESSAGE
: This indicates that the AMF state machine is processing a 5GMM (5G Mobility Management) message.DEBUG: gmm_state_initial_context_setup(): AMF_EVENT_5GMM_MESSAGE
: This shows that the GMM state machine is in the initial context setup state and is processing a 5GMM message.DEBUG: [imsi-466920123456005] Service request
: This indicates that the UE is requesting a service.DEBUG: [imsi-466920123456005] Service reject
: The AMF is rejecting the service request, possibly due to authentication or authorization failures.DEBUG: UEContextReleaseCommand
: This message indicates that the AMF is initiating the release of the UE context.
- A series of
- SBI Interaction:
DEBUG: [200:GET] http://172.22.0.35:7777/nudm-sdm/v2/imsi-466920123456005/smf-select-data?plmn-id=%7B%22mcc%22%3A%22466%22%2C%22mnc%22%3A%2292%22%7D
: This log shows the AMF sending a GET request to the Nudm-SDM to retrieve SMF selection data for the UE.DEBUG: RECEIVED[92]
: This indicates that the AMF has received a response from the Nudm-SDM.- `DEBUG: {