Troubleshooting MLflow NoCredentialsError When Saving To Local Artifact

by StackCamp Team 72 views

Understanding the MLflow NoCredentialsError Bug When Saving to Local Artifacts

This article dives into a common issue encountered by MLflow users: the NoCredentialsError when attempting to save artifacts to local storage. This bug, particularly prevalent when transitioning from cloud-based artifact storage like AWS S3, can be frustrating. We will explore the root causes, provide a detailed analysis of the error, and offer comprehensive solutions to resolve it, ensuring a smooth experience with local artifact storage in MLflow.

Issues Policy Acknowledgement

  • [x] I have read and agree to submit bug reports in accordance with the issues policy

Where Did You Encounter This Bug?

Local machine

MLflow Version

  • Client: 3.1.1
  • Tracking server: 3.1.1

System Information

  • OS Platform and Distribution: ghcr.io/mlflow/mlflow:v3.1.1, Client Version: v1.32.2 Kustomize Version: v5.5.0
  • Python version: Python 3.10.18
  • yarn version, if running the dev UI: none

The Problem: NoCredentialsError with Local Artifact Storage

The core issue reported is a NoCredentialsError that arises even when not using an AWS S3 backend. This is particularly perplexing because the user intended to use local storage after encountering issues with AWS S3 artifact storage (https://github.com/mlflow/mlflow/issues/14226). The user has attempted various configurations, including setting up Kubernetes (K8S) from scratch and using Helm community models, all resulting in the same error. The tasks are executed from Apache Airflow in a separate K8S namespace.

Deep Dive into the NoCredentialsError

The NoCredentialsError in MLflow typically indicates that the system is trying to access an artifact store (often S3) but cannot find the necessary credentials. This can happen even when you intend to use local storage due to a few common reasons:

  1. Residual Configuration: MLflow might still be configured to use S3 from a previous setup. This can occur if environment variables or MLflow configurations are not correctly updated when switching to local storage.
  2. Boto3 Interference: The boto3 library, which MLflow uses for S3 interactions, might be implicitly trying to authenticate with AWS. If AWS-related environment variables (e.g., AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY) are set, boto3 will attempt to use them, leading to this error if they are invalid or not intended for the current setup.
  3. Default Artifact Root: If the default-artifact-root is not correctly set or is pointing to an S3 bucket, MLflow will try to use S3, triggering the error.
  4. Airflow Context: When running MLflow within Apache Airflow, the Airflow environment might have its own set of configurations that interfere with MLflow's intended storage settings. This is especially true if Airflow is configured to use S3 for its own tasks.

Analyzing the Kubernetes Setup

The provided Kubernetes configuration reveals that the MLflow server is set up with a PostgreSQL backend and a persistent volume claim for artifacts. Let's break down the critical parts:

containers:
  - name: mlflow
    image: mlflow:0.0.2
    ports:
    - containerPort: 5000
    env:
      - name: POSTGRES_USER
        valueFrom:
          secretKeyRef:
            name: postgres-secret
            key: postgres-user
      - name: POSTGRES_PASSWORD
        valueFrom:
          secretKeyRef:
            name: postgres-secret
            key: postgres-password
      - name: POSTGRES_DB
        valueFrom:
          secretKeyRef:
            name: postgres-secret
            key: postgres-db
    command: ["mlflow", "server", "--host", "0.0.0.0", "--port", "5000", "--backend-store-uri", "postgresql://$(POSTGRES_USER):$(POSTGRES_PASSWORD)@postgres-service:5432/$(POSTGRES_DB)", "--default-artifact-root", "mlflow-artifacts:/artifacts", "--no-serve-artifacts"]
    volumeMounts:
      - name: mlflow-artifacts
        mountPath: /mlflow/artifacts
    resources:
      requests:
        memory: "512Mi"
        cpu: "250m"
      limits:
        memory: "1Gi"
        cpu: "500m"
volumes:
  - name: mlflow-artifacts
    persistentVolumeClaim:
      claimName: mlflow-artifacts-pvc
  • The command section shows that the MLflow server is started with a PostgreSQL backend ( --backend-store-uri) and a default artifact root ( --default-artifact-root ) set to mlflow-artifacts:/artifacts. This suggests an attempt to use local storage.
  • The --no-serve-artifacts flag indicates that MLflow should not serve artifacts directly, which is appropriate for local storage setups.
  • The volumeMounts section mounts the persistent volume mlflow-artifacts to /mlflow/artifacts inside the container, further reinforcing the intention to use local storage.

Despite these configurations, the NoCredentialsError persists, indicating that something is overriding these settings or that there's an implicit S3 configuration.

Tracking Information Analysis

The provided tracking information offers crucial insights:

System information: Linux #1 SMP Tue Nov 5 00:21:55 UTC 2024
Python version: 3.10.18
MLflow version: 3.1.1
MLflow module location: /usr/local/lib/python3.10/site-packages/mlflow/__init__.py
Tracking URI: file:///mlruns
Registry URI: file:///mlruns
MLflow environment variables:
  MLFLOW_PORT: tcp://10.96.186.129:5000
  MLFLOW_PORT_5000_TCP: tcp://10.96.186.129:5000
  MLFLOW_PORT_5000_TCP_ADDR: 10.96.186.129
  MLFLOW_PORT_5000_TCP_PORT: 5000
  MLFLOW_PORT_5000_TCP_PROTO: tcp
  MLFLOW_SERVICE_HOST: 10.96.186.129
  MLFLOW_SERVICE_PORT: 5000
MLflow dependencies:
  Flask: 3.1.1
  alembic: 1.16.2
  boto3: 1.39.3
  botocore: 1.39.3
  docker: 7.1.0
  fastapi: 0.115.14
  graphene: 3.4.3
  gunicorn: 23.0.0
  matplotlib: 3.10.3
  mlflow-skinny: 3.1.1
  numpy: 2.2.6
  pandas: 2.3.0
  pyarrow: 20.0.0
  scikit-learn: 1.7.0
  scipy: 1.15.3
  sqlalchemy: 2.0.41
  uvicorn: 0.34.3
  • The Tracking URI: file:///mlruns indicates that the tracking URI is set to a local file system, which is correct for local storage.
  • The presence of boto3 and botocore in the dependencies confirms that the AWS SDK is installed, which could be a source of the problem if not configured correctly.

Code to Reproduce the Issue

The provided code snippet is a simplified version of an Airflow DAG that demonstrates the issue:

import tempfile
from pathlib import Path

import mlflow

# Dumbed down version of DAG from Airflow
#MLFLOW_TRACKING_URI = "http://mlflow.mlflow.svc.cluster.local:5000"
#mlflow.set_tracking_uri(MLFLOW_TRACKING_URI)
mlflow.set_experiment("bug-reproduction-experiment")

with mlflow.start_run() as run:
    print(f"Run {run.info.run_id}")

    with tempfile.TemporaryDirectory() as temp_dir:
        temp_path = Path(temp_dir)
        sample_file = temp_path / "test_file.txt"

        with open(sample_file, "w") as f:
            f.write("Hello from MLflow!\n")
            f.write(f"Run ID: {run.info.run_id}\n")
            f.write("This is a test file upload.\n")

        mlflow.log_artifact(str(sample_file), artifact_path="outputs")

This code creates a temporary file and attempts to log it as an artifact using mlflow.log_artifact. The NoCredentialsError occurs during this step, specifically when MLflow tries to upload the artifact.

Stack Trace Analysis

The stack trace provides a clear path to the error:

NoCredentialsError: Unable to locate credentials
File "/home/airflow/.local/lib/python3.12/site-packages/airflow/sdk/execution_time/task_runner.py", line 867 in run
...
File "/home/airflow/.local/lib/python3.12/site-packages/mlflow/tracking/fluent.py", line 1404 in log_artifact
File "/home/airflow/.local/lib/python3.12/site-packages/mlflow/tracking/client.py", line 2448 in log_artifact
File "/home/airflow/.local/lib/python3.12/site-packages/mlflow/tracking/_tracking_service/client.py", line 639 in log_artifact
File "/home/airflow/.local/lib/python3.12/site-packages/mlflow/store/artifact/s3_artifact_repo.py", line 176 in log_artifact
File "/home/airflow/.local/lib/python3.12/site-packages/mlflow/store/artifact/s3_artifact_repo.py", line 169 in _upload_file
File "/home/airflow/.local/lib/python3.12/site-packages/boto3/s3/inject.py", line 145 in upload_file
...

The key part of the stack trace is the transition from mlflow.log_artifact to mlflow.store.artifact.s3_artifact_repo.py. This clearly shows that MLflow is attempting to use the S3 artifact repository even though the intention is to use local storage. The error originates from boto3, confirming that the AWS SDK is trying to authenticate.

Resolving the NoCredentialsError: A Step-by-Step Guide

To effectively resolve the NoCredentialsError when saving artifacts locally with MLflow, consider the following steps:

  1. Explicitly Set the Artifact URI: The most reliable solution is to explicitly set the artifact URI for each run. This ensures that MLflow knows exactly where to store the artifacts. You can do this when starting a run:

    import mlflow
    import os
    
    mlflow.set_experiment("bug-reproduction-experiment")
    
    # Construct the artifact URI using the MLruns directory and the experiment ID
    artifact_uri = os.path.join("file:///mlruns", str(mlflow.get_experiment_by_name("bug-reproduction-experiment").experiment_id), "artifacts")
    
    with mlflow.start_run(nested=True, artifact_uri=artifact_uri) as run:
        print(f"Run {run.info.run_id}")
        # Your code to log artifacts
    

    This method directly specifies the storage location, bypassing any default or residual S3 configurations.

  2. Unset AWS Environment Variables: Ensure that no AWS-related environment variables are set in your environment. This includes AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_REGION, etc. Unsetting these variables prevents boto3 from attempting to authenticate with AWS.

    In a shell environment, you can unset these variables using:

unset AWS_ACCESS_KEY_ID unset AWS_SECRET_ACCESS_KEY unset AWS_REGION ```

In a Kubernetes environment, you should remove these environment variables from your pod or deployment configuration.
  1. Verify MLflow Configuration: Double-check the MLflow server configuration to ensure that the default-artifact-root is correctly set to a local path. In the Kubernetes deployment YAML, the --default-artifact-root should point to a local directory:

    command: ["mlflow", "server", "--host", "0.0.0.0", "--port", "5000", "--backend-store-uri", "postgresql://$(POSTGRES_USER):$(POSTGRES_PASSWORD)@postgres-service:5432/$(POSTGRES_DB)", "--default-artifact-root", "/mlflow/artifacts", "--no-serve-artifacts"]
    

    Also, ensure that the volume mount for mlflow-artifacts is correctly configured.

  2. Check Airflow Environment: If running MLflow within Airflow, ensure that Airflow itself is not configured to use S3 credentials unless explicitly required. Airflow might have its own set of environment variables or configurations that need to be adjusted.

  3. Set MLFLOW_S3_ENDPOINT_URL: If, despite intending to use local storage, S3-related code paths are being triggered, setting the MLFLOW_S3_ENDPOINT_URL environment variable to a dummy value might help:

    export MLFLOW_S3_ENDPOINT_URL=http://localhost:9000 # Or any dummy URL
    

    This can prevent MLflow from attempting to connect to a real S3 endpoint.

  4. Review Code for Implicit S3 Calls: Carefully review your code for any implicit calls that might be triggering S3 interactions. This could include custom logging functions or third-party libraries that attempt to use S3.

  5. Restart MLflow Server: After making configuration changes, restart the MLflow server to ensure that the new settings are applied.

Example of Corrected Code

Here’s an updated version of the code snippet that incorporates the explicit artifact URI setting:

import tempfile
from pathlib import Path
import mlflow
import os

mlflow.set_experiment("bug-reproduction-experiment")

# Construct the artifact URI using the MLruns directory and the experiment ID
experiment = mlflow.get_experiment_by_name("bug-reproduction-experiment")
if experiment is None:
    experiment_id = mlflow.create_experiment("bug-reproduction-experiment")
else:
    experiment_id = experiment.experiment_id

artifact_uri = os.path.join("file:///mlruns", str(experiment_id), "artifacts")

with mlflow.start_run(artifact_uri=artifact_uri) as run:
    print(f"Run {run.info.run_id}")

    with tempfile.TemporaryDirectory() as temp_dir:
        temp_path = Path(temp_dir)
        sample_file = temp_path / "test_file.txt"

        with open(sample_file, "w") as f:
            f.write("Hello from MLflow!\n")
            f.write(f"Run ID: {run.info.run_id}\n")
            f.write("This is a test file upload.\n")

        mlflow.log_artifact(str(sample_file), artifact_path="outputs")

This corrected code explicitly sets the artifact URI, ensuring that MLflow saves artifacts to the local file system as intended.

Conclusion

The NoCredentialsError in MLflow when using local artifact storage can be a challenging issue, but it is often the result of misconfigurations or residual settings from previous cloud-based setups. By systematically addressing the potential causes—such as AWS environment variables, MLflow configurations, and Airflow context—you can resolve this error and ensure that MLflow correctly uses local storage. Explicitly setting the artifact URI is a robust solution that bypasses many common pitfalls. By following the steps outlined in this article, you can ensure a smooth and reliable experience with MLflow, whether you are using local or cloud-based artifact storage.

What component(s) does this bug affect?

  • [x] area/artifacts: Artifact stores and artifact logging
  • [] area/build: Build and test infrastructure for MLflow
  • [] area/deployments: MLflow Deployments client APIs, server, and third-party Deployments integrations
  • [] area/docs: MLflow documentation pages
  • [] area/evaluation: MLflow model evaluation features, evaluation metrics, and evaluation workflows
  • [] area/examples: Example code
  • [] area/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registry
  • [] area/models: MLmodel format, model serialization/deserialization, flavors
  • [] area/projects: MLproject format, project running backends
  • [] area/prompt: MLflow prompt engineering features, prompt templates, and prompt management
  • [] area/scoring: MLflow Model server, model deployment tools, Spark UDFs
  • [] area/server-infra: MLflow Tracking server backend
  • [] area/tracing: MLflow Tracing features, tracing APIs, and LLM tracing functionality
  • [x] area/tracking: Tracking Service, tracking client APIs, autologging

What interface(s) does this bug affect?

  • [] area/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev server
  • [x] area/docker: Docker use across MLflow's components, such as MLflow Projects and MLflow Models
  • [] area/sqlalchemy: Use of SQLAlchemy in the Tracking Service or Model Registry
  • [] area/windows: Windows support

What language(s) does this bug affect?

  • [] language/r: R APIs and clients
  • [] language/java: Java APIs and clients
  • [] language/new: Proposals for new client languages

What integration(s) does this bug affect?

  • [] integrations/azure: Azure and Azure ML integrations
  • [] integrations/sagemaker: SageMaker integrations
  • [] integrations/databricks: Databricks integrations