Troubleshooting MLflow NoCredentialsError When Saving To Local Artifact
Understanding the MLflow NoCredentialsError Bug When Saving to Local Artifacts
This article dives into a common issue encountered by MLflow users: the NoCredentialsError
when attempting to save artifacts to local storage. This bug, particularly prevalent when transitioning from cloud-based artifact storage like AWS S3, can be frustrating. We will explore the root causes, provide a detailed analysis of the error, and offer comprehensive solutions to resolve it, ensuring a smooth experience with local artifact storage in MLflow.
Issues Policy Acknowledgement
- [x] I have read and agree to submit bug reports in accordance with the issues policy
Where Did You Encounter This Bug?
Local machine
MLflow Version
- Client: 3.1.1
- Tracking server: 3.1.1
System Information
- OS Platform and Distribution: ghcr.io/mlflow/mlflow:v3.1.1, Client Version: v1.32.2 Kustomize Version: v5.5.0
- Python version: Python 3.10.18
- yarn version, if running the dev UI: none
The Problem: NoCredentialsError with Local Artifact Storage
The core issue reported is a NoCredentialsError
that arises even when not using an AWS S3 backend. This is particularly perplexing because the user intended to use local storage after encountering issues with AWS S3 artifact storage (https://github.com/mlflow/mlflow/issues/14226). The user has attempted various configurations, including setting up Kubernetes (K8S) from scratch and using Helm community models, all resulting in the same error. The tasks are executed from Apache Airflow in a separate K8S namespace.
Deep Dive into the NoCredentialsError
The NoCredentialsError in MLflow typically indicates that the system is trying to access an artifact store (often S3) but cannot find the necessary credentials. This can happen even when you intend to use local storage due to a few common reasons:
- Residual Configuration: MLflow might still be configured to use S3 from a previous setup. This can occur if environment variables or MLflow configurations are not correctly updated when switching to local storage.
- Boto3 Interference: The
boto3
library, which MLflow uses for S3 interactions, might be implicitly trying to authenticate with AWS. If AWS-related environment variables (e.g.,AWS_ACCESS_KEY_ID
,AWS_SECRET_ACCESS_KEY
) are set,boto3
will attempt to use them, leading to this error if they are invalid or not intended for the current setup. - Default Artifact Root: If the
default-artifact-root
is not correctly set or is pointing to an S3 bucket, MLflow will try to use S3, triggering the error. - Airflow Context: When running MLflow within Apache Airflow, the Airflow environment might have its own set of configurations that interfere with MLflow's intended storage settings. This is especially true if Airflow is configured to use S3 for its own tasks.
Analyzing the Kubernetes Setup
The provided Kubernetes configuration reveals that the MLflow server is set up with a PostgreSQL backend and a persistent volume claim for artifacts. Let's break down the critical parts:
containers:
- name: mlflow
image: mlflow:0.0.2
ports:
- containerPort: 5000
env:
- name: POSTGRES_USER
valueFrom:
secretKeyRef:
name: postgres-secret
key: postgres-user
- name: POSTGRES_PASSWORD
valueFrom:
secretKeyRef:
name: postgres-secret
key: postgres-password
- name: POSTGRES_DB
valueFrom:
secretKeyRef:
name: postgres-secret
key: postgres-db
command: ["mlflow", "server", "--host", "0.0.0.0", "--port", "5000", "--backend-store-uri", "postgresql://$(POSTGRES_USER):$(POSTGRES_PASSWORD)@postgres-service:5432/$(POSTGRES_DB)", "--default-artifact-root", "mlflow-artifacts:/artifacts", "--no-serve-artifacts"]
volumeMounts:
- name: mlflow-artifacts
mountPath: /mlflow/artifacts
resources:
requests:
memory: "512Mi"
cpu: "250m"
limits:
memory: "1Gi"
cpu: "500m"
volumes:
- name: mlflow-artifacts
persistentVolumeClaim:
claimName: mlflow-artifacts-pvc
- The
command
section shows that the MLflow server is started with a PostgreSQL backend (--backend-store-uri
) and a default artifact root (--default-artifact-root
) set tomlflow-artifacts:/artifacts
. This suggests an attempt to use local storage. - The
--no-serve-artifacts
flag indicates that MLflow should not serve artifacts directly, which is appropriate for local storage setups. - The
volumeMounts
section mounts the persistent volumemlflow-artifacts
to/mlflow/artifacts
inside the container, further reinforcing the intention to use local storage.
Despite these configurations, the NoCredentialsError
persists, indicating that something is overriding these settings or that there's an implicit S3 configuration.
Tracking Information Analysis
The provided tracking information offers crucial insights:
System information: Linux #1 SMP Tue Nov 5 00:21:55 UTC 2024
Python version: 3.10.18
MLflow version: 3.1.1
MLflow module location: /usr/local/lib/python3.10/site-packages/mlflow/__init__.py
Tracking URI: file:///mlruns
Registry URI: file:///mlruns
MLflow environment variables:
MLFLOW_PORT: tcp://10.96.186.129:5000
MLFLOW_PORT_5000_TCP: tcp://10.96.186.129:5000
MLFLOW_PORT_5000_TCP_ADDR: 10.96.186.129
MLFLOW_PORT_5000_TCP_PORT: 5000
MLFLOW_PORT_5000_TCP_PROTO: tcp
MLFLOW_SERVICE_HOST: 10.96.186.129
MLFLOW_SERVICE_PORT: 5000
MLflow dependencies:
Flask: 3.1.1
alembic: 1.16.2
boto3: 1.39.3
botocore: 1.39.3
docker: 7.1.0
fastapi: 0.115.14
graphene: 3.4.3
gunicorn: 23.0.0
matplotlib: 3.10.3
mlflow-skinny: 3.1.1
numpy: 2.2.6
pandas: 2.3.0
pyarrow: 20.0.0
scikit-learn: 1.7.0
scipy: 1.15.3
sqlalchemy: 2.0.41
uvicorn: 0.34.3
- The
Tracking URI: file:///mlruns
indicates that the tracking URI is set to a local file system, which is correct for local storage. - The presence of
boto3
andbotocore
in the dependencies confirms that the AWS SDK is installed, which could be a source of the problem if not configured correctly.
Code to Reproduce the Issue
The provided code snippet is a simplified version of an Airflow DAG that demonstrates the issue:
import tempfile
from pathlib import Path
import mlflow
# Dumbed down version of DAG from Airflow
#MLFLOW_TRACKING_URI = "http://mlflow.mlflow.svc.cluster.local:5000"
#mlflow.set_tracking_uri(MLFLOW_TRACKING_URI)
mlflow.set_experiment("bug-reproduction-experiment")
with mlflow.start_run() as run:
print(f"Run {run.info.run_id}")
with tempfile.TemporaryDirectory() as temp_dir:
temp_path = Path(temp_dir)
sample_file = temp_path / "test_file.txt"
with open(sample_file, "w") as f:
f.write("Hello from MLflow!\n")
f.write(f"Run ID: {run.info.run_id}\n")
f.write("This is a test file upload.\n")
mlflow.log_artifact(str(sample_file), artifact_path="outputs")
This code creates a temporary file and attempts to log it as an artifact using mlflow.log_artifact
. The NoCredentialsError
occurs during this step, specifically when MLflow tries to upload the artifact.
Stack Trace Analysis
The stack trace provides a clear path to the error:
NoCredentialsError: Unable to locate credentials
File "/home/airflow/.local/lib/python3.12/site-packages/airflow/sdk/execution_time/task_runner.py", line 867 in run
...
File "/home/airflow/.local/lib/python3.12/site-packages/mlflow/tracking/fluent.py", line 1404 in log_artifact
File "/home/airflow/.local/lib/python3.12/site-packages/mlflow/tracking/client.py", line 2448 in log_artifact
File "/home/airflow/.local/lib/python3.12/site-packages/mlflow/tracking/_tracking_service/client.py", line 639 in log_artifact
File "/home/airflow/.local/lib/python3.12/site-packages/mlflow/store/artifact/s3_artifact_repo.py", line 176 in log_artifact
File "/home/airflow/.local/lib/python3.12/site-packages/mlflow/store/artifact/s3_artifact_repo.py", line 169 in _upload_file
File "/home/airflow/.local/lib/python3.12/site-packages/boto3/s3/inject.py", line 145 in upload_file
...
The key part of the stack trace is the transition from mlflow.log_artifact
to mlflow.store.artifact.s3_artifact_repo.py
. This clearly shows that MLflow is attempting to use the S3 artifact repository even though the intention is to use local storage. The error originates from boto3
, confirming that the AWS SDK is trying to authenticate.
Resolving the NoCredentialsError: A Step-by-Step Guide
To effectively resolve the NoCredentialsError
when saving artifacts locally with MLflow, consider the following steps:
-
Explicitly Set the Artifact URI: The most reliable solution is to explicitly set the artifact URI for each run. This ensures that MLflow knows exactly where to store the artifacts. You can do this when starting a run:
import mlflow import os mlflow.set_experiment("bug-reproduction-experiment") # Construct the artifact URI using the MLruns directory and the experiment ID artifact_uri = os.path.join("file:///mlruns", str(mlflow.get_experiment_by_name("bug-reproduction-experiment").experiment_id), "artifacts") with mlflow.start_run(nested=True, artifact_uri=artifact_uri) as run: print(f"Run {run.info.run_id}") # Your code to log artifacts
This method directly specifies the storage location, bypassing any default or residual S3 configurations.
-
Unset AWS Environment Variables: Ensure that no AWS-related environment variables are set in your environment. This includes
AWS_ACCESS_KEY_ID
,AWS_SECRET_ACCESS_KEY
,AWS_REGION
, etc. Unsetting these variables preventsboto3
from attempting to authenticate with AWS.In a shell environment, you can unset these variables using:
unset AWS_ACCESS_KEY_ID unset AWS_SECRET_ACCESS_KEY unset AWS_REGION ```
In a Kubernetes environment, you should remove these environment variables from your pod or deployment configuration.
-
Verify MLflow Configuration: Double-check the MLflow server configuration to ensure that the
default-artifact-root
is correctly set to a local path. In the Kubernetes deployment YAML, the--default-artifact-root
should point to a local directory:command: ["mlflow", "server", "--host", "0.0.0.0", "--port", "5000", "--backend-store-uri", "postgresql://$(POSTGRES_USER):$(POSTGRES_PASSWORD)@postgres-service:5432/$(POSTGRES_DB)", "--default-artifact-root", "/mlflow/artifacts", "--no-serve-artifacts"]
Also, ensure that the volume mount for
mlflow-artifacts
is correctly configured. -
Check Airflow Environment: If running MLflow within Airflow, ensure that Airflow itself is not configured to use S3 credentials unless explicitly required. Airflow might have its own set of environment variables or configurations that need to be adjusted.
-
Set MLFLOW_S3_ENDPOINT_URL: If, despite intending to use local storage, S3-related code paths are being triggered, setting the
MLFLOW_S3_ENDPOINT_URL
environment variable to a dummy value might help:export MLFLOW_S3_ENDPOINT_URL=http://localhost:9000 # Or any dummy URL
This can prevent MLflow from attempting to connect to a real S3 endpoint.
-
Review Code for Implicit S3 Calls: Carefully review your code for any implicit calls that might be triggering S3 interactions. This could include custom logging functions or third-party libraries that attempt to use S3.
-
Restart MLflow Server: After making configuration changes, restart the MLflow server to ensure that the new settings are applied.
Example of Corrected Code
Here’s an updated version of the code snippet that incorporates the explicit artifact URI setting:
import tempfile
from pathlib import Path
import mlflow
import os
mlflow.set_experiment("bug-reproduction-experiment")
# Construct the artifact URI using the MLruns directory and the experiment ID
experiment = mlflow.get_experiment_by_name("bug-reproduction-experiment")
if experiment is None:
experiment_id = mlflow.create_experiment("bug-reproduction-experiment")
else:
experiment_id = experiment.experiment_id
artifact_uri = os.path.join("file:///mlruns", str(experiment_id), "artifacts")
with mlflow.start_run(artifact_uri=artifact_uri) as run:
print(f"Run {run.info.run_id}")
with tempfile.TemporaryDirectory() as temp_dir:
temp_path = Path(temp_dir)
sample_file = temp_path / "test_file.txt"
with open(sample_file, "w") as f:
f.write("Hello from MLflow!\n")
f.write(f"Run ID: {run.info.run_id}\n")
f.write("This is a test file upload.\n")
mlflow.log_artifact(str(sample_file), artifact_path="outputs")
This corrected code explicitly sets the artifact URI, ensuring that MLflow saves artifacts to the local file system as intended.
Conclusion
The NoCredentialsError
in MLflow when using local artifact storage can be a challenging issue, but it is often the result of misconfigurations or residual settings from previous cloud-based setups. By systematically addressing the potential causes—such as AWS environment variables, MLflow configurations, and Airflow context—you can resolve this error and ensure that MLflow correctly uses local storage. Explicitly setting the artifact URI is a robust solution that bypasses many common pitfalls. By following the steps outlined in this article, you can ensure a smooth and reliable experience with MLflow, whether you are using local or cloud-based artifact storage.
What component(s) does this bug affect?
- [x]
area/artifacts
: Artifact stores and artifact logging - []
area/build
: Build and test infrastructure for MLflow - []
area/deployments
: MLflow Deployments client APIs, server, and third-party Deployments integrations - []
area/docs
: MLflow documentation pages - []
area/evaluation
: MLflow model evaluation features, evaluation metrics, and evaluation workflows - []
area/examples
: Example code - []
area/model-registry
: Model Registry service, APIs, and the fluent client calls for Model Registry - []
area/models
: MLmodel format, model serialization/deserialization, flavors - []
area/projects
: MLproject format, project running backends - []
area/prompt
: MLflow prompt engineering features, prompt templates, and prompt management - []
area/scoring
: MLflow Model server, model deployment tools, Spark UDFs - []
area/server-infra
: MLflow Tracking server backend - []
area/tracing
: MLflow Tracing features, tracing APIs, and LLM tracing functionality - [x]
area/tracking
: Tracking Service, tracking client APIs, autologging
What interface(s) does this bug affect?
- []
area/uiux
: Front-end, user experience, plotting, JavaScript, JavaScript dev server - [x]
area/docker
: Docker use across MLflow's components, such as MLflow Projects and MLflow Models - []
area/sqlalchemy
: Use of SQLAlchemy in the Tracking Service or Model Registry - []
area/windows
: Windows support
What language(s) does this bug affect?
- []
language/r
: R APIs and clients - []
language/java
: Java APIs and clients - []
language/new
: Proposals for new client languages
What integration(s) does this bug affect?
- []
integrations/azure
: Azure and Azure ML integrations - []
integrations/sagemaker
: SageMaker integrations - []
integrations/databricks
: Databricks integrations