Kiwix Operations Week 27 2025 Routine Discussion

by StackCamp Team 49 views

Introduction

This article details the routine operational tasks for Week 27 of 2025, focusing on the Kiwix project. Kiwix, an open-source offline web browser, requires regular maintenance and monitoring to ensure smooth operation and optimal performance. This article covers essential system upgrades, backups, Kubernetes (k8s) cluster management, statistics monitoring, Grafana dashboard checks, project-specific tasks, and security measures. The goal is to provide a comprehensive guide for the assigned personnel to maintain the Kiwix infrastructure effectively. Regular maintenance ensures the stability and reliability of Kiwix services, benefiting users worldwide who rely on offline access to educational content and other resources. By following this routine, we can proactively address potential issues and optimize the performance of the Kiwix ecosystem. Proper monitoring and timely interventions are crucial for sustaining a robust and dependable offline browsing experience.

Nodes System Upgrades

System upgrades are a critical aspect of maintaining the health and security of any infrastructure. For Kiwix, this involves regularly updating the nodes to ensure they are running the latest software versions, which include bug fixes, performance improvements, and security patches. This section outlines the process for upgrading the nodes systematically. Keeping the nodes updated is essential for preventing vulnerabilities and ensuring the system operates efficiently. The apt update && apt upgrade command is a standard practice for Debian-based systems, but it's crucial to execute it methodically across the infrastructure. The bastion, stats, services, storage, demo, and mirrors-qa nodes are prioritized for systematic upgrades to maintain the core functionalities of Kiwix. These nodes form the backbone of the Kiwix infrastructure, and their stability is paramount. Addressing security upgrades promptly on worker nodes such as imager-worker, ondemand, and sisyphus is of utmost importance. These nodes handle critical tasks, and any security lapse can have significant repercussions. Regular worker node updates are performed monthly to minimize production impact, striking a balance between system stability and security.

apt update && apt upgrade
  • [ ] Run systematically the upgrade on bastion, stats, services, storage, demo, mirrors-qa nodes
  • [ ] Check for and apply important security upgrade on worker nodes ASAP (imager-worker, ondemand, sisyphus)

Regular worker updates are done separately on a monthly basis for worker nodes to not impact production.

Backups

Ensuring reliable backups is paramount for data integrity and disaster recovery. For Kiwix, consistent backups protect against data loss due to hardware failures, software issues, or human errors. This section focuses on verifying that all Borg repositories are being updated regularly. BorgBase is a popular service for secure and efficient backups, and monitoring these repositories is a crucial part of our routine. Data is the lifeblood of any organization, and a robust backup strategy is essential for business continuity. In the event of a system failure or data corruption, having up-to-date backups allows for a quick and seamless restoration. The integrity of backups is just as important as their existence. Regular checks ensure that the backups are not only being created but are also restorable. Borg repositories are used for their deduplication capabilities, which save storage space and bandwidth. Monitoring these repositories involves checking the last successful backup time, the size of the backups, and any error logs. Consistent backup updates ensure that the latest data is always protected, minimizing potential data loss. This proactive approach is critical for maintaining the long-term viability of the Kiwix project.

Kubernetes (k8s) Cluster

Managing the Kubernetes (k8s) cluster is essential for ensuring the smooth operation of Kiwix services. Kubernetes provides the orchestration and management capabilities needed to deploy, scale, and manage containerized applications. This section outlines the checks and procedures for maintaining the k8s cluster's health. A well-maintained k8s cluster ensures high availability and efficient resource utilization for Kiwix applications. Regular monitoring of Pod errors and restarts helps identify potential issues before they escalate. Pods in a CrashLoopBackoff state indicate that they are repeatedly failing, which could be due to application errors, misconfigurations, or resource constraints. Checking Pod restarts provides insights into the stability of the applications running within the cluster. High restart counts may indicate underlying problems that need to be addressed. Kubernetes upgrades are crucial for leveraging the latest features, performance improvements, and security patches. The process involves checking for available upgrades and applying them if applicable and possible. Staying up-to-date with k8s versions ensures that the cluster benefits from the most recent advancements and security enhancements.

  • [ ] Check Pod errors or in CrashLoopBackoff
k get pods -A -o wide|grep -E 'Error|Crash'
  • [ ] Check Pod restarts
k get pods -A -o wide | pyp -i 'print("\n".join([line for line in l if re.split(r"\s+", line)[4] != "0"]))'
  • [ ] Check if k8s should/could be upgraded
curl -s -H "X-Auth-Token: $SCW_SECRET_KEY" https://api.scaleway.com/k8s/v1/regions/fr-par/clusters/$KIWIX_PROD_CLUSTER | jq ".version,.upgrade_available"
curl -s -H "X-Auth-Token: $SCW_SECRET_KEY" https://api.scaleway.com/k8s/v1/regions/fr-par/versions | jq ".versions[].name"

Stats

Monitoring statistics is crucial for understanding how users are interacting with Kiwix and identifying areas for improvement. This section focuses on Matomo, a web analytics platform used to track Kiwix usage. Ensuring that download.kiwix.org stats are being recorded allows us to understand download patterns and user behavior. These insights are valuable for optimizing content delivery and user experience. Regularly checking whether Matomo should be upgraded ensures that we are using the latest version with all its features and security enhancements. Keeping the analytics platform up-to-date is essential for accurate data collection and analysis. Matomo provides detailed insights into user engagement, traffic sources, and popular content. By analyzing these stats, we can make informed decisions about content strategy, platform improvements, and marketing efforts. A proactive approach to statistics monitoring helps us continuously refine the Kiwix platform to better serve our users. The data collected helps us understand which content is most popular, which platforms are being used, and where users are located, enabling us to tailor our efforts effectively.

matomo - stats.kiwix.org

  • [ ] Ensure download.kiwix.org stats are being recorded
  • [ ] Check whether matomo should be upgraded

Grafana

Grafana is a powerful data visualization tool that provides insights into the performance and health of the Kiwix infrastructure. This section outlines the various Grafana dashboards that need to be monitored regularly. Grafana dashboards offer a real-time view of key metrics, enabling quick identification and resolution of issues. Disk usage monitoring is crucial for preventing storage-related problems. Unexpected usage spikes can indicate potential issues such as log file growth or data corruption. Monitoring the alert list ensures that any critical issues are promptly addressed. Grafana alerts can be configured to notify the team of potential problems, such as high CPU usage or network latency. The Zimfarm dashboard provides insights into the performance of the Zimfarm infrastructure, which is responsible for generating ZIM files. Ensuring this dashboard is normal is vital for content availability. The Jobs durations dashboard tracks the execution time of various jobs within the Kiwix ecosystem. Abnormal durations can indicate performance bottlenecks or system issues. Monitoring cluster resource consumption ensures that the Kubernetes cluster is operating efficiently. High resource utilization can lead to performance degradation and stability issues. Checking the mdadm RAID arrays ensures data redundancy and prevents data loss. RAID arrays protect against disk failures, and monitoring their status is essential for data integrity.

Projects

This section covers project-specific tasks that are essential for the overall health and progress of Kiwix. Monitoring UptimeRobot alerts ensures that the Kiwix services are available and responsive. UptimeRobot provides external monitoring and sends alerts when services are down. Maintaining a reasonable zimit backlog is crucial for managing the content generation pipeline. A large backlog can indicate delays in content updates and may require attention. Cloud Code Signing Certificate usage needs to be monitored to ensure we have sufficient capacity for signing applications. Running out of signings can disrupt the release process. Reviewing pull requests (PRs) awaiting review ensures that contributions are processed promptly. Timely code reviews help maintain code quality and accelerate development. These project-specific tasks are crucial for maintaining the momentum and quality of the Kiwix project. Each task contributes to the overall success of Kiwix, ensuring that the project remains healthy and continues to grow.

Security

Security is a top priority for Kiwix, and this section focuses on measures to protect the infrastructure and data. Analyzing and merging Dependabot PRs ensures that our dependencies are up-to-date with the latest security patches. Dependabot automatically creates PRs to update dependencies, making it easier to keep our project secure. Regularly reviewing these PRs and merging them helps mitigate potential vulnerabilities. Security is an ongoing process, and proactive measures are essential for preventing attacks and data breaches. Keeping dependencies up-to-date is a critical aspect of this process. Dependabot PRs often include security fixes, and addressing them promptly reduces the risk of exploitation. A strong security posture protects the Kiwix project and its users, ensuring a safe and reliable platform. This proactive approach to security is vital for maintaining the trust of our users and stakeholders.

Note: This is an automatic reminder intended for the assignee(s).