Operations Runbook¶
How to restart safely¶
Postgres is on a Docker volume (postgres-db-volume) and will survive container restarts. DAG run history, Variables, Connections, and XComs are not lost.
After bringing the stack up, check if all services are healthy:
The airflow-init service will run DB migrations again (no probs if there are no new migrations) and will return exit code 0 before the other services become fully available.
Wait for airflow-init to complete
The webserver and scheduler depend on airflow-init completing successfully (condition: service_completed_successfully). If airflow-init fails, the other services will not start. Check its logs:
Rebuilding the custom image¶
When the Dockerfile is modified (e.g., new Python package, system dependency):
Don't forget --no-cache.
After the rebuild, verify that the new package is available:
Checking the daily schedule window¶
To confirm that today's pipeline is on track, check these in order in the Airflow UI:
- By 08:00 UTC:
download_ifs_an,download_ifs_an00,download_ifs_fcshould be green or yellow (retrying). If all three are red before 10:00 UTC, the IFS source may be unavailable. - By 10:30 UTC:
ifs_process(both Lucia and local) should be green. - By 14:00 UTC:
model_lucia_runshould be green (model submitted to Slurm). - By 18:00 UTC:
model_lucia_postprocessshould be green (model finished, files on marines server). - By 16:00 UTC:
count_files_MDSresult tells you whether today's CMEMS products are published.
If any DAG is stuck in the retrying state at an unexpected hour, check the task logs in the UI to see what specific error is repeating.
Manually clearing and re-running a failed task¶
When a task fails and you want to re-run it after fixing the underlying problem:
- Open the Airflow UI.
- Go to the DAG, click on the failed DAG run.
- Click the failed task (red box).
- Click Clear and confirm. Airflow will re-queue the task.
To clear and re-run all tasks from a specific point forward:
- Click the task you want to restart from.
- Click Clear and enable the "Downstream" checkbox.
Clearing compress_results
If compress_results was interrupted mid-compression, check for .tmp orphan files before re-running:
Delete any orphans, then re-run by clearing the compress_files task.
Recovering from a lost submitted_job Variable¶
If model_lucia_postprocess cannot find the submitted_job Variable (e.g., because the Variable was accidentally deleted, or model_lucia_run did not complete successfully):
-
Log into Lucia via the gateway:
-
Note the Slurm job ID from the queue listing.
-
In the Airflow UI, set the Variable manually:
Admin > Variables > New with key
submitted_joband value = the Slurm job ID as an integer. -
Clear the
get_job_numbertask inmodel_lucia_postprocessto re-run from that point.
Scheduled restart¶
There is no automatic scheduled restart currently configured. If the stack is restarted manually, check the Airflow UI for any DAG runs that were in-flight at the time of the restart. Tasks that were running when the stack went down will be marked failed by Airflow after it recovers. You may need to clear and re-run them.
Backups¶
Airflow metadata (Postgres)¶
The metadata database is in the postgres-db-volume Docker volume. To back it up:
This includes all DAG run history, Variables, Connections, and XComs.
Fernet key¶
The Fernet key is in /home/airflow/.env. Without it, encrypted Connection passwords stored in Postgres are unreadable. Back up this file securely and separately from the database.
Model data¶
Model output on /mnt/md0/NRT_V2025/out/ is not automatically backed up by Airflow. The GoogleCloud_backup_run DAG handles long-term model output archiving to GCS.
Known issues¶
Root containers¶
All service containers run as user: "0:0" (root). This is a temporary workaround for a file-permission issue on the /mnt/md0 mount. The TODO comment in docker-compose.yml tracks this. The fix requires setting the correct ownership on /mnt/md0 so that the standard AIRFLOW_UID=50000 user can write to it, then removing the user: "0:0" override.
numpy.ndarray size changed RuntimeWarning¶
Any RuntimeWarning: numpy.ndarray size changed, may indicate binary incompatibility in task logs or CLI output is cosmetic. It does not affect computation results. It is caused by a mismatch between the compiled numpy ABI version and an extension loaded in the same environment. It can be silenced by rebuilding the image with consistent numpy and extension versions.
Orphan .tmp files from compress_results¶
An interrupted compress_results run leaves .tmp partial files in the NRT_V2025 output directory. The original NetCDF file may be gone if shutil.move already ran. If the original is missing, the only recovery path is re-transferring it from Lucia's GPFS.
Cleanup command: