MARINES Airflow Orchestration¶

This website is the internal reference for the Airflow system that automates and monitors our modelling pipeline end to end. It is written for the people who built it and those who will potentially extend it.

History¶

Until mid-2025, the entire NRT production was managed by hand:

Bash scripts executed directly on Lucia;
No automatic retry when an operation failed, failure notifications arrived as emails that required urgent action;
Having a cluster outage meant that the entire pipeline had to be restarted from scratch, without an automated fallback;

It worked for some time, but at larger scale, with mandatory daily data deliveries to CMEMS, it became increasingly difficult.

What Airflow gives us¶

Apache Airflow is a workflow orchestrator: you describe your pipeline as a directed graph of tasks in Python (DAGs), and Airflow takes care of scheduling, executing, retrying, and monitoring every step.

Concretely, for us:

Automatic retries with our own clear policies! A download DAG that cannot reach the CMCC SFTP server at 08:00 will nicely retry every 15 minutes for up to 14 hours before raising a flag. Previously this required someone to notice, log in, and re-run a script.

# From download_ifs_an (LUCIA/download_ifs_an.py)
# 56 retries x 15 minutes = 14 hours of automatic waiting
default_args = {
    'retries': 56,
    'retry_delay': timedelta(minutes=15),
}

Clearly defined dependencies! The model run task (model_lucia_run) only starts after both the IFS atmospheric forcing and the MAR spectral forecast are confirmed present on Lucia.

graph LR
    A[ifs_process] --> C[model_lucia_run]
    B[download_MAR] --> C
    C --> D[model_lucia_postprocess]
    D --> E[compress_results]

Clarity, at last! Every DAG run and every task instance is stored in the Postgres metadata database and visible on our internal web UI at https://airflow.marines-data.eu.

Failure alerts! The postprocessing DAG (model_lucia_postprocess) sends an email only if the Slurm job is still running after 17:00 UTC.

Lucia pipeline (primary real-time model)¶

Data is downloaded and processed directly on the Lucia HPC cluster. The NEMO 4.2.0 + BSFS_BIO ocean model is submitted to the Slurm queue on Lucia. Results are post-processed there and then transferred back to the marines server. Airflow controls this pipeline from our Docker stack on the Marines server by opening SSH connections through our lucia_gateway_* connection.

Orchestration via SSH

This is basically the same trick used over and over throughout all of our DAGs: Airflow runs in Docker on the Marines server, but the actual work is done on Lucia through our SSH keys. Our lucia_gateway_* connections are a jump host that allows us to SSH into Lucia from the marines server without human intervention.

Google Cloud pipeline (backup model)¶

Data is downloaded and processed locally on our Marines server. Inputs are then uploaded to Google Cloud Storage, a GCP (Google Cloud Platform) VM is started to run the model, and results are downloaded back. This pipeline also handles CMEMS product uploads.

This pipeline is run whenever there is a Lucia outage. Since we cannot miss a daily delivery to CMEMS, we have this backup solution that runs locally and on GCP.

Manual Intervention Required

For now, this pipeline is not fully automated: it requires someone to start the Airflow dashboard, trigger the DAG, and monitor it until completion. It's a work in progress to make it fully automatic, but for now, it's a safety net that we can use when Lucia is down.

Daily schedule overview¶

All times are UTC. This is an approximate picture; the DAG reference has exact schedule_interval cron expressions.

Time (UTC)	Activity
06:00	Download CMEMS observations to Lucia (first daily run)
07:30-07:32	Download IFS analysis and forecast to Marines
08:00-08:02	Download IFS analysis and forecast to Lucia
08:15	Process IFS for NEMO (Lucia and local)
09:10	Download EFAS river data (local)
09:25-09:30	Download MAR spectral forecast (Lucia and local), download CAMS
10:30	Submit Lucia model run to Slurm
12:30	Extend CAMS forecast, copy to Lucia and local forcings
14:00	Poll Lucia Slurm job, start postprocessing
14:00	Download CMEMS observations to Lucia (second daily run)
15:45	Compress model output NetCDF files
16:00	Count files on Copernicus Marine Data Store
16:10	Download NIHWM river data (Lucia and local)