Custom Marines Docker Image¶
We do not use the stock apache/airflow:2.10.5 image. Instead, the compose stack builds a custom image from the Dockerfile in /home/airflow/.
Why a custom image?¶
The DAG code (Python scripts in dags/) imports libraries that are not in the base Airflow image: netCDF4, copernicusmarine, cdsapi, and others. Running pip install on every container start (_PIP_ADDITIONAL_REQUIREMENTS) is annoying and unnecessary. Building a single image that contains everything is the correct and cleaner approach.
In addition, some of our postprocessing scripts are written in Julia, and the data pipeline that processes CMEMS outputs uses the Google Cloud CLI (gcloud). Both need to be installed at image build time.
Dockerfile walkthrough¶
FROM apache/airflow:2.10.5
USER root
RUN apt-get update && apt-get install -y \
libhdf5-dev libnetcdf-dev libnetcdff-dev \
nco wget bc liburi-perl ksh \
build-essential gfortran \
apt-transport-https ca-certificates gnupg curl ncftp
USER airflow
RUN pip install --no-cache-dir \
"pandas>=2.1.2,<2.2" netCDF4 copernicusmarine cdsapi
USER root
RUN wget -q https://julialang-s3.julialang.org/bin/linux/x64/1.12/julia-1.12.3-linux-x86_64.tar.gz \
&& tar xzf julia-1.12.3-linux-x86_64.tar.gz && rm julia-1.12.3-linux-x86_64.tar.gz \
&& mv julia-1.12.3 /opt/ && ln -s /opt/julia-1.12.3/bin/julia /usr/local/bin/julia \
&& julia -e 'using Pkg;Pkg.add(["ArgParse","Glob","Dates","Statistics","CSV","DataFrames","NetCDF","NCDatasets","XLSX"]);'
USER root
RUN curl https://packages.cloud.google.com/apt/doc/apt-key.gpg \
| gpg --dearmor -o /usr/share/keyrings/cloud.google.gpg \
&& echo "deb [signed-by=/usr/share/keyrings/cloud.google.gpg] \
https://packages.cloud.google.com/apt cloud-sdk main" \
| tee -a /etc/apt/sources.list.d/google-cloud-sdk.list \
&& apt-get update && apt-get install -y google-cloud-cli \
&& rm -rf /var/lib/apt/lists/*
Layer by layer¶
System packages (first apt-get)¶
| Package | Used by |
|---|---|
libhdf5-dev, libnetcdf-dev, libnetcdff-dev |
Native HDF5/NetCDF libraries required by netCDF4 Python package and Julia NetCDF/NCDatasets packages |
nco |
NetCDF Operators command-line tools (used in some shell scripts) |
wget |
Used in DAG bash scripts to download files (e.g. EFAS tarball) |
bc |
Used in bash scripts for arithmetic |
liburi-perl, ksh |
Used by legacy shell scripts |
build-essential, gfortran |
C/Fortran compilers needed to build native extensions |
ncftp |
FTP client used in some download scripts |
curl |
Used for health checks and in the Google Cloud CLI installer |
Python packages¶
| Package | Version constraint | Used by |
|---|---|---|
pandas |
>=2.1.2,<2.2 |
Data manipulation in DAG Python code |
netCDF4 |
latest | Reading and writing NetCDF files in compress_results and download_CAMS |
copernicusmarine |
latest | Copernicus Marine toolbox, used in count_files_MDS and CMEMS downloads |
cdsapi |
latest | Copernicus Atmosphere Data Store API client, used in download_CAMS |
Julia 1.12.3¶
Julia is installed system-wide under /opt/julia-1.12.3/ with a symlink at /usr/local/bin/julia. The packages pre-installed at build time are:
| Julia package | Purpose |
|---|---|
ArgParse |
Command-line argument parsing for Julia scripts |
Glob |
File globbing |
Dates, Statistics |
Standard utilities |
CSV, DataFrames |
Tabular data (river runoff processing) |
NetCDF, NCDatasets |
Reading and writing NetCDF files |
XLSX |
Excel file handling |
Julia is used in two places:
download_efas_local_preprocessDAG runsjulia rnf_oper_v2025.jlto convert EFAS discharge data into a 10-day river runoff forcing file.GoogleCloud_backup_runDAG runsjulia cmems.jlto upload simulation outputs to the Copernicus Marine Data Store.
Google Cloud CLI¶
gcloud is installed from the official Google Cloud apt repository. It is used by the shell scripts in GoogleCloud_backup_run (gcp_upload_to_google_storage.sh, gcp_start_large_VM.sh, etc.) to interact with Google Cloud Storage and Compute Engine.
Rebuilding the image¶
When you modify the Dockerfile, rebuild without cache to avoid stale layers:
The --no-cache flag is important: pip and apt caches inside Docker layers can mask version upgrades.