Skip to content

Custom Marines Docker Image

We do not use the stock apache/airflow:2.10.5 image. Instead, the compose stack builds a custom image from the Dockerfile in /home/airflow/.


Why a custom image?

The DAG code (Python scripts in dags/) imports libraries that are not in the base Airflow image: netCDF4, copernicusmarine, cdsapi, and others. Running pip install on every container start (_PIP_ADDITIONAL_REQUIREMENTS) is annoying and unnecessary. Building a single image that contains everything is the correct and cleaner approach.

In addition, some of our postprocessing scripts are written in Julia, and the data pipeline that processes CMEMS outputs uses the Google Cloud CLI (gcloud). Both need to be installed at image build time.


Dockerfile walkthrough

Dockerfile
FROM apache/airflow:2.10.5

USER root
RUN apt-get update && apt-get install -y \
    libhdf5-dev libnetcdf-dev libnetcdff-dev \
    nco wget bc liburi-perl ksh \
    build-essential gfortran \
    apt-transport-https ca-certificates gnupg curl ncftp

USER airflow
RUN pip install --no-cache-dir \
    "pandas>=2.1.2,<2.2" netCDF4 copernicusmarine cdsapi

USER root
RUN wget -q https://julialang-s3.julialang.org/bin/linux/x64/1.12/julia-1.12.3-linux-x86_64.tar.gz \
 && tar xzf julia-1.12.3-linux-x86_64.tar.gz && rm julia-1.12.3-linux-x86_64.tar.gz \
 && mv julia-1.12.3 /opt/ && ln -s /opt/julia-1.12.3/bin/julia /usr/local/bin/julia \
 && julia -e 'using Pkg;Pkg.add(["ArgParse","Glob","Dates","Statistics","CSV","DataFrames","NetCDF","NCDatasets","XLSX"]);'

USER root
RUN curl https://packages.cloud.google.com/apt/doc/apt-key.gpg \
        | gpg --dearmor -o /usr/share/keyrings/cloud.google.gpg \
 && echo "deb [signed-by=/usr/share/keyrings/cloud.google.gpg] \
        https://packages.cloud.google.com/apt cloud-sdk main" \
        | tee -a /etc/apt/sources.list.d/google-cloud-sdk.list \
 && apt-get update && apt-get install -y google-cloud-cli \
 && rm -rf /var/lib/apt/lists/*

Layer by layer

System packages (first apt-get)

Package Used by
libhdf5-dev, libnetcdf-dev, libnetcdff-dev Native HDF5/NetCDF libraries required by netCDF4 Python package and Julia NetCDF/NCDatasets packages
nco NetCDF Operators command-line tools (used in some shell scripts)
wget Used in DAG bash scripts to download files (e.g. EFAS tarball)
bc Used in bash scripts for arithmetic
liburi-perl, ksh Used by legacy shell scripts
build-essential, gfortran C/Fortran compilers needed to build native extensions
ncftp FTP client used in some download scripts
curl Used for health checks and in the Google Cloud CLI installer

Python packages

Package Version constraint Used by
pandas >=2.1.2,<2.2 Data manipulation in DAG Python code
netCDF4 latest Reading and writing NetCDF files in compress_results and download_CAMS
copernicusmarine latest Copernicus Marine toolbox, used in count_files_MDS and CMEMS downloads
cdsapi latest Copernicus Atmosphere Data Store API client, used in download_CAMS

Julia 1.12.3

Julia is installed system-wide under /opt/julia-1.12.3/ with a symlink at /usr/local/bin/julia. The packages pre-installed at build time are:

Julia package Purpose
ArgParse Command-line argument parsing for Julia scripts
Glob File globbing
Dates, Statistics Standard utilities
CSV, DataFrames Tabular data (river runoff processing)
NetCDF, NCDatasets Reading and writing NetCDF files
XLSX Excel file handling

Julia is used in two places:

  1. download_efas_local_preprocess DAG runs julia rnf_oper_v2025.jl to convert EFAS discharge data into a 10-day river runoff forcing file.
  2. GoogleCloud_backup_run DAG runs julia cmems.jl to upload simulation outputs to the Copernicus Marine Data Store.

Google Cloud CLI

gcloud is installed from the official Google Cloud apt repository. It is used by the shell scripts in GoogleCloud_backup_run (gcp_upload_to_google_storage.sh, gcp_start_large_VM.sh, etc.) to interact with Google Cloud Storage and Compute Engine.


Rebuilding the image

When you modify the Dockerfile, rebuild without cache to avoid stale layers:

cd /home/airflow
docker compose build --no-cache
docker compose up -d

The --no-cache flag is important: pip and apt caches inside Docker layers can mask version upgrades.

Verify the build

After rebuilding, you can enter a worker container to check that packages are installed:

docker exec -it airflow-airflow-worker-1 /bin/bash
python -c "import netCDF4; print(netCDF4.__version__)"
julia --version
gcloud --version