Ishan Sheth (imsheth)

The Hows - Apache Airflow 2 using DockerOperator with node.js and Gitlab container registry on Ubuntu 20

The Hows - Apache Airflow 2 using DockerOperator with node.js and Gitlab container registry on Ubuntu 20

Jul 25, 2021

The motivation for writing this post is in the hope that it helps others save lots of time, energy and nerve wracking phases that can lead to self doubt which I had extensively faced and would want others to avoid them altogether.

This post is focused on how to setup Apache Airflow 2 using DockerOperator with node.js and Gitlab container registry on Ubuntu 20

The relevant source code for the post can be found here

Ubuntu 20.x.x

  1. Check the machine operating system, it should be Ubuntu 20.x.x to have the best shot at successful installation on the first attempt
  2. Our test machine operating system details (Ubuntu 20.04.2 LTS (Focal Fossa))
cat /etc/os-release
  1. It is recommended to use the LTS version but not necessary

  2. This completes Ubuntu check

python 3.6

  1. Install python 3.6 and required packages
sudo add-apt-repository ppa:deadsnakes/ppa && sudo apt-get update && sudo apt-get -y upgrade && sudo apt-get install software-properties-common && sudo apt install build-essential zlib1g-dev libncurses5-dev libgdbm-dev libnss3-dev libssl-dev libreadline-dev libffi-dev wget && sudo apt install python3.6 && python3.6 --version
  1. Verify python 3.6 installation path
whereis python
  1. Set environment/path/alias variables for python 3.6 & pip 20.2.4
echo 'export PATH=$PATH:"/usr/bin/python3.6"' >> ~/.bashrc && echo 'alias python="python3.6"' >> ~/.bashrc && echo 'export PATH=$PATH:"/home/ubuntu/.local/bin"' >> ~/.bashrc && echo 'alias pip="pip3"' >> ~/.bashrc && source ~/.bashrc && echo $PATH && python --version
  1. pip 20.2.4
cd /tmp && curl -o && python pip==20.2.4 && pip --version
  1. On November 2020, new version of PIP (20.3) has been released with a new, 2020 resolver. This resolver might work with Apache Airflow as of 20.3.3, but it might lead to errors in installation. It might depend on your choice of extras. In order to install Airflow you might need to either downgrade pip to version 20.2.4 pip install --upgrade pip==20.2.4 or, in case you use Pip 20.3, you need to add option --use-deprecated legacy-resolver to your pip install command.
  2. This completes python and pip installation

Airflow 2.0.1

  1. Create directories and set environment variables
mkdir airflow && mkdir dags_root && mkdir logs_root && echo 'export AIRFLOW_HOME="/home/ubuntu/airflow"' >> ~/.bashrc && echo 'export AIRFLOW_VERSION="2.0.1"' >> ~/.bashrc && echo 'export PYTHON_VERSION="$(python --version | cut -d " " -f 2 | cut -d "." -f 1-2)"' >> ~/.bashrc && echo 'export CONSTRAINT_URL="${AIRFLOW_VERSION}/constraints-${PYTHON_VERSION}.txt"' >> ~/.bashrc && source ~/.bashrc
  1. Create virtual environment, the installation causes issues with gunicorn if not using one
cd airflow && pip install virtualenv && virtualenv -p python3.6 airflow_venv && source airflow_venv/bin/activate
  1. Install Apache Airflow 2.0.1
pip install "apache-airflow==${AIRFLOW_VERSION}" --constraint "${CONSTRAINT_URL}"
  1. This completes Apache Airflow 2 installation

Postgres 12

  1. Install Postgres 12
sudo apt update && sudo apt install postgresql postgresql-contrib
  1. Your data directory would be /var/lib/postgresql/12/main and log file would be /var/log/postgresql/postgresql-12-main.log

  2. Setup Postgres

sudo -i -u postgres
createdb airflow
createuser --interactive
## Enter name of role to add:
## Shall the new role be a superuser? (y/n)
## Add airflow user
sudo adduser airflow
sudo -u airflow psql
  1. List databases and tables to verify
\c airflow
  1. Setup users
sudo -i -u postgres
CREATE USER airflow_user WITH PASSWORD 'airflow_user';
  1. This completes Postgres installation and setup

Airflow 2.0.1 config

  1. Change the config as per requirement
vim /home/ubuntu/airflow/airflow.cfg

Recommended for optimum usage

dags_folder = /home/ubuntu/dags_root
executor = LocalExecutor
sql_alchemy_conn = postgresql+psycopg2://airflow_user:airflow_user@localhost:5432/airflow
base_log_folder = /home/ubuntu/logs_root
logging_level = DEBUG
auth_backend = airflow.api.auth.backend.default
enable_xcom_pickling = True
job_heartbeat_sec = 120
min_file_process_interval = 120
scheduler_zombie_task_threshold = 1800

Optional but useful

expose_config = True
base_url =
hostname_callable = socket:gethostname
worker_autoscale = 16,12
broker_url = amqp://guest:guest@localhost:5672//
celery_result_backend = amqp://guest:guest@localhost:5672//
  1. Boot up webserver
airflow webserver
  1. Fix ## ImportError: No module named psycopg2
sudo apt-get update && sudo apt-get install libpq-dev && pip3 install psycopg2-binary
  1. Initialize the database (This has to be done only once you set your config else it'll take the settings from config before you edited the config)
airflow db init
  1. Create airflow user
## username john
## password wick
airflow users create \
--username john \
--firstname John \
--lastname Wick \
--role Admin \
  1. Boot up the scheduler (from a new ssh session)
cd airflow && source airflow_venv/bin/activate && airflow scheduler
  1. Boot up the webserver (from a same session as of until point 5)
airflow webserver
  1. Login at the login prompt @IP:8080/localhost:8080 (Be sure that the port 8080 is open)

  2. Airflow dashboard

  3. This completes Apache Airflow 2 configuration

Docker 20.10.7

  1. Install docker 20.10.7
cd /tmp && curl -fsSL -o && sh
  1. Get the GID for the docker group
cd ~ && getent group docker | cut -d: -f3
  1. Add user to docker group
sudo usermod -aG docker ubuntu
  1. Verify user has been added to docker group
groups ubuntu
  1. Install requirements (you can also add this entry to your dags repo root in by && pip freeze > requirements.txt succeeding the below command)
pip install apache-airflow-providers-docker
  1. Enable docker remote API on docker host (comments section)
sudo vim /lib/systemd/system/docker.service
# Comment the original line for backup to revert if needed
#ExecStart=/usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock
ExecStart=/usr/bin/dockerd -H fd:// -H --containerd=/run/containerd/containerd.sock
  1. Reload - restart daemon and verify docker remote API response
sudo systemctl daemon-reload
sudo service docker restart
curl http://localhost:2375/images/json
  1. This completes docker 20.10.7 installation and setup

Gitlab container registry

  • About Gitlab container registry
  • This post uses connection to Gitlab container registry via docker remote API, without using connection from Airflow via the docker_conn_id parameter

Sample URLs where you can find container registry for your repo based on the repo being used for the post

  1. Create sample project in Gitlab on your local machine

    Crete repo like
    Add your deploy key at
    Clone the repo
    Copy paste sample code from
    Push to your repo
    git push origin master

  1. You need to login to the Gitlab container registry from docker on your local machine (this step also needs to be done on the server). Basically, this step generates the required file at `~/.docker/config.json`
sudo docker login
WARNING! Your password will be stored unencrypted in /root/.docker/config.json.
Configure a credential helper to remove this warning. See
Login Succeeded
  1. Copy and update config json access on your local machine (this step also needs to be done on the server)
mkdir ~/.docker && sudo cp /root/.docker/config.json ~/.docker/config.json && sudo cat ~/.docker/config.json && sudo chmod 777 ~/.docker/config.json
  1. Publish container to Gitlab from local from the root where Dockerfile is present (this will need the local docker daemon to be running, if not already then start it)
sudo docker build -t . && sudo docker push
  1. This completes Gitlab container registry installation and setup

Airflow DAGs

  1. Activate virtual environment, clone at proper path, install dependencies
cd airflow && source airflow_venv/bin/activate
cd ~/dags_root/ && git clone && cd airflow2-dockeroperator-nodejs-gitlab/dags && pip install -r requirements.txt
  1. Your DAG is now available, first enable/unpause before triggering and then trigger it

  2. Trigger your DAG run

  3. DAG run successful

  4. DAG run details for the successful run

  5. DAG run task details graph view for the successful run

  6. Task detail from DAG run task details graph view for the successful run

  7. Task log from DAG run task details graph view for the successful run

  8. This completes DAG run with DockerOperator in Airflow, however keep in mind that minimum AWS EC2 t2.medium equivalent instance will be required on the server to just run the DAGs with DockerOperator in Airflow

systemd service

  1. systemd unit doesn't interpolate variables and it will ignore lines starting with "export" in .bashrc, which causes the issue on server restart as the variables are not exported (add the following to a startup script)
env > /tmp/.magic-environment-file
  1. Service configurations
sudo vim /etc/systemd/system/airflow-webserver.service

Description=Airflow webserver daemon service template for Ubuntu 20.04.2 postgresql.service
ExecStart=/usr/bin/bash -c 'source /home/ubuntu/airflow/airflow_venv/bin/activate ; /home/ubuntu/airflow/airflow_venv/bin/airflow db upgrade ; /home/> ubuntu/airflow/airflow_venv/bin/airflow webserver'

sudo vim /etc/systemd/system/airflow-scheduler.service

Description=Airflow scheduler daemon service template for Ubuntu 20.04.2 postgresql.service
ExecStart=/usr/bin/bash -c 'source /home/ubuntu/airflow/airflow_venv/bin/activate ; /home/ubuntu/airflow/airflow_venv/bin/airflow scheduler'

  1. Reload daemon, enable and start services
sudo systemctl daemon-reload && sudo systemctl enable airflow-webserver.service && sudo systemctl enable airflow-scheduler.service && sudo systemctl start airflow-webserver.service && sudo systemctl start airflow-scheduler.service
  1. Check service status
systemctl status airflow-webserver.service
  1. Check service status
systemctl status airflow-scheduler.service
  1. Checking sysetmd logs
journalctl -xe
  1. Check full systemd logs
journalctl -u airflow-webserver.service -e
  1. Check live logs
journalctl -u airflow-webserver.service -e -f


  1. This command can be added to your deployment pipeline
cd /home/ubuntu/dags_root/airflow2-dockeroperator-nodejs-gitlab/dags && source /home/ubuntu/airflow/airflow_venv/bin/activate && git pull && pip install -r /home/ubuntu/dags_root/airflow2-dockeroperator-nodejs-gitlab/dags/requirements.txt && sudo systemctl stop airflow-webserver.service && sudo systemctl start airflow-webserver.service && sudo systemctl stop airflow-scheduler.service && sudo systemctl start airflow-scheduler.service

Unsuccessful trials

  1. Trials with airflow2 inside docker
  1. Trials with airflow2 inside docker with DockerOperator (docker inside docker)
  1. Referred initially but not used fully


  1. Purging all unused or dangling images, containers, volumes, and networks. This was used in research and development phase
docker system prune -af
docker images -a
docker container ls
docker ps -a
  1. Important parameters for DockerOperator, which helped avoid commands in above point

Sample at
force_pull = True
xcom_all = True
auto_remove = True
tty = True

  1. SSH for deployment
ssh-keygen -t rsa -b 2048 -C ""
  1. Add to repo for deploy keys
  1. For running DAGs with DockerOperator
  1. Gitlab registry
  1. docker remote API
ExecStart=/usr/bin/dockerd -H fd:// -H --containerd=/run/containerd/containerd.sock
DOCKER_GROUP_ID=`getent group docker | cut -d: -f3` sudo docker-compose up -d
docker run -d -v /var/run/docker.sock:/var/run/docker.sock -p bobrik/socat TCP-LISTEN:2375,fork UNIX-CONNECT:/var/run/docker.sock
  1. macOS issues
  1. Check docker container logs
sudo docker ps
sudo docker exec -it 82f4b968cb6d sh
sudo docker-compose logs --tail="all" -f

#airflow #airflow2 #apacheairflow #apacheairflow2.0.1 #dockeroperator #docker #python #python3.6 #pip #pip20.2.4 #ubuntu #ubuntu20.04.2lts #tech

Edit this page on GitHub