Ishan Sheth (imsheth)

The Hows - Apache Airflow 2 using DockerOperator with node.js and Gitlab container registry on Ubuntu 20

The Hows - Apache Airflow 2 using DockerOperator with node.js and Gitlab container registry on Ubuntu 20

Jul 25, 2021

The motivation for writing this post is in the hope that it helps others save lots of time, energy and nerve wracking phases that can lead to self doubt which I had extensively faced and would want others to avoid them altogether.

This post is focused on how to setup Apache Airflow 2 using DockerOperator with node.js and Gitlab container registry on Ubuntu 20

The relevant source code for the post can be found here


Ubuntu 20.x.x

  1. Check the machine operating system, it should be Ubuntu 20.x.x to have the best shot at successful installation on the first attempt
  2. Our test machine operating system details (Ubuntu 20.04.2 LTS (Focal Fossa))
cat /etc/os-release
airflow2-dockeroperator-nodejs-gitlab-1-png
  1. It is recommended to use the LTS version but not necessary

  2. This completes Ubuntu check


python 3.6

  1. Install python 3.6 and required packages
sudo add-apt-repository ppa:deadsnakes/ppa && sudo apt-get update && sudo apt-get -y upgrade && sudo apt-get install software-properties-common && sudo apt install build-essential zlib1g-dev libncurses5-dev libgdbm-dev libnss3-dev libssl-dev libreadline-dev libffi-dev wget && sudo apt install python3.6 && python3.6 --version
airflow2-dockeroperator-nodejs-gitlab-2-pngairflow2-dockeroperator-nodejs-gitlab-3-png
  1. Verify python 3.6 installation path
whereis python
airflow2-dockeroperator-nodejs-gitlab-4-png
  1. Set environment/path/alias variables for python 3.6 & pip 20.2.4
echo 'export PATH=$PATH:"/usr/bin/python3.6"' >> ~/.bashrc && echo 'alias python="python3.6"' >> ~/.bashrc && echo 'export PATH=$PATH:"/home/ubuntu/.local/bin"' >> ~/.bashrc && echo 'alias pip="pip3"' >> ~/.bashrc && source ~/.bashrc && echo $PATH && python --version
airflow2-dockeroperator-nodejs-gitlab-5-png
  1. pip 20.2.4
cd /tmp && curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py && python get-pip.py pip==20.2.4 && pip --version
airflow2-dockeroperator-nodejs-gitlab-6-png
  1. On November 2020, new version of PIP (20.3) has been released with a new, 2020 resolver. This resolver might work with Apache Airflow as of 20.3.3, but it might lead to errors in installation. It might depend on your choice of extras. In order to install Airflow you might need to either downgrade pip to version 20.2.4 pip install --upgrade pip==20.2.4 or, in case you use Pip 20.3, you need to add option --use-deprecated legacy-resolver to your pip install command.
  2. This completes python and pip installation


Airflow 2.0.1

  1. Create directories and set environment variables
mkdir airflow && mkdir dags_root && mkdir logs_root && echo 'export AIRFLOW_HOME="/home/ubuntu/airflow"' >> ~/.bashrc && echo 'export AIRFLOW_VERSION="2.0.1"' >> ~/.bashrc && echo 'export PYTHON_VERSION="$(python --version | cut -d " " -f 2 | cut -d "." -f 1-2)"' >> ~/.bashrc && echo 'export CONSTRAINT_URL="https://raw.githubusercontent.com/apache/airflow/constraints-${AIRFLOW_VERSION}/constraints-${PYTHON_VERSION}.txt"' >> ~/.bashrc && source ~/.bashrc
airflow2-dockeroperator-nodejs-gitlab-7-png
  1. Create virtual environment, the installation causes issues with gunicorn if not using one
cd airflow && pip install virtualenv && virtualenv -p python3.6 airflow_venv && source airflow_venv/bin/activate
airflow2-dockeroperator-nodejs-gitlab-8-png
  1. Install Apache Airflow 2.0.1
pip install "apache-airflow==${AIRFLOW_VERSION}" --constraint "${CONSTRAINT_URL}"
airflow2-dockeroperator-nodejs-gitlab-9-png
  1. This completes Apache Airflow 2 installation

Postgres 12

  1. Install Postgres 12
sudo apt update && sudo apt install postgresql postgresql-contrib
airflow2-dockeroperator-nodejs-gitlab-10-png
  1. Your data directory would be /var/lib/postgresql/12/main and log file would be /var/log/postgresql/postgresql-12-main.log

  2. Setup Postgres

sudo -i -u postgres
createdb airflow
createuser --interactive
## Enter name of role to add:
airflow
## Shall the new role be a superuser? (y/n)
y
exit
## Add airflow user
sudo adduser airflow
sudo -u airflow psql
  1. List databases and tables to verify
\l
\c airflow
\dt
airflow2-dockeroperator-nodejs-gitlab-11-png
  1. Setup users
sudo -i -u postgres
psql
CREATE USER airflow_user WITH PASSWORD 'airflow_user';
GRANT ALL PRIVILEGES ON DATABASE airflow TO airflow_user;
exit
exit
airflow2-dockeroperator-nodejs-gitlab-12-png
  1. This completes Postgres installation and setup

Airflow 2.0.1 config

  1. Change the config as per requirement
vim /home/ubuntu/airflow/airflow.cfg

Recommended for optimum usage

dags_folder = /home/ubuntu/dags_root
executor = LocalExecutor
sql_alchemy_conn = postgresql+psycopg2://airflow_user:airflow_user@localhost:5432/airflow
base_log_folder = /home/ubuntu/logs_root
logging_level = DEBUG
auth_backend = airflow.api.auth.backend.default
enable_xcom_pickling = True
job_heartbeat_sec = 120
min_file_process_interval = 120
scheduler_zombie_task_threshold = 1800

Optional but useful

expose_config = True
base_url = http://continental.thehightable.org:8080
hostname_callable = socket:gethostname
worker_autoscale = 16,12
broker_url = amqp://guest:guest@localhost:5672//
celery_result_backend = amqp://guest:guest@localhost:5672//
  1. Boot up webserver
airflow webserver
airflow2-dockeroperator-nodejs-gitlab-13-png
  1. Fix ## ImportError: No module named psycopg2
sudo apt-get update && sudo apt-get install libpq-dev && pip3 install psycopg2-binary
airflow2-dockeroperator-nodejs-gitlab-14-png
  1. Initialize the database (This has to be done only once you set your config else it'll take the settings from config before you edited the config)
airflow db init
airflow2-dockeroperator-nodejs-gitlab-15-png
  1. Create airflow user
## username john
## password wick
airflow users create \
--username john \
--firstname John \
--lastname Wick \
--role Admin \
--email john@thehightable.org
airflow2-dockeroperator-nodejs-gitlab-16-png
  1. Boot up the scheduler (from a new ssh session)
cd airflow && source airflow_venv/bin/activate && airflow scheduler
airflow2-dockeroperator-nodejs-gitlab-17-png
  1. Boot up the webserver (from a same session as of until point 5)
airflow webserver
airflow2-dockeroperator-nodejs-gitlab-18-png
  1. Login at the login prompt @IP:8080/localhost:8080 (Be sure that the port 8080 is open)

    airflow2-dockeroperator-nodejs-gitlab-19-png
  2. Airflow dashboard

    airflow2-dockeroperator-nodejs-gitlab-20-png
  3. This completes Apache Airflow 2 configuration


Docker 20.10.7

  1. Install docker 20.10.7
cd /tmp && curl -fsSL https://get.docker.com -o get-docker.sh && sh get-docker.sh
airflow2-dockeroperator-nodejs-gitlab-21-png
  1. Get the GID for the docker group
cd ~ && getent group docker | cut -d: -f3
airflow2-dockeroperator-nodejs-gitlab-22-png
  1. Add user to docker group
sudo usermod -aG docker ubuntu
  1. Verify user has been added to docker group
groups ubuntu
airflow2-dockeroperator-nodejs-gitlab-23-png
  1. Install requirements (you can also add this entry to your dags repo root in by && pip freeze > requirements.txt succeeding the below command)
pip install apache-airflow-providers-docker
airflow2-dockeroperator-nodejs-gitlab-24-png
  1. Enable docker remote API on docker host (comments section)
sudo vim /lib/systemd/system/docker.service
# Comment the original line for backup to revert if needed
#ExecStart=/usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock
ExecStart=/usr/bin/dockerd -H fd:// -H 0.0.0.0:2375 --containerd=/run/containerd/containerd.sock
  1. Reload - restart daemon and verify docker remote API response
sudo systemctl daemon-reload
sudo service docker restart
curl http://localhost:2375/images/json
airflow2-dockeroperator-nodejs-gitlab-25-png
  1. This completes docker 20.10.7 installation and setup

Gitlab container registry

  • About Gitlab container registry
  • This post uses connection to Gitlab container registry via docker remote API, without using connection from Airflow via the docker_conn_id parameter

Sample URLs where you can find container registry for your repo based on the repo being used for the post
https://gitlab.com/blogs-imsheth-com/airflow2-dockeroperator-nodejs-gitlab/container_registry
https://gitlab.com/blogs-imsheth-com/airflow2-dockeroperator-nodejs-gitlab/-/settings/repository

  1. Create sample project in Gitlab on your local machine

    Crete repo like
    https://gitlab.com/blogs-imsheth-com/airflow2-dockeroperator-nodejs-gitlab
    Add your deploy key at
    https://gitlab.com/blogs-imsheth-com/airflow2-dockeroperator-nodejs-gitlab/-/settings/repository#js-deploy-keys-settings
    Clone the repo
    https://gitlab.com/blogs-imsheth-com/airflow2-dockeroperator-nodejs-gitlab
    Copy paste sample code from
    https://github.com/imsheth/airflow2-dockeroperator-nodejs-gitlab
    Push to your repo
    git push origin master

airflow2-dockeroperator-nodejs-gitlab-26-png
  1. You need to login to the Gitlab container registry from docker on your local machine (this step also needs to be done on the server). Basically, this step generates the required file at `~/.docker/config.json`
sudo docker login registry.gitlab.com
WARNING! Your password will be stored unencrypted in /root/.docker/config.json.
Configure a credential helper to remove this warning. See
https://docs.docker.com/engine/reference/commandline/login/#credentials-store
Login Succeeded
  1. Copy and update config json access on your local machine (this step also needs to be done on the server)
mkdir ~/.docker && sudo cp /root/.docker/config.json ~/.docker/config.json && sudo cat ~/.docker/config.json && sudo chmod 777 ~/.docker/config.json
  1. Publish container to Gitlab from local from the root where Dockerfile is present (this will need the local docker daemon to be running, if not already then start it)
sudo docker build -t registry.gitlab.com/blogs-imsheth-com/airflow2-dockeroperator-nodejs-gitlab . && sudo docker push registry.gitlab.com/blogs-imsheth-com/airflow2-dockeroperator-nodejs-gitlab
airflow2-dockeroperator-nodejs-gitlab-27-pngairflow2-dockeroperator-nodejs-gitlab-28-png
  1. This completes Gitlab container registry installation and setup

Airflow DAGs

  1. Activate virtual environment, clone at proper path, install dependencies
cd airflow && source airflow_venv/bin/activate
cd ~/dags_root/ && git clone https://github.com/imsheth/airflow2-dockeroperator-nodejs-gitlab && cd airflow2-dockeroperator-nodejs-gitlab/dags && pip install -r requirements.txt
airflow2-dockeroperator-nodejs-gitlab-29-png
  1. Your DAG is now available, first enable/unpause before triggering and then trigger it

    airflow2-dockeroperator-nodejs-gitlab-30-png
  2. Trigger your DAG run

    airflow2-dockeroperator-nodejs-gitlab-31-png
  3. DAG run successful

    airflow2-dockeroperator-nodejs-gitlab-32-png
  4. DAG run details for the successful run

    airflow2-dockeroperator-nodejs-gitlab-33-png
  5. DAG run task details graph view for the successful run

    airflow2-dockeroperator-nodejs-gitlab-34-png
  6. Task detail from DAG run task details graph view for the successful run

    airflow2-dockeroperator-nodejs-gitlab-35-png
  7. Task log from DAG run task details graph view for the successful run

    airflow2-dockeroperator-nodejs-gitlab-36-png
  8. This completes DAG run with DockerOperator in Airflow, however keep in mind that minimum AWS EC2 t2.medium equivalent instance will be required on the server to just run the DAGs with DockerOperator in Airflow


systemd service

  1. systemd unit doesn't interpolate variables and it will ignore lines starting with "export" in .bashrc, which causes the issue on server restart as the variables are not exported (add the following to a startup script)
env > /tmp/.magic-environment-file
  1. Service configurations
sudo vim /etc/systemd/system/airflow-webserver.service

[Unit]
Description=Airflow webserver daemon service template for Ubuntu 20.04.2
After=network.target postgresql.service
Wants=postgresql.service
[Service]
#PIDFile=/run/airflow/webserver.pid
User=ubuntu
Group=ubuntu
Type=simple
EnvironmentFile=-/tmp/.magic-environment-file
ExecStart=/usr/bin/bash -c 'source /home/ubuntu/airflow/airflow_venv/bin/activate ; /home/ubuntu/airflow/airflow_venv/bin/airflow db upgrade ; /home/> ubuntu/airflow/airflow_venv/bin/airflow webserver'
Restart=on-failure
RestartSec=60s
PrivateTmp=true
[Install]
WantedBy=multi-user.target

sudo vim /etc/systemd/system/airflow-scheduler.service

[Unit]
Description=Airflow scheduler daemon service template for Ubuntu 20.04.2
After=network.target postgresql.service
Wants=postgresql.service
[Service]
#PIDFile=/run/airflow/scheduler.pid
User=ubuntu
Group=ubuntu
Type=simple
EnvironmentFile=-/tmp/.magic-environment-file
ExecStart=/usr/bin/bash -c 'source /home/ubuntu/airflow/airflow_venv/bin/activate ; /home/ubuntu/airflow/airflow_venv/bin/airflow scheduler'
Restart=on-failure
RestartSec=60s
PrivateTmp=true
[Install]
WantedBy=multi-user.target

  1. Reload daemon, enable and start services
sudo systemctl daemon-reload && sudo systemctl enable airflow-webserver.service && sudo systemctl enable airflow-scheduler.service && sudo systemctl start airflow-webserver.service && sudo systemctl start airflow-scheduler.service
  1. Check service status
systemctl status airflow-webserver.service
airflow2-dockeroperator-nodejs-gitlab-37-png
  1. Check service status
systemctl status airflow-scheduler.service
airflow2-dockeroperator-nodejs-gitlab-38-png
  1. Checking sysetmd logs
journalctl -xe
  1. Check full systemd logs
journalctl -u airflow-webserver.service -e
  1. Check live logs
journalctl -u airflow-webserver.service -e -f

Deployment

  1. This command can be added to your deployment pipeline
cd /home/ubuntu/dags_root/airflow2-dockeroperator-nodejs-gitlab/dags && source /home/ubuntu/airflow/airflow_venv/bin/activate && git pull && pip install -r /home/ubuntu/dags_root/airflow2-dockeroperator-nodejs-gitlab/dags/requirements.txt && sudo systemctl stop airflow-webserver.service && sudo systemctl start airflow-webserver.service && sudo systemctl stop airflow-scheduler.service && sudo systemctl start airflow-scheduler.service

Unsuccessful trials

  1. Trials with airflow2 inside docker
  1. Trials with airflow2 inside docker with DockerOperator (docker inside docker)
  1. Referred initially but not used fully

Miscellaneous

  1. Purging all unused or dangling images, containers, volumes, and networks. This was used in research and development phase
docker system prune -af
docker images -a
docker container ls
docker ps -a
  1. Important parameters for DockerOperator, which helped avoid commands in above point

Sample at https://github.com/imsheth/airflow2-dockeroperator-nodejs-gitlab/blob/master/dags/docker_provider.py
force_pull = True
xcom_all = True
auto_remove = True
tty = True

  1. SSH for deployment
ssh-keygen -t rsa -b 2048 -C "john@thehightable.org"
  1. Add to repo for deploy keys
cat id_rsa.pub
  1. For running DAGs with DockerOperator
  1. Gitlab registry
  1. docker remote API
ExecStart=/usr/bin/dockerd -H fd:// -H 0.0.0.0:2375 --containerd=/run/containerd/containerd.sock
DOCKER_GROUP_ID=`getent group docker | cut -d: -f3` sudo docker-compose up -d
docker run -d -v /var/run/docker.sock:/var/run/docker.sock -p 127.0.0.1:2375:2375 bobrik/socat TCP-LISTEN:2375,fork UNIX-CONNECT:/var/run/docker.sock
  1. macOS issues
  1. Check docker container logs
sudo docker ps
sudo docker exec -it 82f4b968cb6d sh
sudo docker-compose logs --tail="all" -f

#airflow #airflow2 #apacheairflow #apacheairflow2.0.1 #dockeroperator #docker #python #python3.6 #pip #pip20.2.4 #ubuntu #ubuntu20.04.2lts #tech




Edit this page on GitHub