<img height="1" width="1" style="display:none;" alt="" src="https://px.ads.linkedin.com/collect/?pid=2877026&amp;fmt=gif">

Installing and Configuring Apache Airflow

By Robert Sanders - December 1, 2016

Steps to Install and Configure Apache Airflow 1.x

Apache Airflow is a platform to programmatically author, schedule and monitor workflows — it supports integration with 3rd party platforms so that you, our developer and user community, can adapt it to your needs and stack.

Additional Documentation: Documentation: https://airflow.apache.org/

Install Documentation: https://airflow.apache.org/installation.html

GitHub Repo: https://github.com/apache/airflow

Preparing the Environment

Install all needed system dependencies

Ubuntu

1. SSH onto target machine (s) where you want to install Airflow

2. Login as Root

3 .Install Required Libraries

4. Check Python Version

1.Run the command:

python -V

2.If the version comes back as “Python 3.5.X” you can skip the rest of this step

3.Install Python 3.5.X

CentOS

1. SSH onto target machine(s) where you want to install Airflow

2. Login as Root

3. Install Required Libraries

4. Check Python Version

Run the command:

python -V

If the version comes back as “Python 3.5.X” you can skip the rest of this step

Install Python 3.5.X

Install Airflow

Login as Root and run:

Airflow Versions Available: https://pypi.org/project/apache-airflow/#history

Update: Common Issue with Celery

Recently there were some updates to the dependencies of Airflow where if you were to install the airflow[celery] dependency for Airflow 1.7.x, pip would install celery version 4.0.2. This version of celery is incompatible with Airflow 1.7.x. This would result in various types of errors including messages saying that the CeleryExecutor can’t be loaded or that tasks are not getting executed as they should.

To get around this issue, install an older version of celery using pip:

pip install celery==3.1.17

Install RabbitMQ

If you intend to use RabbitMQ as a message broker you will need to install RabbitMQ.If you don’t intend to, you can skip this step. For production, it is recommended that you use CeleryExecutors which requires a message broker such as RabbitMQ.

Setup

Follow these steps: Install RabbitMQ

Recovering from a RabbitMQ Node Failure

If you’ve opted to setup RabbitMQ to run on as a cluster, and one of those cluster nodes fails, you can follow these steps to recover on airflow:

  1. Bring the RabbitMQ node and daemon back up

  2. Navigate to the RabbitMQ Management UI

  3. Click on Queues

  4. Delete the “Default” queue

  5. Restart Airflow Scheduler service

Install MySQL Dependencies

If you intend to use MySQL as a DB repo you will need to install some MySQL dependencies. If you don’t intend to, you can skip this step.

Install MySQL Dependencies

Ubuntu

Centos

Configuring Airflow

It’s recommended to use RabbitMQ.

Apache Airflow needs a home, ~/airflow is the default, but you can lay foundation somewhere else if you prefer (OPTIONAL)

export AIRFLOW_HOME=~/airflow

Run the following as the desired user (whoever you want executing the Airflow jobs) to set up the airflow directories and default configs

Note: When you run this the first time, it will generate a sqlite file (airflow.db) in the AIRFLOW_HOME directory for the Airflow Metastore. If you don’t intend to use sqlite as the Metastore then you can remove this file.

Make the following changes to the {AIRFLOW_HOME}/airflow.cfg file

1. Change the Executor to CeleryExecutor (Recommended for production)

2. Point SQL Alchemy to MySQL (if using MySQL)

3. Set dags are paused on startup. This is a good idea to avoid unwanted runs of the workflow. (Recommended)

4. Don’t load examples

5. Set the Broker URL (If you’re using CeleryExecutors)

1. If you’re using RabbitMQ:

2. If you’re using AWS SQS:

6. Point Celery to MySQL (if using MySQL)

7. Set the default_queue name used by CeleryExecutors (Optional: Primarily for if you have a preference of the default queue name or plan on using the same broker for multiple airflow instances)

8. Setup MySQL (if using MySQL)

  1. Login to the MySQL machine

  2. Create the airflow database if it doesn’t exist

    CREATE DATABASE airflow CHARACTER SET utf8 COLLATE utf8_unicode_ci;
  3. Grant access

    grant all on airflow.* TO ‘USERNAME’@’%’ IDENTIFIED BY ‘{password}’;

9. Set Fernet Keys (Apache Airflow ≥1.9)

1. Generate a Fernet Key

Example: _zAgNHHWpkEdr-2gHeWSFPfkbdiHTNzGWy1DfkpGF4o=

2. Set the Fernet Key in the Configurations

10. Run initdb to set up the database tables

11. Create needed directories

Configuring Airflow — Advanced (Optional)

Email Alerting

Allow Email alerting for if a task or job fails.

1.Edit the {AIRFLOW_HOME}/airflow.cfg file

2.Set the properties

  • Properties

    * SMTP_HOST — Host of the SMTP Server
    
    * SMTP_TLS — Whether to use TLS when connecting to the SMTP Server
    
    * SMTP_USE_SSL — Whether to use SSL when connecting to the SMTP Server
    
    * STMP_USER — Username for connecting to SMTP Server
    
    * SMTP_PORT — Port to use for SMTP Server
    
    * SMTP_PASSWORD — Password associated with the user thats used to 
    connect to SMTP Server
    
    * SMTP_EMAIL_FROM — Email to send Alert Emails as
    
  • Example

Password Authentication

To enable password authentication for the web app.

Follow these instructions: http://airflow.apache.org/security.html

Controlling Airflow Services

By default, you have to use the Airflow Command-line Tool to start up the services. You can use the below commands to start up the processes in the background and dump the output to log files.

Starting Services

1. Start Web Server

nohup airflow webserver $* >> ~/airflow/logs/webserver.logs &

2. Start Celery Workers

nohup airflow worker $* >> ~/airflow/logs/worker.logs &

3. Start Scheduler

nohup airflow scheduler >> ~/airflow/logs/scheduler.logs &

4. Navigate to the Airflow UI

  • http://{HOSTNAME}:8080/admin/

5. Start Flower (Optional)

  • Flower is a web UI built on top of Celery, to monitor your workers.

    nohup airflow flower >> ~/airflow/logs/flower.logs &

6. Navigate to the Flower UI (Optional)

  • http://{HOSTNAME}:5555/

Stopping Services

Search for the service and run the kill command:

Setting up SystemD to Run Airflow

Deploy SystemD Scripts

AIRFLOW_VERSION Example: 1.8.0

1. Login as Root

2. Get the zipped up Airflow

3. Unzip the file

4. Distribute the SystemD files

How to Use SystemD

Web Server

Celery Worker

Scheduler

Flower (Optional)

Setting up Airflow Services to Run on Machine Startup

Web Server

chkconfig airflow-webserver on

Celery Worker

chkconfig airflow-worker on

Scheduler

chkconfig airflow-scheduler on

Flower (Optional)

chkconfig airflow-flower on

Troubleshooting Airflow Issues

Failure to Start Web Server

Error Message: ImportError: cannot import name EscapeFormatter

  1. Reinstall markupsafe

    sudo pip uninstall markupsafe
    sudo pip install markupsafe
  2. Retry startup

Testing Airflow

Example Dags

https://github.com/apache/incubator-airflow/tree/master/airflow/example_dags

High-Level Testing

Note: You will need to deploy the tutorial.py dag.

Running a Sample Airflow DAG

Assume the following code is in the dag at {AIRFLOW_HOME} /dags/sample.py

Verify the DAG is Available
Verify that the DAG you deployed is available in the list of DAGs

airflow list_dags

The output should list the ‘sample’ DAG

Running a Test
Let’s test by running the actual task instances on a specific date. The date specified in this context is an execution_date, which simulates the scheduler running your task or dag at a specific date + time:

airflow test sample dummy 2016-03-30

Run
Here’s how to run a particular task. Note: It might fail if the dependent tasks are not run successfully.

airflow run sample dummy 2016-04-22T00:00:00 --local

Trigger DAG
Trigger a DAG run

airflow trigger_dag sample

Backfill
Backfill will respect your dependencies, emit logs into files and talk to the database to record status. If you do have a webserver up, you’ll be able to track the progress. airflow webserver will start a web server if you are interested in tracking the progress visually as your backfill progresses.

airflow backfill sample -s 2016-08-21

Helpful Operations

Getting Airflow Version

airflow version

Find Airflow Site-Packages Installation Location

Sometimes it might be helpful to find the source code so you can perform some other operations to help customize the experience in Airflow. This is how you can find the location of where the airflow source code is installed:

1. Start up a Python CLI

python

2. Run the following code to find where the airflow source code is installed

Usual Site Package Paths:

  • Centos

  • /usr/lib/python2.7/site-packages

Change Alert Email Subject

By default, the Airflow Alert Emails are always sent with the subject like: Airflow alert: <TaskInstance: [DAG_NAME].[TASK_ID] [DATE] [failed]>. If you would like to change this to provide more information as to which Airflow cluster you’re working with you can follow these steps.

Note: It requires a very small modification of the Airflow Source Code.

1. Go to the Airflow Site-Packages Installation Location

  • Example Path: /usr/lib/python2.7/site-packages/airflow

2. Edit the models.py file

3. Search for the text “Airflow alert:”

  • Using nano

  • 1. Open the file

  • 2. Hit CTRL+w

  • 3. Type in “Airflow alert” and hit enter

4. Modify this string to whatever you would like.

  1. Original value ‘title = “Airflow alert: {self}”.format(**locals())”’ will produce ‘Airflow alert: <TaskInstance: [DAG_NAME].[TASK_ID] [DATE] [failed]>’

  2. An updated value like ‘title = “Test Updated Airflow alert: {self}”.format(**locals())”’ will produce ‘Test Updated Airflow alert: <TaskInstance: [DAG_NAME].[TASK_ID] [DATE] [failed]>’

Set Logging Level

If you want to get more information in the logs (debug) or log less information (warn) you can follow these steps to set the logging level

Note: It requires a very small modification of the Airflow Source Code.

1. Go to the Airflow Site-Packages Installation Location of airflow

2.Edit the settings.py file

3.Set the LOGGING_LEVEL variable to your desired value

  • debug → logging.DEBUG

  • info → logging.INFO

  • warn → logging.WARN

4. Restart the Airflow Services

Check out our blog about how to install Apache Zeppelin on a Hadoop Cluster here. To get the best data engineering solutions for your business, reach out to us at Clairvoyant.

Author
Robert Sanders

Director of Big Data and Cloud Engineering for Clairvoyant LLC | Marathon Runner | Triathlete | Endurance Athlete

Fill in your Details