Main Content

Running Apache Airflow Workflows as ETL Processes on Hadoop

About the Talk

April 15, 2017 12:00 PM

Pune, Maharashtra, India

Pune, Maharashtra, India

Workflows and the scheduling and reliable execution of those workflows are very important in the world of data. As engineers we need to make sure that the data is available when it needs to be to ensure our customers and staff can gain the insight that the need into the data. Not having the data around can cause missed opportunities to capitalize on trends. To accomplish this in the Hadoop world there are many such tools that help you to do this. This includes the popular Apache Oozie service and other tools like Azkaban and Talend. Over the course of using such tools we've noticed aspects of these services that make them difficult to work with such as lack of features and flexibility. While exploring alternatives we found Apache Airflow.

Apache Airflow is a platform to programmatically author, schedule and monitor workflows. We at Clairvoyant have used this tool on a number of projects to dynamically and reliably build workflows, which utilize many Hadoop services. This includes running Sqoop, Spark, Hive, Impala and many other jobs.

In this talk, I'll be talking about how we've used the tool in various use cases across a number of different clients. In addition, we'll go over the feature set and talk about why such a tool is superior to some of the more traditional workflow services like Ooize. Some of these reasons include its flexibility and how well it integrates with Hadoop services.

Ratings and Recommendations

This Talk hasn't been rated yet. Sign In to rate Talks.

comments powered by Disqus