Data Pipelines in Hadoop – Silicon Valley Data Science

Overcoming the growing pains  |  April 18th, 2017

As data technology continues to improve, many companies are realizing that Hadoop offers them the ability to create insights that lead to better business decisions. Moving to Hadoop is not without its challenges—there are so many options, from tools to approaches, that can have a significant impact on the future success of a business’ strategy. Data management and data pipelining can be particularly difficult. Yet, at the same time, there is a need to move quickly so that the business can benefit as soon as possible.

Many companies have vast experience in traditional data management, using standard tools and designing tables in third normal form. They may be tempted, when using the Hadoop ecosystem, to fall back to what they are familiar with, losing so much of the value of the new technologies. Others embrace the concept of the new environment, but are unsure of the data management process and end up just dumping data into the ecosystem. Data availability and cleanliness is frustrating for many users who are going through the growing pains of switching to Hadoop.

If this sounds familiar, you are not alone. While there is no magic solution, in this post we’ll look at some real world examples to give you a place to start.

Hitting walls with Hadoop

During one of our projects, the client was dealing with the exact issues outlined above, particularly data availability and cleanliness. They were a reporting and analytics business team, and they had recently embraced the importance of switching to a Hadoop environment. However, they did not know how to perform the functions they were used to doing in their old Oracle and SAS environments. The engineering team supporting them Sqooped data into Hadoop, but it was in raw form and difficult for them to query. The engineering team could not keep up with the constant requests for new tables, additional fields, new calculations, additional filters, etc. Frustrations grew on both sides. The reporting and analytics team began to see Hadoop as an inflexible architecture that hurt their ability to deliver value to the business—exactly opposite of what it is designed to do.

In this case, there was a need for structured data in Hadoop. There was also a need to speed up the process of making data available for the business while keeping the data pipeline controlled. We looked at and considered some off-the-shelf ETL tools that were available, but determined that they were not yet ready for implementation in this particular environment. The final solution ended up being a custom ETL tool.

Our custom ETL tool

The core functionality of the custom tool was to provide the right data in the right format to teams so that they could produce their analysis and reporting as needed.

The key requirements identified were:

  • usability by people unfamiliar with Hadoop
  • workflow descriptions that allowed scheduling and productionalizing
  • extensibility for implementing minor changes readily
  • flexibility to pull data from alternate sources into varying locations

We built a framework that allowed the user to write a workflow to extract data, transform the data, load the data, and other functions as needed. The most common set of steps often involved saving complex SQL queries to run in a designated folder. The user could then modify a config file to execute that query into the workflow in a specific way. The solution was simple, elegant, and very user friendly.

As we developed this framework, we observed some best practices that we felt were critical to the success of the project and hope may help you.

Standard development environment

In order for the process to be repeatable, simple and easy to maintain, we readily saw the importance of standardizing the development of ETL pipelines in the Hadoop/Hive/Impala environment.

This included 4 key components: a common framework, common tools, an edge node, and JupyterHub.

Common framework. ETL engineers used a common development framework so ETL jobs can be written using common modules that allow them to enumerate the steps to be executed rather than re-implementing similar logic for the same type of steps .

As an example, in order to execute a HIVE query, an ETL engineer would only need to provide the SQL query, rather than writing a shell script containing hive credentials and hive commands, in addition to the SQL query that has to be executed.

With this approach:

  • There was no need for the same problem to be solved multiple times by different people, coming up with similar but yet different solutions.
  • Common credential storage systems were used (property files), instead of having the very same credentials placed in multiple different places.
  • Files were organized in a standardized way, so that any engineer or data scientist could easily navigate the directory hierarchy in order to find code.

Common tools. ETL engineers used standard software engineering processes and tools. This includes source code control system (such as git), unit and regression testing as well as automated deployment processes and procedures.

This helped significantly with improved collaboration, especially in comparison to what was happening at the time—an ETL engineer would write a script (shell/Python/Ruby) and deploy it by herself to the target machine.

Edge node. All ETL processes and data science applications were run on an edge node in a Hadoop cluster.

This allowed engineers and data scientists to access Hadoop data. That might sound trivial, but we have seen a lot of cases in which data scientists simply were not able to access the data, preventing them from doing their jobs.

By using an edge node, they were able to standardize libraries and tools, instead of having each of them developing in their own isolated island, using their own libraries and tools.

The edge node also enabled centralized monitoring and control. It prevented issues with people running a job on their own computers, then leaving a company and then having nobody familiar how to use or maintain such a job.

JupyterHub. JupyterHub was set up on one of the Hadoop’s edge nodes. For this company, their days of being able to analyze data only by using SQL query tools were long gone. Beyond just running SQL queries, they needed more complex data analytics and data science processing. The Jupyter Notebook is a widely accepted tool by the data science community that was able to provide those capabilities.

Jupyterhub (as a centralized version of Jupyter Notebook) allowed the data scientists to access data stored on the Hadoop cluster with a tool they were familiar with. Jupyterhub/Jupyter was also used initially to investigate and analyze the data. As mentioned above, applications were setup to run on an edge node without need for human intervention. The Notebook is really an excellent tool and having it available enabled engineers and analysts to make a big difference for their company.

Architecture of the ETL tool

The framework we developed has four major components, detailed below.

ETL workflow definition

The ETL workflow we developed consists of two types of files.

ETL Workflow configuration file—ETL workflow configuration files contain workflows defined by a list of steps that should be executed in order to run an ETL process. The files are plain text, and in this case we used INI format.

An ETL workflow configuration file contains two types of objects: workflows and workflow steps. An ETL workflow contains a list of steps that are supposed to be executed in order to execute an ETL process. Steps can be executed in sequential order, but branching functionality as well as “waiting for a condition to happen in order to continue” functionality are supported. We can also run steps in parallel (fork and join).

ETL step artifacts—ETL step artifacts are files containing SQL statements, one liner shell/Python/sed scripts, or sometimes custom written executables.

Some SQL statements are used to create tables, others are supposed to load tables from either files from a HDFS/regular filesystem, or from other SQL tables. Other SQL statements select data out of existing tables.

Custom written executables are used to in order to provide specialized processing. For example, we enabled data scientists to write an R or Python script to do data science from within the same framework.

The ETL framework is aware of some commonly used variables. Some examples of those variables include: ${TODAY}, ${YESTERDAY}, ${THIS_MONTH}, and ${PREVIOUS_MONTH}. Many of the ETL processes run once a day, and many depend on notions of TODAY, YESTERDAY, etc. An ETL developer can use those predefined ETL framework variables relying on the ETL framework to calculate the values in runtime. In addition to variables, there are user defined functions (UDF) UDF. These variable-like entities are backed by code instead of a simple value.

All ETL workflow definition files are being written by ETL engineers and data scientists. There can be as many ETL workflow definitions as it required by the business needs.

Runtime environment configuration file

This file contains variables whose values are different for different runtime environments: development, staging, or production. It also includes controlled variables such as usernames and passwords that need to be secured separately than the rest of the framework. In production, this file is maintained by a sys admin.

Execution framework file

This is an executable whose task is to read an ETL workflow configuration file. It, then, executes an ETL workflow defined in the ETL workflow configuration file, one step at the time using the runtime environment configuration file variables, as well as ETL runtime variables.

The framework supports data backfill. This means that it is possible to run an ETL workflow for a day (or a period) in the past. In that case, ETL variables ${TODAY}, ${YESTERDAY} refer to dates relative to the date for which the ETL workflow is being run.

The execution framework was written by our team and, code-wise, is the most complicated component of what we delivered.

Automated deployment

ETL scripts will need to be deployed to staging and production, which requires some kind of automated deployment process. In this case, we wrote custom shell scripts to check code out from a git repository, and then compile, package, deploy, and extract code at the target machine.

Automated deployment must support easy and fast rollbacks. Eventually, you will release code that, due to some bugs, you will have to rollback. In those moments of terror, you will appreciate ability to roll back bad code quickly and safely to the previous release.

Job monitoring and support

There is also the management portal, which allows a user (with authentication) to view job logs and run statistics. You can also modify job configurations through the portal. Data scientists have more control this way, as they don’t require support from IT.

Business impact of empowered pipelines

This framework allowed the company to take control of their business needs. Their teams were able to build their own data pipelines, quickly gaining access to important business data. As modifications were needed, they were able to quickly modify the code, test, and deploy it.

The data scientists who were previously new to Hadoop were now able to use it with languages more familiar to them through JupyterHub. Any models they built were able to be automated fairly quickly, empowering them to focus on data science and solving business problems.

Interested in learning more? Download our report on data systems.

sign up for our newsletter to stay in touch

Source link

Previous Warren Buffett Shareholder Letters: Sentiment Analysis in R
Next This Week in Data Science (April 18, 2017)