Taxonomy and Guiding Principles
A guest blog post from Kevin Petrie, VP of Research at Eckerson Group
Humans have always struggled to integrate data. The rise of the abacus 5,000 years ago, punch cards 130 years ago, and the data warehouse in the 1980s each advanced civilization because they combined lots of data, quickly, to help us make decisions.
This blog, the first in a series, describes a taxonomy for data integration to support modern analytics, and recommends guiding principles to make it work across on-premises and cloud environments. As is often the case, people and process matter as much as tools. Data teams must adhere to business goals, standardize processes where they can, and customize where they must. Even while anticipating future requirements, they also must “embrace the old” by maintaining the necessary linkages back to heritage systems that persist. Blog 2 in the series will explore how to migrate your data warehouse to the Azure cloud, and blog 3 will explore how to support machine learning (ML) use cases on Azure.
Let’s start with a definition. In the modern world of computing, data integration means ingesting and transforming data, schema and metadata, in a workflow or “pipeline” that connects sources and targets. Often the sources are operational, and targets are analytical. You need to consolidate previously siloed data into accurate combined views of the business. You also manage those pipelines and control their data integration processes for governance purposes.
Data integration includes four components: ingestion, transformation, pipeline management, and control. Microsoft’s SQL Server Integration Services (SSIS) and Azure Data Factory (ADF) help data engineers and architects perform these tasks based on commands they enter through either a command line or graphical interface. Mid-sized enterprises, and understaffed data teams within larger enterprises, also can use a tool such as TimeXtender to further automate, streamline, and accelerate these tasks.
This diagram illustrates the components of modern data integration.
The sequence of ingestion and transformation tasks might vary, depending on your requirements. Options include traditional data extraction, transformation, and loading (ETL), transformation of data after it is loaded into the target (ELT), or iterative variations such as ETLT and ELTLT.
So how do we make this stuff work? Large and midsized enterprises struggle with a history of technical debt when it comes to data integration. Over the years, their data teams tend to hand-craft brittle transformation scripts with little attention to compatibility or documentation. Scripts break when they encounter new data sources, pipelines, or versions of BI tools, forcing them to cancel upgrades or write new code from scratch. Some data teams connect BI tools directly to data sources, which creates silos and bottlenecks.
Here are guiding principles for data integration that keep you on the right track.
We have come a long way in our struggle to integrate mountains of data. Now that we have defined our taxonomy and established guiding principles, our next blog will explore how to migrate your data warehouse to the Azure cloud.
If you'd like to learn more about data integration in Microsoft environments, please join Kevin Petrie of Eckerson Group and Joseph Treadwell of TimeXtender for a webinar, The Art and Science of Data Integration for Hybrid Azure Environments. This webinar will be held Jun 16, 2021 02:00 PM Eastern Time. Can't make it then? Please sign up so you will receive access to the recording after the event.
Or read the Eckerson Group Deep Dive Report on Hybrid Cloud Data Integration. (no form to fill out!)