Essential Steps for Data Migration
If you’re a data professional, you’re likely aware of the growing demand for data migration and modeling experts. Companies are actively seeking individuals to fill roles like Data Engineer, Analytics Engineer, and ETL Specialist, who can contribute to the development of BI infrastructure using modern data stacks. In this article, I’ll share some general insights from my experience participating in numerous data migration and modeling projects, along with practical tips for success in this field.
What should you expect to deal with when you’re hired for a migration project?
In an “ideal” data world (like those often presented in online courses or educational exercises), building a reliable data architecture is a straightforward endeavour. You often receive neatly structured set of tables, demanding only minor adjustments such as renaming column headers for enhanced comprehension or converting the data into a denormalised format optimised for analytics. The task may also require some simple data transformations like eliminating duplicate records or calculating user ages based on their birthdates.
However, the real-world data migration projects are far more complex. Data sources are often scattered, inconsistent, and poorly documented. Intricate transformation logic is required to cleanse, standardise, and integrate disparate data streams. And the end goal is not just to create a functional data warehouse but to establish a data-driven culture that empowers businesses to make informed decisions.
It’s also likely that on your first day, you’ll find out that BI reports already exist, the company has dashboards and the infrastructure is already (somehow) working. This might prompt the question “Why would they even hire me?” Your manager will likely respond by highlighting the following reasons:
- We still run queries on OLTP which makes BI reports take weeks to create. [Read about OLTP vs OLAP here]
- The outdated legacy code is so complicated that no one wants to deal with it so for each small change we sometimes have to rewrite it from scratch.
- We don’t have a single source of truth for the report metrics so the same KPIs appear to have different numbers in different reports.
- A lot of other stuff like data quality, testing, handling high data volume etc.
You are required to
- Develop an OLAP system and migrate the data using the ELT methodology (as ETL is considered outdated in the modern data landscape).
- DBT should be used with a comprehensive SQL code style for data transformation so that it will be easy for even newly hired data analysts to understand every line of code.
- To ensure consistency and accessibility in retrieving KPIs, establish a dedicated reporting layer that encompasses all metrics.
- Make us a cool data-driven company!
The project may appear to be challenging, but let’s explore the most suitable strategy to address its requirements and outline the steps required for a comprehensive solution.
Step 1: Choosing the right data stack for your company
In today’s data-driven world, companies are increasingly reliant on robust data infrastructure to extract insights, make informed decisions, and gain a competitive edge. A well-chosen data stack serves as the backbone of a company’s data management strategy, providing a comprehensive ecosystem for collecting, processing, storing, and analysing data.
The selection of a data stack is a critical decision that should align with a company’s specific needs and goals. Key factors to consider include:
- Data Volume and Type: Assess the volume and diversity of data to determine the storage and processing capabilities required. Consider factors like structured, semi-structured, and unstructured data.
- Scalability: The data stack should be able to handle increasing data volumes and usage demands as the company grows. Look for scalable solutions that can accommodate future growth.
- Cost: Evaluate the overall cost of the data stack, including licensing fees, maintenance, and training. Consider cloud-based solutions for flexibility and cost optimisation.
- Integration: Ensure the components of the data stack can seamlessly integrate with each other and with existing systems. Avoid vendor lock-in and opt for open-source or API-driven solutions.
- User Experience: Consider the ease of use and accessibility of the data stack for both technical and non-technical users. Prioritize tools that provide intuitive interfaces and self-service capabilities.
- Security and Compliance: Data security is paramount. Ensure the data stack meets industry standards and regulations, and has robust security features to protect sensitive information.
- Future-Proofing: Choose a data stack that can adapt to evolving data needs and emerging technologies. Consider platforms that support continuous innovation and integration with new tools.
Consult experts and conduct “proof of concepts”. Involving data engineers and experts in the evaluation process can provide valuable insights and guidance. Consider conducting proof-of-concept (POC) trials with shortlisted data stack solutions to assess their performance for specific use cases.
Step 2: Draw the current and suggested data flows and migrate the source data
This step involves consolidating and centralising the source data from various sources into a single repository, utilising an appropriate data migration tool you chose in step 1 (Fivetran, Stitch, Singer, Airbyte, Meltano, etc.) and the data warehouse deemed most suitable for the your case (Snowflake, Redshift, BigQuery etc). You should also have already chosen your data transformation (DBT, of course) and maybe a data orchestration (Airflow, Prefect, ADF etc) tools.
In order to effectively manage the data flow, it is crucial to identify and group the data sources, the data collection process, the analysis and reporting steps, and potential bottlenecks. Visualising the current workflow as a diagram provides a clear overview and assists in identifying areas for optimisation. An example of such a diagram could be:
By knowing the current view of the data flow, you can suggest a new flow of data and create a diagram depicting the proposed flow.
Step 3: Data flow steps definition and modeling
Once you have successfully established data pipelines and consolidated the raw data into your warehouse, the next crucial step is to structure this data effectively for data analysis purposes. Data modeling, from the perspective of BI, encompasses the process of meticulously analysing and defining the relationships between various data entities to establish a well-organised and insightful foundation for data-driven insights.
Data transformation typically unfolds in a series of stages, each serving a distinct purpose:
Raw/Bronze — > Silver/Staging — > Core/Golden — > Reporting
Raw/Bronze Layer:
The Raw/Bronze layer is what you already have. It serves as the entry point for data from various sources. It holds unprocessed data in its raw format, directly from the source systems, with all its original characteristics and imperfections. This layer serves as a historical snapshot of the original data and is not intended for direct analysis.
Silver/Staging Layer:
The Silver/Staging layer serves as a transitional repository for unprocessed data, receiving raw data from the Raw/Bronze layer and maintaining a direct one-to-one correspondence with the source layer. This layer refrains from complex transformations or joins, focusing primarily on refining the list of needed columns, naming conventions and columns’ data types. This minimalist approach ensures that the data maintains a close connection to its original source while simultaneously preparing it for further processing in the Core/Golden layer.
Core/Golden layer:
The Core/Golden layer represents the heart of your data warehouse, holding the transformed, cleansed, and standardised data ready for analysis and reporting. This is where you should get introduced to Dimensional modeling which is a data modeling technique where you break data up into “facts” and “dimensions” to organise and describe entities within your data warehouse.
Reporting layer
The reporting layer, the final stop in the data analytics journey, takes the cleansed, transformed, and integrated data from the Core/Golden layer and transforms it into meaningful reports, dashboards, and visualisations. It primarily focuses on presenting aggregated data to enable data discovery and exploration, empowering users to gain insights from large volumes of data without having to wade through the raw data itself.
Step 4: Naming conversions and DBT guidance
In order to keep your project well organised and readable you should adopt a standardised approach to model and code creation. This is where you need to have you own DBT Guide. It can be a Confluence page or a .dm file in you project.
A comprehensive dbt guide should encompass the fundamental principles of model creation, delineate the scope of transformations within each layer, and provide guidelines for structuring .yml files to ensure a well-organised and maintainable project. But the most crucial part of that dbt guide is the naming convention. Clear and consistent naming conventions make it easier for developers to understand and maintain the code, and they also make it easier for analysts to understand the data that is being transformed.
Here are some examples of rules in naming convention:
Finally!!!
Once you’ve completed the essential setups mentioned above, you can begin the hands-on work of transferring your data to the new system. The process requires a combination of technical skills, particularly expertise in SQL… a lot of SQL.
I could add some more steps like “implementing a CI/CD pipeline” and “establishing team collaboration principles” , but it’s crucial to recognize that each data migration journey is unique, and the specific steps may vary depending on an organisation’s specific needs and circumstances. These four steps, in my opinion, serve as a general guideline for most companies aspiring to adopt a modern data stack.