Design a centralized data model in the Harmonization Layer of D&A Platform

Aug 4, 2024
3 min read

In today’s fast-paced business environment, organizations are continuously evolving, often resulting in multiple Enterprise Resource Planning (ERP) systems coexisting. As organizations transition from legacy to current and future ERP solutions, one of the most challenging tasks is unifying data across these platforms for consistent analytics and reporting. This blog post explores how to handle unified data modeling during digital transformation and offers insights for data modelers and engineers.

Why Unified Data Modeling Matters?

Unified data modeling is crucial for:

Consistent Reporting: Ensures that business decisions are based on a single source of truth.
Data Quality: Reduces data redundancy and inconsistencies.
Agility: Enhances the organization’s ability to adapt to new business requirements and technologies.

Challenges in Unified Data Modeling

Data Redundancy: Duplicate records across systems can lead to inaccuracies.
Inconsistent Data Definitions: Different ERPs may have varied data definitions and formats.
Data Integration: Seamlessly integrating data from disparate systems is complex.
Maintaining Traceability: Keeping track of the origin and lineage of data is vital.

Example:

1. Data Landscape Assessment

Suppose a company uses the following ERP systems:

Legacy ERP: Used for historical data and some active customer accounts.
Current ERP: The primary system for order processing and inventory management.
Future ERP: Under implementation, intended to replace the current ERP.

Key Entities:

Customer: Exists in all ERPs with different IDs and attributes.
Service Cases: Recorded in both current and future ERPs.

Legacy ERP:

Customer_ID	Name	Address	Phone	Service_Case_ID	Product_ID	Creation_Date	Description
LEGACY_001	Müller GmbH	Berliner Str. 123, Berlin	030123456	L_SC_1001	PROD_A	2024-01-15	Issue with software A
LEGACY_002	Dupont S.A.	Rue de Rivoli 45, Paris	014567890	L_SC_1002	PROD_B	2024-01-20	Hardware malfunction B

Current ERP:

Customer_ID	Name	Address	Email	Service_Case_ID	Product_ID	Creation_Date	Description
CURR_001	Müller GmbH	Berliner Str. 123, Berlin	info@mueller.de	C_SC_2001	PROD_A	2024-01-16	Software error A
CURR_003	Smith Ltd.	Baker St. 221B, London	contact@smith.co.uk	C_SC_2002	PROD_C	2024-01-18	Maintenance request C

Future ERP:

Customer_ID	Name	Address	Contact	Service_Case_ID	Product_ID	Creation_Date	Description
FUTURE_001	Müller GmbH	Berliner Str. 123, Berlin	+4930123456	F_SC_3001	PROD_A	2024-01-17	Update issue A
FUTURE_004	Gonzalez Corp.	Calle Gran Via 22, Madrid	contacto@gonzalez.es	F_SC_3002	PROD_D	2024-01-19	Installation problem D

2. Design a Centralized Data Model with Hash Keys

In the centralized data model, generate a Customer_Key as a hash of the concatenated Name, Address, and Contact_Info attributes. Here, we'll use a hash function to create the keys.

Use Surrogate Keys: Employ surrogate keys to uniquely identify entities.
Define Common Attributes: Standardize attributes across ERPs to maintain consistency.
Maintain Source System References: Include fields to track the origin of each record.

Customer Dimension Table:

Customer_Key	Customer_IDs	Name	Address	Contact_Info	Source_Systems	Source_Record_IDs
9e107d9d372bb6826bd81d3542a419d6	LEGACY_001, CURR_001, FUTURE_001	Müller GmbH	Berliner Str. 123, Berlin	info@mueller.de, +4930123456	Legacy, Current, Future	L_SC_1001, C_SC_2001, F_SC_3001
e4d909c290d0fb1ca068ffaddf22cbd0	LEGACY_002	Dupont S.A.	Rue de Rivoli 45, Paris	014567890	Legacy	L_SC_1002
68b329da9893e34099c7d8ad5cb9c940	CURR_003	Smith Ltd.	Baker St. 221B, London	contact@smith.co.uk	Current	C_SC_2002
a9f0e61a137d86aa9db53465e0801612	FUTURE_004	Gonzalez Corp.	Calle Gran Via 22, Madrid	contacto@gonzalez.es	Future	F_SC_3002

Note: The Customer_Key values shown above are mock hash keys generated for illustration. In practice, you would compute these using an actual hash function (e.g., SHA-256).

Service Case Fact Table:

Service_Case_ID	Customer_Key	Product_Key	Creation_Date_Key	Resolution_Date_Key	Status_Key	Response_Time	Resolution_Time	Cost	Number_of_Interactions	Source_Systems	Source_Record_IDs
1001	9e107d9d372bb6826bd81d3542a419d6	101	20240115	20240118	1	4	72	150	2	Legacy, Current, Future	L_SC_1001, C_SC_2001, F_SC_3001
1002	e4d909c290d0fb1ca068ffaddf22cbd0	102	20240120	20240122	2	3	48	200	1	Legacy	L_SC_1002
1003	68b329da9893e34099c7d8ad5cb9c940	103	20240118	20240121	1	5	72	180	2	Current	C_SC_2002
1004	a9f0e61a137d86aa9db53465e0801612	104	20240119	20240123	3	6	96	250	3	Future	F_SC_3002

Generating Hash Keys

To generate these hash keys, we concatenate the significant attributes (e.g., Name, Address, Contact_Info) and use a hash function to produce a consistent hash value for each unique customer record. Here's a Python snippet to illustrate this process:

This script will produce unique hash keys for each customer based on their details, ensuring consistency across different systems.

3. Implement Data Harmonization Techniques

Data Matching Techniques:

Fuzzy Matching: Utilize fuzzy logic to detect similar names and addresses across systems, such as "Müller GmbH" appearing with different contact details.
Exact Matching: Leverage exact matches on fields like email addresses to identify duplicates.

Consolidation Rules:

Primary Source: Prioritize data from the Future ERP, assuming it has the most accurate and up-to-date information.
Merged Attributes: Combine comments, descriptions, and customer feedback from all systems into a single view.

4. Ensure Data Quality and Governance

Validation Rules:

Ensure that all contact information is in the correct format and that addresses conform to postal standards.
Implement consistency checks to verify that customer names and identifiers are accurate.

Data Profiling and Monitoring:

Utilize tools like SAP Data Intelligence to profile data regularly, identifying anomalies and ensuring data quality.
Set up alerts for data discrepancies or anomalies in customer records.

5. Leverage Modern Data Platforms

SAP Datasphere | Microsoft Fabric | Snowflake | Databricks Implementations:

SAP Datasphere can be used to facilitate the integration

and management of data across multiple ERP systems:

Data Integration: Seamlessly connect and integrate data from Legacy, Current, and Future ERPs into a unified platform.
Real-Time Analytics: Enable real-time insights into customer interactions and service case resolutions.
Scalability: Handle growing data volumes efficiently