Is Kimball modelling as relevant in the modern data era?

Is Kimball modelling as relevant in the modern data era?

Chris Ballment, March 23, 2023.

The Kimball data modelling methodology, developed by Ralph Kimball in the early 1990s, has been a widely adopted approach to designing data warehouses and business intelligence (BI) systems for decades. However, with the emergence of modern data platforms and their unique features, the relevance of the Kimball modelling methodology is being called into question. In this article, we will discuss the reasons why the Kimball approach is losing ground in today's data-driven world.

The Kimball Methodology in a Nutshell

At the core of the Kimball methodology is the concept of the dimensional model, which organises data into facts (quantitative information) and dimensions (contextual information). This approach emphasizes denormalized, star-schema designs that enable fast and efficient query performance for BI and reporting purposes. It gained performance benefits over its contemporaries on traditional databases by leveraging bitmapped indexes on facts with multi-dimensional select queries, and maintained history with a unique identifier, known as a surrogate key.

Challenges with the Kimball Methodology in Modern Data Platforms

In this section, I will detail the challenges facing the implementation of the Kimball methodology in modern data platforms.

Columnar Data Storage

Modern data platforms, such as Snowflake, Google BigQuery, and Amazon Redshift, utilize columnar data storage to optimize query performance. Unlike traditional row-based storage, columnar storage enables faster analytics by storing data in columns rather than rows. This allows for efficient compression, reducing storage costs and improving query performance.

The Kimball methodology's focus on denormalized, star-schema designs may no longer be as beneficial with columnar storage, and query planning and execution. Modern data platforms can efficiently handle queries on normalized data, reducing the need for complex, pre-aggregated tables that are characteristic of Kimball designs. In addition, the query performance benefits seen on traditional relational databases are not observed on columnar databases. In fact, the use of wide tables is up to twice as fast as dimensional models on Snowflake, Redshift and BigQuery.

Data Lake Integration and Semi-Structured Data

The rise of data lakes and the need to handle semi-structured data, such as JSON or XML, has posed new challenges for the Kimball methodology. Traditional dimensional models struggle to accommodate these non-tabular data formats and the schema-on-read approach often used with data lakes.

Modern data platforms offer built-in support for semi-structured data, providing more flexible schema design options that better suit the diverse data landscape organizations now face. This flexibility contrasts with the rigid, schema-on-write approach inherent in Kimball designs.

Real-Time Analytics and Data Streaming

The growing demand for real-time analytics and data streaming has exposed the limitations of the Kimball methodology. Dimensional models and star schemas, optimized for historical reporting, can struggle to handle the continuous ingestion and processing of streaming data. Modern data platforms offer native support for data streaming and real-time analytics, bypassing the need for the batch oriented ETL (Extract, Transform, Load) processes that underpin Kimball designs.

Modern platforms can adopt certain approaches that will aid in its ability to handle large and constant streams of data in a quasi-dimensional model:

  • Transitioning away from mutable processing (performing both inserts and updates) to immutable processing (only inserting new data).
  • Generating fact tables as views to incorporate the latest data, without the need for increased frequency of batch updates.

Kimball in Established Modern Platforms

The Kimball methodology, while facing challenges in the context of modern data platforms, still offers valuable merits for organisations with established data infrastructure. The dimensional modeling approach at its core provides an intuitive and easily understandable structure that simplifies data access for business users, enabling efficient querying and reporting. This user-friendly design encourages widespread adoption across the organisation, fostering a data-driven culture. As a result, the Kimball methodology remains a viable option for organizations seeking to optimize their data infrastructure for business intelligence and reporting purposes within established modern data platforms.

Takeaways from the Kimball Methodology

Despite the challenges with implementing the Kimball methodology in modern data platforms, there are some components that are still valuable today.

The Wider Lifecycle Methodology

On top of defining and structuring the logical and physical models, Kimball proposed a lifecycle methodology for designing data warehouses. This methodology was designed to help organisations move from a traditional operational data store (ODS) to a data warehouse, and then to a data mart [1].

It formalised a process for designing and implementing a data warehouse, and provided a framework for managing the ongoing maintenance and evolution of the system. This methodology is still relevant today, and can be applied to modern data platforms.

The Bus Matrix

The bus matrix is an essential component of the Kimball data modelling process [2]. It served as a blueprint for designing a data warehouse and provided a clear view of how various dimensions and facts were connected. Today, the output is less valuable than the process to produce the matrix. In this phase, each entity was defined and conformed, and any problems relating to the relationship between measures and their required contextual data was found early.

Single Source of Truth

Although the concept of a single source of truth may be unattainable in most data-rich businesses, aspiring to have data produced and reported consistently is potentially even more important than it was when Kimball proposed the model. With the large volumes of data ingested and stored, there isn’t resources to apply the same degree of rigour in data quality, or operations mind share.

In many businesses, I have observed that more of the governance burden (for example, freshness and quality) is being pushed to the end user, in what seems to be the data equivalent of caveat emptor, or buyer beware. This can be particularly dangerous as inconsistent output can cause damage to the system’s reputation inside the business.

Surrogate Keys

Using surrogate keys to uniquely identify rows in a dimension table is still a best practice in modern data platforms. Surrogate keys ensure that data integrity is maintained, especially when dealing with source system changes or updates. The use of an autoincrementing number can be problematic now, specifically where objects may be transient or virtualised (using views). The use of a deterministic value, such as a hash key instead of this arbitrary number, allows recreation of keys where necessary, and doesn’t create deviations in design patterns depending on the target.

Conclusion

While the Kimball data modelling methodology has served organizations well for many years, the evolution of modern data platforms and their unique capabilities have diminished its relevance. In seeing what I have highlighted as the best parts of the methodology, this exercise has served as a very useful reminder for me that with the large volumes and scope of data being captured and processed, it is unusual to be able to devote the attention to perform the required analysis and governance of the data. Organisations must take a very measured and prioritised approach to ensure that the most critical of their data assets get the thought they deserve.

I am planning a deeper dive into the patterns and architectures for data management on modern platforms in future blog posts. Look out for those on LinkedIn and on the website.

References