The data lakehouse as your platform to the future

May 30, 2021
  • IT
  • data

One of the key elements in becoming a successful data-driven organization is setting up a modern data platform that can handle data streams from various sources and translate this raw information into actionable insights. Traditionally,  such a platform can be built around either a data warehouse or a data lake – e.g. as a basis for a data hub or a BI platform – and businesses had to decide which option would serve their unique situation best. With the new data lakehouse paradigm, they can now combine the capabilities of both. Here’s why this matters.  

First, let’s take a closer look at what constitutes a data (management) platform. The specifics will differ in every organization, but broadly speaking, we can differentiate between 5 layers:

  1. Data sources: These are the internal or external sources of information that are not part of the data platform.
  2. Ingestion layer: Here, raw data is ingested and ‘unlocked’ within the data platform. This can happen three ways: in batches (pull), via streaming (push) or through replication.
  3. Raw data layer: A copy of the raw data is then stored in a data lake or data warehouse.
  4. Centrally processed data: Inside the data warehouse or data lake, data is then processed and prepared for further usage. While a data warehouse typically contains structured data (mainly for reporting purposes), a data lake is more fit for unstructured and big data (e.g., for data science purposes).
  5. Serve & consume: In this layer, processed data is analyzed, reported on and/or distributed.

The five different layers of the data platform

Combining the best of both worlds

Imagine a warehouse stocked with well-organized components in neat rows and stacks. Now, think of a lake, full to the brim with water, fish and other objects all jumbled together with no immediate order imposed on them. It’s relatively straightforward to find and access a specific object located in a warehouse – while it requires different processes to identify and extract specific content from a lake.

Like their namesakes, data lakes and data warehouses differ quite profoundly in how they store and process what fills them: information.

  • A data warehouse deals best with moderate amounts of structured data, which is used mainly in reporting and service delivery.
  • A data lakehouse is best at handling large amounts of raw and unstructured data, which is used mainly in data science, machine-learning exploration, and similar applications.

The main problem with this either/or approach? Today’s companies need to be able to handle all types data, and use them in all types of scenarios. In other words, having to choose between a data lake or warehouse is almost always a case of choosing the lesser evil. This is why many organizations now use both in tandem, leading to higher levels of complexity and duplicated data.

Enter the data lakehouse: an open architecture that combines the best features of – you guessed it – data lakes and data warehouses, with increased efficiency and flexibility as a result. Made possible by the rising trend of open and standardized system design, data lakehouses can apply the structured approach of a warehouse to the wealth of data contained in a data lake.

The main features of a data lakehouse

  • Handle various data types: structured, unstructured and semi-structured.
  • Enjoy simplified data governance and enforce data quality across the board.
  • Get BI support directly on the source data, which means that BI users and data scientists work from the same repository;
  • Benefit from increased scalability in terms of users and data sizes.
  • Get support for data science, machine learning, SQL, and analytics – all in one place.

Unlocking innovation

By simplifying enterprise data infrastructure, safeguarding data quality and increasing the opportunities for exploratory data science, the data lakehouse holds the key to future innovation for many companies. Software vendors seem to agree: those with roots in either data warehouses or data lakes are making a lot of effort to create their own hybrid ‘data lakehouse’ solutions. As such, there is not necessarly a need to invest in two different technologies to get yourself a data lakehouse.

While many parties are claiming the term ‘data lakehouse’, it’s important to keep their histories in mind when making a decision. The key is to keep the big picture in mind, and find a solution that works on your terms and takes your data quality and data governance rules into account. At delaware, we combine expert knowledge of every available platform with business experience in numerous sectors, which makes us uniquely qualified help you pick the solution that best fits your needs.

The 6 fundamental principles of Data & Analytics