WHITE PAPER
Cloudera, Inc. 5470 Great America Pkwy, Santa Clara, CA 95054 USA cloudera.com
© 2022 Cloudera, Inc. All rights reserved. Cloudera and the Cloudera logo are trademarks or registered trademarks
of Cloudera Inc. in the USA and other countries. All other trademarks are the property of their respective companies.
Information is subject to change without notice. 5269-001 June 22, 2022
Data Quality
In a traditional Data Warehouse, data typically goes through three distinct stages, resulting in
data of increasingly greater quality. These stages are commonly referred to as Landing, Refined
and Production or Bronze, Silver and Gold. In the Landing stage, data is in its raw or natural
format e.g csv format. As we transform and curate the data, we change its format e.g. Parquet,
apply Data Modelling and store data in an Iceberg table in preparation for efficient analytics.
This transformation results in data transitioning to the Refined stage. The final transition from
Refined to Production requires data to be optimised for production usage. This may include
data cleansing and normalisation operations
Figure 04 - A Simplified Systems View of the Data Lakehouse
We include the Iceberg client library in Cloudera’s Data Services. This makes it possible to
execute the transformations to move data between each of the three stages of quality. As
Iceberg is open source, it’s also readily available to integrate with third-party products and
services to perform data quality operations.
Beyond The Data Lakehouse
As previously described, the definition of a Data Lakehouse has steadily evolved from originally
supporting BI on a Data Lake, to today, supporting AI, BI, ML and Data Engineering on a single
platform. In the earlier section “What is a Data Lakehouse Architecture’’ we introduced seven
qualities that all Data Lakehouses share. One quality that we believe can be extended further, is
to include support for additional analytical services. As such, we are working hard to extend the
supported analytical services to include real-time analytics and operational datastores.
At Cloudera, we believe that an Open Data Lakehouse needs to extend beyond supporting a
single processing engine. Today, we support Iceberg with Apache Spark, Apache Hive and
Apache Impala. Collectively they support the Data Lakehouse architecture across Data
Engineering, Data Warehousing and Machine Learning. Looking to the future, we will bring
support to the real-time analytics engines Apache Flink, data flow management engine
Apache Nifi and operational data stores powered by Apache HBase. This will provide the
foundation of the next generation of Data Lakehouse, one that encompasses the entire data
lifecycle—from the edge to AI.
METADATA | SECURITY | ENCRYPTION | CONTROL | GOVERNANCE
MACHINE
LEARNING
DATA
FLOW
DATA
ENGINEERING
OPERATIONAL
DATABASE
DATA
VISUALIZATION
DATA LAKE LAYER (SDX)
Supports security, governance, and management of data in open storage formats.
METADATA LAYER
Supports partitioning, transactions, data versioning and snapshots.
DATA
WAREHOUSE
About Cloudera
At Cloudera, we believe that data can
make what is impossible today, possible
tomorrow. We empower people to
transform complex data into clear and
actionable insights. Cloudera delivers
an enterprise data cloud for any data,
anywhere, from the Edge to AI. Powered
by the relentless innovation of the open
source community, Cloudera advances
digital transformation for the world’s
largest enterprises.
Learn more at cloudera.com
Connect with Cloudera
About Cloudera:
cloudera.com/more/about.html
Read our Blog:
blog.cloudera.com
Follow us on Twitter:
twitter.com/cloudera
Visit us on Facebook:
facebook.com/cloudera
See us on YouTube:
youtube.com/c/ClouderaInc
Join the Cloudera Community:
community.cloudera.com
Read about our customers’ successes:
cloudera.com/more/customers.html