Data Lakes vs. Data Warehouses

Data lakes and warehouses are two similar concepts but have their own pros and cons when you're deciding where you want to house your data.

Share with others

"Big data" has been the topical buzzword in the tech scene for quite some time now and has an ever-important role to play in the way that technology continues to advance. If you're not already taking advantage of the plentiful insights that your customers' day-to-day transactions and interactions with your brand convey, then we hate to say it, but you're falling behind. Off-the-shelf analytics toolsets like Google Analytics and others offer great data insights to get started, but there comes a time when you need to supercharge your data analysis capabilities.

One of the sure-fire ways to get started is by collecting your data in a data warehouse, data lake.. or both! Our customers often utilize one or the other depending on their use cases, so we've decided to share a few pointers they’ve learned to help you make the right decision for your business. Data lakes and warehouses are two similar concepts, but have their own pros and cons when you're deciding where you want to house your data.

But first, let's dive into what a data warehouses and data lakes are:

What is a Data Lake?

A data lake is a centralized repository designed to hold massive amounts of structured, semi-structured, and unstructured data in its pure, unprocessed state. Think of it as a storage space where data is pooled, ready for whatever analysis or transformation you might need in the future. It’s ideal for when you know you need to collect data but aren’t exactly sure how you’ll use it yet—leaving your options open as your data strategy takes shape.

Data lakes are generally used when the purpose of having data is for infrequent analysis, and/or the purpose of the data is generally uncertain. It is somewhat difficult to query this data without a data scientist, but tools such as Amazon Athena have helped make data lakes more queryable. Data lakes are also great if you know you need to collect the data, but are not exactly sure what the use case will be yet. Since the data has limited structure, you can apply whatever structure you'd like later on in a data warehouse or another tool.

Now, let’s dive into the pros and cons of data lakes and see if they’re the right fit for your needs.

Pros and Cons of Data Lakes

A data lake stores data like water in a lake, uncollected and untransformed. Its purpose has yet to be defined, but it is there when you need it. You can also jump in and explore, but it might take time and effort to find what you're looking for.

Pros of Data Lakes

  • Malleable data: manipulate and/or transform the data into another format once it's stored.
  • Easy to store data: as long as the data is queryable data (JSON, AVRO, Parquet, etc.), it'll be accepted by a data lake.
  • Inexpensive to load & store data: storing unstructured data tends to be cheaper than tools such as data warehouses.

Cons of Data Lakes

  • Hard for many users to query & understand data: you'll most likely need an additional tool or two to query data housed in a data lake.
  • Expensive to perform queries: those tools also can increase costs and be based on a per-query basis.
  • Slower to return query results: data lakes are not optimized to run quick analyses, generally-speaking.

What is a Data Warehouse?

A data warehouse is a centralized repository built to store and organize large amounts of structured data, making it ideal for fast, complex analysis. Data goes through a meticulous process when entering the warehouse, so by the time it’s in, it’s ready for efficient querying and deep dives into historical trends. If you’re after organized, consistent insights with minimal fuss, a data warehouse is designed to deliver exactly that.

A data warehouse stores formatted data with the purpose of performing complex queries and data analysis at a regular interval. You need to know the purpose of the data, because you'll typically need to predefine the structure (schemas, tables, columns and data types) within your database before you can actually insert data into a warehouse. Because you formatted the data and made a conscious decision on how your data is formed, you can more easily analyze the data and manipulate it for your analysis needs.

Pros and Cons of Data Warehouses

Sticking with water analogies, pushing data into a warehouse is like collecting water from a source into barrels, transporting the barrel to a factory, packaging the water into water bottles, and storing them on pallets. The water will be used for one purpose—in this case, data queries to find useful data. The way the water is packaged up is very methodical, and a mess-up on the assembly line could cause production issues. Likewise, if data doesn't fit into a certain schema or has issues, it won't be loaded into a warehouse and be dropped.

Pros of Data Warehouses

  • Easy to sift through data & query it: the structured nature of a warehouse makes it easier to find what you're looking for.
  • Historical data analysis: again, because of the structured nature of the data, it's easy to visualize the data and gain insight from historical trends

Cons of Data Warehouses

  • Potential for data errors & loss: data warehouses generally require that data adheres to a predetermined schema. If data does not fit into that schema, it cannot be loaded. A user must perform some cleanup of the data before it can be stored.
  • Less flexible data: you'll have to transform data in warehouses in order to make it more universal. And, because the data was already fit into a predefined structure, some data may have been omitted when initially loaded in.

Storing Events in Lakes vs. Warehouses

Data lakes and warehouses are two of the most common types of event destinations used to store event data. Their strengths and weaknesses for event analysis generally align with the above, but once you decide to get serious about collecting consumer data, here is what you should know about sending event data to these tools:

Data Lakes

  • Scripts and queries run on data housed in data lakes is costly. The data is cheap and easy to access, but pulling insights out of that data can break the bank if you run too much analysis.
  • Events and their related fields will almost always be loaded into a data lake. The only way that an event will not be saved into a data lake is if it is missing some key piece of data, such as a missing write key that tells our system where to send the event.
  • Raw data in a data lake means much more flexibility. If data is loaded directly into a destination or warehouse, that event is transformed to meet the specific spec required by the destination. This means that you no longer have access to the data in its raw form.
  • Using a data lake with MetaRouter unlocks data replay capabilities. If an event is stored in S3 in a raw format, we can run that data through any server-side destination.
  • Data lakes typically require more setup to query. AWS Athena, for example, requires you to use AWS Glue to create a schema from which you can query data stored in S3. You may also need to transform your data before you're able to query. Additionally, you have to concern yourself with how you partition data, as well as the data format. AVRO and Parquet, for example, are more efficient to query than JSON and improve query performance, thus saving on cost.

Data Warehouses

  • The general go-to for event analysis. Once a schema is populated with data, it is very easy to search for data within the schema.
  • It is likewise easy to run queries on this data and pull insights out of. Your typical data analyst should be able to find value easily from data stored in a warehouse.
  • Data loaded into a warehouse must conform to a defined schema. If data is not in the format that is expected, the event will be dropped unless the service used to load data can accommodate malformed data. Even if malformed data finds its way into your warehouse, you'll still have to perform manual cleanup of the data if you want to run your queries accurately.
  • MetaRouter streams data into BigQuery, while we use a more typical batch loading style for Redshift. This might change soon. The key is that batch loading can be less expensive for users but introduces another step in the data load process. This step introduces longer wait times to see data (typically at an hour or day cadence for batch jobs), which also introduces the possibility of data loss due to inconsistent data. Streaming data is generally easier to manage and troubleshoot, but also introduces increased cost for some partners.

Comparing Data Lakes vs Data Warehouses

Choosing between a data lake and a data warehouse isn’t always straightforward, as each brings its own set of strengths and challenges. Data lakes and data warehouses tackle different needs, from massive raw data storage to structured data ready for quick analysis. To make things clearer, the table below breaks down how they stack up in areas like scalability, data processing, and security. Whether you’re exploring flexible data storage or need a structured approach to business insights, this comparison will help guide you toward the right fit for your data strategy.

Category Data Lake Data Warehouse
Data Storage and Structure Stores raw data in its native format (structured, semi-structured, unstructured) Stores structured, processed data in a predefined schema
Data Processing Uses schema-on-read, applying structure when analyzed; suited for flexible, exploratory analysis Uses schema-on-write, requiring data to conform to a defined schema; optimized for predefined queries
Scalability and Cost Highly scalable and cost-effective for large volumes of data Less scalable and more costly due to structured nature and processing requirements
Agility and Flexibility Flexible and agile, enabling quick adaptation to new data and analytic needs Less adaptable, better suited for consistent data and reporting needs
Security and Governance Challenging to govern due to raw data volume and variety; risk of becoming a “data swamp” Easier to manage data governance with structured data, providing better control over data access
Use Cases Ideal for big data processing, machine learning, and exploratory analytics Best for business intelligence, reporting, and historical data analysis
User Types Data scientists, ML engineers, developers requiring raw data Business analysts, decision-makers needing quick access to structured data
Data Quality Requires robust governance to prevent a “data swamp” Enforces data quality through ETL processes and predefined schemas
Query Performance Slower query performance; batch processing suited for large datasets Optimized for fast querying; real-time analytics and dashboarding supported
Data Integration Supports diverse data ingestion with late binding at analysis time Structured integration for consistent data from well-defined sources
Machine Learning Support Supports ML workflows with large datasets and various formats Limited to BI and may require external tools for advanced ML model development
Data Retention Suitable for long-term raw data storage at lower cost Typically retains recent or aggregated data, with cost considerations for long-term storage
Data Democratization Enables broad access to raw data, requiring technical skills for analysis Provides business-ready data access for non-technical users with self-service BI tools
Ecosystem and Tool Support Integrates with big data tools like Hadoop and Spark, requiring specialized skills Supported by mature BI tools, easier to find skilled professionals for implementation

Data lake or data warehouse? 

It’s never quite that simple, is it? Often, our clients choose both. This allows for easy data analysis when necessary, but also unlocks the flexibility and data replay capability that we mentioned earlier. 

However, if you must choose one or the other, look to the above pros and cons to decide exactly what the needs are for you and your organization.

And, as a word of caution, the current solutions that are available can differ by quite a bit in terms of functionality and cost. Do your homework, understand what the major players in the space have to offer and their price points, and you'll be well on your way towards realizing the benefits of Big Data insights!

As a follow-up to this article, we'll analyze a few of the solutions out there that we have experience with, share thoughts on choosing the right cloud, and offer our recommendations on which tools are best to accomplish the complex data science needs of 2020 and beyond. 

Photo by Marko Blažević on Unsplash