Datalake in AWS

6 min readJan 23, 2022

A data lake is a centralized secure repository that allows you to store, govern, discover, and share all of your structured and unstructured data at any scale. Data lakes don’t require a pre-defined schema, so you can process raw data without having to know what insights you might want to explore in the future. The following figure shows the key components of a data lake.

The challenges of big data

The challenges faced with big data are: data silos, difficulty analyzing diverse datasets, data controllership, data security, and incorporating machine learning (ML). Let’s take a closer look at these challenges and see how a data lake can help solve them.

Breaking down silos

A major reason companies choose to create data lakes is to break down data silos. Having pockets of data in different places, controlled by different groups, inherently obscures data. This often happens when a company grows fast and/or acquires new businesses.

It’s also difficult to get granular details from the data, because not everybody has access to the various data repositories. For smaller queries, you could share a cut of the data in a spreadsheet. But challenges arise when data exceeds the capacity of a spreadsheet (which often happens for larger companies). In some cases, you could share a higher-level summary of the data, but then you’re really not getting the full picture.

A data lake solves this problem by uniting all the data into one central location. Teams can continue to function as nimble units, but all roads lead back to the data lake for analytics. No more silos.

Analyzing diverse datasets

Another challenge of using different systems and approaches to data management is that the data structures and information vary. For example, Amazon Prime has data for fulfillment centers and packaged goods, while Amazon Fresh has data for grocery stores and food. Even shipping programs differ internationally. For example, different countries sometimes have different box sizes and shapes. There’s also an increasing amount of unstructured data coming from Internet of Things (IoT) devices (like sensors on fulfillment center machines).

What’s more, different systems may also have the same type of information, but it’s labeled differently. For example, in Europe, the term used is “cost per unit,” but in North America, the term used is “cost per package.” The date formats of the two terms are different. In this instance, a link needs to be made between the two labels so people analyzing the data know it refers to the same thing.

If you wanted to combine all of this data in a traditional data warehouse without a data lake, it would require a lot of data preparation and export, transform, and load (ETL). You would have to make trade-offs on what to keep and what to lose and continually change the structure of a rigid system.

Data lakes allow you to import any amount of data in any format because there is no pre-defined schema. You can even ingest data in real time. You can collect data from multiple sources and move it into the data lake in its original format. You can also build links between information that might be labeled differently but represents the same thing. Moving all your data to a data lake also improves what you can do with a traditional data warehouse. You have the flexibility to store highly structured, frequently accessed data in a data warehouse, while also keeping up to exabytes of structured, semi-structured, and unstructured data in your data lake storage.

Managing data access

With a data lake, it’s easier to get the right data to the right people at the right time. Instead of managing access for all the different locations in which data is stored, you only have to worry about one set of credentials. Data lakes have controls that allow authorized users to see, access, process, and/or modify specific assets. Data lakes help ensure that unauthorized users are blocked from taking actions that would compromise data confidentiality and security.

With a data lake, data is stored in an open format, which makes it easier to work with different analytic services. Open format also makes it more likely for the data to be compatible with tools that don’t even exist yet. Various roles in your organization, like data scientists, data engineers, application developers, and business analysts, can access data with their choice of analytic tools and frameworks.

You’re not locked in to a small set of tools, and a broader group of people can make sense of the data.

Accelerating machine learning

A data lake is a powerful foundation for ML and AI (artificial intelligence), because ML and AI thrive on large, diverse datasets. ML uses statistical algorithms that learn from existing data, a process called training, to make decisions about new data, a process called inference. During training, patterns and relationships in the data are identified to build a model. The model allows you to make intelligent decisions about data it hasn’t encountered before. The more data you have the better you can train your ML models, resulting in improved accuracy.

Using the right tools: Galaxy on AWS

The Galaxy data lake is built on Amazon Simple Storage Service (Amazon S3), an object storage service that offers unmatched availability, durability, and scalability. Some data is also housed on Amazon proprietary file-based data stores, Andes and Elastic Data eXchange, both of which are service layers on top of Amazon S3. Some other data sources are Amazon Redshift, a data warehouse, Amazon Relational Database Service (Amazon RDS), a relational database, and enterprise applications.

AWS Glue, a fully managed ETL service that makes it easy for you to prepare and load data for analytics, and AWS Database Migration Service (AWS DMS) are used to onboard the various data sets to Amazon S3. Galaxy combines metadata assets from multiple services, including Amazon Redshift, Amazon RDS, and the AWS Glue Data Catalog, into a unified catalog layer built on Amazon DynamoDB, a key-value and document database. Amazon Elasticsearch Service (Amazon ES) is used to enable faster search queries on the catalog.

After the data has been cataloged (onboarded), various services are used at the client layer. For example, Amazon Athena, an interactive query service, for ad hoc exploratory queries using standard SQL; Amazon Redshift, a service for more structured queries and reporting; and Amazon SageMaker, for machine learning.

AWS Lake Formation

The Amazon team created the Galaxy data lake architecture from the ground up. They had to develop many of the components manually over months, which is similar to how other companies have had to do this in the past. In August 2019, AWS released a new service called AWS Lake Formation. It allows you to streamline the data lake creation process and build a secure data lake in days instead of months. Lake Formation helps you collect and catalog data from databases and object storage, move the data into your new Amazon S3 data lake, clean and classify your data using machine learning algorithms, and secure access to your sensitive data.

Summary

By storing data in a unified repository in open standards-based data formats, data lakes allow you to break down silos, use a variety of analytics services to get the most insights from your data, and cost-effectively grow your storage and data processing needs over time.