Data lakes power some of the world’s most data-driven organizations.
In this hands-on workshop, you won’t just learn the theory – you’ll build a fully functional data lake using the same open-source tools used in production: MinIO, Trino, Hive Metastore and SQL.
By the end of the day, you’ll walk away with:
A working data lake you built yourself
The ability to design and deploy a data lake architecture end to end
Hands-on experience with distributed SQL, columnar storage and ACID-compliant table formats
Practical knowledge of security, cost optimization, and production best practices
Who Should Attend
Data Engineers
Developers
DBAs
Prerequisites
No prior experience with data lakes is required.
Participants should be comfortable working with SQL and using the command line
Course Contents
Module 1 – Foundations of Data Lakes
Understand the “why” before the “how.”
What is a data lake — and what isn’t
Data lakes vs. databases vs. data warehouses
Real-world use cases and adoption patterns
Architecture deep dive: storage, compute, and metadata layers
Why Apache Parquet is the lingua franca of analytics
Module 2 – Environment Setup
Spin up your lab in minutes.
Docker essentials: containers, images, and orchestration
Launch the full workshop stack with a single command
Module 3 – Object Storage
Build the foundation layer.
Deploy MinIO as an S3-compatible object store
Organize, load, and browse datasets
Understand buckets, prefixes, and access patterns
Module 4 – Query Engine & Metastore
Make your data queryable.
Deploy Hive Metastore for centralized schema management
Deploy Trino as a high-performance distributed SQL engine
Create external tables and run your first queries
Module 5 – Data Transformation
Turn raw data into analytics-ready assets.
Transform raw CSV into optimized Parquet (Tier 1 → Tier 2)