Multiply Your Data Stack Performance

Distributed Compute & Storage from the Inside Out

Main Speaker

Learning Tracks

Course ID

42936

Date

29-06-2026

Time

Daily seminar
9:00-16:30

Location

John Bryce ECO Tower, Homa Umigdal 29 Tel-Aviv

Overview

AI tools write your ETL pipelines faster than ever – functional, correct code that gets the job done. But they make design decisions without understanding your data distribution, query patterns, or how the compute engine will actually execute the work and that’s where the real cost hides. Your Spark job runs for 40 minutes, so you start guessing: more executors, different API calls, code rewrites, but nothing moves the needle because the bottleneck was never in the code. This seminar teaches you to stop guessing – understanding how distributed compute and storage actually work from the inside out, so you can design a data stack that’s fast and cost-efficient at every layer. Every concept is demonstrated hands-on in Spark and Athena against the same data, with real measurements.

Who Should Attend

  • Data Engineers
  • Backend Developers working with data pipelines
  • Platform Engineers managing data infrastructure
  • Analytics Engineers writing SQL against data lake / warehouse systems
  • Team Leads and Architects making data platform design decisions
 

Prerequisites

  • Working knowledge of Python and SQL.
  • Basic familiarity with Spark (PySpark) and at least one SQL query engine (Athena, Presto, Trino, or similar).
  • No deep expertise required we build up from fundamentals.

Course Contents

  • Distributed Query Execution – How Engines Break Work Apart
    • How Spark and Athena split queries into stages and distribute work across nodes – demonstrated side-by-side on the same data
    • Reading execution plans in Spark UI and Athena EXPLAIN as a practical diagnostic skill
    • Identifying parallelism bottlenecks and understanding where the engine spends its time
  • Data Scanning & Skipping – Making the Engine Read Less
    • Partition pruning, predicate pushdown and column pruning – the three skipping mechanisms every engine uses and how they work mechanically
    • Inside Parquet: how the file format enables data skipping and why sort order is one of the most impactful optimizations most engineers never apply
    • Measuring real data scan reduction and cost savings in both engines
  • Shuffles & Data Movement – The Expensive Operation Everyone Ignores
    • What a shuffle actually does and why it dominates execution time in distributed queries
    • Recognizing shuffles in execution plans across both engines
    • Broadcast strategies, pre-aggregation and partition skew – practical patterns to reduce data movement
  • Storage Patterns That Make Everything Faster
    • File sizing, partitioning strategy and sort order – how physical data layout in S3 determines performance for every engine reading from it
    • Restructuring data with Spark and measuring the downstream impact in both Spark and Athena
  • End to End Pipeline Optimization
    • A complete realistic scenario: Spark transformation → S3/Parquet → Athena query – diagnosing and fixing bottlenecks across the entire chain
   

The conference starts in

Days
Hours
Minutes
Seconds