AWS Data Engineering

November 10, 2025 3 minute read

AWS Data Engineering

1. Overview

This post walks through how I set up a data engineering pipeline on AWS. The goal was to build a reliable, scalable system that ingests raw data, transforms it, and serves it up for analytics — all using managed AWS services so I’m not babysitting servers.

The pipeline follows a classic Extract → Transform → Load (ETL) pattern, with a few extras like orchestration, monitoring, and cost controls baked in.

2. Architecture

Here’s the high-level flow:

Data Sources → S3 (Raw) → Glue (Transform) → S3 (Processed) → Athena / Redshift → Dashboard
                                  ↑
                          Step Functions (Orchestration)
                                  ↑
                         EventBridge (Scheduling)

Component	AWS Service	Role
Storage	S3	Landing zone for raw and processed data
Catalog	Glue Data Catalog	Schema discovery and metadata management
ETL	Glue Jobs (PySpark)	Data cleaning, transformation, aggregation
Orchestration	Step Functions	Coordinate multi-step ETL workflows
Scheduling	EventBridge	Trigger pipelines on a schedule or event
Query	Athena	Serverless SQL queries on S3 data
Warehouse	Redshift (optional)	Heavy analytical workloads
Monitoring	CloudWatch	Logs, metrics, and alerts
IAM	IAM Roles & Policies	Least-privilege access across services

3. Data Ingestion

Raw data lands in S3 — could be CSVs, JSON, or Parquet files from APIs, databases, or streaming sources. I organize the bucket with a partitioned structure:

s3://my-data-lake/
  raw/
    source=api_a/
      year=2026/month=02/day=21/
        data.json
  processed/
    table_name/
      year=2026/month=02/
        part-00000.parquet

For real-time ingestion, Kinesis Data Firehose can stream data directly into S3 with automatic batching and compression.

4. Transformation with Glue

AWS Glue handles the heavy lifting. I write PySpark jobs that:

Clean — Drop nulls, fix data types, deduplicate.
Transform — Join datasets, calculate derived fields, aggregate.
Partition — Write output in Parquet format, partitioned by date for fast querying.

A typical Glue job looks something like this:

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

# Read from catalog
df = glueContext.create_dynamic_frame.from_catalog(
    database="my_database",
    table_name="raw_table"
)

# Transform
df_clean = df.toDF() \
    .dropDuplicates() \
    .filter("amount > 0") \
    .withColumn("processed_date", current_date())

# Write back to S3 as Parquet
df_clean.write \
    .mode("overwrite") \
    .partitionBy("year", "month") \
    .parquet("s3://my-data-lake/processed/clean_table/")

job.commit()

The Glue Data Catalog keeps track of all the schemas automatically — once a Crawler runs, Athena can query the data right away.

5. Orchestration with Step Functions

Step Functions tie the pipeline together. A typical workflow:

Trigger — EventBridge fires on a schedule (e.g. daily at 2 AM UTC).
Crawl raw data — Glue Crawler updates the catalog.
Run ETL job — Glue job transforms and writes processed data.
Crawl processed data — Update the catalog with new partitions.
Notify — SNS sends a success/failure alert.

Each step has built-in retry logic and error handling, so if a Glue job fails it retries before alerting.

6. Querying with Athena

Once data is in processed S3 buckets and cataloged, Athena lets me run SQL directly on it — no loading into a database, no infrastructure to manage.

SELECT
    user_id,
    COUNT(*) AS total_events,
    SUM(amount) AS total_amount
FROM processed_db.clean_table
WHERE year = '2026' AND month = '02'
GROUP BY user_id
ORDER BY total_amount DESC
LIMIT 100;

Athena charges per TB scanned, so partitioning and Parquet compression make a real difference in cost.

7. Cost & Performance Tips

A few things I picked up along the way:

Parquet over CSV — Columnar format = faster queries and way less data scanned.
Partition everything — Date-based partitions cut Athena costs dramatically.
Right-size Glue DPUs — Start with the minimum workers and scale up only if jobs are slow.
Use Glue bookmarks — Process only new data instead of re-processing the full dataset every run.
Set S3 lifecycle rules — Move old raw data to Glacier after 90 days.
Tag everything — Makes cost tracking and cleanup much easier.

8. What I Learned

Serverless doesn’t mean zero ops — You still need to monitor, tune, and debug. CloudWatch dashboards and alerts are essential.
Schema evolution is tricky — When source data changes shape, the Glue Catalog and downstream queries can break. Adding schema validation early saves headaches.
Start simple, then optimize — It’s tempting to over-engineer from day one. A basic S3 → Glue → Athena pipeline handles most use cases before you need Redshift or EMR.
IAM is the real boss — Getting permissions right across S3, Glue, Step Functions, and Athena took more time than the actual data work.

AWS Data Engineering

AWS Data Engineering

1. Overview

2. Architecture

3. Data Ingestion

4. Transformation with Glue

5. Orchestration with Step Functions

6. Querying with Athena

7. Cost & Performance Tips

8. What I Learned

You may also enjoy

Agentic AI

ONWRD - A fitness mobile app

SQL RLS Rule

Deep dive into analysis