π Roadmap Data Engineering
Filosofi: Kalau kamu cuma bisa query SQL, itu artinya kamu bisa ambil data. Tapi kalau kamu bisa design pipeline yang extract dari 5 source berbeda β transform dengan business logic β load ke warehouse β schedule otomatis β monitor data quality, itu artinya kamu bisa membangun data infrastructure β dan itu yang ditanya waktu interview. Rekruter akan tanya βoke, data masuk warehouse, terus gimana kalau source berubah schema?β Kalau kamu jawab βpipeline saya punya schema validation di ingestion layer, dead letter queue untuk data gagal, dan alerting kalau row count turun lebih dari 20%β β itu yang menutup pertanyaan.
Salary Ceiling Paling Tinggi
Data Engineering adalah jalur dengan salary ceiling paling tinggi jangka panjang di Indonesia. Senior Data Engineer di unicorn Indonesia: 30β60 juta/bulan. Di luar negeri (remote): $150β250k/tahun. Demand masih naik karena setiap perusahaan butuh data infrastructure.
π― Checkpoint Awal β Sebelum Mulai
Stack : Local machine (Python + Docker) β Proxmox VM untuk services
Jalur : Data Engineer β Senior Data Engineer β Data Architect
Spek : i7 Gen7, 8GB RAM, GTX 1050
Target Karir: Junior DE β Mid DE β Senior DE / Data Architect
Urutan belajar:
Fase 1 (fondasi wajib) : SQL Advanced + Python + Linux + Docker
Fase 2 (pipeline basics) : ETL/ELT + Airflow + dbt + PostgreSQL/DuckDB
Fase 3 (warehouse & lake) : Data Modeling + Spark/Polars + Cloud Storage
Fase 4 (production) : Streaming + Data Quality + Orchestration Advanced
Next step: Install Python, DuckDB, setup Docker Compose untuk PostgreSQL + Airflow
Fase 1 β Fondasi: SQL, Python & Infrastructure (Minggu 1β8)
Goal: SQL dan Python HARUS di level intermediate sebelum melangkah. Tidak ada shortcut. RAM Impact: Minimal β editor + terminal + database.
| Skill | Yang Dipelajari | Combo A+B yang Membuktikan |
|---|---|---|
| SQL Advanced | Window functions, CTEs, subqueries, EXPLAIN, indexing, partitioning | SQL + complex analytical query = kamu bisa jawab business question langsung |
| Python Data | pandas, polars, file I/O, API calls, error handling, typing | Python + data transformation pipeline = kamu bisa process data programmatically |
| Linux & Shell | Bash scripting, cron, ssh, file system, process management | Linux + scheduled script = kamu bisa automate tanpa UI |
| Docker | Container, Compose, networking, volume mount, multi-service | Docker + local dev environment = consistent setup everywhere |
| Git | Branching, PR, versioning, .gitignore untuk data | Git + versioned pipeline code = kamu treat data pipeline seperti software |
DuckDB Adalah Game Changer
Untuk belajar SQL advanced, pakai DuckDB β analytical database yang jalan di local tanpa server. Import CSV/Parquet langsung, query secepat Spark untuk data <10GB. Zero setup, zero RAM overhead. Perfect untuk homelab.
Proyek Portofolio Fase 1:
Analytical SQL Challenge β ambil dataset publik (Kaggle), load ke DuckDB/PostgreSQL, tulis 20+ complex queries (window function, CTE recursive, pivot, performance comparison). Dokumentasikan dengan penjelasan bisnis per query. Ini menunjukkan SQL kamu bukan cuma SELECT * FROM.
Fase 2 β ETL/ELT Pipeline & Orchestration (Minggu 9β18)
Goal: Bangun pipeline pertama yang extract β transform β load secara otomatis. RAM Impact: Airflow ~1-2GB. PostgreSQL ~500MB. Matikan service lain.
| Tool/Skill | RAM | Yang Dipelajari | Combo A+B yang Membuktikan |
|---|---|---|---|
| Apache Airflow | ~1.5GB | DAG-based orchestration, scheduling, dependency, retry, alerting | Airflow + custom DAG dengan error handling = kamu bisa orchestrate production pipeline |
| dbt (data build tool) | ~200MB | SQL-based transformation, testing, documentation, lineage | dbt + tested models = kamu bisa transform dan validate data di warehouse |
| PostgreSQL | ~500MB | Warehouse sederhana β schema design, materialized views, partitioning | PostgreSQL + analytical schema = kamu paham dimensional modeling |
| API Extraction | β | REST API, pagination, rate limiting, error handling, incremental load | API + idempotent extraction = kamu bisa build reliable ingestion |
| File Formats | β | CSV, JSON, Parquet, Avro β kapan pakai apa dan kenapa | Format + performance comparison = kamu paham storage optimization |
Airflow di 8GB RAM
Pakai Airflow standalone mode (bukan CeleryExecutor). Set
AIRFLOW__CORE__PARALLELISM=4danAIRFLOW__CORE__DAG_CONCURRENCY=2. Cukup untuk belajar. Matikan Wazuh/Suricata kalau jalan bersamaan.
Proyek Portofolio Fase 2:
End-to-End Data Pipeline β Extract data dari 2-3 public API (weather, crypto, news) β Transform dengan dbt (clean, join, aggregate) β Load ke PostgreSQL β Schedule via Airflow (daily run) β dbt test untuk data quality. Deploy di homelab, tunjukkan DAG graph dan data lineage.
Fase 3 β Data Warehouse & Big Data Processing (Minggu 19β28)
Goal: Dari single database β proper data warehouse dengan modeling methodology. RAM Impact: Spark heavy. Pakai Polars untuk homelab (10x lebih ringan).
| Skill/Tool | RAM | Yang Dipelajari | Combo A+B yang Membuktikan |
|---|---|---|---|
| Dimensional Modeling | β | Star schema, snowflake schema, SCD Type 1/2, fact vs dimension table | Modeling + implemented warehouse = kamu bisa design data architecture |
| Polars / PySpark | ~500MBβ2GB | DataFrame API, lazy evaluation, partitioning, join strategies | Polars + large dataset processing = kamu bisa process data yang tidak muat di pandas |
| Parquet / Delta Lake | β | Columnar storage, schema evolution, time travel, ACID on files | Parquet + partitioned data lake = kamu paham modern data storage |
| MinIO (S3-compatible) | ~500MB | Object storage β data lake layer, lifecycle policies | MinIO + partitioned storage = kamu punya data lake di homelab |
| Great Expectations | ~300MB | Data quality β expectations, validation, profiling, docs | GX + automated testing = kamu bisa prove data quality to stakeholders |
Polars vs PySpark untuk Homelab
Pakai Polars. PySpark butuh JVM + Spark cluster (~3GB minimum). Polars jalan native di Python, 10-100x lebih cepat dari pandas, dan API mirip Spark. Kalau interview tanya Spark, bilang βsaya pakai Polars untuk local, konsep lazy evaluation dan partitioning sama β saya bisa switch ke Spark di cluster.β Rekruter mengerti.
Proyek Portofolio Fase 3:
Data Warehouse with Dimensional Modeling β design star schema untuk domain bisnis (e-commerce atau sales), implement SCD Type 2 untuk dimension tables, build ELT pipeline (extract β raw layer β staging β marts), data quality checks via Great Expectations. Tunjukkan ERD + data lineage diagram.
Fase 4 β Streaming, Quality & Cloud (Minggu 29β40)
Goal: Real-time data processing dan production-grade data platform. RAM Impact: Kafka heavy (~2GB). Pelajari konsep dulu, jalankan hanya saat khusus belajar.
| Skill/Tool | RAM | Yang Dipelajari | Combo A+B yang Membuktikan |
|---|---|---|---|
| Apache Kafka | ~2GB | Event streaming, topic/partition, consumer group, exactly-once | Kafka + streaming pipeline = kamu bisa handle real-time data |
| Flink / Kafka Streams | ~1GB | Stream processing, windowing, stateful computation | Stream processing + real-time dashboard = kamu paham beyond batch |
| Data Contracts | β | Schema registry, backward/forward compatibility, breaking change detection | Contracts + enforcement = kamu bisa manage schema evolution |
| Cloud Provider | Cloud | BigQuery / Redshift / Snowflake β managed warehouse | Cloud + cost optimization = kamu bisa migrate on-prem ke cloud |
| Metadata / Catalog | ~500MB | DataHub / Amundsen β data discovery, lineage, ownership | Catalog + searchable data = kamu bisa scale data org |
Kafka di 8GB RAM
Kafka butuh ~2GB (broker + ZooKeeper). Matikan SEMUA service lain. Atau: pakai Redpanda β Kafka-compatible, single binary, butuh cuma ~500MB. API sama, performance lebih baik untuk single node.
Proyek Portofolio Fase 4:
Real-Time Data Platform β Kafka (atau Redpanda) β stream processing β real-time dashboard (Grafana/Metabase) + batch processing β warehouse β dbt marts. Dual pipeline (lambda architecture). Ini proyek yang membuat data team lead bilang βkamu paham.β
Roadmap Visual β Timeline 10 Bulan
Bulan 1-2 Bulan 3-5 Bulan 6-7 Bulan 8-10
βββββββββββββββββ βββββββββββββββββ βββββββββββββββββ βββββββββββββββββ
β FASE 1 β β FASE 2 β β FASE 3 β β FASE 4 β
β β β β β β β β
β SQL Advanced β β Airflow β β Dim Modeling β β Kafka β
β Python Data β β dbt β β Polars/Spark β β Streaming β
β Linux/Shell β β PostgreSQL WH β β Parquet/Delta β β Data Contractsβ
β Docker β β API Extract β β MinIO (Lake) β β Cloud WH β
β Git β β File Formats β β Data Quality β β Catalog β
β β β β β β β β
β βΊ SQL β β βΊ E2E β β βΊ Data WH β β βΊ Real-Time β
β Challenge β β Pipeline β β Star Schema β β Platform β
βββββββββββββββββ βββββββββββββββββ βββββββββββββββββ βββββββββββββββββ
Sertifikasi yang Cocok per Fase
| Fase | Sertifikasi | Kenapa |
|---|---|---|
| Setelah Fase 1 | Google Data Analytics Certificate (Coursera, gratis) | Fondasi analytics thinking |
| Setelah Fase 2 | dbt Analytics Engineering Certification (gratis) | Industry standard transformation tool |
| Setelah Fase 3 | Google Professional Data Engineer | Comprehensive β design + build + optimize |
| Setelah Fase 4 | Databricks Data Engineer Associate/Professional | Modern lakehouse standard |
Stack Comparison β Mana yang Dipilih?
| Komponen | Pilihan Homelab (Gratis) | Pilihan Cloud (Production) | Notes |
|---|---|---|---|
| Orchestrator | Airflow (Docker) | Airflow / Dagster / Prefect | Airflow = industry standard, tapi Dagster lebih modern |
| Transformation | dbt Core (CLI) | dbt Cloud | dbt Core gratis dan feature-complete |
| Warehouse | PostgreSQL / DuckDB | BigQuery / Snowflake / Redshift | PostgreSQL cukup untuk belajar konsep |
| Storage | MinIO (S3-compatible) | AWS S3 / GCS | MinIO API identical dengan S3 |
| Processing | Polars (local) | PySpark (cluster) | Konsep sama, scale beda |
| Streaming | Redpanda (single node) | Kafka (managed) | Redpanda = Kafka API, less resource |
| Quality | Great Expectations | Monte Carlo / Soda | GX gratis dan powerful |
π Lihat Juga
- Master Index
- System Design β Database Internals & Architecture patterns
- Matematika & Algoritma β Linear Algebra untuk ML pipeline
- Computer Science Foundations β OS Internals yang mendukung distributed systems
- DevOps Roadmap β CI/CD dan container orchestration overlap
- Roadmap_Software_Engineering β Backend skill = fondasi Data Engineering
Roadmap Data Engineering | Fase 1 (SQL/Python) β Fase 4 (Streaming) Β· 10 Bulan