Clean Water & Sanitation Big Data Analysis (Hadoop & MapReduce)
Overview
This project applies Big Data processing techniques to analyze global clean water access trends aligned with UN Sustainable Development Goal 6 (Clean Water & Sanitation).
Using Hadoop Streaming and MapReduce, large-scale water access data was processed to compute regional and temporal trends. The results were then visualized and interpreted using Python.
Problem & Goal
The goal was to analyze large water-access datasets efficiently and answer questions such as:
- How access to basic drinking water changes across regions
- How water access evolves over time by region
- How Big Data tools scale better than traditional single-machine processing
What I Built
- Custom Mapper and Reducer scripts to process large CSV datasets using Hadoop Streaming
- MapReduce logic to compute average basic water access by region and year
- A Python notebook to visualize and interpret aggregated Big Data outputs
- A full analytical report connecting technical results to real-world sustainability outcomes
Big Data Pipeline
- Input: Raw global water access CSV data
- Mapper: Extracts region, year, and water access values
- Reducer: Aggregates values and computes averages per region-year
- Output: Clean, summarized datasets for visualization and analysis
- Big Data: Hadoop, MapReduce, Hadoop Streaming
- Programming: Python
- Data Processing: CSV streaming, key-value aggregation
- Visualization: Matplotlib / Pandas
- Domain: Sustainable Development Goals (SDG 6)
Key Takeaways
- Hands-on experience with distributed data processing
- Learned how to design MapReduce jobs for aggregation problems
- Understood the tradeoffs between Big Data systems and local analysis
- Connected technical outputs to real-world global development challenges
Files
📄 Final Report: report.pdf
🧠MapReduce – Regional Analysis:
🧠MapReduce – Yearly Analysis:
📊 Analysis & Visuals: