Aggregation Framework

MongoDB Aggregation Framework: $match, $group, $project, and Pipeline Stages

1. Problem Statement

Case Study: MovieFlix Analytics

MovieFlix, a global movie streaming platform, wants to:

Find the most popular genres and top-rated movies in each country.
Calculate average watch times, total views per genre, and revenue by region.
Generate business dashboards and user recommendations, all from millions of viewing records.

Sample Image

The challenge:
How can MovieFlix efficiently analyze, summarize, and transform huge volumes of streaming data—without exporting it to another system or writing complex, slow code?

2. Learning Objectives

By the end of this tutorial, you will:

Understand what the MongoDB Aggregation Framework is and why it’s powerful.

-   Build multi-stage aggregation pipelines using  `$match`,  `$group`,  `$project`, and more.

Use pipeline stages to filter, group, reshape, and analyze data.
Apply best practices for performance and maintainability.

3. Concept Introduction with Analogy

Analogy: The Data Conveyor Belt

Imagine MovieFlix’s analytics as a high-tech conveyor belt in a mailroom:

-   Each letter (document) passes through a series of stations (pipeline stages).
    
-   Some stations filter out junk mail (`$match`).
    
-   Others sort mail into bins by city or sender (`$group`).
    
-   Some reformat addresses or add labels (`$project`).

At the end, you have exactly the summary or report you need.

The aggregation pipeline is this conveyor belt—each stage transforms or filters the data, step by step, until you get your answer.

4. Technical Deep Dive

A. What is the Aggregation Framework?

The Aggregation Framework is MongoDB’s built-in tool for advanced data analysis and transformation.
It uses a pipeline: an array of stages, each performing a specific operation on the documents.
Each stage’s output becomes the next stage’s input—like a series of assembly line steps.

Pipeline Syntax:

db.collection.aggregate([
  { $match: { ... } },
  { $group: { ... } },
  { $project: { ... } },
  // ...more stages
])

Pipelines can have any number of stages, in any order

B. Key Pipeline Stages

1. $match: Filtering Documents

Like a query filter; selects only documents that meet criteria.
Should be placed early for efficiency.

Example:

{ $match: { country: "USA", "rating": { $gte: 8 } } }

Filters for US movies with rating 8 or higher.

2. $group: Aggregating Data

Groups documents by a field (or expression) and computes aggregates (sum, avg, count, etc.).
The _id field specifies the group key.

Example:

{ $group: {
    _id: "$genre",
    totalViews: { $sum: "$views" },
    avgRating: { $avg: "$rating" }
  }
}

Groups by genre, calculates total views and average rating per genre.

3. $project: Shaping Output

Reshapes documents, includes/excludes fields, computes new fields.
Can rename, transform, or format fields.

Example:

{ $project: {
    _id: 0,
    genre: "$_id",
    totalViews: 1,
    avgRating: { $round: ["$avgRating", 2] }
  }
}

Outputs only genre, totalViews, and a rounded avgRating.

4. Other Useful Stages

-   `$sort`: Orders documents (e.g., by totalViews descending).
    
-   `$limit`: Restricts the number of output documents.
    
-   `$unwind`: Deconstructs arrays into separate documents.
    
-   `$addFields`: Adds computed fields.

5. How the Pipeline Works

Documents flow through each stage in order.
Each stage can filter, group, sort, or reshape data.
The output of the last stage is your final result.

C. Aggregation Pipeline Example

Suppose MovieFlix wants to find the top 3 genres by total views in 2024:

db.watchHistory.aggregate([
  // 1. Only 2024 records
  { $match: { year: 2024 } },
  // 2. Group by genre, sum views
  { $group: { _id: "$genre", totalViews: { $sum: "$views" } } },
  // 3. Sort by totalViews descending
  { $sort: { totalViews: -1 } },
  // 4. Limit to top 3
  { $limit: 3 },
  // 5. Project clean output
  { $project: { _id: 0, genre: "$_id", totalViews: 1 } }
])

Explanation:

-   `$match`: Filters for 2024.
    
-   `$group`: Sums up views per genre.
    
-   `$sort`  and  `$limit`: Gets top 3 genres.
    
-   `$project`: Formats output for reporting.

D. Best Practices for Aggregation Pipelines

-   **Place  `$match`  early**  to reduce data volume for later stages.

Use indexes on fields used in $match for performance.
Keep documents small—avoid unnecessary fields with $project.
Test each stage separately to debug and optimize.
Limit pipeline complexity for maintainability.

5. Step-by-Step Data Modeling & Code Walkthrough

Let’s build a real MovieFlix aggregation pipeline.

A. Sample Document Structure

{
  "_id": ObjectId("..."),
  "movie": "Edge of Tomorrow",
  "genre": "Sci-Fi",
  "country": "USA",
  "views": 15000,
  "rating": 8.2,
  "year": 2024
}

B. Find the Average Rating and Total Views per Genre in the USA

db.watchHistory.aggregate([
  { $match: { country: "USA" } },
  { $group: {
      _id: "$genre",
      totalViews: { $sum: "$views" },
      avgRating: { $avg: "$rating" }
    }
  },
  { $project: {
      _id: 0,
      genre: "$_id",
      totalViews: 1,
      avgRating: { $round: ["$avgRating", 2] }
    }
  }
])

Explanation:

Filters for USA records.
Groups by genre.
Sums views and averages ratings.
Projects a clean, rounded output.

C. Find Top 5 Movies by Views in 2024

db.watchHistory.aggregate([
  { $match: { year: 2024 } },
  { $sort: { views: -1 } },
  { $limit: 5 },
  { $project: { _id: 0, movie: 1, views: 1 } }
])

D. Count Movies per Genre with $group and $project

db.watchHistory.aggregate([
  { $group: { _id: "$genre", count: { $sum: 1 } } },
  { $project: { _id: 0, genre: "$_id", count: 1 } }
])

6. Interactive Challenge / Mini-Project

Your Turn!

Write an aggregation pipeline to find the average rating for each genre in 2024, but only include genres with more than 10,000 total views.
Output should show genre, average rating (rounded to 1 decimal), and total views.

7. Common Pitfalls & Best Practices

Pitfall	Best Practice
Placing `$match` late	Put `$match` early to reduce data volume
Returning too many fields	Use `$project` to limit output
Not using indexes	Index fields used in `$match` for speed
Complex pipelines in one go	Build and test one stage at a time

8. Optional: Programmer’s Workflow Checklist

Define your output and work backwards to design stages.
Place $match as early as possible.
Use $group for aggregation, $project for shaping output.
Test each stage’s output before adding the next.
Use $sort and $limit for ranking and pagination.
Optimize with indexes and by limiting unnecessary fields.

1. Problem Statement​

Case Study: MovieFlix Analytics​

2. Learning Objectives​

3. Concept Introduction with Analogy​

Analogy: The Data Conveyor Belt​

4. Technical Deep Dive​

B. Key Pipeline Stages​

1. $match: Filtering Documents​

5. Step-by-Step Data Modeling & Code Walkthrough​

6. Interactive Challenge / Mini-Project​

7. Common Pitfalls & Best Practices​

8. Optional: Programmer’s Workflow Checklist​

1. Problem Statement

Case Study: MovieFlix Analytics

2. Learning Objectives

3. Concept Introduction with Analogy

Analogy: The Data Conveyor Belt

4. Technical Deep Dive

B. Key Pipeline Stages

1. $match: Filtering Documents

5. Step-by-Step Data Modeling & Code Walkthrough

6. Interactive Challenge / Mini-Project

7. Common Pitfalls & Best Practices

8. Optional: Programmer’s Workflow Checklist