Full Transcript: Aggregate Functions: COUNT, SUM, AVG and GROUP BY

Understand the power of aggregate functions and the GROUP BY clause, tools for summarizing and analyzing data on a grand scale within databases.

A single SQL query can turn millions of messy rows into one clear sentence. A store manager sees “yesterday’s total sales,” not chaos. Today, we’ll zoom in on how COUNT, SUM, and AVG secretly do that work—and why one tiny mistake can flip your results on their head.

Most dashboards you’ve seen—“top 10 products,” “revenue by region,” “average response time today”—are powered by the same quiet duo: aggregate functions plus GROUP BY. This is where SQL stops thinking row-by-row and starts thinking in crowds. GROUP BY decides which rows belong together; the aggregates summarize each crowd into a single, useful fact.

This is also where subtle mistakes become expensive. One misplaced column in GROUP BY, and “sales per day” silently turns into “sales per store per day per cashier.” Put a filter in WHERE instead of HAVING, and you might exclude rows before they’re counted, changing totals without any error message.

In this episode, we’ll zoom in on how GROUP BY shapes your results, how it interacts with COUNT, SUM, and AVG at scale, and what to watch for so your summaries actually match the story your data is trying to tell.

Some of the most “obvious” numbers in reports are quietly the easiest to get wrong. A churn dashboard might proudly show “average revenue per user,” while a finance team insists the number is off by 20%. Both can be using the same table and still disagree. The culprit is often small choices: which rows to exclude, where NULLs sneak in, or how granular the groups really are. In real systems—think ride‑sharing trips, food delivery orders, hospital visits—those details decide whether your summary guides decisions or quietly misleads them. This is where careful thinking starts to matter.

A sports analyst at a streaming company might look at one table of events and ask three very different questions:

- How many matches did we stream per league yesterday? - What was the total watch‑time per league? - On average, how long did people watch per match?

Each question sounds simple, but the way you write it in SQL quietly encodes your assumptions.

To count rows that meet a condition, you can either filter first:

```sql SELECT league, COUNT(*) AS watched_matches FROM streams WHERE watched_seconds > 0 GROUP BY league; ```

or count only the subset inside the aggregate:

```sql SELECT league, COUNT(CASE WHEN watched_seconds > 0 THEN 1 END) AS watched_matches FROM streams GROUP BY league; ```

The first version throws away every row that doesn’t match. The second keeps all rows in the grouping—but only tallies the ones you care about. On messy real‑world data, that difference can decide whether a league with no views yesterday shows up as “0” or disappears completely.

Once you start mixing multiple aggregates, you’re really designing a tiny dataset per group. Consider an orders table:

```sql SELECT customer_id, COUNT(*) AS orders, SUM(amount) AS revenue, AVG(amount) AS avg_order_value FROM orders GROUP BY customer_id; ```

Downstream tools might treat this as “one row per customer,” but that’s only correct if your grouping matches how the business thinks about customers. Add `store_id` to the `GROUP BY`, and suddenly “per customer” silently becomes “per customer per store,” even though the column names might not change.

WHERE and HAVING deepen this effect. WHERE shapes which raw rows enter the crowd:

```sql WHERE order_date >= CURRENT_DATE - INTERVAL '30 days' ```

HAVING shapes which *groups* survive after summarizing:

```sql HAVING SUM(amount) > 1000 ```

Together, they let you say “show only customers with more than $1,000 in the last 30 days,” without ever touching the underlying table.

One subtle performance twist: databases often use hash aggregation for GROUP BY, building in‑memory buckets keyed by the grouped columns. That’s fast when groups fit in memory and keys are well chosen, but can spill to disk if you group on very detailed identifiers. Tuning isn’t only about adding indexes; sometimes it’s about grouping at the right level and pre‑aggregating in stages so your database isn’t asked to build millions of tiny groups it doesn’t really need.

A hospital analytics team runs into a puzzle: one report says “20 patients treated today,” another says “18.” Same system, same day. The difference? One includes people who checked in but left before seeing a doctor; the other doesn’t. Both reports use the same columns, but one filters earlier in the pipeline, the other later. Neither is “wrong”—they’re answering slightly different questions.

This is where being explicit about the *story* behind each number matters. Is “patients treated” everyone who walked through the door, or only those who received a prescription? When you add conditions directly inside an aggregate versus in a separate filter, you’re silently choosing which version of reality to measure.

Here’s a practical twist: sometimes you need multiple “realities” side by side. A customer‑support team might want, in one row, total tickets, tickets resolved within 1 hour, and tickets resolved within 24 hours. That’s three different definitions of “success,” all grouped the same way, and each one driven by a slightly different condition inside its own aggregate.

Anomalies in aggregates become early warning signs. A sudden dip in yesterday’s AVG watch‑time might hint at app crashes, not bored users. Like a doctor tracking vitals, you start to recognize “normal noise” versus a real fever. As event streams grow, you’ll layer in time windows, segments, and device types, then compare today’s patterns to historical baselines. The real skill isn’t writing the formula once, but evolving it as behavior, products, and questions keep changing.

Your challenge this week: take one real dataset—orders, tickets, or logs—and create two summaries that *disagree* on purpose. Change one assumption each time: time window, threshold, or segment. Then ask: which view better matches the story a non‑technical teammate tells? Let those tensions guide the next question you ask the data.