Sunday, January 5, 2025

Published January 05, 2025 by with 0 comment

Why do we use the average (mean) of numbers instead of something else?

We all learn that the average of numbers is just their sum divided by their count, but where does that metric come from and why do we use it?
Here's a simple thought experiment to make it easy to understand:
  • People live in houses along a big road.
  • I want to open a store to sell things to them.
  • I need to decide where to build that store.
  • I will put the store in the location that minimizes the distance between each house and the store.
Where should the store go?

Mathematically, that looks something like this with each x being the location of a house and m being the location we pick for the store: \[f=\sum_{i=1}^n(|x_{i} - m|)\\\frac{df}{dm}=\frac{d}{dm}\sum_{i=1}^n(|x_{i} - m|)\\0=\frac{d}{dm}\sum_{i=1}^n(|x_{i} - m|)\\|x_{i} - m| =\begin{cases}x_{i}-m & x_{i} \geq m\\m-x_{i} & x_{i} < m\end{cases}\\\frac{d}{dm}|x_{i} - m| =\begin{cases}-1 & x_{i} \geq m\\1 & x_{i} < m\end{cases}\\0=\frac{d}{dm}\sum_{i=1}^n(|x_{i} - m|)=\sum_{x below m}^{}(1) + \sum_{x above m}^{}(-1)\\\sum_{x above m}^{}(1)=\sum_{xbelowm}^{}(1)\] We end up with that equality which translates into 'number of houses on one side of m = number of houses on the other side of m'. Shockingly, that isn't the average. That's actually a definition of the median. It's not a unique definition. For example, imagine the houses are located at the x locations 0, 1, 9, and 10. This just says any number between 1 and 9 is ok for the store's location. By convention and to make things symmetric, we pick the middle of those values (5 here), or in the case of an odd number of values, the value with the same number of values on either side.

If that's not the average though, what is? Why not just always use the median?

Imagine we have houses at x = 1, 3, and 20. Using the median above, we'd build our store at 3. That's great for people living in houses x = 1 and x = 3 but awful for the person living at x = 20. Is there a way to penalize huge distances?

What if we instead squared every distance from house to store and tried to minimize that total? That punishes huge distances more since squaring a large number makes it even larger. With x as house locations and m as the target value again we get: \[f=\sum_{i=1}^n(x_{i} - m)^2\\\frac{df}{dm}=\frac{d}{dm}\sum_{i=1}^n(x_{i} - m)^2\\0=\frac{d}{dm}\sum_{i=1}^n(x_{i} - m)^2\\0=-2*\sum_{i=1}^n(x_{i} - m)\\0=\sum_{i=1}^n(x_{i} - m)\\0=\sum_{i=1}^n(x_{i})-\sum_{i=1}^n(m)\\0=\sum_{i=1}^n(x_{i})-(n*m)\\n*m=\sum_{i=1}^nx_{i}\\m=\frac{\sum_{i=1}^nx_{i}}{n}\]

That final equation is our definition of the average. Instead of minimizing the errors between each point and our target, the average minimizes the square of the errors between each point and our target. In our example above of houses at 1, 3, and 20, the store location would just be at x = (1 + 3 + 20) / 3 = 8. That's still a longer distance for the house at x = 20, but not as bad as when we picked x = 3 for the store location.

We could punish larger errors even more by picking something like 4th power errors, but that doesn't yield an easy equation like the average does, and the average has some other nice properties in statistics so we go with it.

Finally, imagine the houses are at x = 1, 2, 3, 4, 5, 6, 7, 7, 7, and 10. If you wanted to just build the store where the most houses are (x = 7), you would be picking the 'mode'. It minimizes the error represented by "is my store location the same as each house's location?"
      edit

0 comments:

Post a Comment