Metrics and estimates

Jul 31, 2018 18:47 · 583 words · 3 minute read statistics

Mean

Most basic estimate for location. As simple as possible it is just and average value from a list of values described with equation:

$$ \frac{\sum\limits_{i=1}^n x_i}{n} $$

Trimmed Mean

It is a small variation of mean. Fist Values are sorted, then some number of values are dropped from collection beginning and end. After that normal average value is computed. It is Very useful when we want to eliminate the influence of extreme values in data set.

Weighted Mean

It is calculated by multiplying every value by its weight and dividing their sum by the sum of weights. See the formula:

$$ \frac{\sum\limits_{i=1}^n w_ix_i}{\sum w_i} $$

It is very useful when you want to distinct that some values are more important than others.

When data set has data that represents some different groups and measuring them equally would not reflect real world example.

Median

Basically it is a middle value on a sorted list of values. If there is an even number of values then median is computed from average of two middle values.

Weighted Median

It is computed that there is a half weights above and below median value. First data needs to be sorted then instead of picking middle value we pick such a value that half of the weights are above this value and other half is below.

Trimmed Mean

This value is computed by dropping some fixed number of values from the beginning and from the end of sorted data set. Trimmed mean eliminates the influence of extreme values.

Outlier

Extreme cases that exist in data set. Outlier for example is every value that is very distant from any other value in data set.

Robust estimate

Estimate that is not influenced by outliers. We can call median a robust estimate cause it does not care about outliers in the data set, computing median omits extreme values.

Examples

To perform examples I will use Airbnb publicly available data set. I picked data set that contains rooms available for rent in Warsaw and cost (in US$) per night. This is how the data set looks like:

room_id	price
1	110.0
2	118.0
3	72.0

For computation I will use python3 and to load data I want to go with pandas framework cause if is extremely easy to use DataFrame to load and do any data manipulation with it.

Mean

0    import pandas as pd
1
2    data = pd.read_csv('./data/airbnb_warsaw_2017.csv')
3    n = len(data)
4    mean_price = sum(data.price)/n
5    # 54.68495536686262

Trimmed Mean

0    sorted_price = sorted(data.price)
1    trim_size = int(0.1 * len(data))
2    trimmed_price = sorted_price[trim_size:(len(data) - trim_size)]
3    trimmed_mean = sum(trimmed_price)/len(trimmed_price)
4    trimmed_mean
5    # 45.97714285714286

Median

0    size = len(sorted_price)
1    # 4593
2    median_index = math.ceil(size/2)
3    sorted_price[median_index]
4    # 42.0

Weighted Mean

This dataset has three types of rooms available, lets create weights mapping for them:

0    mapping = {"Entire home/apt": 3, "Private room" :2, "Shared room": 1}
1    data['weight'] = data.room_type.map(lambda x: mapping[x])

This means that most important for us is type called Entire home/apt.

Now lets calculate weighted mean for price:

0    weighted_mean = sum(data.weight * data.price)/sum(data.weight)
1    # 57.14528152260111

Outlier

0    sorted_price[size-1]
1    # 3000
2    sorted_price[0]
3    # 3

Conclusion

We can clearly see that this values differ but this difference is not a huge one, it always depends on data. Basic metric for location is mean but as it can be seen in above example it can be sensitive to extreme values. Other metrics like median or trimmed mean are more robust.