Being an analyst you will often come across various data patterns, a variable can be said to have an outlier if its value is quite extreme.
Now what does quite extreme mean? It means that the value of the variable that is far away from the mean of the variable and does not contribute to the overall understanding of the variable.
We can understand this further with the help of an example: The sample average weight of 100 working males between 25-35 years of age in Gurugram is 96kgs and there are two males with the weight 180kgs and 49kgs, we would call these an outlier. They both are extreme values and do not give a clear picture about our population.
Why are outliers bad? Well, having said the example above, while using the variable to analyse a business problem, an outlier can make your data biased and statistically unstable.
|Without Outlier||With Outlier|
An outlier could exist in the data as manual data entry error or may exist naturally or may arise as a result of the wrong-calculation while calculating it. In any case it must be properly treated for the analysis.
Ways to treat an outlier:
- Delete the outlier: simply removing the outlier from your sample data can easily fix the problem. But this can only be used if they form a very small percentage of your data example less than 2%.
- Capping and Flooring: this can be done by putting a cap or floor on the maximum and minimum value. For example capping all the value greater than the 80th percentile at the 80th percentile and likewise flooring all the values less than the 20th percentile at 20th
- Transforming: log transformation can also be good fix to the problem of outliers, Natural log also reduces the variation caused by the extreme value.