Detecting anomalies using a histogram: Seeking quant analytics suggestions on the bin range…

In detecting anomalies using a histogram, would you recommend using the third standard deviation as the first and last numbers in the bin range? Based upon the below descriptive statistics of my data, the normal bin range would start and end at 550 and 1338, respectively. If I use the third standard deviation the start and ending numbers would be 789 and 1406, respectively.

Normal Descriptive Statistics

Mean 1083.84
Standard Error 7.260163349
Median 1,098
Mode 1,065
Standard Deviation 102.6742147
Sample Variance 10541.99437
Kurtosis 6.171381149
Skewness -1.692586856
Range 788
Minimum 550
Maximum 1338
Sum 216768
Count 200

3rd std dev 2nd std dev 1st std dev Median 1st std dev 2nd std dev 3rd std dev
789 892 995 1097.5 1200 1303 1406



I’m not sure why you’re binning… your skew isn’t that extreme (although the kurtosis indicates it is a bit pointy), although your minimum is a bit of an outlier at 5+ SD below the mean.

Perhaps you suspect that this variable has a non-linear relationship to an outcome / target variable? If that is the case, you could try using a quadratic or cubic term instead of binning.

But if you are going to bin, I recommend simply graphing the values of your outcome variable across the values of this variable to observe where the inflection points seem to be occurring. Naturally this will lead to an increased risk of overfitting, but as long as you employ a holdout sample for validation you should be fine.

Hint: if the data is especially noisy, drop in a quadratic, cubic or higher order trendline (you can do it in Excel) to more clearly see the relationship.



If the purpose is to detect anomolies, why clip your histogram at 3 standard deviations? Use all of your data and pick an appropriate number of bins.



