Quant development: Using rolling average for data cleaning / remove outliers. Is fix in Excel?
I have a set of data which has daily and seasonal cycle and expands over ten years. I need to remove erroneous values using the rolling average of 5 points at a time (I might end up using more points for the rolling average, maximum of 24 values which represent the whole day, … more…
I have a set of data which has daily and seasonal cycle and expands over ten years. I need to remove erroneous values using the rolling average of 5 points at a time (I might end up using more points for the rolling average, maximum of 24 values which represent the whole day, till I get something that just eliminates the wrong values. Any point which is outside the range of average + 3 standard deviations to average – 3 standard deviations will be removed (although I might change the +/- 3 standard deviations to +/- 2 standard deviations as it suites my work).
Essentially I want to remove outliers based on a rolling average principle (averaging the whole sample would dampen the cycles I have)….Can anyone suggest how to do this in SPSS/STATISTICA?EXCEL?
Ideally, the method should show me the points it identifies as outliers before just deleting them
I’ve got a suggestion that maybe could help you on your effort to remove outliners. I have tried several times the Data Mining Add-in for Excel 2010, especifically the Clean Data tool.
It works great. you can evaluate your parameters (mean, average deviation and limits) to set the threshold to tune the tool.
This data mining add-in is free, you can download it from Microsoft site.
that is an excellent idea. Particularly if it is in the initial stages of analysis i.e. data exploration or methodology proto-typing. I didn’t realize that the Data Mining Add-in would allow tuning of the sort you described. I like SPSS, and I like SAS even more, but the convenience of an Excel spreadsheet with enhanced functionality is hard to beat when one is first getting acquainted with a dataset.
Ashraf, I am curious about your motivation for removing so much outlier data, particularly everythig beyond 2 STD’s. Rolling averages is the right way to do it, of course. But that might not be something you want to disclose, as part of your work. Hmmm, it could be useful in forensic financial risk analysis!
In my field, data tables from scientific papers are very frequently provided as Supplemental Data in Adobe PDF Format. In many cases, when one extracts such tables as text in order to do further analysis, there will be some random junk mixed in. Some of the garbage I can remove with Perl Regexes, but some is less predictable and must be cleaned up by hand. This junk is not visible when viewing or printing in Acrobat, it is only visible in captured text.
One of my key tools for spotting data points in need of fixing is by computing the difference between each data point and a moving average of nearby values; then I use Excel Filtering to show me the largest and smallest values in the Delta column.
NOTE I now post my TRADING ALERTS into my personal FACEBOOK ACCOUNT and TWITTER. Don't worry as I don't post stupid cat videos or what I eat!