Do you think overfitting will occur?

(Last Updated On: June 4, 2012)

Do you think overfitting will occur?

I’ve Implemented J48 algorithm for my dataset and the accuracy was 87.91%. When I’ve identified and removed misclassified instance, the accuracy is improved to 98.79%. The original dataset had 10,000 instance and after the removal of misclassified instances, the instances are 7100.
will it overfit or not?



After using J48 (or any other pruning algorithm) you should see the performance on your original tree decrease, and that decrease is due to your original tree overfitting the data.

The key metric you want to use to evaluate the ‘goodness of fit’ is how the model performs on a holdout sample. With n = 10,000 you should have definitely pulled out a random group of at least n = 2,000 before you built your original tree.

Then, once you are confident in the tree you build on the remaining n = 8,000, you apply those rules to your holdout sample and see if the model is robust, and not overfitting the data.

(and I wasn’t sure what you meant when you said you “removed misclassified instance”… sounds odd)



. I just answered to a similar question asked by your fellow Ayne on another forum. I agree with John: if you first classify data and then remove those instances that are misclassified, you artificially increase performance and this is equivalent to overfitting.

Regarding J48, in particular, I would advise to pay attention to tree stability. In other words, if you randomly split your data into training and test data multiple times and build the tree with the different training data, your tree shouldn’t change a lot from one training set to another training set. If this is not the case, then high accuracy is irrelevant, as it was accidentally attained. In this case, one can’t be sure that such a tree will well generalize to unseen data.

In other words, I would prefer a stable tree with a lower accuracy to an unstable tree with a higher accuracy.



There’s always overfitting. The question is, how much overfitting? You need to have a way of accurately measuring that.



By coincidence, just last night I was reading a paper that describes a method of quantifying overfitting … see Harrell, Lee & Mark, (Tutorial in Biostatistics) Multivariate Prognostic Model Issues in Developing Models, Evaluating Assumptions and Adequacy, and Measuring and Reducing Errors (Statistics in Medicine, Vol. 15, 361-387, 1996).

Given Dr. Harrell’s generous nature in terms of sharing, I thinking/hoping he’s built a function to do this in one of the R-packages he’s authored/co-authored, but I haven’t looked yet.

(edited: I originally said the calc used a bootstrap method, but that is incorrect. He uses bootstrapping to look for smaller models that may have less overfitting.)

URL to a PDF of the paper Douglas mentioned:


NOTE I now post my TRADING ALERTS into my personal FACEBOOK ACCOUNT and TWITTER. Don't worry as I don't post stupid cat videos or what I eat!

Subscribe For Latest Updates

Sign up to best of business news, informed analysis and opinions on what matters to you.
Invalid email address
We promise not to spam you. You can unsubscribe at any time.


Check NEW site on stock forex and ETF analysis and automation

Scroll to Top