Quant analytics Classification Step: C 5.0 vs SVM

(Last Updated On: September 22, 2011)

Quant analytics Classification Step: C 5.0 vs SVM

A new post on my blog:
The easiest way to play with a document classifier.
I hope it triggers some discussion on classification techniques.


The article suggests “Features Reduction” which I can’t accept. It is not the way of data mining to deal with complexity.
Although there is a shortage of good DM tools, it’s essential that classification or clustering wouldn’t be compromised.
Classification features are the data mining’s RESULT not its input; this is why it is so much sought after.


The result should be the text classification not the features! I use the features (to build the vectors) to feed the classifier and the output is the class of the documents.
The Features reduction is a step to reduce the complexity of the problem:
If you feed for example an SVM with vectors having size = 100, the features reduction allows you to feed the same SVM with vectors having size <100 with a minimal loss of information (that you can measure via mutual information, or more easily with the comparison of the accuracy before the features reduction and after this step).
As you know the time-space complexity is related (almost always) to the dimension of the input, this is why the feature reduction is a common step in the classification problem.
Of course the final accuracy is strictly related with the features you are using to build the vectors.
Regarding the feature extraction I shown in a former post the most common techniques to build a “bag of words” based on TF-IDF and based on trivial frequency analysis.
In the next post I’ll provide an accuracy comparison using the same algorithm fed with vectors built with different features set (just to show how much the features are important in the docs. classification)


Agree that for doc classification, any enhancement should be appreciated, but not for the general case.
Of course one tries to simplify the problem if possible – i.e. feature reduction. Just take into consideration the risk of ‘throwing the baby with the water’. In a vast multi-dimensional solution space, a procedure which requires human intervention is too limited and could cause wrong conclusions.

COMPLEXITY IS ANYTHING BUT EXPECTED, therefore I’d be cautious about analytics technics that otherwise work very well.


—–I’m totally with you!!
My reasoning is valid only under the domain of text categorization and it cannot be extended in different context.


I’ve done once library material classification, using data mining.
The input was a table with articles along the page and for attributes: article key words, keywords extracted from titles and pictures, authors, date, affiliation, article length…
It worked well.
Is it relevant?


Yes it is!
In my opinion the main problem of the text categorization, is the low re-usability of the techniques: changing the domain the results are often different!
For example: If you consider a domain of documents having many words, I’ve noticed that methods like PLSA (for feature extraction) work well, but if you have to classify documents having really few words (like invoices, table of figures), other strategies work better.
So what is good for a specific corpus could be not good for other kind of corpus.
BTW, In this prospective, I’m with you: the features extracted are the real discriminant to obtain good accuracy, and so I agree: the features extraction is part of the problem.

This is the reason because in the blog, I’m comparing the same algorithm with different features: to understand what is the best feature set for the specific corpus.



NOTE I now post my TRADING ALERTS into my personal FACEBOOK ACCOUNT and TWITTER. Don't worry as I don't post stupid cat videos or what I eat!

Subscribe For Latest Updates

Sign up to best of business news, informed analysis and opinions on what matters to you.
Invalid email address
We promise not to spam you. You can unsubscribe at any time.


Check NEW site on stock forex and ETF analysis and automation

Scroll to Top