Data Segmentation Technique – Financial Modeling
There are 6 financial ratios which are to be used in order to predict the probability of liquidation for defaulted firm. Since the data set has only 239 records, i need some technique/methodology to predict the dependent variable using all the 6 financial ratios. I tried using logistic regression but as number of observation are less, the approach doesn’t works. I am thinking on segmenting the variables and then assign scores to each segment based on the historical liquidation rate in each segment. Can anyone suggest some statistical technique for segmenting the variables or a better approach to solve the problem.
Looks like you got a rough problem with only 239 observations, but the logistic regression might not be a good fit for your job as it is used with discrete dependent variables, yours looks like being continuous so I would use Multiple Regression Analysis. Keep an eye on the ratios as using all might not be the best option, specially with only 239 obs.
Good luck, I hope it helps
I’m confused by the initial question and the enxt comment. Your target/dependent variable is defaulted/not defaulted firm, and thus a logistic regression and not a linear regression is more appropriate.
What I don’t understand is the comment on number of observations (239) versus 6 predictors. What’s the problem about few observations?
If you do not have enough observations, you can try reducing the number of variables via a technique such as principal components analysis. Or are your 6 variables truly uncorrelated? Also, how many defaulters do you have in your 236 sample
The dependent variable is dichotomous and therefor thought of using logistic. However the results are not getting validated on the out of time sample. Also, i could have used Discriminant Analysis technique but the independent variables do not follow the assumptions such as variable should be normally distributed etc..
What i am looking for is some technique which either tell the importance of each variable or assign weight to each variable so that we can calculate a Z score and based on the cut-off, we will decide the high/medium/low liquidation buckets.Also, i will be using all the 6 variables and there is no need of reducing the number of variables.
The 6 variables are uncorrelated to each other. The number of defaulters in my data set is 31(13% event rate)
Not matter how you approach it the analysis will be largely based on heuristics. If you know the debt rating of these companies or their SIC code, then S&P or Fitch publish the 20 year default rates for each. Poisson regression is more widely used than logistic by academics when faced with a sparse dependent variable. Oversampling of the DV is another widely used technique for this situation. That said, I’m with the others in wondering why logistic regression doesn’t work with 239 obs and 6 predictors. Were they all insignificant? If so, try dropping the most insignificant predictors until there’s one, two or a few predictors left that probably are significant. Then I would use the Wald Chi-Square to rank the variable importance or go to Ulrike Gromping’s website for his Relaimpo R macro to develop a more rigorous metric of importance. Finally, I would use the predicted likelihood of default to calculate your cutoffs.
With a prior of 13% and no good model found, don’t bother with discriminant at least for now. Thomas’ suggestion of adding info from the recently maligned Fitch et al is a good one. But, if your logistic is already bad, why bother with ranking variable importance when they’re not important for what you’re doing?
If you still believe that there is a model somewhere, try trees for instance, or its present versions as boosting.
Also, while you mention that the 6 ratios are uncorrelated, that sounds a bit difficult. Try to obtain the VIFs (in a proc reg with default as the dependent variable). it could well be that some VIFs are very high and thus co-linearity in logistic would be happening. Notice that the usual rule of thumb of vif > 10 does not apply in the logistic case. instead, you have to run a regression with a weight = sqrt (probability predicted by the logistic). The resulting vifs ‘cutoff’s are about 2 or 3.
I would still expect to see at least 10 defaulters for each
covariate in the equation.
You can try bootstrap. More important, you will benefit a lot by reading Edward Altman’s seminar paper on the Z-score (just google it) in 1960s and other variation after that. He has done very similar things before.
NOTE I now post my TRADING ALERTS into my personal FACEBOOK ACCOUNT and TWITTER. Don't worry as I don't post stupid cat videos or what I eat!