Quant analytics: Model for time since…
I have a dataset with age, sex, and last month of doctor visit, going back two years with a residual group (has not visited doctor in last two years). Some of the records are of persons who have since left the country but remained registered with their doctor. A recent visit is evidence that the person has not left the country. How can I use the (age-sex specific) distribution of when last visited to estimate the number of persons still in the country? Any help/pointers much appreciated.
You have either a one-class or two-class classification problem. Classes are ‘visited a doctor’ and ‘not visited a doctor’. The decision of whether to choose working with either one or two classes depends on the number of instances of each class you have. If you have an unbalanced data set where the number of instances of one class far exceeds the number of instances of another class, then it might be better to dismiss the information for the minority class and to keep the majority class only. If the imbalance is not large, then you can train a classifier using instances of both classes. As a classifier, many algorithms can be used such as a support vector machine (SVM), for example.
In one-class classification, the training leads to a hyper sphere enclosing all instances of the class; in the testing phase, everything falling outside that hyper sphere belongs to another class. For example, in your case, you may train your algorithm (e.g., a one-class support vector machine) with 2-dimensional instances of those persons who recently visited a doctor. Dimensions are age and sex. In your case, a hyper sphere can easily be visualized in 2D space. In the testing phase, everything coming outside that sphere implies a person who left a country. Counting the number of such persons among your test instances will give you an estimate you seek.
I am not sure but it is very likely that you might find several implementations (meaning different programming languages) of a one-class SVM on the web. Just search for “one-class SVM” with Google.
Finally, a word of warning about one-class classifiers: they tend to inflate the false positive rate or FPR (this rate is a percentage of those persons who still reside in a country but they were classified as ‘left the country’), i.e., you should estimate not only the true detection rate (TDR) but also the false positive rate. For this, ROC curve and area under it are good characteristics. As any practical application needs as low FPR as possible, you are only interested in TDRs corresponding very small FPRs, say from 0 to 5% (upper limit should be agreed with the end-customer).
To mitigate the risk of high false positive rate, you may need to periodically re-train a classifier with both old and new data. This re-training will adjust (expand out) the boundary surrounding instances of country residents. As processing time is not very likely to be critical in your case (no real-time requirements, right?), this represents a quick-and-dirty solution that is still perfectly workable.
Your challenge is that you are unable to distinguish between (1) individual is not seeking medical care; and (2) individual has left the country. Thus, if you do not add any additional information to your model, you cannot reliably estimate how many of your “non-events” belong in category (1) or category (2).
If you were dealing with a U.S. sample, I would recommend using their address to determine the U.S. Census rate at which people are moving from that neighborhood / length of residence (not perfect, as some might be moving a short distance, especially if young or lower SES). If you are able to locate similar data for your sample, it might help estimate how many of your “non-events” are due to relocation, or category (2).
If I didn’t have any other data to bring in, you could make an arbitrary cut at some time point in your data. Use the data before that cutoff as your predictors, and use ‘doctor visit’ (binary) as your target in a logistic regression. To the degree the model performs well, it will at least allow you to state the likelihood that an individual will visit their doctor in the future. However, if they have a low likelihood, you won’t be able to tell if it is due to non-adherance or relocation.
Your best bet would probably be more data, even if you have to extrapolate from community- or neighborhood- level published statistics.
NOTE I now post my TRADING ALERTS into my personal FACEBOOK ACCOUNT and TWITTER. Don't worry as I don't post stupid cat videos or what I eat!