Demographic Income Prediction

Scikit Machine Learning Algorithms

Project Outcomes:

We will be feeding an ML classification model income data to see what demographic aspects of a person are the strongest predictors of whether that person makes less than or more than 50k ($USD) a year.

In order to find which demographic traits (age, race, sex, etc.) had the most significance in predicting someone’s income, we fed a sci-kit learn decision tree model income census data in order to find out which traits the model assigned the most important when deciding whether to sort someone in the "<=50K” income group or the “>50K” income group.

After modeling the importance values, this is what we got.

Interestingly, intrinsic traits to a person, such as race, role in the family (father, mother, older brother, etc.), and sex had a minimal impact on whether someone was more or less likely to be sorted in the "<=50K" or the ">50K" income groups. Instead, the biggest predictors of income group came from age, capital gains/loss, occupation, marital status, and hours worked per week.

These importance results suggest that "progress" in life due to age, marital status, and career are the critical deciding factors on whether someone can be predicted to make less/more than 50k a year, while traits like race and sex have a very small predictive impact.

Something interesting to note, is that the model's precision, recall, and f1-score is much lower for ">50K" than it is for "<=50K". The reason for this is because "<=50K" is a much more common result, which means the algorithm "got more practice", and therefore got better, at identifying when someone made less than 50k annually as opposed to more than 50k annually.

Now we model the precision, recall, and f1-score of our model. This helps us get a closer look into the reliability statistics of our model.

There are 3 main limitations to this project:

1 - Model reliability: While 82 percent model accuracy isn't bad for a more simplistic ML model like this, the somewhat low reliability when handling entries in the ">50K" income group is certainly a hindrance. This problem could be improved with a larger and more varied dataset, since this dataset has a very large "<=50K" income group bias.

2 - Scope: While this project was able to identify which demographic traits are of high predictive quality, more in-depth analytics may be needed in order to identify which trait values correlate with which income group (e.g., does old age indicate >50K or <=50K?). This is an analytics project that we could possibly tackle in the future.

3 - Granularity: Since this is a binary classification system, we have only 2 broad categories. Results may change if we were to introduce more possible income groups to sort into.

Conclusion:

Factors related to life/relationship/career progression such as age, marital status, and occupation are the most reliable metrics in order to predict whether someone makes less than or more than 50k a year. However innate characteristics such as race and sex have very little ability to determine which income group someone belongs to.