Escríbeme al WhatsApp








Revolutionize Budget Management with the Power of Machine Learning .
Classification model


Project URL in R: Clasification-model.notbeook-in-R.html

Tools used: RStudio, Microsoft Excel, Power Query, Tableau


Section 1: Project Introduction



Problem Description

IFood, a food delivery company that operates through an app, is experiencing stagnant growth and seeks to optimize the performance of its marketing activities, which are currently suffering from poor budget management.

Despite having solid revenues over the past three years, the growth prospects are not promising. Therefore, the company aims to address this issue by focusing on improving the efficiency of its marketing campaigns, particularly the upcoming campaign targeting the sale of a new device to its existing customer base.

Business Objective

"Develop a predictive model based on historical customer data that identifies those most likely to respond positively to the next marketing campaign."

This model should enable the marketing department to make quantitative and informed decisions, resulting in better use of the annual budget and, ultimately, an increase in sales and company growth.


Section 2: Data Description



Features to be provided for model development

For the project, a dataset consisting of records from 2206 customers of XYZ company was provided with information on:

Column Data type Sub Data type Ranges or Categories
1. ID Categorical Nominal 0-11,191
2. Age Numerical Discrete 18-21
3. Income Numerical Continuous 1,730-666,666
4. Kidhome Numerical Discrete 0-2
5. Teenhome Numerical Discrete 0-2
6. Dt_Customer Numerical Discrete 2012-07-30 to 2014-06-29
7. Recency Numerical Discrete 0-99
8. MntWines Numerical Continuous 0-1,493
9. MntFruits Numerical Continuous 0-199
10. MntMeatProducts Numerical Continuous 0 a 1725
11. MntFishProducts Numerical Continuous 0-259
12. MntSweetProducts Numerical Continuous 0-263
13. MntGoldProds Numerical Continuous 0-362
14. NumDealsPurchases Numerical Discrete 0 a 15
15. NumWebPurchases Numerical Discrete 0 a 27
16. NumCatalogPurchases Numerical Discrete 0 a 28
17. NumStorePurchases Numerical Discrete 0-13
18. NumWebVisitsMonth Numerical Discrete 0-20
19. AcceptedCpm1 Categorical Nominal 0-1
20. AcceptedCpm2 Categorical Nominal 0-1
21. AcceptedCpm3 Categorical Nominal 0-1
22. AcceptedCpm4 Categorical Nominal 0-1
23. AcceptedCpm5 Categorical Nominal 0-1
24. Complain Categorical Nominal 0-1
25. Response Categorical Nominal 0-1
26. Education Categorical Nominal Graduation, Master, PhD, 2nd Cycle, Basic
27. Marital Status Categorical Nominal Married, Single, Widow, Divorced, Together, Alone, YOLO, Absurd

Section 3: Findings - Exploratory Data Analysis


Education vs Response


Observations

Response to previous campaigns vs Response to current campaign


Observations

Months of tenure vs Response to current campaign


Observations

Monthly spending vs Response to current campaign


Observations

Other attributes vs Response to current campaign


Observations

Section 4: Building the model with Logistic Regression


This section begins after statistical tests and attribute effect analysis, so the attributes to be used by the model have already been selected. If you want to know more about the entire process in detail, visit the project URL at the beginning of this web page.

Logistic Regression


Data split into training and validation sets

An 80%/20% split was chosen.

Multicollinearitys

No multicollinearity was detected in any term (VIF less than 5.0).

10-fold cross-validation results

The model shows high accuracy (above 85%).

Evaluation of precision, recall, and specificity of the model

The model is good at identifying true positives (71% precision) but lacks sensitivity (30% recall), although sensitivity may not be as important for this project.

ROC and AUC visualization

The AUC is 0.815, which is considered very good.

Out-of-sample performance

The model demonstrates even more satisfactory performance on previously unused data, which is a positive sign of its predictive ability.

Section 5: Improving the model with Oversampling


Strategy for imbalanced data: Oversampling


We will now use the oversampling technique to try to improve the model.

Out-of-sample performance

The improvement provided by oversampling was not significant enough to justify its inclusion; therefore, we have chosen to maintain the original model without applying the oversampling technique.

Section 6: Improving the model with KNN


K Nearest Neighbor.


First let’s try to build a model to catch some non-linear relationships that couldn’t be catch by the logistic regression model.

Asses K-nearest neighbors model on Holdout Sample

The results are not as good, but as we know, usually KNN itself is not competitive with more sophisticated classification techniques.

In practical model fitting, however, KNN can be used to add “local knowledge” in a staged process with other classification techniques Therefore, We’ll add the result of the KNN as a new predictor that could improve the capacity to classify on my model.


Setting KNN as feature engine


Now let’s try setting KNN as a feature and use it in the multiple linear regression model to see if that could improve the model.

10 fold cross validation model

Assessing the model Precision, Recall, and Specificity

While the model’s precision experienced a slight decline from 71% to 69%, the recall increased substantially from 30% to 41%. Additionally, the model’s accuracy improved modestly, rising from 87% to 88.5%.

Visualizing the model ROC and AUC

These results indicate that the model has shown some improvement. However, it is crucial to evaluate the model’s performance using out-of-sample data to confirm its effectiveness.

Assessing the model on new data (holdout sample)

Regrettably, despite the promising appearance of the modified model, it did not yield any improvement on the holdout sample. Consequently, we have decided to retain the original logistic regression model for our analysis.

Visualizing Observed vs predicted values on holdout sample


Section 7: Conclusion

The resulting model has a precision of 75%, sensitivity of 42%, specificity of 97%, accuracy of 89%, and an AUC of 0.831. This model will allow the company to focus its marketing campaigns on those customers most likely to respond positively to the new campaign. In doing so, the department's budget management will be optimized, and profits will increase.

The effectiveness of the model could be improved, particularly in terms of sensitivity. It would be advisable to explore other modeling techniques, adjust parameters, or test different data preprocessing techniques to enhance the model's performance in this area.

©2023 Abraham Cedeño Levy