Data <- read_csv(file.path(PSDS_PATH, 'ml_project1_data.csv'))
Data<- arrange(Data,ID)
It would be useful to have a feature with the age of the clients.
As the last date in the data is on june of 2014, we are going to assume that we are performing this analysis in 2014 for the age calculations.
Data <- Data %>%
mutate(Age = 2014 - Year_Birth)
str(Data)
## tibble [2,240 × 30] (S3: tbl_df/tbl/data.frame)
## $ ID : num [1:2240] 0 1 9 13 17 20 22 24 25 35 ...
## $ Year_Birth : num [1:2240] 1985 1961 1975 1947 1971 ...
## $ Education : chr [1:2240] "Graduation" "Graduation" "Master" "PhD" ...
## $ Marital_Status : chr [1:2240] "Married" "Single" "Single" "Widow" ...
## $ Income : num [1:2240] 70951 57091 46098 25358 60491 ...
## $ Kidhome : num [1:2240] 0 0 1 0 0 0 1 1 0 1 ...
## $ Teenhome : num [1:2240] 0 0 1 1 1 1 0 1 1 0 ...
## $ Dt_Customer : Date[1:2240], format: "2013-05-04" "2014-06-15" ...
## $ Recency : num [1:2240] 66 0 86 57 81 91 99 96 9 35 ...
## $ MntWines : num [1:2240] 239 464 57 19 637 43 185 18 460 32 ...
## $ MntFruits : num [1:2240] 10 5 0 0 47 12 2 2 35 1 ...
## $ MntMeatProducts : num [1:2240] 554 64 27 5 237 23 88 19 422 64 ...
## $ MntFishProducts : num [1:2240] 254 7 0 0 12 29 15 0 33 16 ...
## $ MntSweetProducts : num [1:2240] 87 0 0 0 19 15 5 2 12 12 ...
## $ MntGoldProds : num [1:2240] 54 37 36 8 76 61 14 6 153 85 ...
## $ NumDealsPurchases : num [1:2240] 1 1 4 2 4 1 2 5 2 3 ...
## $ NumWebPurchases : num [1:2240] 3 7 3 1 6 2 6 3 6 2 ...
## $ NumCatalogPurchases: num [1:2240] 4 3 2 0 11 1 1 0 6 2 ...
## $ NumStorePurchases : num [1:2240] 9 7 2 3 7 4 5 4 7 3 ...
## $ NumWebVisitsMonth : num [1:2240] 1 5 8 6 5 4 8 7 4 6 ...
## $ AcceptedCmp3 : num [1:2240] 0 0 0 0 0 0 0 0 0 0 ...
## $ AcceptedCmp4 : num [1:2240] 0 0 0 0 0 0 0 0 0 0 ...
## $ AcceptedCmp5 : num [1:2240] 0 0 0 0 0 0 0 0 0 0 ...
## $ AcceptedCmp1 : num [1:2240] 0 0 0 0 0 0 0 0 0 0 ...
## $ AcceptedCmp2 : num [1:2240] 0 1 0 0 0 0 0 0 0 0 ...
## $ Complain : num [1:2240] 0 0 0 0 0 0 0 0 0 0 ...
## $ Z_CostContact : num [1:2240] 3 3 3 3 3 3 3 3 3 3 ...
## $ Z_Revenue : num [1:2240] 11 11 11 11 11 11 11 11 11 11 ...
## $ Response : num [1:2240] 0 1 0 0 0 0 0 0 0 1 ...
## $ Age : num [1:2240] 29 53 39 67 43 49 38 54 56 27 ...
#Unique categories in each categorical column
unique(Data$Education)
## [1] "Graduation" "Master" "PhD" "2n Cycle" "Basic"
unique(Data$Marital_Status)
## [1] "Married" "Single" "Widow" "Divorced" "Together" "Alone" "YOLO"
## [8] "Absurd"
As it can observed, there are no trailing or leading spaces, misspelings, or blank spaces, therefore, this portion of the data is cleaned.
Data$Income[is.na(Data$Income)] <- 0
Group | Range | Data_Type |
---|---|---|
ID | 0-11191 | Categorical nominal |
Age | 18-121 | Numeric discrete |
Income | 0-666666 | Numeric continuous |
Kidhome | 0-2 | Numeric discrete |
Teenhome | 0-2 | Numeric discrete |
Dt_Customer | 2012-07-30 to 2014-06-29 | Numeric discrete |
Recency | 0-99 | Numeric discrete |
MntWines | 0-1493 | Numeric continuous |
MntFruits | 0-199 | Numeric continuous |
MntMeatProducts | 0-1725 | Numeric continuous |
MntFishProducts | 0-259 | Numeric continuous |
MntSweetProducts | 0-263 | Numeric continuous |
MntGoldProds | 0-362 | Numeric continuous |
NumDealsPurchases | 0-15 | Numeric discrete |
NumWebPurchases | 0-27 | Numeric discrete |
NumCatalogPurchases | 0-28 | Numeric discrete |
NumStorePurchases | 0-13 | Numeric discrete |
NumWebVisitsMonth | 0-20 | Numeric discrete |
AcceptedCpm1 | 0-1 | Categorical nominal |
AcceptedCpm2 | 0-1 | Categorical nominal |
AcceptedCpm3 | 0-1 | Categorical nominal |
AcceptedCpm4 | 0-1 | Categorical nominal |
AcceptedCpm5 | 0-1 | Categorical nominal |
Complain | 0-1 | Categorical nominal |
Z_CostContact | 3-3 | Not sure |
Z_Revenue | 11-11 | Not sure |
Response | 0-1 | Categorical nominal |
Education | - | Categorical nominal |
Marital Status | - | Categorical nominal |
It can be observed that the ranges of numerical data are as expected, therefore, this portion of the data is cleaned.
duplicates <- duplicated(Data$ID)
num_true <- sum(duplicates)
print(num_true)
## [1] 0
remove(duplicates,num_true)
We can conclude that there are no duplicates.
write.csv(Data,file.path(PSDS_PATH, 'dataset_from_r_classificatory_model.csv'),row.names=FALSE)
rm(dir1,dir2,dir3,dir4,dir5,dir6,dir7,dir8,dir9,dir10,file_name,PSDS_PATH)
Analyze each predictor variable individually to understand its distribution and identify any potential outliers or data issues.
Histograms: Create histograms to visualize the distribution of each continuous predictor variable.
Box plots: Use box plots to identify the range, quartiles, and potential outliers of continuous predictor variables.
Bar plots: Create bar plots for categorical predictor variables to visualize the frequency of each category.
Summary statistics: Calculate summary statistics, such as mean, median, standard deviation, and quartiles, to describe the central tendency and dispersion of each continuous predictor variable.
Analyze the relationship between each predictor variable and the target variable.
Scatter plots: Plot scatter plots between continuous predictor variables and the target variable to visualize the relationships and identify any trends or patterns.
Box plots or violin plots: Use box plots or violin plots to visualize the distribution of the target variable across different categories of a categorical predictor variable.
Correlation matrix: Calculate the correlation matrix for continuous predictor variables to identify any strong linear relationships with the target variable.
INC = Income
WIN = MMntWines
FRU = MntFruits
MEA = MntMeatProducts
FIS = MntFishProducts
SWE = MntSweetProducts
GOL = MntGoldProds
All this variables seem to have a correlation but it’s common sense. The more Income the people have the more they will spend monthly on each of our product categories.
##
## Wilcoxon rank sum test with continuity correction
##
## data: Data_0$NumDealsPurchases and Data_1$NumDealsPurchases
## W = 335931, p-value = 0.08919
## alternative hypothesis: true location shift is not equal to 0
##
## Wilcoxon rank sum test with continuity correction
##
## data: Data_0$NumWebPurchases and Data_1$NumWebPurchases
## W = 229882, p-value = 2.95e-16
## alternative hypothesis: true location shift is not equal to 0
##
## Wilcoxon rank sum test with continuity correction
##
## data: Data_0$NumCatalogPurchases and Data_1$NumCatalogPurchases
## W = 199938, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
##
## Wilcoxon rank sum test with continuity correction
##
## data: Data_0$NumWebVisitsMonth and Data_1$NumWebVisitsMonth
## W = 312810, p-value = 0.6115
## alternative hypothesis: true location shift is not equal to 0
##
## Wilcoxon rank sum test with continuity correction
##
## data: Data_0$NumStorePurchases and Data_1$NumStorePurchases
## W = 291004, p-value = 0.01157
## alternative hypothesis: true location shift is not equal to 0
##
## Wilcoxon rank sum test with continuity correction
##
## data: Data_0$Age and Data_1$Age
## W = 328994, p-value = 0.3267
## alternative hypothesis: true location shift is not equal to 0
##
## Wilcoxon rank sum test with continuity correction
##
## data: Data_0_numeric and Data_1_numeric
## W = 418740, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
##
## Wilcoxon rank sum test with continuity correction
##
## data: Data_0_numeric and Data_1_numeric
## W = 418740, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
##
## Wilcoxon rank sum test with continuity correction
##
## data: Data_0$Kidhome and Data_1$Kidhome
## W = 352752, p-value = 0.000247
## alternative hypothesis: true location shift is not equal to 0
##
## Wilcoxon rank sum test with continuity correction
##
## data: Data_0$Teenhome and Data_1$Teenhome
## W = 390173, p-value = 4.765e-14
## alternative hypothesis: true location shift is not equal to 0
##
## Pearson's Chi-squared test with simulated p-value (based on 2000
## replicates)
##
## data: reaction_Education
## X-squared = 23.076, df = NA, p-value = 0.0009995
##
## Pearson's Chi-squared test with simulated p-value (based on 2000
## replicates)
##
## data: reaction_Marital_Status
## X-squared = 54.242, df = NA, p-value = 0.0004998
##
## Pearson's Chi-squared test with simulated p-value (based on 2000
## replicates)
##
## data: reaction_AcceptedCmp1
## X-squared = 119.64, df = NA, p-value = 0.0004998
##
## Pearson's Chi-squared test with simulated p-value (based on 2000
## replicates)
##
## data: reaction_AcceptedCmp2
## X-squared = 8.7505, df = NA, p-value = 0.004498
##
## Pearson's Chi-squared test with simulated p-value (based on 2000
## replicates)
##
## data: reaction_AcceptedCmp3
## X-squared = 144.81, df = NA, p-value = 0.0004998
##
## Pearson's Chi-squared test with simulated p-value (based on 2000
## replicates)
##
## data: reaction_AcceptedCmp4
## X-squared = 70.192, df = NA, p-value = 0.0004998
##
## Pearson's Chi-squared test with simulated p-value (based on 2000
## replicates)
##
## data: reaction_AcceptedCmp5
## X-squared = 128, df = NA, p-value = 0.0004998
##
## Pearson's Chi-squared test with simulated p-value (based on 2000
## replicates)
##
## data: reaction_Complain
## X-squared = 0.0065267, df = NA, p-value = 1
## [1] 0
## [1] 0
## [1] 0
## [1] 0
## [1] 0
## [1] 0
## [1] 0
##
## Call: glm(formula = Response ~ Education, family = "binomial", data = Data)
##
## Coefficients:
## (Intercept) EducationBasic EducationGraduation
## -2.1075 -1.1506 0.2489
## EducationMaster EducationPhD
## 0.4043 0.7693
##
## Degrees of Freedom: 2239 Total (i.e. Null); 2235 Residual
## Null Deviance: 1887
## Residual Deviance: 1863 AIC: 1873
##
## Call:
## glm(formula = Response ~ Education, family = "binomial", data = Data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.6826 -0.5785 -0.5383 -0.4790 2.5674
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.1075 0.2258 -9.334 < 2e-16 ***
## EducationBasic -1.1506 0.7548 -1.524 0.12741
## EducationGraduation 0.2489 0.2420 1.028 0.30380
## EducationMaster 0.4043 0.2678 1.510 0.13112
## EducationPhD 0.7693 0.2519 3.054 0.00226 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1886.8 on 2239 degrees of freedom
## Residual deviance: 1862.7 on 2235 degrees of freedom
## AIC: 1872.7
##
## Number of Fisher Scoring iterations: 5
##
## Call: glm(formula = Response ~ PhD_hot_encoding, family = "binomial",
## data = Data)
##
## Coefficients:
## (Intercept) PhD_hot_encoding
## -1.876 0.538
##
## Degrees of Freedom: 2239 Total (i.e. Null); 2238 Residual
## Null Deviance: 1887
## Residual Deviance: 1871 AIC: 1875
##
## Call:
## glm(formula = Response ~ PhD_hot_encoding, family = "binomial",
## data = Data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.6826 -0.5339 -0.5339 -0.5339 2.0093
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.87608 0.07035 -26.667 < 2e-16 ***
## PhD_hot_encoding 0.53796 0.13209 4.073 4.65e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1886.8 on 2239 degrees of freedom
## Residual deviance: 1871.0 on 2238 degrees of freedom
## AIC: 1875
##
## Number of Fisher Scoring iterations: 4
##
## Call: glm(formula = Response ~ Marital_Status, family = "binomial",
## data = Data)
##
## Coefficients:
## (Intercept) Marital_StatusAlone Marital_StatusDivorced
## -7.629e-14 -6.931e-01 -1.344e+00
## Marital_StatusMarried Marital_StatusSingle Marital_StatusTogether
## -2.056e+00 -1.261e+00 -2.159e+00
## Marital_StatusWidow Marital_StatusYOLO
## -1.116e+00 3.286e-14
##
## Degrees of Freedom: 2239 Total (i.e. Null); 2232 Residual
## Null Deviance: 1887
## Residual Deviance: 1836 AIC: 1852
##
## Call:
## glm(formula = Response ~ Marital_Status, family = "binomial",
## data = Data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.1774 -0.6809 -0.4907 -0.4673 2.1301
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.629e-14 1.414e+00 0.000 1.000
## Marital_StatusAlone -6.931e-01 1.871e+00 -0.371 0.711
## Marital_StatusDivorced -1.344e+00 1.423e+00 -0.944 0.345
## Marital_StatusMarried -2.056e+00 1.418e+00 -1.450 0.147
## Marital_StatusSingle -1.261e+00 1.418e+00 -0.889 0.374
## Marital_StatusTogether -2.159e+00 1.421e+00 -1.520 0.129
## Marital_StatusWidow -1.116e+00 1.439e+00 -0.776 0.438
## Marital_StatusYOLO 3.286e-14 2.000e+00 0.000 1.000
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1886.8 on 2239 degrees of freedom
## Residual deviance: 1835.7 on 2232 degrees of freedom
## AIC: 1851.7
##
## Number of Fisher Scoring iterations: 4
##
## Call: glm(formula = Response ~ AcceptedCmp1, family = "binomial", data = Data)
##
## Coefficients:
## (Intercept) AcceptedCmp11
## -1.977 2.172
##
## Degrees of Freedom: 2239 Total (i.e. Null); 2238 Residual
## Null Deviance: 1887
## Residual Deviance: 1750 AIC: 1754
##
## Call:
## glm(formula = Response ~ AcceptedCmp1, family = "binomial", data = Data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.2613 -0.5094 -0.5094 -0.5094 2.0526
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.97680 0.06682 -29.59 <2e-16 ***
## AcceptedCmp11 2.17186 0.18030 12.05 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1886.8 on 2239 degrees of freedom
## Residual deviance: 1750.2 on 2238 degrees of freedom
## AIC: 1754.2
##
## Number of Fisher Scoring iterations: 4
##
## Call: glm(formula = Response ~ AcceptedCmp2, family = "binomial", data = Data)
##
## Coefficients:
## (Intercept) AcceptedCmp21
## -1.798 2.491
##
## Degrees of Freedom: 2239 Total (i.e. Null); 2238 Residual
## Null Deviance: 1887
## Residual Deviance: 1845 AIC: 1849
##
## Call:
## glm(formula = Response ~ AcceptedCmp2, family = "binomial", data = Data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.4823 -0.5536 -0.5536 -0.5536 1.9755
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.79811 0.06093 -29.512 < 2e-16 ***
## AcceptedCmp21 2.49126 0.39206 6.354 2.09e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1886.8 on 2239 degrees of freedom
## Residual deviance: 1844.8 on 2238 degrees of freedom
## AIC: 1848.8
##
## Number of Fisher Scoring iterations: 4
##
## Call: glm(formula = Response ~ AcceptedCmp3, family = "binomial", data = Data)
##
## Coefficients:
## (Intercept) AcceptedCmp31
## -1.958 1.847
##
## Degrees of Freedom: 2239 Total (i.e. Null); 2238 Residual
## Null Deviance: 1887
## Residual Deviance: 1780 AIC: 1784
##
## Call:
## glm(formula = Response ~ AcceptedCmp3, family = "binomial", data = Data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.131 -0.514 -0.514 -0.514 2.044
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.95752 0.06664 -29.38 <2e-16 ***
## AcceptedCmp31 1.84697 0.17046 10.84 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1886.8 on 2239 degrees of freedom
## Residual deviance: 1780.3 on 2238 degrees of freedom
## AIC: 1784.3
##
## Number of Fisher Scoring iterations: 4
##
## Call: glm(formula = Response ~ AcceptedCmp4, family = "binomial", data = Data)
##
## Coefficients:
## (Intercept) AcceptedCmp41
## -1.890 1.363
##
## Degrees of Freedom: 2239 Total (i.e. Null); 2238 Residual
## Null Deviance: 1887
## Residual Deviance: 1832 AIC: 1836
##
## Call:
## glm(formula = Response ~ AcceptedCmp4, family = "binomial", data = Data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.9634 -0.5304 -0.5304 -0.5304 2.0154
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.89030 0.06505 -29.058 < 2e-16 ***
## AcceptedCmp41 1.36347 0.17287 7.887 3.09e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1886.8 on 2239 degrees of freedom
## Residual deviance: 1831.8 on 2238 degrees of freedom
## AIC: 1835.8
##
## Number of Fisher Scoring iterations: 4
##
## Call: glm(formula = Response ~ AcceptedCmp5, family = "binomial", data = Data)
##
## Coefficients:
## (Intercept) AcceptedCmp51
## -2.026 2.285
##
## Degrees of Freedom: 2239 Total (i.e. Null); 2238 Residual
## Null Deviance: 1887
## Residual Deviance: 1718 AIC: 1722
##
## Call:
## glm(formula = Response ~ AcceptedCmp5, family = "binomial", data = Data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.2892 -0.4978 -0.4978 -0.4978 2.0735
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.02586 0.06839 -29.62 <2e-16 ***
## AcceptedCmp51 2.28497 0.17214 13.27 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1886.8 on 2239 degrees of freedom
## Residual deviance: 1718.4 on 2238 degrees of freedom
## AIC: 1722.4
##
## Number of Fisher Scoring iterations: 4
##
## Call: glm(formula = Response ~ NumWebPurchases, family = "binomial",
## data = Data)
##
## Coefficients:
## (Intercept) NumWebPurchases
## -2.3375 0.1342
##
## Degrees of Freedom: 2239 Total (i.e. Null); 2238 Residual
## Null Deviance: 1887
## Residual Deviance: 1842 AIC: 1846
##
## Call:
## glm(formula = Response ~ NumWebPurchases, family = "binomial",
## data = Data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7495 -0.5883 -0.5195 -0.4577 2.2044
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.33753 0.11306 -20.68 < 2e-16 ***
## NumWebPurchases 0.13422 0.02012 6.67 2.55e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1886.8 on 2239 degrees of freedom
## Residual deviance: 1841.6 on 2238 degrees of freedom
## AIC: 1845.6
##
## Number of Fisher Scoring iterations: 4
##
## Call: glm(formula = Response ~ NumCatalogPurchases, family = "binomial",
## data = Data)
##
## Coefficients:
## (Intercept) NumCatalogPurchases
## -2.3185 0.1819
##
## Degrees of Freedom: 2239 Total (i.e. Null); 2238 Residual
## Null Deviance: 1887
## Residual Deviance: 1792 AIC: 1796
##
## Call:
## glm(formula = Response ~ NumCatalogPurchases, family = "binomial",
## data = Data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.3809 -0.5601 -0.4724 -0.4333 2.1965
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.31847 0.09220 -25.14 <2e-16 ***
## NumCatalogPurchases 0.18186 0.01914 9.50 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1886.8 on 2239 degrees of freedom
## Residual deviance: 1792.0 on 2238 degrees of freedom
## AIC: 1796
##
## Number of Fisher Scoring iterations: 4
##
## Call: glm(formula = Response ~ NumStorePurchases, family = "binomial",
## data = Data)
##
## Coefficients:
## (Intercept) NumStorePurchases
## -1.93786 0.03318
##
## Degrees of Freedom: 2239 Total (i.e. Null); 2238 Residual
## Null Deviance: 1887
## Residual Deviance: 1883 AIC: 1887
##
## Call:
## glm(formula = Response ~ NumStorePurchases, family = "binomial",
## data = Data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.6328 -0.5778 -0.5518 -0.5434 2.0073
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.93786 0.12297 -15.759 <2e-16 ***
## NumStorePurchases 0.03318 0.01783 1.861 0.0628 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1886.8 on 2239 degrees of freedom
## Residual deviance: 1883.4 on 2238 degrees of freedom
## AIC: 1887.4
##
## Number of Fisher Scoring iterations: 4
##
## Call: glm(formula = Response ~ Months_since_Dt_Customer, family = "binomial",
## data = Data)
##
## Coefficients:
## (Intercept) Months_since_Dt_Customer
## -2.85096 0.08549
##
## Degrees of Freedom: 2239 Total (i.e. Null); 2238 Residual
## Null Deviance: 1887
## Residual Deviance: 1802 AIC: 1806
##
## Call:
## glm(formula = Response ~ Months_since_Dt_Customer, family = "binomial",
## data = Data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.8314 -0.6396 -0.4849 -0.3795 2.4113
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.850958 0.150151 -18.987 <2e-16 ***
## Months_since_Dt_Customer 0.085491 0.009666 8.845 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1886.8 on 2239 degrees of freedom
## Residual deviance: 1801.8 on 2238 degrees of freedom
## AIC: 1805.8
##
## Number of Fisher Scoring iterations: 5
##
## Call: glm(formula = Response ~ Teenhome, family = "binomial", data = Data)
##
## Coefficients:
## (Intercept) Teenhome
## -1.3725 -0.8834
##
## Degrees of Freedom: 2239 Total (i.e. Null); 2238 Residual
## Null Deviance: 1887
## Residual Deviance: 1830 AIC: 1834
##
## Call:
## glm(formula = Response ~ Teenhome, family = "binomial", data = Data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.6722 -0.6722 -0.4464 -0.4464 2.5226
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.3725 0.0728 -18.854 < 2e-16 ***
## Teenhome -0.8834 0.1233 -7.166 7.72e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1886.8 on 2239 degrees of freedom
## Residual deviance: 1829.9 on 2238 degrees of freedom
## AIC: 1833.9
##
## Number of Fisher Scoring iterations: 5
##
## Call: glm(formula = Response ~ Kidhome, family = "binomial", data = Data)
##
## Coefficients:
## (Intercept) Kidhome
## -1.5639 -0.4443
##
## Degrees of Freedom: 2239 Total (i.e. Null); 2238 Residual
## Null Deviance: 1887
## Residual Deviance: 1872 AIC: 1876
##
## Call:
## glm(formula = Response ~ Kidhome, family = "binomial", data = Data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.6165 -0.6165 -0.5019 -0.5019 2.2517
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.56389 0.07291 -21.449 < 2e-16 ***
## Kidhome -0.44431 0.11799 -3.766 0.000166 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1886.8 on 2239 degrees of freedom
## Residual deviance: 1871.9 on 2238 degrees of freedom
## AIC: 1875.9
##
## Number of Fisher Scoring iterations: 4
##
## Call: glm(formula = Response ~ Recency, family = "binomial", data = Data)
##
## Coefficients:
## (Intercept) Recency
## -0.86026 -0.02037
##
## Degrees of Freedom: 2239 Total (i.e. Null); 2238 Residual
## Null Deviance: 1887
## Residual Deviance: 1796 AIC: 1800
##
## Call:
## glm(formula = Response ~ Recency, family = "binomial", data = Data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.8400 -0.6371 -0.4800 -0.3690 2.4215
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.860263 0.103704 -8.295 <2e-16 ***
## Recency -0.020372 0.002233 -9.123 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1886.8 on 2239 degrees of freedom
## Residual deviance: 1796.1 on 2238 degrees of freedom
## AIC: 1800.1
##
## Number of Fisher Scoring iterations: 5
##
## Call: glm(formula = Response ~ Income, family = "binomial", data = Data)
##
## Coefficients:
## (Intercept) Income
## -2.623e+00 1.623e-05
##
## Degrees of Freedom: 2239 Total (i.e. Null); 2238 Residual
## Null Deviance: 1887
## Residual Deviance: 1846 AIC: 1850
##
## Call:
## glm(formula = Response ~ Income, family = "binomial", data = Data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -4.0482 -0.6075 -0.5257 -0.4525 2.3209
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.623e+00 1.647e-01 -15.922 < 2e-16 ***
## Income 1.623e-05 2.704e-06 6.001 1.96e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1886.8 on 2239 degrees of freedom
## Residual deviance: 1846.3 on 2238 degrees of freedom
## AIC: 1850.3
##
## Number of Fisher Scoring iterations: 4
##
## Call: glm(formula = Response ~ MntWines, family = "binomial", data = Data)
##
## Coefficients:
## (Intercept) MntWines
## -2.405572 0.001769
##
## Degrees of Freedom: 2239 Total (i.e. Null); 2238 Residual
## Null Deviance: 1887
## Residual Deviance: 1765 AIC: 1769
##
## Call:
## glm(formula = Response ~ MntWines, family = "binomial", data = Data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.2792 -0.5714 -0.4413 -0.4181 2.2317
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.4055723 0.0935629 -25.71 <2e-16 ***
## MntWines 0.0017694 0.0001596 11.09 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1886.8 on 2239 degrees of freedom
## Residual deviance: 1765.5 on 2238 degrees of freedom
## AIC: 1769.5
##
## Number of Fisher Scoring iterations: 4
##
## Call: glm(formula = Response ~ MntFruits, family = "binomial", data = Data)
##
## Coefficients:
## (Intercept) MntFruits
## -1.965980 0.007349
##
## Degrees of Freedom: 2239 Total (i.e. Null); 2238 Residual
## Null Deviance: 1887
## Residual Deviance: 1856 AIC: 1860
##
## Call:
## glm(formula = Response ~ MntFruits, family = "binomial", data = Data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.9724 -0.5520 -0.5190 -0.5120 2.0479
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.965980 0.074315 -26.455 < 2e-16 ***
## MntFruits 0.007349 0.001270 5.787 7.17e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1886.8 on 2239 degrees of freedom
## Residual deviance: 1856.0 on 2238 degrees of freedom
## AIC: 1860
##
## Number of Fisher Scoring iterations: 4
##
## Call: glm(formula = Response ~ MntMeatProducts, family = "binomial",
## data = Data)
##
## Coefficients:
## (Intercept) MntMeatProducts
## -2.222755 0.002318
##
## Degrees of Freedom: 2239 Total (i.e. Null); 2238 Residual
## Null Deviance: 1887
## Residual Deviance: 1784 AIC: 1788
##
## Call:
## glm(formula = Response ~ MntMeatProducts, family = "binomial",
## data = Data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9662 -0.5365 -0.4687 -0.4570 2.1557
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.2227553 0.0822246 -27.03 <2e-16 ***
## MntMeatProducts 0.0023185 0.0002267 10.22 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1886.8 on 2239 degrees of freedom
## Residual deviance: 1783.5 on 2238 degrees of freedom
## AIC: 1787.5
##
## Number of Fisher Scoring iterations: 4
##
## Call: glm(formula = Response ~ MntFishProducts, family = "binomial",
## data = Data)
##
## Coefficients:
## (Intercept) MntFishProducts
## -1.950992 0.004889
##
## Degrees of Freedom: 2239 Total (i.e. Null); 2238 Residual
## Null Deviance: 1887
## Residual Deviance: 1862 AIC: 1866
##
## Call:
## glm(formula = Response ~ MntFishProducts, family = "binomial",
## data = Data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.9036 -0.5558 -0.5239 -0.5155 2.0415
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.9509921 0.0750546 -25.994 < 2e-16 ***
## MntFishProducts 0.0048891 0.0009445 5.176 2.26e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1886.8 on 2239 degrees of freedom
## Residual deviance: 1862.0 on 2238 degrees of freedom
## AIC: 1866
##
## Number of Fisher Scoring iterations: 4
##
## Call: glm(formula = Response ~ MntSweetProducts, family = "binomial",
## data = Data)
##
## Coefficients:
## (Intercept) MntSweetProducts
## -1.949835 0.006675
##
## Degrees of Freedom: 2239 Total (i.e. Null); 2238 Residual
## Null Deviance: 1887
## Residual Deviance: 1860 AIC: 1864
##
## Call:
## glm(formula = Response ~ MntSweetProducts, family = "binomial",
## data = Data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.0961 -0.5541 -0.5223 -0.5158 2.0410
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.949835 0.073848 -26.40 < 2e-16 ***
## MntSweetProducts 0.006675 0.001229 5.43 5.63e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1886.8 on 2239 degrees of freedom
## Residual deviance: 1859.6 on 2238 degrees of freedom
## AIC: 1863.6
##
## Number of Fisher Scoring iterations: 4
##
## Call: glm(formula = Response ~ MntGoldProds, family = "binomial", data = Data)
##
## Coefficients:
## (Intercept) MntGoldProds
## -2.057133 0.006285
##
## Degrees of Freedom: 2239 Total (i.e. Null); 2238 Residual
## Null Deviance: 1887
## Residual Deviance: 1848 AIC: 1852
##
## Call:
## glm(formula = Response ~ MntGoldProds, family = "binomial", data = Data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.2713 -0.5565 -0.5127 -0.4949 2.0868
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.0571330 0.0813300 -25.294 < 2e-16 ***
## MntGoldProds 0.0062849 0.0009771 6.432 1.26e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1886.8 on 2239 degrees of freedom
## Residual deviance: 1848.4 on 2238 degrees of freedom
## AIC: 1852.4
##
## Number of Fisher Scoring iterations: 4
For this model, I first selected the most promising features. These were identified based on their high regression coefficients. I then examined the data for multicollinearity but found no evidence of it among the chosen features.
Finally, I employed the stepwise regression technique to further refine the feature selection process, ensuring that only the most impactful features were included in the model.
#Split the data into training and holdout sets using a 80-10 split
set.seed(123)
trainIndex <- createDataPartition(Data$Response, p=0.8, list = FALSE)
trainData <- Data[trainIndex, ]
holdoutData <- Data[-trainIndex, ]
##
## Call: glm(formula = Response ~ PhD_hot_encoding + AcceptedCmp1 + AcceptedCmp3 +
## AcceptedCmp4 + AcceptedCmp5 + Months_since_Dt_Customer +
## MntWines + MntMeatProducts, family = "binomial", data = trainData)
##
## Coefficients:
## (Intercept) PhD_hot_encoding AcceptedCmp11
## -4.2594155 0.9439971 1.7162674
## AcceptedCmp31 AcceptedCmp41 AcceptedCmp51
## 2.2026332 0.9352136 1.4995628
## Months_since_Dt_Customer MntWines MntMeatProducts
## 0.1126719 -0.0005644 0.0014283
##
## Degrees of Freedom: 1792 Total (i.e. Null); 1784 Residual
## Null Deviance: 1513
## Residual Deviance: 1139 AIC: 1157
##
## Call:
## glm(formula = Response ~ PhD_hot_encoding + AcceptedCmp1 + AcceptedCmp3 +
## AcceptedCmp4 + AcceptedCmp5 + Months_since_Dt_Customer +
## MntWines + MntMeatProducts, family = "binomial", data = trainData)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.3936 -0.5012 -0.3423 -0.2210 2.8323
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.2594155 0.2361324 -18.038 < 2e-16 ***
## PhD_hot_encoding 0.9439971 0.1756254 5.375 7.66e-08 ***
## AcceptedCmp11 1.7162674 0.2745354 6.252 4.06e-10 ***
## AcceptedCmp31 2.2026332 0.2235573 9.853 < 2e-16 ***
## AcceptedCmp41 0.9352136 0.2716465 3.443 0.000576 ***
## AcceptedCmp51 1.4995628 0.2781107 5.392 6.97e-08 ***
## Months_since_Dt_Customer 0.1126719 0.0130373 8.642 < 2e-16 ***
## MntWines -0.0005644 0.0002873 -1.965 0.049450 *
## MntMeatProducts 0.0014283 0.0003402 4.198 2.69e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1512.5 on 1792 degrees of freedom
## Residual deviance: 1139.5 on 1784 degrees of freedom
## AIC: 1157.5
##
## Number of Fisher Scoring iterations: 5
## PhD_hot_encoding AcceptedCmp1 AcceptedCmp3
## 1.085816 1.268044 1.080945
## AcceptedCmp4 AcceptedCmp5 Months_since_Dt_Customer
## 1.290124 1.483181 1.160394
## MntWines MntMeatProducts
## 1.994079 1.430336
There’s no Multicollinearity
## Start: AIC=1157.48
## Response ~ PhD_hot_encoding + AcceptedCmp1 + AcceptedCmp3 + AcceptedCmp4 +
## AcceptedCmp5 + Months_since_Dt_Customer + MntWines + MntMeatProducts
##
## Df Deviance AIC
## <none> 1139.5 1157.5
## - MntWines 1 1143.5 1159.5
## - AcceptedCmp4 1 1150.8 1166.8
## - MntMeatProducts 1 1156.6 1172.6
## - PhD_hot_encoding 1 1167.3 1183.3
## - AcceptedCmp5 1 1168.5 1184.5
## - AcceptedCmp1 1 1178.2 1194.2
## - Months_since_Dt_Customer 1 1224.3 1240.3
## - AcceptedCmp3 1 1230.7 1246.7
##
## Call: glm(formula = Response ~ PhD_hot_encoding + AcceptedCmp1 + AcceptedCmp3 +
## AcceptedCmp4 + AcceptedCmp5 + Months_since_Dt_Customer +
## MntWines + MntMeatProducts, family = "binomial", data = trainData)
##
## Coefficients:
## (Intercept) PhD_hot_encoding AcceptedCmp11
## -4.2594155 0.9439971 1.7162674
## AcceptedCmp31 AcceptedCmp41 AcceptedCmp51
## 2.2026332 0.9352136 1.4995628
## Months_since_Dt_Customer MntWines MntMeatProducts
## 0.1126719 -0.0005644 0.0014283
##
## Degrees of Freedom: 1792 Total (i.e. Null); 1784 Residual
## Null Deviance: 1513
## Residual Deviance: 1139 AIC: 1157
##
## Call:
## glm(formula = Response ~ PhD_hot_encoding + AcceptedCmp1 + AcceptedCmp3 +
## AcceptedCmp4 + AcceptedCmp5 + Months_since_Dt_Customer +
## MntWines + MntMeatProducts, family = "binomial", data = trainData)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.3936 -0.5012 -0.3423 -0.2210 2.8323
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.2594155 0.2361324 -18.038 < 2e-16 ***
## PhD_hot_encoding 0.9439971 0.1756254 5.375 7.66e-08 ***
## AcceptedCmp11 1.7162674 0.2745354 6.252 4.06e-10 ***
## AcceptedCmp31 2.2026332 0.2235573 9.853 < 2e-16 ***
## AcceptedCmp41 0.9352136 0.2716465 3.443 0.000576 ***
## AcceptedCmp51 1.4995628 0.2781107 5.392 6.97e-08 ***
## Months_since_Dt_Customer 0.1126719 0.0130373 8.642 < 2e-16 ***
## MntWines -0.0005644 0.0002873 -1.965 0.049450 *
## MntMeatProducts 0.0014283 0.0003402 4.198 2.69e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1512.5 on 1792 degrees of freedom
## Residual deviance: 1139.5 on 1784 degrees of freedom
## AIC: 1157.5
##
## Number of Fisher Scoring iterations: 5
#Eliminate outliers that are more than 2.5 standard residuals
sresid <- rstandard(step_lm)
idx <- order(sresid, decreasing=TRUE)
sresid[idx[1]]
resid(step_lm)[idx[1]]
#outlier_record <- trainData[idx[1], c('Monthly_Premium_Auto', 'Total_Claim_Amount')]
trainData$sresid <- rstandard(step_lm)
#trainData <- trainData %>%
# filter(sresid < 2.5) %>%
#filter(sresid > -2.5)
We found 19 records that surpass the benchmark of 2.5 standard residuals benchmark, we have opted against data exclusion, as a significant proportion adheres to this threshold. Removing these records may potentially compromise the integrity of our model.
## Accuracy Kappa Resample
## 1 0.9162011 0.5562717 Fold01
## 2 0.8777778 0.3659942 Fold02
## 3 0.8603352 0.1608851 Fold03
## 4 0.8491620 0.2054907 Fold04
## 5 0.8722222 0.2944785 Fold05
## 6 0.8764045 0.3873592 Fold06
## 7 0.8994413 0.5369359 Fold07
## 8 0.8888889 0.4666667 Fold08
## 9 0.8770950 0.3902137 Fold09
## 10 0.8555556 0.2507205 Fold10
## Generalized Linear Model
##
## 1793 samples
## 8 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 1614, 1613, 1614, 1614, 1613, 1615, ...
## Resampling results:
##
## Accuracy Kappa
## 0.8773084 0.3615016
##
## Call: NULL
##
## Coefficients:
## (Intercept) PhD_hot_encoding AcceptedCmp11
## -4.2594155 0.9439971 1.7162674
## AcceptedCmp31 AcceptedCmp41 AcceptedCmp51
## 2.2026332 0.9352136 1.4995628
## Months_since_Dt_Customer MntWines MntMeatProducts
## 0.1126719 -0.0005644 0.0014283
##
## Degrees of Freedom: 1792 Total (i.e. Null); 1784 Residual
## Null Deviance: 1513
## Residual Deviance: 1139 AIC: 1157
## [1] "Confusion Matrix"
## Yhat = 1 Yhat = 0
## Y = 1 82 186
## Y = 0 33 1492
## [1] "Precision"
## [1] 0.7130435
## [1] "Recall"
## [1] 0.3059701
## [1] "Specificity"
## [1] 0.9783607
## [1] "Accuracy"
## [1] 0.8778583
The model is good at identifying true positives (71% precision) but it lacks sensitivity which might be important for this task (30% sensitivity).
## Desired recall: 0.75
## Closest actual recall: 0.738806
## Specificity at the closest recall: 0.7731148
## [1] 0.3639706
34.6% of precision is not very good. Therefore, we decided to keep the default model with a 50% treshold instead of the 14% treshold that gave this results.
## [1] "Confusion Matrix"
## Yhat = 1 Yhat = 0
## Y = 1 28 38
## Y = 0 9 372
## [1] "Precision"
## [1] 0.7567568
## [1] "Recall"
## [1] 0.4242424
## [1] "Specificity"
## [1] 0.976378
## [1] "Accuracy"
## [1] 0.8948546
The performance of the model on the holdout sample is quite impressive, with a precision of 72% and a recall that surpasses the results obtained in the training data.
However, the model’s recall is somewhat lower, indicating that 60% of false negatives are present. Despite this, the model demonstrates strong predictive capabilities overall.
### Oversampling and Up/Down Weighting
wt <- ifelse(trainData$Response == 1,
1 / mean(trainData$Response == 1), 1)
Oversampling_model <- glm(Response ~ PhD_hot_encoding + AcceptedCmp1 + AcceptedCmp3 + AcceptedCmp4 + AcceptedCmp5 + Months_since_Dt_Customer+ MntWines + MntMeatProducts, data=trainData, weight=wt, family='quasibinomial')
#pred <- predict(Oversampling_model)
#mean(pred > 0)
# Predict probabilities
pred_prob <- predict(Oversampling_model, type="response")
# Convert probabilities to binary predictions (0 or 1) using a threshold of 0.5
pred_binary <- ifelse(pred_prob > 0.5, 1, 0)
# Check the mean of the binary predictions
mean(pred_binary)
## [1] "Confusion Matrix"
## Yhat = 1 Yhat = 0
## Y = 1 46 20
## Y = 0 91 290
## [1] "Precision"
## [1] 0.3357664
## [1] "Recall"
## [1] 0.6969697
## [1] "Specificity"
## [1] 0.7611549
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Area under the curve: 0.8201
The improvement provided by oversampling was not substantial enough to warrant its inclusion; therefore, we have chosen to retain the original model without applying the oversampling technique.
First let’s try to build a model to catch some non-linear relationships that couldn’t be catch by the logistic regression model.
# Prepare the data
Data_KNN <- model.matrix(~ -1 + Marital_Status + Education + NumWebPurchases+ NumCatalogPurchases + NumStorePurchases + Recency + AcceptedCmp2 + MntGoldProds + Income, data=Data)
Data_KNN_std <- scale(Data_KNN)
# Create train and holdout indices
#trainIndex_knn <- createDataPartition(Data$Response, p=0.8, list = FALSE)
trainData_knn <- Data_KNN_std[trainIndex, ,drop=FALSE]
holdoutData_knn <- Data_KNN_std[-trainIndex, ,drop=FALSE]
# Extract the corresponding 'Response' values for the train data
train_outcome <- Data[trainIndex,]$Response
# Train the KNN model and make predictions for the holdout data
knn_pred <- knn(train=trainData_knn, test=holdoutData_knn, cl=train_outcome, prob=FALSE, k=20)
# Convert the knn_pred vector to a matrix
knn_pred_matrix <- matrix(knn_pred, ncol = 1)
# Add the knn_pred_matrix as a new column to the holdoutData_knn matrix
holdoutData_knn_with_pred <- cbind(holdoutData, knn_pred_matrix)
# Set the column name for the new column
colnames(holdoutData_knn_with_pred)[ncol(holdoutData_knn_with_pred)] <- "Predicted_Outcome"
## [1] "Confusion Matrix"
## Yhat = 1 Yhat = 0
## Y = 1 9 57
## Y = 0 5 376
## [1] "Precision"
## [1] 0.6428571
## [1] "Recall"
## [1] 0.1363636
## [1] "Specificity"
## [1] 0.9868766
## [1] "Accuracy"
## [1] 0.8612975
The results are not as good, but as we know, usually KNN itself is not competitive with more sophisticated classification techniques.
In practical model fitting, however, KNN can be used to add “local knowledge” in a staged process with other classification techniques
Therefore, We’ll add the result of the KNN as a new predictor that could improve the capacity to classify on my model.
Data_KNN <- model.matrix(~ -1 + Marital_Status + Education + NumWebPurchases+ NumCatalogPurchases + NumStorePurchases + Recency + AcceptedCmp2 + MntGoldProds + Income, data=Data)
Data_KNN_std <- scale(Data_KNN)
knn_on_data <- knn(Data_KNN_std, test=Data_KNN_std, cl=Data$Response, prob=TRUE, k=20)
prob <- attr(knn_on_data, "prob")
#Check which class is calculating the probability for (the first that appears)
class_levels <- sort(unique(Data$Response))
print(class_levels)
#First element is 0, so is the probability of getting a 0
knn_feature <- ifelse(knn_on_data == '0', prob, 1 - prob)
summary(knn_feature)
Data <- Data %>%
mutate(knn_feature = knn_feature)
trainData <- trainData %>%
mutate(knn_feature = Data[trainIndex,]$knn_feature)
holdoutData <- holdoutData %>%
mutate(knn_feature = Data[-trainIndex,]$knn_feature)
# NOW SAME PREDICTION BUT WITH 1'S AND 0'S
knn_on_data <- knn(Data_KNN_std, test=Data_KNN_std, cl=Data$Response, prob=FALSE, k=20)
knn_pred_matrix <- matrix(knn_on_data, ncol = 1)
Data <- Data %>%
mutate(knn_pred_matrix = knn_pred_matrix)
trainData <- trainData %>%
mutate(knn_pred_matrix = Data[trainIndex,]$knn_pred_matrix)
holdoutData <- holdoutData %>%
mutate(knn_pred_matrix = Data[-trainIndex,]$knn_pred_matrix)
## Accuracy Kappa Resample
## 1 0.9162011 0.6332468 Fold01
## 2 0.8833333 0.4502618 Fold02
## 3 0.8659218 0.2789527 Fold03
## 4 0.8547486 0.3311296 Fold04
## 5 0.8666667 0.3600000 Fold05
## 6 0.8820225 0.4034472 Fold06
## 7 0.9050279 0.5850266 Fold07
## 8 0.8944444 0.5518868 Fold08
## 9 0.8826816 0.4498756 Fold09
## 10 0.8777778 0.4344473 Fold10
## Generalized Linear Model
##
## 1793 samples
## 9 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 1614, 1613, 1614, 1614, 1613, 1615, ...
## Resampling results:
##
## Accuracy Kappa
## 0.8828826 0.4478274
##
## Call: NULL
##
## Coefficients:
## (Intercept) PhD_hot_encoding AcceptedCmp11
## 3.1847337 0.5451720 1.7876616
## AcceptedCmp31 AcceptedCmp41 AcceptedCmp51
## 2.1027073 0.7667220 1.7801829
## Months_since_Dt_Customer MntWines MntMeatProducts
## 0.1250705 -0.0015136 0.0007905
## knn_feature
## -8.5682713
##
## Degrees of Freedom: 1792 Total (i.e. Null); 1783 Residual
## Null Deviance: 1513
## Residual Deviance: 944.2 AIC: 964.2
## [1] "Confusion Matrix"
## Yhat = 1 Yhat = 0
## Y = 1 112 156
## Y = 0 50 1475
## [1] "Precision"
## [1] 0.691358
## [1] "Recall"
## [1] 0.4179104
## [1] "Specificity"
## [1] 0.9672131
## [1] "Accuracy"
## [1] 0.8851088
While the model’s precision experienced a slight decline from 71% to 69%, the recall increased substantially from 30% to 41%. Additionally, the model’s accuracy improved modestly, rising from 87% to 88.5%.
These results indicate that the model has shown some improvement. However, it is crucial to evaluate the model’s performance using out-of-sample data to confirm its effectiveness.
## PhD_hot_encoding AcceptedCmp1 AcceptedCmp3
## 1.112743 1.269955 1.079651
## AcceptedCmp4 AcceptedCmp5 Months_since_Dt_Customer
## 1.252444 1.512756 1.171474
## MntWines MntMeatProducts knn_feature
## 2.091986 1.442327 1.222207
There’s no Multicollinearity
## Start: AIC=964.17
## Response ~ PhD_hot_encoding + AcceptedCmp1 + AcceptedCmp3 + AcceptedCmp4 +
## AcceptedCmp5 + Months_since_Dt_Customer + MntWines + MntMeatProducts +
## knn_feature
##
## Df Deviance AIC
## <none> 944.17 964.17
## - MntMeatProducts 1 948.69 966.69
## - AcceptedCmp4 1 949.95 967.95
## - PhD_hot_encoding 1 951.51 969.51
## - MntWines 1 966.75 984.75
## - AcceptedCmp1 1 975.63 993.63
## - AcceptedCmp5 1 976.43 994.43
## - AcceptedCmp3 1 1012.94 1030.94
## - Months_since_Dt_Customer 1 1028.87 1046.87
## - knn_feature 1 1139.48 1157.48
##
## Call: glm(formula = Response ~ PhD_hot_encoding + AcceptedCmp1 + AcceptedCmp3 +
## AcceptedCmp4 + AcceptedCmp5 + Months_since_Dt_Customer +
## MntWines + MntMeatProducts + knn_feature, family = "binomial",
## data = trainData)
##
## Coefficients:
## (Intercept) PhD_hot_encoding AcceptedCmp11
## 3.1847337 0.5451720 1.7876616
## AcceptedCmp31 AcceptedCmp41 AcceptedCmp51
## 2.1027073 0.7667220 1.7801829
## Months_since_Dt_Customer MntWines MntMeatProducts
## 0.1250705 -0.0015136 0.0007905
## knn_feature
## -8.5682713
##
## Degrees of Freedom: 1792 Total (i.e. Null); 1783 Residual
## Null Deviance: 1513
## Residual Deviance: 944.2 AIC: 964.2
##
## Call:
## glm(formula = Response ~ PhD_hot_encoding + AcceptedCmp1 + AcceptedCmp3 +
## AcceptedCmp4 + AcceptedCmp5 + Months_since_Dt_Customer +
## MntWines + MntMeatProducts + knn_feature, family = "binomial",
## data = trainData)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5382 -0.4194 -0.2524 -0.1470 2.7488
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.1847337 0.6111346 5.211 1.88e-07 ***
## PhD_hot_encoding 0.5451720 0.1985098 2.746 0.00603 **
## AcceptedCmp11 1.7876616 0.3195181 5.595 2.21e-08 ***
## AcceptedCmp31 2.1027073 0.2478546 8.484 < 2e-16 ***
## AcceptedCmp41 0.7667220 0.3127722 2.451 0.01423 *
## AcceptedCmp51 1.7801829 0.3148743 5.654 1.57e-08 ***
## Months_since_Dt_Customer 0.1250705 0.0146380 8.544 < 2e-16 ***
## MntWines -0.0015136 0.0003325 -4.552 5.32e-06 ***
## MntMeatProducts 0.0007905 0.0003699 2.137 0.03260 *
## knn_feature -8.5682713 0.6921783 -12.379 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1512.54 on 1792 degrees of freedom
## Residual deviance: 944.17 on 1783 degrees of freedom
## AIC: 964.17
##
## Number of Fisher Scoring iterations: 6
## [1] "Confusion Matrix"
## Yhat = 1 Yhat = 0
## Y = 1 28 38
## Y = 0 12 369
## [1] "Precision"
## [1] 0.7
## [1] "Recall"
## [1] 0.4242424
## [1] "Specificity"
## [1] 0.9685039
## [1] "Accuracy"
## [1] 0.8881432
Regrettably, despite the promising appearance of the modified model, it did not yield any improvement on the holdout sample. Consequently, we have decided to retain the original logistic regression model for our analysis.
PhD holders showed a higher likelihood of responding positively to the campaign:
Customers who responded positively to previous campaigns demonstrated a higher likelihood of responding positively to the current campaign.
Clients with a longer history since their first use of our services exhibit a greater probability of responding positively to the current campaign.
Notable differences were identified across various attributes, but none were found to be statistically significant or to have a considerable impact on the model’s predictive ability. As a result, these attributes were not considered in the development of the model.
The resulting model exhibits a precision of 75%, sensitivity of 42%, specificity of 97%, accuracy of 89%, and an AUC of 0.816. This model will enable the company to focus its marketing campaigns on customers who are more likely to respond positively to the new campaign. In doing so, the management of the department’s budget will be optimized, and profits will increase.
The model’s effectiveness could be improved, particularly in terms of sensitivity. It would be advisable to explore other modeling techniques, adjust parameters, or test different data preprocessing techniques to enhance the model’s performance in this area.