Classification

Section 1: Data preparation (Import, transform, sort and filter)

Import csv files as data frames

Data <- read_csv(file.path(PSDS_PATH, 'ml_project1_data.csv'))

Sort ascending according to ID

Data<- arrange(Data,ID)

Transform Year_Birth to Age

It would be useful to have a feature with the age of the clients.

As the last date in the data is on june of 2014, we are going to assume that we are performing this analysis in 2014 for the age calculations.

Data <- Data %>%
  mutate(Age = 2014 - Year_Birth)

Section 2: Data process (clean)

Explore the data frame

str(Data)

## tibble [2,240 × 30] (S3: tbl_df/tbl/data.frame)
##  $ ID                 : num [1:2240] 0 1 9 13 17 20 22 24 25 35 ...
##  $ Year_Birth         : num [1:2240] 1985 1961 1975 1947 1971 ...
##  $ Education          : chr [1:2240] "Graduation" "Graduation" "Master" "PhD" ...
##  $ Marital_Status     : chr [1:2240] "Married" "Single" "Single" "Widow" ...
##  $ Income             : num [1:2240] 70951 57091 46098 25358 60491 ...
##  $ Kidhome            : num [1:2240] 0 0 1 0 0 0 1 1 0 1 ...
##  $ Teenhome           : num [1:2240] 0 0 1 1 1 1 0 1 1 0 ...
##  $ Dt_Customer        : Date[1:2240], format: "2013-05-04" "2014-06-15" ...
##  $ Recency            : num [1:2240] 66 0 86 57 81 91 99 96 9 35 ...
##  $ MntWines           : num [1:2240] 239 464 57 19 637 43 185 18 460 32 ...
##  $ MntFruits          : num [1:2240] 10 5 0 0 47 12 2 2 35 1 ...
##  $ MntMeatProducts    : num [1:2240] 554 64 27 5 237 23 88 19 422 64 ...
##  $ MntFishProducts    : num [1:2240] 254 7 0 0 12 29 15 0 33 16 ...
##  $ MntSweetProducts   : num [1:2240] 87 0 0 0 19 15 5 2 12 12 ...
##  $ MntGoldProds       : num [1:2240] 54 37 36 8 76 61 14 6 153 85 ...
##  $ NumDealsPurchases  : num [1:2240] 1 1 4 2 4 1 2 5 2 3 ...
##  $ NumWebPurchases    : num [1:2240] 3 7 3 1 6 2 6 3 6 2 ...
##  $ NumCatalogPurchases: num [1:2240] 4 3 2 0 11 1 1 0 6 2 ...
##  $ NumStorePurchases  : num [1:2240] 9 7 2 3 7 4 5 4 7 3 ...
##  $ NumWebVisitsMonth  : num [1:2240] 1 5 8 6 5 4 8 7 4 6 ...
##  $ AcceptedCmp3       : num [1:2240] 0 0 0 0 0 0 0 0 0 0 ...
##  $ AcceptedCmp4       : num [1:2240] 0 0 0 0 0 0 0 0 0 0 ...
##  $ AcceptedCmp5       : num [1:2240] 0 0 0 0 0 0 0 0 0 0 ...
##  $ AcceptedCmp1       : num [1:2240] 0 0 0 0 0 0 0 0 0 0 ...
##  $ AcceptedCmp2       : num [1:2240] 0 1 0 0 0 0 0 0 0 0 ...
##  $ Complain           : num [1:2240] 0 0 0 0 0 0 0 0 0 0 ...
##  $ Z_CostContact      : num [1:2240] 3 3 3 3 3 3 3 3 3 3 ...
##  $ Z_Revenue          : num [1:2240] 11 11 11 11 11 11 11 11 11 11 ...
##  $ Response           : num [1:2240] 0 1 0 0 0 0 0 0 0 1 ...
##  $ Age                : num [1:2240] 29 53 39 67 43 49 38 54 56 27 ...

Categorial feature options

#Unique categories in each categorical column
unique(Data$Education)

## [1] "Graduation" "Master"     "PhD"        "2n Cycle"   "Basic"

unique(Data$Marital_Status)

## [1] "Married"  "Single"   "Widow"    "Divorced" "Together" "Alone"    "YOLO"    
## [8] "Absurd"

As it can observed, there are no trailing or leading spaces, misspelings, or blank spaces, therefore, this portion of the data is cleaned.

Data$Income[is.na(Data$Income)] <- 0

About the data

Exploring the features characteristics
Group	Range	Data_Type
ID	0-11191	Categorical nominal
Age	18-121	Numeric discrete
Income	0-666666	Numeric continuous
Kidhome	0-2	Numeric discrete
Teenhome	0-2	Numeric discrete
Dt_Customer	2012-07-30 to 2014-06-29	Numeric discrete
Recency	0-99	Numeric discrete
MntWines	0-1493	Numeric continuous
MntFruits	0-199	Numeric continuous
MntMeatProducts	0-1725	Numeric continuous
MntFishProducts	0-259	Numeric continuous
MntSweetProducts	0-263	Numeric continuous
MntGoldProds	0-362	Numeric continuous
NumDealsPurchases	0-15	Numeric discrete
NumWebPurchases	0-27	Numeric discrete
NumCatalogPurchases	0-28	Numeric discrete
NumStorePurchases	0-13	Numeric discrete
NumWebVisitsMonth	0-20	Numeric discrete
AcceptedCpm1	0-1	Categorical nominal
AcceptedCpm2	0-1	Categorical nominal
AcceptedCpm3	0-1	Categorical nominal
AcceptedCpm4	0-1	Categorical nominal
AcceptedCpm5	0-1	Categorical nominal
Complain	0-1	Categorical nominal
Z_CostContact	3-3	Not sure
Z_Revenue	11-11	Not sure
Response	0-1	Categorical nominal
Education	-	Categorical nominal
Marital Status	-	Categorical nominal

It can be observed that the ranges of numerical data are as expected, therefore, this portion of the data is cleaned.

Duplicates

duplicates <- duplicated(Data$ID)
num_true <- sum(duplicates)
print(num_true)

## [1] 0

remove(duplicates,num_true)

We can conclude that there are no duplicates.

write.csv(Data,file.path(PSDS_PATH, 'dataset_from_r_classificatory_model.csv'),row.names=FALSE)
rm(dir1,dir2,dir3,dir4,dir5,dir6,dir7,dir8,dir9,dir10,file_name,PSDS_PATH)

Section 3: Exploratory Data Analysis

Univariate analysis

Analyze each predictor variable individually to understand its distribution and identify any potential outliers or data issues.

Histograms: Create histograms to visualize the distribution of each continuous predictor variable.
Box plots: Use box plots to identify the range, quartiles, and potential outliers of continuous predictor variables.
Bar plots: Create bar plots for categorical predictor variables to visualize the frequency of each category.
Summary statistics: Calculate summary statistics, such as mean, median, standard deviation, and quartiles, to describe the central tendency and dispersion of each continuous predictor variable.

Categorical Data - Bar plots

Numerical Discrete Data - Histograms

Numerical Continuous Data - Histogram and Boxplots

Bivariate analysis

Analyze the relationship between each predictor variable and the target variable.

Scatter plots: Plot scatter plots between continuous predictor variables and the target variable to visualize the relationships and identify any trends or patterns.
Box plots or violin plots: Use box plots or violin plots to visualize the distribution of the target variable across different categories of a categorical predictor variable.
Correlation matrix: Calculate the correlation matrix for continuous predictor variables to identify any strong linear relationships with the target variable.

Categorical Data - Bar charts

Numerical Discrete Data - Histograms

Numerical Continuous Data - Boxplots

Numerical Continuous Data - Correlation Matrix

INC = Income

WIN = MMntWines

FRU = MntFruits

MEA = MntMeatProducts

FIS = MntFishProducts

SWE = MntSweetProducts

GOL = MntGoldProds

All this variables seem to have a correlation but it’s common sense. The more Income the people have the more they will spend monthly on each of our product categories.

Section 4: Statistical Analysis

Numerical Discrete Data - Mann-Whitney U test

NumDealsPurchases

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  Data_0$NumDealsPurchases and Data_1$NumDealsPurchases
## W = 335931, p-value = 0.08919
## alternative hypothesis: true location shift is not equal to 0

NumWebPurchases

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  Data_0$NumWebPurchases and Data_1$NumWebPurchases
## W = 229882, p-value = 2.95e-16
## alternative hypothesis: true location shift is not equal to 0

NumCatalogPurchases

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  Data_0$NumCatalogPurchases and Data_1$NumCatalogPurchases
## W = 199938, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0

NumWebVisitsMonth

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  Data_0$NumWebVisitsMonth and Data_1$NumWebVisitsMonth
## W = 312810, p-value = 0.6115
## alternative hypothesis: true location shift is not equal to 0

NumStorePurchases

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  Data_0$NumStorePurchases and Data_1$NumStorePurchases
## W = 291004, p-value = 0.01157
## alternative hypothesis: true location shift is not equal to 0

Age

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  Data_0$Age and Data_1$Age
## W = 328994, p-value = 0.3267
## alternative hypothesis: true location shift is not equal to 0

Dt_Customer

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  Data_0_numeric and Data_1_numeric
## W = 418740, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0

TeenHome

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  Data_0_numeric and Data_1_numeric
## W = 418740, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0

Kidhome

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  Data_0$Kidhome and Data_1$Kidhome
## W = 352752, p-value = 0.000247
## alternative hypothesis: true location shift is not equal to 0

Recency

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  Data_0$Teenhome and Data_1$Teenhome
## W = 390173, p-value = 4.765e-14
## alternative hypothesis: true location shift is not equal to 0

Categorical Data - Chi-Square test

Education

## 
##  Pearson's Chi-squared test with simulated p-value (based on 2000
##  replicates)
## 
## data:  reaction_Education
## X-squared = 23.076, df = NA, p-value = 0.0009995

Marital_Status

## 
##  Pearson's Chi-squared test with simulated p-value (based on 2000
##  replicates)
## 
## data:  reaction_Marital_Status
## X-squared = 54.242, df = NA, p-value = 0.0004998

AcceptedCmp1

## 
##  Pearson's Chi-squared test with simulated p-value (based on 2000
##  replicates)
## 
## data:  reaction_AcceptedCmp1
## X-squared = 119.64, df = NA, p-value = 0.0004998

AcceptedCmp2

## 
##  Pearson's Chi-squared test with simulated p-value (based on 2000
##  replicates)
## 
## data:  reaction_AcceptedCmp2
## X-squared = 8.7505, df = NA, p-value = 0.004498

AcceptedCmp3

## 
##  Pearson's Chi-squared test with simulated p-value (based on 2000
##  replicates)
## 
## data:  reaction_AcceptedCmp3
## X-squared = 144.81, df = NA, p-value = 0.0004998

AcceptedCmp4

## 
##  Pearson's Chi-squared test with simulated p-value (based on 2000
##  replicates)
## 
## data:  reaction_AcceptedCmp4
## X-squared = 70.192, df = NA, p-value = 0.0004998

AcceptedCmp5

## 
##  Pearson's Chi-squared test with simulated p-value (based on 2000
##  replicates)
## 
## data:  reaction_AcceptedCmp5
## X-squared = 128, df = NA, p-value = 0.0004998

Complain

## 
##  Pearson's Chi-squared test with simulated p-value (based on 2000
##  replicates)
## 
## data:  reaction_Complain
## X-squared = 0.0065267, df = NA, p-value = 1

Numerical Continuous Data - Permutation Test

Income

## [1] 0

MntWines

## [1] 0

MntFruits

## [1] 0

MntMeatProducts

## [1] 0

MntFishProducts

## [1] 0

MntSweetProducts

## [1] 0

MntGoldProds

## [1] 0

Section 5: Classification Model

Assessing each feature

Education

## 
## Call:  glm(formula = Response ~ Education, family = "binomial", data = Data)
## 
## Coefficients:
##         (Intercept)       EducationBasic  EducationGraduation  
##             -2.1075              -1.1506               0.2489  
##     EducationMaster         EducationPhD  
##              0.4043               0.7693  
## 
## Degrees of Freedom: 2239 Total (i.e. Null);  2235 Residual
## Null Deviance:       1887 
## Residual Deviance: 1863  AIC: 1873

## 
## Call:
## glm(formula = Response ~ Education, family = "binomial", data = Data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.6826  -0.5785  -0.5383  -0.4790   2.5674  
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          -2.1075     0.2258  -9.334  < 2e-16 ***
## EducationBasic       -1.1506     0.7548  -1.524  0.12741    
## EducationGraduation   0.2489     0.2420   1.028  0.30380    
## EducationMaster       0.4043     0.2678   1.510  0.13112    
## EducationPhD          0.7693     0.2519   3.054  0.00226 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1886.8  on 2239  degrees of freedom
## Residual deviance: 1862.7  on 2235  degrees of freedom
## AIC: 1872.7
## 
## Number of Fisher Scoring iterations: 5

PhD_hot_encoding

## 
## Call:  glm(formula = Response ~ PhD_hot_encoding, family = "binomial", 
##     data = Data)
## 
## Coefficients:
##      (Intercept)  PhD_hot_encoding  
##           -1.876             0.538  
## 
## Degrees of Freedom: 2239 Total (i.e. Null);  2238 Residual
## Null Deviance:       1887 
## Residual Deviance: 1871  AIC: 1875

## 
## Call:
## glm(formula = Response ~ PhD_hot_encoding, family = "binomial", 
##     data = Data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.6826  -0.5339  -0.5339  -0.5339   2.0093  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -1.87608    0.07035 -26.667  < 2e-16 ***
## PhD_hot_encoding  0.53796    0.13209   4.073 4.65e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1886.8  on 2239  degrees of freedom
## Residual deviance: 1871.0  on 2238  degrees of freedom
## AIC: 1875
## 
## Number of Fisher Scoring iterations: 4

Marital_Status

## 
## Call:  glm(formula = Response ~ Marital_Status, family = "binomial", 
##     data = Data)
## 
## Coefficients:
##            (Intercept)     Marital_StatusAlone  Marital_StatusDivorced  
##             -7.629e-14              -6.931e-01              -1.344e+00  
##  Marital_StatusMarried    Marital_StatusSingle  Marital_StatusTogether  
##             -2.056e+00              -1.261e+00              -2.159e+00  
##    Marital_StatusWidow      Marital_StatusYOLO  
##             -1.116e+00               3.286e-14  
## 
## Degrees of Freedom: 2239 Total (i.e. Null);  2232 Residual
## Null Deviance:       1887 
## Residual Deviance: 1836  AIC: 1852

## 
## Call:
## glm(formula = Response ~ Marital_Status, family = "binomial", 
##     data = Data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.1774  -0.6809  -0.4907  -0.4673   2.1301  
## 
## Coefficients:
##                          Estimate Std. Error z value Pr(>|z|)
## (Intercept)            -7.629e-14  1.414e+00   0.000    1.000
## Marital_StatusAlone    -6.931e-01  1.871e+00  -0.371    0.711
## Marital_StatusDivorced -1.344e+00  1.423e+00  -0.944    0.345
## Marital_StatusMarried  -2.056e+00  1.418e+00  -1.450    0.147
## Marital_StatusSingle   -1.261e+00  1.418e+00  -0.889    0.374
## Marital_StatusTogether -2.159e+00  1.421e+00  -1.520    0.129
## Marital_StatusWidow    -1.116e+00  1.439e+00  -0.776    0.438
## Marital_StatusYOLO      3.286e-14  2.000e+00   0.000    1.000
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1886.8  on 2239  degrees of freedom
## Residual deviance: 1835.7  on 2232  degrees of freedom
## AIC: 1851.7
## 
## Number of Fisher Scoring iterations: 4

AcceptedCmp1

## 
## Call:  glm(formula = Response ~ AcceptedCmp1, family = "binomial", data = Data)
## 
## Coefficients:
##   (Intercept)  AcceptedCmp11  
##        -1.977          2.172  
## 
## Degrees of Freedom: 2239 Total (i.e. Null);  2238 Residual
## Null Deviance:       1887 
## Residual Deviance: 1750  AIC: 1754

## 
## Call:
## glm(formula = Response ~ AcceptedCmp1, family = "binomial", data = Data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.2613  -0.5094  -0.5094  -0.5094   2.0526  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -1.97680    0.06682  -29.59   <2e-16 ***
## AcceptedCmp11  2.17186    0.18030   12.05   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1886.8  on 2239  degrees of freedom
## Residual deviance: 1750.2  on 2238  degrees of freedom
## AIC: 1754.2
## 
## Number of Fisher Scoring iterations: 4

AcceptedCmp2

## 
## Call:  glm(formula = Response ~ AcceptedCmp2, family = "binomial", data = Data)
## 
## Coefficients:
##   (Intercept)  AcceptedCmp21  
##        -1.798          2.491  
## 
## Degrees of Freedom: 2239 Total (i.e. Null);  2238 Residual
## Null Deviance:       1887 
## Residual Deviance: 1845  AIC: 1849

## 
## Call:
## glm(formula = Response ~ AcceptedCmp2, family = "binomial", data = Data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.4823  -0.5536  -0.5536  -0.5536   1.9755  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -1.79811    0.06093 -29.512  < 2e-16 ***
## AcceptedCmp21  2.49126    0.39206   6.354 2.09e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1886.8  on 2239  degrees of freedom
## Residual deviance: 1844.8  on 2238  degrees of freedom
## AIC: 1848.8
## 
## Number of Fisher Scoring iterations: 4

AcceptedCmp3

## 
## Call:  glm(formula = Response ~ AcceptedCmp3, family = "binomial", data = Data)
## 
## Coefficients:
##   (Intercept)  AcceptedCmp31  
##        -1.958          1.847  
## 
## Degrees of Freedom: 2239 Total (i.e. Null);  2238 Residual
## Null Deviance:       1887 
## Residual Deviance: 1780  AIC: 1784

## 
## Call:
## glm(formula = Response ~ AcceptedCmp3, family = "binomial", data = Data)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.131  -0.514  -0.514  -0.514   2.044  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -1.95752    0.06664  -29.38   <2e-16 ***
## AcceptedCmp31  1.84697    0.17046   10.84   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1886.8  on 2239  degrees of freedom
## Residual deviance: 1780.3  on 2238  degrees of freedom
## AIC: 1784.3
## 
## Number of Fisher Scoring iterations: 4

AcceptedCmp4

## 
## Call:  glm(formula = Response ~ AcceptedCmp4, family = "binomial", data = Data)
## 
## Coefficients:
##   (Intercept)  AcceptedCmp41  
##        -1.890          1.363  
## 
## Degrees of Freedom: 2239 Total (i.e. Null);  2238 Residual
## Null Deviance:       1887 
## Residual Deviance: 1832  AIC: 1836

## 
## Call:
## glm(formula = Response ~ AcceptedCmp4, family = "binomial", data = Data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.9634  -0.5304  -0.5304  -0.5304   2.0154  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -1.89030    0.06505 -29.058  < 2e-16 ***
## AcceptedCmp41  1.36347    0.17287   7.887 3.09e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1886.8  on 2239  degrees of freedom
## Residual deviance: 1831.8  on 2238  degrees of freedom
## AIC: 1835.8
## 
## Number of Fisher Scoring iterations: 4

AcceptedCmp5

## 
## Call:  glm(formula = Response ~ AcceptedCmp5, family = "binomial", data = Data)
## 
## Coefficients:
##   (Intercept)  AcceptedCmp51  
##        -2.026          2.285  
## 
## Degrees of Freedom: 2239 Total (i.e. Null);  2238 Residual
## Null Deviance:       1887 
## Residual Deviance: 1718  AIC: 1722

## 
## Call:
## glm(formula = Response ~ AcceptedCmp5, family = "binomial", data = Data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.2892  -0.4978  -0.4978  -0.4978   2.0735  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -2.02586    0.06839  -29.62   <2e-16 ***
## AcceptedCmp51  2.28497    0.17214   13.27   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1886.8  on 2239  degrees of freedom
## Residual deviance: 1718.4  on 2238  degrees of freedom
## AIC: 1722.4
## 
## Number of Fisher Scoring iterations: 4

NumWebPurchases

## 
## Call:  glm(formula = Response ~ NumWebPurchases, family = "binomial", 
##     data = Data)
## 
## Coefficients:
##     (Intercept)  NumWebPurchases  
##         -2.3375           0.1342  
## 
## Degrees of Freedom: 2239 Total (i.e. Null);  2238 Residual
## Null Deviance:       1887 
## Residual Deviance: 1842  AIC: 1846

## 
## Call:
## glm(formula = Response ~ NumWebPurchases, family = "binomial", 
##     data = Data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7495  -0.5883  -0.5195  -0.4577   2.2044  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     -2.33753    0.11306  -20.68  < 2e-16 ***
## NumWebPurchases  0.13422    0.02012    6.67 2.55e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1886.8  on 2239  degrees of freedom
## Residual deviance: 1841.6  on 2238  degrees of freedom
## AIC: 1845.6
## 
## Number of Fisher Scoring iterations: 4

NumCatalogPurchases

## 
## Call:  glm(formula = Response ~ NumCatalogPurchases, family = "binomial", 
##     data = Data)
## 
## Coefficients:
##         (Intercept)  NumCatalogPurchases  
##             -2.3185               0.1819  
## 
## Degrees of Freedom: 2239 Total (i.e. Null);  2238 Residual
## Null Deviance:       1887 
## Residual Deviance: 1792  AIC: 1796

## 
## Call:
## glm(formula = Response ~ NumCatalogPurchases, family = "binomial", 
##     data = Data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3809  -0.5601  -0.4724  -0.4333   2.1965  
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         -2.31847    0.09220  -25.14   <2e-16 ***
## NumCatalogPurchases  0.18186    0.01914    9.50   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1886.8  on 2239  degrees of freedom
## Residual deviance: 1792.0  on 2238  degrees of freedom
## AIC: 1796
## 
## Number of Fisher Scoring iterations: 4

NumStorePurchases

## 
## Call:  glm(formula = Response ~ NumStorePurchases, family = "binomial", 
##     data = Data)
## 
## Coefficients:
##       (Intercept)  NumStorePurchases  
##          -1.93786            0.03318  
## 
## Degrees of Freedom: 2239 Total (i.e. Null);  2238 Residual
## Null Deviance:       1887 
## Residual Deviance: 1883  AIC: 1887

## 
## Call:
## glm(formula = Response ~ NumStorePurchases, family = "binomial", 
##     data = Data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.6328  -0.5778  -0.5518  -0.5434   2.0073  
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       -1.93786    0.12297 -15.759   <2e-16 ***
## NumStorePurchases  0.03318    0.01783   1.861   0.0628 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1886.8  on 2239  degrees of freedom
## Residual deviance: 1883.4  on 2238  degrees of freedom
## AIC: 1887.4
## 
## Number of Fisher Scoring iterations: 4

Dt_Customers

## 
## Call:  glm(formula = Response ~ Months_since_Dt_Customer, family = "binomial", 
##     data = Data)
## 
## Coefficients:
##              (Intercept)  Months_since_Dt_Customer  
##                 -2.85096                   0.08549  
## 
## Degrees of Freedom: 2239 Total (i.e. Null);  2238 Residual
## Null Deviance:       1887 
## Residual Deviance: 1802  AIC: 1806

## 
## Call:
## glm(formula = Response ~ Months_since_Dt_Customer, family = "binomial", 
##     data = Data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.8314  -0.6396  -0.4849  -0.3795   2.4113  
## 
## Coefficients:
##                           Estimate Std. Error z value Pr(>|z|)    
## (Intercept)              -2.850958   0.150151 -18.987   <2e-16 ***
## Months_since_Dt_Customer  0.085491   0.009666   8.845   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1886.8  on 2239  degrees of freedom
## Residual deviance: 1801.8  on 2238  degrees of freedom
## AIC: 1805.8
## 
## Number of Fisher Scoring iterations: 5

Teenhome

## 
## Call:  glm(formula = Response ~ Teenhome, family = "binomial", data = Data)
## 
## Coefficients:
## (Intercept)     Teenhome  
##     -1.3725      -0.8834  
## 
## Degrees of Freedom: 2239 Total (i.e. Null);  2238 Residual
## Null Deviance:       1887 
## Residual Deviance: 1830  AIC: 1834

## 
## Call:
## glm(formula = Response ~ Teenhome, family = "binomial", data = Data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.6722  -0.6722  -0.4464  -0.4464   2.5226  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -1.3725     0.0728 -18.854  < 2e-16 ***
## Teenhome     -0.8834     0.1233  -7.166 7.72e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1886.8  on 2239  degrees of freedom
## Residual deviance: 1829.9  on 2238  degrees of freedom
## AIC: 1833.9
## 
## Number of Fisher Scoring iterations: 5

Kidhome

## 
## Call:  glm(formula = Response ~ Kidhome, family = "binomial", data = Data)
## 
## Coefficients:
## (Intercept)      Kidhome  
##     -1.5639      -0.4443  
## 
## Degrees of Freedom: 2239 Total (i.e. Null);  2238 Residual
## Null Deviance:       1887 
## Residual Deviance: 1872  AIC: 1876

## 
## Call:
## glm(formula = Response ~ Kidhome, family = "binomial", data = Data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.6165  -0.6165  -0.5019  -0.5019   2.2517  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.56389    0.07291 -21.449  < 2e-16 ***
## Kidhome     -0.44431    0.11799  -3.766 0.000166 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1886.8  on 2239  degrees of freedom
## Residual deviance: 1871.9  on 2238  degrees of freedom
## AIC: 1875.9
## 
## Number of Fisher Scoring iterations: 4

Recency

## 
## Call:  glm(formula = Response ~ Recency, family = "binomial", data = Data)
## 
## Coefficients:
## (Intercept)      Recency  
##    -0.86026     -0.02037  
## 
## Degrees of Freedom: 2239 Total (i.e. Null);  2238 Residual
## Null Deviance:       1887 
## Residual Deviance: 1796  AIC: 1800

## 
## Call:
## glm(formula = Response ~ Recency, family = "binomial", data = Data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.8400  -0.6371  -0.4800  -0.3690   2.4215  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.860263   0.103704  -8.295   <2e-16 ***
## Recency     -0.020372   0.002233  -9.123   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1886.8  on 2239  degrees of freedom
## Residual deviance: 1796.1  on 2238  degrees of freedom
## AIC: 1800.1
## 
## Number of Fisher Scoring iterations: 5

Income

## 
## Call:  glm(formula = Response ~ Income, family = "binomial", data = Data)
## 
## Coefficients:
## (Intercept)       Income  
##  -2.623e+00    1.623e-05  
## 
## Degrees of Freedom: 2239 Total (i.e. Null);  2238 Residual
## Null Deviance:       1887 
## Residual Deviance: 1846  AIC: 1850

## 
## Call:
## glm(formula = Response ~ Income, family = "binomial", data = Data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -4.0482  -0.6075  -0.5257  -0.4525   2.3209  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.623e+00  1.647e-01 -15.922  < 2e-16 ***
## Income       1.623e-05  2.704e-06   6.001 1.96e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1886.8  on 2239  degrees of freedom
## Residual deviance: 1846.3  on 2238  degrees of freedom
## AIC: 1850.3
## 
## Number of Fisher Scoring iterations: 4

MntWines

## 
## Call:  glm(formula = Response ~ MntWines, family = "binomial", data = Data)
## 
## Coefficients:
## (Intercept)     MntWines  
##   -2.405572     0.001769  
## 
## Degrees of Freedom: 2239 Total (i.e. Null);  2238 Residual
## Null Deviance:       1887 
## Residual Deviance: 1765  AIC: 1769

## 
## Call:
## glm(formula = Response ~ MntWines, family = "binomial", data = Data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.2792  -0.5714  -0.4413  -0.4181   2.2317  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.4055723  0.0935629  -25.71   <2e-16 ***
## MntWines     0.0017694  0.0001596   11.09   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1886.8  on 2239  degrees of freedom
## Residual deviance: 1765.5  on 2238  degrees of freedom
## AIC: 1769.5
## 
## Number of Fisher Scoring iterations: 4

MntFruits

## 
## Call:  glm(formula = Response ~ MntFruits, family = "binomial", data = Data)
## 
## Coefficients:
## (Intercept)    MntFruits  
##   -1.965980     0.007349  
## 
## Degrees of Freedom: 2239 Total (i.e. Null);  2238 Residual
## Null Deviance:       1887 
## Residual Deviance: 1856  AIC: 1860

## 
## Call:
## glm(formula = Response ~ MntFruits, family = "binomial", data = Data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.9724  -0.5520  -0.5190  -0.5120   2.0479  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.965980   0.074315 -26.455  < 2e-16 ***
## MntFruits    0.007349   0.001270   5.787 7.17e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1886.8  on 2239  degrees of freedom
## Residual deviance: 1856.0  on 2238  degrees of freedom
## AIC: 1860
## 
## Number of Fisher Scoring iterations: 4

MntMeatProducts

## 
## Call:  glm(formula = Response ~ MntMeatProducts, family = "binomial", 
##     data = Data)
## 
## Coefficients:
##     (Intercept)  MntMeatProducts  
##       -2.222755         0.002318  
## 
## Degrees of Freedom: 2239 Total (i.e. Null);  2238 Residual
## Null Deviance:       1887 
## Residual Deviance: 1784  AIC: 1788

## 
## Call:
## glm(formula = Response ~ MntMeatProducts, family = "binomial", 
##     data = Data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9662  -0.5365  -0.4687  -0.4570   2.1557  
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     -2.2227553  0.0822246  -27.03   <2e-16 ***
## MntMeatProducts  0.0023185  0.0002267   10.22   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1886.8  on 2239  degrees of freedom
## Residual deviance: 1783.5  on 2238  degrees of freedom
## AIC: 1787.5
## 
## Number of Fisher Scoring iterations: 4

MntFishProducts

## 
## Call:  glm(formula = Response ~ MntFishProducts, family = "binomial", 
##     data = Data)
## 
## Coefficients:
##     (Intercept)  MntFishProducts  
##       -1.950992         0.004889  
## 
## Degrees of Freedom: 2239 Total (i.e. Null);  2238 Residual
## Null Deviance:       1887 
## Residual Deviance: 1862  AIC: 1866

## 
## Call:
## glm(formula = Response ~ MntFishProducts, family = "binomial", 
##     data = Data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.9036  -0.5558  -0.5239  -0.5155   2.0415  
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     -1.9509921  0.0750546 -25.994  < 2e-16 ***
## MntFishProducts  0.0048891  0.0009445   5.176 2.26e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1886.8  on 2239  degrees of freedom
## Residual deviance: 1862.0  on 2238  degrees of freedom
## AIC: 1866
## 
## Number of Fisher Scoring iterations: 4

MntSweetProducts

## 
## Call:  glm(formula = Response ~ MntSweetProducts, family = "binomial", 
##     data = Data)
## 
## Coefficients:
##      (Intercept)  MntSweetProducts  
##        -1.949835          0.006675  
## 
## Degrees of Freedom: 2239 Total (i.e. Null);  2238 Residual
## Null Deviance:       1887 
## Residual Deviance: 1860  AIC: 1864

## 
## Call:
## glm(formula = Response ~ MntSweetProducts, family = "binomial", 
##     data = Data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.0961  -0.5541  -0.5223  -0.5158   2.0410  
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -1.949835   0.073848  -26.40  < 2e-16 ***
## MntSweetProducts  0.006675   0.001229    5.43 5.63e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1886.8  on 2239  degrees of freedom
## Residual deviance: 1859.6  on 2238  degrees of freedom
## AIC: 1863.6
## 
## Number of Fisher Scoring iterations: 4

MntGoldProds

## 
## Call:  glm(formula = Response ~ MntGoldProds, family = "binomial", data = Data)
## 
## Coefficients:
##  (Intercept)  MntGoldProds  
##    -2.057133      0.006285  
## 
## Degrees of Freedom: 2239 Total (i.e. Null);  2238 Residual
## Null Deviance:       1887 
## Residual Deviance: 1848  AIC: 1852

## 
## Call:
## glm(formula = Response ~ MntGoldProds, family = "binomial", data = Data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.2713  -0.5565  -0.5127  -0.4949   2.0868  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -2.0571330  0.0813300 -25.294  < 2e-16 ***
## MntGoldProds  0.0062849  0.0009771   6.432 1.26e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1886.8  on 2239  degrees of freedom
## Residual deviance: 1848.4  on 2238  degrees of freedom
## AIC: 1852.4
## 
## Number of Fisher Scoring iterations: 4

Section 6: Building the Classificatory Model

For this model, I first selected the most promising features. These were identified based on their high regression coefficients. I then examined the data for multicollinearity but found no evidence of it among the chosen features.

Finally, I employed the stepwise regression technique to further refine the feature selection process, ensuring that only the most impactful features were included in the model.

Step by step to build the Logistic model

1. Splitting data, training and holdout sets

#Split the data into training and holdout sets using a 80-10 split
set.seed(123)
trainIndex <- createDataPartition(Data$Response, p=0.8, list = FALSE)
trainData <- Data[trainIndex, ]
holdoutData <- Data[-trainIndex, ]

2. Create the model based on training data

## 
## Call:  glm(formula = Response ~ PhD_hot_encoding + AcceptedCmp1 + AcceptedCmp3 + 
##     AcceptedCmp4 + AcceptedCmp5 + Months_since_Dt_Customer + 
##     MntWines + MntMeatProducts, family = "binomial", data = trainData)
## 
## Coefficients:
##              (Intercept)          PhD_hot_encoding             AcceptedCmp11  
##               -4.2594155                 0.9439971                 1.7162674  
##            AcceptedCmp31             AcceptedCmp41             AcceptedCmp51  
##                2.2026332                 0.9352136                 1.4995628  
## Months_since_Dt_Customer                  MntWines           MntMeatProducts  
##                0.1126719                -0.0005644                 0.0014283  
## 
## Degrees of Freedom: 1792 Total (i.e. Null);  1784 Residual
## Null Deviance:       1513 
## Residual Deviance: 1139  AIC: 1157

## 
## Call:
## glm(formula = Response ~ PhD_hot_encoding + AcceptedCmp1 + AcceptedCmp3 + 
##     AcceptedCmp4 + AcceptedCmp5 + Months_since_Dt_Customer + 
##     MntWines + MntMeatProducts, family = "binomial", data = trainData)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3936  -0.5012  -0.3423  -0.2210   2.8323  
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)    
## (Intercept)              -4.2594155  0.2361324 -18.038  < 2e-16 ***
## PhD_hot_encoding          0.9439971  0.1756254   5.375 7.66e-08 ***
## AcceptedCmp11             1.7162674  0.2745354   6.252 4.06e-10 ***
## AcceptedCmp31             2.2026332  0.2235573   9.853  < 2e-16 ***
## AcceptedCmp41             0.9352136  0.2716465   3.443 0.000576 ***
## AcceptedCmp51             1.4995628  0.2781107   5.392 6.97e-08 ***
## Months_since_Dt_Customer  0.1126719  0.0130373   8.642  < 2e-16 ***
## MntWines                 -0.0005644  0.0002873  -1.965 0.049450 *  
## MntMeatProducts           0.0014283  0.0003402   4.198 2.69e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1512.5  on 1792  degrees of freedom
## Residual deviance: 1139.5  on 1784  degrees of freedom
## AIC: 1157.5
## 
## Number of Fisher Scoring iterations: 5

3. Calculating the Variance Inflation Factor for each feature

##         PhD_hot_encoding             AcceptedCmp1             AcceptedCmp3 
##                 1.085816                 1.268044                 1.080945 
##             AcceptedCmp4             AcceptedCmp5 Months_since_Dt_Customer 
##                 1.290124                 1.483181                 1.160394 
##                 MntWines          MntMeatProducts 
##                 1.994079                 1.430336

There’s no Multicollinearity

4. Model selection and Stepwise regression

## Start:  AIC=1157.48
## Response ~ PhD_hot_encoding + AcceptedCmp1 + AcceptedCmp3 + AcceptedCmp4 + 
##     AcceptedCmp5 + Months_since_Dt_Customer + MntWines + MntMeatProducts
## 
##                            Df Deviance    AIC
## <none>                          1139.5 1157.5
## - MntWines                  1   1143.5 1159.5
## - AcceptedCmp4              1   1150.8 1166.8
## - MntMeatProducts           1   1156.6 1172.6
## - PhD_hot_encoding          1   1167.3 1183.3
## - AcceptedCmp5              1   1168.5 1184.5
## - AcceptedCmp1              1   1178.2 1194.2
## - Months_since_Dt_Customer  1   1224.3 1240.3
## - AcceptedCmp3              1   1230.7 1246.7

## 
## Call:  glm(formula = Response ~ PhD_hot_encoding + AcceptedCmp1 + AcceptedCmp3 + 
##     AcceptedCmp4 + AcceptedCmp5 + Months_since_Dt_Customer + 
##     MntWines + MntMeatProducts, family = "binomial", data = trainData)
## 
## Coefficients:
##              (Intercept)          PhD_hot_encoding             AcceptedCmp11  
##               -4.2594155                 0.9439971                 1.7162674  
##            AcceptedCmp31             AcceptedCmp41             AcceptedCmp51  
##                2.2026332                 0.9352136                 1.4995628  
## Months_since_Dt_Customer                  MntWines           MntMeatProducts  
##                0.1126719                -0.0005644                 0.0014283  
## 
## Degrees of Freedom: 1792 Total (i.e. Null);  1784 Residual
## Null Deviance:       1513 
## Residual Deviance: 1139  AIC: 1157

## 
## Call:
## glm(formula = Response ~ PhD_hot_encoding + AcceptedCmp1 + AcceptedCmp3 + 
##     AcceptedCmp4 + AcceptedCmp5 + Months_since_Dt_Customer + 
##     MntWines + MntMeatProducts, family = "binomial", data = trainData)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3936  -0.5012  -0.3423  -0.2210   2.8323  
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)    
## (Intercept)              -4.2594155  0.2361324 -18.038  < 2e-16 ***
## PhD_hot_encoding          0.9439971  0.1756254   5.375 7.66e-08 ***
## AcceptedCmp11             1.7162674  0.2745354   6.252 4.06e-10 ***
## AcceptedCmp31             2.2026332  0.2235573   9.853  < 2e-16 ***
## AcceptedCmp41             0.9352136  0.2716465   3.443 0.000576 ***
## AcceptedCmp51             1.4995628  0.2781107   5.392 6.97e-08 ***
## Months_since_Dt_Customer  0.1126719  0.0130373   8.642  < 2e-16 ***
## MntWines                 -0.0005644  0.0002873  -1.965 0.049450 *  
## MntMeatProducts           0.0014283  0.0003402   4.198 2.69e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1512.5  on 1792  degrees of freedom
## Residual deviance: 1139.5  on 1784  degrees of freedom
## AIC: 1157.5
## 
## Number of Fisher Scoring iterations: 5

5. Outliers analysis based on Standard Residual

#Eliminate outliers that are more than 2.5 standard residuals
sresid <- rstandard(step_lm)
idx <- order(sresid, decreasing=TRUE)
sresid[idx[1]]
resid(step_lm)[idx[1]]
#outlier_record <- trainData[idx[1], c('Monthly_Premium_Auto', 'Total_Claim_Amount')]
trainData$sresid <- rstandard(step_lm)

#trainData <- trainData %>%
 # filter(sresid < 2.5) %>%
  #filter(sresid > -2.5)

We found 19 records that surpass the benchmark of 2.5 standard residuals benchmark, we have opted against data exclusion, as a significant proportion adheres to this threshold. Removing these records may potentially compromise the integrity of our model.

6. 10 fold cross validation model

##     Accuracy     Kappa Resample
## 1  0.9162011 0.5562717   Fold01
## 2  0.8777778 0.3659942   Fold02
## 3  0.8603352 0.1608851   Fold03
## 4  0.8491620 0.2054907   Fold04
## 5  0.8722222 0.2944785   Fold05
## 6  0.8764045 0.3873592   Fold06
## 7  0.8994413 0.5369359   Fold07
## 8  0.8888889 0.4666667   Fold08
## 9  0.8770950 0.3902137   Fold09
## 10 0.8555556 0.2507205   Fold10

## Generalized Linear Model 
## 
## 1793 samples
##    8 predictor
##    2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1614, 1613, 1614, 1614, 1613, 1615, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.8773084  0.3615016

## 
## Call:  NULL
## 
## Coefficients:
##              (Intercept)          PhD_hot_encoding             AcceptedCmp11  
##               -4.2594155                 0.9439971                 1.7162674  
##            AcceptedCmp31             AcceptedCmp41             AcceptedCmp51  
##                2.2026332                 0.9352136                 1.4995628  
## Months_since_Dt_Customer                  MntWines           MntMeatProducts  
##                0.1126719                -0.0005644                 0.0014283  
## 
## Degrees of Freedom: 1792 Total (i.e. Null);  1784 Residual
## Null Deviance:       1513 
## Residual Deviance: 1139  AIC: 1157

7. Confusion Matrix

## [1] "Confusion Matrix"

##       Yhat = 1 Yhat = 0
## Y = 1       82      186
## Y = 0       33     1492

8. Assessing the model Precision, Recall, and Specificity

## [1] "Precision"

## [1] 0.7130435

## [1] "Recall"

## [1] 0.3059701

## [1] "Specificity"

## [1] 0.9783607

## [1] "Accuracy"

## [1] 0.8778583

The model is good at identifying true positives (71% precision) but it lacks sensitivity which might be important for this task (30% sensitivity).

9. Visualizing the model ROC and AUC

10. Adjusting the Recall to 0.75

## Desired recall: 0.75

## Closest actual recall: 0.738806

## Specificity at the closest recall: 0.7731148

11. Finding out the precision at 0.75 recall

## [1] 0.3639706

34.6% of precision is not very good. Therefore, we decided to keep the default model with a 50% treshold instead of the 14% treshold that gave this results.

12. Assessing the model on new data (holdout sample)

## [1] "Confusion Matrix"

##       Yhat = 1 Yhat = 0
## Y = 1       28       38
## Y = 0        9      372

## [1] "Precision"

## [1] 0.7567568

## [1] "Recall"

## [1] 0.4242424

## [1] "Specificity"

## [1] 0.976378

## [1] "Accuracy"

## [1] 0.8948546

The performance of the model on the holdout sample is quite impressive, with a precision of 72% and a recall that surpasses the results obtained in the training data.

However, the model’s recall is somewhat lower, indicating that 60% of false negatives are present. Despite this, the model demonstrates strong predictive capabilities overall.

13. Visualizing Observed vs predicted values on holdout sample

Section 7: Improving the model using oversampling

Strategy for imbalanced data: oversampling

### Oversampling and Up/Down Weighting
wt <- ifelse(trainData$Response == 1, 
             1 / mean(trainData$Response == 1), 1)

Oversampling_model <- glm(Response ~ PhD_hot_encoding + AcceptedCmp1 + AcceptedCmp3 + AcceptedCmp4 + AcceptedCmp5 + Months_since_Dt_Customer+ MntWines + MntMeatProducts, data=trainData, weight=wt, family='quasibinomial')
#pred <- predict(Oversampling_model)
#mean(pred > 0)


# Predict probabilities
pred_prob <- predict(Oversampling_model, type="response")

# Convert probabilities to binary predictions (0 or 1) using a threshold of 0.5
pred_binary <- ifelse(pred_prob > 0.5, 1, 0)

# Check the mean of the binary predictions
mean(pred_binary)

Assessing the oversampling model on new data (holdout sample)

## [1] "Confusion Matrix"

##       Yhat = 1 Yhat = 0
## Y = 1       46       20
## Y = 0       91      290

## [1] "Precision"

## [1] 0.3357664

## [1] "Recall"

## [1] 0.6969697

## [1] "Specificity"

## [1] 0.7611549

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

## Area under the curve: 0.8201

The improvement provided by oversampling was not substantial enough to warrant its inclusion; therefore, we have chosen to retain the original model without applying the oversampling technique.

Section 8: Improving the model using the K nearest neighbor

1. Building the model

First let’s try to build a model to catch some non-linear relationships that couldn’t be catch by the logistic regression model.

# Prepare the data
Data_KNN <- model.matrix(~ -1 + Marital_Status + Education + NumWebPurchases+ NumCatalogPurchases + NumStorePurchases + Recency + AcceptedCmp2 + MntGoldProds + Income, data=Data)

Data_KNN_std <- scale(Data_KNN)

# Create train and holdout indices

#trainIndex_knn <- createDataPartition(Data$Response, p=0.8, list = FALSE)
trainData_knn <- Data_KNN_std[trainIndex, ,drop=FALSE]
holdoutData_knn <- Data_KNN_std[-trainIndex, ,drop=FALSE]

# Extract the corresponding 'Response' values for the train data
train_outcome <- Data[trainIndex,]$Response

# Train the KNN model and make predictions for the holdout data
knn_pred <- knn(train=trainData_knn, test=holdoutData_knn, cl=train_outcome, prob=FALSE, k=20)

# Convert the knn_pred vector to a matrix
knn_pred_matrix <- matrix(knn_pred, ncol = 1)

# Add the knn_pred_matrix as a new column to the holdoutData_knn matrix
holdoutData_knn_with_pred <- cbind(holdoutData, knn_pred_matrix)

# Set the column name for the new column
colnames(holdoutData_knn_with_pred)[ncol(holdoutData_knn_with_pred)] <- "Predicted_Outcome"

2.Asses K-nearest neighbors model on Holdout Sample

## [1] "Confusion Matrix"

##       Yhat = 1 Yhat = 0
## Y = 1        9       57
## Y = 0        5      376

## [1] "Precision"

## [1] 0.6428571

## [1] "Recall"

## [1] 0.1363636

## [1] "Specificity"

## [1] 0.9868766

## [1] "Accuracy"

## [1] 0.8612975

The results are not as good, but as we know, usually KNN itself is not competitive with more sophisticated classification techniques.

In practical model fitting, however, KNN can be used to add “local knowledge” in a staged process with other classification techniques

Therefore, We’ll add the result of the KNN as a new predictor that could improve the capacity to classify on my model.

3.Setting KNN as feature engine

Data_KNN <- model.matrix(~ -1 + Marital_Status + Education + NumWebPurchases+ NumCatalogPurchases + NumStorePurchases + Recency + AcceptedCmp2 + MntGoldProds + Income, data=Data)

Data_KNN_std <- scale(Data_KNN)

knn_on_data <- knn(Data_KNN_std, test=Data_KNN_std, cl=Data$Response, prob=TRUE, k=20)
prob <- attr(knn_on_data, "prob")

#Check which class is calculating the probability for (the first that appears)
class_levels <- sort(unique(Data$Response))
print(class_levels)

#First element is 0, so is the probability of getting a 0

knn_feature <- ifelse(knn_on_data == '0', prob, 1 - prob)
summary(knn_feature)

Data <- Data %>%
  mutate(knn_feature = knn_feature)

trainData <- trainData %>%
  mutate(knn_feature = Data[trainIndex,]$knn_feature)

holdoutData <- holdoutData %>%
  mutate(knn_feature = Data[-trainIndex,]$knn_feature)

# NOW SAME PREDICTION BUT WITH 1'S AND 0'S

knn_on_data <- knn(Data_KNN_std, test=Data_KNN_std, cl=Data$Response, prob=FALSE, k=20)

knn_pred_matrix <- matrix(knn_on_data, ncol = 1)

Data <- Data %>%
  mutate(knn_pred_matrix = knn_pred_matrix)

trainData <- trainData %>%
  mutate(knn_pred_matrix = Data[trainIndex,]$knn_pred_matrix)

holdoutData <- holdoutData %>%
  mutate(knn_pred_matrix = Data[-trainIndex,]$knn_pred_matrix)

4. 10 fold cross validation model adding KNN feature to the logistic model

##     Accuracy     Kappa Resample
## 1  0.9162011 0.6332468   Fold01
## 2  0.8833333 0.4502618   Fold02
## 3  0.8659218 0.2789527   Fold03
## 4  0.8547486 0.3311296   Fold04
## 5  0.8666667 0.3600000   Fold05
## 6  0.8820225 0.4034472   Fold06
## 7  0.9050279 0.5850266   Fold07
## 8  0.8944444 0.5518868   Fold08
## 9  0.8826816 0.4498756   Fold09
## 10 0.8777778 0.4344473   Fold10

## Generalized Linear Model 
## 
## 1793 samples
##    9 predictor
##    2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1614, 1613, 1614, 1614, 1613, 1615, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.8828826  0.4478274

## 
## Call:  NULL
## 
## Coefficients:
##              (Intercept)          PhD_hot_encoding             AcceptedCmp11  
##                3.1847337                 0.5451720                 1.7876616  
##            AcceptedCmp31             AcceptedCmp41             AcceptedCmp51  
##                2.1027073                 0.7667220                 1.7801829  
## Months_since_Dt_Customer                  MntWines           MntMeatProducts  
##                0.1250705                -0.0015136                 0.0007905  
##              knn_feature  
##               -8.5682713  
## 
## Degrees of Freedom: 1792 Total (i.e. Null);  1783 Residual
## Null Deviance:       1513 
## Residual Deviance: 944.2     AIC: 964.2

5. Confusion Matrix

## [1] "Confusion Matrix"

##       Yhat = 1 Yhat = 0
## Y = 1      112      156
## Y = 0       50     1475

6. Assessing the model Precision, Recall, and Specificity

## [1] "Precision"

## [1] 0.691358

## [1] "Recall"

## [1] 0.4179104

## [1] "Specificity"

## [1] 0.9672131

## [1] "Accuracy"

## [1] 0.8851088

While the model’s precision experienced a slight decline from 71% to 69%, the recall increased substantially from 30% to 41%. Additionally, the model’s accuracy improved modestly, rising from 87% to 88.5%.

These results indicate that the model has shown some improvement. However, it is crucial to evaluate the model’s performance using out-of-sample data to confirm its effectiveness.

7. Verifying multicollinarity

##         PhD_hot_encoding             AcceptedCmp1             AcceptedCmp3 
##                 1.112743                 1.269955                 1.079651 
##             AcceptedCmp4             AcceptedCmp5 Months_since_Dt_Customer 
##                 1.252444                 1.512756                 1.171474 
##                 MntWines          MntMeatProducts              knn_feature 
##                 2.091986                 1.442327                 1.222207

There’s no Multicollinearity

8. Model selection and Stepwise regression

## Start:  AIC=964.17
## Response ~ PhD_hot_encoding + AcceptedCmp1 + AcceptedCmp3 + AcceptedCmp4 + 
##     AcceptedCmp5 + Months_since_Dt_Customer + MntWines + MntMeatProducts + 
##     knn_feature
## 
##                            Df Deviance     AIC
## <none>                          944.17  964.17
## - MntMeatProducts           1   948.69  966.69
## - AcceptedCmp4              1   949.95  967.95
## - PhD_hot_encoding          1   951.51  969.51
## - MntWines                  1   966.75  984.75
## - AcceptedCmp1              1   975.63  993.63
## - AcceptedCmp5              1   976.43  994.43
## - AcceptedCmp3              1  1012.94 1030.94
## - Months_since_Dt_Customer  1  1028.87 1046.87
## - knn_feature               1  1139.48 1157.48

## 
## Call:  glm(formula = Response ~ PhD_hot_encoding + AcceptedCmp1 + AcceptedCmp3 + 
##     AcceptedCmp4 + AcceptedCmp5 + Months_since_Dt_Customer + 
##     MntWines + MntMeatProducts + knn_feature, family = "binomial", 
##     data = trainData)
## 
## Coefficients:
##              (Intercept)          PhD_hot_encoding             AcceptedCmp11  
##                3.1847337                 0.5451720                 1.7876616  
##            AcceptedCmp31             AcceptedCmp41             AcceptedCmp51  
##                2.1027073                 0.7667220                 1.7801829  
## Months_since_Dt_Customer                  MntWines           MntMeatProducts  
##                0.1250705                -0.0015136                 0.0007905  
##              knn_feature  
##               -8.5682713  
## 
## Degrees of Freedom: 1792 Total (i.e. Null);  1783 Residual
## Null Deviance:       1513 
## Residual Deviance: 944.2     AIC: 964.2

## 
## Call:
## glm(formula = Response ~ PhD_hot_encoding + AcceptedCmp1 + AcceptedCmp3 + 
##     AcceptedCmp4 + AcceptedCmp5 + Months_since_Dt_Customer + 
##     MntWines + MntMeatProducts + knn_feature, family = "binomial", 
##     data = trainData)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5382  -0.4194  -0.2524  -0.1470   2.7488  
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)    
## (Intercept)               3.1847337  0.6111346   5.211 1.88e-07 ***
## PhD_hot_encoding          0.5451720  0.1985098   2.746  0.00603 ** 
## AcceptedCmp11             1.7876616  0.3195181   5.595 2.21e-08 ***
## AcceptedCmp31             2.1027073  0.2478546   8.484  < 2e-16 ***
## AcceptedCmp41             0.7667220  0.3127722   2.451  0.01423 *  
## AcceptedCmp51             1.7801829  0.3148743   5.654 1.57e-08 ***
## Months_since_Dt_Customer  0.1250705  0.0146380   8.544  < 2e-16 ***
## MntWines                 -0.0015136  0.0003325  -4.552 5.32e-06 ***
## MntMeatProducts           0.0007905  0.0003699   2.137  0.03260 *  
## knn_feature              -8.5682713  0.6921783 -12.379  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1512.54  on 1792  degrees of freedom
## Residual deviance:  944.17  on 1783  degrees of freedom
## AIC: 964.17
## 
## Number of Fisher Scoring iterations: 6

9. Visualizing the model ROC and AUC

10. Assessing the model on new data (holdout sample)

## [1] "Confusion Matrix"

##       Yhat = 1 Yhat = 0
## Y = 1       28       38
## Y = 0       12      369

## [1] "Precision"

## [1] 0.7

## [1] "Recall"

## [1] 0.4242424

## [1] "Specificity"

## [1] 0.9685039

## [1] "Accuracy"

## [1] 0.8881432

Regrettably, despite the promising appearance of the modified model, it did not yield any improvement on the holdout sample. Consequently, we have decided to retain the original logistic regression model for our analysis.

11. Visualizing Observed vs predicted values on holdout sample

Section 9: Observations

PhD holders showed a higher likelihood of responding positively to the campaign:
Customers who responded positively to previous campaigns demonstrated a higher likelihood of responding positively to the current campaign.
Clients with a longer history since their first use of our services exhibit a greater probability of responding positively to the current campaign.
Notable differences were identified across various attributes, but none were found to be statistically significant or to have a considerable impact on the model’s predictive ability. As a result, these attributes were not considered in the development of the model.

Section 10: Conclusion and Recommendation

The resulting model exhibits a precision of 75%, sensitivity of 42%, specificity of 97%, accuracy of 89%, and an AUC of 0.816. This model will enable the company to focus its marketing campaigns on customers who are more likely to respond positively to the new campaign. In doing so, the management of the department’s budget will be optimized, and profits will increase.

The model’s effectiveness could be improved, particularly in terms of sensitivity. It would be advisable to explore other modeling techniques, adjust parameters, or test different data preprocessing techniques to enhance the model’s performance in this area.

Classification_model

Abraham Cedeño

2023-04-19

Section 1: Data preparation (Import, transform, sort and filter)

Import csv files as data frames

Sort ascending according to ID

Transform Year_Birth to Age

Section 2: Data process (clean)

Explore the data frame

Categorial feature options

About the data

Duplicates

Section 3: Exploratory Data Analysis

Univariate analysis

Categorical Data - Bar plots

Numerical Discrete Data - Histograms

Numerical Continuous Data - Histogram and Boxplots

Bivariate analysis

Categorical Data - Bar charts

Numerical Discrete Data - Histograms

Numerical Continuous Data - Boxplots

Numerical Continuous Data - Correlation Matrix

Section 4: Statistical Analysis

Numerical Discrete Data - Mann-Whitney U test

NumDealsPurchases

NumWebPurchases

NumCatalogPurchases

NumWebVisitsMonth

NumStorePurchases

Age

Dt_Customer

TeenHome

Kidhome

Recency

Categorical Data - Chi-Square test

Education

Marital_Status

AcceptedCmp1

AcceptedCmp2

AcceptedCmp3

AcceptedCmp4

AcceptedCmp5

Complain

Numerical Continuous Data - Permutation Test

Income

MntWines

MntFruits

MntMeatProducts

MntFishProducts

MntSweetProducts

MntGoldProds

Section 5: Classification Model

Assessing each feature

Education

PhD_hot_encoding

Marital_Status

AcceptedCmp1

AcceptedCmp2

AcceptedCmp3

AcceptedCmp4

AcceptedCmp5

NumWebPurchases

NumCatalogPurchases

NumStorePurchases

Dt_Customers

Teenhome

Kidhome

Recency

Income

MntWines

MntFruits

MntMeatProducts

MntFishProducts

MntSweetProducts

MntGoldProds

Section 6: Building the Classificatory Model

Step by step to build the Logistic model

1. Splitting data, training and holdout sets

2. Create the model based on training data

3. Calculating the Variance Inflation Factor for each feature