library(car)
library(ggcorrplot)
library(tidyverse)
library(stringr)
library(ggpubr)
library(knitr)
library(MASS)
library(dplyr)
library(tidyr)
library(ggplot2)
library(lubridate)
library(splines)
library(mgcv)
library(reshape2)
library(gridExtra)
library(lmPerm)
library(caret)
Data <- read_csv(file.path(PSDS_PATH, 'dataset.csv'))
rm(dir1,dir2,dir3,dir4,dir5,dir6,dir7,dir8,dir9,dir10,file_name,PSDS_PATH)
str(Data)
## spc_tbl_ [9,134 × 24] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Customer : chr [1:9134] "SP56326" "VQ57815" "JG36165" "LO75919" ...
## $ State : chr [1:9134] "Arizona" "Oregon" "California" "Arizona" ...
## $ Customer Lifetime Value : num [1:9134] 7453 8828 6043 6008 15245 ...
## $ Response : chr [1:9134] "No" "No" "No" "No" ...
## $ Coverage : chr [1:9134] "Premium" "Basic" "Extended" "Extended" ...
## $ Education : chr [1:9134] "High School or Below" "Doctor" "Bachelor" "Bachelor" ...
## $ Effective To Date : chr [1:9134] "01/01/11" "01/01/11" "01/01/11" "01/01/11" ...
## $ EmploymentStatus : chr [1:9134] "Medical Leave" "Employed" "Unemployed" "Employed" ...
## $ Gender : chr [1:9134] "F" "M" "M" "M" ...
## $ Income : num [1:9134] 12195 61246 0 43543 30205 ...
## $ Location Code : chr [1:9134] "Suburban" "Rural" "Suburban" "Urban" ...
## $ Marital Status : chr [1:9134] "Divorced" "Married" "Single" "Married" ...
## $ Monthly Premium Auto : num [1:9134] 101 110 83 76 195 74 66 112 74 66 ...
## $ Months Since Last Claim : num [1:9134] 29 24 12 11 24 21 8 6 3 15 ...
## $ Months Since Policy Inception: num [1:9134] 35 21 21 62 1 30 93 71 42 45 ...
## $ Number of Open Complaints : num [1:9134] 0 0 0 0 0 0 0 0 0 0 ...
## $ Number of Policies : num [1:9134] 7 3 5 8 4 2 1 9 3 1 ...
## $ Policy Type : chr [1:9134] "Personal Auto" "Personal Auto" "Personal Auto" "Personal Auto" ...
## $ Policy : chr [1:9134] "Personal L3" "Personal L2" "Personal L1" "Personal L3" ...
## $ Renew Offer Type : chr [1:9134] "Offer1" "Offer2" "Offer4" "Offer1" ...
## $ Sales Channel : chr [1:9134] "Branch" "Branch" "Web" "Call Center" ...
## $ Total Claim Amount : num [1:9134] 671 159 904 339 1330 ...
## $ Vehicle Class : chr [1:9134] "Four-Door Car" "Sports Car" "Two-Door Car" "Two-Door Car" ...
## $ Vehicle Size : chr [1:9134] "Medsize" "Medsize" "Medsize" "Medsize" ...
## - attr(*, "spec")=
## .. cols(
## .. Customer = col_character(),
## .. State = col_character(),
## .. `Customer Lifetime Value` = col_double(),
## .. Response = col_character(),
## .. Coverage = col_character(),
## .. Education = col_character(),
## .. `Effective To Date` = col_character(),
## .. EmploymentStatus = col_character(),
## .. Gender = col_character(),
## .. Income = col_double(),
## .. `Location Code` = col_character(),
## .. `Marital Status` = col_character(),
## .. `Monthly Premium Auto` = col_double(),
## .. `Months Since Last Claim` = col_double(),
## .. `Months Since Policy Inception` = col_double(),
## .. `Number of Open Complaints` = col_double(),
## .. `Number of Policies` = col_double(),
## .. `Policy Type` = col_character(),
## .. Policy = col_character(),
## .. `Renew Offer Type` = col_character(),
## .. `Sales Channel` = col_character(),
## .. `Total Claim Amount` = col_double(),
## .. `Vehicle Class` = col_character(),
## .. `Vehicle Size` = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
duplicates <- duplicated(Data$Customer)
num_true <- sum(duplicates)
print(num_true)
## [1] 0
remove(duplicates,num_true)
We can conclude that there are no duplicates.
Data <- Data %>%
rename(Customer_Lifetime_Value=`Customer Lifetime Value`,
Effective_To_Date=`Effective To Date`,
Location_Code=`Location Code`,
Employment_Status=EmploymentStatus,
Marital_Status=`Marital Status`,
Monthly_Premium_Auto=`Monthly Premium Auto`,
Months_Since_Last_Claim=`Months Since Last Claim`,
Months_Since_Policy_Inception=`Months Since Policy Inception`,
Number_of_Open_Complaints=`Number of Open Complaints`,
Number_of_Policies=`Number of Policies`,
Policy_Type=`Policy Type`,
Renew_Offer_Type=`Renew Offer Type`,
Sales_Channel=`Sales Channel`,
Total_Claim_Amount=`Total Claim Amount`,
Vehicle_Class=`Vehicle Class`,
Vehicle_Size=`Vehicle Size`)
Data <- Data %>%
mutate(`Effective_To_Date`=mdy(`Effective_To_Date`))
Data$Effective_Days <- weekdays(Data$`Effective_To_Date`)
Data$Effective_Days <- factor(Data$Effective_Days, c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"))
Analyze each predictor variable individually to understand its distribution and identify any potential outliers or data issues.
Histograms: Create histograms to visualize the distribution of each continuous predictor variable.
Box plots: Use box plots to identify the range, quartiles, and potential outliers of continuous predictor variables.
Bar plots: Create bar plots for categorical predictor variables to visualize the frequency of each category.
Summary statistics: Calculate summary statistics, such as mean, median, standard deviation, and quartiles, to describe the central tendency and dispersion of each continuous predictor variable.
Analyze the relationship between each predictor variable and the target variable.
Scatter plots: Plot scatter plots between continuous predictor variables and the target variable to visualize the relationships and identify any trends or patterns.
Box plots or violin plots: Use box plots or violin plots to visualize the distribution of the target variable across different categories of a categorical predictor variable.
Correlation matrix: Calculate the correlation matrix for continuous predictor variables to identify any strong linear relationships with the target variable.
CLV = Customer Lifetime Value
INC = Income
MPA = Monthly Premium Auto
TCA = Total Claim Amount
As CLV has a high correlation with MPA, We decided to study how other categorical values affect MPA.
In this section, we will perform a concise statistical analysis to select the most relevant predictors for our regression model. The steps involved in this process are as follows:
Multicollinearity assessment: Check for correlations among predictor variables to ensure stability and reliability in the regression model.
Bivariate analysis: Investigate associations between each predictor and the target variable using correlation coefficients and hypothesis tests.
Feature selection: Employ techniques like stepwise regression, LASSO, or Ridge regression to systematically choose the optimal subset of predictors.
Visualizations: Leverage graphical representations to effectively communicate findings and support informed decision-making.
By following these steps, we will identify a suitable set of predictor variables that contribute meaningfully to our regression model’s performance.
## [1] "Settings: unique SS "
## Component 1 :
## Df R Sum Sq R Mean Sq Iter Pr(Prob)
## State1 4 5.1550e+07 12887444 1533 0.9289
## Residuals 9129 4.3112e+11 47225235
## [1] "Settings: unique SS "
## Component 1 :
## Df R Sum Sq R Mean Sq Iter Pr(Prob)
## Response1 1 3.4380e+07 34380462 184 0.3533
## Residuals 9132 4.3114e+11 47211601
## [1] "Settings: unique SS "
## Component 1 :
## Df R Sum Sq R Mean Sq Iter Pr(Prob)
## Location_Code1 2 1.0200e+07 5099770 51 1
## Residuals 9131 4.3116e+11 47219419
## [1] "Settings: unique SS "
## Component 1 :
## Df R Sum Sq R Mean Sq Iter Pr(Prob)
## Marital_Status1 2 3.1310e+08 156547994 5000 < 2.2e-16 ***
## Residuals 9131 4.3086e+11 47186247
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Call:
## lm(formula = Customer_Lifetime_Value ~ Marital_Status, data = Data)
##
## Coefficients:
## (Intercept) Marital_StatusMarried Marital_StatusSingle
## 8241.2 -162.3 -526.4
##
## Call:
## lm(formula = Customer_Lifetime_Value ~ Marital_Status, data = Data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6237 -3990 -2219 956 75246
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8241.2 185.7 44.390 <2e-16 ***
## Marital_StatusMarried -162.3 208.3 -0.779 0.436
## Marital_StatusSingle -526.4 231.5 -2.274 0.023 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6869 on 9131 degrees of freedom
## Multiple R-squared: 0.0007262, Adjusted R-squared: 0.0005073
## F-statistic: 3.318 on 2 and 9131 DF, p-value: 0.03628
## [1] "Settings: unique SS "
## Component 1 :
## Df R Sum Sq R Mean Sq Iter Pr(Prob)
## Coverage1 2 1.2265e+10 6132650242 5000 < 2.2e-16 ***
## Residuals 9131 4.1891e+11 45877277
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Call:
## lm(formula = Customer_Lifetime_Value ~ Coverage, data = Data)
##
## Coefficients:
## (Intercept) CoverageExtended CoveragePremium
## 7191 1599 3705
##
## Call:
## lm(formula = Customer_Lifetime_Value ~ Coverage, data = Data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7723 -4300 -1971 1185 74536
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7190.71 90.77 79.22 <2e-16 ***
## CoverageExtended 1598.97 158.02 10.12 <2e-16 ***
## CoveragePremium 3704.90 252.82 14.65 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6773 on 9131 degrees of freedom
## Multiple R-squared: 0.02845, Adjusted R-squared: 0.02823
## F-statistic: 133.7 on 2 and 9131 DF, p-value: < 2.2e-16
## [1] "Settings: unique SS "
## Component 1 :
## Df R Sum Sq R Mean Sq Iter Pr(Prob)
## Education1 4 4.5725e+08 114312631 5000 < 2.2e-16 ***
## Residuals 9129 4.3071e+11 47180794
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Call:
## lm(formula = Customer_Lifetime_Value ~ Education, data = Data)
##
## Coefficients:
## (Intercept) EducationCollege
## 7872.7 -21.6
## EducationDoctor EducationHigh School or Below
## -352.3 424.0
## EducationMaster
## 370.8
##
## Call:
## lm(formula = Customer_Lifetime_Value ~ Education, data = Data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6356 -3992 -2187 879 75029
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7872.7 131.0 60.082 <2e-16 ***
## EducationCollege -21.6 186.5 -0.116 0.9078
## EducationDoctor -352.3 393.9 -0.895 0.3711
## EducationHigh School or Below 424.1 187.5 2.261 0.0238 *
## EducationMaster 370.8 284.3 1.304 0.1922
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6869 on 9129 degrees of freedom
## Multiple R-squared: 0.00106, Adjusted R-squared: 0.0006228
## F-statistic: 2.423 on 4 and 9129 DF, p-value: 0.04604
## [1] "Settings: unique SS "
## Component 1 :
## Df R Sum Sq R Mean Sq Iter Pr(Prob)
## Employment_Status1 4 7.1856e+08 179640773 5000 < 2.2e-16 ***
## Residuals 9129 4.3045e+11 47152169
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Call:
## lm(formula = Customer_Lifetime_Value ~ Employment_Status, data = Data)
##
## Coefficients:
## (Intercept) Employment_StatusEmployed
## 7847.9 371.2
## Employment_StatusMedical Leave Employment_StatusRetired
## -206.1 -360.0
## Employment_StatusUnemployed
## -211.6
##
## Call:
## lm(formula = Customer_Lifetime_Value ~ Employment_Status, data = Data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5964 -3952 -2285 895 75106
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7847.9 341.2 23.000 <2e-16 ***
## Employment_StatusEmployed 371.2 353.1 1.051 0.293
## Employment_StatusMedical Leave -206.1 474.9 -0.434 0.664
## Employment_StatusRetired -360.0 532.6 -0.676 0.499
## Employment_StatusUnemployed -211.6 369.8 -0.572 0.567
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6867 on 9129 degrees of freedom
## Multiple R-squared: 0.001667, Adjusted R-squared: 0.001229
## F-statistic: 3.81 on 4 and 9129 DF, p-value: 0.004251
## [1] "Settings: unique SS "
## Component 1 :
## Df R Sum Sq R Mean Sq Iter Pr(Prob)
## Gender1 1 7.9863e+07 79863478 5000 0.0036 **
## Residuals 9132 4.3109e+11 47206620
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Call:
## lm(formula = Customer_Lifetime_Value ~ Gender, data = Data)
##
## Coefficients:
## (Intercept) GenderM
## 8096.6 -187.1
##
## Call:
## lm(formula = Customer_Lifetime_Value ~ Gender, data = Data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6198 -4001 -2218 940 75416
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8096.6 100.7 80.427 <2e-16 ***
## GenderM -187.1 143.8 -1.301 0.193
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6871 on 9132 degrees of freedom
## Multiple R-squared: 0.0001852, Adjusted R-squared: 7.574e-05
## F-statistic: 1.692 on 1 and 9132 DF, p-value: 0.1934
## [1] "Settings: unique SS "
## Component 1 :
## Df R Sum Sq R Mean Sq Iter Pr(Prob)
## Policy_Type1 2 2.0613e+08 103063571 5000 0.1696
## Residuals 9131 4.3096e+11 47197962
## [1] "Settings: unique SS "
## Component 1 :
## Df R Sum Sq R Mean Sq Iter Pr(Prob)
## Policy1 8 4.4710e+08 55887813 5000 0.5334
## Residuals 9125 4.3072e+11 47202588
## [1] "Settings: unique SS "
## Component 1 :
## Df R Sum Sq R Mean Sq Iter Pr(Prob)
## Vehicle_Class1 5 5.5043e+10 1.1009e+10 5000 < 2.2e-16 ***
## Residuals 9128 3.7613e+11 4.1206e+07
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Call:
## lm(formula = Customer_Lifetime_Value ~ Vehicle_Class, data = Data)
##
## Coefficients:
## (Intercept) Vehicle_ClassLuxury Car
## 6631.7 10421.6
## Vehicle_ClassLuxury SUV Vehicle_ClassSports Car
## 10491.3 4119.3
## Vehicle_ClassSUV Vehicle_ClassTwo-Door Car
## 3811.8 39.3
##
## Call:
## lm(formula = Customer_Lifetime_Value ~ Vehicle_Class, data = Data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11167 -3769 -1511 1075 66272
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6631.73 94.43 70.229 <2e-16 ***
## Vehicle_ClassLuxury Car 10421.62 511.58 20.371 <2e-16 ***
## Vehicle_ClassLuxury SUV 10491.27 482.56 21.741 <2e-16 ***
## Vehicle_ClassSports Car 4119.26 306.68 13.432 <2e-16 ***
## Vehicle_ClassSUV 3811.79 178.49 21.355 <2e-16 ***
## Vehicle_ClassTwo-Door Car 39.30 175.40 0.224 0.823
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6419 on 9128 degrees of freedom
## Multiple R-squared: 0.1277, Adjusted R-squared: 0.1272
## F-statistic: 267.2 on 5 and 9128 DF, p-value: < 2.2e-16
## [1] "Settings: unique SS "
## Component 1 :
## Df R Sum Sq R Mean Sq Iter Pr(Prob)
## Vehicle_Size1 2 2.2489e+08 112444031 5000 0.086 .
## Residuals 9131 4.3095e+11 47195907
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Call:
## lm(formula = Customer_Lifetime_Value ~ Vehicle_Size, data = Data)
##
## Coefficients:
## (Intercept) Vehicle_SizeMedsize Vehicle_SizeSmall
## 7545.0 505.7 540.1
##
## Call:
## lm(formula = Customer_Lifetime_Value ~ Vehicle_Size, data = Data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6186 -3977 -2212 981 75240
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7545.0 223.4 33.779 <2e-16 ***
## Vehicle_SizeMedsize 505.7 239.2 2.114 0.0346 *
## Vehicle_SizeSmall 540.1 276.8 1.951 0.0511 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6870 on 9131 degrees of freedom
## Multiple R-squared: 0.0005216, Adjusted R-squared: 0.0003027
## F-statistic: 2.382 on 2 and 9131 DF, p-value: 0.09238
## [1] "Settings: unique SS "
## Component 1 :
## Df R Sum Sq R Mean Sq Iter Pr(Prob)
## Renew_Offer_Type1 3 3.6291e+09 1209695120 5000 < 2.2e-16 ***
## Residuals 9130 4.2754e+11 46828218
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Call:
## lm(formula = Customer_Lifetime_Value ~ Renew_Offer_Type, data = Data)
##
## Coefficients:
## (Intercept) Renew_Offer_TypeOffer2 Renew_Offer_TypeOffer3
## 8707.1 -1310.3 -709.2
## Renew_Offer_TypeOffer4
## -1527.1
##
## Call:
## lm(formula = Customer_Lifetime_Value ~ Renew_Offer_Type, data = Data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6809 -4046 -2028 1055 74618
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8707.1 111.7 77.938 < 2e-16 ***
## Renew_Offer_TypeOffer2 -1310.3 168.8 -7.764 9.13e-15 ***
## Renew_Offer_TypeOffer3 -709.2 212.6 -3.336 0.000852 ***
## Renew_Offer_TypeOffer4 -1527.1 241.3 -6.330 2.57e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6843 on 9130 degrees of freedom
## Multiple R-squared: 0.008417, Adjusted R-squared: 0.008091
## F-statistic: 25.83 on 3 and 9130 DF, p-value: < 2.2e-16
## [1] "Settings: unique SS "
## Component 1 :
## Df R Sum Sq R Mean Sq Iter Pr(Prob)
## Sales_Channel1 3 1.2472e+08 41572350 1959 0.6131
## Residuals 9130 4.3105e+11 47212048
##
## Call:
## lm(formula = Customer_Lifetime_Value ~ Sales_Channel, data = Data)
##
## Coefficients:
## (Intercept) Sales_ChannelBranch Sales_ChannelCall Center
## 7957.7 162.0 142.4
## Sales_ChannelWeb
## -177.9
##
## Call:
## lm(formula = Customer_Lifetime_Value ~ Sales_Channel, data = Data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6202 -4001 -2209 931 75225
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7957.7 116.5 68.291 <2e-16 ***
## Sales_ChannelBranch 162.0 178.8 0.906 0.365
## Sales_ChannelCall Center 142.4 200.8 0.709 0.478
## Sales_ChannelWeb -177.9 221.8 -0.802 0.423
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6871 on 9130 degrees of freedom
## Multiple R-squared: 0.0002893, Adjusted R-squared: -3.924e-05
## F-statistic: 0.8805 on 3 and 9130 DF, p-value: 0.4503
## [1] "Settings: unique SS "
## Component 1 :
## Df R Sum Sq R Mean Sq Iter Pr(Prob)
## Effective_Days 6 1.4290e+08 23816121 5000 0.9998
## Residuals 9127 4.3103e+11 47225575
## [1] "Settings: unique SS : numeric variables centered"
## Component 1 :
## Df R Sum Sq R Mean Sq Iter Pr(Prob)
## Number_of_Open_Complaints 1 5.695e+08 569502273 5000 < 2.2e-16 ***
## Residuals 9132 4.306e+11 47153002
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Call:
## lm(formula = Customer_Lifetime_Value ~ Number_of_Open_Complaints,
## data = Data)
##
## Coefficients:
## (Intercept) Number_of_Open_Complaints
## 8110.4 -274.3
##
## Call:
## lm(formula = Customer_Lifetime_Value ~ Number_of_Open_Complaints,
## data = Data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6116 -4019 -2215 971 75215
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8110.38 77.99 103.990 < 2e-16 ***
## Number_of_Open_Complaints -274.29 78.93 -3.475 0.000513 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6867 on 9132 degrees of freedom
## Multiple R-squared: 0.001321, Adjusted R-squared: 0.001211
## F-statistic: 12.08 on 1 and 9132 DF, p-value: 0.0005126
##
## Call:
## lm(formula = Customer_Lifetime_Value ~ Months_Since_Policy_Inception,
## data = Data)
##
## Coefficients:
## (Intercept) Months_Since_Policy_Inception
## 7893.480 2.319
##
## Call:
## lm(formula = Customer_Lifetime_Value ~ Months_Since_Policy_Inception,
## data = Data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6206 -4007 -2228 952 75260
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7893.480 143.192 55.12 <2e-16 ***
## Months_Since_Policy_Inception 2.319 2.576 0.90 0.368
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6871 on 9132 degrees of freedom
## Multiple R-squared: 8.871e-05, Adjusted R-squared: -2.079e-05
## F-statistic: 0.8101 on 1 and 9132 DF, p-value: 0.3681
##
## Call:
## lm(formula = Customer_Lifetime_Value ~ Income, data = Data)
##
## Coefficients:
## (Intercept) Income
## 7.797e+03 5.511e-03
##
## Call:
## lm(formula = Customer_Lifetime_Value ~ Income, data = Data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5899 -3976 -2274 917 75203
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.797e+03 1.145e+02 68.114 <2e-16 ***
## Income 5.511e-03 2.366e-03 2.329 0.0199 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6869 on 9132 degrees of freedom
## Multiple R-squared: 0.0005937, Adjusted R-squared: 0.0004842
## F-statistic: 5.425 on 1 and 9132 DF, p-value: 0.01987
##
## Call:
## lm(formula = Customer_Lifetime_Value ~ Months_Since_Last_Claim,
## data = Data)
##
## Coefficients:
## (Intercept) Months_Since_Last_Claim
## 7886.345 7.856
##
## Call:
## lm(formula = Customer_Lifetime_Value ~ Months_Since_Last_Claim,
## data = Data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6155 -3995 -2224 963 75196
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7886.345 129.534 60.882 <2e-16 ***
## Months_Since_Last_Claim 7.856 7.137 1.101 0.271
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6871 on 9132 degrees of freedom
## Multiple R-squared: 0.0001326, Adjusted R-squared: 2.314e-05
## F-statistic: 1.211 on 1 and 9132 DF, p-value: 0.2711
##
## Call:
## lm(formula = Customer_Lifetime_Value ~ Number_of_Policies, data = Data)
##
## Coefficients:
## (Intercept) Number_of_Policies
## 7817.73 63.11
##
## Call:
## lm(formula = Customer_Lifetime_Value ~ Number_of_Policies, data = Data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5983 -3918 -2337 831 75381
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7817.73 114.56 68.240 <2e-16 ***
## Number_of_Policies 63.11 30.07 2.099 0.0359 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6870 on 9132 degrees of freedom
## Multiple R-squared: 0.000482, Adjusted R-squared: 0.0003726
## F-statistic: 4.404 on 1 and 9132 DF, p-value: 0.03588
##
## Call:
## lm(formula = Customer_Lifetime_Value ~ Total_Claim_Amount, data = Data)
##
## Coefficients:
## (Intercept) Total_Claim_Amount
## 5679.932 5.356
##
## Call:
## lm(formula = Customer_Lifetime_Value ~ Total_Claim_Amount, data = Data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12595 -4042 -1804 1128 71707
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5679.9318 125.9190 45.11 <2e-16 ***
## Total_Claim_Amount 5.3561 0.2411 22.22 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6693 on 9132 degrees of freedom
## Multiple R-squared: 0.05128, Adjusted R-squared: 0.05118
## F-statistic: 493.6 on 1 and 9132 DF, p-value: < 2.2e-16
#Split the data into training and holdout sets using a 90-10 split
set.seed(123)
trainIndex <- createDataPartition(Data$Customer_Lifetime_Value, p=0.8, list = FALSE)
trainData <- Data[trainIndex, ]
holdoutData <- Data[-trainIndex, ]
##
## Call:
## lm(formula = Customer_Lifetime_Value ~ Coverage + Vehicle_Class +
## Monthly_Premium_Auto + Total_Claim_Amount, data = trainData,
## na.action = na.omit)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11522 -3422 -946 570 56555
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1957.8049 728.5766 2.687 0.00722 **
## CoverageExtended 312.8582 277.1374 1.129 0.25898
## CoveragePremium 898.0097 586.5817 1.531 0.12583
## Vehicle_ClassLuxury Car 2185.6535 1530.9625 1.428 0.15344
## Vehicle_ClassLuxury SUV 2700.4376 1525.6442 1.770 0.07676 .
## Vehicle_ClassSports Car 1304.6222 581.2719 2.244 0.02483 *
## Vehicle_ClassSUV 1276.7265 505.1394 2.527 0.01151 *
## Vehicle_ClassTwo-Door Car 153.8725 190.8422 0.806 0.42011
## Monthly_Premium_Auto 62.3692 10.9021 5.721 1.1e-08 ***
## Total_Claim_Amount -0.9269 0.3264 -2.840 0.00453 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6272 on 7300 degrees of freedom
## Multiple R-squared: 0.1554, Adjusted R-squared: 0.1544
## F-statistic: 149.3 on 9 and 7300 DF, p-value: < 2.2e-16
## GVIF Df GVIF^(1/(2*Df))
## Coverage 5.923714 2 1.560086
## Vehicle_Class 20.427826 5 1.352142
## Monthly_Premium_Auto 26.347164 1 5.132949
## Total_Claim_Amount 1.674327 1 1.293958
The Monthly Premium Auto has a VIF adjusted higher than 5 that indicates some multicollienarity. To adress that We will use the Stepwise regression to select the appropiate features.
## Start: AIC=127845.3
## Customer_Lifetime_Value ~ Coverage + Vehicle_Class + Monthly_Premium_Auto +
## Total_Claim_Amount
##
## Df Sum of Sq RSS AIC
## - Coverage 2 92543790 2.8727e+11 127844
## - Vehicle_Class 5 393074168 2.8757e+11 127845
## <none> 2.8717e+11 127845
## - Total_Claim_Amount 1 317241437 2.8749e+11 127851
## - Monthly_Premium_Auto 1 1287480143 2.8846e+11 127876
##
## Step: AIC=127843.6
## Customer_Lifetime_Value ~ Vehicle_Class + Monthly_Premium_Auto +
## Total_Claim_Amount
##
## Df Sum of Sq RSS AIC
## - Vehicle_Class 5 3.2126e+08 2.8759e+11 127842
## <none> 2.8727e+11 127844
## + Coverage 2 9.2544e+07 2.8717e+11 127845
## - Total_Claim_Amount 1 3.1446e+08 2.8758e+11 127850
## - Monthly_Premium_Auto 1 1.0322e+10 2.9759e+11 128100
##
## Step: AIC=127841.8
## Customer_Lifetime_Value ~ Monthly_Premium_Auto + Total_Claim_Amount
##
## Df Sum of Sq RSS AIC
## <none> 2.8759e+11 127842
## + Vehicle_Class 5 3.2126e+08 2.8727e+11 127844
## + Coverage 2 2.0727e+07 2.8757e+11 127845
## - Total_Claim_Amount 1 3.4071e+08 2.8793e+11 127848
## - Monthly_Premium_Auto 1 3.5501e+10 3.2309e+11 128691
##
## Call:
## lm(formula = Customer_Lifetime_Value ~ Monthly_Premium_Auto +
## Total_Claim_Amount, data = trainData, na.action = na.omit)
##
## Coefficients:
## (Intercept) Monthly_Premium_Auto Total_Claim_Amount
## 724.9909 82.3935 -0.9586
##
## Call:
## lm(formula = Customer_Lifetime_Value ~ Monthly_Premium_Auto +
## Total_Claim_Amount, data = trainData, na.action = na.omit)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12225 -3516 -909 253 56787
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 724.9909 212.4388 3.413 0.000647 ***
## Monthly_Premium_Auto 82.3935 2.7434 30.034 < 2e-16 ***
## Total_Claim_Amount -0.9586 0.3258 -2.942 0.003269 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6274 on 7307 degrees of freedom
## Multiple R-squared: 0.1542, Adjusted R-squared: 0.154
## F-statistic: 666.1 on 2 and 7307 DF, p-value: < 2.2e-16
Only 2 features were left: Monthly_Premium_Auto and Total_Claim_Amount.
#Eliminate outliers that are more than 2.5 standard residuals
sresid <- rstandard(step_lm)
idx <- order(sresid, decreasing=TRUE)
sresid[idx[325]]
resid(step_lm)[idx[1]]
outlier_record <- trainData[idx[1], c('Monthly_Premium_Auto', 'Total_Claim_Amount')]
trainData$sresid <- rstandard(step_lm)
#trainData <- trainData %>%
# filter(sresid < 2.5) %>%
#filter(sresid > -2.5)
Despite numerous records surpassing the 2.5 standard deviation benchmark, we have opted against data exclusion, as a significant proportion adheres to this threshold. Removing these records may potentially compromise the integrity of our model.
## RMSE Rsquared MAE Resample
## 1 5553.824 0.1746101 3614.963 Fold01
## 2 6152.378 0.1148496 3913.138 Fold02
## 3 6235.549 0.1819838 3844.474 Fold03
## 4 6277.868 0.1064881 3985.924 Fold04
## 5 6336.034 0.1626300 3883.500 Fold05
## 6 5957.007 0.1503061 3744.616 Fold06
## 7 6212.078 0.2165131 3821.449 Fold07
## 8 6823.927 0.1706952 3970.905 Fold08
## 9 6372.496 0.1236572 3870.934 Fold09
## 10 6741.269 0.1529866 4062.650 Fold10
## Linear Regression
##
## 7310 samples
## 2 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 6578, 6578, 6580, 6578, 6579, 6580, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 6266.243 0.155472 3871.255
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
##
## Call:
## lm(formula = Customer_Lifetime_Value ~ Monthly_Premium_Auto +
## Total_Claim_Amount, data = trainData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12225 -3516 -909 253 56787
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 724.9909 212.4388 3.413 0.000647 ***
## Monthly_Premium_Auto 82.3935 2.7434 30.034 < 2e-16 ***
## Total_Claim_Amount -0.9586 0.3258 -2.942 0.003269 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6274 on 7307 degrees of freedom
## Multiple R-squared: 0.1542, Adjusted R-squared: 0.154
## F-statistic: 666.1 on 2 and 7307 DF, p-value: < 2.2e-16
## Monthly_Premium_Auto Total_Claim_Amount
## 1.667533 1.667533
## RMSE: 6433.889
## R-squared 0.1738726
The model performs even better on data it haven’t seem before.
Previously, We found that only when number of policies is 2 the relationship between Customer_Lifetime_Value (target value) and Monthly_Premium_Auto (a feature) vary hugely. We suspect that the model could improve it’s accuaracy by trying using an algorithm that could catch this non linear relationship.
# Load the kknn package
library(kknn)
##
## Attaching package: 'kknn'
## The following object is masked from 'package:caret':
##
## contr.dummy
# Prepare the data
Data_KNN <- model.matrix(~ -1 + Monthly_Premium_Auto + Coverage + Vehicle_Class + Total_Claim_Amount + Number_of_Policies, data=Data)
Data_KNN_std <- scale(Data_KNN)
# Create train and holdout indices
# trainIndex_knn <- createDataPartition(Data$Response, p=0.8, list = FALSE)
trainData_knn <- data.frame(Data_KNN_std[trainIndex, ,drop=FALSE])
holdoutData_knn <- data.frame(Data_KNN_std[-trainIndex, ,drop=FALSE])
# Extract the corresponding 'Response' values for the train data
train_outcome <- Data[trainIndex,]$Customer_Lifetime_Value
# Add the train_outcome to the trainData_knn data frame
trainData_knn$Customer_Lifetime_Value <- train_outcome
# Train the KNN regression model and make predictions for the holdout data
knn_regression <- kknn(Customer_Lifetime_Value ~ ., train = trainData_knn, test = holdoutData_knn, k = 20)
# Get the predicted values from the model
knn_pred <- knn_regression$fitted.values
# Convert the knn_pred vector to a matrix
knn_pred_matrix <- matrix(knn_pred, ncol = 1)
# Add the knn_pred_matrix as a new column to the holdoutData_knn data frame
holdoutData <- cbind(holdoutData, knn_pred_matrix)
# Set the column name for the new column
colnames(holdoutData)[ncol(holdoutData)] <- "KNN_Alone_Predicted_Outcome"
##
## Attaching package: 'Metrics'
## The following objects are masked from 'package:caret':
##
## precision, recall
## R-squared: 0.5573309
## RMSE: 4703.382
We get a hugely superior model in the holdout sample.
Now let’s try setting KNN as a feature and use it in the multiple linear regression model to see if that could improve the model even further.
Data_KNN <- model.matrix(~ -1 + Monthly_Premium_Auto + Coverage + Vehicle_Class + Total_Claim_Amount + Number_of_Policies, data=Data)
#standarized
Data_KNN_std <- scale(Data_KNN)
#Convert to a dataframe
Data_KNN_std <- data.frame(Data_KNN_std[,,drop=FALSE])
# Extract the corresponding 'Response' values for the train data
train_outcome <- Data$Customer_Lifetime_Value
# Add the train_outcome to the trainData_knn data frame
Data_KNN_std$Customer_Lifetime_Value <- train_outcome
knn_on_data <- kknn(Customer_Lifetime_Value ~ ., train = Data_KNN_std, test = Data_KNN_std, k = 20)
# Get the predicted values from the model
knn_pred <- knn_on_data$fitted.values
# Convert the knn_pred vector to a matrix
knn_pred_matrix <- matrix(knn_pred, ncol = 1)
# Add the knn_pred_matrix as a new column to the holdoutData_knn data frame
Data <- cbind(Data, knn_pred_matrix)
# Set the column name for the new column
colnames(Data)[ncol(Data)] <- "Predicted_Outcome"
## RMSE Rsquared MAE Resample
## 1 3196.165 0.7310191 1583.584 Fold01
## 2 3488.560 0.7143012 1692.963 Fold02
## 3 3846.960 0.6879247 1813.171 Fold03
## 4 3451.232 0.7283497 1679.809 Fold04
## 5 3816.549 0.6963893 1791.272 Fold05
## 6 3152.503 0.7621065 1551.005 Fold06
## 7 3664.483 0.7268223 1747.670 Fold07
## 8 4520.429 0.6343727 2000.154 Fold08
## 9 4061.948 0.6425909 1826.314 Fold09
## 10 4041.724 0.6968264 1941.774 Fold10
## Linear Regression
##
## 7310 samples
## 3 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 6578, 6578, 6580, 6578, 6579, 6580, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 3724.055 0.7020703 1762.771
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
This is a huge improvement. Now we are talking! But let’s not get ahead of ourselves, is important to check other important factors to validate the model like the VIF and most importantly, assessing the model in the holdout sample.
## Monthly_Premium_Auto Total_Claim_Amount Predicted_Outcome
## 2.041001 1.669735 1.339379
There’s no multicollinarity
## RMSE: 3994.279
## R-squared 0.6813194
Now this is a very good model with a whooping 0.68131 R-squared in the holdout sample.
As it can be seen, the model is very accurate in predicting the customer lifetime value.
We can conclude that creating an ensemble model, which combines the KNN feature with our original multiple linear regression model, significantly improves our predictive performance. This improvement is attributed to the ensemble model’s ability to capture non-linear relationships that the initial linear model was unable to detect.
A strong relationship between the coverage (Basic, Extended, or Premium) and CLV exists, with higher CLV for more premium coverages.
Vehicle Class also impacts CLV, with luxury cars having higher CLV.
Monthly Premium Auto, the amount paid monthly for auto insurance, had the strongest correlation to CLV.
An inverse correlation was observed between Total Claim Amount and CLV.
The Number of Policies feature showed that customers with two policies had significantly higher CLV, affecting linearity.
Due to multicollinearity between Coverage, Vehicle Class, and Monthly Premium Auto, only the most promising feature (Monthly Premium Auto) was retained for the model.
Our model has an RMSE of 3994.297 and an R-squared of 0.6813 on the holdout sample. This is really good, therefore, this model can help make informed business decisions to increase CLV.
The really good performance was due to the identification of an irregular behavior associated with customers having two policies. We recommend investigating this behavior to determine the cause, such as a special offer for purchasing two policies. By understanding and addressing this issue, we can develop an even more accurate model to predict CLV.