Section 1: Data preparation (Import, transform, sort and filter)


Import libraries

library(car)
library(ggcorrplot)
library(tidyverse)
library(stringr)
library(ggpubr)
library(knitr)
library(MASS)
library(dplyr)
library(tidyr)
library(ggplot2)
library(lubridate)
library(splines)
library(mgcv)
library(reshape2)
library(gridExtra)
library(lmPerm)
library(caret)

Import csv files as data frames

Data <- read_csv(file.path(PSDS_PATH, 'dataset.csv'))
rm(dir1,dir2,dir3,dir4,dir5,dir6,dir7,dir8,dir9,dir10,file_name,PSDS_PATH)

Section 2: Data process (clean)


Explore the data frame

str(Data)
## spc_tbl_ [9,134 × 24] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Customer                     : chr [1:9134] "SP56326" "VQ57815" "JG36165" "LO75919" ...
##  $ State                        : chr [1:9134] "Arizona" "Oregon" "California" "Arizona" ...
##  $ Customer Lifetime Value      : num [1:9134] 7453 8828 6043 6008 15245 ...
##  $ Response                     : chr [1:9134] "No" "No" "No" "No" ...
##  $ Coverage                     : chr [1:9134] "Premium" "Basic" "Extended" "Extended" ...
##  $ Education                    : chr [1:9134] "High School or Below" "Doctor" "Bachelor" "Bachelor" ...
##  $ Effective To Date            : chr [1:9134] "01/01/11" "01/01/11" "01/01/11" "01/01/11" ...
##  $ EmploymentStatus             : chr [1:9134] "Medical Leave" "Employed" "Unemployed" "Employed" ...
##  $ Gender                       : chr [1:9134] "F" "M" "M" "M" ...
##  $ Income                       : num [1:9134] 12195 61246 0 43543 30205 ...
##  $ Location Code                : chr [1:9134] "Suburban" "Rural" "Suburban" "Urban" ...
##  $ Marital Status               : chr [1:9134] "Divorced" "Married" "Single" "Married" ...
##  $ Monthly Premium Auto         : num [1:9134] 101 110 83 76 195 74 66 112 74 66 ...
##  $ Months Since Last Claim      : num [1:9134] 29 24 12 11 24 21 8 6 3 15 ...
##  $ Months Since Policy Inception: num [1:9134] 35 21 21 62 1 30 93 71 42 45 ...
##  $ Number of Open Complaints    : num [1:9134] 0 0 0 0 0 0 0 0 0 0 ...
##  $ Number of Policies           : num [1:9134] 7 3 5 8 4 2 1 9 3 1 ...
##  $ Policy Type                  : chr [1:9134] "Personal Auto" "Personal Auto" "Personal Auto" "Personal Auto" ...
##  $ Policy                       : chr [1:9134] "Personal L3" "Personal L2" "Personal L1" "Personal L3" ...
##  $ Renew Offer Type             : chr [1:9134] "Offer1" "Offer2" "Offer4" "Offer1" ...
##  $ Sales Channel                : chr [1:9134] "Branch" "Branch" "Web" "Call Center" ...
##  $ Total Claim Amount           : num [1:9134] 671 159 904 339 1330 ...
##  $ Vehicle Class                : chr [1:9134] "Four-Door Car" "Sports Car" "Two-Door Car" "Two-Door Car" ...
##  $ Vehicle Size                 : chr [1:9134] "Medsize" "Medsize" "Medsize" "Medsize" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Customer = col_character(),
##   ..   State = col_character(),
##   ..   `Customer Lifetime Value` = col_double(),
##   ..   Response = col_character(),
##   ..   Coverage = col_character(),
##   ..   Education = col_character(),
##   ..   `Effective To Date` = col_character(),
##   ..   EmploymentStatus = col_character(),
##   ..   Gender = col_character(),
##   ..   Income = col_double(),
##   ..   `Location Code` = col_character(),
##   ..   `Marital Status` = col_character(),
##   ..   `Monthly Premium Auto` = col_double(),
##   ..   `Months Since Last Claim` = col_double(),
##   ..   `Months Since Policy Inception` = col_double(),
##   ..   `Number of Open Complaints` = col_double(),
##   ..   `Number of Policies` = col_double(),
##   ..   `Policy Type` = col_character(),
##   ..   Policy = col_character(),
##   ..   `Renew Offer Type` = col_character(),
##   ..   `Sales Channel` = col_character(),
##   ..   `Total Claim Amount` = col_double(),
##   ..   `Vehicle Class` = col_character(),
##   ..   `Vehicle Size` = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

Duplicates

duplicates <- duplicated(Data$Customer)
num_true <- sum(duplicates)
print(num_true)
## [1] 0
remove(duplicates,num_true)

We can conclude that there are no duplicates.

Rename columns

Data <- Data %>% 
rename(Customer_Lifetime_Value=`Customer Lifetime Value`,
       Effective_To_Date=`Effective To Date`,
       Location_Code=`Location Code`,
       Employment_Status=EmploymentStatus,
       Marital_Status=`Marital Status`,
       Monthly_Premium_Auto=`Monthly Premium Auto`,
       Months_Since_Last_Claim=`Months Since Last Claim`,
       Months_Since_Policy_Inception=`Months Since Policy Inception`,
       Number_of_Open_Complaints=`Number of Open Complaints`,
       Number_of_Policies=`Number of Policies`,
       Policy_Type=`Policy Type`,
       Renew_Offer_Type=`Renew Offer Type`,
       Sales_Channel=`Sales Channel`,
       Total_Claim_Amount=`Total Claim Amount`,
       Vehicle_Class=`Vehicle Class`,
       Vehicle_Size=`Vehicle Size`)

Convert “Effective to Date” from CHR data type to DATE data type

Data <- Data %>%
  mutate(`Effective_To_Date`=mdy(`Effective_To_Date`))
Data$Effective_Days <- weekdays(Data$`Effective_To_Date`)
Data$Effective_Days <- factor(Data$Effective_Days, c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"))

Section 3: Exploratory Data Analysis

Univariate analysis

Analyze each predictor variable individually to understand its distribution and identify any potential outliers or data issues.

  1. Histograms: Create histograms to visualize the distribution of each continuous predictor variable.

  2. Box plots: Use box plots to identify the range, quartiles, and potential outliers of continuous predictor variables.

  3. Bar plots: Create bar plots for categorical predictor variables to visualize the frequency of each category.

  4. Summary statistics: Calculate summary statistics, such as mean, median, standard deviation, and quartiles, to describe the central tendency and dispersion of each continuous predictor variable.

Categorical Data - Bar plots

Numerical Data - Histograms

Numerical Data - Boxplots

Bivariate analysis

Analyze the relationship between each predictor variable and the target variable.

  1. Scatter plots: Plot scatter plots between continuous predictor variables and the target variable to visualize the relationships and identify any trends or patterns.

  2. Box plots or violin plots: Use box plots or violin plots to visualize the distribution of the target variable across different categories of a categorical predictor variable.

  3. Correlation matrix: Calculate the correlation matrix for continuous predictor variables to identify any strong linear relationships with the target variable.

Categorical Data - Box plots

Numerical Data - Scatter plots

Numerical Continuous Data - Correlation matrix

CLV = Customer Lifetime Value

INC = Income

MPA = Monthly Premium Auto

TCA = Total Claim Amount

As CLV has a high correlation with MPA, We decided to study how other categorical values affect MPA.

Monthly Premium Auto vs categorical values

Relationship for Education

Relationship for Coverage

Relationship for Employment Status

Relationship for Gender

Relationship for Location

Relationship for Marital

Relationship for Policy

Relationship for Policy Type

Relationship for Renew Offer

Relationship for Renew Offer

Relationship for Channel

##### Relationship for Vehicle Class

Relationship for Size

Compare Complaints

Compare Number of Policies

Section 4: Statistical Analysis

In this section, we will perform a concise statistical analysis to select the most relevant predictors for our regression model. The steps involved in this process are as follows:

  1. Multicollinearity assessment: Check for correlations among predictor variables to ensure stability and reliability in the regression model.

  2. Bivariate analysis: Investigate associations between each predictor and the target variable using correlation coefficients and hypothesis tests.

  3. Feature selection: Employ techniques like stepwise regression, LASSO, or Ridge regression to systematically choose the optimal subset of predictors.

  4. Visualizations: Leverage graphical representations to effectively communicate findings and support informed decision-making.

By following these steps, we will identify a suitable set of predictor variables that contribute meaningfully to our regression model’s performance.

Hypothesis testing - Categorical Values - ANOVA test and Simple linear regression

State

## [1] "Settings:  unique SS "
## Component 1 :
##               Df   R Sum Sq R Mean Sq Iter Pr(Prob)
## State1         4 5.1550e+07  12887444 1533   0.9289
## Residuals   9129 4.3112e+11  47225235

Response

## [1] "Settings:  unique SS "
## Component 1 :
##               Df   R Sum Sq R Mean Sq Iter Pr(Prob)
## Response1      1 3.4380e+07  34380462  184   0.3533
## Residuals   9132 4.3114e+11  47211601

Location_Code

## [1] "Settings:  unique SS "
## Component 1 :
##                  Df   R Sum Sq R Mean Sq Iter Pr(Prob)
## Location_Code1    2 1.0200e+07   5099770   51        1
## Residuals      9131 4.3116e+11  47219419

Marital_Status

## [1] "Settings:  unique SS "
## Component 1 :
##                   Df   R Sum Sq R Mean Sq Iter  Pr(Prob)    
## Marital_Status1    2 3.1310e+08 156547994 5000 < 2.2e-16 ***
## Residuals       9131 4.3086e+11  47186247                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Call:
## lm(formula = Customer_Lifetime_Value ~ Marital_Status, data = Data)
## 
## Coefficients:
##           (Intercept)  Marital_StatusMarried   Marital_StatusSingle  
##                8241.2                 -162.3                 -526.4
## 
## Call:
## lm(formula = Customer_Lifetime_Value ~ Marital_Status, data = Data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -6237  -3990  -2219    956  75246 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             8241.2      185.7  44.390   <2e-16 ***
## Marital_StatusMarried   -162.3      208.3  -0.779    0.436    
## Marital_StatusSingle    -526.4      231.5  -2.274    0.023 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6869 on 9131 degrees of freedom
## Multiple R-squared:  0.0007262,  Adjusted R-squared:  0.0005073 
## F-statistic: 3.318 on 2 and 9131 DF,  p-value: 0.03628

Coverage

## [1] "Settings:  unique SS "
## Component 1 :
##               Df   R Sum Sq  R Mean Sq Iter  Pr(Prob)    
## Coverage1      2 1.2265e+10 6132650242 5000 < 2.2e-16 ***
## Residuals   9131 4.1891e+11   45877277                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Call:
## lm(formula = Customer_Lifetime_Value ~ Coverage, data = Data)
## 
## Coefficients:
##      (Intercept)  CoverageExtended   CoveragePremium  
##             7191              1599              3705
## 
## Call:
## lm(formula = Customer_Lifetime_Value ~ Coverage, data = Data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -7723  -4300  -1971   1185  74536 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       7190.71      90.77   79.22   <2e-16 ***
## CoverageExtended  1598.97     158.02   10.12   <2e-16 ***
## CoveragePremium   3704.90     252.82   14.65   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6773 on 9131 degrees of freedom
## Multiple R-squared:  0.02845,    Adjusted R-squared:  0.02823 
## F-statistic: 133.7 on 2 and 9131 DF,  p-value: < 2.2e-16

Education

## [1] "Settings:  unique SS "
## Component 1 :
##               Df   R Sum Sq R Mean Sq Iter  Pr(Prob)    
## Education1     4 4.5725e+08 114312631 5000 < 2.2e-16 ***
## Residuals   9129 4.3071e+11  47180794                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Call:
## lm(formula = Customer_Lifetime_Value ~ Education, data = Data)
## 
## Coefficients:
##                   (Intercept)               EducationCollege  
##                        7872.7                          -21.6  
##               EducationDoctor  EducationHigh School or Below  
##                        -352.3                          424.0  
##               EducationMaster  
##                         370.8
## 
## Call:
## lm(formula = Customer_Lifetime_Value ~ Education, data = Data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -6356  -3992  -2187    879  75029 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     7872.7      131.0  60.082   <2e-16 ***
## EducationCollege                 -21.6      186.5  -0.116   0.9078    
## EducationDoctor                 -352.3      393.9  -0.895   0.3711    
## EducationHigh School or Below    424.1      187.5   2.261   0.0238 *  
## EducationMaster                  370.8      284.3   1.304   0.1922    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6869 on 9129 degrees of freedom
## Multiple R-squared:  0.00106,    Adjusted R-squared:  0.0006228 
## F-statistic: 2.423 on 4 and 9129 DF,  p-value: 0.04604

Employment Status

## [1] "Settings:  unique SS "
## Component 1 :
##                      Df   R Sum Sq R Mean Sq Iter  Pr(Prob)    
## Employment_Status1    4 7.1856e+08 179640773 5000 < 2.2e-16 ***
## Residuals          9129 4.3045e+11  47152169                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Call:
## lm(formula = Customer_Lifetime_Value ~ Employment_Status, data = Data)
## 
## Coefficients:
##                    (Intercept)       Employment_StatusEmployed  
##                         7847.9                           371.2  
## Employment_StatusMedical Leave        Employment_StatusRetired  
##                         -206.1                          -360.0  
##    Employment_StatusUnemployed  
##                         -211.6
## 
## Call:
## lm(formula = Customer_Lifetime_Value ~ Employment_Status, data = Data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -5964  -3952  -2285    895  75106 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      7847.9      341.2  23.000   <2e-16 ***
## Employment_StatusEmployed         371.2      353.1   1.051    0.293    
## Employment_StatusMedical Leave   -206.1      474.9  -0.434    0.664    
## Employment_StatusRetired         -360.0      532.6  -0.676    0.499    
## Employment_StatusUnemployed      -211.6      369.8  -0.572    0.567    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6867 on 9129 degrees of freedom
## Multiple R-squared:  0.001667,   Adjusted R-squared:  0.001229 
## F-statistic:  3.81 on 4 and 9129 DF,  p-value: 0.004251

Gender

## [1] "Settings:  unique SS "
## Component 1 :
##               Df   R Sum Sq R Mean Sq Iter Pr(Prob)   
## Gender1        1 7.9863e+07  79863478 5000   0.0036 **
## Residuals   9132 4.3109e+11  47206620                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Call:
## lm(formula = Customer_Lifetime_Value ~ Gender, data = Data)
## 
## Coefficients:
## (Intercept)      GenderM  
##      8096.6       -187.1
## 
## Call:
## lm(formula = Customer_Lifetime_Value ~ Gender, data = Data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -6198  -4001  -2218    940  75416 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   8096.6      100.7  80.427   <2e-16 ***
## GenderM       -187.1      143.8  -1.301    0.193    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6871 on 9132 degrees of freedom
## Multiple R-squared:  0.0001852,  Adjusted R-squared:  7.574e-05 
## F-statistic: 1.692 on 1 and 9132 DF,  p-value: 0.1934

Policy Type

## [1] "Settings:  unique SS "
## Component 1 :
##                Df   R Sum Sq R Mean Sq Iter Pr(Prob)
## Policy_Type1    2 2.0613e+08 103063571 5000   0.1696
## Residuals    9131 4.3096e+11  47197962

Policy

## [1] "Settings:  unique SS "
## Component 1 :
##               Df   R Sum Sq R Mean Sq Iter Pr(Prob)
## Policy1        8 4.4710e+08  55887813 5000   0.5334
## Residuals   9125 4.3072e+11  47202588

Vehicle Class

## [1] "Settings:  unique SS "
## Component 1 :
##                  Df   R Sum Sq  R Mean Sq Iter  Pr(Prob)    
## Vehicle_Class1    5 5.5043e+10 1.1009e+10 5000 < 2.2e-16 ***
## Residuals      9128 3.7613e+11 4.1206e+07                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Call:
## lm(formula = Customer_Lifetime_Value ~ Vehicle_Class, data = Data)
## 
## Coefficients:
##               (Intercept)    Vehicle_ClassLuxury Car  
##                    6631.7                    10421.6  
##   Vehicle_ClassLuxury SUV    Vehicle_ClassSports Car  
##                   10491.3                     4119.3  
##          Vehicle_ClassSUV  Vehicle_ClassTwo-Door Car  
##                    3811.8                       39.3
## 
## Call:
## lm(formula = Customer_Lifetime_Value ~ Vehicle_Class, data = Data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -11167  -3769  -1511   1075  66272 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                6631.73      94.43  70.229   <2e-16 ***
## Vehicle_ClassLuxury Car   10421.62     511.58  20.371   <2e-16 ***
## Vehicle_ClassLuxury SUV   10491.27     482.56  21.741   <2e-16 ***
## Vehicle_ClassSports Car    4119.26     306.68  13.432   <2e-16 ***
## Vehicle_ClassSUV           3811.79     178.49  21.355   <2e-16 ***
## Vehicle_ClassTwo-Door Car    39.30     175.40   0.224    0.823    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6419 on 9128 degrees of freedom
## Multiple R-squared:  0.1277, Adjusted R-squared:  0.1272 
## F-statistic: 267.2 on 5 and 9128 DF,  p-value: < 2.2e-16

Vehicle Size

## [1] "Settings:  unique SS "
## Component 1 :
##                 Df   R Sum Sq R Mean Sq Iter Pr(Prob)  
## Vehicle_Size1    2 2.2489e+08 112444031 5000    0.086 .
## Residuals     9131 4.3095e+11  47195907                
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Call:
## lm(formula = Customer_Lifetime_Value ~ Vehicle_Size, data = Data)
## 
## Coefficients:
##         (Intercept)  Vehicle_SizeMedsize    Vehicle_SizeSmall  
##              7545.0                505.7                540.1
## 
## Call:
## lm(formula = Customer_Lifetime_Value ~ Vehicle_Size, data = Data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -6186  -3977  -2212    981  75240 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           7545.0      223.4  33.779   <2e-16 ***
## Vehicle_SizeMedsize    505.7      239.2   2.114   0.0346 *  
## Vehicle_SizeSmall      540.1      276.8   1.951   0.0511 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6870 on 9131 degrees of freedom
## Multiple R-squared:  0.0005216,  Adjusted R-squared:  0.0003027 
## F-statistic: 2.382 on 2 and 9131 DF,  p-value: 0.09238

Renew Offer Type

## [1] "Settings:  unique SS "
## Component 1 :
##                     Df   R Sum Sq  R Mean Sq Iter  Pr(Prob)    
## Renew_Offer_Type1    3 3.6291e+09 1209695120 5000 < 2.2e-16 ***
## Residuals         9130 4.2754e+11   46828218                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Call:
## lm(formula = Customer_Lifetime_Value ~ Renew_Offer_Type, data = Data)
## 
## Coefficients:
##            (Intercept)  Renew_Offer_TypeOffer2  Renew_Offer_TypeOffer3  
##                 8707.1                 -1310.3                  -709.2  
## Renew_Offer_TypeOffer4  
##                -1527.1
## 
## Call:
## lm(formula = Customer_Lifetime_Value ~ Renew_Offer_Type, data = Data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -6809  -4046  -2028   1055  74618 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              8707.1      111.7  77.938  < 2e-16 ***
## Renew_Offer_TypeOffer2  -1310.3      168.8  -7.764 9.13e-15 ***
## Renew_Offer_TypeOffer3   -709.2      212.6  -3.336 0.000852 ***
## Renew_Offer_TypeOffer4  -1527.1      241.3  -6.330 2.57e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6843 on 9130 degrees of freedom
## Multiple R-squared:  0.008417,   Adjusted R-squared:  0.008091 
## F-statistic: 25.83 on 3 and 9130 DF,  p-value: < 2.2e-16

Sales Channel

## [1] "Settings:  unique SS "
## Component 1 :
##                  Df   R Sum Sq R Mean Sq Iter Pr(Prob)
## Sales_Channel1    3 1.2472e+08  41572350 1959   0.6131
## Residuals      9130 4.3105e+11  47212048
## 
## Call:
## lm(formula = Customer_Lifetime_Value ~ Sales_Channel, data = Data)
## 
## Coefficients:
##              (Intercept)       Sales_ChannelBranch  Sales_ChannelCall Center  
##                   7957.7                     162.0                     142.4  
##         Sales_ChannelWeb  
##                   -177.9
## 
## Call:
## lm(formula = Customer_Lifetime_Value ~ Sales_Channel, data = Data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -6202  -4001  -2209    931  75225 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                7957.7      116.5  68.291   <2e-16 ***
## Sales_ChannelBranch         162.0      178.8   0.906    0.365    
## Sales_ChannelCall Center    142.4      200.8   0.709    0.478    
## Sales_ChannelWeb           -177.9      221.8  -0.802    0.423    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6871 on 9130 degrees of freedom
## Multiple R-squared:  0.0002893,  Adjusted R-squared:  -3.924e-05 
## F-statistic: 0.8805 on 3 and 9130 DF,  p-value: 0.4503

Effective To Date

## [1] "Settings:  unique SS "
## Component 1 :
##                  Df   R Sum Sq R Mean Sq Iter Pr(Prob)
## Effective_Days    6 1.4290e+08  23816121 5000   0.9998
## Residuals      9127 4.3103e+11  47225575

Number_of_Open_Complaints

## [1] "Settings:  unique SS : numeric variables centered"
## Component 1 :
##                             Df  R Sum Sq R Mean Sq Iter  Pr(Prob)    
## Number_of_Open_Complaints    1 5.695e+08 569502273 5000 < 2.2e-16 ***
## Residuals                 9132 4.306e+11  47153002                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Call:
## lm(formula = Customer_Lifetime_Value ~ Number_of_Open_Complaints, 
##     data = Data)
## 
## Coefficients:
##               (Intercept)  Number_of_Open_Complaints  
##                    8110.4                     -274.3
## 
## Call:
## lm(formula = Customer_Lifetime_Value ~ Number_of_Open_Complaints, 
##     data = Data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -6116  -4019  -2215    971  75215 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                8110.38      77.99 103.990  < 2e-16 ***
## Number_of_Open_Complaints  -274.29      78.93  -3.475 0.000513 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6867 on 9132 degrees of freedom
## Multiple R-squared:  0.001321,   Adjusted R-squared:  0.001211 
## F-statistic: 12.08 on 1 and 9132 DF,  p-value: 0.0005126

Simple linear regression - Numerical Values

Months Since Policy Inception

## 
## Call:
## lm(formula = Customer_Lifetime_Value ~ Months_Since_Policy_Inception, 
##     data = Data)
## 
## Coefficients:
##                   (Intercept)  Months_Since_Policy_Inception  
##                      7893.480                          2.319
## 
## Call:
## lm(formula = Customer_Lifetime_Value ~ Months_Since_Policy_Inception, 
##     data = Data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -6206  -4007  -2228    952  75260 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                   7893.480    143.192   55.12   <2e-16 ***
## Months_Since_Policy_Inception    2.319      2.576    0.90    0.368    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6871 on 9132 degrees of freedom
## Multiple R-squared:  8.871e-05,  Adjusted R-squared:  -2.079e-05 
## F-statistic: 0.8101 on 1 and 9132 DF,  p-value: 0.3681

Income

## 
## Call:
## lm(formula = Customer_Lifetime_Value ~ Income, data = Data)
## 
## Coefficients:
## (Intercept)       Income  
##   7.797e+03    5.511e-03
## 
## Call:
## lm(formula = Customer_Lifetime_Value ~ Income, data = Data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -5899  -3976  -2274    917  75203 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 7.797e+03  1.145e+02  68.114   <2e-16 ***
## Income      5.511e-03  2.366e-03   2.329   0.0199 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6869 on 9132 degrees of freedom
## Multiple R-squared:  0.0005937,  Adjusted R-squared:  0.0004842 
## F-statistic: 5.425 on 1 and 9132 DF,  p-value: 0.01987

Monthly Premium Auto

## 
## Call:
## lm(formula = Customer_Lifetime_Value ~ Monthly_Premium_Auto, 
##     data = Data)
## 
## Coefficients:
##          (Intercept)  Monthly_Premium_Auto  
##               628.50                 79.13
## 
## Call:
## lm(formula = Customer_Lifetime_Value ~ Monthly_Premium_Auto, 
##     data = Data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -13038  -3469   -853    513  64418 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           628.501    190.643   3.297 0.000982 ***
## Monthly_Premium_Auto   79.130      1.919  41.244  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6309 on 9132 degrees of freedom
## Multiple R-squared:  0.157,  Adjusted R-squared:  0.1569 
## F-statistic:  1701 on 1 and 9132 DF,  p-value: < 2.2e-16

Months Since Last Claim

## 
## Call:
## lm(formula = Customer_Lifetime_Value ~ Months_Since_Last_Claim, 
##     data = Data)
## 
## Coefficients:
##             (Intercept)  Months_Since_Last_Claim  
##                7886.345                    7.856
## 
## Call:
## lm(formula = Customer_Lifetime_Value ~ Months_Since_Last_Claim, 
##     data = Data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -6155  -3995  -2224    963  75196 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             7886.345    129.534  60.882   <2e-16 ***
## Months_Since_Last_Claim    7.856      7.137   1.101    0.271    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6871 on 9132 degrees of freedom
## Multiple R-squared:  0.0001326,  Adjusted R-squared:  2.314e-05 
## F-statistic: 1.211 on 1 and 9132 DF,  p-value: 0.2711

Number_of_Policies

## 
## Call:
## lm(formula = Customer_Lifetime_Value ~ Number_of_Policies, data = Data)
## 
## Coefficients:
##        (Intercept)  Number_of_Policies  
##            7817.73               63.11
## 
## Call:
## lm(formula = Customer_Lifetime_Value ~ Number_of_Policies, data = Data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -5983  -3918  -2337    831  75381 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         7817.73     114.56  68.240   <2e-16 ***
## Number_of_Policies    63.11      30.07   2.099   0.0359 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6870 on 9132 degrees of freedom
## Multiple R-squared:  0.000482,   Adjusted R-squared:  0.0003726 
## F-statistic: 4.404 on 1 and 9132 DF,  p-value: 0.03588

Total Claim Amount

## 
## Call:
## lm(formula = Customer_Lifetime_Value ~ Total_Claim_Amount, data = Data)
## 
## Coefficients:
##        (Intercept)  Total_Claim_Amount  
##           5679.932               5.356
## 
## Call:
## lm(formula = Customer_Lifetime_Value ~ Total_Claim_Amount, data = Data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -12595  -4042  -1804   1128  71707 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        5679.9318   125.9190   45.11   <2e-16 ***
## Total_Claim_Amount    5.3561     0.2411   22.22   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6693 on 9132 degrees of freedom
## Multiple R-squared:  0.05128,    Adjusted R-squared:  0.05118 
## F-statistic: 493.6 on 1 and 9132 DF,  p-value: < 2.2e-16

Section 5: Predictive model - Multiple Linear Regression

Step by step to build the Multiple Regression Model

1. Splitting data, training and holdout sets

#Split the data into training and holdout sets using a 90-10 split
set.seed(123)
trainIndex <- createDataPartition(Data$Customer_Lifetime_Value, p=0.8, list = FALSE)
trainData <- Data[trainIndex, ]
holdoutData <- Data[-trainIndex, ]

2. Create the model based on training data

## 
## Call:
## lm(formula = Customer_Lifetime_Value ~ Coverage + Vehicle_Class + 
##     Monthly_Premium_Auto + Total_Claim_Amount, data = trainData, 
##     na.action = na.omit)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -11522  -3422   -946    570  56555 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               1957.8049   728.5766   2.687  0.00722 ** 
## CoverageExtended           312.8582   277.1374   1.129  0.25898    
## CoveragePremium            898.0097   586.5817   1.531  0.12583    
## Vehicle_ClassLuxury Car   2185.6535  1530.9625   1.428  0.15344    
## Vehicle_ClassLuxury SUV   2700.4376  1525.6442   1.770  0.07676 .  
## Vehicle_ClassSports Car   1304.6222   581.2719   2.244  0.02483 *  
## Vehicle_ClassSUV          1276.7265   505.1394   2.527  0.01151 *  
## Vehicle_ClassTwo-Door Car  153.8725   190.8422   0.806  0.42011    
## Monthly_Premium_Auto        62.3692    10.9021   5.721  1.1e-08 ***
## Total_Claim_Amount          -0.9269     0.3264  -2.840  0.00453 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6272 on 7300 degrees of freedom
## Multiple R-squared:  0.1554, Adjusted R-squared:  0.1544 
## F-statistic: 149.3 on 9 and 7300 DF,  p-value: < 2.2e-16

3. Calculating the Variance Inflation Factor for each feature

##                           GVIF Df GVIF^(1/(2*Df))
## Coverage              5.923714  2        1.560086
## Vehicle_Class        20.427826  5        1.352142
## Monthly_Premium_Auto 26.347164  1        5.132949
## Total_Claim_Amount    1.674327  1        1.293958

The Monthly Premium Auto has a VIF adjusted higher than 5 that indicates some multicollienarity. To adress that We will use the Stepwise regression to select the appropiate features.

4. Model selection and Stepwise regression

## Start:  AIC=127845.3
## Customer_Lifetime_Value ~ Coverage + Vehicle_Class + Monthly_Premium_Auto + 
##     Total_Claim_Amount
## 
##                        Df  Sum of Sq        RSS    AIC
## - Coverage              2   92543790 2.8727e+11 127844
## - Vehicle_Class         5  393074168 2.8757e+11 127845
## <none>                               2.8717e+11 127845
## - Total_Claim_Amount    1  317241437 2.8749e+11 127851
## - Monthly_Premium_Auto  1 1287480143 2.8846e+11 127876
## 
## Step:  AIC=127843.6
## Customer_Lifetime_Value ~ Vehicle_Class + Monthly_Premium_Auto + 
##     Total_Claim_Amount
## 
##                        Df  Sum of Sq        RSS    AIC
## - Vehicle_Class         5 3.2126e+08 2.8759e+11 127842
## <none>                               2.8727e+11 127844
## + Coverage              2 9.2544e+07 2.8717e+11 127845
## - Total_Claim_Amount    1 3.1446e+08 2.8758e+11 127850
## - Monthly_Premium_Auto  1 1.0322e+10 2.9759e+11 128100
## 
## Step:  AIC=127841.8
## Customer_Lifetime_Value ~ Monthly_Premium_Auto + Total_Claim_Amount
## 
##                        Df  Sum of Sq        RSS    AIC
## <none>                               2.8759e+11 127842
## + Vehicle_Class         5 3.2126e+08 2.8727e+11 127844
## + Coverage              2 2.0727e+07 2.8757e+11 127845
## - Total_Claim_Amount    1 3.4071e+08 2.8793e+11 127848
## - Monthly_Premium_Auto  1 3.5501e+10 3.2309e+11 128691
## 
## Call:
## lm(formula = Customer_Lifetime_Value ~ Monthly_Premium_Auto + 
##     Total_Claim_Amount, data = trainData, na.action = na.omit)
## 
## Coefficients:
##          (Intercept)  Monthly_Premium_Auto    Total_Claim_Amount  
##             724.9909               82.3935               -0.9586
## 
## Call:
## lm(formula = Customer_Lifetime_Value ~ Monthly_Premium_Auto + 
##     Total_Claim_Amount, data = trainData, na.action = na.omit)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -12225  -3516   -909    253  56787 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          724.9909   212.4388   3.413 0.000647 ***
## Monthly_Premium_Auto  82.3935     2.7434  30.034  < 2e-16 ***
## Total_Claim_Amount    -0.9586     0.3258  -2.942 0.003269 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6274 on 7307 degrees of freedom
## Multiple R-squared:  0.1542, Adjusted R-squared:  0.154 
## F-statistic: 666.1 on 2 and 7307 DF,  p-value: < 2.2e-16

Only 2 features were left: Monthly_Premium_Auto and Total_Claim_Amount.

5. Outliers analysis based on Standard Residual

#Eliminate outliers that are more than 2.5 standard residuals
sresid <- rstandard(step_lm)
idx <- order(sresid, decreasing=TRUE)
sresid[idx[325]]
resid(step_lm)[idx[1]]
outlier_record <- trainData[idx[1], c('Monthly_Premium_Auto', 'Total_Claim_Amount')]
trainData$sresid <- rstandard(step_lm)

#trainData <- trainData %>%
 # filter(sresid < 2.5) %>%
  #filter(sresid > -2.5)

Despite numerous records surpassing the 2.5 standard deviation benchmark, we have opted against data exclusion, as a significant proportion adheres to this threshold. Removing these records may potentially compromise the integrity of our model.

6. 10 fold cross validation model

##        RMSE  Rsquared      MAE Resample
## 1  5553.824 0.1746101 3614.963   Fold01
## 2  6152.378 0.1148496 3913.138   Fold02
## 3  6235.549 0.1819838 3844.474   Fold03
## 4  6277.868 0.1064881 3985.924   Fold04
## 5  6336.034 0.1626300 3883.500   Fold05
## 6  5957.007 0.1503061 3744.616   Fold06
## 7  6212.078 0.2165131 3821.449   Fold07
## 8  6823.927 0.1706952 3970.905   Fold08
## 9  6372.496 0.1236572 3870.934   Fold09
## 10 6741.269 0.1529866 4062.650   Fold10
## Linear Regression 
## 
## 7310 samples
##    2 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 6578, 6578, 6580, 6578, 6579, 6580, ... 
## Resampling results:
## 
##   RMSE      Rsquared  MAE     
##   6266.243  0.155472  3871.255
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE
## 
## Call:
## lm(formula = Customer_Lifetime_Value ~ Monthly_Premium_Auto + 
##     Total_Claim_Amount, data = trainData)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -12225  -3516   -909    253  56787 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          724.9909   212.4388   3.413 0.000647 ***
## Monthly_Premium_Auto  82.3935     2.7434  30.034  < 2e-16 ***
## Total_Claim_Amount    -0.9586     0.3258  -2.942 0.003269 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6274 on 7307 degrees of freedom
## Multiple R-squared:  0.1542, Adjusted R-squared:  0.154 
## F-statistic: 666.1 on 2 and 7307 DF,  p-value: < 2.2e-16

7. Calculating the Variance Inflation Factor for each feature in the cross validation model

## Monthly_Premium_Auto   Total_Claim_Amount 
##             1.667533             1.667533

8. Assessing the model on new data (holdout sample)

## RMSE: 6433.889
## R-squared 0.1738726

The model performs even better on data it haven’t seem before.

Visualizing the absolute value of the residuals vs the predicted values

Visualizing Observed vs predicted values on holdout sample

Visualizing the standard residuals

Visualizing the Partial Residual Plots

Section 6: Improving the model using the K nearest neighbor

Previously, We found that only when number of policies is 2 the relationship between Customer_Lifetime_Value (target value) and Monthly_Premium_Auto (a feature) vary hugely. We suspect that the model could improve it’s accuaracy by trying using an algorithm that could catch this non linear relationship.

1. Building the model -KNN algorithm

# Load the kknn package
library(kknn)
## 
## Attaching package: 'kknn'
## The following object is masked from 'package:caret':
## 
##     contr.dummy
# Prepare the data
Data_KNN <- model.matrix(~ -1 + Monthly_Premium_Auto + Coverage + Vehicle_Class + Total_Claim_Amount + Number_of_Policies, data=Data)

Data_KNN_std <- scale(Data_KNN)

# Create train and holdout indices
# trainIndex_knn <- createDataPartition(Data$Response, p=0.8, list = FALSE)
trainData_knn <- data.frame(Data_KNN_std[trainIndex, ,drop=FALSE])
holdoutData_knn <- data.frame(Data_KNN_std[-trainIndex, ,drop=FALSE])

# Extract the corresponding 'Response' values for the train data
train_outcome <- Data[trainIndex,]$Customer_Lifetime_Value

# Add the train_outcome to the trainData_knn data frame
trainData_knn$Customer_Lifetime_Value <- train_outcome

# Train the KNN regression model and make predictions for the holdout data
knn_regression <- kknn(Customer_Lifetime_Value ~ ., train = trainData_knn, test = holdoutData_knn, k = 20)

# Get the predicted values from the model
knn_pred <- knn_regression$fitted.values

# Convert the knn_pred vector to a matrix
knn_pred_matrix <- matrix(knn_pred, ncol = 1)

# Add the knn_pred_matrix as a new column to the holdoutData_knn data frame
holdoutData <- cbind(holdoutData, knn_pred_matrix)

# Set the column name for the new column
colnames(holdoutData)[ncol(holdoutData)] <- "KNN_Alone_Predicted_Outcome"

2.Asses K-nearest neighbors model on Holdout Sample

## 
## Attaching package: 'Metrics'
## The following objects are masked from 'package:caret':
## 
##     precision, recall
## R-squared: 0.5573309
## RMSE: 4703.382

We get a hugely superior model in the holdout sample.

3. Visualizing Observed vs predicted values on holdout sample

Now let’s try setting KNN as a feature and use it in the multiple linear regression model to see if that could improve the model even further.

4.Setting KNN as feature engine

Data_KNN <- model.matrix(~ -1 + Monthly_Premium_Auto + Coverage + Vehicle_Class + Total_Claim_Amount + Number_of_Policies, data=Data)

#standarized
Data_KNN_std <- scale(Data_KNN)

#Convert to a dataframe
Data_KNN_std <- data.frame(Data_KNN_std[,,drop=FALSE])

# Extract the corresponding 'Response' values for the train data
train_outcome <- Data$Customer_Lifetime_Value

# Add the train_outcome to the trainData_knn data frame
Data_KNN_std$Customer_Lifetime_Value <- train_outcome

knn_on_data <- kknn(Customer_Lifetime_Value ~ ., train = Data_KNN_std, test = Data_KNN_std, k = 20)


# Get the predicted values from the model
knn_pred <- knn_on_data$fitted.values

# Convert the knn_pred vector to a matrix
knn_pred_matrix <- matrix(knn_pred, ncol = 1)

# Add the knn_pred_matrix as a new column to the holdoutData_knn data frame
Data <- cbind(Data, knn_pred_matrix)

# Set the column name for the new column
colnames(Data)[ncol(Data)] <- "Predicted_Outcome"

5. 10 fold cross validation model

##        RMSE  Rsquared      MAE Resample
## 1  3196.165 0.7310191 1583.584   Fold01
## 2  3488.560 0.7143012 1692.963   Fold02
## 3  3846.960 0.6879247 1813.171   Fold03
## 4  3451.232 0.7283497 1679.809   Fold04
## 5  3816.549 0.6963893 1791.272   Fold05
## 6  3152.503 0.7621065 1551.005   Fold06
## 7  3664.483 0.7268223 1747.670   Fold07
## 8  4520.429 0.6343727 2000.154   Fold08
## 9  4061.948 0.6425909 1826.314   Fold09
## 10 4041.724 0.6968264 1941.774   Fold10
## Linear Regression 
## 
## 7310 samples
##    3 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 6578, 6578, 6580, 6578, 6579, 6580, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   3724.055  0.7020703  1762.771
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

This is a huge improvement. Now we are talking! But let’s not get ahead of ourselves, is important to check other important factors to validate the model like the VIF and most importantly, assessing the model in the holdout sample.

7. Calculating the Variance Inflation Factor for each feature in the cross validation model

## Monthly_Premium_Auto   Total_Claim_Amount    Predicted_Outcome 
##             2.041001             1.669735             1.339379

There’s no multicollinarity

8. Assessing the model on new data (holdout sample)

## RMSE: 3994.279
## R-squared 0.6813194

Now this is a very good model with a whooping 0.68131 R-squared in the holdout sample.

9. Visualizing Observed vs predicted values on holdout sample

As it can be seen, the model is very accurate in predicting the customer lifetime value.

10. Visualizing the absolute value of the residuals vs the predicted values

We can conclude that creating an ensemble model, which combines the KNN feature with our original multiple linear regression model, significantly improves our predictive performance. This improvement is attributed to the ensemble model’s ability to capture non-linear relationships that the initial linear model was unable to detect.

Section 7: Observations

  1. A strong relationship between the coverage (Basic, Extended, or Premium) and CLV exists, with higher CLV for more premium coverages.

  2. Vehicle Class also impacts CLV, with luxury cars having higher CLV.

  3. Monthly Premium Auto, the amount paid monthly for auto insurance, had the strongest correlation to CLV.

  4. An inverse correlation was observed between Total Claim Amount and CLV.

  5. The Number of Policies feature showed that customers with two policies had significantly higher CLV, affecting linearity.

  6. Due to multicollinearity between Coverage, Vehicle Class, and Monthly Premium Auto, only the most promising feature (Monthly Premium Auto) was retained for the model.

Section 8: Conclusion and Recommendation

Our model has an RMSE of 3994.297 and an R-squared of 0.6813 on the holdout sample. This is really good, therefore, this model can help make informed business decisions to increase CLV.

The really good performance was due to the identification of an irregular behavior associated with customers having two policies. We recommend investigating this behavior to determine the cause, such as a special offer for purchasing two policies. By understanding and addressing this issue, we can develop an even more accurate model to predict CLV.