Term Deposit Sale¶

import pandas as pd                                             # library for working with dataframes
import numpy as np                                              # library for working with arrays
import matplotlib.pyplot as plt                                 # low level visualization library
%matplotlib inline
import seaborn as sns                                           # higher level visualization library compared to matplotlib
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, f1_score, recall_score, precision_score, confusion_matrix, roc_auc_score
from IPython.display import Image  
import pydotplus as pydot
from sklearn import tree
from os import system

from yellowbrick.classifier import ClassificationReport, ROCAUC
plt.style.use('ggplot')
pd.options.display.float_format = '{:,.2f}'.format
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:95% !important; }</style>"))

# read the csv file into a dataframe
df=pd.read_csv("bank-full.csv")

Bank client data:

age: Continuous feature
job: Type of job (management, technician, entrepreneur, blue-collar, etc.)
marital: marital status (married, single, divorced)
education: education level (primary, secondary, tertiary)
default: has credit in default?
housing: has housing loan?
loan: has personal loan?
balance in account Related to previous contact:
contact: contact communication type
month: last contact month of year
day: last contact day of the month
duration: last contact duration, in seconds* Other attributes:
campaign: number of contacts performed during this campaign and for this client
pdays: number of days that passed by after the client was last contacted from a previous campaign (-1 tells us the person has not been contacted or contact period is beyond 900 days)
previous: number of contacts performed before this campaign and for this client
poutcome: outcome of the previous marketing campaign Output variable (desired target):
Target: Tell us has the client subscribed a term deposit. (Yes, No)

Deliverable - 1 Exploratory Data Quality report¶

1 Univariate Analysis (12 )¶

df.head()
# Target is our dependent variable , rest are the independent variables.

1.a Checking for number of rows and coloumns in the database¶

df.shape
#there are 45211 rows and 17 columns in the dataframe.

(45211, 17)

1.a Finding the five point summary of continous varibales¶

df.info()
# There are continous and non continous  variables in the dataframe
# age, balance, day, duration, campaign, pdays, previous are continous variables.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45211 entries, 0 to 45210
Data columns (total 17 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   age        45211 non-null  int64 
 1   job        45211 non-null  object
 2   marital    45211 non-null  object
 3   education  45211 non-null  object
 4   default    45211 non-null  object
 5   balance    45211 non-null  int64 
 6   housing    45211 non-null  object
 7   loan       45211 non-null  object
 8   contact    45211 non-null  object
 9   day        45211 non-null  int64 
 10  month      45211 non-null  object
 11  duration   45211 non-null  int64 
 12  campaign   45211 non-null  int64 
 13  pdays      45211 non-null  int64 
 14  previous   45211 non-null  int64 
 15  poutcome   45211 non-null  object
 16  Target     45211 non-null  object
dtypes: int64(7), object(10)
memory usage: 5.9+ MB

1.a Finding the data types of variables¶

df.describe().transpose()
#Observations
# Average age is around 40-41
# Average balance in accound is 1362 and there are members with negative balance.
# Here we have mean , median , quartiles etc.  of all continous variables

# There are no missing values in the continous variable columns because all the
# continous variables got listed in five point summary and none of the values are 
# a string.

1.a Finiding if there are any null values¶

df.isnull().values.any()
# There are no null values in the dataframe.

False

# Listing all unique values
df_unique = df.nunique().to_frame().reset_index()
df_unique.columns = ['Variable','DistinctCount']
print(df_unique)

     Variable  DistinctCount
0         age             77
1         job             12
2     marital              3
3   education              4
4     default              2
5     balance           7168
6     housing              2
7        loan              2
8     contact              3
9         day             31
10      month             12
11   duration           1573
12   campaign             48
13      pdays            559
14   previous             41
15   poutcome              4
16     Target              2

1a. Checking charateristics of the independent variables.¶

# Age Box plot
sns.boxplot(x=df['age'])

<matplotlib.axes._subplots.AxesSubplot at 0x24522753488>

# Balance Box plot
sns.boxplot(x=df['balance'])
#There are outliers in the balance data.

<matplotlib.axes._subplots.AxesSubplot at 0x245225e15c8>

# Day Box plot
sns.boxplot(x=df['day'])

<matplotlib.axes._subplots.AxesSubplot at 0x24522561708>

# duration Box plot
sns.boxplot(x=df['duration'])
#There are outliers in duration

<matplotlib.axes._subplots.AxesSubplot at 0x245224e4308>

# campaign Box plot
sns.boxplot(x=df['campaign'])
#There are outliers in campaign data

<matplotlib.axes._subplots.AxesSubplot at 0x2452240ee48>

# pdays Box plot
sns.boxplot(x=df['pdays'])
#pdays is highly skewed

<matplotlib.axes._subplots.AxesSubplot at 0x24522680f48>

# pdays Box plot
sns.boxplot(x=df['previous'])
#previous is highly skewed

<matplotlib.axes._subplots.AxesSubplot at 0x24522451908>

df['pdays'].value_counts(normalize= True)
#pdays is highly skewed out of 45211 values 36954 has a 
#value -1 (-1 tells us the person has not been contacted or contact period is beyond 900 days)

-1     0.82
 182   0.00
 92    0.00
 183   0.00
 91    0.00
       ... 
 749   0.00
 717   0.00
 589   0.00
 493   0.00
 32    0.00
Name: pdays, Length: 559, dtype: float64

df['previous'].value_counts(normalize=True)
#previous is highly skewed out of 45211 values 36954 has a value 0

0     0.82
1     0.06
2     0.05
3     0.03
4     0.02
5     0.01
6     0.01
7     0.00
8     0.00
9     0.00
10    0.00
11    0.00
12    0.00
13    0.00
15    0.00
14    0.00
17    0.00
16    0.00
19    0.00
23    0.00
20    0.00
22    0.00
18    0.00
24    0.00
27    0.00
29    0.00
25    0.00
21    0.00
30    0.00
28    0.00
26    0.00
37    0.00
38    0.00
55    0.00
40    0.00
35    0.00
58    0.00
51    0.00
41    0.00
32    0.00
275   0.00
Name: previous, dtype: float64

# dropping pdays and previous from the dataframe

df.drop(['pdays','previous'], axis=1,inplace = True)
df.head()

# converting variables to categorical variables
for feature in df.columns:
    if df[feature].dtype == 'object':
        df[feature] = pd.Categorical(df[feature])
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45211 entries, 0 to 45210
Data columns (total 15 columns):
 #   Column     Non-Null Count  Dtype   
---  ------     --------------  -----   
 0   age        45211 non-null  int64   
 1   job        45211 non-null  category
 2   marital    45211 non-null  category
 3   education  45211 non-null  category
 4   default    45211 non-null  category
 5   balance    45211 non-null  int64   
 6   housing    45211 non-null  category
 7   loan       45211 non-null  category
 8   contact    45211 non-null  category
 9   day        45211 non-null  int64   
 10  month      45211 non-null  category
 11  duration   45211 non-null  int64   
 12  campaign   45211 non-null  int64   
 13  poutcome   45211 non-null  category
 14  Target     45211 non-null  category
dtypes: category(10), int64(5)
memory usage: 2.2 MB

# examining all categorical varables and its counts
df['job'].value_counts()

blue-collar      9732
management       9458
technician       7597
admin.           5171
services         4154
retired          2264
self-employed    1579
entrepreneur     1487
unemployed       1303
housemaid        1240
student           938
unknown           288
Name: job, dtype: int64

df['marital'].value_counts()

married     27214
single      12790
divorced     5207
Name: marital, dtype: int64

df['education'].value_counts()

secondary    23202
tertiary     13301
primary       6851
unknown       1857
Name: education, dtype: int64

df['default'].value_counts()

no     44396
yes      815
Name: default, dtype: int64

df['housing'].value_counts()

yes    25130
no     20081
Name: housing, dtype: int64

df['loan'].value_counts()

no     37967
yes     7244
Name: loan, dtype: int64

df['contact'].value_counts()

cellular     29285
unknown      13020
telephone     2906
Name: contact, dtype: int64

df['month'].value_counts()
# most of the people where contacted in the month May. 
# Dec, Mar, Sep ,Oct saw least contacted month

may    13766
jul     6895
aug     6247
jun     5341
nov     3970
apr     2932
feb     2649
jan     1403
oct      738
sep      579
mar      477
dec      214
Name: month, dtype: int64

df['poutcome'].value_counts()

unknown    36959
failure     4901
other       1840
success     1511
Name: poutcome, dtype: int64

df['Target'].value_counts()

no     39922
yes     5289
Name: Target, dtype: int64

2. Multivariate Analysis (8 )¶

plt.figure(figsize=(10,8))

sns.heatmap(df.corr(),
            annot=True,
            linewidths=.5,
            center=0,
            cbar=False)

plt.show()
# We dont want any highly correlated variables. We will only take one column from highly corelated columns.  We dont see any highly correlated variables.

sns.scatterplot(x="age", y="balance", hue='Target', data=df)

# When balance had increased the person had not taken a term deposit.

<matplotlib.axes._subplots.AxesSubplot at 0x245223eab08>

sns.scatterplot(x="day", y="duration", hue='Target', data=df)

<matplotlib.axes._subplots.AxesSubplot at 0x24521ce4b08>

sns.catplot(x="Target",  y="balance", hue="marital", kind="bar", data=df);

sns.catplot(x="Target",  y="balance", hue="job", kind="bar", data=df);

sns.catplot(x="Target",  y="balance", hue="housing", kind="bar", data=df);

sns.distplot(df['age'])

<matplotlib.axes._subplots.AxesSubplot at 0x2452348a948>

sns.distplot(df['balance'])
#balance seems to be higly skewed.

<matplotlib.axes._subplots.AxesSubplot at 0x2452528d408>

for feature in df.columns:
    if df[feature].dtype == 'object':
        df[feature] = pd.Categorical(df[feature])
df.head()

#finding out unique values
df.nunique()

age            77
job            12
marital         3
education       4
default         2
balance      7168
housing         2
loan            2
contact         3
day            31
month          12
duration     1573
campaign       48
poutcome        4
Target          2
dtype: int64

df.dtypes

age             int64
job          category
marital      category
education    category
default      category
balance         int64
housing      category
loan         category
contact      category
day             int64
month        category
duration        int64
campaign        int64
poutcome     category
Target       category
dtype: object

df['job'].value_counts()

blue-collar      9732
management       9458
technician       7597
admin.           5171
services         4154
retired          2264
self-employed    1579
entrepreneur     1487
unemployed       1303
housemaid        1240
student           938
unknown           288
Name: job, dtype: int64

df['education'].value_counts()

secondary    23202
tertiary     13301
primary       6851
unknown       1857
Name: education, dtype: int64

df['poutcome'].value_counts()

unknown    36959
failure     4901
other       1840
success     1511
Name: poutcome, dtype: int64

Deliverable – 2 (Prepare the data for analytics) – (10)¶

df.skew()
#balance is highly skewed, due to outliers

age        0.68
balance    8.36
day        0.09
duration   3.14
campaign   4.90
dtype: float64

removing skewness¶

Q1 = df["balance"].quantile(0.25)
Q3 = df["balance"].quantile(0.75)
df["balance"] = np.where(df["balance"] < Q1 , Q1 ,df["balance"])
df["balance"] = np.where(df["balance"] > Q3, Q3,df["balance"])
df['balance'].skew()

0.40156307226290167

Q1 = df["duration"].quantile(0.25)
Q3 = df["duration"].quantile(0.75)
df["duration"] = np.where(df["duration"] < Q1 , Q1 ,df["duration"])
df["duration"] = np.where(df["duration"] > Q3, Q3,df["duration"])
df['duration'].skew()

0.25707393018451163

Q1 = df["campaign"].quantile(0.25)
Q3 = df["campaign"].quantile(0.75)
df["campaign"] = np.where(df["campaign"] < Q1 , Q1 ,df["campaign"])
df["campaign"] = np.where(df["campaign"] > Q3, Q3,df["campaign"])
df['campaign'].skew()

0.10031025929176021

df.skew()

age        0.68
balance    0.40
day        0.09
duration   0.26
campaign   0.10
dtype: float64

from sklearn.preprocessing  import MinMaxScaler

scaler = MinMaxScaler()
df[['balance', 'duration']] = scaler.fit_transform(df[['balance', 'duration']])
df

One hot encoding for categorical variables¶

df.head()

oneHotCols=["job","marital","education","default", "housing", "loan", "contact", "month", "poutcome"]
df=pd.get_dummies(df, columns=oneHotCols)

Split data¶

X = df.drop("Target" , axis=1)
y = df.pop("Target")
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.30, random_state=1)

Deliverable – 3 (create the ensemble model) – (30)¶

## function to get confusion matrix in a proper format
def draw_cm( actual, predicted ):
    cm = confusion_matrix( actual, predicted)
    sns.heatmap(cm, annot=True,  fmt='.2f', xticklabels = [0,1] , yticklabels = [0,1] )
    plt.ylabel('Observed')
    plt.xlabel('Predicted')
    plt.show()

Logistic regression¶

from sklearn.linear_model import LogisticRegression

# Fit the model on train
model = LogisticRegression(solver="liblinear")
model.fit(X_train, y_train)
#predict on test
pred_logit = model.predict(X_test)
model_score = model.score(X_test, y_test)
print(model_score)

0.9018726039516367

acc_logit = accuracy_score(y_test, pred_logit)
recall_logit =  recall_score(y_test, pred_logit, pos_label="yes")
precision_logit =  precision_score(y_test, pred_logit,pos_label="yes" )
f1_logit =  f1_score(y_test, pred_logit, pos_label="yes")
resultsDf = pd.DataFrame({'Method':['Logistic Regression'], 'accuracy': acc_logit, 'recall':f1_logit, 'precision': precision_logit , 'f1_score' : f1_logit })
resultsDf.reset_index(drop=True)

# Confusion matrix
pd.crosstab(y_test, pred_logit, rownames=['Actual'], colnames=['Predicted'])

draw_cm(y_test, pred_logit)

roc = ROCAUC(LogisticRegression(solver="liblinear"))
roc.fit(X_train, y_train)
roc.score(X_test, y_test)
roc.show()

# Visualize model performance with yellowbrick library
viz = ClassificationReport(LogisticRegression(solver="liblinear"))
viz.fit(X_train, y_train)
viz.score(X_test, y_test)
viz.show()

C:\ProgramData\Anaconda3\lib\site-packages\sklearn\base.py:197: FutureWarning: From version 0.24, get_params will raise an AttributeError if a parameter cannot be retrieved as an instance attribute. Previously it would return None.
  FutureWarning)

C:\ProgramData\Anaconda3\lib\site-packages\sklearn\base.py:197: FutureWarning: From version 0.24, get_params will raise an AttributeError if a parameter cannot be retrieved as an instance attribute. Previously it would return None.
  FutureWarning)

<matplotlib.axes._subplots.AxesSubplot at 0x24519ad8488>

DecisionTree Classifier¶

dTree = DecisionTreeClassifier(criterion = 'entropy', random_state=22, max_depth=4)
dTree.fit(X_train, y_train)

DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='entropy',
                       max_depth=4, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=22, splitter='best')

print(dTree.score(X_train, y_train))
print(dTree.score(X_test, y_test))

0.8947135589471356
0.8983338248304334

dTreeR = DecisionTreeClassifier(criterion = 'entropy', max_depth = 4, random_state=1)
dTreeR.fit(X_train, y_train)
print(dTreeR.score(X_train, y_train))
print(dTreeR.score(X_test, y_test))

0.8947135589471356
0.8983338248304334

pred_dTreeR = dTreeR.predict(X_test)

acc_dTreeR = accuracy_score(y_test, pred_dTreeR)
recall_dTreeR =  recall_score(y_test, pred_dTreeR, pos_label="yes")
precision_dTreeR =  precision_score(y_test, pred_dTreeR, pos_label="yes")
f1_dTreeR =  f1_score(y_test, pred_dTreeR, pos_label="yes")

tempResultsDf = pd.DataFrame({'Method':['DecisionTree'], 'accuracy': acc_dTreeR, 'recall':f1_dTreeR, 'precision': precision_dTreeR , 'f1_score' : f1_dTreeR })
#Store the accuracy results for each model in a dataframe for final comparison
resultsDf = pd.concat([resultsDf, tempResultsDf])
tempResultsDf.reset_index(drop=True)

# Confusion matrix
pd.crosstab(y_test, pred_dTreeR, rownames=['Actual'], colnames=['Predicted'])

draw_cm(y_test, pred_dTreeR)

roc = ROCAUC(DecisionTreeClassifier(criterion = "gini", max_depth=4))
roc.fit(X_train, y_train)
roc.score(X_test, y_test)
roc.show()

# Visualize model performance with yellowbrick library
viz = ClassificationReport(DecisionTreeClassifier(criterion = "gini", max_depth=4))
viz.fit(X_train, y_train)
viz.score(X_test, y_test)
viz.show()

C:\ProgramData\Anaconda3\lib\site-packages\sklearn\base.py:197: FutureWarning: From version 0.24, get_params will raise an AttributeError if a parameter cannot be retrieved as an instance attribute. Previously it would return None.
  FutureWarning)

C:\ProgramData\Anaconda3\lib\site-packages\sklearn\base.py:197: FutureWarning: From version 0.24, get_params will raise an AttributeError if a parameter cannot be retrieved as an instance attribute. Previously it would return None.
  FutureWarning)

<matplotlib.axes._subplots.AxesSubplot at 0x24521c1ba08>

Random Forest Classifier¶

from sklearn.ensemble import RandomForestClassifier
rfcl = RandomForestClassifier(criterion = 'gini' , n_estimators = 50)
rfcl = rfcl.fit(X_train, y_train)
print(rfcl.score(X_train, y_train))
print(rfcl.score(X_test, y_test))

0.9993680285651089
0.8977440283102329

pred_rfcl = rfcl.predict(X_test)

acc_rfcl = accuracy_score(y_test, pred_rfcl)
recall_rfcl =  recall_score(y_test, pred_rfcl, pos_label="yes")
precision_rfcl =  precision_score(y_test, pred_rfcl, pos_label="yes")
f1_rfcl =  f1_score(y_test, pred_rfcl, pos_label="yes")

tempResultsDf = pd.DataFrame({'Method':['RandomForest'], 'accuracy': acc_rfcl, 'recall':recall_rfcl, 'precision': precision_rfcl , 'f1_score' : f1_rfcl })
#Store the accuracy results for each model in a dataframe for final comparison
resultsDf = pd.concat([resultsDf, tempResultsDf])
tempResultsDf.reset_index(drop=True)

# Confusion matrix
pd.crosstab(y_test, pred_rfcl, rownames=['Actual'], colnames=['Predicted'])

draw_cm(y_test, pred_rfcl)

roc = ROCAUC(RandomForestClassifier(criterion = 'gini' , n_estimators = 50))
roc.fit(X_train, y_train)
roc.score(X_test, y_test)
roc.show()

# Visualize model performance with yellowbrick library
viz = ClassificationReport(RandomForestClassifier(criterion = 'gini' , n_estimators = 50))
viz.fit(X_train, y_train)
viz.score(X_test, y_test)
viz.show()

C:\ProgramData\Anaconda3\lib\site-packages\sklearn\base.py:197: FutureWarning: From version 0.24, get_params will raise an AttributeError if a parameter cannot be retrieved as an instance attribute. Previously it would return None.
  FutureWarning)

C:\ProgramData\Anaconda3\lib\site-packages\sklearn\base.py:197: FutureWarning: From version 0.24, get_params will raise an AttributeError if a parameter cannot be retrieved as an instance attribute. Previously it would return None.
  FutureWarning)

<matplotlib.axes._subplots.AxesSubplot at 0x24525cbe808>

AdaBoost Classifier¶

from sklearn.ensemble import AdaBoostClassifier
abcl = AdaBoostClassifier( n_estimators = 50, learning_rate = 0.1, random_state=22)
abcl = abcl.fit(X_train, y_train)
print(abcl.score(X_train, y_train))
print(abcl.score(X_test, y_test))

0.8893102031788164
0.8919197876732528

pred_abcl = abcl.predict(X_test)

acc_abcl = accuracy_score(y_test, pred_abcl)
recall_abcl =  recall_score(y_test, pred_abcl, pos_label="yes")
precision_abcl =  precision_score(y_test, pred_abcl, pos_label="yes")
f1_abcl =  f1_score(y_test, pred_abcl, pos_label="yes")

tempResultsDf = pd.DataFrame({'Method':['AdaBoost'], 'accuracy': acc_abcl, 'recall':recall_abcl, 'precision': precision_abcl , 'f1_score' : f1_abcl })
#Store the accuracy results for each model in a dataframe for final comparison
resultsDf = pd.concat([resultsDf, tempResultsDf])
tempResultsDf.reset_index(drop=True)

# Confusion matrix
pd.crosstab(y_test, pred_abcl, rownames=['Actual'], colnames=['Predicted'])

draw_cm(y_test, pred_abcl)

roc = ROCAUC(AdaBoostClassifier(n_estimators = 50))
roc.fit(X_train, y_train)
roc.score(X_test, y_test)
roc.show()

# Visualize model performance with yellowbrick library
viz = ClassificationReport(AdaBoostClassifier())
viz.fit(X_train, y_train)
viz.score(X_test, y_test)
viz.show()

C:\ProgramData\Anaconda3\lib\site-packages\sklearn\base.py:197: FutureWarning: From version 0.24, get_params will raise an AttributeError if a parameter cannot be retrieved as an instance attribute. Previously it would return None.
  FutureWarning)

C:\ProgramData\Anaconda3\lib\site-packages\sklearn\base.py:197: FutureWarning: From version 0.24, get_params will raise an AttributeError if a parameter cannot be retrieved as an instance attribute. Previously it would return None.
  FutureWarning)

<matplotlib.axes._subplots.AxesSubplot at 0x2452604a708>

Bagging Classifier (Bootstrap Aggregation)¶

from sklearn.ensemble import BaggingClassifier
bgcl = BaggingClassifier( n_estimators=50, max_samples= .7, bootstrap=True, oob_score=True, random_state=22)
bgcl = bgcl.fit(X_train, y_train)
print(bgcl.score(X_train, y_train))
print(bgcl.score(X_test, y_test))

0.9921319556356053
0.8970805072250073

pred_bgcl = bgcl.predict(X_test)

acc_bgcl = accuracy_score(y_test, pred_bgcl)
recall_bgcl =  recall_score(y_test, pred_bgcl, pos_label="yes")
precision_bgcl =  precision_score(y_test, pred_bgcl, pos_label="yes")
f1_bgcl =  f1_score(y_test, pred_bgcl, pos_label="yes")

tempResultsDf = pd.DataFrame({'Method':['Bagging'], 'accuracy': acc_bgcl, 'recall':recall_bgcl, 'precision': precision_bgcl , 'f1_score' : f1_bgcl })
#Store the accuracy results for each model in a dataframe for final comparison
resultsDf = pd.concat([resultsDf, tempResultsDf])
tempResultsDf.reset_index(drop=True)

# Confusion matrix
pd.crosstab(y_test, pred_bgcl, rownames=['Actual'], colnames=['Predicted'])

draw_cm(y_test, pred_bgcl)

roc = ROCAUC(GradientBoostingClassifier(n_estimators = 50))
roc.fit(X_train, y_train)
roc.score(X_test, y_test)
roc.show()

# Visualize model performance with yellowbrick library
viz = ClassificationReport(GradientBoostingClassifier())
viz.fit(X_train, y_train)
viz.score(X_test, y_test)
viz.show()

C:\ProgramData\Anaconda3\lib\site-packages\sklearn\base.py:197: FutureWarning: From version 0.24, get_params will raise an AttributeError if a parameter cannot be retrieved as an instance attribute. Previously it would return None.
  FutureWarning)

C:\ProgramData\Anaconda3\lib\site-packages\sklearn\base.py:197: FutureWarning: From version 0.24, get_params will raise an AttributeError if a parameter cannot be retrieved as an instance attribute. Previously it would return None.
  FutureWarning)

<matplotlib.axes._subplots.AxesSubplot at 0x245266b0d08>

Gradient boost classifier¶

from sklearn.ensemble import GradientBoostingClassifier
gbcl = GradientBoostingClassifier( n_estimators = 50, learning_rate = 0.1, random_state=22)
gbcl = gbcl.fit(X_train, y_train)
print(gbcl.score(X_train, y_train))
print(gbcl.score(X_test, y_test))

0.9020760261636174
0.9008404600412857

pred_gbcl = gbcl.predict(X_test)

acc_gbcl = accuracy_score(y_test, pred_gbcl)
recall_gbcl =  recall_score(y_test, pred_gbcl, pos_label="yes")
precision_gbcl =  precision_score(y_test, pred_gbcl, pos_label="yes")
f1_gbcl =  f1_score(y_test, pred_gbcl, pos_label="yes")

tempResultsDf = pd.DataFrame({'Method':['GradientBoost'], 'accuracy': acc_gbcl, 'recall':recall_gbcl, 'precision': precision_gbcl , 'f1_score' : f1_gbcl })
#Store the accuracy results for each model in a dataframe for final comparison
resultsDf = pd.concat([resultsDf, tempResultsDf])
tempResultsDf.reset_index(drop=True)

# Confusion matrix
pd.crosstab(y_test, pred_gbcl, rownames=['Actual'], colnames=['Predicted'])

C:\ProgramData\Anaconda3\lib\site-packages\IPython\core\displayhook.py:276: UserWarning: Output cache limit (currently 1000 entries) hit.
Flushing oldest 200 entries.
  'Flushing oldest {cull_count} entries.'.format(sz=sz, cull_count=cull_count))

draw_cm(y_test, pred_gbcl)

roc = ROCAUC( GradientBoostingClassifier( n_estimators = 50, learning_rate = 0.1, random_state=22))
roc.fit(X_train, y_train)
roc.score(X_test, y_test)
roc.show()

# Visualize model performance with yellowbrick library
viz = ClassificationReport( GradientBoostingClassifier( n_estimators = 50, learning_rate = 0.1, random_state=22))
viz.fit(X_train, y_train)
viz.score(X_test, y_test)
viz.show()

C:\ProgramData\Anaconda3\lib\site-packages\sklearn\base.py:197: FutureWarning: From version 0.24, get_params will raise an AttributeError if a parameter cannot be retrieved as an instance attribute. Previously it would return None.
  FutureWarning)

C:\ProgramData\Anaconda3\lib\site-packages\sklearn\base.py:197: FutureWarning: From version 0.24, get_params will raise an AttributeError if a parameter cannot be retrieved as an instance attribute. Previously it would return None.
  FutureWarning)

<matplotlib.axes._subplots.AxesSubplot at 0x24527d15188>

resultsDf.reset_index(drop=True)

Observations¶

We are getting accuracy around 0.90 with most of the models. But to maximise the reach for the term deposit customers we have to minimise false negatives. That is those customers who are actually positive but labeled as not potential customers. Considering that we want to select Bagging Classifier in this case. Also Bagging classifier has got the most favourable f1 score.

	count	mean	std	min	25%	50%	75%	max
age	45,211.00	40.94	10.62	18.00	33.00	39.00	48.00	95.00
balance	45,211.00	1,362.27	3,044.77	-8,019.00	72.00	448.00	1,428.00	102,127.00
day	45,211.00	15.81	8.32	1.00	8.00	16.00	21.00	31.00
duration	45,211.00	258.16	257.53	0.00	103.00	180.00	319.00	4,918.00
campaign	45,211.00	2.76	3.10	1.00	1.00	2.00	3.00	63.00
pdays	45,211.00	40.20	100.13	-1.00	-1.00	-1.00	-1.00	871.00
previous	45,211.00	0.58	2.30	0.00	0.00	0.00	0.00	275.00

	age	job	marital	education	default	balance	housing	loan	contact	day	month	duration	campaign	pdays	poutcome	Target
0	58	management	married	tertiary	no	2143	yes	no	unknown	5	may	261	1	-1	unknown	no
1	44	technician	single	secondary	no	29	yes	no	unknown	5	may	151	1	-1	unknown	no
2	33	entrepreneur	married	secondary	no	2	yes	yes	unknown	5	may	76	1	-1	unknown	no
3	47	blue-collar	married	unknown	no	1506	yes	no	unknown	5	may	92	1	-1	unknown	no
4	33	unknown	single	unknown	no	1	no	no	unknown	5	may	198	1	-1	unknown	no

	Method	accuracy	recall	precision	f1_score
0	Logistic Regression	0.90	0.39	0.68	0.39
1	DecisionTree	0.90	0.28	0.73	0.28
2	RandomForest	0.90	0.30	0.61	0.40
3	AdaBoost	0.89	0.08	0.77	0.14
4	Bagging	0.90	0.33	0.59	0.43
5	GradientBoost	0.90	0.25	0.68	0.37

Predicted	no	yes
Actual
no	11813	200
yes	1131	420

Predicted	no	yes
Actual
no	11911	102
yes	1277	274

Predicted	no	yes
Actual
no	11706	307
yes	1080	471

Predicted	no	yes
Actual
no	11977	36
yes	1430	121