Term Deposit Sale

In [1610]:
import pandas as pd                                             # library for working with dataframes
import numpy as np                                              # library for working with arrays
import matplotlib.pyplot as plt                                 # low level visualization library
%matplotlib inline
import seaborn as sns                                           # higher level visualization library compared to matplotlib
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, f1_score, recall_score, precision_score, confusion_matrix, roc_auc_score
from IPython.display import Image  
import pydotplus as pydot
from sklearn import tree
from os import system

from yellowbrick.classifier import ClassificationReport, ROCAUC
plt.style.use('ggplot')
pd.options.display.float_format = '{:,.2f}'.format
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:95% !important; }</style>"))
In [1611]:
# read the csv file into a dataframe
df=pd.read_csv("bank-full.csv")

Bank client data:

  1. age: Continuous feature
  2. job: Type of job (management, technician, entrepreneur, blue-collar, etc.)
  3. marital: marital status (married, single, divorced)
  4. education: education level (primary, secondary, tertiary)
  5. default: has credit in default?
  6. housing: has housing loan?
  7. loan: has personal loan?
  8. balance in account Related to previous contact:
  9. contact: contact communication type
  10. month: last contact month of year
  11. day: last contact day of the month
  12. duration: last contact duration, in seconds* Other attributes:
  13. campaign: number of contacts performed during this campaign and for this client
  14. pdays: number of days that passed by after the client was last contacted from a previous campaign (-1 tells us the person has not been contacted or contact period is beyond 900 days)
  15. previous: number of contacts performed before this campaign and for this client
  16. poutcome: outcome of the previous marketing campaign Output variable (desired target):
  17. Target: Tell us has the client subscribed a term deposit. (Yes, No)

Deliverable - 1 Exploratory Data Quality report

1 Univariate Analysis (12 )

In [1612]:
df.head()
# Target is our dependent variable , rest are the independent variables.
Out[1612]:
age job marital education default balance housing loan contact day month duration campaign pdays previous poutcome Target
0 58 management married tertiary no 2143 yes no unknown 5 may 261 1 -1 0 unknown no
1 44 technician single secondary no 29 yes no unknown 5 may 151 1 -1 0 unknown no
2 33 entrepreneur married secondary no 2 yes yes unknown 5 may 76 1 -1 0 unknown no
3 47 blue-collar married unknown no 1506 yes no unknown 5 may 92 1 -1 0 unknown no
4 33 unknown single unknown no 1 no no unknown 5 may 198 1 -1 0 unknown no

1.a Checking for number of rows and coloumns in the database

In [1613]:
df.shape
#there are 45211 rows and 17 columns in the dataframe.
Out[1613]:
(45211, 17)

1.a Finding the five point summary of continous varibales

In [1614]:
df.info()
# There are continous and non continous  variables in the dataframe
# age, balance, day, duration, campaign, pdays, previous are continous variables.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45211 entries, 0 to 45210
Data columns (total 17 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   age        45211 non-null  int64 
 1   job        45211 non-null  object
 2   marital    45211 non-null  object
 3   education  45211 non-null  object
 4   default    45211 non-null  object
 5   balance    45211 non-null  int64 
 6   housing    45211 non-null  object
 7   loan       45211 non-null  object
 8   contact    45211 non-null  object
 9   day        45211 non-null  int64 
 10  month      45211 non-null  object
 11  duration   45211 non-null  int64 
 12  campaign   45211 non-null  int64 
 13  pdays      45211 non-null  int64 
 14  previous   45211 non-null  int64 
 15  poutcome   45211 non-null  object
 16  Target     45211 non-null  object
dtypes: int64(7), object(10)
memory usage: 5.9+ MB

1.a Finding the data types of variables

In [1615]:
df.describe().transpose()
#Observations
# Average age is around 40-41
# Average balance in accound is 1362 and there are members with negative balance.
# Here we have mean , median , quartiles etc.  of all continous variables

# There are no missing values in the continous variable columns because all the
# continous variables got listed in five point summary and none of the values are 
# a string. 
Out[1615]:
count mean std min 25% 50% 75% max
age 45,211.00 40.94 10.62 18.00 33.00 39.00 48.00 95.00
balance 45,211.00 1,362.27 3,044.77 -8,019.00 72.00 448.00 1,428.00 102,127.00
day 45,211.00 15.81 8.32 1.00 8.00 16.00 21.00 31.00
duration 45,211.00 258.16 257.53 0.00 103.00 180.00 319.00 4,918.00
campaign 45,211.00 2.76 3.10 1.00 1.00 2.00 3.00 63.00
pdays 45,211.00 40.20 100.13 -1.00 -1.00 -1.00 -1.00 871.00
previous 45,211.00 0.58 2.30 0.00 0.00 0.00 0.00 275.00

1.a Finiding if there are any null values

In [1616]:
df.isnull().values.any()
# There are no null values in the dataframe.
Out[1616]:
False
In [1617]:
# Listing all unique values
df_unique = df.nunique().to_frame().reset_index()
df_unique.columns = ['Variable','DistinctCount']
print(df_unique)
     Variable  DistinctCount
0         age             77
1         job             12
2     marital              3
3   education              4
4     default              2
5     balance           7168
6     housing              2
7        loan              2
8     contact              3
9         day             31
10      month             12
11   duration           1573
12   campaign             48
13      pdays            559
14   previous             41
15   poutcome              4
16     Target              2

1a. Checking charateristics of the independent variables.

In [1618]:
# Age Box plot
sns.boxplot(x=df['age'])
Out[1618]:
<matplotlib.axes._subplots.AxesSubplot at 0x24522753488>
In [1619]:
# Balance Box plot
sns.boxplot(x=df['balance'])
#There are outliers in the balance data. 
Out[1619]:
<matplotlib.axes._subplots.AxesSubplot at 0x245225e15c8>
In [1620]:
# Day Box plot
sns.boxplot(x=df['day'])
Out[1620]:
<matplotlib.axes._subplots.AxesSubplot at 0x24522561708>
In [1621]:
# duration Box plot
sns.boxplot(x=df['duration'])
#There are outliers in duration
Out[1621]:
<matplotlib.axes._subplots.AxesSubplot at 0x245224e4308>
In [1622]:
# campaign Box plot
sns.boxplot(x=df['campaign'])
#There are outliers in campaign data
Out[1622]:
<matplotlib.axes._subplots.AxesSubplot at 0x2452240ee48>
In [1623]:
# pdays Box plot
sns.boxplot(x=df['pdays'])
#pdays is highly skewed
Out[1623]:
<matplotlib.axes._subplots.AxesSubplot at 0x24522680f48>
In [1624]:
# pdays Box plot
sns.boxplot(x=df['previous'])
#previous is highly skewed
Out[1624]:
<matplotlib.axes._subplots.AxesSubplot at 0x24522451908>
In [1625]:
df['pdays'].value_counts(normalize= True)
#pdays is highly skewed out of 45211 values 36954 has a 
#value -1 (-1 tells us the person has not been contacted or contact period is beyond 900 days)
Out[1625]:
-1     0.82
 182   0.00
 92    0.00
 183   0.00
 91    0.00
       ... 
 749   0.00
 717   0.00
 589   0.00
 493   0.00
 32    0.00
Name: pdays, Length: 559, dtype: float64
In [1626]:
df['previous'].value_counts(normalize=True)
#previous is highly skewed out of 45211 values 36954 has a value 0
Out[1626]:
0     0.82
1     0.06
2     0.05
3     0.03
4     0.02
5     0.01
6     0.01
7     0.00
8     0.00
9     0.00
10    0.00
11    0.00
12    0.00
13    0.00
15    0.00
14    0.00
17    0.00
16    0.00
19    0.00
23    0.00
20    0.00
22    0.00
18    0.00
24    0.00
27    0.00
29    0.00
25    0.00
21    0.00
30    0.00
28    0.00
26    0.00
37    0.00
38    0.00
55    0.00
40    0.00
35    0.00
58    0.00
51    0.00
41    0.00
32    0.00
275   0.00
Name: previous, dtype: float64
In [1627]:
# dropping pdays and previous from the dataframe

df.drop(['pdays','previous'], axis=1,inplace = True)
df.head()
Out[1627]:
age job marital education default balance housing loan contact day month duration campaign poutcome Target
0 58 management married tertiary no 2143 yes no unknown 5 may 261 1 unknown no
1 44 technician single secondary no 29 yes no unknown 5 may 151 1 unknown no
2 33 entrepreneur married secondary no 2 yes yes unknown 5 may 76 1 unknown no
3 47 blue-collar married unknown no 1506 yes no unknown 5 may 92 1 unknown no
4 33 unknown single unknown no 1 no no unknown 5 may 198 1 unknown no
In [1628]:
# converting variables to categorical variables
for feature in df.columns:
    if df[feature].dtype == 'object':
        df[feature] = pd.Categorical(df[feature])
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45211 entries, 0 to 45210
Data columns (total 15 columns):
 #   Column     Non-Null Count  Dtype   
---  ------     --------------  -----   
 0   age        45211 non-null  int64   
 1   job        45211 non-null  category
 2   marital    45211 non-null  category
 3   education  45211 non-null  category
 4   default    45211 non-null  category
 5   balance    45211 non-null  int64   
 6   housing    45211 non-null  category
 7   loan       45211 non-null  category
 8   contact    45211 non-null  category
 9   day        45211 non-null  int64   
 10  month      45211 non-null  category
 11  duration   45211 non-null  int64   
 12  campaign   45211 non-null  int64   
 13  poutcome   45211 non-null  category
 14  Target     45211 non-null  category
dtypes: category(10), int64(5)
memory usage: 2.2 MB
In [1629]:
# examining all categorical varables and its counts
df['job'].value_counts()
Out[1629]:
blue-collar      9732
management       9458
technician       7597
admin.           5171
services         4154
retired          2264
self-employed    1579
entrepreneur     1487
unemployed       1303
housemaid        1240
student           938
unknown           288
Name: job, dtype: int64
In [1630]:
df['marital'].value_counts()
Out[1630]:
married     27214
single      12790
divorced     5207
Name: marital, dtype: int64
In [1631]:
df['education'].value_counts()
Out[1631]:
secondary    23202
tertiary     13301
primary       6851
unknown       1857
Name: education, dtype: int64
In [1632]:
df['default'].value_counts()
Out[1632]:
no     44396
yes      815
Name: default, dtype: int64
In [1633]:
df['housing'].value_counts() 
Out[1633]:
yes    25130
no     20081
Name: housing, dtype: int64
In [1634]:
df['loan'].value_counts()
Out[1634]:
no     37967
yes     7244
Name: loan, dtype: int64
In [1635]:
df['contact'].value_counts()
Out[1635]:
cellular     29285
unknown      13020
telephone     2906
Name: contact, dtype: int64
In [1636]:
df['month'].value_counts()
# most of the people where contacted in the month May. 
# Dec, Mar, Sep ,Oct saw least contacted month
Out[1636]:
may    13766
jul     6895
aug     6247
jun     5341
nov     3970
apr     2932
feb     2649
jan     1403
oct      738
sep      579
mar      477
dec      214
Name: month, dtype: int64
In [1637]:
df['poutcome'].value_counts()
Out[1637]:
unknown    36959
failure     4901
other       1840
success     1511
Name: poutcome, dtype: int64
In [1638]:
df['Target'].value_counts()
Out[1638]:
no     39922
yes     5289
Name: Target, dtype: int64

2. Multivariate Analysis (8 )

In [1639]:
plt.figure(figsize=(10,8))

sns.heatmap(df.corr(),
            annot=True,
            linewidths=.5,
            center=0,
            cbar=False)

plt.show()
# We dont want any highly correlated variables. We will only take one column from highly corelated columns.  We dont see any highly correlated variables.
In [1640]:
sns.scatterplot(x="age", y="balance", hue='Target', data=df)

# When balance had increased the person had not taken a term deposit.
Out[1640]:
<matplotlib.axes._subplots.AxesSubplot at 0x245223eab08>
In [1641]:
sns.scatterplot(x="day", y="duration", hue='Target', data=df)
Out[1641]:
<matplotlib.axes._subplots.AxesSubplot at 0x24521ce4b08>
In [1642]:
sns.catplot(x="Target",  y="balance", hue="marital", kind="bar", data=df);
In [1643]:
sns.catplot(x="Target",  y="balance", hue="job", kind="bar", data=df);
In [1644]:
sns.catplot(x="Target",  y="balance", hue="housing", kind="bar", data=df);
In [1645]:
sns.distplot(df['age'])
Out[1645]:
<matplotlib.axes._subplots.AxesSubplot at 0x2452348a948>
In [1646]:
sns.distplot(df['balance'])
#balance seems to be higly skewed.
Out[1646]:
<matplotlib.axes._subplots.AxesSubplot at 0x2452528d408>
In [1647]:
for feature in df.columns:
    if df[feature].dtype == 'object':
        df[feature] = pd.Categorical(df[feature])
df.head()
Out[1647]:
age job marital education default balance housing loan contact day month duration campaign poutcome Target
0 58 management married tertiary no 2143 yes no unknown 5 may 261 1 unknown no
1 44 technician single secondary no 29 yes no unknown 5 may 151 1 unknown no
2 33 entrepreneur married secondary no 2 yes yes unknown 5 may 76 1 unknown no
3 47 blue-collar married unknown no 1506 yes no unknown 5 may 92 1 unknown no
4 33 unknown single unknown no 1 no no unknown 5 may 198 1 unknown no
In [1648]:
#finding out unique values
df.nunique()
Out[1648]:
age            77
job            12
marital         3
education       4
default         2
balance      7168
housing         2
loan            2
contact         3
day            31
month          12
duration     1573
campaign       48
poutcome        4
Target          2
dtype: int64
In [1649]:
df.dtypes
Out[1649]:
age             int64
job          category
marital      category
education    category
default      category
balance         int64
housing      category
loan         category
contact      category
day             int64
month        category
duration        int64
campaign        int64
poutcome     category
Target       category
dtype: object
In [1650]:
df['job'].value_counts()
Out[1650]:
blue-collar      9732
management       9458
technician       7597
admin.           5171
services         4154
retired          2264
self-employed    1579
entrepreneur     1487
unemployed       1303
housemaid        1240
student           938
unknown           288
Name: job, dtype: int64
In [1651]:
df['education'].value_counts()
Out[1651]:
secondary    23202
tertiary     13301
primary       6851
unknown       1857
Name: education, dtype: int64
In [1652]:
df['poutcome'].value_counts()
Out[1652]:
unknown    36959
failure     4901
other       1840
success     1511
Name: poutcome, dtype: int64

Deliverable – 2 (Prepare the data for analytics) – (10)

In [1653]:
df.skew()
#balance is highly skewed, due to outliers
Out[1653]:
age        0.68
balance    8.36
day        0.09
duration   3.14
campaign   4.90
dtype: float64

removing skewness

In [1654]:
Q1 = df["balance"].quantile(0.25)
Q3 = df["balance"].quantile(0.75)
df["balance"] = np.where(df["balance"] < Q1 , Q1 ,df["balance"])
df["balance"] = np.where(df["balance"] > Q3, Q3,df["balance"])
df['balance'].skew()
Out[1654]:
0.40156307226290167
In [1655]:
Q1 = df["duration"].quantile(0.25)
Q3 = df["duration"].quantile(0.75)
df["duration"] = np.where(df["duration"] < Q1 , Q1 ,df["duration"])
df["duration"] = np.where(df["duration"] > Q3, Q3,df["duration"])
df['duration'].skew()
Out[1655]:
0.25707393018451163
In [1656]:
Q1 = df["campaign"].quantile(0.25)
Q3 = df["campaign"].quantile(0.75)
df["campaign"] = np.where(df["campaign"] < Q1 , Q1 ,df["campaign"])
df["campaign"] = np.where(df["campaign"] > Q3, Q3,df["campaign"])
df['campaign'].skew()
Out[1656]:
0.10031025929176021
In [1657]:
df.skew()
Out[1657]:
age        0.68
balance    0.40
day        0.09
duration   0.26
campaign   0.10
dtype: float64
In [1658]:
from sklearn.preprocessing  import MinMaxScaler

scaler = MinMaxScaler()
df[['balance', 'duration']] = scaler.fit_transform(df[['balance', 'duration']])
df
Out[1658]:
age job marital education default balance housing loan contact day month duration campaign poutcome Target
0 58 management married tertiary no 1.00 yes no unknown 5 may 0.73 1.00 unknown no
1 44 technician single secondary no 0.00 yes no unknown 5 may 0.22 1.00 unknown no
2 33 entrepreneur married secondary no 0.00 yes yes unknown 5 may 0.00 1.00 unknown no
3 47 blue-collar married unknown no 1.00 yes no unknown 5 may 0.00 1.00 unknown no
4 33 unknown single unknown no 0.00 no no unknown 5 may 0.44 1.00 unknown no
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
45206 51 technician married tertiary no 0.56 no no cellular 17 nov 1.00 3.00 unknown yes
45207 71 retired divorced primary no 1.00 no no cellular 17 nov 1.00 2.00 unknown yes
45208 72 retired married secondary no 1.00 no no cellular 17 nov 1.00 3.00 success yes
45209 57 blue-collar married secondary no 0.44 no no telephone 17 nov 1.00 3.00 unknown no
45210 37 entrepreneur married secondary no 1.00 no no cellular 17 nov 1.00 2.00 other no

45211 rows × 15 columns

One hot encoding for categorical variables

In [1659]:
df.head()
Out[1659]:
age job marital education default balance housing loan contact day month duration campaign poutcome Target
0 58 management married tertiary no 1.00 yes no unknown 5 may 0.73 1.00 unknown no
1 44 technician single secondary no 0.00 yes no unknown 5 may 0.22 1.00 unknown no
2 33 entrepreneur married secondary no 0.00 yes yes unknown 5 may 0.00 1.00 unknown no
3 47 blue-collar married unknown no 1.00 yes no unknown 5 may 0.00 1.00 unknown no
4 33 unknown single unknown no 0.00 no no unknown 5 may 0.44 1.00 unknown no
In [1660]:
oneHotCols=["job","marital","education","default", "housing", "loan", "contact", "month", "poutcome"]
df=pd.get_dummies(df, columns=oneHotCols)

Split data

In [1661]:
X = df.drop("Target" , axis=1)
y = df.pop("Target")
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.30, random_state=1)

Deliverable – 3 (create the ensemble model) – (30)

In [1662]:
## function to get confusion matrix in a proper format
def draw_cm( actual, predicted ):
    cm = confusion_matrix( actual, predicted)
    sns.heatmap(cm, annot=True,  fmt='.2f', xticklabels = [0,1] , yticklabels = [0,1] )
    plt.ylabel('Observed')
    plt.xlabel('Predicted')
    plt.show()

Logistic regression

In [1663]:
from sklearn.linear_model import LogisticRegression

# Fit the model on train
model = LogisticRegression(solver="liblinear")
model.fit(X_train, y_train)
#predict on test
pred_logit = model.predict(X_test)
model_score = model.score(X_test, y_test)
print(model_score)
0.9018726039516367
In [1664]:
acc_logit = accuracy_score(y_test, pred_logit)
recall_logit =  recall_score(y_test, pred_logit, pos_label="yes")
precision_logit =  precision_score(y_test, pred_logit,pos_label="yes" )
f1_logit =  f1_score(y_test, pred_logit, pos_label="yes")
resultsDf = pd.DataFrame({'Method':['Logistic Regression'], 'accuracy': acc_logit, 'recall':f1_logit, 'precision': precision_logit , 'f1_score' : f1_logit })
resultsDf.reset_index(drop=True)
Out[1664]:
Method accuracy recall precision f1_score
0 Logistic Regression 0.90 0.39 0.68 0.39
In [1665]:
# Confusion matrix
pd.crosstab(y_test, pred_logit, rownames=['Actual'], colnames=['Predicted'])
Out[1665]:
Predicted no yes
Actual
no 11813 200
yes 1131 420
In [1666]:
draw_cm(y_test, pred_logit)
In [1667]:
roc = ROCAUC(LogisticRegression(solver="liblinear"))
roc.fit(X_train, y_train)
roc.score(X_test, y_test)
roc.show()

# Visualize model performance with yellowbrick library
viz = ClassificationReport(LogisticRegression(solver="liblinear"))
viz.fit(X_train, y_train)
viz.score(X_test, y_test)
viz.show()
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\base.py:197: FutureWarning: From version 0.24, get_params will raise an AttributeError if a parameter cannot be retrieved as an instance attribute. Previously it would return None.
  FutureWarning)
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\base.py:197: FutureWarning: From version 0.24, get_params will raise an AttributeError if a parameter cannot be retrieved as an instance attribute. Previously it would return None.
  FutureWarning)
Out[1667]:
<matplotlib.axes._subplots.AxesSubplot at 0x24519ad8488>

DecisionTree Classifier

In [1668]:
dTree = DecisionTreeClassifier(criterion = 'entropy', random_state=22, max_depth=4)
dTree.fit(X_train, y_train)
Out[1668]:
DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='entropy',
                       max_depth=4, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=22, splitter='best')
In [1669]:
print(dTree.score(X_train, y_train))
print(dTree.score(X_test, y_test))
0.8947135589471356
0.8983338248304334
In [1670]:
dTreeR = DecisionTreeClassifier(criterion = 'entropy', max_depth = 4, random_state=1)
dTreeR.fit(X_train, y_train)
print(dTreeR.score(X_train, y_train))
print(dTreeR.score(X_test, y_test))
0.8947135589471356
0.8983338248304334
In [1671]:
pred_dTreeR = dTreeR.predict(X_test)

acc_dTreeR = accuracy_score(y_test, pred_dTreeR)
recall_dTreeR =  recall_score(y_test, pred_dTreeR, pos_label="yes")
precision_dTreeR =  precision_score(y_test, pred_dTreeR, pos_label="yes")
f1_dTreeR =  f1_score(y_test, pred_dTreeR, pos_label="yes")

tempResultsDf = pd.DataFrame({'Method':['DecisionTree'], 'accuracy': acc_dTreeR, 'recall':f1_dTreeR, 'precision': precision_dTreeR , 'f1_score' : f1_dTreeR })
#Store the accuracy results for each model in a dataframe for final comparison
resultsDf = pd.concat([resultsDf, tempResultsDf])
tempResultsDf.reset_index(drop=True)
Out[1671]:
Method accuracy recall precision f1_score
0 DecisionTree 0.90 0.28 0.73 0.28
In [1672]:
# Confusion matrix
pd.crosstab(y_test, pred_dTreeR, rownames=['Actual'], colnames=['Predicted'])
Out[1672]:
Predicted no yes
Actual
no 11911 102
yes 1277 274
In [1673]:
draw_cm(y_test, pred_dTreeR)
In [1674]:
roc = ROCAUC(DecisionTreeClassifier(criterion = "gini", max_depth=4))
roc.fit(X_train, y_train)
roc.score(X_test, y_test)
roc.show()

# Visualize model performance with yellowbrick library
viz = ClassificationReport(DecisionTreeClassifier(criterion = "gini", max_depth=4))
viz.fit(X_train, y_train)
viz.score(X_test, y_test)
viz.show()
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\base.py:197: FutureWarning: From version 0.24, get_params will raise an AttributeError if a parameter cannot be retrieved as an instance attribute. Previously it would return None.
  FutureWarning)
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\base.py:197: FutureWarning: From version 0.24, get_params will raise an AttributeError if a parameter cannot be retrieved as an instance attribute. Previously it would return None.
  FutureWarning)
Out[1674]:
<matplotlib.axes._subplots.AxesSubplot at 0x24521c1ba08>

Random Forest Classifier

In [1675]:
from sklearn.ensemble import RandomForestClassifier
rfcl = RandomForestClassifier(criterion = 'gini' , n_estimators = 50)
rfcl = rfcl.fit(X_train, y_train)
print(rfcl.score(X_train, y_train))
print(rfcl.score(X_test, y_test))
0.9993680285651089
0.8977440283102329
In [1676]:
pred_rfcl = rfcl.predict(X_test)

acc_rfcl = accuracy_score(y_test, pred_rfcl)
recall_rfcl =  recall_score(y_test, pred_rfcl, pos_label="yes")
precision_rfcl =  precision_score(y_test, pred_rfcl, pos_label="yes")
f1_rfcl =  f1_score(y_test, pred_rfcl, pos_label="yes")

tempResultsDf = pd.DataFrame({'Method':['RandomForest'], 'accuracy': acc_rfcl, 'recall':recall_rfcl, 'precision': precision_rfcl , 'f1_score' : f1_rfcl })
#Store the accuracy results for each model in a dataframe for final comparison
resultsDf = pd.concat([resultsDf, tempResultsDf])
tempResultsDf.reset_index(drop=True)
Out[1676]:
Method accuracy recall precision f1_score
0 RandomForest 0.90 0.30 0.61 0.40
In [1677]:
# Confusion matrix
pd.crosstab(y_test, pred_rfcl, rownames=['Actual'], colnames=['Predicted'])
Out[1677]:
Predicted no yes
Actual
no 11706 307
yes 1080 471
In [1678]:
draw_cm(y_test, pred_rfcl)
In [1679]:
roc = ROCAUC(RandomForestClassifier(criterion = 'gini' , n_estimators = 50))
roc.fit(X_train, y_train)
roc.score(X_test, y_test)
roc.show()

# Visualize model performance with yellowbrick library
viz = ClassificationReport(RandomForestClassifier(criterion = 'gini' , n_estimators = 50))
viz.fit(X_train, y_train)
viz.score(X_test, y_test)
viz.show()
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\base.py:197: FutureWarning: From version 0.24, get_params will raise an AttributeError if a parameter cannot be retrieved as an instance attribute. Previously it would return None.
  FutureWarning)
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\base.py:197: FutureWarning: From version 0.24, get_params will raise an AttributeError if a parameter cannot be retrieved as an instance attribute. Previously it would return None.
  FutureWarning)
Out[1679]:
<matplotlib.axes._subplots.AxesSubplot at 0x24525cbe808>

AdaBoost Classifier

In [1680]:
from sklearn.ensemble import AdaBoostClassifier
abcl = AdaBoostClassifier( n_estimators = 50, learning_rate = 0.1, random_state=22)
abcl = abcl.fit(X_train, y_train)
print(abcl.score(X_train, y_train))
print(abcl.score(X_test, y_test))
0.8893102031788164
0.8919197876732528
In [1681]:
pred_abcl = abcl.predict(X_test)

acc_abcl = accuracy_score(y_test, pred_abcl)
recall_abcl =  recall_score(y_test, pred_abcl, pos_label="yes")
precision_abcl =  precision_score(y_test, pred_abcl, pos_label="yes")
f1_abcl =  f1_score(y_test, pred_abcl, pos_label="yes")

tempResultsDf = pd.DataFrame({'Method':['AdaBoost'], 'accuracy': acc_abcl, 'recall':recall_abcl, 'precision': precision_abcl , 'f1_score' : f1_abcl })
#Store the accuracy results for each model in a dataframe for final comparison
resultsDf = pd.concat([resultsDf, tempResultsDf])
tempResultsDf.reset_index(drop=True)
Out[1681]:
Method accuracy recall precision f1_score
0 AdaBoost 0.89 0.08 0.77 0.14
In [1682]:
# Confusion matrix
pd.crosstab(y_test, pred_abcl, rownames=['Actual'], colnames=['Predicted'])
Out[1682]:
Predicted no yes
Actual
no 11977 36
yes 1430 121
In [1683]:
draw_cm(y_test, pred_abcl)
In [1684]:
roc = ROCAUC(AdaBoostClassifier(n_estimators = 50))
roc.fit(X_train, y_train)
roc.score(X_test, y_test)
roc.show()

# Visualize model performance with yellowbrick library
viz = ClassificationReport(AdaBoostClassifier())
viz.fit(X_train, y_train)
viz.score(X_test, y_test)
viz.show()
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\base.py:197: FutureWarning: From version 0.24, get_params will raise an AttributeError if a parameter cannot be retrieved as an instance attribute. Previously it would return None.
  FutureWarning)
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\base.py:197: FutureWarning: From version 0.24, get_params will raise an AttributeError if a parameter cannot be retrieved as an instance attribute. Previously it would return None.
  FutureWarning)
Out[1684]:
<matplotlib.axes._subplots.AxesSubplot at 0x2452604a708>

Bagging Classifier (Bootstrap Aggregation)

In [1685]:
from sklearn.ensemble import BaggingClassifier
bgcl = BaggingClassifier( n_estimators=50, max_samples= .7, bootstrap=True, oob_score=True, random_state=22)
bgcl = bgcl.fit(X_train, y_train)
print(bgcl.score(X_train, y_train))
print(bgcl.score(X_test, y_test))
0.9921319556356053
0.8970805072250073
In [1686]:
pred_bgcl = bgcl.predict(X_test)

acc_bgcl = accuracy_score(y_test, pred_bgcl)
recall_bgcl =  recall_score(y_test, pred_bgcl, pos_label="yes")
precision_bgcl =  precision_score(y_test, pred_bgcl, pos_label="yes")
f1_bgcl =  f1_score(y_test, pred_bgcl, pos_label="yes")

tempResultsDf = pd.DataFrame({'Method':['Bagging'], 'accuracy': acc_bgcl, 'recall':recall_bgcl, 'precision': precision_bgcl , 'f1_score' : f1_bgcl })
#Store the accuracy results for each model in a dataframe for final comparison
resultsDf = pd.concat([resultsDf, tempResultsDf])
tempResultsDf.reset_index(drop=True)
Out[1686]:
Method accuracy recall precision f1_score
0 Bagging 0.90 0.33 0.59 0.43
In [1687]:
# Confusion matrix
pd.crosstab(y_test, pred_bgcl, rownames=['Actual'], colnames=['Predicted'])
Out[1687]:
Predicted no yes
Actual
no 11649 364
yes 1032 519
In [1688]:
draw_cm(y_test, pred_bgcl)
In [1689]:
roc = ROCAUC(GradientBoostingClassifier(n_estimators = 50))
roc.fit(X_train, y_train)
roc.score(X_test, y_test)
roc.show()

# Visualize model performance with yellowbrick library
viz = ClassificationReport(GradientBoostingClassifier())
viz.fit(X_train, y_train)
viz.score(X_test, y_test)
viz.show()
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\base.py:197: FutureWarning: From version 0.24, get_params will raise an AttributeError if a parameter cannot be retrieved as an instance attribute. Previously it would return None.
  FutureWarning)
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\base.py:197: FutureWarning: From version 0.24, get_params will raise an AttributeError if a parameter cannot be retrieved as an instance attribute. Previously it would return None.
  FutureWarning)
Out[1689]:
<matplotlib.axes._subplots.AxesSubplot at 0x245266b0d08>

Gradient boost classifier

In [1690]:
from sklearn.ensemble import GradientBoostingClassifier
gbcl = GradientBoostingClassifier( n_estimators = 50, learning_rate = 0.1, random_state=22)
gbcl = gbcl.fit(X_train, y_train)
print(gbcl.score(X_train, y_train))
print(gbcl.score(X_test, y_test))
0.9020760261636174
0.9008404600412857
In [1691]:
pred_gbcl = gbcl.predict(X_test)

acc_gbcl = accuracy_score(y_test, pred_gbcl)
recall_gbcl =  recall_score(y_test, pred_gbcl, pos_label="yes")
precision_gbcl =  precision_score(y_test, pred_gbcl, pos_label="yes")
f1_gbcl =  f1_score(y_test, pred_gbcl, pos_label="yes")

tempResultsDf = pd.DataFrame({'Method':['GradientBoost'], 'accuracy': acc_gbcl, 'recall':recall_gbcl, 'precision': precision_gbcl , 'f1_score' : f1_gbcl })
#Store the accuracy results for each model in a dataframe for final comparison
resultsDf = pd.concat([resultsDf, tempResultsDf])
tempResultsDf.reset_index(drop=True)
Out[1691]:
Method accuracy recall precision f1_score
0 GradientBoost 0.90 0.25 0.68 0.37
In [1692]:
# Confusion matrix
pd.crosstab(y_test, pred_gbcl, rownames=['Actual'], colnames=['Predicted'])
C:\ProgramData\Anaconda3\lib\site-packages\IPython\core\displayhook.py:276: UserWarning: Output cache limit (currently 1000 entries) hit.
Flushing oldest 200 entries.
  'Flushing oldest {cull_count} entries.'.format(sz=sz, cull_count=cull_count))
Out[1692]:
Predicted no yes
Actual
no 11828 185
yes 1160 391
In [1693]:
draw_cm(y_test, pred_gbcl)
In [1694]:
roc = ROCAUC( GradientBoostingClassifier( n_estimators = 50, learning_rate = 0.1, random_state=22))
roc.fit(X_train, y_train)
roc.score(X_test, y_test)
roc.show()

# Visualize model performance with yellowbrick library
viz = ClassificationReport( GradientBoostingClassifier( n_estimators = 50, learning_rate = 0.1, random_state=22))
viz.fit(X_train, y_train)
viz.score(X_test, y_test)
viz.show()
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\base.py:197: FutureWarning: From version 0.24, get_params will raise an AttributeError if a parameter cannot be retrieved as an instance attribute. Previously it would return None.
  FutureWarning)
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\base.py:197: FutureWarning: From version 0.24, get_params will raise an AttributeError if a parameter cannot be retrieved as an instance attribute. Previously it would return None.
  FutureWarning)
Out[1694]:
<matplotlib.axes._subplots.AxesSubplot at 0x24527d15188>
In [1695]:
resultsDf.reset_index(drop=True)
Out[1695]:
Method accuracy recall precision f1_score
0 Logistic Regression 0.90 0.39 0.68 0.39
1 DecisionTree 0.90 0.28 0.73 0.28
2 RandomForest 0.90 0.30 0.61 0.40
3 AdaBoost 0.89 0.08 0.77 0.14
4 Bagging 0.90 0.33 0.59 0.43
5 GradientBoost 0.90 0.25 0.68 0.37

Observations

We are getting accuracy around 0.90 with most of the models. But to maximise the reach for the term deposit customers we have to minimise false negatives. That is those customers who are actually positive but labeled as not potential customers. Considering that we want to select Bagging Classifier in this case. Also Bagging classifier has got the most favourable f1 score.