Objective:
Given a Bank customer, build a neural network-based classifier that can determine whether they will leave or not.
Context:
Businesses like banks which provide service have to worry about the problem of 'Churn' i.e. customers leaving and joining another service provider. It is important to understand which aspects of the service influence a customer's decision in this regard. Management can concentrate efforts on improvement of service, keeping in mind these priorities.
Data Description:
The case study is from an open-source dataset from Kaggle. The dataset contains 10,000 sample points with 14 distinct features such as CustomerId, CreditScore, Geography, Gender, Age, Tenure, Balance etc. Link to the Kaggle project site: https://www.kaggle.com/barelydedicated/bank-customer-churn-modeling
Points Distribution:
The points distribution for this case is as follows:
Read the dataset
Drop the columns which are unique for all users like IDs (5points)
Distinguish the features and target variable(5points)
Divide the data set into training and test sets (5points)
Normalize the train and test data (10points)
Initialize & build the model. Identify the points of improvement and implement the same. Note that you need to demonstrate at least two models(the original and the improved one) and highlight the differences to complete this point. You can also demonstrate more models. (20points)
Predict the results using 0.5 as a threshold. Note that you need to first predict the probability and then predict classes using the given threshold (10points)
Print the Accuracy score and confusion matrix (5points)
Happy Learning!!
import tensorflow as tf
print(tf.__version__)
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
import tensorflow as tf
from sklearn import preprocessing
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, BatchNormalization
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score, f1_score, precision_recall_curve, auc
import matplotlib.pyplot as plt
from tensorflow.keras import optimizers
data = pd.read_csv("bank.csv")
data.head()
data.shape
data = data.drop(["RowNumber","CustomerId","Surname"], axis = 1)
data.head()
data.isnull().any()
data.isna().any()
Here 'exited' is the target variable and rest are independent variables.
data["Geography"].unique()
encoder = LabelEncoder()
data["Geography"] = encoder.fit_transform(data["Geography"])
data["Gender"] = encoder.fit_transform(data["Gender"])
data
data.describe()
sns.distplot(data['Balance'])
data[data['Balance']==0].count()
fig,axis = plt.subplots(figsize=(16,12))
axis = sns.heatmap(data=data.corr(method='pearson',min_periods=1),annot=True,cmap="YlGnBu")
X_data = data.iloc[:, :-1]
y_data = data.iloc[:, -1]
X_train, X_test, y_train, y_test = train_test_split(X_data, y_data, test_size = 0.2, random_state = 50)
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)
X_train
X_test
model1 = Sequential()
model1.add(Dense(20, input_shape = (10,), activation = 'relu', kernel_initializer='uniform'))
model1.add(Dense(10, activation = 'tanh', kernel_initializer='uniform'))
model1.add(Dense(1, activation = 'sigmoid',kernel_initializer='uniform'))
sgd = optimizers.Adam(lr = 0.001)
model1.compile(optimizer = sgd, loss = 'mean_squared_error', metrics=['accuracy', 'mse'])
model1.summary()
history = model1.fit(X_train, y_train.values, batch_size = 100, validation_split = 0.2, epochs = 50, verbose = 1)
hist = pd.DataFrame(history.history)
hist['epoch'] = history.epoch
plt.plot(hist['mse'])
plt.plot(hist['val_mse'])
plt.legend(("train" , "valid") , loc =0)
results = model1.evaluate(X_test, y_test.values, verbose =0)
accuracy = str(model1.evaluate(X_test,y_test.values, verbose=0)[1])
print('Accuracy: '+ accuracy)
print(model1.metrics_names)
print(results)
model2 = Sequential()
model2.add(Dense(10, input_shape = (10,), activation = 'relu', kernel_initializer='normal'))
model2.add(Dense(10, activation = 'relu', kernel_initializer='normal'))
model2.add(Dense(10, activation = 'relu', kernel_initializer='normal'))
model2.add(Dense(1, activation = 'sigmoid',kernel_initializer='normal'))
sgd = optimizers.Adam(lr = 0.001)
model2.compile(optimizer = sgd, loss = 'binary_crossentropy', metrics=['accuracy', 'mse'])
model2.summary()
history = model2.fit(X_train, y_train.values, batch_size = 100, validation_split = 0.2, epochs =50, verbose = 1)
hist = pd.DataFrame(history.history)
hist['epoch'] = history.epoch
plt.plot(hist['mse'])
plt.plot(hist['val_mse'])
plt.legend(("train" , "valid") , loc =0)
results = model2.evaluate(X_test, y_test.values, verbose =0)
accuracy = str(model1.evaluate(X_test,y_test.values, verbose=0)[1])
print('Accuracy: '+ accuracy)
print(model2.metrics_names)
print(results)
def predict_with_threshold(model, x, batch_size, verbose):
proba = model.predict(x, batch_size=batch_size, verbose=verbose)
return (proba >= 0.50).astype('int32')
Y_pred1 = predict_with_threshold(model1 , X_test, batch_size=1000, verbose=0)
Y_pred2 = predict_with_threshold(model2 , X_test, batch_size=1000, verbose=0)
print('Recall_score: ' + str(recall_score(y_test.values,Y_pred1)))
print('Precision_score: ' + str(precision_score(y_test.values, Y_pred1)))
print('F-score: ' + str(f1_score(y_test.values,Y_pred1)))
confusionMatrix = confusion_matrix(y_test.values,Y_pred1)
group_names = ['True Neg','False Pos','False Neg','True Pos']
group_counts = ["{0:0.0f}".format(value) for value in confusionMatrix.flatten()]
group_percentages = ["{0:.2%}".format(value) for value in confusionMatrix.flatten()/np.sum(confusionMatrix)]
labels = [f"{v1}\n{v2}\n{v3}" for v1, v2, v3 in zip(group_names,group_counts,group_percentages)]
labels = np.asarray(labels).reshape(2,2)
sns.heatmap(confusionMatrix, annot=labels, fmt='', cmap='Blues')
print('Recall_score: ' + str(recall_score(y_test.values,Y_pred2)))
print('Precision_score: ' + str(precision_score(y_test.values, Y_pred2)))
print('F-score: ' + str(f1_score(y_test.values,Y_pred2)))
confusionMatrix = confusion_matrix(y_test.values,Y_pred2)
group_names = ['True Neg','False Pos','False Neg','True Pos']
group_counts = ["{0:0.0f}".format(value) for value in confusionMatrix.flatten()]
group_percentages = ["{0:.2%}".format(value) for value in confusionMatrix.flatten()/np.sum(confusionMatrix)]
labels = [f"{v1}\n{v2}\n{v3}" for v1, v2, v3 in zip(group_names,group_counts,group_percentages)]
labels = np.asarray(labels).reshape(2,2)
sns.heatmap(confusionMatrix, annot=labels, fmt='', cmap='Blues')