sns.violinplot(x="Churn", y="tenure", data=numeric_ds) G = g.map_offdiag(plt.scatter, linewidths=1, edgecolor="w", s=40)īoth the ‘tenure’ and the ‘Monthl圜harges’ are looking like good predictors of the ‘Churn’ variable, I will use the violin plot now to decide the bins that I’m going to use. G = sns.PairGrid(numeric_ds.sample(n=1000), hue="Churn") numeric_ds = pd.concat(],axis=1) #Add the 'Churn' variable to the numeric dataset The box plot and histogram shows us that our numerical variables are not normally distributed, I will check how they relate to the variable we are trying to predict and aggregate those variable in bins. Numeric_ds.plot(kind='box', subplots=True, figsize=(15,5)) To see the distribution I will use the boxplot and histogram plots: # box plots In this part we will look into our numerical variables, how they are distributed, how they relate with each other and how they can help us to predict the ‘Churn’ variable. This gives us an idea of how our data looks like, now we will dive into the graphics that will help us understand better our variables and how do they relate with each other. Objects_ds = lect_dtypes(exclude=numerics) numeric_ds.describe() Numeric_ds = lect_dtypes(include=numerics) I will split the dataset into numeric and objects to facilitate the analysis: numerics = The data is almost complete so I will just drop the few NA rows data has null values: dataset = dataset.dropna() Initial Analysis Now I’m going to check for NA values and clean my data. So until now I have checked the size of my dataset, the first 5 rows (transposed), the type of each variable (changing the SeniorCitizen to categorical) and deleted the custumerID since it doesn’t help to predict if the client is going to leave. (7043, 21) dataset.head().T #Transposed for easier visualizationĭataset=pd.Categorical(dataset) #Changing from int to categoricalĭel dataset # Deleting the custumerID column Now I will start looking the data that I read. The column “TotalCharges” has some empty spaces so I specifies then as NA values. Now I will load the data: # Read the fileįile = "./WA_Fn-UseC_-Telco-Customer-Churn.csv"ĭataset = pd.read_csv(file, na_values=) Which clients have the highest chance of leaving?įirst I like to load all the necessary libraries for the project: import pandas as pdįrom sklearn.model_selection import train_test_splitįrom sklearn.ensemble import RandomForestClassifierįrom sklearn.linear_model import LogisticRegressionįrom sklearn.discriminant_analysis import LinearDiscriminantAnalysisįrom ee import DecisionTreeClassifierįrom sklearn.naive_bayes import GaussianNBįrom sklearn.neighbors import KNeighborsClassifierįrom trics import confusion_matrixįrom trics import accuracy_scoreįrom trics import classification_report.What are the most important variables to look?.Which variables influences if the client will leave?.
Some of the questions that we will try to answer during this project are: This dataset has 7043 samples and 21 features, the features includes demographic information about the client like gender, age range, and if they have partners and dependents, the services that they have signed up for, account information like how long they’ve been a customer, contract, payment method, paperless billing, monthly charges, total charges, and the variable that we will try to predict that tells which customers have left within the last month. In this project I will be using the Telco Customer Churn dataset to study the customer behavior in order to develop focused customer retention programs.