0% found this document useful (0 votes)
3 views

caso2lau

The document presents a final report on data analysis and recommendations to reduce plan cancellations, authored by Laura Revilla from the Pontificia Universidad Católica del Ecuador. It includes data loading, exploration, and descriptive statistics of a dataset containing 3333 entries related to customer accounts and their usage patterns. The analysis aims to identify factors contributing to customer churn and provide actionable insights.

Uploaded by

lirevillav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

caso2lau

The document presents a final report on data analysis and recommendations to reduce plan cancellations, authored by Laura Revilla from the Pontificia Universidad Católica del Ecuador. It includes data loading, exploration, and descriptive statistics of a dataset containing 3333 entries related to customer accounts and their usage patterns. The analysis aims to identify factors contributing to customer churn and provide actionable insights.

Uploaded by

lirevillav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

caso2laura

February 7, 2025

1 ANEXO 1 - CÓDIGO LENGUAJE PYTHON


Informe final del análisis y recomendaciones para disminuir la renuncia de planes
Laura Revilla
Facultad de Ingeniería, Pontificia Universidad Católica del Ecuador
Maestría en Sistemas de Información mención Data Science
Mgs. Damián Nicolalde
06 de febrero de 2024

2 Carga y Exploración de Datos


[43]: import pandas as pd
import numpy as np

# Cargar el dataset
df = pd.read_csv('Datos_complexivo.csv', encoding="latin1", delimiter=";")

[44]: import warnings


warnings.filterwarnings("ignore")

[45]: # Resetear índices


df.reset_index(drop=True, inplace=True)

[46]: # Mostrar las primeras filas


display(df.head())

index State Account length Area code International plan Voice mail plan \
0 0 LA 117 408 No No
1 1 IN 65 415 No No
2 2 NY 161 415 No No
3 3 SC 111 415 No No
4 4 HI 49 510 No No

Number vmail messages Total day minutes Total day calls \


0 NaN 184.5 97

1
1 NaN 129.1 137
2 NaN 500.0 67
3 NaN 110.4 103
4 NaN 119.3 117

Total day charge … Total eve calls Total eve charge \


0 31.37 … 80 29.89
1 21.95 … 83 19.42
2 56.59 … 97 27.01
3 18.77 … 102 11.67
4 20.28 … 109 18.28

Total night minutes Total night calls Total night charge \


0 215.8 90 9.71
1 208.8 111 9.40
2 160.6 128 7.23
3 189.6 105 8.53
4 178.7 90 8.04

Total intl minutes Total intl calls Total intl charge \


0 8.7 4.0 2.35
1 12.7 6.0 3.43
2 5.4 9.0 1.46
3 7.7 6.0 2.08
4 11.1 NaN 3.00

Customer service calls Churn


0 1 False
1 4 True
2 4 True
3 2 False
4 1 False

[5 rows x 21 columns]

[5]: !pip install ydata-profiling

Requirement already satisfied: ydata-profiling in


/usr/local/lib/python3.11/dist-packages (4.12.2)
Requirement already satisfied: scipy<1.16,>=1.4.1 in
/usr/local/lib/python3.11/dist-packages (from ydata-profiling) (1.13.1)
Requirement already satisfied: pandas!=1.4.0,<3,>1.1 in
/usr/local/lib/python3.11/dist-packages (from ydata-profiling) (2.2.2)
Requirement already satisfied: matplotlib>=3.5 in
/usr/local/lib/python3.11/dist-packages (from ydata-profiling) (3.10.0)
Requirement already satisfied: pydantic>=2 in /usr/local/lib/python3.11/dist-
packages (from ydata-profiling) (2.10.6)
Requirement already satisfied: PyYAML<6.1,>=5.0.0 in

2
/usr/local/lib/python3.11/dist-packages (from ydata-profiling) (6.0.2)
Requirement already satisfied: jinja2<3.2,>=2.11.1 in
/usr/local/lib/python3.11/dist-packages (from ydata-profiling) (3.1.5)
Requirement already satisfied: visions<0.8.0,>=0.7.5 in
/usr/local/lib/python3.11/dist-packages (from
visions[type_image_path]<0.8.0,>=0.7.5->ydata-profiling) (0.7.6)
Requirement already satisfied: numpy<2.2,>=1.16.0 in
/usr/local/lib/python3.11/dist-packages (from ydata-profiling) (1.26.4)
Requirement already satisfied: htmlmin==0.1.12 in
/usr/local/lib/python3.11/dist-packages (from ydata-profiling) (0.1.12)
Requirement already satisfied: phik<0.13,>=0.11.1 in
/usr/local/lib/python3.11/dist-packages (from ydata-profiling) (0.12.4)
Requirement already satisfied: requests<3,>=2.24.0 in
/usr/local/lib/python3.11/dist-packages (from ydata-profiling) (2.32.3)
Requirement already satisfied: tqdm<5,>=4.48.2 in
/usr/local/lib/python3.11/dist-packages (from ydata-profiling) (4.67.1)
Requirement already satisfied: seaborn<0.14,>=0.10.1 in
/usr/local/lib/python3.11/dist-packages (from ydata-profiling) (0.13.2)
Requirement already satisfied: multimethod<2,>=1.4 in
/usr/local/lib/python3.11/dist-packages (from ydata-profiling) (1.12)
Requirement already satisfied: statsmodels<1,>=0.13.2 in
/usr/local/lib/python3.11/dist-packages (from ydata-profiling) (0.14.4)
Requirement already satisfied: typeguard<5,>=3 in
/usr/local/lib/python3.11/dist-packages (from ydata-profiling) (4.4.1)
Requirement already satisfied: imagehash==4.3.1 in
/usr/local/lib/python3.11/dist-packages (from ydata-profiling) (4.3.1)
Requirement already satisfied: wordcloud>=1.9.3 in
/usr/local/lib/python3.11/dist-packages (from ydata-profiling) (1.9.4)
Requirement already satisfied: dacite>=1.8 in /usr/local/lib/python3.11/dist-
packages (from ydata-profiling) (1.9.2)
Requirement already satisfied: PyWavelets in /usr/local/lib/python3.11/dist-
packages (from imagehash==4.3.1->ydata-profiling) (1.8.0)
Requirement already satisfied: pillow in /usr/local/lib/python3.11/dist-packages
(from imagehash==4.3.1->ydata-profiling) (11.1.0)
Requirement already satisfied: MarkupSafe>=2.0 in
/usr/local/lib/python3.11/dist-packages (from jinja2<3.2,>=2.11.1->ydata-
profiling) (3.0.2)
Requirement already satisfied: contourpy>=1.0.1 in
/usr/local/lib/python3.11/dist-packages (from matplotlib>=3.5->ydata-profiling)
(1.3.1)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.11/dist-
packages (from matplotlib>=3.5->ydata-profiling) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in
/usr/local/lib/python3.11/dist-packages (from matplotlib>=3.5->ydata-profiling)
(4.55.7)
Requirement already satisfied: kiwisolver>=1.3.1 in
/usr/local/lib/python3.11/dist-packages (from matplotlib>=3.5->ydata-profiling)
(1.4.8)

3
Requirement already satisfied: packaging>=20.0 in
/usr/local/lib/python3.11/dist-packages (from matplotlib>=3.5->ydata-profiling)
(24.2)
Requirement already satisfied: pyparsing>=2.3.1 in
/usr/local/lib/python3.11/dist-packages (from matplotlib>=3.5->ydata-profiling)
(3.2.1)
Requirement already satisfied: python-dateutil>=2.7 in
/usr/local/lib/python3.11/dist-packages (from matplotlib>=3.5->ydata-profiling)
(2.8.2)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.11/dist-
packages (from pandas!=1.4.0,<3,>1.1->ydata-profiling) (2024.2)
Requirement already satisfied: tzdata>=2022.7 in /usr/local/lib/python3.11/dist-
packages (from pandas!=1.4.0,<3,>1.1->ydata-profiling) (2025.1)
Requirement already satisfied: joblib>=0.14.1 in /usr/local/lib/python3.11/dist-
packages (from phik<0.13,>=0.11.1->ydata-profiling) (1.4.2)
Requirement already satisfied: annotated-types>=0.6.0 in
/usr/local/lib/python3.11/dist-packages (from pydantic>=2->ydata-profiling)
(0.7.0)
Requirement already satisfied: pydantic-core==2.27.2 in
/usr/local/lib/python3.11/dist-packages (from pydantic>=2->ydata-profiling)
(2.27.2)
Requirement already satisfied: typing-extensions>=4.12.2 in
/usr/local/lib/python3.11/dist-packages (from pydantic>=2->ydata-profiling)
(4.12.2)
Requirement already satisfied: charset-normalizer<4,>=2 in
/usr/local/lib/python3.11/dist-packages (from requests<3,>=2.24.0->ydata-
profiling) (3.4.1)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.11/dist-
packages (from requests<3,>=2.24.0->ydata-profiling) (3.10)
Requirement already satisfied: urllib3<3,>=1.21.1 in
/usr/local/lib/python3.11/dist-packages (from requests<3,>=2.24.0->ydata-
profiling) (2.3.0)
Requirement already satisfied: certifi>=2017.4.17 in
/usr/local/lib/python3.11/dist-packages (from requests<3,>=2.24.0->ydata-
profiling) (2024.12.14)
Requirement already satisfied: patsy>=0.5.6 in /usr/local/lib/python3.11/dist-
packages (from statsmodels<1,>=0.13.2->ydata-profiling) (1.0.1)
Requirement already satisfied: attrs>=19.3.0 in /usr/local/lib/python3.11/dist-
packages (from
visions<0.8.0,>=0.7.5->visions[type_image_path]<0.8.0,>=0.7.5->ydata-profiling)
(25.1.0)
Requirement already satisfied: networkx>=2.4 in /usr/local/lib/python3.11/dist-
packages (from
visions<0.8.0,>=0.7.5->visions[type_image_path]<0.8.0,>=0.7.5->ydata-profiling)
(3.4.2)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.11/dist-
packages (from python-dateutil>=2.7->matplotlib>=3.5->ydata-profiling) (1.17.0)

4
[6]: #from ydata_profiling import ProfileReport

# Generar el informe detallado similar a skim() usando pandas_profiling


#profile = ProfileReport(df, title="Pandas Profiling Report", explorative=True)

# Mostrar el informe
#profile.to_notebook_iframe()

[47]: # Información del dataset


df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3333 entries, 0 to 3332
Data columns (total 21 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 index 3333 non-null int64
1 State 3333 non-null object
2 Account length 3333 non-null int64
3 Area code 3333 non-null int64
4 International plan 3333 non-null object
5 Voice mail plan 3333 non-null object
6 Number vmail messages 922 non-null float64
7 Total day minutes 3333 non-null float64
8 Total day calls 3333 non-null int64
9 Total day charge 3333 non-null float64
10 Total eve minutes 3333 non-null float64
11 Total eve calls 3333 non-null int64
12 Total eve charge 3333 non-null float64
13 Total night minutes 3333 non-null float64
14 Total night calls 3333 non-null int64
15 Total night charge 3333 non-null float64
16 Total intl minutes 3333 non-null float64
17 Total intl calls 1998 non-null float64
18 Total intl charge 3333 non-null float64
19 Customer service calls 3333 non-null int64
20 Churn 3333 non-null bool
dtypes: bool(1), float64(10), int64(7), object(3)
memory usage: 524.2+ KB

[48]: # Valores nulos por columna


df.isnull().sum()

[48]: index 0
State 0
Account length 0
Area code 0

5
International plan 0
Voice mail plan 0
Number vmail messages 2411
Total day minutes 0
Total day calls 0
Total day charge 0
Total eve minutes 0
Total eve calls 0
Total eve charge 0
Total night minutes 0
Total night calls 0
Total night charge 0
Total intl minutes 0
Total intl calls 1335
Total intl charge 0
Customer service calls 0
Churn 0
dtype: int64

[49]: df.duplicated().sum()

[49]: 0

[50]: # Estadísticas descriptivas


df.describe()

[50]: index Account length Area code Number vmail messages \


count 3333.000000 3333.000000 3333.000000 922.000000
mean 1132.480048 101.064806 437.182418 29.277657
std 800.805553 39.822106 42.371290 7.559027
min 0.000000 1.000000 408.000000 4.000000
25% 416.000000 74.000000 408.000000 24.000000
50% 999.000000 101.000000 415.000000 29.000000
75% 1832.000000 127.000000 510.000000 34.000000
max 2665.000000 243.000000 510.000000 51.000000

Total day minutes Total day calls Total day charge \


count 3333.000000 3333.000000 3333.000000
mean 181.534713 100.899790 30.562307
std 61.455081 21.733147 9.259435
min 0.000000 0.000000 0.000000
25% 143.700000 87.000000 24.430000
50% 179.400000 101.000000 30.500000
75% 216.400000 114.000000 36.790000
max 500.000000 200.000000 59.640000

Total eve minutes Total eve calls Total eve charge \

6
count 3333.000000 3333.000000 3333.000000
mean 203.522382 100.114311 17.083540
std 62.610089 19.922625 4.310668
min 0.000000 0.000000 0.000000
25% 166.600000 87.000000 14.160000
50% 201.400000 100.000000 17.120000
75% 235.300000 114.000000 20.000000
max 600.000000 170.000000 30.910000

Total night minutes Total night calls Total night charge \


count 3333.000000 3333.000000 3333.000000
mean 200.872037 100.107711 9.039325
std 50.573847 19.568609 2.275873
min 23.200000 33.000000 1.040000
25% 167.000000 87.000000 7.520000
50% 201.200000 100.000000 9.050000
75% 235.300000 113.000000 10.590000
max 395.000000 175.000000 17.770000

Total intl minutes Total intl calls Total intl charge \


count 3333.000000 1998.000000 3333.000000
mean 10.237294 5.899900 2.764581
std 2.791840 2.167958 0.753773
min 0.000000 4.000000 0.000000
25% 8.500000 4.000000 2.300000
50% 10.300000 5.000000 2.780000
75% 12.100000 7.000000 3.270000
max 20.000000 20.000000 5.400000

Customer service calls


count 3333.000000
mean 1.562856
std 1.315491
min 0.000000
25% 1.000000
50% 1.000000
75% 2.000000
max 9.000000

Analisis univariado
Distribución de la variables
[51]: print('The percentage of customers churning from the company is: %{}'.
↪format((df['Churn'].sum()) *100/df.shape[0]) )

The percentage of customers churning from the company is: %14.491449144914492


Análisis de las variables categóricas

7
[52]: import matplotlib.pyplot as plt
import seaborn as sns

# Lista de variables categóricas


variables_categoricas = ["Churn", "Area code", "International plan", "Voice␣
↪mail plan"]

plt.figure(figsize=(12, 8))
for i, var in enumerate(variables_categoricas, 1):
plt.subplot(2, 2, i)
sns.countplot(x=df[var], palette="viridis")
plt.title(f"Distribución de {var}")
plt.ylabel("Frecuencia")

plt.tight_layout()
plt.show()

Análisis de las variables numéricas


[53]: # Histogramas de las variables importantes
variables_numericas = ["Account length","Number vmail messages",
"Total day minutes","Total day calls","Total day␣
↪charge","Total eve minutes",

8
"Total eve calls","Total eve charge","Total night␣
↪minutes","Total night calls",
"Total night charge","Total intl minutes","Total intl␣
↪calls","Total intl charge",

"Customer service calls"]

df[variables_numericas].hist(figsize=(18, 10), bins=30, color='teal',␣


↪edgecolor='black')

plt.suptitle("Distribución de Variables Numéricas")


plt.show()

Análisis bivariado entre churn y variables independientes


Churn por estado
[54]: plt.figure(figsize=(20,6))
sns.set_style('whitegrid')
sns.barplot(x='State',y='Churn', data=df, palette="turbo")

[54]: <Axes: xlabel='State', ylabel='Churn'>

9
Relación entre churn y llamadas al servicio al cliente
[55]: sns.barplot(x='Churn', y='Customer service calls',data=df, palette="viridis")

[55]: <Axes: xlabel='Churn', ylabel='Customer service calls'>

Relación entre churn y el plan internacional

10
[56]: churn_voicem = df.groupby(['Churn','International plan']).size()
churn_voicem.plot()
plt.show()

Relación entre churn y el plan de correo de voz


[57]: churn_voicem = df.groupby(['Churn','Voice mail plan']).size()
churn_voicem.plot()
plt.show()

11
Distribución de variables respecto del churn
[59]: variables_numericas = ["Account length","Number vmail messages",
"Total day minutes","Total day calls","Total day␣
↪charge","Total eve minutes",

"Total eve calls","Total eve charge","Total night␣


↪minutes","Total night calls",

"Total night charge","Total intl minutes","Total intl␣


↪calls","Total intl charge",

"Customer service calls"]

filas = 4
columnas = 4
fig, axes = plt.subplots(filas, columnas, figsize=(15, 15))
colores = ["#1f77b4", "#ff7f0e"]

# Generar histogramas con KDE para cada variable numérica


for i, col in enumerate(variables_numericas):
fila = i // columnas
columna = i % columnas
ax = axes[fila, columna]

12
sns.histplot(df, x=col, hue="Churn", kde=True, bins=30, palette=colores,␣
alpha=0.8, ax=ax)

ax.set_title(f"{col} por Churn", fontsize=14, fontweight="bold")


ax.set_xlabel("")
ax.set_ylabel("Frecuencia")
ax.grid(True, linestyle="--", alpha=0.5)
plt.tight_layout()
plt.show()

Análisis multivariado

13
[60]: # Matriz de correlación de variables numéricas

import matplotlib.pyplot as plt

correlation_matrix = df[variables_numericas].corr()

# Visualización de la matriz de correlación


plt.figure(figsize=(12, 10))
sns.heatmap(correlation_matrix, annot=True, fmt=".2f", cmap='coolwarm',␣
↪cbar=True)

plt.title('Matriz de Correlación de Variables Numéricas')


plt.show()

14
[62]: filas = 4
columnas = 4
df_numeric = df.select_dtypes(include=[np.number])
fig, axes = plt.subplots(filas, columnas, figsize=(16, 12))
fig.suptitle("Diagrama de cajas y bigotes - Variables numéricas", fontsize=16,␣
↪fontweight="bold")

axes = axes.flatten()

# Crear un diagrama de cajas para cada variable numérica


for i, col in enumerate(df_numeric.columns):
if i < len(axes):
sns.boxplot(y=df_numeric[col], ax=axes[i], palette="cool")
axes[i].set_title(col, fontsize=12)
axes[i].grid(True, linestyle="--", alpha=0.7)

plt.tight_layout(rect=[0, 0, 1, 0.95])
plt.show()

15
3 Preparación de los Datos
[64]: # Imputar valores nulos con la mediana
df.loc[:, 'Number vmail messages'] = df['Number vmail messages'].
↪fillna(df['Number vmail messages'].median())

df.loc[:, 'Total intl calls'] = df['Total intl calls'].fillna(df['Total intl␣


↪calls'].median())

[33]: !pip install category_encoders

Requirement already satisfied: category_encoders in


/usr/local/lib/python3.11/dist-packages (2.8.0)
Requirement already satisfied: numpy>=1.14.0 in /usr/local/lib/python3.11/dist-
packages (from category_encoders) (1.26.4)
Requirement already satisfied: pandas>=1.0.5 in /usr/local/lib/python3.11/dist-
packages (from category_encoders) (2.2.2)
Requirement already satisfied: patsy>=0.5.1 in /usr/local/lib/python3.11/dist-
packages (from category_encoders) (1.0.1)
Requirement already satisfied: scikit-learn>=1.6.0 in
/usr/local/lib/python3.11/dist-packages (from category_encoders) (1.6.1)
Requirement already satisfied: scipy>=1.0.0 in /usr/local/lib/python3.11/dist-
packages (from category_encoders) (1.13.1)
Requirement already satisfied: statsmodels>=0.9.0 in
/usr/local/lib/python3.11/dist-packages (from category_encoders) (0.14.4)
Requirement already satisfied: python-dateutil>=2.8.2 in
/usr/local/lib/python3.11/dist-packages (from pandas>=1.0.5->category_encoders)
(2.8.2)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.11/dist-
packages (from pandas>=1.0.5->category_encoders) (2024.2)
Requirement already satisfied: tzdata>=2022.7 in /usr/local/lib/python3.11/dist-
packages (from pandas>=1.0.5->category_encoders) (2025.1)
Requirement already satisfied: joblib>=1.2.0 in /usr/local/lib/python3.11/dist-
packages (from scikit-learn>=1.6.0->category_encoders) (1.4.2)
Requirement already satisfied: threadpoolctl>=3.1.0 in
/usr/local/lib/python3.11/dist-packages (from scikit-
learn>=1.6.0->category_encoders) (3.5.0)
Requirement already satisfied: packaging>=21.3 in
/usr/local/lib/python3.11/dist-packages (from
statsmodels>=0.9.0->category_encoders) (24.2)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.11/dist-
packages (from python-dateutil>=2.8.2->pandas>=1.0.5->category_encoders)
(1.17.0)

[66]: from category_encoders import TargetEncoder


# Convertir variables categóricas en numéricas
df['International plan'] = df['International plan'].map({"No": 0, "Yes": 1})
df['Voice mail plan'] = df['Voice mail plan'].map({"No": 0, "Yes": 1})

16
# Aplicar Target Encoding a "State"
target_encoder = TargetEncoder(cols=['State'])
df['State'] = target_encoder.fit_transform(df['State'], df['Churn'])

Manejo de outlierss
[68]: def replace_outliers(series):
# Calcular los cuartiles y los límites para los outliers
Q1 = series.quantile(0.25)
Q3 = series.quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Percentiles para reemplazo


lower_cap = series.quantile(0.05)
upper_cap = series.quantile(0.95)

return series.apply(lambda x: lower_cap if x < lower_bound else


(upper_cap if x > upper_bound else x) if pd.
↪notna(x) else x)

# Aplicación de la función a varias columnas del DataFrame 'df'


# Changed column names to match the actual column names in the DataFrame
columns_to_fix = [
'Number vmail messages', 'Total day minutes', 'Total day calls',
'Total day charge', 'Total eve minutes', 'Total eve calls',
'Total eve charge', 'Total night minutes', 'Total night calls',
'Total night charge', 'Total intl minutes', 'Total intl calls',
'Total intl charge', 'Customer service calls'
]

for column in columns_to_fix:


df[column] = replace_outliers(df[column])

[70]: filas = 4
columnas = 4
df_numeric = df.select_dtypes(include=[np.number])
fig, axes = plt.subplots(filas, columnas, figsize=(16, 12))
fig.suptitle("Diagrama de cajas y bigotes - Variables numéricas", fontsize=16,␣
↪fontweight="bold")

axes = axes.flatten()

# Crear un diagrama de cajas para cada variable numérica


for i, col in enumerate(df_numeric.columns):
if i < len(axes):

17
sns.boxplot(y=df_numeric[col], ax=axes[i], palette="cool")
axes[i].set_title(col, fontsize=12)
axes[i].grid(True, linestyle="--", alpha=0.7)

plt.tight_layout(rect=[0, 0, 1, 0.95])
plt.show()

4 Modelado
Suavizar el desbalance de la variable objetivo
[26]: !pip install --upgrade scikit-learn

Requirement already satisfied: scikit-learn in /usr/local/lib/python3.11/dist-


packages (1.6.1)
Requirement already satisfied: numpy>=1.19.5 in /usr/local/lib/python3.11/dist-
packages (from scikit-learn) (1.26.4)
Requirement already satisfied: scipy>=1.6.0 in /usr/local/lib/python3.11/dist-
packages (from scikit-learn) (1.13.1)
Requirement already satisfied: joblib>=1.2.0 in /usr/local/lib/python3.11/dist-
packages (from scikit-learn) (1.4.2)

18
Requirement already satisfied: threadpoolctl>=3.1.0 in
/usr/local/lib/python3.11/dist-packages (from scikit-learn) (3.5.0)

[72]: import pandas as pd


import numpy as np
from sklearn.model_selection import train_test_split
from imblearn.combine import SMOTETomek
from imblearn.over_sampling import SMOTE
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
import matplotlib.pyplot as plt

# Establecer la semilla para reproducibilidad


np.random.seed(1991)

# Leer los datos, asumiendo que el archivo CSV ya está cargado en 'df'
# df = pd.read_csv("tu_archivo.csv")

# Separar las características y la variable objetivo


X = df.drop('Churn', axis=1)
y = df['Churn']

# Dividir el conjunto de datos en conjuntos de entrenamiento (70%) y prueba␣


↪(30%)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,␣


↪random_state=1)

# Resumen del conjunto de entrenamiento


print(X_train.describe())
print(y_train.value_counts())

# Balancear el conjunto de entrenamiento utilizando SMOTE y Tomek links


smote_tomek = SMOTETomek(random_state=1)
X_train_balanced, y_train_balanced = smote_tomek.fit_resample(X_train, y_train)

# Crear un DataFrame del conjunto de datos balanceado (opcional, solo para␣


↪visualización)

train_balanced = pd.concat([X_train_balanced, y_train_balanced], axis=1)

# Entrenar un modelo de Random Forest utilizando el conjunto de datos balanceado


modelo_bagging = RandomForestClassifier(n_estimators=100, max_features=4,␣
↪random_state=1, oob_score=True)

modelo_bagging.fit(X_train_balanced, y_train_balanced)

# Graficar la importancia de las características


importances = modelo_bagging.feature_importances_
indices = np.argsort(importances)[::-1]

19
# Visualizar la importancia de las características
plt.figure(figsize=(12, 6))
plt.title("Importancia de las características")
plt.bar(range(X_train.shape[1]), importances[indices], align="center")
plt.xticks(range(X_train.shape[1]), X_train.columns[indices], rotation=90)
plt.xlim([-1, X_train.shape[1]])
plt.show()

# Evaluar el modelo con el conjunto de prueba


y_pred = modelo_bagging.predict(X_test)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

index State Account length Area code \


count 2333.000000 2333.000000 2333.000000 2333.000000
mean 1141.411059 0.145197 101.125161 436.856408
std 802.347918 0.054208 39.381101 42.178981
min 0.000000 0.059745 1.000000 408.000000
25% 423.000000 0.097485 74.000000 408.000000
50% 1024.000000 0.136449 101.000000 415.000000
75% 1837.000000 0.180657 127.000000 415.000000
max 2665.000000 0.263728 243.000000 510.000000

International plan Voice mail plan Number vmail messages \


count 2333.000000 2333.000000 2333.000000
mean 0.098586 0.271753 29.039006
std 0.298169 0.444959 3.547488
min 0.000000 0.000000 22.000000
25% 0.000000 0.000000 29.000000
50% 0.000000 0.000000 29.000000
75% 0.000000 1.000000 29.000000
max 1.000000 1.000000 36.000000

Total day minutes Total day calls Total day charge \


count 2333.000000 2333.000000 2333.000000
mean 179.550699 100.489498 30.570058
std 52.467265 19.365548 9.018493
min 37.700000 47.000000 6.410000
25% 144.600000 87.000000 24.580000
50% 179.100000 101.000000 30.450000
75% 216.600000 114.000000 36.820000
max 305.200000 146.000000 54.830000

Total eve minutes Total eve calls Total eve charge \


count 2333.000000 2333.000000 2333.000000
mean 200.933733 100.019288 17.107227

20
std 49.358279 19.589178 4.254357
min 64.300000 48.000000 5.470000
25% 167.200000 87.000000 14.210000
50% 201.000000 100.000000 17.090000
75% 236.000000 114.000000 20.060000
max 319.300000 154.000000 28.650000

Total night minutes Total night calls Total night charge \


count 2333.000000 2333.000000 2333.000000
mean 200.540823 99.919846 9.024382
std 49.508177 19.038487 2.227949
min 65.700000 49.000000 2.960000
25% 165.900000 86.000000 7.470000
50% 202.000000 100.000000 9.090000
75% 235.000000 113.000000 10.580000
max 329.300000 152.000000 14.820000

Total intl minutes Total intl calls Total intl charge \


count 2333.000000 2333.000000 2333.000000
mean 10.277925 5.441063 2.774625
std 2.630047 1.406787 0.707916
min 3.300000 4.000000 0.890000
25% 8.500000 5.000000 2.300000
50% 10.300000 5.000000 2.780000
75% 12.100000 6.000000 3.270000
max 17.500000 9.000000 4.670000

Customer service calls


count 2333.000000
mean 1.507501
std 1.186531
min 0.000000
25% 1.000000
50% 1.000000
75% 2.000000
max 4.000000
Churn
False 1996
True 337
Name: count, dtype: int64

21
[[821 33]
[ 51 95]]
precision recall f1-score support

False 0.94 0.96 0.95 854


True 0.74 0.65 0.69 146

accuracy 0.92 1000


macro avg 0.84 0.81 0.82 1000
weighted avg 0.91 0.92 0.91 1000

[77]: import pandas as pd


import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns

# Separar las características y la variable objetivo


X = df.drop('Churn', axis=1)

22
y = df['Churn']

# Dividir el conjunto de datos en conjuntos de entrenamiento (70%) y prueba␣


↪(30%)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,␣


↪random_state=1)

# Entrenar un modelo de Random Forest utilizando el conjunto de datos de␣


↪entrenamiento

modelo_bagging = RandomForestClassifier(n_estimators=100, max_features=4,␣


↪random_state=1, oob_score=True)

modelo_bagging.fit(X_train, y_train)

# Calcular el error OOB para diferentes números de árboles


oob_errors = []
for i in range(1, len(modelo_bagging.estimators_) + 1):
# Create a temporary Random Forest with i estimators
temp_rf = RandomForestClassifier(n_estimators=i, max_features=4,␣
↪random_state=1, oob_score=True, warm_start=True)

temp_rf.fit(X_train, y_train) # Fit the temporary model


oob_errors.append(1 - temp_rf.oob_score_) # Store the OOB error

# Graficar el error OOB


plt.figure(figsize=(10, 6))
plt.plot(range(1, len(modelo_bagging.estimators_) + 1), oob_errors,␣
↪color='firebrick')

plt.xlabel('Número de árboles')
plt.ylabel('Error OOB')
plt.title('Error OOB vs. Número de árboles en el Random Forest')
plt.show()

# Graficar la importancia de las variables en el modelo de Random Forest


importances = modelo_bagging.feature_importances_
indices = np.argsort(importances)[::-1]

plt.figure(figsize=(12, 6))
plt.title('Importancia de las características')
plt.bar(range(X_train.shape[1]), importances[indices], align='center',␣
↪color='firebrick')

plt.xticks(range(X_train.shape[1]), X_train.columns[indices], rotation=90)


plt.xlim([-1, X_train.shape[1]])
plt.tight_layout()
plt.show()

# Evaluar el modelo con el conjunto de prueba


y_pred = modelo_bagging.predict(X_test)

23
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

[[847 7]
[ 60 86]]
precision recall f1-score support

24
False 0.93 0.99 0.96 854
True 0.92 0.59 0.72 146

accuracy 0.93 1000


macro avg 0.93 0.79 0.84 1000
weighted avg 0.93 0.93 0.93 1000

[79]: import numpy as np


import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_curve, roc_auc_score, classification_report,␣
↪confusion_matrix, precision_recall_curve

import matplotlib.pyplot as plt


import seaborn as sns

# Realizar predicciones probabilísticas en el conjunto de datos de prueba␣


↪utilizando el modelo de Random Forest entrenado

proba = modelo_bagging.predict_proba(X_test)[:, 1]

# Calcular la curva ROC y el AUC


fpr, tpr, _ = roc_curve(y_test, proba)
roc_auc = roc_auc_score(y_test, proba)

# Graficar la curva ROC


plt.figure(figsize=(10, 6))
plt.plot(fpr, tpr, color='darkorange', lw=2, label=f'Curva ROC (AUC = {roc_auc:.
↪2f})')

plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')


plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('Tasa de Falsos Positivos (FPR)')
plt.ylabel('Tasa de Verdaderos Positivos (TPR)')
plt.title('Curva ROC')
plt.legend(loc="lower right")
plt.show()

# Calcular la curva Lift


precision, recall, thresholds = precision_recall_curve(y_test, proba)
lift = precision / (np.sum(y_test) / len(y_test))

# Graficar la curva Lift


plt.figure(figsize=(10, 6))
plt.plot(thresholds, lift[:-1], color='darkorange', lw=2)
plt.xlabel('Umbral')

25
plt.ylabel('Lift')
plt.title('Curva Lift')
plt.show()

# Calcular la curva de Sensibilidad/Especificidad


plt.figure(figsize=(10, 6))
plt.plot(1 - fpr, tpr, color='darkorange', lw=2)
plt.xlabel('Especificidad')
plt.ylabel('Sensibilidad')
plt.title('Curva de Sensibilidad/Especificidad')
plt.show()

26
27

You might also like