0% found this document useful (0 votes)
23 views17 pages

07bRegresionLinealBostonVerdConEstandarizacion - Jupyter Notebook

This document discusses performing linear regression on the Boston housing dataset after standardizing the data. It loads the Boston housing dataset, explores the data through plots, splits the data into training and test sets, and prepares the input (X) and output (y) variables for modeling. Standardization of the data will be performed before fitting a linear regression model to address the impact of standardization on model performance.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views17 pages

07bRegresionLinealBostonVerdConEstandarizacion - Jupyter Notebook

This document discusses performing linear regression on the Boston housing dataset after standardizing the data. It loads the Boston housing dataset, explores the data through plots, splits the data into training and test sets, and prepares the input (X) and output (y) variables for modeling. Standardization of the data will be performed before fitting a linear regression model to address the impact of standardization on model performance.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

24/10/22, 13:10 07bRegresionLinealBostonVerdConEstandarizacion - Jupyter Notebook

07bRegresionLinealBostonVerdConEstandarizacion

CJURO APAZA JIMMY CRISTHIAN 19200111


In [1]:

from sklearn.datasets import load_boston


import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

localhost:8888/notebooks/Desktop/Ultimo-RegresionLineal/02RegresionLineal/07bRegresionLinealBostonVerdConEstandarizacion.ipynb# 1/17
24/10/22, 13:10 07bRegresionLinealBostonVerdConEstandarizacion - Jupyter Notebook

In [2]:

# crea el diccionario boston


boston = load_boston()

C:\Users\Jimmy\anaconda3\lib\site-packages\sklearn\utils\deprecation.py:87:
FutureWarning: Function load_boston is deprecated; `load_boston` is deprecat
ed in 1.0 and will be removed in 1.2.

The Boston housing prices dataset has an ethical problem. You can refer
to

the documentation of this function for further details.

The scikit-learn maintainers therefore strongly discourage the use of th


is

dataset unless the purpose of the code is to study and educate about

ethical issues in data science and machine learning.

In this special case, you can fetch the dataset from the original

source::

import pandas as pd

import numpy as np

data_url = "http://lib.stat.cmu.edu/datasets/boston"

raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)

data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])

target = raw_df.values[1::2, 2]

Alternative datasets include the California housing dataset (i.e.

:func:`~sklearn.datasets.fetch_california_housing`) and the Ames housing

dataset. You can load the datasets as follows::

from sklearn.datasets import fetch_california_housing

housing = fetch_california_housing()

for the California housing dataset and::

from sklearn.datasets import fetch_openml

housing = fetch_openml(name="house_prices", as_frame=True)

for the Ames housing dataset.

warnings.warn(msg, category=FutureWarning)

In [3]:

# Muestra las claves del diccionario boston


boston.keys()

Out[3]:

dict_keys(['data', 'target', 'feature_names', 'DESCR', 'filename', 'data_mod


ule'])

localhost:8888/notebooks/Desktop/Ultimo-RegresionLineal/02RegresionLineal/07bRegresionLinealBostonVerdConEstandarizacion.ipynb# 2/17
24/10/22, 13:10 07bRegresionLinealBostonVerdConEstandarizacion - Jupyter Notebook

In [4]:

print(boston['DESCR'])

.. _boston_dataset:

Boston house prices dataset

---------------------------

**Data Set Characteristics:**

:Number of Instances: 506

:Number of Attributes: 13 numeric/categorical predictive. Median Value


(attribute 14) is usually the target.

:Attribute Information (in order):

- CRIM per capita crime rate by town

- ZN proportion of residential land zoned for lots over 25,000


sq.ft.

- INDUS proportion of non-retail business acres per town

- CHAS Charles River dummy variable (= 1 if tract bounds river;


0 otherwise)

- NOX nitric oxides concentration (parts per 10 million)

- RM average number of rooms per dwelling

- AGE proportion of owner-occupied units built prior to 1940

- DIS weighted distances to five Boston employment centres

- RAD index of accessibility to radial highways

- TAX full-value property-tax rate per $10,000

- PTRATIO pupil-teacher ratio by town

- B 1000(Bk - 0.63)^2 where Bk is the proportion of black peo


ple by town

- LSTAT % lower status of the population

- MEDV Median value of owner-occupied homes in $1000's

:Missing Attribute Values: None

:Creator: Harrison, D. and Rubinfeld, D.L.

This is a copy of UCI ML housing dataset.

https://archive.ics.uci.edu/ml/machine-learning-databases/housing/ (https://
archive.ics.uci.edu/ml/machine-learning-databases/housing/)

This dataset was taken from the StatLib library which is maintained at Carne
gie Mellon University.

The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic

prices and the demand for clean air', J. Environ. Economics & Management,

vol.5, 81-102, 1978. Used in Belsley, Kuh & Welsch, 'Regression diagnostic
s

...', Wiley, 1980. N.B. Various transformations are used in the table on

pages 244-261 of the latter.

The Boston house-price data has been used in many machine learning papers th
at address regression

problems.

.. topic:: References

- Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential


localhost:8888/notebooks/Desktop/Ultimo-RegresionLineal/02RegresionLineal/07bRegresionLinealBostonVerdConEstandarizacion.ipynb# 3/17
24/10/22, 13:10 07bRegresionLinealBostonVerdConEstandarizacion - Jupyter Notebook

Data and Sources of Collinearity', Wiley, 1980. 244-261.

- Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. I


n Proceedings on the Tenth International Conference of Machine Learning, 236
-243, University of Massachusetts, Amherst. Morgan Kaufmann.

In [5]:

# Crear un dataframe para minupular los datos


df_entrada=pd.DataFrame(data=boston['data'],columns=boston['feature_names'])

In [6]:

df_entrada.head()

Out[6]:

CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LS

0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 396.90 4

1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396.90 9

2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8 392.83 4

3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 18.7 394.63 2

4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7 396.90 5

In [7]:

df_salida=pd.DataFrame(data=boston['target'],columns=['valor m'])

In [8]:

df_salida.head()

Out[8]:

valor m

0 24.0

1 21.6

2 34.7

3 33.4

4 36.2

In [9]:

df = pd.concat([df_entrada,df_salida],axis=1)

localhost:8888/notebooks/Desktop/Ultimo-RegresionLineal/02RegresionLineal/07bRegresionLinealBostonVerdConEstandarizacion.ipynb# 4/17
24/10/22, 13:10 07bRegresionLinealBostonVerdConEstandarizacion - Jupyter Notebook

In [10]:

df.head()

Out[10]:

CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LS

0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 396.90 4

1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396.90 9

2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8 392.83 4

3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 18.7 394.63 2

4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7 396.90 5

Exploración de datos
In [11]:

sns.displot(df['valor m'],bins=30)

Out[11]:

<seaborn.axisgrid.FacetGrid at 0x1bed6189c10>

localhost:8888/notebooks/Desktop/Ultimo-RegresionLineal/02RegresionLineal/07bRegresionLinealBostonVerdConEstandarizacion.ipynb# 5/17
24/10/22, 13:10 07bRegresionLinealBostonVerdConEstandarizacion - Jupyter Notebook

In [12]:

sns.pairplot(df)

Out[12]:

<seaborn.axisgrid.PairGrid at 0x1bed61f4910>

localhost:8888/notebooks/Desktop/Ultimo-RegresionLineal/02RegresionLineal/07bRegresionLinealBostonVerdConEstandarizacion.ipynb# 6/17
24/10/22, 13:10 07bRegresionLinealBostonVerdConEstandarizacion - Jupyter Notebook

In [13]:

sns.heatmap(df.corr(),annot=True)

Out[13]:

<AxesSubplot:>

División del conjunto de entrada y salida


In [14]:

X = df.drop('valor m',axis=1)
#X

In [15]:

y = df['valor m']
#y

Separar el conjunto de entrenamiento y de prueba


In [16]:

from sklearn.model_selection import train_test_split

localhost:8888/notebooks/Desktop/Ultimo-RegresionLineal/02RegresionLineal/07bRegresionLinealBostonVerdConEstandarizacion.ipynb# 7/17
24/10/22, 13:10 07bRegresionLinealBostonVerdConEstandarizacion - Jupyter Notebook

In [17]:

X_train, X_test, y_train, y_test = train_test_split(X, y,


test_size=0.3,
random_state=50)

In [18]:

print("Dimensión de la Entrada de entrenamiento",X_train.shape)


print("Dimensión de la Salida de entrenamiento",y_train.shape)
print("Dimensión de la Entrada de prueba",X_test.shape)
print("Dimensión de la Salida de prueba",y_test.shape)

Dimensión de la Entrada de entrenamiento (354, 13)

Dimensión de la Salida de entrenamiento (354,)

Dimensión de la Entrada de prueba (152, 13)

Dimensión de la Salida de prueba (152,)

Estandarizacion
In [19]:

from sklearn.preprocessing import StandardScaler

In [20]:

escala = StandardScaler()

In [21]:

escala.fit(X_train)

Out[21]:

StandardScaler()

In [22]:

X_train_escala = escala.transform(X_train)

In [23]:

X_test_escala = escala.transform(X_test)

localhost:8888/notebooks/Desktop/Ultimo-RegresionLineal/02RegresionLineal/07bRegresionLinealBostonVerdConEstandarizacion.ipynb# 8/17
24/10/22, 13:10 07bRegresionLinealBostonVerdConEstandarizacion - Jupyter Notebook

In [24]:

# Las primeras 5 entradas de entrenamiento sin estandarizar


X_train[0:5]

Out[24]:

CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B L

251 0.21409 22.0 5.86 0.0 0.431 6.438 8.9 7.3967 7.0 330.0 19.1 377.07

3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 18.7 394.63

257 0.61154 20.0 3.97 0.0 0.647 8.704 86.9 1.8010 5.0 264.0 13.0 389.70

35 0.06417 0.0 5.96 0.0 0.499 5.933 68.2 3.3603 5.0 279.0 19.2 396.90

339 0.05497 0.0 5.19 0.0 0.515 5.985 45.4 4.8122 5.0 224.0 20.2 396.90

In [25]:

# Las primeras 5 entradas de entrenamiento estandarizadas


X_train_escala[0:5]

Out[25]:

array([[-0.37652669, 0.40303812, -0.73014076, -0.23815209, -1.03670857,

0.22264446, -2.05938044, 1.6441492 , -0.26964969, -0.44905554,

0.28623396, 0.20667552, -1.27644669],

[-0.39714499, -0.50770056, -1.26620691, -0.23815209, -0.80355288,

1.06469723, -0.75736878, 1.01955302, -0.73642714, -1.09742939,

0.09893854, 0.40175973, -1.36951318],

[-0.33143127, 0.32024369, -1.00545735, -0.23815209, 0.82853698,

3.62995083, 0.69283933, -0.97484881, -0.50303841, -0.845284 ,

-2.57002119, 0.3469895 , -1.0573825 ],

[-0.3935369 , -0.50770056, -0.71557375, -0.23815209, -0.44950164,

-0.53670669, 0.03301228, -0.24503778, -0.50303841, -0.75523208,

0.33305782, 0.42697847, -0.40448531],

[-0.39458075, -0.50770056, -0.82773976, -0.23815209, -0.3113353 ,

-0.45851608, -0.77148273, 0.43450601, -0.50303841, -1.08542247,

0.80129637, 0.42697847, -0.39589456]])

Entrenar el modelo
In [26]:

from sklearn.linear_model import LinearRegression

In [27]:

# Modelo que aplica el algoritmo de Mínimos Cuadrados Ordinarios


rl = LinearRegression()

localhost:8888/notebooks/Desktop/Ultimo-RegresionLineal/02RegresionLineal/07bRegresionLinealBostonVerdConEstandarizacion.ipynb# 9/17
24/10/22, 13:10 07bRegresionLinealBostonVerdConEstandarizacion - Jupyter Notebook

In [28]:

# Entrenamiento del modelo


rl.fit(X_train_escala,y_train)

Out[28]:

LinearRegression()

In [29]:

# Determina el Beta_0
rl.intercept_

Out[29]:

22.358757062146893

In [30]:

# Resto de Betas: Beta_1 hasta Beta_10


rl.coef_

Out[30]:

array([-0.90114346, 0.75136486, 0.09864326, 0.37403423, -1.83771432,

3.34663783, -0.02030234, -2.80647891, 2.19329497, -2.05496738,

-1.92881505, 0.86930044, -2.95289747])

In [31]:

X.columns

Out[31]:

Index(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TA
X',

'PTRATIO', 'B', 'LSTAT'],

dtype='object')

localhost:8888/notebooks/Desktop/Ultimo-RegresionLineal/02RegresionLineal/07bRegresionLinealBostonVerdConEstandarizacion.ipynb# 10/17
24/10/22, 13:10 07bRegresionLinealBostonVerdConEstandarizacion - Jupyter Notebook

In [32]:

df_coef = pd.DataFrame(rl.coef_,X.columns,columns=['Coeficiente (Beta)'])


df_coef

Out[32]:

Coeficiente (Beta)

CRIM -0.901143

ZN 0.751365

INDUS 0.098643

CHAS 0.374034

NOX -1.837714

RM 3.346638

AGE -0.020302

DIS -2.806479

RAD 2.193295

TAX -2.054967

PTRATIO -1.928815

B 0.869300

LSTAT -2.952897

Evaluar el modelo
In [33]:

# Las primeras 5 entradas de prueba


X_test_escala[0:5]

Out[33]:

array([[ 0.11609979, -0.50770056, 1.05286187, -0.23815209, 1.4416501 ,

-4.10340162, 0.7281242 , -1.06274628, 1.71415448, 1.56810756,

0.80129637, -0.04184577, -0.77102409],

[-0.37025339, -0.50770056, -0.14163336, -0.23815209, -0.06090881,

-0.03598603, 0.54817137, -0.29067145, -0.61973277, -0.60514554,

-0.04153302, 0.38798385, -0.6593443 ],

[ 0.01820942, -0.50770056, 1.05286187, -0.23815209, 1.39847312,

0.12941719, 0.74576664, -0.61628521, 1.71415448, 1.56810756,

0.80129637, 0.36620907, 0.30711536],

[-0.24845661, -0.50770056, 1.26845369, -0.23815209, 0.46585035,

-0.33671916, 1.15507111, -0.99530206, -0.50303841, -0.01080284,

-1.77401566, -0.05084453, -0.86981774],

[-0.39505048, -0.50770056, -0.82773976, -0.23815209, -0.3113353 ,

0.03919725, -1.02906227, 1.20498951, -0.50303841, -1.08542247,

0.80129637, 0.3471006 , -0.97720215]])

localhost:8888/notebooks/Desktop/Ultimo-RegresionLineal/02RegresionLineal/07bRegresionLinealBostonVerdConEstandarizacion.ipynb# 11/17
24/10/22, 13:10 07bRegresionLinealBostonVerdConEstandarizacion - Jupyter Notebook

In [34]:

predicciones_test = rl.predict(X_test_escala)

In [ ]:

In [35]:

# Observamos las 5 primeras predicciones


predicciones_test[0:5]

Out[35]:

array([ 9.70536163, 25.25268456, 19.95637866, 27.88901718, 22.27395303])

In [36]:

# Observamos los 5 primeros valores verdaderos


y_test[0:5]

Out[36]:

365 27.5

313 21.6

461 17.7

158 24.3

333 22.2

Name: valor m, dtype: float64

In [37]:

plt.scatter(y_test,predicciones_test)
plt.xlabel('Valores verdaderos')
plt.ylabel('Valores predichos')

Out[37]:

Text(0, 0.5, 'Valores predichos')

localhost:8888/notebooks/Desktop/Ultimo-RegresionLineal/02RegresionLineal/07bRegresionLinealBostonVerdConEstandarizacion.ipynb# 12/17
24/10/22, 13:10 07bRegresionLinealBostonVerdConEstandarizacion - Jupyter Notebook

In [38]:

# residual o error = valor verdadero - valor predicho


# en nuestro caso: residuales = y_test - predicciones_test
sns.displot((y_test-predicciones_test),bins=50,kde=True);

In [39]:

# residuales
res_test = y_test - predicciones_test

localhost:8888/notebooks/Desktop/Ultimo-RegresionLineal/02RegresionLineal/07bRegresionLinealBostonVerdConEstandarizacion.ipynb# 13/17
24/10/22, 13:10 07bRegresionLinealBostonVerdConEstandarizacion - Jupyter Notebook

In [40]:

sns.scatterplot(x=y_test,y=res_test)
plt.axhline(y=0, color='r', linestyle='--')

Out[40]:

<matplotlib.lines.Line2D at 0x1bed20ad280>

In [41]:

import scipy as sp

localhost:8888/notebooks/Desktop/Ultimo-RegresionLineal/02RegresionLineal/07bRegresionLinealBostonVerdConEstandarizacion.ipynb# 14/17
24/10/22, 13:10 07bRegresionLinealBostonVerdConEstandarizacion - Jupyter Notebook

In [42]:

fig, ax = plt.subplots(figsize=(6,8),dpi=100)
_ = sp.stats.probplot(res_test,plot=ax)

localhost:8888/notebooks/Desktop/Ultimo-RegresionLineal/02RegresionLineal/07bRegresionLinealBostonVerdConEstandarizacion.ipynb# 15/17
24/10/22, 13:10 07bRegresionLinealBostonVerdConEstandarizacion - Jupyter Notebook

Metricas de Evaluación de Regresión


Aquí hay tres métricas de evaluación comunes para los problemas de regresión:

Mean Absolute Error (MAE) es la media del valor absoluto de los errores:

1 𝑛 |𝑦𝑖 − 𝑦𝑖̂ |
𝑛∑𝑖=1
Mean Squared Error (MSE) es la media de los errores al cuadrado:

1 𝑛 (𝑦𝑖 − 𝑦𝑖̂ )2
𝑛∑𝑖=1
Root Mean Squared Error (RMSE) es la raíz cuadrada de la media de los errores al cuadrado:


⎯1⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯
𝑛 ⎯
 𝑛 ∑(𝑦𝑖 − 𝑦𝑖̂ ) 2
 𝑖=1
Comparing these metrics:

MAE es el más fácil de entender, porque es el error promedio.


MSE es más popular que MAE, porque MSE "castiga" los errores más grandes, lo que tiende a ser útil en
el mundo real.
RMSE es aún más popular que MSE, porque RMSE es interpretable en las unidades de "y".

Todas estas son funciones de pérdida, y las queremos minimizar.

In [43]:

from sklearn import metrics

localhost:8888/notebooks/Desktop/Ultimo-RegresionLineal/02RegresionLineal/07bRegresionLinealBostonVerdConEstandarizacion.ipynb# 16/17
24/10/22, 13:10 07bRegresionLinealBostonVerdConEstandarizacion - Jupyter Notebook

In [44]:

print('MAE:', metrics.mean_absolute_error(y_test, predicciones_test))


print('MSE:', metrics.mean_squared_error(y_test, predicciones_test))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, predicciones_test)))

MAE: 3.6789775344994413

MSE: 33.86803399667011

RMSE: 5.819624901715755

In [45]:

# R^2 está entre 0 y 1, 1 significa ajuste perfecto, 0 no hay relación entre entrada salida
r_squared = rl.score(X_test_escala, y_test)
r_squared

Out[45]:

0.6685538790447977

El cambio en las metricas despues de estandarizar es


insignificante. El MSE disminuye, por otro lado el MAE y RMSE
aumentan. En general, el R^2 disminuyo tras la estandarizacion lo
que indica que la estandarizacion provoco un modelo de
regresion con una menor relacion entrada-salida de su datos, a
diferencia del modelo sin estandarizar. En conclusion se
considera mejor el modelo sin estandarizar.
In [ ]:

localhost:8888/notebooks/Desktop/Ultimo-RegresionLineal/02RegresionLineal/07bRegresionLinealBostonVerdConEstandarizacion.ipynb# 17/17

You might also like