0% found this document useful (0 votes)
25 views7 pages

Practica 11

The document contains code to analyze and summarize data using NumPy and Pandas. It imports data from an Excel file containing employee data, then performs statistical calculations and grouping on the data. Calculations include sums, means, percentiles, standard deviations across the full dataset and grouped by variables like gender. Frequency counts and cross tabulations are also generated to analyze relationships between variables.

Uploaded by

2marlenehh2003
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views7 pages

Practica 11

The document contains code to analyze and summarize data using NumPy and Pandas. It imports data from an Excel file containing employee data, then performs statistical calculations and grouping on the data. Calculations include sums, means, percentiles, standard deviations across the full dataset and grouped by variables like gender. Frequency counts and cross tabulations are also generated to analyze relationships between variables.

Uploaded by

2marlenehh2003
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

import sys

print(sys.version)

3.10.11 (main, Apr 5 2023, 14:15:10) [GCC 9.4.0]

from IPython.core.interactiveshell import InteractiveShell


InteractiveShell.ast_node_interactivity="all"

Analisis exploratorio de datos


Calculos en Arrays

import numpy as np

a = np.array([4,5,9,4,6,3,2])
a

array([4, 5, 9, 4, 6, 3, 2])

np.sum(a); print()
np.mean(a); print()
np.median(a); print()

import statistics as stat


stat.mode(a)

33
4.714285714285714
4.0
4

np.percentile(a, 25); print() #Del 100% de datos de a, el 25% tiene valores menores a 3.5
np.percentile(a, 50); print() #Del 100% de datos de a, el 50% tiene valores menores a 4.0
np.percentile(a, 75) #Del 100% de datos de a, el 75% tiene valores menores a 5.5

3.5
4.0
5.5

min(a); print()
max(a); print()
np.std(a)

2
9
2.1189138534559038

from scipy.stats import skew #sesgado


from scipy.stats import kurtosis #curtosis

skew(a); print()
kurtosis(a)

0.8274271039321606
-0.1402479338842988

np.random.seed(2021)
a = np.random.randint(1,9,size=(3,3))
a

array([[5, 6, 2],
[1, 6, 7],
[7, 5, 8]])

np.sum(a); print()
np.sum(a,0); print() #por columnas
np.sum(a,1) #por filas

47
array([13, 17, 17])
array([13, 14, 20])

np.mean(a); print()
np.mean(a,0); print() #por columnas
np.mean(a,1) #por filas
5.222222222222222
array([4.33333333, 5.66666667, 5.66666667])
array([4.33333333, 4.66666667, 6.66666667])

np.std(a); print()
np.std(a,0); print() #por columnas
np.std(a,1) #por filas

2.199887763691481
array([2.49443826, 0.47140452, 2.62466929])
array([1.69967317, 2.62466929, 1.24721913])

np.percentile(a, 25); print()


np.percentile(a,25,0); print() #por columnas
np.percentile(a,25,1) #por filas

5.0
array([3. , 5.5, 4.5])
array([3.5, 3.5, 6. ])

a.min(); print()
a.min(0); print() #por columnas
a.min(1) #por filas

1
array([1, 5, 2])
array([2, 1, 5])

a.max(); print()
a.max(0); print() #por columnas
a.max(1) #por filas

8
array([7, 6, 8])
array([6, 7, 8])

Calculos en dataframes

import pandas as pd

import os
os.chdir('/content/dataset')
os.getcwd()

'/content/dataset'

ed = pd.read_excel('EmployeeData2.xlsx')
ed.head()

id sexo fechnac educ catlab salario salini tiempemp expprev minoria

0 1 Hombre 1952-02-03 15 Directivo 57000 27000 98 144.0 No

1 2 Hombre 1958-05-23 16 Administrativo 40200 18750 98 36.0 No

2 3 Mujer 1929-07-26 12 Administrativo 21450 12000 98 381.0 No

3 4 Mujer 1947-04-15 8 Administrativo 21900 13200 98 190.0 No

4 5 Hombre 1955-02-09 15 Administrativo 45000 21000 98 138.0 No

ed.info

<bound method DataFrame.info of id sexo fechnac educ catlab salario salini tiempemp \
0 1 Hombre 1952-02-03 15 Directivo 57000 27000 98
1 2 Hombre 1958-05-23 16 Administrativo 40200 18750 98
2 3 Mujer 1929-07-26 12 Administrativo 21450 12000 98
3 4 Mujer 1947-04-15 8 Administrativo 21900 13200 98
4 5 Hombre 1955-02-09 15 Administrativo 45000 21000 98
.. ... ... ... ... ... ... ... ...
469 470 Hombre 1964-01-22 12 Administrativo 26250 15750 64
470 471 Hombre 1966-08-03 15 Administrativo 26400 15750 64
471 472 Hombre 1966-02-21 15 Administrativo 39150 15750 63
472 473 Mujer 1937-11-25 12 Administrativo 21450 12750 63
473 474 Mujer 1968-11-05 12 Administrativo 29400 14250 63

expprev minoria
0 144.0 No
1 36.0 No
2 381.0 No
3 190.0 No
4 138.0 No
.. ... ...
469 69.0 Sí
470 32.0 Sí
471 46.0 No
472 139.0 No
473 9.0 No

[474 rows x 10 columns]>

ed.describe()
#ed.describe(include = [np.number])

id educ salario salini tiempemp expprev

count 474.000000 474.000000 474.000000 474.000000 474.000000 450.000000

mean 237.500000 13.491561 34419.567511 17016.086498 81.109705 100.973333

std 136.976275 2.884846 17075.661465 7870.638154 10.060945 104.907443

min 1.000000 8.000000 15750.000000 9000.000000 63.000000 2.000000

25% 119.250000 12.000000 24000.000000 12487.500000 72.000000 24.000000

50% 237.500000 12.000000 28875.000000 15000.000000 81.000000 59.000000

75% 355.750000 15.000000 36937.500000 17490.000000 90.000000 144.000000

max 474.000000 21.000000 135000.000000 79980.000000 98.000000 476.000000

ed.describe(include = ['O'])
#Incluye las columnas con tipos de datos de objetos (cad. de caract.)

sexo catlab minoria

count 474 474 474

unique 2 3 2

top Hombre Administrativo No

freq 258 363 370

ed.isnull().sum()

id 0
sexo 0
fechnac 1
educ 0
catlab 0
salario 0
salini 0
tiempemp 0
expprev 24
minoria 0
dtype: int64

ed['sexo'].value_counts(); print('\n')
ed['catlab'].value_counts()

Hombre 258
Mujer 216
Name: sexo, dtype: int64

Administrativo 363
Directivo 84
Seguridad 27
Name: catlab, dtype: int64

ed['sexo'].value_counts(normalize=True); print('\n')
ed['catlab'].value_counts(normalize=True)

Hombre 0.544304
Mujer 0.455696
Name: sexo, dtype: float64

Administrativo 0.765823
Directivo 0.177215
Seguridad 0.056962
Name: catlab, dtype: float64
pd.crosstab(ed['sexo'], ed['catlab'])

catlab Administrativo Directivo Seguridad

sexo

Hombre 157 74 27

Mujer 206 10 0

pd.crosstab(ed['sexo'], ed['catlab'], normalize = 'index')

catlab Administrativo Directivo Seguridad

sexo

Hombre 0.608527 0.286822 0.104651

Mujer 0.953704 0.046296 0.000000

pd.crosstab(ed['sexo'], ed['catlab'], normalize = 'columns')

catlab Administrativo Directivo Seguridad

sexo

Hombre 0.432507 0.880952 1.0

Mujer 0.567493 0.119048 0.0

pd.crosstab(ed['sexo'], ed['catlab'], normalize = 'all')

catlab Administrativo Directivo Seguridad

sexo

Hombre 0.331224 0.156118 0.056962

Mujer 0.434599 0.021097 0.000000

Groupby

ed.groupby(by='sexo')

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f23685c9e40>

ed.groupby(by='sexo').describe()

id educ ... tiempemp expprev

count mean std min 25% 50% 75% max count mean ... 75% max count mean std min 25% 50%

sexo

Hombre 258.0 227.550388 141.168440 1.0 103.25 216.5 342.50 472.0 258.0 14.430233 ... 91.0 98.0 258.0 111.620155 109.692296 3.0 37.25 67.5

Mujer 216.0 249.384259 131.130737 3.0 141.75 247.5 360.25 474.0 216.0 12.370370 ... 88.0 98.0 192.0 86.666667 96.553993 2.0 11.00 48.0

2 rows × 48 columns

ed.groupby(by='sexo').describe().stack()
#calcular estadisticas descriptivas para cada grupo
id educ salario salini tiempemp expprev

sexo

Hombre count 258.000000 258.000000 258.000000 258.000000 258.000000 258.000000

mean 227.550388 14.430233 41441.782946 20301.395349 81.720930 111.620155

std 141.168440 2.979335 19499.213736 9111.780867 10.351020 109.692296

min 1.000000 8.000000 19650.000000 9000.000000 63.000000 3.000000

25% 103.250000 12.000000 28050.000000 15000.000000 73.250000 37.250000

50% 216.500000 15.000000 32850.000000 15750.000000 82.000000 67.500000

75% 342.500000 16.000000 50412.500000 22372.500000 91.000000 149.750000

max 472.000000 21.000000 135000.000000 79980.000000 98.000000 476.000000

Mujer count 216.000000 216.000000 216.000000 216.000000 216.000000 192.000000

ed.groupby(by='sexo').describe(include
mean 249.384259 =12.370370
['O']).stack()
26031.921296 13091.967593 80.379630 86.666667

std 131.130737 2.319152 7558.021452 2935.599213 9.676361 96.553993


catlab minoria
min 3.000000 8.000000 15750.000000 9000.000000 63.000000 2.000000
sexo
25% 141.750000 12.000000 21562.500000 11193.750000 72.000000 11.000000
Hombre count 258 258
50% 247.500000 12.000000 24300.000000 12375.000000 81.000000 48.000000
unique 3 2
75% 360.250000 15.000000 28500.000000 14250.000000 88.000000 137.500000
top Administrativo No
max 474.000000 17.000000 58125.000000 30000.000000 98.000000 412.000000
freq 157 194

Mujer count 216 216

unique 2 2

top Administrativo No

freq 206 176

#media de las variables numericas, por sexo


ed.groupby(by='sexo').mean()

<ipython-input-40-731e4483d053>:2: FutureWarning: The default value of numeric_only in DataFrameGroupBy.mean is deprecated. In a future version, numeri
ed.groupby(by='sexo').mean()
id educ salario salini tiempemp expprev

sexo

Hombre 227.550388 14.430233 41441.782946 20301.395349 81.72093 111.620155

Mujer 249.384259 12.370370 26031.921296 13091.967593 80.37963 86.666667

#media de la variable salario, por sexo


ed.groupby(by='sexo')['salario'].mean()

sexo
Hombre 41441.782946
Mujer 26031.921296
Name: salario, dtype: float64

pd.DataFrame(ed.groupby(by='sexo')['salario'].mean())

salario

sexo

Hombre 41441.782946

Mujer 26031.921296

pd.DataFrame(ed.groupby(by='sexo', as_index = False)['salario'].mean())

sexo salario

0 Hombre 41441.782946

1 Mujer 26031.921296
ed.groupby(by=['sexo', 'catlab'])['salario'].mean()

sexo catlab
Hombre Administrativo 31558.152866
Directivo 66243.243243
Seguridad 30938.888889
Mujer Administrativo 25003.689320
Directivo 47213.500000
Name: salario, dtype: float64

pd.DataFrame(ed.groupby(by=['sexo', 'catlab'])['salario'].mean())

salario

sexo catlab

Hombre Administrativo 31558.152866

Directivo 66243.243243

Seguridad 30938.888889

Mujer Administrativo 25003.689320

Directivo 47213.500000

ed.groupby(by=['sexo', 'catlab'])['salario'].mean().unstack()
#['salario'].mean() calcula la media de la columna salario para cada grupo
#unstack() Convierte la tabla de resultados en un formato de "tabla pivoteada",
# en la que los valores de la columna 'catlab' se convierten en columnas separadas

catlab Administrativo Directivo Seguridad

sexo

Hombre 31558.152866 66243.243243 30938.888889

Mujer 25003.689320 47213.500000 NaN

ed[['sexo', 'catlab', 'salario','tiempemp']].groupby(by=['sexo',


'catlab']).aggregate(['min', 'max', np.mean, np.std])

salario tiempemp

min max mean std min max mean std

sexo catlab

Hombre Administrativo 19650 80000 31558.152866 7997.977675 63 98 81.726115 10.670239

Directivo 38700 135000 66243.243243 18051.569628 64 98 81.770270 10.403565

Seguridad 24300 35250 30938.888889 2114.616411 67 95 81.555556 8.486792

Mujer Administrativo 15750 54000 25003.689320 5812.838103 63 98 80.563107 9.658228

Directivo 34410 58125 47213.500000 8501.252538 64 90 76.600000 9.766155

ed[['sexo','catlab','salario','tiempemp']].groupby(by=['sexo',
'catlab']).aggregate({'salario':'min', 'tiempemp':np.mean})

salario tiempemp

sexo catlab

Hombre Administrativo 19650 81.726115

Directivo 38700 81.770270

Seguridad 24300 81.555556

Mujer Administrativo 15750 80.563107

Directivo 34410 76.600000


check 0 s se ejecutó 21:52

You might also like