Numpy&pandas
Numpy&pandas
The NumPy library is the core library for scientific computing in Python.
It provides a high-performance multidimensional array object, and tools
for working with these arrays
NumPy (or Numpy) is the Python Linear Algebra library, the most
important reason for Python Data Science is that almost all libraries in
the PyData Ecosystem rely on NumPy as one of its building blocks
Using Numpy:-
Import numpy as np
Numpy has a lot of built-in jobs and skills. We will not cover them all but
instead we will focus on some of the most important of Numpy.
NUMPY ARAYS:-
NumPy arrays are a great way to use Numpy throughout the lesson.
Numpy arrays actually form two flavors: Vectors and matrics. Vectors
are 1-d arrows firmly and the matrics is 2-d .
My_list=[1,2,3,5]
My_list
np.array(my_list)
my_matrix = [[1,2,3],[4,5,6],[7,8,9]]
my_matrix
np.array(my_matrix)
Built-in Methods
There are lots of built-in ways to generate Arrays
arange
np.arange(0,10)
np.arange(0,11,2)
np.zeros((5,5))
np.ones(3)
np.ones((3,3))
linspace
np.linspace(0,10,3)
np.linspace(0,10,50)
eye
np.eye(4)
Random
Numpy also has lots of ways to create random number arrays:
rand
Create an array of the given shape and populate it with random samples from a uniform
distribution over [0, 1].
np.random.rand(2)
np.random.rand(5,5)
randn
Return a sample (or samples) from the "standard normal" distribution. Unlike rand which is
uniform:
np.random.randn(2)
np.random.randn(5,5)
randint
np.random.randint(1,100)
np.random.randint(1,100,10)
arr
ranarr
Reshape
Returns an array containing the same data with a new shape.
arr.reshape(5,5)
max,min,argmax,argmin
These are useful methods for finding max or min values. Or to find their index locations using
argmin or argmax
ranarr
ranarr.max()
ranarr.argmax()
ranarr.min()
ranarr.argmin()
Shape
Shape is an attribute that arrays have (not a method):
# Vectorarr.shape
arr.reshape(1,25).shape
arr.reshape(25,1)
arr.reshape(25,1).shape
dtype
You can also grab the data type of the object in the array:
arr.dtype
import numpy as np
#Showarr
Broadcasting
Numpy arrays differ from a normal Python list because of their ability to broadcast:
#Showarr
#Showarr
#Show sliceslice_of_arr
#Change Sliceslice_of_arr[:]=99
#Data is not copied, it's a view of the original array! This avoids memory problems!
arr_2d = np.array(([5,10,15],[20,25,30],[35,40,45]))
#Showarr_2d
#Indexing rowarr_2d[1]
# Format is arr_2d[row][col] or arr_2d[row,col]
In [53]:
# 2D array slicing
Fancy Indexing
Fancy indexing allows you to select entire rows or columns out of order,to show this, let's quickly
build out a numpy array:
#Set up array
for i in range(arr_length):
arr2d[i] = i
arr2d
arr2d[[2,4,6,8]]
Selection
Let's briefly go over how to use brackets for selection based off of comparison operators.
arr = np.arange(1,11)arr
arr > 4
bool_arr = arr>4
bool_arr
arr[bool_arr]
arr[arr>2]
x = 2arr[arr>x]
NumPy Operations
Arithmetic
You can easily perform array with array arithmetic, or scalar with array arithmetic. Let's see some
examples:
arr + arr
arr * arr
arr - arr
# Warning on division by zero, but not an error!# Just replaced with nanarr/arr
arr**3
np.sin(arr)
np.log(arr)
PANDAS:-
Pandas has so many uses that it might make sense to list the things it
can't do instead of what it can do.
This tool is essentially your data’s home. Through pandas, you get
acquainted with your data by cleaning, transforming, and analyzing it.
For example, say you want to explore a dataset stored in a CSV on your
computer. Pandas will extract the data from that CSV into a DataFrame
— a table, basically — then let you do things like:
Not only is the pandas library a central component of the data science
toolkit but it is used in conjunction with other libraries in that collection.
import pandas as pd
DataFrames and Series are quite similar in that many operations that
you can do with one you can do with the other, such as filling in null
values and calculating the mean.
You'll see how these components work when we start working with data
below.
Let's say we have a fruit stand that sells apples and oranges. We want to
have a column for each fruit and a row for each customer purchase. To
organize this as a dictionary for pandas we could do something like:
data = {
'apples': [3, 2, 0, 1],
'oranges': [0, 3, 7, 2]}
purchases = pd.DataFrame(data)
purchases
purchases
purchases
purchases.loc['June']
With CSV files all you need is a single line to load in the data:
df = pd.read_csv('purchases.csv')
df
df = pd.read_json('purchases.json')
df
Data frame
movies_df = pd.read_csv("IMDB-Movie-Data.csv", index_col="Title")
movies_df.head()
movies_df.tail(2)
movies_df.shape
movies_df.info()
Handling duplicates
This dataset does not have duplicate rows, but it is always important to
verify you aren't aggregating duplicate rows.
temp_df = movies_df.append(movies_df)
temp_df.shape
temp_df = temp_df.drop_duplicates()
temp_df.shape
temp_df.drop_duplicates(inplace=True)
temp_df.drop_duplicates(inplace=True, keep=False)
temp_df.shape
Column cleanup
Many times datasets will have verbose column names with symbols,
upper and lowercase words, spaces, and typos. To make selecting data
by column name easier we can spend a little time cleaning up their
names.
movies_df.columns
movies_df.columns = ['rank', 'genre', 'description', 'director', 'actors',
'year', 'runtime',
'rating', 'votes', 'revenue_millions', 'metascore']
movies_df.columns = [col.lower() for col in movies_df]
movies_df.columns
When exploring data, you’ll most likely encounter missing or null values,
which are essentially placeholders for non-existent values. Most
commonly you'll see Python's or NumPy's np.nan, each of which are
handled differently in some situations.
movies_df.isnull()
movies_df.isnull().sum()
movies_df.dropna()
revenue = movies_df['revenue_millions']
revenue.head()
revenue_mean = revenue.mean()
revenue_mean
revenue.fillna(revenue_mean, inplace=True)
movies_df.isnull().sum()
movies_df.describe()
movies_df['genre'].describe()
movies_df['genre'].value_counts().head(10)
movies_df.corr()
It's important to note that, although many methods are the same,
DataFrames and Series have different attributes, so you'll need be sure
to know which type you are working with or else you will receive
attribute errors.
You already saw how to extract a column using square brackets like this:
genre_col = movies_df['genre']
type(genre_col)
genre_col = movies_df[['genre']]
type(genre_col)
prom = movies_df.loc["Prometheus"]
prom
prom = movies_df.iloc[1]
movie_subset = movies_df.loc['Prometheus':'Sing']
movie_subset = movies_df.iloc[1:4]
movie_subset
condition = (movies_df['director'] == "Ridley Scott")
condition.head()
condition = (movies_df['director'] == "Ridley Scott")
condition.head()
movies_df[(movies_df['director'] == 'Christopher Nolan') |
(movies_df['director'] == 'Ridley Scott')].head()
movies_df[
((movies_df['year'] >= 2005) & (movies_df['year'] <= 2010))
& (movies_df['rating'] > 8.0)
& (movies_df['revenue_millions'] <
movies_df['revenue_millions'].quantile(0.25))]
Applying functions
def rating_function(x):
if x >= 8.0:
return "good"
else:
return "bad"
Now we want to send the entire rating column through this function,
which is what apply()does:
movies_df["rating_category"] =
movies_df["rating"].apply(rating_function)
movies_df.head(2)
movies_df["rating_category"] = movies_df["rating"].apply(lambda x:
'good' if x >= 8.0 else 'bad')
movies_df.head(2)
Let's plot the relationship between ratings and revenue. All we need to
do is call .plot() on movies_df with some info about how to construct the
plot:
movies_df['rating'].plot(kind='hist', title='Rating');
movies_df['rating'].describe()
movies_df['rating'].plot(kind="box");
movies_df.boxplot(column='revenue_millions', by='rating_category');