Pandas 1
Pandas 1
pandas is a Python library for data analysis. It offers a number of data exploration, cleaning
and transformation operations that are critical in working with data in Python.
pandas build upon numpy and scipy providing easy-to-use data structures and data
manipulation functions with integrated indexing.
The main data structures pandas provides are Series and DataFrames. After a brief
introduction to these two data structures and data ingestion, the key features of pandas
this notebook covers are:
*pandas* has two main data structures it uses, namely, *Series* and *DataFrames*.
pandas Series
pandas Series one-dimensional labeled array.
In [2]: ser = pd.Series([100, 'foo', 300, 'bar', 500], ['tom', 'bob', 'nancy', 'dan', 'eric'])
print(ser)
tom 100
bob foo
nancy 300
dan bar
eric 500
dtype: object
In [3]: ser.index
Out[6]: 300
Out[7]: True
In [8]: print(ser)
print(ser*2)
tom 100
bob foo
nancy 300
dan bar
eric 500
dtype: object
tom 200
bob foofoo
nancy 600
dan barbar
eric 1000
dtype: object
Out[14]: two ve
dancy 4444.0 NaN
ball 222.0 NaN
apple 111.0 NaN
Create DataFrame from list of Python dictionaries
In [15]: data = [{'alex': 1, 'joe': 2}, {'ema': 5, 'dora': 10, 'alice': 20}]
In [16]: pd.DataFrame(data)
Out[26]: one ag
apple 100.0 False
ball 200.0 False
cerill NaN False
clock 300.0 True
dancy NaN False
In [27]: df.insert(2, 'copy_of_one', df['one'])
df
Let us look at the les in this dataset using the UNIX command ls.
In [30]: %%bash
ls movielens/Large/
README.txt
genome-scores.csv
genome-tags.csv
links.csv
movies.csv
ratings.csv
tags.csv
In [31]: %%bash
cat movielens/Large/movies.csv | wc -l
27279
In [32]: %%bash
cat movielens/Large/ratings.csv | wc -l
20000264
In [33]: %%bash
head -5 ./movielens/Large/ratings.csv
userId,movieId,rating,timestamp
1,2,3.5,1112486027
1,29,3.5,1112484676
1,32,3.5,1112484819
1,47,3.5,1112484727
Use Pandas to Read the
Dataset
In this notebook, we will be using three CSV les:
Using the read_csv function in pandas, we will ingest these three les.
In [34]: movies = pd.read_csv('./movielens/Large/movies.csv', sep=',')
print(type(movies))
movies.head(15)
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.series.Series'>
userId 18
movieId 4141
tag Mark Waters
Name: 0, dtype: object
In [39]: row_0.index
Out[40]: 18
Out[41]: False
In [42]: row_0.name
Out[42]: 0
Out[43]: 'first_row'
Descriptive Statistics
Let's look how the ratings are distributed!
In [44]: ratings.describe()
In [46]: ratings.corr()
Out[49]: (27278, 3)
Is there any row Null?
In [50]: movies.isnull().any()
Out[51]: (20000263, 3)
In [52]: ratings.isnull().any()
Out[53]: (465564, 3)
In [54]: tags.isnull().any()
Out[57]: (465548, 3)
Data Visualization
In [58]: import matplotlib.pylab as plt
In [59]: ratings.hist(column='rating', figsize=(15,10),bins=10)
plt.show()
Getting information from columns
In [60]: tags['tag'].head()
In [61]: movies[['title','genres']].head()
Out[69]: movieId
rating
0.5 239125
1.0 680732
1.5 279252
2.0 1430997
2.5 883398
3.0 4291193
3.5 2200156
4.0 5561926
4.5 1534824
5.0 2898660
Group By and Aggregate
In [70]: average_rating = ratings[['movieId','rating']].groupby('movieId').mean() # We are not in
terested in the user that voted for it
average_rating.head()
Out[70]: rating
movieId
1 3.921240
2 3.211977
3 3.151040
4 2.861393
5 3.064592
Task:
Get the movies that are in average the best rated movies
Option 1:
Sort the list in descending order and get the rst rows
In [71]: sorted_average_rating=average_rating.sort_values(by="rating",ascending=False)
sorted_average_rating.head()
Out[71]: rating
movieId
95517 5.0
105846 5.0
89133 5.0
105187 5.0
105191 5.0
Option 2:
Do not sort the list but intead ask where we have that the rating score is 5.0
In [72]: average_rating.loc[average_rating.rating==5.0].head()
Out[72]: rating
movieId
26718 5.0
27914 5.0
32230 5.0
40404 5.0
54326 5.0
But since we do not understand to what this Id movie is related, we would like to see
intead the name of the movie. To do that, we need to see in the movies DataFrame
In [73]: id_movie=average_rating.loc[average_rating.rating==5.0].index
In [74]: movies.loc[movies.movieId.isin(id_movie)].head()
box_office[is_highly_rated].tail()
box_office[is_comedy].head()
In [86]: movie_genres.head(10)
Out[86]: 0 1 2 3 4 5 6 7 8 9
0 Adventure Animation Children Comedy Fantasy None None None None None
1 Adventure Children Fantasy None None None None None None None
2 Comedy Romance None None None None None None None None
3 Comedy Drama Romance None None None None None None None
4 Comedy None None None None None None None None None
5 Action Crime Thriller None None None None None None None
6 Comedy Romance None None None None None None None None
7 Adventure Children None None None None None None None None
8 Action None None None None None None None None None
9 Action Adventure Thriller None None None None None None None
Add a new column for comedy genre flag
In [88]: movie_genres.head()
Out[88]: 0 1 2 3 4 5 6 7 8 9 IsComedy
0 Adventure Animation Children Comedy Fantasy None None None None None True
1 Adventure Children Fantasy None None None None None None None False
2 Comedy Romance None None None None None None None None True
3 Comedy Drama Romance None None None None None None None True
4 Comedy None None None None None None None None None True
Extract year from title e.g. (1995)
In [91]: tags.head(5)
In [93]: tags['parsed_time'].dtype
Out[93]: dtype('<M8[ns]')
In [94]: tags.head(2)
(465564, 5) (12130, 5)
Sorting the table using the timestamps
In [99]: joined.corr()