RAW Data
RAW Data
Raw data comes in many forms and sizes. There is a lot of information that can be extracted from
this raw data.
To give an example, Amazon collects click stream data that records each and every click of the user
on the website. This data can be utilized to understand if a user is a price-sensitive customer or prefer
more popularly rated products. You must have noticed recommended products in Amazon; they are
derived using such data. The first step towards such an analysis would be to parse raw data. The
parsing of the data involves the following steps:
Data can come in many forms, such as Excel, CSV, JSON, databases, and so on. Python makes it very
easy to read data from these sources with the help of some useful packages.
Once a sanity check has been done, one needs to clean the data appropriately so that it can be utilized
for analysis. You may have a dataset about students of a class and details about their height, weight,
and marks. There may also be certain rows with the height or weight missing. Depending on the
analysis being performed, these rows with missing values can either be ignored or replaced with the
average height or weight.
• Manipulating data
Python, by default, comes with a data structure, such as List, which can be utilized for array
operations, but a Python list on its own is not suitable to perform heavy mathematical operations, as it
is not optimized for it. NumPy is a wonderful Python package produced by Travis Oliphant, which
has been created fundamentally for scientific computing. It helps handle large multidimensional
arrays and matrices, along with a large library of high-level mathematical functions to operate on
these arrays. A NumPy array would require much less memory to store the same amount of data
compared to a Python list, which helps in reading and writing from the array in a faster manner.
Creating an array
A list of numbers can be passed to the following array function to create a NumPy array object:
>>> import numpy as np
A NumPy array object has a number of attributes, which help in giving information about the array.
Here are its important attributes:
• ndim: This gives the number of dimensions of the array. The following shows that the array that we
defined had two dimensions:
>>> n_array.ndim
>>> n_array.shape
(3, 4)
The first dimension of n_array has a size of 3 and the second dimension has a size of 4. This can be
also visualized as three rows and four columns.
>>> n_array.size
12
>>> n_array.dtype.name
int64
Mathematical operations
When you have an array of data, you would like to perform certain mathematical operations on it. We
will now discuss a few of the important ones in the following sections.
Array subtraction
The following commands subtract the a array from the b array to get the resultant c array. The
subtraction happens element by element:
>>> c = a - b
Do note that when you subtract two arrays, they should be of equal dimensions.
Squaring an array
The following command raises each element to the power of 2 to obtain this result:
>>> b**2
[1 4 9 16]
The following command applies cosine to each of the values in the b array to obtain the following
result:
>>> np.cos(b)
Conditional operations
The following command will apply a conditional operation to each of the elements of the b array, in
order to generate the respective Boolean values:
>>> b
Matrix multiplication
Two matrices can be multiplied element by element or in a dot product. The following commands will
perform the element-by-element multiplication:
>>> A1 * A2
[[2 0]
[0 4]]
[[5 4]
[3 4]]
Indexing and slicing
If you want to select a particular element of an array, it can be achieved using indexes:
>>> n_array[0,1]
The preceding command will select the first array and then select the second value in the array. It can
also be seen as an intersection of the first row and the second column of the matrix.
If a range of values has to be selected on a row, then we can use the following command:
[0 1 2]
The 0:3 value selects the first three values of the first row.
The whole row of values can be selected with the following command:
>>> n_array[ 0 , : ]
[0 1 2 3]
>>> n_array[ : , 1 ]
[1 5 9]
Shape manipulation
Once the array has been created, we can change the shape of it too. The following command flattens
the array:
>>> n_array.ravel()
[ 0 1 2 3 4 5 6 7 8 9 10 11]
The following command reshapes the array in to a six rows and two columns format. Also, note that
when reshaping, the new shape should have the same number of elements as the previous one:
>>> n_array
[[ 0 1]
[ 2 3]
[ 4 5]
[ 6 7]
[ 8 9]
[10 11]]
>>> n_array.transpose()
[[ 0 2 4 6 8 10]
[ 1 3 5 7 9 11]]
The pandas library was developed by Wes McKinny when he was working at AQR Capital
Management. He wanted a tool that was flexible enough to perform quantitative analysis on financial
data. Later, Chang She joined him and helped develop the package further. The pandas library is an
open source Python library, specially designed for data analysis. It has been built on NumPy and
makes it easy to handle data. NumPy is a fairly low-level tool that handles matrices really well. The
pandas library brings the richness of R in the world of Python to handle data. It's has efficient data
structures to process data, perform fast joins, and read data from various sources, to name a few.
1. Series
2. DataFrame
3. Panel
Series
Series is a one-dimensional array, which can hold any type of data, such as integers, floats, strings,
and Python objects too. A series can be created by calling the following:
>>> pd.Series(np.random.randn(5))
0 0.733810
1 -1.274658
2 -1.602298
3 0.460944
4 -0.632756
dtype: float64
The random.randn parameter is part of the NumPy package and it generates random numbers. The
series function creates a pandas series that consists of an index, which is the first column, and the
second column consists of random values. At the bottom of the output is the datatype of the series.
The index of the series can be customized by calling the following:
a -0.929494
b -0.571423
c -1.197866
d 0.081107
e -0.035091
dtype: float64
>>> pd.Series(d)
A 10
B 20
C 30
dtype: int64
DataFrame
DataFrame is a 2D data structure with columns that can be of different datatypes. It can be seen as a
table. A DataFrame can be formed from the following data structures:
• A NumPy array
• Lists
• Dicts
• Series
• A 2D NumPy array
A DataFrame can be created from a dict of series by calling the following commands:
>>> d = {'c1': pd.Series(['A', 'B', 'C']), 'c2': pd.Series([1, 2., 3., 4.])}
>>> df = pd.DataFrame(d)
>>> df
c1 c2
0 A 1
1 B 2
2 C 3
3 NaN 4
>>> d = {'c1': ['A', 'B', 'C', 'D'], 'c2': [1, 2.0, 3.0, 4.0]}
>>> df = pd.DataFrame(d)
>>> print df
c1 c2
0 A 1
1 B 2
2 C 3
3 D 4
Panel
A Panel is a data structure that handles 3D data. The following command is an example of panel data:
>>> pd.Panel(d)
Major_axis axis: 0 to 3
Minor_axis axis: 0 to 2
The preceding command shows that there are 2 DataFrames represented by two items. There are four
rows represented by four major axes and three columns represented by three minor axes.
The data is stored in various forms, such as CSV, TSV, databases, and so on. The pandas library
makes it convenient to read data from these formats or to export to these formats. We'll use a dataset
that contains the weight statistics of the school students from the U.S..