0% found this document useful (0 votes)
3 views

RAW Data

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

RAW Data

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

START FROM TODAY - 1

Raw data comes in many forms and sizes. There is a lot of information that can be extracted from
this raw data.

To give an example, Amazon collects click stream data that records each and every click of the user
on the website. This data can be utilized to understand if a user is a price-sensitive customer or prefer
more popularly rated products. You must have noticed recommended products in Amazon; they are
derived using such data. The first step towards such an analysis would be to parse raw data. The
parsing of the data involves the following steps:

1• Extracting data from the source:

Data can come in many forms, such as Excel, CSV, JSON, databases, and so on. Python makes it very
easy to read data from these sources with the help of some useful packages.

2• Cleaning the data:

Once a sanity check has been done, one needs to clean the data appropriately so that it can be utilized
for analysis. You may have a dataset about students of a class and details about their height, weight,
and marks. There may also be certain rows with the height or weight missing. Depending on the
analysis being performed, these rows with missing values can either be ignored or replaced with the
average height or weight.

The following topics:

• Exploring arrays with NumPy

• Handling data with pandas

• Reading and writing data from various formats

• Handling missing data

• Manipulating data

Arrays with NumPy

Python, by default, comes with a data structure, such as List, which can be utilized for array
operations, but a Python list on its own is not suitable to perform heavy mathematical operations, as it
is not optimized for it. NumPy is a wonderful Python package produced by Travis Oliphant, which
has been created fundamentally for scientific computing. It helps handle large multidimensional
arrays and matrices, along with a large library of high-level mathematical functions to operate on
these arrays. A NumPy array would require much less memory to store the same amount of data
compared to a Python list, which helps in reading and writing from the array in a faster manner.

Creating an array

A list of numbers can be passed to the following array function to create a NumPy array object:
>>> import numpy as np

>>> n_array = np.array([[0, 1, 2, 3], [4, 5, 6, 7], [8, 9, 10, 11]])

A NumPy array object has a number of attributes, which help in giving information about the array.
Here are its important attributes:

• ndim: This gives the number of dimensions of the array. The following shows that the array that we
defined had two dimensions:

>>> n_array.ndim

n_array has a rank of 2, which is a 2D array.

• shape: This gives the size of each dimension of the array:

>>> n_array.shape

(3, 4)

The first dimension of n_array has a size of 3 and the second dimension has a size of 4. This can be
also visualized as three rows and four columns.

• size: This gives the number of elements:

>>> n_array.size

12

The total number of elements in n_array is 12.

• dtype: This gives the datatype of the elements in the array:

>>> n_array.dtype.name

int64

The number is stored as int64 in n_array.

Mathematical operations

When you have an array of data, you would like to perform certain mathematical operations on it. We
will now discuss a few of the important ones in the following sections.

Array subtraction

The following commands subtract the a array from the b array to get the resultant c array. The
subtraction happens element by element:

>>> a = np.array( [11, 12, 13, 14])


>>> b = np.array( [ 1, 2, 3, 4])

>>> c = a - b

>>> c Array[10 10 10 10]

Do note that when you subtract two arrays, they should be of equal dimensions.

Squaring an array

The following command raises each element to the power of 2 to obtain this result:

>>> b**2

[1 4 9 16]

A trigonometric function performed on the array

The following command applies cosine to each of the values in the b array to obtain the following
result:

>>> np.cos(b)

[ 0.54030231 -0.41614684 -0.9899925 -0.65364362]

Conditional operations

The following command will apply a conditional operation to each of the elements of the b array, in
order to generate the respective Boolean values:

>>> b

[ True False False False]

Matrix multiplication

Two matrices can be multiplied element by element or in a dot product. The following commands will
perform the element-by-element multiplication:

>>> A1 = np.array([[1, 1], [0, 1]])

>>> A2 = np.array([[2, 0], [3, 4]])

>>> A1 * A2

[[2 0]

[0 4]]

The dot product can be performed with the following command:

>>> np.dot(A1, A2)

[[5 4]

[3 4]]
Indexing and slicing

If you want to select a particular element of an array, it can be achieved using indexes:

>>> n_array[0,1]

The preceding command will select the first array and then select the second value in the array. It can
also be seen as an intersection of the first row and the second column of the matrix.

If a range of values has to be selected on a row, then we can use the following command:

>>> n_array[ 0 , 0:3 ]

[0 1 2]

The 0:3 value selects the first three values of the first row.

The whole row of values can be selected with the following command:

>>> n_array[ 0 , : ]

[0 1 2 3]

Using the following command, an entire column of values need to be selected:

>>> n_array[ : , 1 ]

[1 5 9]

Shape manipulation

Once the array has been created, we can change the shape of it too. The following command flattens
the array:

>>> n_array.ravel()

[ 0 1 2 3 4 5 6 7 8 9 10 11]

The following command reshapes the array in to a six rows and two columns format. Also, note that
when reshaping, the new shape should have the same number of elements as the previous one:

>>> n_array.shape = (6,2)

>>> n_array

[[ 0 1]

[ 2 3]

[ 4 5]

[ 6 7]

[ 8 9]
[10 11]]

The array can be transposed too:

>>> n_array.transpose()

[[ 0 2 4 6 8 10]

[ 1 3 5 7 9 11]]

Empowering data analysis with pandas

The pandas library was developed by Wes McKinny when he was working at AQR Capital
Management. He wanted a tool that was flexible enough to perform quantitative analysis on financial
data. Later, Chang She joined him and helped develop the package further. The pandas library is an
open source Python library, specially designed for data analysis. It has been built on NumPy and
makes it easy to handle data. NumPy is a fairly low-level tool that handles matrices really well. The
pandas library brings the richness of R in the world of Python to handle data. It's has efficient data
structures to process data, perform fast joins, and read data from various sources, to name a few.

The data structure of pandas

The pandas library essentially has three data structures:

1. Series

2. DataFrame

3. Panel

Series

Series is a one-dimensional array, which can hold any type of data, such as integers, floats, strings,
and Python objects too. A series can be created by calling the following:

>>> import pandas as pd

>>> pd.Series(np.random.randn(5))

0 0.733810

1 -1.274658

2 -1.602298

3 0.460944

4 -0.632756

dtype: float64

The random.randn parameter is part of the NumPy package and it generates random numbers. The
series function creates a pandas series that consists of an index, which is the first column, and the
second column consists of random values. At the bottom of the output is the datatype of the series.
The index of the series can be customized by calling the following:

>>> pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])

a -0.929494

b -0.571423

c -1.197866

d 0.081107

e -0.035091

dtype: float64

A series can be derived from a Python dict too:

>>> d = {'A': 10, 'B': 20, 'C': 30}

>>> pd.Series(d)

A 10

B 20

C 30

dtype: int64

DataFrame

DataFrame is a 2D data structure with columns that can be of different datatypes. It can be seen as a
table. A DataFrame can be formed from the following data structures:

• A NumPy array

• Lists

• Dicts

• Series

• A 2D NumPy array

A DataFrame can be created from a dict of series by calling the following commands:

>>> d = {'c1': pd.Series(['A', 'B', 'C']), 'c2': pd.Series([1, 2., 3., 4.])}

>>> df = pd.DataFrame(d)

>>> df

c1 c2

0 A 1
1 B 2

2 C 3

3 NaN 4

The DataFrame can be created using a dict of lists too:

>>> d = {'c1': ['A', 'B', 'C', 'D'], 'c2': [1, 2.0, 3.0, 4.0]}

>>> df = pd.DataFrame(d)

>>> print df

c1 c2

0 A 1

1 B 2

2 C 3

3 D 4

Panel

A Panel is a data structure that handles 3D data. The following command is an example of panel data:

>>> d = {'Item1': pd.DataFrame(np.random.randn(4, 3)), 'Item2': pd.DataFrame(np.random.randn(4,


2))}

>>> pd.Panel(d)

Dimensions: 2 (items) x 4 (major_axis) x 3 (minor_axis)

Items axis: Item1 to Item2

Major_axis axis: 0 to 3

Minor_axis axis: 0 to 2

The preceding command shows that there are 2 DataFrames represented by two items. There are four
rows represented by four major axes and three columns represented by three minor axes.

Inserting and exporting data

The data is stored in various forms, such as CSV, TSV, databases, and so on. The pandas library
makes it convenient to read data from these formats or to export to these formats. We'll use a dataset
that contains the weight statistics of the school students from the U.S..

We'll be using a file with the following structure:

You might also like