0% found this document useful (0 votes)
4 views

data science

The document provides a comprehensive guide on using Python for data science, including installation commands for essential packages like ipykernel, numpy, pandas, and matplotlib. It covers topics such as data acquisition, exploration, and ethical considerations in data science, alongside practical activities like personality prediction and dataset creation. Additionally, it explains key concepts in data science, data formats, and the functionalities of various Python libraries for data manipulation and visualization.

Uploaded by

Kiran Jasdev
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

data science

The document provides a comprehensive guide on using Python for data science, including installation commands for essential packages like ipykernel, numpy, pandas, and matplotlib. It covers topics such as data acquisition, exploration, and ethical considerations in data science, alongside practical activities like personality prediction and dataset creation. Additionally, it explains key concepts in data science, data formats, and the functionalities of various Python libraries for data manipulation and visualization.

Uploaded by

Kiran Jasdev
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

Advance Python

Commands
Command to install ipykernel with jupyter notebook
>>> conda install ipykernel nb_conda jupyter

To launch jupyter notebook


>>> jupyter notebook

Creating virtual environment


>>> conda create –n env python=3.7

Activate the virtual enviroment


>>> conda activate env
Installing packages
>>> conda install numpy
>>> conda install pandas
>>> conda install matplotlib
Data Sciences
Topics to be covered
Introduction to Data Science
Applications of Data Science
Python for Data Sciences
Hands-on: Statistical Learning & Data Visualisation
Activity: Personality Prediction
Understanding K-nearest neighbour model
Ethical issues around Data Science
What is Data?
 Data can be defined as a representation of facts or instructions about some
entity (students, school, sports, business, animals etc.) that can be processed
or communicated by human or machines.
 Data is a collection of facts, such as numbers, words, pictures, audio clips,
videos, maps, measurements, observations or even just descriptions of things.
Data maybe represented with the help of characters such as alphabets (A-Z,
a-z), digits (0-9) or special characters (+, -, /, *, <,>, = etc.)
Humans are social animals. We tend to organize and/or participate in various
kinds of social gatherings all the time. We love eating out with friends and
family because of which we can find restaurants almost everywhere and out of
these, many of the restaurants arrange for buffets to offer a variety of food
items to their customers. Be it small shops or big outlets, every restaurant
prepares food in bulk as they expect a good crowd to come and enjoy their
food. But in most cases, after the day ends, a lot of food is left which becomes
unusable for the restaurant as they do not wish to serve stale food to their
customers the next day. So, every day, they prepare food in large quantities
keeping in mind the probable number of customers walking into their outlet.
But if the expectations are not met, a good amount of food gets wasted which
eventually becomes a loss for the restaurant as they either have to dump it or
give it to hungry people for free. And if this daily loss is taken into account for a
year, it becomes quite a big amount.
Problem Scoping
Problem Scoping
Problem Scoping
Problem Statement Template
Data Acquisition
Data Acquisition
Data Exploration
Data Science
 It is a concept to unify statistics, data analysis, machine learning and their
related methods in order to understand and analyse actual phenomena with
data.
 It work around analysing the data and when it comes to AI, the analysis
helps in making the machine intelligent enough to perform tasks by itself.
Explain how
each of this field
uses Data
Science
Data Sciences
Let us create a students’ dataset for your class (the one given
below is a sample, you can create one of your own)
Activity
Does this dataset tell you a
story?
Do you think it mirrors an
association between
marks obtained and
attendance?
Can you extract 5
observations from this
dataset?
Data Collection
While accessing data from any of the data sources, following points should be
kept in mind:
1. Data which is available for public usage only should be taken up.
2. Personal datasets should only be used with the consent of the owner.
3. One should never breach someone’s privacy to collect data.
4. Data should only be taken from reliable sources as the data collected from
random sources can be wrong or unusable.
5. Reliable sources of data ensure the authenticity of data which helps in proper
training of the AI model.
Sources of Data
Types of Data Formats

1. CSV: CSV stands


for comma
separated values.
2. Spreadsheet
3. SQL: SQL is a
programming
language also known
as Structured Query
Language.
Python Packages
Introduction to Lists
Practical and demonstration
NumPy
NumPy stands for ‘Numerical Python’. It is a package for data analysis and
scientific computing with Python.
It is a commonly used package to working around numbers.
NumPy gives a wide range of arithmetic operations around numbers giving us an
easier approach in working with them.
NumPy also works with arrays (homogenous collection of Data)
In NumPy, the arrays used are known as ND-arrays (N-Dimensional Arrays) as
NumPy comes with a feature of creating n-dimensional arrays in Python.
Creation of NumPy Arrays from List
import numpy as np
#The NumPy’s array() function converts a given list into an array.
#For example, Create an array called array1 from the given list.
array1 = np.array([10,20,30])
#Display the contents of the array
print(array1)
//output
array([10, 20, 30])
Pandas (panel data)
Pandas is a software library written for data manipulation and analysis
it offers data structures and operations for manipulating numerical tables and time series
Pandas is well suited for many different kinds of data:
• Tabular data with heterogeneously-typed columns, as in an SQL table or Excel spreadsheet
• Ordered and unordered (not necessarily fixed-frequency) time series data.
• Arbitrary matrix data (homogeneously typed or heterogeneous) with row and column
labels
• Any other form of observational / statistical data sets. The data actually need not be
labelled at all to be placed into a Pandas data structure
Here are just a few of the things that pandas does well:
• Easy handling of missing data (represented as NaN)
• Size mutability: columns can be inserted and deleted from DataFrame and
higher dimensional objects
• Automatic and explicit data alignment: objects can be explicitly aligned to a
set of labels, or the user can simply ignore the labels and let Series,
DataFrame, etc. automatically align the data for you in computations
• Intelligent label-based slicing, fancy indexing, and subsetting of large data
sets
• Intuitive merging and joining data sets
• Flexible reshaping and pivoting of data sets
Matplotlib
Matplotlib is an amazing visualization library in Python for 2D plots of arrays.
Matplotlib is a multi-platform data visualization library built on NumPy arrays.
Matplotlib comes with a wide variety of plots.
Plots helps to understand trends, patterns, and to make correlations.
Package Installation
conda install numpy /pip install numpy
conda install pandas /pip install pandas

conda install matplotlib / pip install matplotlib


Functions performed on numpy array
ARR = numpy.array([1,2,3,4,5])
Dataframe : Reading CSV file
import pandas as pd
df=pd.read_csv(r”path/filename.csv”)
print(df)
print(df.head()) //display top five rows
print(df.head(10)) //display top ten rows
print(df.tail(10)) //display bottom 10 rows
print(df.dtypes)
dataframe
//to remove all rows with null or empty values
newdf1 = df.dropna()
//to do changes in original dataframe
df.dropna(inplace = True)
//to fill null values with some values
newdf2=df.fillna(newValue)
//replace only for a specified column
newdf3=df["ColumnName"].fillna(newValue)
dataframe
//calculating mean
mn = df["ColumnName"].mean()
//to fill the null values with mean value
newdf4=df["ColumnName"].fillna(mn)
//calculating median
med = df["ColumnName"].median()
//calculating mode
md = df["ColumnName"].mode()
dataframe
//calculating sum
s = df.sum(axis = 0, skipna = True) # column wise
df.sum(axis = 1, skipna = True) # row wise
//to find minimum value
df.min(axis = 0)
df.min(axis = 1)
//calculating max
df.max(axis = 0)
df.max(axis = 1)

You might also like