The document provides a comprehensive guide on using Python for data science, including installation commands for essential packages like ipykernel, numpy, pandas, and matplotlib. It covers topics such as data acquisition, exploration, and ethical considerations in data science, alongside practical activities like personality prediction and dataset creation. Additionally, it explains key concepts in data science, data formats, and the functionalities of various Python libraries for data manipulation and visualization.
The document provides a comprehensive guide on using Python for data science, including installation commands for essential packages like ipykernel, numpy, pandas, and matplotlib. It covers topics such as data acquisition, exploration, and ethical considerations in data science, alongside practical activities like personality prediction and dataset creation. Additionally, it explains key concepts in data science, data formats, and the functionalities of various Python libraries for data manipulation and visualization.
Commands Command to install ipykernel with jupyter notebook >>> conda install ipykernel nb_conda jupyter
To launch jupyter notebook
>>> jupyter notebook
Creating virtual environment
>>> conda create –n env python=3.7
Activate the virtual enviroment
>>> conda activate env Installing packages >>> conda install numpy >>> conda install pandas >>> conda install matplotlib Data Sciences Topics to be covered Introduction to Data Science Applications of Data Science Python for Data Sciences Hands-on: Statistical Learning & Data Visualisation Activity: Personality Prediction Understanding K-nearest neighbour model Ethical issues around Data Science What is Data? Data can be defined as a representation of facts or instructions about some entity (students, school, sports, business, animals etc.) that can be processed or communicated by human or machines. Data is a collection of facts, such as numbers, words, pictures, audio clips, videos, maps, measurements, observations or even just descriptions of things. Data maybe represented with the help of characters such as alphabets (A-Z, a-z), digits (0-9) or special characters (+, -, /, *, <,>, = etc.) Humans are social animals. We tend to organize and/or participate in various kinds of social gatherings all the time. We love eating out with friends and family because of which we can find restaurants almost everywhere and out of these, many of the restaurants arrange for buffets to offer a variety of food items to their customers. Be it small shops or big outlets, every restaurant prepares food in bulk as they expect a good crowd to come and enjoy their food. But in most cases, after the day ends, a lot of food is left which becomes unusable for the restaurant as they do not wish to serve stale food to their customers the next day. So, every day, they prepare food in large quantities keeping in mind the probable number of customers walking into their outlet. But if the expectations are not met, a good amount of food gets wasted which eventually becomes a loss for the restaurant as they either have to dump it or give it to hungry people for free. And if this daily loss is taken into account for a year, it becomes quite a big amount. Problem Scoping Problem Scoping Problem Scoping Problem Statement Template Data Acquisition Data Acquisition Data Exploration Data Science It is a concept to unify statistics, data analysis, machine learning and their related methods in order to understand and analyse actual phenomena with data. It work around analysing the data and when it comes to AI, the analysis helps in making the machine intelligent enough to perform tasks by itself. Explain how each of this field uses Data Science Data Sciences Let us create a students’ dataset for your class (the one given below is a sample, you can create one of your own) Activity Does this dataset tell you a story? Do you think it mirrors an association between marks obtained and attendance? Can you extract 5 observations from this dataset? Data Collection While accessing data from any of the data sources, following points should be kept in mind: 1. Data which is available for public usage only should be taken up. 2. Personal datasets should only be used with the consent of the owner. 3. One should never breach someone’s privacy to collect data. 4. Data should only be taken from reliable sources as the data collected from random sources can be wrong or unusable. 5. Reliable sources of data ensure the authenticity of data which helps in proper training of the AI model. Sources of Data Types of Data Formats
1. CSV: CSV stands
for comma separated values. 2. Spreadsheet 3. SQL: SQL is a programming language also known as Structured Query Language. Python Packages Introduction to Lists Practical and demonstration NumPy NumPy stands for ‘Numerical Python’. It is a package for data analysis and scientific computing with Python. It is a commonly used package to working around numbers. NumPy gives a wide range of arithmetic operations around numbers giving us an easier approach in working with them. NumPy also works with arrays (homogenous collection of Data) In NumPy, the arrays used are known as ND-arrays (N-Dimensional Arrays) as NumPy comes with a feature of creating n-dimensional arrays in Python. Creation of NumPy Arrays from List import numpy as np #The NumPy’s array() function converts a given list into an array. #For example, Create an array called array1 from the given list. array1 = np.array([10,20,30]) #Display the contents of the array print(array1) //output array([10, 20, 30]) Pandas (panel data) Pandas is a software library written for data manipulation and analysis it offers data structures and operations for manipulating numerical tables and time series Pandas is well suited for many different kinds of data: • Tabular data with heterogeneously-typed columns, as in an SQL table or Excel spreadsheet • Ordered and unordered (not necessarily fixed-frequency) time series data. • Arbitrary matrix data (homogeneously typed or heterogeneous) with row and column labels • Any other form of observational / statistical data sets. The data actually need not be labelled at all to be placed into a Pandas data structure Here are just a few of the things that pandas does well: • Easy handling of missing data (represented as NaN) • Size mutability: columns can be inserted and deleted from DataFrame and higher dimensional objects • Automatic and explicit data alignment: objects can be explicitly aligned to a set of labels, or the user can simply ignore the labels and let Series, DataFrame, etc. automatically align the data for you in computations • Intelligent label-based slicing, fancy indexing, and subsetting of large data sets • Intuitive merging and joining data sets • Flexible reshaping and pivoting of data sets Matplotlib Matplotlib is an amazing visualization library in Python for 2D plots of arrays. Matplotlib is a multi-platform data visualization library built on NumPy arrays. Matplotlib comes with a wide variety of plots. Plots helps to understand trends, patterns, and to make correlations. Package Installation conda install numpy /pip install numpy conda install pandas /pip install pandas
conda install matplotlib / pip install matplotlib
Functions performed on numpy array ARR = numpy.array([1,2,3,4,5]) Dataframe : Reading CSV file import pandas as pd df=pd.read_csv(r”path/filename.csv”) print(df) print(df.head()) //display top five rows print(df.head(10)) //display top ten rows print(df.tail(10)) //display bottom 10 rows print(df.dtypes) dataframe //to remove all rows with null or empty values newdf1 = df.dropna() //to do changes in original dataframe df.dropna(inplace = True) //to fill null values with some values newdf2=df.fillna(newValue) //replace only for a specified column newdf3=df["ColumnName"].fillna(newValue) dataframe //calculating mean mn = df["ColumnName"].mean() //to fill the null values with mean value newdf4=df["ColumnName"].fillna(mn) //calculating median med = df["ColumnName"].median() //calculating mode md = df["ColumnName"].mode() dataframe //calculating sum s = df.sum(axis = 0, skipna = True) # column wise df.sum(axis = 1, skipna = True) # row wise //to find minimum value df.min(axis = 0) df.min(axis = 1) //calculating max df.max(axis = 0) df.max(axis = 1)