0% found this document useful (0 votes)
27 views62 pages

Unit 1 Data Acquisition

Unit 1 covers data acquisition, detailing sources, types, and methods for gathering data, including internal and external systems, web APIs, and data preprocessing techniques. It emphasizes the importance of data acquisition for effective analysis and decision-making, outlining the processes of data harvest and ingestion. Additionally, it discusses challenges, tools, and techniques involved in data acquisition, as well as the significance of primary and secondary data.

Uploaded by

sibi00424
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views62 pages

Unit 1 Data Acquisition

Unit 1 covers data acquisition, detailing sources, types, and methods for gathering data, including internal and external systems, web APIs, and data preprocessing techniques. It emphasizes the importance of data acquisition for effective analysis and decision-making, outlining the processes of data harvest and ingestion. Additionally, it discusses challenges, tools, and techniques involved in data acquisition, as well as the significance of primary and secondary data.

Uploaded by

sibi00424
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 62

UNIT 1 DATA ACQUISITION

Data Acquisition – Sources of acquiring the data – Internal Systems and External Systems,
Web APIs, Data Preprocessing – Exploratory Data Analysis(EDA) – Basic tools(plots,
graphs and summary statistics) of EDA, Open Data Sources, Data APIs, Web Scrapping –
Relational Database access(queries) to process/access data

Introduction

Data

• a raw information, facts or numbers collected to be examined or analysed to make


decisions.

• should be in a formalized manner suitable for communication, interpretation and


processing.

Information

• Result of analysing data

Data versus Information

Types of Data

• Structured – Data which is organized and formatted in a specific way that forms a
well-defined schema or shape to form a proper structure.
• Unstructured – These data are in an unorganized form and context specific or
varying. Eg., e-mail

• Natural language - It is a special type of unstructured data; it’s challenging to


process because it requires knowledge of specific data science techniques and
linguistics

• Machine-generated - data that is automatically created by a computer, process,


application, or other machine without human intervention.

• Graph-based - data that focuses on the relationship or adjacency of objects

• Audio, video, and images – captured and recognized through sound, pictures and
videos

• Streaming - The data flows into the system when an event happens instead of being
loaded into a data store in a batch.

Data file formats

• Tabular (e.g., .csv, .tsv, .xlsx)

• Non-tabular (e.g., .txt, .rtf, .xml)

• Image (e.g., .png, .jpg, .tif)

• Agnostic (e.g., .dat)

➢ some file formats are proprietary and can only be opened by software developed by
a particular company

➢ There are also other file formats that store metadata, such as SPSS and STATA files
that contain information on data labels.

1. Data Acquisition

Data Acquisition :

• The process of gathering various data from different relevant sources is referred to
as Data Acquisition

• It translates into the collection of data and ingesting it into a system for further use.

Importance of Data Acquisition :

• It is easier for businesses to analyze and formulate corresponding strategies around it.

• Having data in one place makes it easier to detect any discrepancy and solve it faster.

• It also decreases human error and improves data security.


• In the longer run, it proves to be cost-efficient.

• It helps in building Recommendation System

Things to consider when acquiring data are:

• What data is needed to achieve the goal?


• How much data is needed?
• Where and how can this data be found?
• What legal and privacy concerns should be considered?

Data acquisition comprises of two steps – Data Harvest and Data Ingestion

Data Harvest :

It is the process by which a source generates data and it considers what data is
acquired.

Data Ingestion :

• Focuses on bringing the produced data into a given system.

• Data ingestion consists of three stages – discover, connect and synchronize.

Data Acquisition Methods

Data can be obtained from many different sources, such as websites, apps, IoT protocols or
even physical notes, and new data sources pop up literally every day.

Methods of acquiring data :

• Collecting new data.

• Converting and/or transforming legacy data.

• Sharing or exchanging data.

• Purchasing data.

Challenges and Characteristics to be considered for Data Acquisition

Before using these methods for data acquisition, USGS suggests considering certain business
goals and data characteristics.

First, think about the business goal (why is this data required and what will it bring?).

Next, consider the costs, time restrictions and format.

For specific domain, heavily regulated industry like banking, or are a government-controlled
entity, additional restrictions may also apply– for instance, data standard thresholds or
business rule limitations
Every data acquisition method comes with additional challenges and characteristics to be
considered. For example, when it comes to transforming legacy data, first assess the legacy
quality. And for purchasing data, all the licensing issues need to be analysed.

Data Acquisition in Machine Learning

“ Data acquisition is the process of sampling signals that measure real-world physical
conditions and converting the resulting samples into digital numeric values that a computer
can manipulate.”

• Collection and Integration of the data: The data is extracted from various sources
and multiple data need to be combined based upon the requirement.

• Formatting: Prepare or organize the datasets as per the analysis requirements.

• Labeling: After gathering data, it is required to label or naming the data

Data Acquisition Process

The process of data acquisition involves searching for the datasets that can be used to train
the Machine Learning models.

The main segments are :

1. Data Discovery

2. Data Augmentation

3. Data Generation
Data Discovery :

The first approach for acquiring data is Data discovery. It is a key step when indexing,
sharing, and searching for new datasets available on the web and incorporating data
lakes. It can be broken into two steps: Searching and Sharing. Firstly, the data must
be labeled or indexed and published for sharing using many available collaborative
systems for this purpose.

Data Augmentation :

The next approach for data acquisition is Data augmentation. Augment means to
make something greater by adding to it, so here in the context of data acquisition, we
are essentially enriching the existing data by adding more external data. In Deep and
Machine learning, using pre-trained models and embeddings is common to increase
the features to train on.

Data Generation :

As the name suggests, the data is generated. If we do not have enough and any
external data is not available, the option is to generate the datasets manually or
automatically. Crowdsourcing is the standard technique for manual construction of
the data where people are assigned tasks to collect the required data to form the
generated dataset. There are automatic techniques available as well to generate
synthetic datasets. Also, the data generation method can be seen as data augmentation
when there is data available however it has missing values that need to be imputed.
Sources of acquiring data

Data collection is the process of acquiring, collecting, extracting,


and storing the voluminous amount of data which may be in the structured or
unstructured form like text, video, audio, XML files, records, or other image
files used in later stages of data analysis. In the process of big data analysis,
“Data collection” is the initial step before starting to analyze the patterns or
useful information in data. The data which is to be analyzed must be collected
from different valid sources.
There are four methods of acquiring data:
1. collecting new data;
2. converting/transforming legacy data;
3. sharing/exchanging data;
4. purchasing data.

This includes automated collection (e.g., of sensor-derived data), the manual


recording of empirical observations, and obtaining existing data from other
sources.

Data Acquisition Techniques and Tools :

The major tools and techniques for data acquisition are:


1. Data Warehouses and ETL
2. Data Lakes and ELT
3. Cloud Data Warehouse providers
1. Data Warehouses and ETL :
The first option to acquire data is via a data warehouse. Data warehousing is the
process of constructing and using a data warehouse to offer meaningful business insights.

A data warehouse is a centralized repository, which is constructed by combining data from


various heterogeneous sources. It is primarily created and used for data reporting and analysis
rather than transaction processing. The focus of the data warehouse is on the business
processes.

A data warehouse is typically constructed to store structured records having tabular formats.
Employees’ data, sales records, payrolls, student records, and CRM all come under this
bucket. In a data warehouse, usually, we transform the data before loading, and hence, it falls
under the approach of ETL (Extract, Transform and Load).It is defined as the extraction of
data, the transformation of data, and the loading of the data.

The data acquisition is performed by two kinds of ETL (Extract, Transform and Load), these
are:
a) Code-based ETL: These ETL applications are developed using programming languages
such as SQL, and PL/SQL (which is a combination of SQL and procedural features of
programming languages). Examples: BASE SAS, SAS ACCESS.
b) Graphical User Interface (GUI)-based ETL: This type of ETL application are
developed using the graphical user interface, and point and click techniques. Examples
are data stage, data manager, AB Initio, Informatica, ODI (Oracle Data Integration),
data services, and SSIS (SQL Server Integration Services).
2. Data Lakes and ELT :
A data lake is a storage repository having the capacity to store large amounts of data,
including structured, semi-structured, and unstructured data. It can store images, videos,
audio, sound records, and PDF files. It helps for faster ingestion of new data.

Unlike data warehouses, data lakes store everything, are more flexible, and follow the
Extract, Load, and Transform (ELT) approach. The data is first loaded and not transformed
until required to transform. Therefore the data is processed later as per the requirements.

Data lakes provide an “unrefined view of data” to data scientists. Open-source tools such as
Hadoop and Map Reduce are available under data lakes.

3. Cloud Data Warehouse providers :


A cloud data warehouse is another service that collects, organizes, and stores data. Unlike the
traditional data warehouse, cloud data warehouses are quicker and cheaper to set up as no
physical hardware needs to be procured.

Additionally, these architectures use massively parallel processing (MPP), i.e., employ a
large number of computer processors (up to 200 or more processors) to perform a set of
coordinated computations in parallel simultaneously and, therefore, perform complex
analytical queries much faster.

Some of the prominent cloud data warehouse services are:


• Amazon Redshift

• Snowflake
• Google BigQuery
• IBM Db2 Warehouse
• Microsoft Azure Synapse
• Oracle Autonomous Data Warehouse
• SAP Data Warehouse Cloud
• Yellowbrick Data
• Teradata Integrated Data Warehouse
Primary data : Primary data refers to the data originated by the researcher for the first time

• Conducting research and experiments, surveys and simulations are common methods
for acquiring primary data.

• Webscraping is also a special case of primary data collection by extracting or copying


data directly from a website.

• Examples : Survey, Interview, Questionnaire, Observations etc.,

Secondary data : Secondary data is the already existing data collected by the investigator
agencies and organizations earlier.

It can be obtained from many different websites and also from Application Programming
Interface (APIs). Some of the most popular repositories include:

– GitHub

– Kaggle

– KDnuggets

– UCI Machine Learning Repository

– US Government’s Open Data

– Five Thirty Eight

– Amazon Web Services

– BuzzFeed
Internal Systems and External Systems

Internal Systems :

• Internal data is generated and used within a company or organization.

• This data is usually produced by the company's operations, such as sales, customer
service, or production.

• Internal data occurs in various formats, such as spreadsheets, databases, or customer


relationship management (CRM) systems.

• Examples of internal data:

Users data – logs, messages, mails etc.,

Internal documents – invoices, contracts, notes etc.,

IoT devices – cameras, sensors etc.,

Logs – website/platform logs

External Systems :

• Data collected from external sources, including customers, partners, competitors, and
industry reports.

• This data can be purchased from third-party providers or gathered from publicly
available sources including market research reports, social media data, government
data etc.,

• Examples of external data :

Web – e-commerce, real estate etc.,

Geo – maps, location, GPS etc.,

Files – invoices, documents, sheets etc.,

Third parties – weather, credit card, telecom etc.,


Web APIs

Application Programming Interface (API) - set of defined rules that enable different
applications to communicate with each other.

Basic elements of an API:

• Access: allowed to use the data or services

• Request: is the actual data or service being asked for

• Response: the data or service as a result of request

Web APIs :

• A Web API is an interface to either a web server or a web browser.

• These APIs are used extensively for the development of web applications.

• These APIs work at either the server end or the client end.

• Companies like Google, Amazon, eBay all provide web-based API.

• Some popular examples of web based API are Twitter REST API, Facebook Graph
API, Amazon S3 REST API, etc.

REST API :

REST stands for REpresentational State Transfer. It is a web architecture with a set
of constraints applied to web service applications. REST APIs provide data in the form of
resources which can be related to objects.

Some of the popular APIs in ML and Data Science

• Google Map API

Google Map API is one of the commonly used API. Its applications vary from
integration in a cab service application to the popular Pokemon Go. It supports to
retrieve all the information like location coordinates, distances between l ocations,
routes etc.

• Amazon Machine Learning API

This API is used for developers to build applications like fraud detection, targeted
marketing, and demand forecasting. These APIs create machine learning models
by finding the trends in provided data with great accuracy.

• Facebook API

Facebook API provides an interface to a large amount of data generated every


day. The innumerable post, comments and shares in various groups & pages
produces massive data. And this massive public data provides a large number of
opportunities for analyzing the crowd. It is also incredibly convenient to use
Facebook Graph API with both R and python to extract data.

• Twitter API

Just like Facebook Graph API, Twitter data can be accessed using the Twitter API
as well. We can access all the data like tweets made by any user, the tweets
containing a particular term or even a combination of terms, tweets done on the
topic in a particular date range, etc. Twitter data is a great resource for
performing the tasks like opinion mining, sentiment analysis.

• IBM Watson API

IBM Watson offers a set of APIs for performing a host of complex tasks such as
Tone analyzer, document conversion, personality insights, visual recognition, text
to speech, speech to text, etc by using just few lines of code. This set of APIs
differ from the other APIs discussed so far, as they provide service for
manipulating and deriving insights from the data.

• US Census Bureau API

This API allows to make custom queries to the data and embed them into mobile
or web applications. Census usually contains data on the population and economy
of a nation. This data can be used to make major demographic and economic
statistics more accessible to users.

• Quandl API

Quandl helps to invoke the time series information of a large number of stocks for
the specified date range. The setting up of Quandl API is very easy and provides a
great resource for projects like Stock price prediction, stock profiling, etc. ,
Data Pre processing

Data preprocessing is an important process of data mining. In this process, raw data is
converted into an understandable format and made ready for further analysis. The motive is
to improve data quality and make it up to mark for specific tasks.

Purpose of data preprocessing

After properly gathering the data, it needs to be explored, or assessed, to spot key trends and
inconsistencies. The main goals of Data preprocessing are:

• Get data overview: understand the data formats and overall structure in which the data
is stored. Also, find the properties of data like mean, median, standard quantiles and
standard deviation. These details can help identify irregularities in the data.

• Identify missing data: missing data is common in most real-world datasets. It can
disrupt true data patterns, and even lead to more data loss when entire rows and
columns are removed because of a few missing cells in the dataset.

• Identify outliers or anomalous data: some data points fall far out of the predominant
data patterns. These points are outliers, and might need to be discarded to get
predictions with higher accuracies, unless the primary purpose of the algorithm is to
detect anomalies.

• Remove Inconsistencies: just like missing values, real-world data also has multiple
inconsistencies like incorrect spellings, incorrectly populated columns and rows (eg.:
salary populated in gender column), duplicated data and much more. Sometimes,
these inconsistencies can be treated through automation, but most often they need a
manual check-up.

Tasks in Data Preprocessing


Data Cleaning :

Data cleaning help us remove inaccurate, incomplete and incorrect data from the
dataset. Some techniques used in data cleaning are −

a) Handling missing values : This type of scenario occurs when some data is missing.
Standard values can be used to fill up the missing values in a manual way but only for
a small dataset. Attribute's mean and median values can be used to replace the missing
values in normal and non-normal distribution of data respectively. Tuples can be
ignored if the dataset is quite large and many values are missing within a tuple.
Most appropriate value can be used while using regression or decision tree algorithms

b) Noisy Data : Noisy data are the data that cannot be interpreted by machine and are
containing unnecessary faulty data. Some ways to handle them are −
• Binning − This method handle noisy data to make it smooth. Data gets divided
equally and stored in form of bins and then methods are applied to smoothing or
completing the tasks. The methods are Smoothing by a bin mean method(bin
values are replaced by mean values), Smoothing by bin median(bin values are
replaced by median values) and Smoothing by bin boundary(minimum/maximum
bin values are taken and replaced by closest boundary values).
• Regression − Regression functions are used to smoothen the data. Regression can
be linear(consists of one independent variable) or multiple(consists of multiple
independent variables).
• Clustering − It is used for grouping the similar data in clusters and is used for
finding outliers.

Data Integration :

The process of combining data from multiple sources (databases, spreadsheets,text


files) into a single dataset. Single and consistent view of data is created in this process. Major
problems during data integration are

Schema integration(Integrates set of data collected from various sources),

Entity identification(identifying entities from different databases) and

Detecting and resolving data values concept.

Data Transformation :

The change made in the format or the structure of the data is called data
transformation. This step can be simple or complex based on the requirements. There are
some methods in data transformation.

a) Smoothing: With the help of algorithms, we can remove noise from the dataset and
helps in knowing the important features of the dataset. By smoothing we can find
even a simple change that helps in prediction.
b) Aggregation: In this method, the data is stored and presented in the form of a
summary. The data set which is from multiple sources is integrated into with data
analysis description. This is an important step since the accuracy of the data depends
on the quantity and quality of the data. When the quality and the quantity of the data
are good the results are more relevant.
c) Discretization: The continuous data here is split into intervals. Discretization reduces
the data size. For example, rather than specifying the class time, we can set an interval
like (3 pm-5 pm, 6 pm-8 pm).
d) Normalization: It is the method of scaling the data so that it can be represented in a
smaller range. Example ranging from -1.0 to 1.0.
e) Attribute Selection: To help the mining process, new attributes are derived from the
given attributes.
f) Concept Hierarchy Generation: In this, the attributes are changed from lower level
to higher level in hierarchy.

Data reduction :

It helps in increasing storage efficiency and reducing data storage to make the analysis easier
by producing almost the same results. Analysis becomes harder while working with huge
amounts of data, so reduction is used to get rid of that.

Steps of data reduction are −

i. Data Compression:
The compressed form of data is called data compression. Data is compressed to make
efficient analysis. This compression can be lossless or lossy.. Lossless compression is when
there is no loss of data while compression. lossy compression is when unnecessary
information is removed during compression.

ii. Numerosity Reduction:


There is a reduction in volume of data i.e. only store model of data instead of whole
data, which provides smaller representation of data without any loss of data.

iii. Dimensionality reduction:


This process is necessary for real-world applications as the data size is big. In this
process, the reduction of random variables or attributes is done so that the dimensionality of
the data set can be reduced. Combining and merging the attributes of the data without losing
its original characteristics. This also helps in the reduction of storage space and computation
time is reduced. In this, reduction of attributes or random variables are done so as to make the
data set dimension low.

Data preprocessing in Machine Learning : A practical approach

Data preprocessing is a process of preparing the raw data and making it suitable for a
machine learning model. It is the first and crucial step while creating a machine learning
model.

A real-world data generally contains noises, missing values, and maybe in an


unusable format which cannot be directly used for machine learning models. Data
preprocessing is required tasks for cleaning the data and making it suitable for a machine
learning model which also increases the accuracy and efficiency of a machine learning
model.
It involves below steps:

▪ Getting the dataset


▪ Importing libraries
▪ Importing datasets
▪ Finding Missing Data
▪ Encoding Categorical Data
▪ Splitting dataset into training and test set
▪ Feature scaling

1. Get the Dataset

To create a machine learning model, the first thing we required is a dataset as a machine
learning model completely works on data. The collected data for a particular problem in a
proper format is known as the dataset.

Dataset may be of different formats for different purposes, such as, if we want to create a
machine learning model for business purpose, then dataset will be different with the
dataset required for a liver patient. So each dataset is different from another dataset. To
use the dataset in our code, we usually put it into a CSV file. However, sometimes, we
may also need to use an HTML or xlsx file.

What is a CSV File?

CSV stands for "Comma-Separated Values" files; it is a file format which allows us to
save the tabular data, such as spreadsheets. It is useful for huge datasets and can use these
datasets in programs.

Here we will use a demo dataset for data preprocessing, and for practice, it can be
downloaded from kaggle open source. We can also create our dataset by gathering data
using various API with Python and put that data into a .csv file.

2. Importing Libraries

In order to perform data preprocessing using Python, we need to import some predefined
Python libraries. These libraries are used to perform some specific jobs. There are three
specific libraries that we will use for data preprocessing, which are:

Numpy: Numpy Python library is used for including any type of mathematical operation
in the code. It is the fundamental package for scientific calculation in Python. It also
supports to add large, multidimensional arrays and matrices. So, in Python, we can import
it as:

1. import numpy as nm

Here we have used nm, which is a short name for Numpy, and it will be used in the whole
program.
Matplotlib: The second library is matplotlib, which is a Python 2D plotting library, and with
this library, we need to import a sub-library pyplot. This library is used to plot any type of
charts in Python for the code. It will be imported as below:

1. import matplotlib.pyplot as mpt

Here we have used mpt as a short name for this library.

Pandas: The last library is the Pandas library, which is one of the most famous Python
libraries and used for importing and managing the datasets. It is an open-source data
manipulation and analysis library. It will be imported as below:

Here, we have used pd as a short name for this library. Consider the below image:

3. Importing the Datasets

Now we need to import the datasets which we have collected for our machine learning
project. But before importing a dataset, we need to set the current directory as a working
directory. To set a working directory in Spyder IDE, we need to follow the below steps:

• Save your Python file in the directory which contains dataset.


• Go to File explorer option in Spyder IDE, and select the required directory.
• Click on F5 button or run option to execute the file.

Here, in the below image, we can see the Python file along with required dataset. Now, the
current folder is set as a working directory.
read_csv() function:

Now to import the dataset, we will use read_csv() function of pandas library, which is used
to read a csv file and performs various operations on it. Using this function, we can read a
csv file locally as well as through an URL.

We can use read_csv function as below:

1. data_set= pd.read_csv('Dataset.csv')

Here, data_set is a name of the variable to store our dataset, and inside the
function, we have passed the name of our dataset. Once we execute the above line
of code, it will successfully import the dataset in our code. We can also check the
imported dataset by clicking on the section variable explorer, and then double
click on data_set. Consider the below image:

As in the above image, indexing is started from 0, which is the default indexing in Python.
We can also change the format of our dataset by clicking on the format option.
Extracting dependent and independent variables:

In machine learning, it is important to distinguish the matrix of features (independent


variables) and dependent variables from dataset. In our dataset, there are three
independent variables that are Country, Age, and Salary, and one is a
dependent variable which is Purchased.

Extracting independent variable:

To extract an independent variable, we will use iloc[ ] method of Pandas library.


It is used to extract the required rows and columns from the dataset.

1. x= data_set.iloc[:,:-1].values

In the above code, the first colon(:) is used to take all the rows, and the second colon(:)
is for all the columns. Here we have used :-1, because we don't want to take the last
column as it contains the dependent variable. So by doing this, we will get the matrix of
features.

By executing the above code, we will get output as:

1. [['India' 38.0 68000.0]


2. ['France' 43.0 45000.0]
3. ['Germany' 30.0 54000.0]
4. ['France' 48.0 65000.0]
5. ['Germany' 40.0 nan]
6. ['India' 35.0 58000.0]
7. ['Germany' nan 53000.0]
8. ['France' 49.0 79000.0]
9. ['India' 50.0 88000.0]
10. ['France' 37.0 77000.0]]

As we can see in the above output, there are only three variables.

Extracting dependent variable:

To extract dependent variables, again, we will use Pandas .iloc[] method.

1. y= data_set.iloc[:,3].values

Here we have taken all the rows with the last column only. It will give the array of
dependent variables.

By executing the above code, we will get output as:


Output:

array(['No', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes'],
dtype=object)

4. Handling Missing data:

The next step of data preprocessing is to handle missing data in the datasets. If our
dataset contains some missing data, then it may create a huge problem for our machine
learning model. Hence it is necessary to handle missing values present in the dataset.

Ways to handle missing data: There are mainly two ways to handle missing data, which are:

By deleting the particular row: The first way is used to commonly deal with null
values. In this way, we just delete the specific row or column which consists of null
values. But this way is not so efficient and removing data may lead to loss of
information which will not give the accurate output.

By calculating the mean: In this way, we will calculate the mean of that column or
row which contains any missing value and will put it on the place of missing value.
This strategy is useful for the features which have numeric data such as age, salary,
year, etc. Here, we will use this approach.

To handle missing values, we will use Scikit-learn library in our code, which contains
various libraries for building machine learning models. Here we will use Imputer
class of sklearn.preprocessing library. Below is the code for it:

#handling missing data (Replacing missing data with the mean value)
from sklearn.preprocessing import Imputer
imputer= Imputer(missing_values ='NaN', strategy='mean', axis = 0)
#Fitting imputer object to the independent variables x.
imputerimputer= imputer.fit(x[:, 1:3])
#Replacing missing data with the calculated mean value
x[:, 1:3]= imputer.transform(x[:, 1:3])

Output:

array([['India', 38.0, 68000.0],


['France', 43.0, 45000.0],
['Germany', 30.0, 54000.0],
['France', 48.0, 65000.0],
['Germany', 40.0, 65222.22222222222],
['India', 35.0, 58000.0],
['Germany', 41.111111111111114, 53000.0],
['France', 49.0, 79000.0],
['India', 50.0, 88000.0],
['France', 37.0, 77000.0]], dtype=object
As we can see in the above output, the missing values have been replaced with
the means of rest column values.

5. Encoding Categorical data:

Categorical data is data which has some categories such as, in our dataset; there are two
categorical variable, Country, and Purchased.

Since machine learning model completely works on mathematics and numbers, but if our
dataset would have a categorical variable, then it may create trouble while building the
model. So it is necessary to encode these categorical variables into numbers.

For Country variable:

Firstly, we will convert the country variables into categorical data. So to


do this, we will use LabelEncoder() class from preprocessing library.

#Catgorical data
#for Country Variable
from sklearn.preprocessing import LabelEncoder
label_encoder_x= LabelEncoder()
x[:, 0]= label_encoder_x.fit_transform(x[:, 0])

Output:

Out[15]:
array([[2, 38.0, 68000.0],
[0, 43.0, 45000.0],
[1, 30.0, 54000.0],
[0, 48.0, 65000.0],
[1, 40.0, 65222.22222222222],
[2, 35.0, 58000.0],
[1, 41.111111111111114, 53000.0],
[0, 49.0, 79000.0],
[2, 50.0, 88000.0],
[0, 37.0, 77000.0]], dtype=object)

Explanation:

In above code, we have imported LabelEncoder class of sklearn library. This class has
successfully encoded the variables into digits.

But in our case, there are three country variables, and as we can see in the above output, these
variables are encoded into 0, 1, and 2. By these values, the machine learning model may
assume that there is some correlation between these variables which will produce the wrong
output. So to remove this issue, we will use dummy encoding.
Dummy Variables:

Dummy variables are those variables which have values 0 or 1. The 1 value gives the
presence of that variable in a particular column, and rest variables become 0. With dummy
encoding, we will have a number of columns equal to the number of categories.

In our dataset, we have 3 categories so it will produce three columns having 0 and 1 values.
For Dummy Encoding, we will use OneHotEncoder class of preprocessing library.

#for Country Variable


from sklearn.preprocessing import LabelEncoder, OneHotEncoder
label_encoder_x= LabelEncoder()
x[:, 0]= label_encoder_x.fit_transform(x[:, 0])
#Encoding for dummy variables
onehot_encoder= OneHotEncoder(categorical_features= [0])
x= onehot_encoder.fit_transform(x).toarray()

Output:

array([[0.00000000e+00, 0.00000000e+00, 1.00000000e+00, 3.80000000e+01,


6.80000000e+04],

[1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 4.30000000e+01,


4.50000000e+04],

[0.00000000e+00, 1.00000000e+00, 0.00000000e+00, 3.00000000e+01,


5.40000000e+04],

[1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 4.80000000e+01,


6.50000000e+04],

[0.00000000e+00, 1.00000000e+00, 0.00000000e+00, 4.00000000e+01,


6.52222222e+04],

[0.00000000e+00, 0.00000000e+00, 1.00000000e+00, 3.50000000e+01,


5.80000000e+04],

[0.00000000e+00, 1.00000000e+00, 0.00000000e+00, 4.11111111e+01,


5.30000000e+04],

[1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 4.90000000e+01,


7.90000000e+04],

As we [0.00000000e+00, 0.00000000e+00,
can see in the above 1.00000000e+00,
output, all the 5.00000000e+01,
variables are encoded into numbers 0 and 1 and
divided8.80000000e+04],
into three columns.
It can[1.00000000e+00,
be seen more clearly in the variables0.00000000e+00,
0.00000000e+00, explorer section,3.70000000e+01,
by clicking on x option as:
7.70000000e+04]])
For Purchased Variable:

labelencoder_y= LabelEncoder()
y= labelencoder_y.fit_transform(y)

For the second categorical variable, we will only use labelencoder


object of LableEncoder class. Here we are not using OneHotEncoder class because
the purchased variable has only two categories yes or no, and which are automatically
encoded into 0 and 1.

Output:

Out[17]: array([0, 1, 0, 0, 1, 1, 0, 1, 0, 1])

It can also be seen as:


6. Splitting the Dataset into the Training set and Test set

In machine learning data preprocessing, we divide our dataset into a training set and test
set. This is one of the crucial steps of data preprocessing as by doing this, we can enhance
the performance of our machine learning model.

Suppose, if we have given training to our machine learning model by a dataset and we test
it by a completely different dataset. Then, it will create difficulties for our model to
understand the correlations between the models.

If we train our model very well and its training accuracy is also very high, but we provide
a new dataset to it, then it will decrease the performance. So we always try to make a
machine learning model which performs well with the training set and also with the test
dataset. Here, we can define these datasets as:

Training Set: A subset of dataset to train the machine learning model, and we already
know the output.

Test set: A subset of dataset to test the machine learning model, and by using the
test set, model predicts the output.

For splitting the dataset, we will use the below lines of code:

from sklearn.model_selection import train_test_split


x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.2,
random_state=0)

Explanation:
o In the above code, the first line is used for splitting arrays of the dataset into random
train and test subsets.
o In the second line, we have used four variables for our output that are
o x_train: features for the training data
o x_test: features for testing data
o y_train: Dependent variables for training data
o y_test: Independent variable for testing data
o In train_test_split() function, we have passed four parameters in which first two
are for arrays of data, and test_size is for specifying the size of the test set. The
test_size maybe .5, .3, or .2, which tells the dividing ratio of training and testing
sets.
o The last parameter random_state is used to set a seed for a random generator so
that you always get the same result, and the most used value for this is 42.

Output:

By executing the above code, we will get 4 different variables, which can be seen under
the variable explorer section.
As we can see in the above image, the x and y variables are divided into 4 different
variables with corresponding values.

7. Feature Scaling

Feature scaling is the final step of data preprocessing in machine learning. It is a


technique to standardize the independent variables of the dataset in a specific range. In
feature scaling, we put our variables in the same range and in the same scale so that no
any variable dominate the other variable.

Consider the below dataset:

As we can see, the age and salary column values are not on the same scale. A machine
learning model is based on Euclidean distance, and if we do not scale the variable,
then it will cause some issue in our machine learning model.

Euclidean distance is given as:


If we compute any two values from age and salary, then salary values will
dominate the age values, and it will produce an incorrect result. So to remove
this issue, we need to perform feature scaling for machine learning.

There are two ways to perform feature scaling in machine learning:

Standardization

Normalization

Here, we will use the standardization method for our dataset.

For feature scaling, we will import StandardScaler class of sklearn.preprocessing


library as:

from sklearn.preprocessing import StandardScaler

Now, we will create the object of StandardScaler class for independent variables or features.
And then we will fit and transform the training dataset.

st_x= StandardScaler()
x_train= st_x.fit_transform(x_train)

For test dataset, we will directly apply transform() function instead of fit_transform()
because it is already done in training set.
x_test= st_x.transform(x_test)

Output:

By executing the above lines of code, we will get the scaled values for x_train and x_test as:

x_train:
x_test:

As we can see in the above output, all the variables are scaled between values -1 to 1.
Exploratory Data Analysis

Exploratory Data Analysis is an approach in analyzing data sets to summarize their

main characteristics, often using statistical graphics and other data visualization

methods. EDA assists Data science professionals in various ways:-

1. Getting a better understanding of data

2. Identifying various data patterns

3. Getting a better understanding of the problem statement

Types of exploratory data analysis


There are four primary types of EDA:

• Univariate non-graphical. This is simplest form of data analysis, where the data
being analyzed consists of just one variable. Since it’s a single variable, it doesn’t
deal with causes or relationships. The main purpose of univariate analysis is to
describe the data and find patterns that exist within it.
• Univariate graphical. Non-graphical methods don’t provide a full picture of the
data. Graphical methods are therefore required. Common types of univariate
graphics include:
o Stem-and-leaf plots, which show all data values and the
shape of the distribution.
o Histograms, a bar plot in which each bar represents the frequency (count) or
proportion (count/total count) of cases for a range of values.
o Box plots, which graphically depict the five-number summary of minimum,
first quartile, median, third quartile, and maximum.
• Multivariate nongraphical: Multivariate data arises from more than one variable.
Multivariate non-graphical EDA techniques generally show the relationship
between two or more variables of the data through cross-tabulation or statistics.
• Multivariate graphical: Multivariate data uses graphics to display relationships
between two or more sets of data. The most used graphic is a grouped bar plot or bar
chart with each group representing one level of one of the variables and each bar
within a group representing the levels of the other variable.

Other common types of multivariate graphics include:

• Scatter plot, which is used to plot data points on a horizontal and a vertical axis
to show how much one variable is affected by another.
• Multivariate chart, which is a graphical representation of the relationships
between factors and a response.
• Run chart, which is a line graph of data plotted over time.
• Bubble chart, which is a data visualization that displays multiple circles (bubbles)
in a two-dimensional plot.
• Heat map, which is a graphical representation of data where values are depicted
by color.
Exploratory Data Analysis Tools

Some of the most common data science tools used to create an EDA include:

• Python: An interpreted, object-oriented programming language with dynamic


semantics. Its high-level, built-in data structures, combined with dynamic typing
and dynamic binding, make it very attractive for rapid application development, as well
as for use as a scripting or glue language to connect existing components together.
Python and EDA can be used together to identify missing values in a data set, which is
important so you can decide how to handle missing values for machine learning.
• R: An open-source programming language and free software environment for statistical
computing and graphics supported by the R Foundation for Statistical Computing.
The R language is widely used among statisticians in data science in developing
statistical observations and data analysis.

Line Plot: a type of plot which displays information as a series of data points called
“markers” connected by straight lines. In this type of plot, we need the measurement
points to be ordered (typically by their x-axis values). This type of plot is often used to
visualize a trend in data over interval softime-a time series. To make a line plot with
Matplotlib, we call plt.plot(). The first argument is used for the data on the horizontal axis,
and the second is used for the data on the vertical axis. This function generates your plot,
but it doesn’t display it. To display the plot, we need to call the plt.show()
function. This is nice because we might want to add some additional customizations to our
plot before we display it. For example, we might want to add labels to the axis and title for
the plot.

Simple Line Plot

Scatter plot: This type of plot shows all individual data points. Here, they aren’t
connected with lines. Each data point has the value of the x-axis value and the value from
the y-axis values. This type of plot can be used to display trends or correlations. In data
science, it shows how 2 variables compare. To make a scatter plot with Matplotlib, we
can use the plt.scatter() function. Again, the first argument is used for the data on the
horizontal axis, and the second - for the vertical axis. Simple Scatter Plot
Histogram: an accurate representation of the distribution of numeric data. To create a
histogram, first, we divide the entire range of values into a series of intervals, and second,
we count how many values fall into each interval. The intervals are also called bins. The
bins are consecutive and non-overlapping intervals of a variable. They must be adjacent
and are often of equal size. To make a histogram with Matplotlib, we can use the plt.hist()
function. The first argument is the numeric data, the second argument is the number
of bins. The default value for the bins argument is 10.

Simple Histogram

We can see from the histogram above that there are:

• 5 values between 0 and 3


• 3 values between 3 and 6 (including)
• 2 values between 6 (excluding) and 9
Box plot, also called the box-and-whisker plot: a way to show the distribution of values
based on the five-number summary: minimum, first quartile, median, third quartile,
and maximum.

• The minimum and the maximum are just the min and max values from our data.

• The median is the value that separates the higher half of a data from the lower half. It’s
calculated by the following steps: order your values, and find the middle one. In a
case when our count of values is even, we actually have 2 middle numbers, so the
median here is calculated by summing these 2 numbers and divide the sum by 2. For
example, if we have the numbers 1, 2, 5, 6, 8, 9, your median will be (5 + 6) / 2 = 5,5.

• The first quartile is the median of the data values to the left of the median in our
ordered values. For example, if we have the numbers 1, 3, 4, 7, 8, 8, 9, the first quartile
is
the median from the 1, 3, 4 values, so it’s 3.

• The third quartile is the median of the data values to the right of the median in our
ordered values. For example, if we use these numbers 1, 3, 4, 7, 8, 8, 9 again, the
third
quartile is the median from the 8, 8, 9 values, so it’s 8.

• I also want to mention one more statistic here. That is the IQR (Interquartile
Range). The IQR approximates the amount of spread in the middle 50% of the data.
The formula is the third quartile - the first quartile.

• This type of plot can also show outliers. An outlier is a data value that lies outside the
overall pattern. They are visualized as circles. When we have
outliers, the minimum and the maximum are visualized as the min and the max values
from the values which aren’t outliers. There are many ways to identify what is an
outlier. A commonly used rule says that a value is an outlier if it’s less than the first
quartile - 1.5

* IQR or high than the third quartile + 1.5 * IQR.

That’s a powerful plot, isn’t it? Let’s see an example to understand the plot better.
Now, when we have some knowledge about the box plots, we can see that it’s super
easy to create this plot with Matplotlib. All we need is the function plt.boxplot(). The
first argument is the data points.

Simple Box Plot

In this example the values 1, 2, and 21 are outliers, the maximum is 10, the minimum
is 5. And, the mean is 7.

Bar chart: represents categorical data with rectangular bars. Each bar has a height
corresponds to the value it represents. It’s useful when we want to compare a given numeric
value on different categories. It can also be used with 2 data series,.
To make a bar chart with Maplotlib, we’ll need the plt.bar() function. Simple Bar Chart

Simple Bar Chart Output

Pie chart: a circular plot, divided into slices to show numerical proportion. They are widely
used in the business world. However, many experts recommend to avoid them. The main
reason is that it’s difficult to compare the sections of a given pie chart. Also, it’s difficult to
compare data across multiple pie charts. In many cases, they can be replaced by a bar chart.

Now, we can see the same data displayed on a bar chart. This times the percentage revenues
can be easily compared.
.

To make a pie chart with Matplotlib, we can use the plt.pie() function. The

autopct parameter allows us to display the percentage value using the Python

string formatting.

Simple Pie Chart

Bar and Column Charts

It’s one of the simplest graphs to understand how our quantitative field is performing across
various categories. It is used for comparison.
From the above column chart, we can see that sales of technology are higher and office
supplies are the least.

The above-shown graph is a bar chart that shows that L categories perform better

Scatter Plot and Bubble Chart

Scatter and bubble plots help us to understand how variables are spread across the
range considered. It can be used to identify the patterns, the presence of outliers
and the relationship between the two variables.
We can see that with the increase in discount profits are decreasing.
The above-shown graph is a bubble graph.

Line Graph

It is preferred when time-dependent data has to be presented. It is best


suited to analyse the trend.

From the above graph, we can see that sales are increasing over

the months but there is a sudden dip in the month of July and

sales are highest in November.


Histogram
A histogram is a frequency chart that records the number of occurrences of an entry in a
data set. It is useful when you want to understand the distribution of a series.

Box Plot

Box plots are effective in summarizing the spread of large data. They use percentile
to divide the data range. This helps us to understand data point which falls below or above a
chosen data point. It helps us to identify outliers in the data.
Box plot divides entire data into three categories
* Median value-it divides the data into two equal halves
* IQR – It ranges between 25th and 75th percentile values.
* Outliers – This data differ significantly and lie outside the whiskers.

The circles in the above plot show the presence of outliers.

Subplots

Sometimes it’s better to plot different plots in the same grid to understand and
compare the data better.
Here you can see that in the single graph we were able to understand sales
over a period of time in different regions.

Donut, Pie charts and stacked column charts

When we want to find the composition of data above-mentioned charts is the best.

The above doughnut chart shows the sales composition of different product categories.

The above pie chart shows the percentage of sales in different years.
The above-stacked column chart shows the sale of two products over different quarters.

Heatmaps

It is the most preferred chart when we want to check if any correlation between variables.
Here the positive value shows a positive correlation and the negative value shows a negative
correlation. The colour indicates the intensity of correlation, the darker the colour higher the
positive correlation and the lighter the colour higher the negative correlation.

Open Data

Open Data means the kind of data which is open for anyone and everyone for access,
modification, reuse, and sharing.
Open Data derives its base from various “open movements” such as open source, open
hardware, open government, open science etc.

Governments, independent organizations, and agencies have come forward to open the
floodgates of data to create more and more open data for free and easy access.

Why Is Open Data Important?


Open data is important because the world has grown increasingly data-driven. But if
there are restrictions on the access and use of data, the idea of data-driven business and
governance will not be materialized.Therefore, open data has its own unique place. It can
allow a fuller understanding of the global problems and universal issues. It can give a big
boost to businesses. It can be a great impetus for machine learning. It can help fight global
problems such as disease or crime or famine. Open data can empower citizens and hence can
strengthen democracy. It can streamline the processes and systems that the society and
governments have built. It can help transform the way we understand and engage with the
world.

So list of 15 awesome Open Data sources:


1. World Bank Open Data

As a repository of the world’s most comprehensive data regarding what’s happening in


different countries across the world, World Bank Open Data is a vital source of Open
Data. It also provides access to other datasets as well which are mentioned in the data
catalog.

World Bank Open Data is massive because it has got 3000 datasets and 14000
indicators encompassing microdata, time series statistics, and geospatial data.

Accessing and discovering the data you want is also quite easy. All you need to do is to
specify the indicator names, countries or topics and it will open up the treasure-
house of Open Data for you. It also allows you to download data in different formats
such as CSV, Excel, and XML.
If you are a journalist or academic, you will be enthralled by the array of tools available
to you. You can get access to analysis and visualization tools that can bolster your
research. It can felicitate a deeper and better understanding of global problems.
You can get access to the API which can help you create the data visualizations you
need, live combinations with other data sources and many more such features.

Therefore, it’s no surprise that World Bank Open Data tops any list of Open Data
sources!

2. WHO (World Health Organization) — Open data repository

WHO’s Open Data repository is how WHO keeps track of health-specific statistics of
its 194 Member States.

The repository keeps the data systematically organized. It can be accessed as per
different needs. For instance, whether it is mortality or burden of diseases, one can
access data classified under 100 or more categories such as the Millennium
Development Goals (child nutrition, child health, maternal and reproductive health,
immunization, HIV/AIDS, tuberculosis, malaria, neglected diseases, water and
sanitation), non communicable diseases and risk factors, epidemic-prone diseases,
health systems, environmental health, violence and injuries, equity etc.

For your specific needs, you can go through the datasets according to themes, category,
indicator, and country.

The good thing is that it is possible to download whatever data you need in Excel
Format. You can also monitor and analyze data by making use of its data portal.

The API to the World Health Organization’s data and statistics content is also available.

3. Google Public Data Explorer

Launched in 2010, Google Public Data Explorer can help you explore vast
amounts of public-interest datasets. You can visualize and communicate the data for
your respective uses.

It makes the data from different agencies and sources available. For instance, you can
access data from World Bank, U. S. Bureau of Labor Statistics and U.S. Bureau,
OECD, IMF, and others.

Different stakeholders access this data for a variety of purposes. Whether you are a
student or a journalist, whether you are a policy maker or an academic, you can
leverage this tool in order to create visualizations of public data.

You can deploy various ways of representing the data such as line graphs, bar graphs,
maps and bubble charts with the help of Data Explorer.

The best part is that you would find these visualizations quite dynamic. It means that
you will see them change over time. You can change topics, focus on different entries
and modify the scale.
It is easily shareable too. As soon as you get the chart ready, you can embed it on your
website or blog or simply share a link with your friends.

4. Registry of Open Data on AWS (RODA)

This is a repository containing public datasets. It is data which is available from AWS
resources.

As far as RODA is concerned, you can discover and share the data which is publicly
available.

In RODA, you can use keywords and tags for common types of data such as genomic,
satellite imagery and transportation in order to search whatever data that you are
looking for. All of this is possible on a simple web interface.

For every dataset, you will discover detail page, usage examples, license information
and tutorials or applications that use this data.

By making use of a broad range of compute and data analytics products, you can
analyze the open data and build whatever services you want.

While the data you access is available through AWS resources, you need to bear in
mind that it is not provided by AWS. This data belongs to different agencies,
government organizations, researchers, businesses and individuals.

5. European Union Open Data Portal

You can access whatever open data EU institutions, agencies and other organizations
publish on a single platform namely European Union Open Data Portal.

The EU Open Data Portal is home to vital open data pertaining to EU policy domains.
These policy domains include economy, employment, science, environment, and
education.

Around 70 EU institutions, organizations or departments such as Eurostat, the European


Environment Agency, the Joint Research Centre and other European Commission
Directorates General and EU Agencies have made their datasets public and allowed
access. These datasets have crossed the number of 11700 till date.

The portal enables easy access. You can easily search, explore, link, download and
reuse the data through a catalog of common metadata. You can do so for your specific
purposes. It could be commercial or non-commercial purposes.

You can search the metadata catalog through an interactive search engine (Data tab)
and SPARQL queries (Linked data tab).
By making use of this catalog, you can gain access to the data stored on the different
websites of the EU institutions, agencies and organizations.

6. FiveThirtyEight

It is a great site for data-driven journalism and story-telling.

It provides its various sources of data for a variety of sectors such as politics, sports,
science, economics etc. You can download the data as well.

When you access the data, you will come across a brief explanation regarding each
dataset with respect to its source. You will also get to know what it stands for and how
to use it.

In order to render this data user-friendly, it provides datasets in as simple, non-


proprietary formats such as CSV files as possible. Needless to say, these formats can be
easily accessed and processed by humans as well as machines.

With the help of these datasets, you can create stories and visualizations as per your
own requirements and preference.

7. U.S. Census Bureau

U.S. Census Bureau is the biggest statistical agency of the federal government.
It stores and provides reliable facts and data regarding people, places, and economy of
America.

The Census Bureau considers its noble mission to extend its services as the most
reliable provider of quality data.

Whether it is a federal, state, local or tribal government, all of them make use of census
data for a variety of purposes. These governments use this data to determine the
location of new housing and public facilities. They also make use of it at the time of
examining the demographic characteristics of communities, states, and the USA.

This data is also made use of in planning of transportation systems and roadways.
When it comes to deciding quotas and creating police and fire precincts, this data
comes in handy. When governments create localized areas of elections, schools,
utilities etc, they make use of this data. It is a practice to compile population
information once a decade and this data are quite useful in accomplishing the same.

There are various tools such as American Fact Finder, Census Data Explorer and Quick
Facts which are useful in case you want to search, customize and visualize data.

For instance, Quick Facts alone contains statistics for all the states, counties, cities and
even towns with a population of 5000 or more.
Likewise, American Fact Finder can help you discover popular facts such as
population, income etc. It provides information that is frequently requested.

The good thing is that you can search, interact with the data, get to know about popular
statistics and see the related charts through Census Data Explorer. Moreover, you can
also use visual tool to customize data on an interactive maps experience.

8. Data.gov

Data.gov is the treasure-house of US government’s open data. It was only


recently that the decision was made to make all government data available for free.

When it was launched, there were only 47. There are now 180,000 datasets.

Why Data.gov is a great resource is because you can find data, tools, and resources that
you can deploy for a variety of purposes. You can conduct your research, develop your
web and mobile applications and even design data visualizations.

All you need to do is enter keywords in the search box and browse through types, tags,
formats, groups, organization types, organizations, and categories. This will facilitate
easy access to data or datasets that you need.

Data.gov follows the Project Open Data Schema — a set of requisite fields (Title,
Description, Tags, Last Update, Publisher, Contact Name, etc.) for every data set
displayed on Data.gov.

9. DBpedia

As you know, Wikipedia is a great source of information. DBpedia aims at getting


structured content from the valuable information that Wikipedia created.

With DBpedia, you can semantically search and explore relationships and properties of
Wikipedia resource. This includes links to other related datasets as well.

There are around 4.58 million entities in the DBpedia dataset. 4.22 million are
classified in ontology, including 1,445,000 persons, 735,000 places, 123,000 music
albums, 87,000 films,
19,000 video games, 241,000 organizations, 251,000 species and 6,000 diseases.

There are labels and abstracts for these entities in around 125 languages. There are 25.2
million links to images. There are 29.8 million links to external web pages.

All you need to do in order to use DBpedia is write SPARQL queries against endpoint
or by downloading their dumps.

DBpedia has benefitted several enterprises, such as Apple (via Siri), Google (via
Freebase and Google Knowledge Graph), and IBM (via Watson), and particularly their
respective prestigious projects associated with artificial intelligence.

10. freeCodeCamp Open Data

It is an open source community. Why it matters is because it enables you to


code, build pro bono projects after nonprofits and grab a job as a developer.

In order to make this happen, the freeCodeCamp.org community makes available


enormous amounts of data every month. They have turned it into open data.

You will find a variety of things in this repository. You can find datasets, analysis of
the same and even demos of projects based on the freeCodeCamp data. You can also
find links to external projects involving the freeCodeCamp data.

It can help you with a diversity of projects and tasks that you may have in mind.
Whether it is web analytics, social media analytics, social network analysis, education
analysis, data visualization, data-driven web development or bots, the data offered by
this community can extremely useful and effective.

11. Yelp Open Datasets

The Yelp dataset is basically a subset of nothing but our own businesses,
reviews and user data for use in personal, educational and academic pursuits.

There are 5,996,996 reviews, 188,593 businesses, 280,991 pictures and 10 metropolitan
areas included in Yelp Open Datasets.

You can use them for different purposes. Since they are available as JSON files, you
can use them in order to teach students about databases. You can use them to learn NLP
or for sample production data while you understand how to design mobile apps.

In this dataset, you will find each file composed of a single object type, one JSON-
object per- line.

12. UNICEF Dataset

Since UNICEF concerns itself with a wide variety of critical issues, it has
compiled relevant data on education, child labor, child disability, child mortality,
maternal mortality, water and sanitation, low birth-weight, antenatal care, pneumonia,
malaria, iodine deficiency disorder, female genital mutilation/cutting, and adolescents.

UNICEF’s open datasets published on the IATI Registry:


http://www.iatiregistry.org/publisher/unicef has been extracted directly from
UNICEF’s operating system (VISION) and other data systems, and it reflects inputs
made by individual UNICEF offices.
The good thing is that there is a regular update when it comes to these datasets. Every
month, the data is updated in order to make it more comprehensive, reliable and
accurate.

You can freely and easily access this data. In order to do so, you can download this data
in CSV format. You can also preview sample data prior to downloading it.

While anybody can explore and visualize UNICEF’s datasets, there are three principal
publishers:

UNICEF’s AID TRANSPARENCY PORTAL : You can far more easily access the
datasets if you use this portal. It also includes details for each country that UNICEF
works in.

Publisher d-portal : It is, at the moment, in BETA. With this, portal, you can explore
IATI data.
You can search the information related to development activities, budgets etc. You can
explore this information country-wise.

Publisher’s data platform : On this platform, you can easily access statistics, charts, and
metrics on data accessed via the IATI Registry. If you click on the headers, you can
also sort many of the tables that you see on the platform. You will also find many of the
datasets in the platforms in machine-readable JSON format.

13. Kaggle

Kaggle is great because it promotes the use of different dataset publication


formats. However, the better part is that it strongly recommends that the dataset
publishers share their data in an accessible, non-proprietary format.

The platform supports open and accessible data formats. It is important not just for
access but also for whatever you want to do with this data. Therefore, Kaggle Dataset
clearly defines the file formats which are recommended while sharing data.

The unique thing about Kaggle datasets is that it is not just a data repository. Each
dataset stands for a community that enables you to discuss data, find out public codes
and techniques, and conceptualize your own projects in Kernels.

CSV, JSON, SQLite, Archive, Big Query etc. are files types that Kaggle supports. You
can find a variety of resources in order to start working on your open data project.
The best part is that Kaggle allows you to publish and share datasets privately or
publicly.

14. LODUM

It is the Open Data initiative of the University of Münster. Under this initiative,
it is made possible for anyone to access any public information about the university in
machine- readable formats. You can easily access and reuse it as per your needs.

Open data about scientific artifacts and encoded as linked data is made available under
this project.

With the help of Linked Data, it is possible to share and use data, ontologies and
various metadata standards. It is, in fact, envisaged that it will be the accepted standard
for providing metadata, and the data itself on the Web.

You can use SPARQL editor or SPARQL package of R to analyze data.


SPARQL Package enables to connect to a SPARQL endpoint over HTTP, pose a
SELECT query or an update query (LOAD, INSERT, DELETE).

15. UCI Machine Learning Repository

It serves as a comprehensive repository of databases, domain theories, and data


generators that are used by the machine learning community for the empirical analysis
of machine learning algorithms.

In this repository, there are, at present, 463 datasets as a service to the machine learning
community.

The Center for Machine Learning and Intelligent Systems at the University of
California, Irvine hosts and maintains it. David Aha had originally created it as a
graduate student at UC Irvine.

Since then, students, educators, and researchers all over the world make use of it as a
reliable source of machine learning datasets.

How it works is that each dataset has its distinct webpage which enlists all the known
details including any relevant publications that investigate it. You can download these
datasets as ASCII files, often the useful CSV format.

The details of datasets are summarized by aspects like attribute types, number of
instances, number of attributes and year published that can be sorted and searched.
Data APIs

A Data API provides API access to data stored in a Data management system. APIs
provide granular, per record access to datasets and their component data files.

Limitations of APIs

Whilst Data APIs are in many ways more flexible than direct download they have
disadvantages:

1. APIs are much more costly and complex to create and maintain than direct
download
2. API queries are slow and limited in size because they run in real-time in memory.
Thus, for bulk access e.g. of the entire dataset direct download is much faster and
more efficient (download a 1GB CSV directly is easy and takes seconds but
attempting to do so via the API may crash the server and be very slow).

Why Data APIs?

1. Data (pre)viewing: reliably and richly (e.g. with querying, mapping etc). This makes
the data much more accessible to non-technical users.
2. Visualization and analytics: rich visualization and analytics may need a data API
(because they need easily to query and aggregate parts of dataset).
3. Rich Data Exploration: when exploring the data you will want to explore through a
dataset quickly only pulling parts of the data and drilling down further as needed.
4. (Thin) Client applications: with a data API third party users of the portal can build
apps on top of the portal data easily and quickly (and without having to host the data
themselves)

On the other side they provide support for:

• Rapidly updating data e.g. timeseries: if you are updating a dataset every minute or
every second you want an append operation and don't want to store the whole file
every update just to add a single record
• Datasets stored as structured data by default and which can therefore be updated
in part, a few records at a time, rather than all at once (as with blob storage)

Domain Model

The functionality associated to the Data APIs can be divided in 6 areas:


1. Descriptor: metadata describing and specifying the API e.g. general metadata e.g.
name, title, description, schema, and permissions
2. Manager for creating and editing APIs.
1. API: for creating and editing Data API's descriptors (which triggers creation of
storage and service endpoint)
2. UI: for doing this manually
3.Service (read): web API for accessing structured data (i.e. per record) with querying
etc. When we simply say "Data API" this is usually what we are talking about
• Custom API & Complex functions: e.g. aggregations, join
• Tracking & Analytics: rate-limiting etc
• Write API: usually secondary because of its limited performance vs bulk
loading
• Bulk export of query results especially large ones (or even export of the whole
dataset in the case where the data is stored directly in the DataStore rather than
the FileStore). This is an increasingly important feature a lower priority but if
required it is substantive feature to implement.
4.Data Loader: bulk loading data into the system that powers the data API.
• Bulk Load: bulk import of individual data files
• Maybe includes some ETL => this takes us more into data factory
5.Storage (Structured): the underlying structured store for the data (and its layout). For
example, Postgres and its table structure. This could be considered a separate component
that the Data API uses or as part of the Data API – in some cases the store and API are
completely wrapped together, e.g. ElasticSearch is both a store and a rich Web API.

Web Scraping

Web scraping is an automatic method to obtain large amounts of data from websites.
Most of this data is unstructured data in an HTML format which is then converted into
structured data in a spreadsheet or a database so that it can be used in various applications.
There are many different ways to perform web scraping to obtain data from websites. These
include using online services, particular API’s or even creating your code for web scraping
from scratch. Many large websites, like Google, Twitter, Facebook, StackOverflow, etc. have
API’s that allow you to access their data in a structured format. This is the best option, but
there are other sites that don’t allow users to access large amounts of data in a structured
form or they are simply not that technologically advanced. In that situation, it’s best to use
Web Scraping to scrape the website for data.
Web scraping requires two parts, namely the crawler and the scraper. The crawler is an
artificial intelligence algorithm that browses the web to search for the particular data required
by following the links across the internet. The scraper, on the other hand, is a specific tool
created to extract data from the website. The design of the scraper can vary greatly according
to the complexity and scope of the project so that it can quickly and accurately extract the
data.

How Web Scrapers Work?

Web Scrapers can extract all the data on particular sites or the specific data that a user wants.
Ideally, it’s best if you specify the data you want so that the web scraper only extracts that
data quickly. For example, you might want to scrape an Amazon page for the types of juicers
available, but you might only want the data about the models of different juicers and not the
customer reviews.
So, when a web scraper needs to scrape a site, first the URLs are provided. Then it loads all
the HTML code for those sites and a more advanced scraper might even extract all the CSS
and Javascript elements as well. Then the scraper obtains the required data from this HTML
code and outputs this data in the format specified by the user. Mostly, this is in the form of an
Excel spreadsheet or a CSV file, but the data can also be saved in other formats, such as a
JSON file.

Different Types of Web Scrapers

Web Scrapers can be divided on the basis of many different criteria, including Self-built or
Pre-built Web Scrapers, Browser extension or Software Web Scrapers, and Cloud or Local
Web Scrapers.
You can have Self-built Web Scrapers but that requires advanced knowledge of
programming. And if you want more features in your Web Scrapper, then you need even
more knowledge. On the other hand, pre-built Web Scrapers are previously created scrapers
that you can download and run easily. These also have more advanced options that you can
customize.

Browser extensions Web Scrapers are extensions that can be added to your browser. These
are easy to run as they are integrated with your browser, but at the same time, they are also
limited because of this. Any advanced features that are outside the scope of your browser are
impossible to run on Browser extension Web Scrapers. But Software Web Scrapers don’t
have these limitations as they can be downloaded and installed on your computer. These are
more complex than Browser web scrapers, but they also have advanced features that are not
limited by the scope of your browser.

Cloud Web Scrapers run on the cloud, which is an off-site server mostly provided by the
company that you buy the scraper from. These allow your computer to focus on other tasks as
the computer resources are not required to scrape data from websites. Local Web Scrapers,
on the other hand, run on your computer using local resources. So, if the Web scrapers
require more CPU or RAM, then your computer will become slow and not be able to perform
other tasks.

Why is Python a popular programming language for Web Scraping?

Python seems to be in fashion these days! It is the most popular language for web scraping as
it can handle most of the processes easily. It also has a variety of libraries that were created
specifically for Web Scraping. Scrapy is a very popular open-source web crawling
framework that is written in Python. It is ideal for web scraping as well as extracting data
using APIs. Beautiful soup is another Python library that is highly suitable for Web
Scraping. It creates a parse tree that can be used to extract data from HTML on a website.
Beautiful soup also has multiple features for navigation, searching, and modifying these parse
trees.
What is Web Scraping used for?

Web Scraping has multiple applications across various industries. Let’s check out some of
these now!
1. Price Monitoring
Web Scraping can be used by companies to scrap the product data for their
products and competing products as well to see how it impacts their pricing
strategies. Companies can use this data to fix the optimal pricing for their
products so that they can obtain maximum revenue.
2. Market Research
Web scraping can be used for market research by companies. High-quality web
scraped data obtained in large volumes can be very helpful for companies in
analyzing consumer trends and understanding which direction the company
should move in the future.
3. News Monitoring
Web scraping news sites can provide detailed reports on the current news to a
company. This is even more essential for companies that are frequently in the
news or that depend on daily news for their day-to-day functioning. After all,
news reports can make or break a company in a single day!
4. Sentiment Analysis
If companies want to understand the general sentiment for their products among
their consumers, then Sentiment Analysis is a must. Companies can use web
scraping to collect data from social media websites such as Facebook and Twitter
as to what the general sentiment about their products is. This will help them in
creating products that people desire and moving ahead of their competition.
5. Email Marketing
Companies can also use Web scraping for email marketing. They can collect
Email ID’s from various sites using web scraping and then send bulk promotional
and marketing Emails to all the people owning these Email ID’s.

Relational Database access (queries) to process/access data.

Using a query makes it easier to view, add, delete, or change data in your
Access database. Some other reasons for using queries:

• Find specific quickly data by filtering on specific criteria (conditions)


• Calculate or summarize data
• Automate data management tasks, such as reviewing the most current data on a
recurring basis.
• Queries help you find and work with your data
• Create a select query
• Create a parameter query
• Create a totals query
• Create a crosstab query
• Create a make table query
• Create an append query
• Create an update query
• Create a delete query

Queries help you find and work with your data

In a well-designed database, the data that you want to present through a form or report is
usually located in multiple tables. A query can pull the information from various tables and
assemble it for display in the form or report. A query can either be a request for data results
from your database or for action on the data, or for both. A query can give you an answer to a
simple question, perform calculations, combine data from different tables, add, change, or
delete data from a database. Since queries are so versatile, there are many types of queries
and you would create a type of query based on the task.

Major query Use


types
Select To retrieve data from a table or make calculations.
Action Add, change, or delete data. Each task has a specific type of action query.
Action queries available in Access web apps.

Create a select query

If you want to review data from only certain fields in a table, or review data from
multiple tables simultaneously or maybe just see the data based on certain criteria, a
select query type would be your choice.

Review data from select fields

For example, if your database has a table with a lot of information about products and
you want to review a list of products and their prices, here’s how you’d create a select
query to return just the product names and the respective price:

1. Open the database and on the Create tab, click Query Design.
2. On the Tables tab, double-click the Products table.
3. In the Products table, let’s say that you have Product Name
and List Price fields. Double- click the Product Name and
List Price to add these fields to the query design grid.
4. On the Design tab, click Run. The query runs, and displays a list of products
and their prices.
Review data from multiple related tables simultaneously

For example, if you have a database for a store that sells food items and you want to
review orders for customers who live in a particular city. Say that the data about orders
and data about customers are stored in two tables named Customers and Orders
respectively. If each table has a Customer ID field, which forms the basis of a one-to-
many relationship between the two tables. You can create a query that returns orders for
customers in a particular city, for example, Las Vegas, by using the following
procedure:

1. Open the database. On the Create tab, in the Query group, click Query
Design.
2. On the Tables tab, double-click Customers and Orders.

Note the line (called a join) that connects the ID field in the
Customers table and the Customer ID field in the Orders table.
This line shows the relationship between the two tables.

3. In the Customers table, double-click Company and City to


add these fields to the query design grid.
4. In the query design grid, in the City column, clear the check box in the Show
row.
5. In the Criteria row of the City column, type Las Vegas.

Clearing the Show check box prevents the query from displaying the city in its
results, and typing Las Vegas in the Criteria row specifies that you want to
see only records where the value of the City field is Las Vegas. In this case,
the query returns only the customers that are located in Las Vegas. You don’t
need to display a field to use it with a criterion.
6. In the Orders table, double-click Order ID and Order Date to add these
fields to the next two columns of the query design grid.
7. On the Design tab, in the Results group, click Run. The query runs, and then
displays a list of orders for customers in Las Vegas.
8. Press CTRL+S to save the query.

Create a parameter query

If you frequently want to run variations of a particular query, consider using a


parameter query. When you run a parameter query, the query prompts you for field
values, and then uses the values that you supply to create criteria for your query.

Continuing from the previous example where you learnt to create a select query that
returns orders for customers located in Las Vegas, you can modify the select query to
prompt you to specify the city each time that you run the query. To follow along, open
the database that you created in the previous example:

1. In the Navigation Pane, right-click the query named Orders by City (that you
created in the previous section), and then click Design View on the shortcut
menu.
2. In the query design grid, in the Criteria row of the City column, delete Las
Vegas, and then type [For what city?].
The string [For what city?] is your parameter prompt. The square brackets
indicate that you want the query to ask for input, and the text (in this case, For
what city?) is the question that the parameter prompt displays.

3. Select the check box in the Show row of the City column, so that the
query results will display the city.
4. On the Design tab, in the Results group, click Run. The query prompts you to
enter a value for City.
5. Type New York, and then press ENTER to see orders for customers in New
York.

What if you don't know what values you can specify? You can use wildcard
characters as part of the prompt:

6. On the Home tab, in the Views group, click View, and then click Design
View.
7. In the query design grid, in the Criteria row of the City column, type
Like [For what city?]&"*".

In this parameter prompt, the Like keyword, the ampersand (&), and the
asterisk (*) enclosed in quotation marks allow the user to type a combination
of characters, including wildcard characters, to return a variety of results. For
example, if the user types *, the query returns all cities; if the user types L, the
query returns all cities that start with the letter "L;" and if the user types *s*,
the query returns all cities that contain the letter "s."

8. On the Design tab, in the Results group, click Run, and at the query prompt,
type New, and press ENTER.

The query runs, and then displays orders for customers in New York.

Specify parameter data types

You can also specify what type of data a parameter should accept. You can set the data
type for any parameter, but it is especially important to set the data type for numeric,
currency, or date/time data. When you specify the data type that a parameter should
accept, users see a more helpful error message if they enter the wrong type of data, such
as entering text when currency is expected.

If a parameter is set to accept text data, any input is interpreted as text, and no error
message is displayed.

To specify the data type for parameters in a query, use the following procedure:

1. With the query open in Design view, on the Design tab, in the
Show/Hide group, click Parameters.
2. In the Query Parameters dialog box, in the Parameter column, type the
prompt for each parameter for which you want to specify the data type. Make
sure that each parameter matches the prompt that you use in the Criteria row
of the query design grid.
3. In the Data Type column, select the data type for each parameter.

Create a totals query

The Total row in a datasheet is very useful, but for more complex questions, you use a
totals query. A totals query is a select query that allows you to group and summarize
data, like when you want to see total sales per product. In a totals query, you can use
the Sum function (an aggregate function), to see total sales per product.

Use the following procedure to modify the Product Subtotals query that you created in
the previous example so that it summarizes product subtotals by product.

1. On the Home tab, click View > Design View.

The Product Subtotals query opens in Design view.

2. On the Design tab, in the Show/Hide group, click Totals.

The Totals row is displayed in the query design grid.

▪ You can group by field values by using the Totals row in the design grid.
▪ You can add a datasheet Total row to the results of a totals query.
▪ When you use the Totals row in the design grid, you must choose an
aggregate function for each field. If you do not want to perform a calculation
on a field, you can group by the field.

▪ In the second column of the design grid, in the Total row, select Sum from
the drop-down list.
▪ On the Design tab, in the Results group, click Run. The query runs, and then
displays a list of products with subtotals.
▪ Press CTRL+S to save the query. Leave the query open.

Make calculations based on your data

You usually would not use tables to store calculated values, like subtotals, even if they
are based on data in the same database, because calculated values can become outdated
if the values that they are based on changes. For example, you would not store
someone's age in a table, because every year you would have to update the value;
instead, you store the person's date of birth, and then use a query to calculate the
person's age.

For example if you have a database for some products you’d like to sell. This database
has a table called Orders Details that has information about the products in fields such
as, price of each product and the quantities. You can calculate the subtotal by using a
query that multiplies the quantity of each product by the unit price for that product,
multiplies the quantity of each product by the unit price and discount for that product,
and then subtracts the total discount from the total unit price. If you created the sample
database in the previous example, open it and follow along:

1. On the Create tab, click Query Design.


2. On the Tables tab, double-click Order Details.
3. In the Order Details table, double-click Product ID to add this field to the first
column of the query design grid.
4. In the second column of the grid, right-click the Field row, and then click
Zoom on the shortcut menu.
5. In the Zoom box, type or paste the following: Subtotal:
([Quantity]*[Unit Price])- ([Quantity]*[Unit Price]*[Discount])
6. Click OK.
7. On the Design tab, click Run. The query runs, and then displays a list of
products and subtotals, per order.
8. Press CTRL+S to save the query, and then name the query Product Subtotals.

Display summarized or aggregate data

When you use tables to record transactions or store regularly occurring numeric data, it
is useful to be able to review that data in aggregate, such as sums or averages. In
Access, you can add a Totals row to a datasheet. Total row is a row at the bottom of the
datasheet that can display a running total or other aggregate value.

1. Run the Product Subtotals query you created earlier, and leave the results open
in Datasheet view.
2. On the Home tab, click Totals. A new row appears at the bottom of the
datasheet, with the word Total in the first column.
3. Click the cell in the last row of the datasheet named Total.
4. Click the arrow to view the available aggregate functions. Because the column
contains text data, there are only two choices: None and Count.
5. Select Count. The content of the cell changes from Total to a count of the
column values.
6. Click the adjoining cell (the second column). Note that an arrow appears in the
cell.
7. Click the arrow, and then click Sum. The field displays a sum of the column
values.
8. Leave the query open in Datasheet view.

Create a crosstab query

Now suppose that you want to review product subtotals, but you also want to aggregate
by month, so that each row shows subtotals for a product, and each column shows
product subtotals for a month. To show subtotals for a product and to show product
subtotals for a month, use a crosstab query.

You can modify the Product Subtotals query again so that the query returns rows of
product subtotals and columns of monthly subtotals.

1. On the Home tab, in the Views group, click View, and then click Design
View.
2. In the Query Setup group, click Add Tables (or Show Table in Access
2013).
3. Double-click Orders, and then click Close.
4. On the Design tab, in the QueryType group, click Crosstab. In the design
grid, the Show row is hidden, and the Crosstab row is displayed.
5. In the third column of the design grid, right-click the Field row, and then click
Zoom on the shortcut menu. The Zoom box opens.
6. In the Zoom box, type or paste the following: Month: "Month " &
DatePart("m", [Order Date])
7. Click OK.
8. In the Crosstab row, select the following values from the drop-down list:
Row Heading for the first column, Value for the second column, and Column
Heading for the third column.
9. On the Design tab, in the Results group, click Run. The query runs, and
then displays product subtotals, aggregated by month.
10. Press CTRL+S to save the query.

Create a make table query

You can use a make-table query to create a new table from data that is stored in other
tables.

For example, suppose that you want to send data for Chicago orders to a Chicago
business partner who uses Access to prepare reports. Instead of sending all your order
data, you want to restrict the data that you send to data specific to Chicago orders.

You can build a select query that contains Chicago order data, and then use the select
query to create the new table by using the following procedure:

1. Open the example database from the previous example.

To run a make-table query, you may need to enable the database content.

2. On the Create tab, in the Query group, click Query Design.


3. Double-click Order Details and Orders.
4. In the Orders table, double-click Customer ID and Ship City to add these
fields to the design grid.
5. In the Order Details table, double-click Order ID, Product ID, Quantity,
Unit Price, and Discount to add these fields to the design grid.
6. In the Ship City column of the design grid, clear the box in the Show
row. In the Criteria row, type 'Chicago' (include the single quotation marks).
Verify the query results before you use them to create the table.
7. On the Design tab, in the Results group, click Run.
8. Press Ctrl + S to save the query.
9. In the Query Name box, type Chicago Orders Query, and then click OK.
10. On the Home tab, in the Views group, click View, and then click Design View.
11. On the Design tab, in the Query Type group, click Make Table.
12. In the Make Table dialog box, in the Table Name box, type Chicago Orders,
and then click OK.
13. On the Design tab, in the Results group, click Run.
14. In the confirmation dialog box, click Yes, and see the new table displayed in
the Navigation Pane.

Create an append query

You can use an append query to retrieve data from one or more tables and add that data
to another table.

For example, suppose that you created a table to share with a Chicago business
associate, but you realize that the associate also works with clients in the Milwaukee
area. You want to add rows that contain Milwaukee area data to the table before you
share the table with your associate. You can add Milwaukee area data to the Chicago
Orders table by using the following procedure:

1. Open the query named "Chicago Orders Query" you created earlier in Design
view.
2. On the Design tab, in the Query Type group, click Append. The Append dialog
box opens.
3. In the Append dialog box, click the arrow in the Table Name box,
select Chicago Orders from the drop-down list, and then click OK.
4. In the design grid, in the Criteria row of the Ship City column, delete
'Chicago', and then type 'Milwaukee'.
5. In the Append To row, select the appropriate field for each column.

In this example, the Append To row values should match the Field row values,
but that is not required for append queries to work.

6. On the Design tab, in the Results group, click Run.

Create an update query

You can use an update query to change the data in your tables, and you can use an
update query to enter criteria to specify which rows should be updated. An update
query provides you an opportunity to review the updated data before you perform the
update.
Important: An action query cannot be undone. You should consider making a backup of
any tables that you will update by using an update query. An update query is not available in
Access web apps.

In the previous example, you appended rows to the Chicago Orders table. In the Chicago
Orders table, the Product ID field shows the numeric Product ID. To make the data more
useful in reports, you can replace the product IDs with product names, use the following
procedure:

1. Open the Chicago Orders table in Design view.


2. In the Product ID row, change the Data Type from Number to Text.
3. Save and close the Chicago Orders table.
4. On the Create tab, in the Query group, click Query Design.
5. Double-click Chicago Orders and Products.
6. On the Design tab, in the Query Type group, click Update.
7. In the design grid, the Sort and Show rows disappear, and the Update To row
appears.
8. In the Chicago Orders table, double-click Product ID to add this field to the
design grid.
9. In the design grid, in the Update To row of the Product ID column, type
or paste the following: [Products].[Product Name]

Tip: You can use an update query to delete field values by using an empty string ("") or
NULL in the Update To row.

10. In the Criteria row, type or paste the following: [Product ID] Like
([Products].[ID])
11. You can review which values will be changed by an update query by
viewing the query in Datasheet view.
12. On the Design tab, click View > Datasheet View. The query returns a list of
Product IDs that will be updated.
13. On the Design tab, click Run.

When you open the Chicago Orders table, you will see that the numeric values in the
Product ID field have been replaced by the product names from the Products table.

Create a delete query

You can use a delete query to delete data from your tables, and you can use a delete
query to enter criteria to specify which rows should be deleted. A delete query provides
you an opportunity to review the rows that will be deleted before you perform the
deletion.

For example, say that while you were preparing to send the Chicago Orders table from
the previous example, to your Chicago business associate, you notice that some of the
rows contain a number of empty fields. You decided to remove these rows before you
send the table. You could just open the table and delete the rows manually, but if you
have many rows to delete and you have clear criteria for which rows should be deleted,
you might find it helpful to use a delete query.

You can use a query to delete rows in the Chicago Orders table that do not have a value
for Order ID by using the following procedure:
1. On the Create tab, click Query Design.
2. Double-click Chicago Orders.
3. On the Design tab, in the QueryType group,click Delete. In the design grid,
the Sort and Show rows disappear, and the Delete row appears.
4. In the Chicago Orders table, double-click Order ID to add it to the grid.
5. In the design grid, in the Criteria row of the Order ID column, type Is Null.
6. On the Design tab, in the Results group, click Run

You might also like