Astma Lab Manual
Astma Lab Manual
LIST OF EXPERIMENTS
Text mining, also known as text data mining, is the process of transforming unstructured text
into a structured format to identify meaningful patterns and new insights. By applying
advanced analytical techniques, such as Naïve Bayes, Support Vector Machines (SVM), and
other deep learning algorithms, companies are able to explore and discover hidden
relationships within their unstructured data.
Text is a one of the most common data types within databases. Depending on the database,
this data can be organized as:
Structured data: This data is standardized into a tabular format with numerous rows
and columns, making it easier to store and process for analysis and machine learning
algorithms. Structured data can include inputs such as names, addresses, and phone
numbers.
Unstructured data: This data does not have a predefined data format. It can include
text from sources, like social media or product reviews, or rich media formats like,
video and audio files.
Semi-structured data: As the name suggests, this data is a blend between structured
and unstructured data formats. While it has some organization, it doesn’t have enough
structure to meet the requirements of a relational database. Examples of semi-
structured data include XML, JSON and HTML files.
Text mining techniques
The process of text mining comprises several activities that enable you to deduce information
from unstructured text data. Before you can apply different text mining techniques, you must
start with text preprocessing, which is the practice of cleaning and transforming text data into
a usable format. This practice is a core aspect of natural language processing (NLP) and it
usually involves the use of techniques such as language identification, tokenization, part-of-
speech tagging, chunking, and syntax parsing to format data appropriately for analysis. When
text preprocessing is complete, you can apply text mining algorithms to derive insights from
the data. Some of these common text mining techniques include:
Information retrieval
Stemming: This refers to the process of separating the prefixes and suffixes from
words to derive the root word form and meaning. This technique improves
information retrieval by reducing the size of indexing files.
Natural language processing (NLP)
Natural language processing, which evolved from computational linguistics, uses methods
from various disciplines, such as computer science, artificial intelligence, linguistics, and data
science, to enable computers to understand human language in both written and verbal forms.
By analyzing sentence structure and grammar, NLP sub-tasks allow computers to “read”.
Common sub-tasks include:
Sentiment analysis: This task detects positive or negative sentiment from internal or
external data sources, allowing you to track changes in customer attitudes over time.
It is commonly used to provide information about perceptions of brands, products,
and services. These insights can propel businesses to connect with customers and
improve processes and user experiences.
Information extraction
Information extraction (IE) surfaces the relevant pieces of data when searching various
documents. It also focuses on extracting structured information from free text and storing
these entities, attributes, and relationship information in a database. Common information
extraction sub-tasks include:
EXPERIMENTS-2
Express various social media platform and social media analytics tools.
Social media analytics is the ability to gather and find meaning in data gathered from social
channels to support business decisions — and measure the performance of actions based on
those decisions through social media.
Social media analytics is broader than metrics such as likes, follows, retweets, previews,
clicks, and impressions gathered from individual channels. It also differs from reporting
offered by services that support marketing campaigns such as LinkedIn or Google Analytics.
Social media analytics uses specifically designed software platforms that work similarly to
web search tools. Data about keywords or topics is retrieved through search queries or web
‘crawlers’ that span channels. Fragments of text are returned, loaded into a database,
categorized and analyzed to derive meaningful insights.
Social media analytics includes the concept of social listening. Listening is monitoring social
channels for problems and opportunities. Social media analytics tools typically incorporate
listening into more comprehensive reporting that involves listening and performance analysis.
Social media analytics tools are essential in running a successful social media campaign. It
allows social media experts to track and determine the performance of various portions of the
social marketing campaign, such as sales, customer service, and sentiment analysis.
In terms of sales, these tools show how well a social media marketing campaign is going by
showing all positive turnovers or purchases that come directly from social media sources,
such as Facebook and Twitter.
These sites are useful for disseminating purchase or signup links and correlate directly to
traffic, which can be picked up by a specific social media analytics tool.
For brand recognition and sentiment analysis, some tools are able to mine the data from
social networking sites in order to find or discover the sentiment of people towards a brand or
business through methods such as natural language processing and pattern recognition.
Data mining
Transformation
Natural language processing
Data pre-processing
Data visualization
Hidden pattern evaluation
Traffic analysis
Examples of social media analytics tools or platforms:
Google Analytics
Twitter Analytics
Facebook Insight
Hootsuite
Experiment 4
Exploratory Data Analysis (EDA) is usually the first step when you
have data in hand and want to analyze it. In EDA, there is no
hypothesis and no model. You are finding patterns and truth from
the data.
EDA is crucial for data science projects because it can: 1. Help you
gain intuition about the data; 2. Make comparisons between
distributions; 3. Check if the data is on the scale you expect; 4. Find
out where data is missing or if there are outliers; 5. Summarize data,
calculate the mean, min, max, and variance.
The basic tools of EDA are plots, graphs, and summary statistics
Exploratory data analysis is a way to better understand your data which helps in further
Data preprocessing. And data visualization is key, making the exploratory data
analysis process streamline and easily analyzing data using wonderful plots and
charts.
Data Visualization
Data Visualization represents the text or numerical data in a visual format, which
makes it easy to grasp the information the data express. We, humans, remember
the pictures more easily than readable text, so Python provides us various
libraries for data visualization like matplotlib, seaborn, plotly, etc. In this tutorial,
we will use Matplotlib and seaborn for performing various techniques to explore
data using various plots.
We will use a very popular Titanic dataset with which everyone is familiar with
and you can download it from here.
Now lets us start exploring data and study different data visualization plots with
different types of data. And for demonstrating some of the techniques we will
also use an inbuilt dataset of seaborn as tips data which explains the tips each
waiter gets from different customers.
import numpy as np
import pandas pd
import matplotlib.pyplot as plt
import seaborn as sns
from seaborn import load_dataset
#titanic dataset
data = pd.read_csv("titanic_train.csv")
#tips dataset
tips = load_dataset("tips")
Categorical Data
A variable that has text-based information is referred to as categorical variables.
let’s look at various plots which we can use for visualizing Categorical data.
1) CountPlot
Countplot is basically a count of frequency plot in form of a bar graph. It plots the
count of each category in a separate bar. When we use the pandas’ value counts
function on any column, It is the same visual form of the value counts function. In
our data-target variable is survived and it is categorical so let us plot a countplot
of this.
2) Pie Chart
The pie chart is also the same as the countplot, only gives you additional
information about the percentage presence of each category in data means
which category is getting how much weightage in data. let us check about
the Sex column, what is a percentage of Male and Female members
traveling.
data['Sex'].value_counts().plot(kind="pie", autopct="%.2f")
plt.show()
Numerical Data
Analyzing Numerical data is important because understanding the
distribution of variables helps to further process the data. Most of the time
you will find much inconsistency with numerical data so do explore
numerical variables.
1) Histogram
A histogram is a value distribution plot of numerical columns. It basically
creates bins in various ranges in values and plots it where we can visualize
how values are distributed. We can have a look where more values lie like
in positive, negative, or at the center(mean). Let’s have a look at the Age
column.
plt.hist(data['Age'], bins=5)
plt.show()
2) Distplot
Distplot is also known as the second Histogram because it is a slight
improvement version of the Histogram. Distplot gives us a KDE(Kernel
Density Estimation) over histogram which explains PDF(Probability Density
Function) which means what is the probability of each value occurring in
this column. If you have study statistics before then definitely you should
know about PDF function.
sns.distplot(data['Age'])
plt.show()
3) Boxplot
Boxplot is a very interesting plot that basically plots a 5 number summary.
to get 5 number summary some terms we need to describe.
IQR = Q3 - Q1
Lower_boundary = Q1 - 1.5 * IQR
Upper_bounday = Q3 + 1.5 * IQR
First, let’s explore the plots when both the variable is numerical.
1) Scatter Plot
To plot the relationship between two numerical variables scatter plot is a
simple plot to do. Let us see the relationship between the total bill and tip
provided using a scatter plot.
sns.scatterplot(tips["total_bill"], tips["tip"])
If one variable is numerical and one is categorical then there are various
plots that we can use for Bivariate and Multivariate analysis.
1) Bar Plot
Bar plot is a simple plot which we can use to plot categorical variable on
the x-axis and numerical variable on y-axis and explore the relationship
between both variables. The blacktip on top of each bar shows the
confidence Interval. let us explore P-Class with age.
sns.barplot(data['Pclass'], data['Age'])
plt.show()
Multivariate analysis using Bar plot
Hue’s argument is very useful which helps to analyze more than 2
variables. Now along with the above relationship we want to see with
gender.
sns.barplot(data['Pclass'], data['Fare'], hue = data["Sex"])
plt.show()
Experiment 5
Data Reduction: This involves reducing the size of the dataset while
preserving the important information. Data reduction can be achieved
through techniques such as feature selection and feature extraction. Feature
selection involves selecting a subset of relevant features from the dataset,
while feature extraction involves transforming the data into a lower-
dimensional space while preserving the important information.
2. Regression:
Here data can be made smooth by fitting it to a regression
function.The regression used may be linear (having one independent
variable) or multiple (having multiple independent variables).
3. Clustering:
This approach groups the similar data in a cluster. The outliers may be
undetected or it will fall outside the clusters.
2. Data Transformation:
This step is taken in order to transform the data in appropriate forms suitable
for mining process. This involves following ways:
1. Normalization:
It is done in order to scale the data values in a specified range (-1.0 to 1.0
or 0.0 to 1.0)
2. Attribute Selection:
In this strategy, new attributes are constructed from the given set of
attributes to help the mining process.
3. Discretization:
This is done to replace the raw values of numeric attribute by interval
levels or conceptual levels.
3. Data Reduction:
Data reduction is a crucial step in the data mining process that involves
reducing the size of the dataset while preserving the important information.
This is done to improve the efficiency of data analysis and to avoid overfitting
of the model. Some common steps involved in data reduction are:
Feature Selection: This involves selecting a subset of relevant features from
the dataset. Feature selection is often performed to remove irrelevant or
redundant features from the dataset. It can be done using various techniques
such as correlation analysis, mutual information, and principal component
analysis (PCA).
Clustering: This involves grouping similar data points together into clusters.
Clustering is often used to reduce the size of the dataset by replacing similar
data points with a representative centroid. It can be done using techniques
such as k-means, hierarchical clustering, and density-based clustering.