0% found this document useful (0 votes)
23 views

Astma Lab Manual

The document discusses social media analytics and text mining concepts. It describes analyzing unstructured text data through techniques like natural language processing, information extraction, and data mining. It also covers exploring social media analytics techniques through exploratory data analysis without hypotheses to find patterns and insights from data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views

Astma Lab Manual

The document discusses social media analytics and text mining concepts. It describes analyzing unstructured text data through techniques like natural language processing, information extraction, and data mining. It also covers exploring social media analytics techniques through exploratory data analysis without hypotheses to find patterns and insights from data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 17

ASTMA

LIST OF EXPERIMENTS

1. Study the Text mining concept in social media.


2. Express various social media platform and social media analytics tools.
3. To perform Exploratory Data Analysis on social media.
4. Explore social media analytics techniques.
5. Perform pre-processing of collected data and store it.
6. Analyse and visualise the social media data collected.
7. Perform text analytics.
8. Perform action analytics ( Analyse likes, Mention of a product).
9. Perform online survey of any social media platform.
10. Perform Sentiment Analysis of any media analytics.
EXPERIMENTS-1

Study the Text mining concept in social media.

Text mining, also known as text data mining, is the process of transforming unstructured text
into a structured format to identify meaningful patterns and new insights. By applying
advanced analytical techniques, such as Naïve Bayes, Support Vector Machines (SVM), and
other deep learning algorithms, companies are able to explore and discover hidden
relationships within their unstructured data.

Text is a one of the most common data types within databases. Depending on the database,
this data can be organized as:

 Structured data: This data is standardized into a tabular format with numerous rows
and columns, making it easier to store and process for analysis and machine learning
algorithms. Structured data can include inputs such as names, addresses, and phone
numbers.

 Unstructured data: This data does not have a predefined data format. It can include
text from sources, like social media or product reviews, or rich media formats like,
video and audio files.

 Semi-structured data: As the name suggests, this data is a blend between structured
and unstructured data formats. While it has some organization, it doesn’t have enough
structure to meet the requirements of a relational database. Examples of semi-
structured data include XML, JSON and HTML files.
Text mining techniques

The process of text mining comprises several activities that enable you to deduce information
from unstructured text data. Before you can apply different text mining techniques, you must
start with text preprocessing, which is the practice of cleaning and transforming text data into
a usable format. This practice is a core aspect of natural language processing (NLP) and it
usually involves the use of techniques such as language identification, tokenization, part-of-
speech tagging, chunking, and syntax parsing to format data appropriately for analysis. When
text preprocessing is complete, you can apply text mining algorithms to derive insights from
the data. Some of these common text mining techniques include:

Information retrieval

Information retrieval (IR) returns relevant information or documents based on a pre-defined


set of queries or phrases. IR systems utilize algorithms to track user behaviors and identify
relevant data. Information retrieval is commonly used in library catalogue systems and
popular search engines, like Google. Some common IR sub-tasks include:
 Tokenization: This is the process of breaking out long-form text into sentences and
words called “tokens”. These are, then, used in the models, like bag-of-words, for text
clustering and document matching tasks.

 Stemming: This refers to the process of separating the prefixes and suffixes from
words to derive the root word form and meaning. This technique improves
information retrieval by reducing the size of indexing files.
Natural language processing (NLP)

Natural language processing, which evolved from computational linguistics, uses methods
from various disciplines, such as computer science, artificial intelligence, linguistics, and data
science, to enable computers to understand human language in both written and verbal forms.
By analyzing sentence structure and grammar, NLP sub-tasks allow computers to “read”.
Common sub-tasks include:

 Summarization: This technique provides a synopsis of long pieces of text to create a


concise, coherent summary of a document’s main points.

 Part-of-Speech (PoS) tagging: This technique assigns a tag to every token in a


document based on its part of speech—i.e. denoting nouns, verbs, adjectives, etc. This
step enables semantic analysis on unstructured text.

 Text categorization: This task, which is also known as text classification, is


responsible for analyzing text documents and classifying them based on predefined
topics or categories. This sub-task is particularly helpful when categorizing synonyms
and abbreviations.

 Sentiment analysis: This task detects positive or negative sentiment from internal or
external data sources, allowing you to track changes in customer attitudes over time.
It is commonly used to provide information about perceptions of brands, products,
and services. These insights can propel businesses to connect with customers and
improve processes and user experiences.
Information extraction

Information extraction (IE) surfaces the relevant pieces of data when searching various
documents. It also focuses on extracting structured information from free text and storing
these entities, attributes, and relationship information in a database. Common information
extraction sub-tasks include:

 Feature selection, or attribute selection, is the process of selecting the important


features (dimensions) to contribute the most to output of a predictive analytics model.

 Feature extraction is the process of selecting a subset of features to improve the


accuracy of a classification task. This is particularly important for dimensionality
reduction.

 Named-entity recognition (NER) also known as entity identification or entity


extraction, aims to find and categorize specific entities in text, such as names or
locations. For example, NER identifies “California” as a location and “Mary” as a
woman’s name.
Data mining
Data mining is the process of identifying patterns and extracting useful insights from big data
sets. This practice evaluates both structured and unstructured data to identify new
information, and it is commonly utilized to analyze consumer behaviors within marketing and
sales. Text mining is essentially a sub-field of data mining as it focuses on bringing structure
to unstructured data and analyzing it to generate novel insights. The techniques mentioned
above are forms of data mining but fall under the scope of textual data analysis.

EXPERIMENTS-2

Express various social media platform and social media analytics tools.

Social media analytics is the ability to gather and find meaning in data gathered from social
channels to support business decisions — and measure the performance of actions based on
those decisions through social media.

Social media analytics is broader than metrics such as likes, follows, retweets, previews,
clicks, and impressions gathered from individual channels. It also differs from reporting
offered by services that support marketing campaigns such as LinkedIn or Google Analytics.

Social media analytics uses specifically designed software platforms that work similarly to
web search tools. Data about keywords or topics is retrieved through search queries or web
‘crawlers’ that span channels. Fragments of text are returned, loaded into a database,
categorized and analyzed to derive meaningful insights.

Social media analytics includes the concept of social listening. Listening is monitoring social
channels for problems and opportunities. Social media analytics tools typically incorporate
listening into more comprehensive reporting that involves listening and performance analysis.

Why you need social media analytics tools


Social media analytics tools help you create performance reports to share with your team,
stakeholders, and boss — to figure out what’s working and what’s not. They should also
provide the historical data you need to assess your social media marketing strategy on both
macro and micro levels.

Social media analytics tools are essential in running a successful social media campaign. It
allows social media experts to track and determine the performance of various portions of the
social marketing campaign, such as sales, customer service, and sentiment analysis.

In terms of sales, these tools show how well a social media marketing campaign is going by
showing all positive turnovers or purchases that come directly from social media sources,
such as Facebook and Twitter.
These sites are useful for disseminating purchase or signup links and correlate directly to
traffic, which can be picked up by a specific social media analytics tool.

For brand recognition and sentiment analysis, some tools are able to mine the data from
social networking sites in order to find or discover the sentiment of people towards a brand or
business through methods such as natural language processing and pattern recognition.

Common methods of analysis used:

 Data mining
 Transformation
 Natural language processing
 Data pre-processing
 Data visualization
 Hidden pattern evaluation
 Traffic analysis
Examples of social media analytics tools or platforms:

 Google Analytics
 Twitter Analytics
 Facebook Insight
 Hootsuite

Experiment 4

Explore social media analytics techniques.

Exploratory Data Analysis (EDA) is usually the first step when you
have data in hand and want to analyze it. In EDA, there is no
hypothesis and no model. You are finding patterns and truth from
the data.

EDA is crucial for data science projects because it can: 1. Help you
gain intuition about the data; 2. Make comparisons between
distributions; 3. Check if the data is on the scale you expect; 4. Find
out where data is missing or if there are outliers; 5. Summarize data,
calculate the mean, min, max, and variance.

The basic tools of EDA are plots, graphs, and summary statistics

Creating Hypotheses, testing various business assumptions while dealing


with any Machine learning problem statement is very important and this is
what EDA helps to accomplish. There are various tootle and techniques to
understand your data, And the basic need is you should have the
knowledge of Numpy for mathematical operations and Pandas for data
manipulation.

Exploratory data analysis is a way to better understand your data which helps in further

Data preprocessing. And data visualization is key, making the exploratory data
analysis process streamline and easily analyzing data using wonderful plots and
charts.

Data Visualization
Data Visualization represents the text or numerical data in a visual format, which
makes it easy to grasp the information the data express. We, humans, remember
the pictures more easily than readable text, so Python provides us various
libraries for data visualization like matplotlib, seaborn, plotly, etc. In this tutorial,
we will use Matplotlib and seaborn for performing various techniques to explore
data using various plots.

Exploratory Data Analysis


Creating Hypotheses, testing various business assumptions while dealing with any
Machine learning problem statement is very important and this is what EDA helps
to accomplish. There are various tootle and techniques to understand your data,
And the basic need is you should have the knowledge of Numpy for mathematical
operations and Pandas for data manipulation.

We will use a very popular Titanic dataset with which everyone is familiar with
and you can download it from here.

Now lets us start exploring data and study different data visualization plots with
different types of data. And for demonstrating some of the techniques we will
also use an inbuilt dataset of seaborn as tips data which explains the tips each
waiter gets from different customers.

let’s get started by importing libraries and loading Data

import numpy as np
import pandas pd
import matplotlib.pyplot as plt
import seaborn as sns
from seaborn import load_dataset
#titanic dataset
data = pd.read_csv("titanic_train.csv")
#tips dataset
tips = load_dataset("tips")
Categorical Data
A variable that has text-based information is referred to as categorical variables.
let’s look at various plots which we can use for visualizing Categorical data.

1) CountPlot
Countplot is basically a count of frequency plot in form of a bar graph. It plots the
count of each category in a separate bar. When we use the pandas’ value counts
function on any column, It is the same visual form of the value counts function. In
our data-target variable is survived and it is categorical so let us plot a countplot
of this.

2) Pie Chart
The pie chart is also the same as the countplot, only gives you additional
information about the percentage presence of each category in data means
which category is getting how much weightage in data. let us check about
the Sex column, what is a percentage of Male and Female members
traveling.
data['Sex'].value_counts().plot(kind="pie", autopct="%.2f")
plt.show()

Numerical Data
Analyzing Numerical data is important because understanding the
distribution of variables helps to further process the data. Most of the time
you will find much inconsistency with numerical data so do explore
numerical variables.

1) Histogram
A histogram is a value distribution plot of numerical columns. It basically
creates bins in various ranges in values and plots it where we can visualize
how values are distributed. We can have a look where more values lie like
in positive, negative, or at the center(mean). Let’s have a look at the Age
column.
plt.hist(data['Age'], bins=5)
plt.show()

2) Distplot
Distplot is also known as the second Histogram because it is a slight
improvement version of the Histogram. Distplot gives us a KDE(Kernel
Density Estimation) over histogram which explains PDF(Probability Density
Function) which means what is the probability of each value occurring in
this column. If you have study statistics before then definitely you should
know about PDF function.
sns.distplot(data['Age'])
plt.show()

3) Boxplot
Boxplot is a very interesting plot that basically plots a 5 number summary.
to get 5 number summary some terms we need to describe.

 Median – Middle value in series after sorting


 Percentile – Gives any number which is number of values present before
this percentile like for example 50 under 25th percentile so it explains total
of 50 values are there below 25th percentile
 Minimum and Maximum – These are not minimum and maximum values,
rather they describe the lower and upper boundary of standard deviation
which is calculated using Interquartile range(IQR).

IQR = Q3 - Q1
Lower_boundary = Q1 - 1.5 * IQR
Upper_bounday = Q3 + 1.5 * IQR

Here Q1 and Q3 is 1st quantile(25th percentile) and 3rd Quantile(75th


percentile)

Numerical and Numerical

First, let’s explore the plots when both the variable is numerical.
1) Scatter Plot
To plot the relationship between two numerical variables scatter plot is a
simple plot to do. Let us see the relationship between the total bill and tip
provided using a scatter plot.
sns.scatterplot(tips["total_bill"], tips["tip"])

Multivariate analysis with scatter plot


we can also plot 3 variable or 4 variable relationships with scatter plot.
suppose we want to find the separate ratio of male and female with total bill
and tip provided.
sns.scatterplot(tips["total_bill"], tips["tip"], hue=tips["sex"])
plt.show()
We can also see 4 variable multivariate analyses with scatter plots using
style argument. Suppose now along with gender I also want to know
whether the customer was a smoker or not so we can do this.
sns.scatterplot(tips["total_bill"], tips["tip"], hue=tips["sex"],
style=tips['smoker'])
plt.show()

Numerical and Categorical

If one variable is numerical and one is categorical then there are various
plots that we can use for Bivariate and Multivariate analysis.

1) Bar Plot
Bar plot is a simple plot which we can use to plot categorical variable on
the x-axis and numerical variable on y-axis and explore the relationship
between both variables. The blacktip on top of each bar shows the
confidence Interval. let us explore P-Class with age.
sns.barplot(data['Pclass'], data['Age'])
plt.show()
Multivariate analysis using Bar plot
Hue’s argument is very useful which helps to analyze more than 2
variables. Now along with the above relationship we want to see with
gender.
sns.barplot(data['Pclass'], data['Fare'], hue = data["Sex"])
plt.show()
Experiment 5

Perform pre-processing of collected data and store it.

Data preprocessing is an important step in the data mining process. It refers


to the cleaning, transforming, and integrating of data in order to make it
ready for analysis. The goal of data preprocessing is to improve the quality of
the data and to make it more suitable for the specific data mining task.

Some common steps in data preprocessing include:

Data preprocessing is an important step in the data mining process that


involves cleaning and transforming raw data to make it suitable for analysis.
Some common steps in data preprocessing include:
Data Cleaning: This involves identifying and correcting errors or
inconsistencies in the data, such as missing values, outliers, and duplicates.
Various techniques can be used for data cleaning, such as imputation,
removal, and transformation.

Data Integration: This involves combining data from multiple sources to


create a unified dataset. Data integration can be challenging as it requires
handling data with different formats, structures, and semantics. Techniques
such as record linkage and data fusion can be used for data integration.

Data Transformation: This involves converting the data into a suitable


format for analysis. Common techniques used in data transformation include
normalization, standardization, and discretization. Normalization is used to
scale the data to a common range, while standardization is used to transform
the data to have zero mean and unit variance. Discretization is used to
convert continuous data into discrete categories.

Data Reduction: This involves reducing the size of the dataset while
preserving the important information. Data reduction can be achieved
through techniques such as feature selection and feature extraction. Feature
selection involves selecting a subset of relevant features from the dataset,
while feature extraction involves transforming the data into a lower-
dimensional space while preserving the important information.

Data Discretization: This involves dividing continuous data into discrete


categories or intervals. Discretization is often used in data mining and
machine learning algorithms that require categorical data. Discretization can
be achieved through techniques such as equal width binning, equal
frequency binning, and clustering.
Data Normalization: This involves scaling the data to a common range,
such as between 0 and 1 or -1 and 1. Normalization is often used to handle
data with different units and scales. Common normalization techniques
include min-max normalization, z-score normalization, and decimal scaling.
Data preprocessing plays a crucial role in ensuring the quality of data and
the accuracy of the analysis results. The specific steps involved in data
preprocessing may vary depending on the nature of the data and the
analysis goals.
By performing these steps, the data mining process becomes more efficient
and the results become more accurate.
Preprocessing in Data Mining:
Data preprocessing is a data mining technique which is used to transform the
raw data in a useful and efficient format.

Steps Involved in Data Preprocessing:


1. Data Cleaning:
The data can have many irrelevant and missing parts. To handle this part,
data cleaning is done. It involves handling of missing data, noisy data etc.

 (a). Missing Data:


This situation arises when some data is missing in the data. It can be
handled in various ways.
Some of them are:
1. Ignore the tuples:
This approach is suitable only when the dataset we have is quite large
and multiple values are missing within a tuple.

2. Fill the Missing values:


There are various ways to do this task. You can choose to fill the
missing values manually, by attribute mean or the most probable
value.

 (b). Noisy Data:


Noisy data is a meaningless data that can’t be interpreted by machines.It
can be generated due to faulty data collection, data entry errors etc. It can
be handled in following ways :
1. Binning Method:
This method works on sorted data in order to smooth it. The whole
data is divided into segments of equal size and then various methods
are performed to complete the task. Each segmented is handled
separately. One can replace all data in a segment by its mean or
boundary values can be used to complete the task.

2. Regression:
Here data can be made smooth by fitting it to a regression
function.The regression used may be linear (having one independent
variable) or multiple (having multiple independent variables).

3. Clustering:
This approach groups the similar data in a cluster. The outliers may be
undetected or it will fall outside the clusters.
2. Data Transformation:
This step is taken in order to transform the data in appropriate forms suitable
for mining process. This involves following ways:
1. Normalization:
It is done in order to scale the data values in a specified range (-1.0 to 1.0
or 0.0 to 1.0)

2. Attribute Selection:
In this strategy, new attributes are constructed from the given set of
attributes to help the mining process.

3. Discretization:
This is done to replace the raw values of numeric attribute by interval
levels or conceptual levels.

4. Concept Hierarchy Generation:


Here attributes are converted from lower level to higher level in hierarchy.
For Example-The attribute “city” can be converted to “country”.

3. Data Reduction:
Data reduction is a crucial step in the data mining process that involves
reducing the size of the dataset while preserving the important information.
This is done to improve the efficiency of data analysis and to avoid overfitting
of the model. Some common steps involved in data reduction are:
Feature Selection: This involves selecting a subset of relevant features from
the dataset. Feature selection is often performed to remove irrelevant or
redundant features from the dataset. It can be done using various techniques
such as correlation analysis, mutual information, and principal component
analysis (PCA).

Feature Extraction: This involves transforming the data into a lower-


dimensional space while preserving the important information. Feature
extraction is often used when the original features are high-dimensional and
complex. It can be done using techniques such as PCA, linear discriminant
analysis (LDA), and non-negative matrix factorization (NMF).
Sampling: This involves selecting a subset of data points from the dataset.
Sampling is often used to reduce the size of the dataset while preserving the
important information. It can be done using techniques such as random
sampling, stratified sampling, and systematic sampling.

Clustering: This involves grouping similar data points together into clusters.
Clustering is often used to reduce the size of the dataset by replacing similar
data points with a representative centroid. It can be done using techniques
such as k-means, hierarchical clustering, and density-based clustering.

Compression: This involves compressing the dataset while preserving the


important information. Compression is often used to reduce the size of the
dataset for storage and transmission purposes. It can be done using
techniques such as wavelet compression, JPEG compression, and gzip
compression.

You might also like