0% found this document useful (0 votes)
6 views13 pages

SMA_Expt_4

The lab manual outlines an experiment focused on Exploratory Data Analysis (EDA) and visualization of social media data for business purposes. Students will learn to collect, monitor, and analyze social media data using various visualization techniques such as univariate plots, histograms, and heat maps. The document emphasizes the importance of EDA in understanding data patterns, detecting anomalies, and informing data cleaning processes.

Uploaded by

Laukik Pawar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views13 pages

SMA_Expt_4

The lab manual outlines an experiment focused on Exploratory Data Analysis (EDA) and visualization of social media data for business purposes. Students will learn to collect, monitor, and analyze social media data using various visualization techniques such as univariate plots, histograms, and heat maps. The document emphasizes the importance of EDA in understanding data patterns, detecting anomalies, and informing data cleaning processes.

Uploaded by

Laukik Pawar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

LAB MANUAL

PART A
(PART A: TO BE REFFERED BY STUDENTS)

Experiment No-04

A.1 Aim:
Exploratory Data Analysis and visualization of Social Media Data for business.

Lab Objective To understand the fundamental concepts of social media networks

Lab Outcome Collect, monitor, store and track social media data

A-2 Prerequisite
Data Mining, Data Analytics

A.3 OutCome
Students will able to perform exploratory data analysis and visualization on the chosen social
media data.

A.4 Theory:

What is Exploratory Data Analysis?


We can define exploratory data analysis as the essential data investigation process before the
formal analysis to spot patterns and anomalies, discover trends, and test hypotheses with summary
statistics and visualizations. It gives an idea about the data we will be digging deep into while
analyzing. It aids in formulating how we can handle data during analysis, like choosing models,
handling outliers, deciding model accuracy parameters, etc. Visualization helps to infer insights
easily from massive datasets.

Need for visualizing data:

● Understand the trends and patterns of data


● Analyze the frequency and other such characteristics of data
● Know the distribution of the variables in the data.
● Visualize the relationship that may exist between different variables

The number of variables of interest featured by the data classifies it as univariate, bivariate, or
multivariate. For example, if the data features only one variable of interest then it is a univariate
data. Further, based on the characteristics of data, it can be classified
as categorical/discrete and continuous data.

Types of Exploratory Data Analysis

1. Univariate Plots
Univariate plots show the frequency or the distribution shape of a variable.

2. Swarm Plot

The swarm-plot, similar to a strip-plot, provides a visualization technique for univariate data to
view the spread of values in a continuous variable. The swarm-plot spreads out the data points of
the variable automatically to avoid overlap and hence provides a better visual overview of the
data.

2. Histograms
Histograms are two-dimensional plots in which the x-axis divide into a range of numerical bins or
time intervals. The y-axis shows the frequency values, which are counts of occurrences of values
for each bin. Bar graphs have gaps between the bars to indicate that they compare distinct groups,
but there are no gaps in histograms. Hence, they tell us if the distribution is left/positively skew
(most of the data falls to the right side), right/negatively skewed (most of the data falls to the left
side), bi-modal (graphs having two distinct peaks), normal (perfectly symmetrical without skew),
or uniform (almost all the bins have similar frequency).

Density Plots:
A density plot is like a smoother version of a histogram. Generally, the kernel density estimate is
used in density plots to show the probability density function of the variable. A continuous curve,
which is the kernel is drawn to generate a smooth density estimation for the whole data.
Bar Graphs
Bar charts can be used to compare nominal or ordinal data. They are helpful for recognizing trends.

Violin Plots:
The Violin plot is very much similar to a box plot, with the addition of a rotated kernel density
plot on each side. It shows the distribution of quantitative data across several levels of one (or
more) categorical variables such that those distributions can be compared.

Box Plots
These charts show the distribution of values along an axis. Rectangular boxes are used in order to
bucket the data, giving us an idea of how the data points are spread out. These boxes are also called
quartiles which represent a quarter of a data set. Boxes can be drawn vertically or horizontally.
Box plots are suitable for identifying outliers. The below figure shows the structure of a box plot.
Heat Maps
For instance, correlation heat maps show the interrelationship between variables—areas as shaded
as per the data’s values. So, colour differences can easily spot similar and different values and
make sense of the data variation. They are usually helpful when you have a large amount of data.
They are used during A/B testing to see which parts of a web page are accessed by users on a
website.

PART B
(PART B: TO BE COMPLETED BY STUDENTS)

(Students must submit the soft copy as per following segments within two hours of the practical.
The soft copy must be uploaded on the Blackboard or emailed to the concerned lab in charge
faculties at the end of the practical in case the there is no Black board access available)

Roll. No.: A17 Name: Laukik Pawar


Class: BE_A Batch: A1
Date of Experiment: Date of Submission:
Grade:
B.1.Study the fundamentals of social media platform and implement data cleaning, pre-
processing, filtering and storing social media data for business:
(Paste your Search material completed during the 2 hours of practical in the lab here)

● Students need to use the previous social media dataset to perform exploratory data analysis and
visualization.

B.2 Input and Output:


(Command and its output)
# prompt: code to visulize the dataset i,e Univariate Plots ,Swarm Plot,
Histograms, Density Plots, Bar Graphs, Violin Plots, Box Plots,Heat Maps

import matplotlib.pyplot as plt


import seaborn as sns

# Univariate Plots
# Histograms
plt.figure(figsize=(10, 6))
df['VIEWS'].hist(bins=20)
plt.title('Distribution of Views')
plt.xlabel('Views')
plt.ylabel('Frequency')
plt.show()

# Density Plots
plt.figure(figsize=(10, 6))
sns.kdeplot(df['VIEWS'])
plt.title('Density Plot of Views')
plt.xlabel('Views')
plt.ylabel('Density')
plt.show()

# Box Plots
plt.figure(figsize=(10, 6))
sns.boxplot(y=df['VIEWS'])
plt.title('Box Plot of Views')
plt.show()

# Violin Plots
plt.figure(figsize=(10, 6))
sns.violinplot(y=df['VIEWS'])
plt.title('Violin Plot of Views')
plt.show()

# Swarm Plots (for smaller datasets, can be slow for large ones)
plt.figure(figsize=(10, 6))
sns.swarmplot(y=df['VIEWS']) # Consider sampling for large datasets
plt.title('Swarm Plot of Views')
plt.show()

# Bar Graphs (for categorical data - example using 'Channel Name' if it's
categorical)
if 'Channel' in df.columns: #Check if the column exists
plt.figure(figsize=(12,6))
df['CHANNEL'].value_counts().plot(kind='bar')
plt.title('Number of videos per channel')
plt.xlabel('Channel Name')
plt.ylabel('Number of Videos')
plt.show()

# Additional plots you can explore


#Scatter plot
plt.figure(figsize=(10, 6))
plt.scatter(df['VIEWS'], df['DURATION'])
plt.title('Views vs Duration')
plt.xlabel('Views')
plt.ylabel('Duration')
plt.show()
# prompt: code to plot heat map for numeric data only

import matplotlib.pyplot as plt


import seaborn as sns
# Assuming 'df' is your DataFrame and it's already preprocessed

# Select numeric columns for the heatmap


numeric_cols = df.select_dtypes(include=['number']).columns

# Create the heatmap


plt.figure(figsize=(10, 8))
sns.heatmap(df[numeric_cols].corr(), annot=True, cmap='coolwarm',
fmt=".2f")
plt.title('Correlation Heatmap of Numeric Features')
plt.show()

B.3 Observations and learning:


(Students are expected to comment on the output obtained with clear observations and
learning for each task/ sub part assigned)
We performed Exploratory Data Analysis and visualization of Social Media Data for business.

B.4 Conclusion:
(Students must write the conclusion as per the attainment of individual outcome listed above
and learning/observation noted in section B.3)
We performed Exploratory Data Analysis and visualization of Social Media Data for business.

B.5 Question of Curiosity


(To be answered by student based on the practical performed and learning/observations)
Q1. What is EDA? Explain Its Importance.
Exploratory Data Analysis (EDA) is an analytical process used to summarize the main
characteristics of a dataset, often using visual methods. EDA allows data scientists and analysts to
explore data without having a specific hypothesis in mind, helping to uncover patterns, spot
anomalies, test assumptions, and gain insights that inform future analyses.

Importance of EDA:

Data Understanding: EDA provides in-depth insight into the data's structure, including
distributions, relationships, and trends, enabling analysts to understand what the data represents.

Identifying Patterns and Trends: It helps in uncovering underlying trends or patterns that may not
be immediately obvious, which can guide further data exploration and analysis.

Detecting Anomalies: Through visualization and summary statistics, EDA can identify outliers or
anomalies that could affect subsequent analysis or modeling.

Hypothesis Generation: EDA can help generate hypotheses that can be tested in further analysis
by revealing insights that might not have been considered originally.

Feature Selection: Understanding the relationships between different variables can aid in
identifying which features are most relevant for predictive modeling.

Informing Data Cleaning and Pre-processing: EDA highlights issues in the data, such as missing
values, skewed distributions, or irrelevant features, influencing necessary data cleaning steps.

Q2. What is the Importance of Visualization?


Data Visualization is the graphical representation of information and data. Using visual elements
like charts, graphs, and maps, data visualization tools provide an accessible way to see and
understand trends, outliers, and patterns in data.

Importance of Visualization:
Improved Comprehension: Visualizations make complex data more understandable, summarizing
large amounts of information quickly and effectively.

Better Communication: Visual representations make it easier to convey findings to others,


especially non-technical stakeholders, facilitating discussions and decision-making.

Enhanced Pattern Recognition: Humans are generally good at recognizing patterns visually, so
visualizations can highlight correlations and trends that might not be discernible through raw data
alone.

Time Efficiency: Visual tools help analysts quickly grasp the significance of the data without
digging deeply into the numbers, thus saving time.

Exploration in EDA: In the context of EDA, visual tools help in immediate feedback and iterative
analysis, allowing stakeholders to explore data more freely and flexibly.

Q3. Explain the Steps Involved in EDA.


The EDA process typically involves several cohesive steps that provide a thorough understanding
of the dataset:

Data Collection: Gather data from various sources to create a comprehensive dataset for analysis.

Data Cleaning: Address issues such as missing values, duplicates, and inconsistencies to ensure
the quality of the dataset.

Descriptive Statistics: Compute summary statistics (mean, median, mode, variance, etc.) to gain
insights into the central tendency and dispersion of data.

Data Visualization: Create visual representations (histograms, scatter plots, box plots, etc.) to
explore distributions and relationships between variables.

Variable Analysis:

Univariate Analysis: Analyze each variable individually to understand its distribution and
characteristics.
Bivariate/Multivariate Analysis: Explore the relationships between two or more variables to
identify correlations and dependencies.
Outlier Detection: Identify and analyze outliers or anomalies in the data that could impact the
analysis.
Correlation Analysis: Examine the correlation between features to understand relationships and
dependencies, using correlation matrices or heatmaps.

Hypothesis Generation: Use insights derived from EDA to formulate hypotheses for further testing
in subsequent analyses.

Documentation: Document findings, visualizations, and initial impressions to provide context and
reference for future analysis or stakeholders.

By following these steps, analysts can approach their data in a structured manner, ensuring that
they derive maximum insights while also preparing the data for further predictive modeling or
analysis as needed

You might also like