SMA_Expt_4
SMA_Expt_4
PART A
(PART A: TO BE REFFERED BY STUDENTS)
Experiment No-04
A.1 Aim:
Exploratory Data Analysis and visualization of Social Media Data for business.
Lab Outcome Collect, monitor, store and track social media data
A-2 Prerequisite
Data Mining, Data Analytics
A.3 OutCome
Students will able to perform exploratory data analysis and visualization on the chosen social
media data.
A.4 Theory:
The number of variables of interest featured by the data classifies it as univariate, bivariate, or
multivariate. For example, if the data features only one variable of interest then it is a univariate
data. Further, based on the characteristics of data, it can be classified
as categorical/discrete and continuous data.
1. Univariate Plots
Univariate plots show the frequency or the distribution shape of a variable.
2. Swarm Plot
The swarm-plot, similar to a strip-plot, provides a visualization technique for univariate data to
view the spread of values in a continuous variable. The swarm-plot spreads out the data points of
the variable automatically to avoid overlap and hence provides a better visual overview of the
data.
2. Histograms
Histograms are two-dimensional plots in which the x-axis divide into a range of numerical bins or
time intervals. The y-axis shows the frequency values, which are counts of occurrences of values
for each bin. Bar graphs have gaps between the bars to indicate that they compare distinct groups,
but there are no gaps in histograms. Hence, they tell us if the distribution is left/positively skew
(most of the data falls to the right side), right/negatively skewed (most of the data falls to the left
side), bi-modal (graphs having two distinct peaks), normal (perfectly symmetrical without skew),
or uniform (almost all the bins have similar frequency).
Density Plots:
A density plot is like a smoother version of a histogram. Generally, the kernel density estimate is
used in density plots to show the probability density function of the variable. A continuous curve,
which is the kernel is drawn to generate a smooth density estimation for the whole data.
Bar Graphs
Bar charts can be used to compare nominal or ordinal data. They are helpful for recognizing trends.
Violin Plots:
The Violin plot is very much similar to a box plot, with the addition of a rotated kernel density
plot on each side. It shows the distribution of quantitative data across several levels of one (or
more) categorical variables such that those distributions can be compared.
Box Plots
These charts show the distribution of values along an axis. Rectangular boxes are used in order to
bucket the data, giving us an idea of how the data points are spread out. These boxes are also called
quartiles which represent a quarter of a data set. Boxes can be drawn vertically or horizontally.
Box plots are suitable for identifying outliers. The below figure shows the structure of a box plot.
Heat Maps
For instance, correlation heat maps show the interrelationship between variables—areas as shaded
as per the data’s values. So, colour differences can easily spot similar and different values and
make sense of the data variation. They are usually helpful when you have a large amount of data.
They are used during A/B testing to see which parts of a web page are accessed by users on a
website.
PART B
(PART B: TO BE COMPLETED BY STUDENTS)
(Students must submit the soft copy as per following segments within two hours of the practical.
The soft copy must be uploaded on the Blackboard or emailed to the concerned lab in charge
faculties at the end of the practical in case the there is no Black board access available)
● Students need to use the previous social media dataset to perform exploratory data analysis and
visualization.
# Univariate Plots
# Histograms
plt.figure(figsize=(10, 6))
df['VIEWS'].hist(bins=20)
plt.title('Distribution of Views')
plt.xlabel('Views')
plt.ylabel('Frequency')
plt.show()
# Density Plots
plt.figure(figsize=(10, 6))
sns.kdeplot(df['VIEWS'])
plt.title('Density Plot of Views')
plt.xlabel('Views')
plt.ylabel('Density')
plt.show()
# Box Plots
plt.figure(figsize=(10, 6))
sns.boxplot(y=df['VIEWS'])
plt.title('Box Plot of Views')
plt.show()
# Violin Plots
plt.figure(figsize=(10, 6))
sns.violinplot(y=df['VIEWS'])
plt.title('Violin Plot of Views')
plt.show()
# Swarm Plots (for smaller datasets, can be slow for large ones)
plt.figure(figsize=(10, 6))
sns.swarmplot(y=df['VIEWS']) # Consider sampling for large datasets
plt.title('Swarm Plot of Views')
plt.show()
# Bar Graphs (for categorical data - example using 'Channel Name' if it's
categorical)
if 'Channel' in df.columns: #Check if the column exists
plt.figure(figsize=(12,6))
df['CHANNEL'].value_counts().plot(kind='bar')
plt.title('Number of videos per channel')
plt.xlabel('Channel Name')
plt.ylabel('Number of Videos')
plt.show()
B.4 Conclusion:
(Students must write the conclusion as per the attainment of individual outcome listed above
and learning/observation noted in section B.3)
We performed Exploratory Data Analysis and visualization of Social Media Data for business.
Importance of EDA:
Data Understanding: EDA provides in-depth insight into the data's structure, including
distributions, relationships, and trends, enabling analysts to understand what the data represents.
Identifying Patterns and Trends: It helps in uncovering underlying trends or patterns that may not
be immediately obvious, which can guide further data exploration and analysis.
Detecting Anomalies: Through visualization and summary statistics, EDA can identify outliers or
anomalies that could affect subsequent analysis or modeling.
Hypothesis Generation: EDA can help generate hypotheses that can be tested in further analysis
by revealing insights that might not have been considered originally.
Feature Selection: Understanding the relationships between different variables can aid in
identifying which features are most relevant for predictive modeling.
Informing Data Cleaning and Pre-processing: EDA highlights issues in the data, such as missing
values, skewed distributions, or irrelevant features, influencing necessary data cleaning steps.
Importance of Visualization:
Improved Comprehension: Visualizations make complex data more understandable, summarizing
large amounts of information quickly and effectively.
Enhanced Pattern Recognition: Humans are generally good at recognizing patterns visually, so
visualizations can highlight correlations and trends that might not be discernible through raw data
alone.
Time Efficiency: Visual tools help analysts quickly grasp the significance of the data without
digging deeply into the numbers, thus saving time.
Exploration in EDA: In the context of EDA, visual tools help in immediate feedback and iterative
analysis, allowing stakeholders to explore data more freely and flexibly.
Data Collection: Gather data from various sources to create a comprehensive dataset for analysis.
Data Cleaning: Address issues such as missing values, duplicates, and inconsistencies to ensure
the quality of the dataset.
Descriptive Statistics: Compute summary statistics (mean, median, mode, variance, etc.) to gain
insights into the central tendency and dispersion of data.
Data Visualization: Create visual representations (histograms, scatter plots, box plots, etc.) to
explore distributions and relationships between variables.
Variable Analysis:
Univariate Analysis: Analyze each variable individually to understand its distribution and
characteristics.
Bivariate/Multivariate Analysis: Explore the relationships between two or more variables to
identify correlations and dependencies.
Outlier Detection: Identify and analyze outliers or anomalies in the data that could impact the
analysis.
Correlation Analysis: Examine the correlation between features to understand relationships and
dependencies, using correlation matrices or heatmaps.
Hypothesis Generation: Use insights derived from EDA to formulate hypotheses for further testing
in subsequent analyses.
Documentation: Document findings, visualizations, and initial impressions to provide context and
reference for future analysis or stakeholders.
By following these steps, analysts can approach their data in a structured manner, ensuring that
they derive maximum insights while also preparing the data for further predictive modeling or
analysis as needed