BDA Final Compiled_pagenumber
BDA Final Compiled_pagenumber
Experiment 1
Aim:
Installation of Hadoop and Experiment on HDFS commands.
Theory:
1. Print the Hadoop version
• To check the Hadoop version, you can use the hadoop version command.
• This command provides information about the installed Hadoop software,
including the version number and build details.
1
3. Report the amount of space used and available on currently mounted
filesystem
• The hadoop fs -df hdfs:/ command reports the disk usage and available
space on the HDFS file system.
• It provides information about the storage capacity and usage for the
specified HDFS location.
4. Count the number of directories, files, and bytes under the paths that
match the specified file pattern
• The hadoop fs -count hdfs:/ command is used to count the number of
directories, files, and bytes under the specified path.
• It helps in understanding the structure and size of data in HDFS.
2
6. Run a cluster balancing utility
• The hadoop balancer command is used to initiate the Hadoop cluster
balancing process.
• It redistributes data blocks across DataNodes to ensure data balance
within the Hadoop cluster.
8. Add a sample text file from the local directory named "data" to the new
directory you created in HDFS
• The hadoop fs -put data/sample.txt /user/training/hadoop command
uploads a text file from the local "data" directory to the "hadoop"
directory in HDFS.
• It is a common operation to transfer data from the local file system to
HDFS.
10. Add the entire local directory called "retail" to the /user/training
directory in HDFS
• The hadoop fs -put data/retail /user/training/hadoop command copies
the entire local directory "retail" to the "hadoop" directory in HDFS.
3
• This is useful for transferring a complete directory to HDFS.
12. See how much space the "retail" directory occupies in HDFS
• The hadoop fs -du -s -h hadoop/retail command shows the disk usage of
the "retail" directory within HDFS.
• It displays the total size occupied by the specified directory in a human-
readable format.
4
13. Delete a file 'customers' from the "retail" directory
• The hadoop fs -rm hadoop/retail/customers command deletes the file
named 'customers' from the "retail" directory in HDFS.
• This operation removes the specified file from HDFS.
15. Delete all files from the "retail" directory using a wildcard
• The hadoop fs -rm hadoop/retail/* command deletes all files within the
"retail" directory in HDFS using a wildcard '*'.
• It is a convenient way to remove multiple files in one operation.
17. Remove the entire retail directory and its contents in HDFS
• The hadoop fs -rm -r hadoop/retail command deletes the "retail"
directory and all its contents recursively.
• This operation effectively removes an entire directory and its
subdirectories.
5
• It helps confirm the status of the directory after performing various
operations.
19. Add the purchases.txt file from the local directory to the hadoop
directory you created in HDFS
• The hadoop fs -copyFromLocal /home/training/purchases.txt hadoop/
command copies the local file "purchases.txt" to the "hadoop" directory
in HDFS.
• This is a way to transfer a single file from the local file system to HDFS.
20. View the contents of the text file purchases.txt in your hadoop directory
• The hadoop fs -cat hadoop/purchases.txt command displays the
contents of the "purchases.txt" file within the "hadoop" directory in
HDFS.
• It allows you to view the content of the specified file.
21. Add the purchases.txt file from the "hadoop" directory in HDFS to the
"data" directory in your local directory
• The hadoop fs -copyToLocal hadoop/purchases.txt
/home/training/data command copies the "purchases.txt" file from the
"hadoop" directory in HDFS to the "data" directory in your local
filesystem.
• This is a way to transfer a file from HDFS to the local file system.
6
• It provides flexibility in copying files from HDFS to the local filesystem.
7
27. Change group name in HDFS
• The default name of the group is usually "training." The hadoop fs -
chgrp command is used to change the group name for a file or directory
in HDFS.
• In this example, it changes the group of the "purchases.txt" file to
"training."
8
32. List all Hadoop file system shell commands
• The hadoop fs command, when used without specifying a subcommand,
lists all available Hadoop file system shell commands.
• This helps you see a comprehensive list of commands available for
managing HDFS.
9
33. Always ask for help
• The hadoop fs -help command provides detailed help and usage
information for the Hadoop file system shell commands.
• It's a valuable resource to understand command syntax, options, and how
to use these commands effectively.
10
BDA
Experiment 2
Aim:
Use of Sqoop tool to transfer data between Hadoop and relational database
servers.
Theory:
Sqoop in the Hadoop Ecosystem: Bridging Relational Databases and Big
Data
The growth of data in today's digital landscape is nothing short of explosive.
Enterprises are inundated with vast volumes of data, and much of this data
resides in relational databases managed by traditional application systems.
Meanwhile, Hadoop, with its distributed storage (HDFS) and powerful data
processing tools, has emerged as a leading platform for managing and analyzing
big data. The challenge, however, is how to seamlessly integrate and leverage
the data stored in relational databases with Hadoop's capabilities.
The Crucial Role of Sqoop:
Sqoop comes to the rescue by acting as the bridge between these two disparate
data worlds. It plays an indispensable role in facilitating data transfer between
Hadoop and relational database systems, making it an integral component of the
Hadoop ecosystem. Let's explore Sqoop's role and significance in greater depth:
Purpose and Role of Sqoop:
At its core, Sqoop is designed to accomplish the seamless transfer of data
between Hadoop and relational database servers. It has a twofold purpose:
importing data from relational databases into Hadoop HDFS and exporting data
from Hadoop HDFS back to relational databases. This dual functionality is
encapsulated in its name: "SQL to Hadoop and Hadoop to SQL."
1. Importing Data into HDFS: The import functionality of Sqoop is tasked
with retrieving data from relational databases such as MySQL, Oracle, or
SQL Server and bringing it into Hadoop's distributed file system, HDFS.
Sqoop operates at the table level, allowing users to import individual
tables or select subsets of data.
2. Data Representation in HDFS: Once data arrives in HDFS, Sqoop
organizes it into records. In this context, each row from the source
RDBMS table is treated as a record in HDFS. These records can be stored
as text data within text files or in binary formats like Avro and Sequence
files. This versatility allows Sqoop to adapt to various data storage and
11
processing requirements, making it a flexible and practical solution for
managing diverse data sources.
Sqoop Export:
1. Exporting Data to RDBMS: The export tool within Sqoop complements
the import functionality by allowing users to move data from HDFS to a
target relational database system. This means that data generated,
processed, or stored in Hadoop can be effectively transported back to a
structured RDBMS environment.
2. Record Transformation and Delimiting: To make this transition
seamless, Sqoop reads and parses data files in HDFS. These data files
contain records, which align with the structure of the destination RDBMS
table. Sqoop offers users the option to specify a delimiter to separate
fields within the records. This transformation ensures that the data being
exported from Hadoop is formatted in a manner that is compatible with
the target RDBMS, thereby maintaining data integrity and consistency.
The Significance of Sqoop:
Sqoop's significance lies in its ability to empower organizations to fully harness
the power of Hadoop for big data processing, analytics, and storage, while still
maintaining a connection with the structured data in traditional relational
databases. This is particularly crucial in scenarios where historical or
operational data is essential for comprehensive analysis, reporting, or regulatory
compliance.
12
Output:
Step1:
Step 2:
Step3:
Step4:
13
Step5:
Step6:
14
Step7:
Step8:
Step9:
15
BDA
Experiment 3
Aim:
Programming Exercises in HBase
Theory:
HBase Overview
• HBase is a distributed, column-oriented, non-relational database
management system that is designed to run on top of the Hadoop
Distributed File System (HDFS). It is part of the Hadoop ecosystem and
is ideal for handling large volumes of data, making it suitable for big data
use cases.
• HBase excels in storing sparse data sets where most of the data is empty
or not present, which is common in big data applications.
• Unlike traditional relational database systems, HBase does not use a
structured query language (SQL). Instead, it is designed for handling
unstructured and semi-structured data.
• HBase applications are primarily written in Java, following the
MapReduce programming model. However, HBase also supports
application development in other languages, including Apache Avro,
REST, and Thrift.
• An HBase system is designed to scale linearly, meaning you can easily
add more machines to your cluster to handle increasing data volumes and
workloads. It's an excellent choice for real-time data processing and
random read/write access to large data sets.
• HBase relies on Apache ZooKeeper for high-performance coordination.
While ZooKeeper is integrated into HBase, in a production cluster, it's
recommended to have a dedicated ZooKeeper cluster that works in
tandem with your HBase cluster.
16
Step 1: HBase Shell
• HBase provides a shell interface that allows users to interact with the
database. The HBase shell is a command-line tool for executing various
HBase operations.
17
Step 5: Enabling a Table
• After disabling a table, you can re-enable it using the enable command.
18
Step 8: Checking Table Existence
• You can verify the existence of a table using the exists command in the
HBase shell.
19
Step 10: Creating Data (Put Command)
• In HBase, you can insert data into a table using the put command in the
HBase shell. This command allows you to specify the table name, row
key, column family, column name, and the value you want to store in a
particular cell.
20
Step 12: Updating Data (Put Command)
• To update an existing cell's value in an HBase table, you can use the put
command. The new value you provide will replace the existing value in
the specified cell.
21
Step 16: Truncating a Table
• The truncate command in HBase is used to disable, drop, and recreate a
table. This operation is helpful in certain scenarios when you need to
recreate a table without losing data.
22
BDA
Experiment No 4
Aim:
Experiment for Word Counting using Hadoop Map-Reduce
Theory:
WordCount Program Overview:
The WordCount program is a classic example in the Hadoop ecosystem, illustrating the
MapReduce paradigm for processing and analyzing large datasets. It is commonly used
as a beginner's exercise to understand the fundamental concepts of MapReduce.
1. Mapper (WordMapper.java):
• Responsible for reading input data and emitting key-value pairs.
• Takes a line of text and splits it into individual words.
• Associates each word with the value 1 to indicate its occurrence.
2. Reducer (WordReducer.java):
• Aggregates the intermediate results from the Mapper.
• Takes key-value pairs where the key is a word and the value is the count of
occurrences.
• Summarizes the counts to get the total occurrences of each word.
2. Add Classes:
• Create three classes: WordCount, WordMapper, and WordReducer.
• Copy the respective program contents into each class file.
3. Build Path:
• Add the Hadoop library (hadoop-core.jar) to the project's build path.
23
• This allows the project to use Hadoop classes and functionalities.
5. Terminal Commands:
• Open a terminal and navigate to the project directory.
9. Check Output:
• Use hadoop fs -ls to check the output directory and see the results.
Steps:
Step 1 :
Create New Java
Project in Eclipse Name
project :
WordCountJob click :
finish
Step 2:
Right Click on project ->
New -> Class create class
files as
Name :
WordCount
Name :
WordMapper
Name :
24
WordReducer
Add the respective program contents to the file which is given below :
Run The program as java application
Step 3 :
Right Click Project -> build path -> add expertnal archievs -> filesystem ->
usr -> lib -> hadoop-0.20 -> hadoop-core.jar click add
import
org.apache.hadoop.fs.Path;
import
org.apache.hadoop.io.IntWr
itable; import
org.apache.hadoop.io.Text;
import
org.apache.hadoop.mapreduce.lib.input.File
InputFormat; import
org.apache.hadoop.mapreduce.lib.output.Fil
eOutputFormat; import
org.apache.hadoop.mapreduce.Job;
Prog 2 : WordMapper.java
import java.io.IOException;
import
org.apache.hadoop.io.IntWrit
able; import
org.apache.hadoop.io.LongW
ritable; import
org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class wordMapper extends Mapper<LongWritable, Text, Text, IntWritable>
{
@Override
public void map(LongWritable key, Text value, Context context) throws IOException,
InterruptedException
{
String line = value.toString();
for (String word : line.split("\\W+"))
{
if (word.length() > 0)
context.write(new Text(word), new IntWritable(1));
}
}
}
Program 3 : WordReducer.java
import java.io.IOException;
27
import
org.apache.hadoop.io.IntWrit
able; import
org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class wordReducer extends Reducer<Text, IntWritable, Text, IntWritable>
{
@Override
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws
IOException, InterruptedException
{
int wordCount = 0;
for (IntWritable value : values)
wordCount += value.get();
context.write(key, new
IntWritable(wordCount));
}
}
Output:
28
29
30
31
32
BDA
Experiment 5
Aim:
Experiment on Pig.
Theory:
1. Overview of Pig:
• Apache Pig is a high-level platform for processing and analysing large
datasets on top of Hadoop. It provides a simple scripting language called
Pig Latin for expressing data transformations.
• Pig is designed for parallel processing and can work with structured and
semi-structured data. It's particularly valuable for ETL (Extract,
Transform, Load) operations and data processing tasks in the Hadoop
ecosystem.
• Pig abstracts the complexities of writing MapReduce programs, making it
more accessible for data engineers, analysts, and data scientists to work
with big data.
2. Pig Architecture:
• Pig operates in two modes: local mode and MapReduce mode. In local
mode, Pig runs on a single machine, which is useful for testing and
debugging. In MapReduce mode, it takes advantage of Hadoop's
distributed processing capabilities to process data across a cluster.
• Pig consists of two main components:
• Pig Latin Interpreter: The Pig Latin interpreter parses and
executes Pig Latin scripts, which are written by users to define data
transformations.
• Execution Engine: Pig employs Hadoop's MapReduce engine or
Apache Tez for executing data processing tasks across a cluster.
3. Basics of Pig Latin:
• Pig Latin is the scripting language used in Pig for expressing data
transformations. It has a simple and expressive syntax for working with
data.
33
• Pig Latin scripts consist of a series of operations, where each operation
represents a transformation on the data. These transformations can
include loading data, filtering, grouping, joining, and more.
• Pig Latin scripts are translated into a series of MapReduce jobs when
executed in MapReduce mode.
4. Basics of Pig Operators:
• Load: The LOAD operator is used to load data into Pig from various data
sources, including HDFS, local files, and HBase. It specifies the location
and format of the data.
• Dump: The DUMP operator is used for debugging and displaying the
contents of relations in Pig. It's often used during script development to
inspect intermediate data.
• Foreach: The FOREACH operator is used to apply a projection
operation on a relation. It selects specific fields or generates new fields
based on the existing data.
• Filter: The FILTER operator is used to filter rows from a relation based
on specific conditions. It is useful for data selection.
• Join: The JOIN operator is used to combine data from multiple relations
based on a common field or key. It can perform various types of joins,
including inner, outer, and cross joins.
• Distinct: The DISTINCT operator is used to remove duplicate tuples
from a relation, preserving only unique values.
• Cogroup: The COGROUP operator groups data from multiple relations
based on a common field or key, allowing you to perform operations on
grouped data.
• Nested Projection: Pig allows you to project fields from nested data
structures, such as tuples, bags, and maps, using the FOREACH
operator. This is valuable for working with complex, nested data.
34
Step 2: Listing Files
• The fs -ls command within the Pig Grunt shell is used to list files and
directories in HDFS. It provides a view of the available data sources.
35
Step 4: Dumping Data
• The dump command is used to display the contents of a relation created
from loaded data. It's a way to verify the data's correctness and structure.
36
Step 5: Projection
• Projection involves selecting specific fields from a relation using the
FOREACH operator. It helps reduce the data to only the columns of
interest.
37
Step 6: Joins
• The JOIN operator combines data from different relations based on
specified keys. It is essential for merging data from multiple sources.
38
Step 7: Relational Operators
a. CROSS Operator:
• The CROSS operator is used to calculate the cross product of two or
more relations. It generates all possible combinations of records from
the input relations.
• In the context of Pig, it takes two or more relations as input and
produces a new relation containing all possible pairs of records from
the input relations.
b. DISTINCT Operator:
• The DISTINCT operator is used to remove duplicate tuples from a
relation. It retains only unique values while discarding redundant
records.
• It's a valuable operator when you want to ensure that the data contains
no duplicate entries.
39
c. FILTER Operator:
• The FILTER operator is used to select data from a relation based on
specific conditions or predicates. It allows you to filter rows that meet
specific criteria.
• Conditions are defined using comparison and logical operators, and
only rows that satisfy the conditions are retained.
d. FOREACH Operator:
• The FOREACH operator is essential for generating data
transformations based on column data. It allows you to apply
expressions to fields within a relation, generating new fields or
modifying existing ones.
• With FOREACH, you can create calculated fields, perform
mathematical operations, and manipulate data as needed.
40
e. COGROUP Operator:
• The COGROUP operator groups data from multiple relations based
on a common field or key. It is particularly useful when you need to
perform operations on data that share a common attribute.
• After co-grouping, you can apply operations on groups of data,
making it possible to analyze or process data in a more structured way.
41
42
BDA
EXPERIMENT NO. 6
Theory:
What is HIVE?
What is HQL?
Hive Query Language (HQL) is a SQL-like query language for managing large
datasets. It simplifies complex MapReduce programming for data analysis. Users
familiar with SQL can leverage HQL to write custom MapReduce frameworks for
advanced analysis.
Uses of Hive:
Components of HIVE:
43
HIVE Organization:
Data in Hive is organized into Tables, Partitions, and Buckets. Tables are mapped to
directories in HDFS, partitions are subdirectories, and buckets are divisions within
partitions. Hive also has a Metastore storing metadata.
Limitations of HIVE:
• Not designed for Online Transaction Processing (OLTP); used for Online
Analytical Processing (OLAP).
• Supports overwriting or appending data, not updates and deletes.
• Subqueries are not supported.
Steps:
• Command: hive
• Explanation:
• Initiates Hive command-line interface.
• Provides an interactive shell for Hive operations.
2. To check databases:
44
3. To check tables:
5. To create a database:
45
6. To create a table "emp" in the "retail" database:
• Command: create table emp(id INT, name STRING, sal DOUBLE) row
format delimited fields terminated by ',' stored as textfile;
• Explanation:
• Changes to "retail" database context.
• Creates a table "emp" with specified columns and storage format.
7. To create a file "hive-emp.txt" and load data into the "emp" table:
• Command:
• cat /home/training/hive-emp.txt
• load data local inpath '/home/training/hive-emp.txt' into table emp;
• Explanation:
46
• Displays content of "hive-emp.txt."
• Loads data into "emp" table for data ingestion.
47
11. To select data from "emp_data" where id is 1:
48
13. Max commands using HQL:
49
Big Data Analysis
Experiment – 7
Aim:
Implement Bloom Filter using Python/R Programming
Theory:
Introduction to Bloom Filters:
A Bloom Filter is a space-efficient probabilistic data structure used for
membership testing. It's designed to answer the question: "Is this element in the
set?" with some probability of false positives. Bloom Filters are particularly
useful when you want to quickly check if an element is part of a large dataset
without actually storing the entire dataset. They have applications in various
fields, including databases, networking, distributed systems, and more.
Basic Structure:
A Bloom Filter consists of the following components:
1. Bit Array (Bitmap): The core of the Bloom Filter is a fixed-size array of
bits. This array is typically initialized with all bits set to 0.
2. Hash Functions: A set of hash functions that map elements to positions
in the bit array. The number of hash functions used is a key parameter of
the Bloom Filter, often denoted as 'k.'
Key Operations:
1. Insertion: To add an element to the Bloom Filter, the element is hashed
by each of the 'k' hash functions, and the corresponding bit positions in
the bit array are set to 1.
2. Membership Test: To check if an element is in the set, you hash the
element using the same 'k' hash functions and check if all corresponding
bits are set to 1. If any of them is 0, you can be certain that the element is
not in the set. If all bits are 1, it's likely that the element is in the set, but
there's a chance of a false positive.
50
Probability of False Positives:
The probability of a false positive in a Bloom Filter depends on three main
factors:
1. Bit Array Size (m): The size of the bit array, which affects the filter's
capacity to store elements.
2. Number of Hash Functions (k): The number of hash functions used for
mapping elements to bits.
3. Number of Elements Inserted (n): The total number of elements added
to the Bloom Filter.
The formula to calculate the probability of a false positive (P_FP) is given by:
P_FP = (1 - e^(-kn/m))^k
Characteristics:
1. Space Efficiency: Bloom Filters are memory-efficient because they only
require a fixed-size bit array, regardless of the number of elements stored.
2. False Positives: Bloom Filters can return false positives but never false
negatives. This means if the filter indicates that an element is not in the
set, it's correct. However, if it indicates that an element is in the set,
there's a small chance it might not be.
3. Parallelism: Bloom Filters allow for parallel insertion and membership
tests, as each hash function can be computed independently.
Use Cases:
Bloom Filters find applications in various scenarios, including:
• Caching: To quickly determine if a requested resource is in a cache
before accessing a slower storage system.
• Spell Checking: To efficiently check if a word exists in a dictionary.
• Distributed Systems: To reduce network requests by checking if a record
exists in a remote database.
• De-duplication: To eliminate duplicate entries in large datasets.
51
• Malware Detection: For quickly checking if a file is known to be
malware.
• Web Search Engines: For checking if a web page has already been
indexed.
Trade-offs:
• Bloom Filters have a fixed false positive rate, which may or may not be
suitable for your use case.
• As elements are inserted, the false positive rate may increase over time.
• Removing elements from a Bloom Filter is not straightforward because
bits once set to 1 cannot be safely reverted.
• Bloom Filters do not provide information about the stored elements
themselves; they only indicate presence or likely presence.
52
pip install mmh3
Collecting mmh3
Downloading mmh3-4.0.1-cp310-cp310-win_amd64.whl (36 kB)
Installing collected packages: mmh3
Successfully installed mmh3-4.0.1
Note: you may need to restart the kernel to use updated packages.
import math
import mmh3 # You may need to install this library using: pip install
mmh3
class BloomFilter:
def __init__(self, capacity, error_rate):
self.capacity = capacity
self.error_rate = error_rate
self.bit_array_size = self.calculate_bit_array_size()
self.hash_functions_count =
self.calculate_hash_functions_count()
self.bit_array = [False] * self.bit_array_size
def calculate_bit_array_size(self):
# Calculate the optimal size of the bit array
return int(-self.capacity * math.log(self.error_rate) /
(math.log(2) ** 2))
def calculate_hash_functions_count(self):
# Calculate the optimal number of hash functions
return int(self.bit_array_size * math.log(2) / self.capacity)
if __name__ == "__main__":
capacity = 1000000 # Set the expected capacity of your data
error_rate = 0.001 # Set the desired error rate (false positive
53
probability)
54
BDA
Experiment - 8
Aim:-
Theory:-
55
4. Mathematical Formulation:
5. Model Parameters:
6. Predictions:
7. Learning:
8. Advantages:
56
10. Use Cases:
11. Challenges:
57
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
class FactorizationMachine:
def __init__(self, n_factors, learning_rate, n_iterations):
self.n_factors = n_factors
self.learning_rate = learning_rate
self.n_iterations = n_iterations
for _ in range(self.n_iterations):
y_pred = self.sigmoid(self.predict(X_train))
error = y_train - y_pred
# Example usage
if __name__ == "__main__":
58
# Generate some random data for demonstration
np.random.seed(0)
X = np.random.randint(2, size=(1000, 10)) # Feature matrix
(binary features)
y = np.random.randint(2, size=1000) # Binary labels
Accuracy: 48.50
/var/folders/c5/05rtc57s0vj8p6whh_qf51pc0000gn/T/
ipykernel_501/3572402935.py:24: RuntimeWarning: overflow encountered
in exp
return 1 / (1 + np.exp(-x))
59
BDA
EXPERIMENT NO. 9
Histogram:
Scatter Plot:
Scatter plots emerge as invaluable aids in unraveling the intricate relationship between two
continuous variables. The code snippet adeptly uses the plot function to depict the interplay
between test scores and salary. Beyond individual data points, the scatter plot allows for the
discernment of trends, clusters, or outliers, furnishing analysts with a visual narrative of how
test scores might impact salary outcomes. This graphical exploration deepens the
understanding of the nuanced connections within the dataset, enabling data-driven decision-
making and strategic planning.
Boxplot:
60
aids in pinpointing variations, asymmetries, or patterns that might influence academic
outcomes. The boxplot's elegance lies in its ability to distill complex distributional
information into a visually accessible format, empowering analysts to glean insights
effortlessly.
Heatmap:
Heatmaps, as harnessed in this code snippet, are versatile instruments for unraveling intricate
patterns within a matrix. The selected subset of columns, transformed into a matrix, is
ingeniously visualized as a heatmap with column scaling. This technique not only facilitates
the identification of correlations but also brings out nuanced variations in the dataset. By
utilizing color gradients, the heatmap offers a nuanced representation of the relationships
between different variables, laying bare underlying structures that might be overlooked in raw
numerical data.
Barplot:
The barplot depicted in the code extends its utility beyond mere enumeration, becoming a
narrative of categorical distributions. By graphically representing the count of males and
females across placement status categories, this visualization allows for an at-a-glance
understanding of demographic proportions within different outcomes. The judicious use of
color and legend placement enhances interpretability, making it a powerful tool for conveying
gender distribution nuances within the context of placement outcomes. This visual aid
expedites decision-making processes related to gender diversity initiatives or targeted
interventions.
- The analysis aims to explore various aspects of the dataset, including the distribution of
MBA percentages, relationships between test scores and salary, the distribution of HSC
percentages, patterns in selected variables via a heatmap, and the distribution of
placement status by gender.
- Each visualization technique provides a unique perspective on the data, helping analysts
and stakeholders gain insights into different facets of the dataset. The combination of
61
histograms, scatter plots, boxplots, heatmaps, and barplots allows for a comprehensive
exploratory data analysis (EDA). This EDA is crucial for understanding data patterns,
making informed decisions, and potentially identifying areas for further investigation or
modeling.
Code:
data <- read.csv("Placement_Data_Full_Class.csv")
hist(data$mba_p, main = "MBA Percentage", xlim = c(40, 90), freq = TRUE)
data <- read.csv("Placement_Data_Full_Class.csv")
plot(x = data$etest_p, y = data$salary,
xlab = "Test Scores", ylab = "Salary",
xlim = c(45, 100), ylim = c(200000, 800000)
)
data <- read.csv("Placement_Data_Full_Class.csv")
boxplot(data$hsc_p,
xlab = "Box Plot", ylab = "HSC Percentage", notch = TRUE
)
data <- read.csv("Placement_Data_Full_Class.csv")
data3 <- data[, c(3, 5, 8, 11)]
data2 <- as.matrix(data3)
heatmap(data2, scale = "column")
data <- read.csv("Placement_Data_Full_Class.csv")
grouped <- table(data$gender, data$status)
barplot(as.matrix(grouped), col = c("grey", "black"))
legend("topright", legend = rownames(grouped), fill = c("grey", "black"))
Data Set:
62
Output:
63
MBA Percentage
60
50
40
Frequency
30
20
10
0
40 50 60 70 80 90
data$mba_p 64
Salary
50
60
70
Test Scores
80
90
100
65
HSC Percentage
40 50 60 70 80 90 100
Box Plot
66
102
131
214
24
158
192
55
152
76
5
34
9
119
68
172
8
186
157
197
91
44
84
166
27
47
36
61
213
19
49
210
56
23
39
168
89
140
115
181
33
149
113
187
101
162
6
52
4
48
7
88
43
137
207
test_p
hsc_p
ree_p
ssc_p
67
F
140
M
120
100
80
60
40
20
0
68
Experiment 10
1. Data Ingestion:
- Acquire property data from various sources, such as public records, real
estate websites, or APIs. Ensure the data is in a suitable format, like CSV,
and load it into the Hadoop Distributed File System (HDFS).
create table house (price INT, area INT,bedrooms INT, bathrooms INT,
stories INT, mainroad STRING, guestroom STRING, basement STRING,
hotwaterheating STRING)
3. Data Loading:
- Load the data into the Hive table you've created. This can be done
using Hive's `LOAD DATA` or `INSERT INTO` statements.
69
Creating and loading table house
4. Data Exploration:
- Use Hive SQL to explore the data. You can write queries to get an initial
sense of the data, such as finding the average price, distribution of
bedrooms, or the properties built after a certain year.
70
5. Data Transformation:
- Clean and transform the data as needed. This might involve handling
missing values, outliers, or converting data types. Hive provides various
built-in functions for these operations.
Aggregation of data
71