0% found this document useful (0 votes)
16 views11 pages

Facebook Analysis

Uploaded by

mijal16558
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views11 pages

Facebook Analysis

Uploaded by

mijal16558
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 11

Facebook Data Analysis

1.Sanity check(using spark 2):


Code:
from pyspark.sql import SparkSession
from pyspark.sql import Row
from pyspark.sql import functions

def parseInput(line):
fields = line.split(',')
return Row(value = str(fields[i]))

if __name__ == "__main__":
# Create a SparkSession (the config bit is only for Windows!)
spark = SparkSession.builder.appName("SanityCheck").getOrCreate()

# Get the raw data


lines = spark.sparkContext.textFile("hdfs:///tmp/facebook_data/pseudo_facebook.csv")

a=["userid","age","dob_day","dob_year","dob_month","gender","tenure","friend_count","friendships_i
nitiated","likes","likes_received","mobile_likes","mobile_likes_received","www_likes",$
for i in range(15):
# Convert it to a RDD of Row objects with (value)
x = lines.map(parseInput)
# Convert that to a DataFrame
xDF = spark.createDataFrame(x)

# Compute count of Null Values


counts = xDF.filter(xDF["value"]=="NA").count()

# Print them out


print ("%s : %d"%(a[i],counts))

# Stop the session


spark.stop()
Command:
export SPARK_MAJOR_VERSION=2
spark-submit SanityCheck.py

Output:

Observation: Gender has null values, we should not delete these as users
might have kept it blank .

2: Facebook popularity based on ages(Using Mapreduce


(python language))
Code:
from mrjob.job import MRJob
from mrjob.step import MRStep

class WhatAgeUsesFacebook(MRJob):
def steps(self):
return [
MRStep(mapper=self.mapper_get_ages,
reducer=self.reducer_count_ages),
MRStep(reducer=self.reducer_sorted_output)
]
def mapper_get_ages(self, _, line):
(userid, age, dob_day, dob_year, dob_month, gender, tenure, friend_count,
friendships_initiated, likes, likes_received, mobile_likes, mobile_likes_received, www_likes,
www_likes_receved) = line.split(',')
yield age, 1

def reducer_count_ages(self, age, ones):


yield str(sum(ones)).zfill(5), age

def reducer_sorted_output(self, count, ages):


for age in ages:
yield age, count

if __name__ == '__main__':
WhatAgeUsesFacebook.run()

Command: python map_reduce1.py -r hadoop --hadoop-streaming-jar


/usr/hdp/current/hadoop-mapreduce-client/hadoop-streaming.jar
hdfs:///tmp/facebook_data/pseudo_facebook.csv
Age wise distribution of users:
Output: (Age,Count)

Observation : Facebook is most popular between age groups 16 and 26.


3. Likes Given (Using Drill)
CMD: apache-drill-1.12.0/bin/drillbit.sh start -Ddrill.exec.http.port=8765
Query 1: SELECT gender,avg(likes) AS AVG_Likes_Given
FROM hive.facebook_db.facebook
GROUP BY gender
ORDER BY AVG_Likes_Given DESC

Output: gender vs likes given :

Query 2: SELECT userid, gender, likes AS Total_Likes_Given


FROM hive.facebook_db.facebook
ORDER BY Total_likes_Given DESC LIMIT 10
Output : Top 10 users with most likes given
Analysis Result: Females give more likes then men

4. Likes Received (Using Drill)


CMD: apache-drill-1.12.0/bin/drillbit.sh start -Ddrill.exec.http.port=8765
Query 1: SELECT gender,avg(likes_received) AS AVG_Likes_Received
FROM hive.facebook_db.facebook
GROUP BY gender
ORDER BY AVG_Likes_Received DESC
Output: gender vs total likes received :

Query 2: SELECT userid, gender, likes_received AS Total_Likes_Received


FROM hive.facebook_db.facebook
ORDER BY likes_received DESC
LIMIT 10
Output : Top 10 users with most likes received
Analysis Result: Females receive more likes then men

5.Gender Count (Using Zeppelin(Spark code)):


val x = fbDF.groupBy("gender").count().orderBy(desc("count")).cache()
x.show()
Output:
+------+-----+
|gender|count|
+------+-----+
| male |58574|
|female|40254|
| NA | 175 |
+------+-----+
Analysis : There are more male users than female .

6.Likes Split Up (using Zeppelin-sql code)


Query 1:
SELECT gender,avg(mobile_likes) AS mobile_likes_given,
avg(mobile_likes_received) AS mobile_likes_received, avg(www_likes) AS
www_likes_given, avg(www_likes_received) AS www_likes_received
FROM fb
WHERE gender <> "NA"
GROUP BY gender
Output:
Query2:
%sql
SELECT gender,avg(likes) AS likes_given ,avg(likes_received) AS likes_received
FROM fb
WHERE gender <> "NA"
GROUP BY gender

Output(Likes vs Likes Recived by gender):

Analysis: Interesting obsservation for gender specific interaction with facebook: women like as well
as are liked a lot more than men (nearly 2.5 as much).
7.Friends Counts & Friendships initiated (using Zepplin -sql code)
Query :
SELECT gender,avg(friend_count) AS friend_count ,avg(friendships_initiated) AS
friendships_initiated
FROM fb
WHERE gender <> "NA"
GROUP BY gender

Output : (Friends Count vs Friendships Initiated)

Analysis: Women have more friends than men on facebook, the


friendships initiated in proportion to friend count are more in case of men
than women.
8. Users w.r.t birth year(using Zepplin -sql code)

Query: SELECT dob_year,count(userid) AS users_count


FROM fb
GROUP BY dob_year

Output:

Analysis:
We see bumps between 1940 to 1980. After 1980 the no. users rocket. Since the
data is till 2000 (we see miniscule value in 2000)

You might also like