Facebook Analysis
Facebook Analysis
def parseInput(line):
fields = line.split(',')
return Row(value = str(fields[i]))
if __name__ == "__main__":
# Create a SparkSession (the config bit is only for Windows!)
spark = SparkSession.builder.appName("SanityCheck").getOrCreate()
a=["userid","age","dob_day","dob_year","dob_month","gender","tenure","friend_count","friendships_i
nitiated","likes","likes_received","mobile_likes","mobile_likes_received","www_likes",$
for i in range(15):
# Convert it to a RDD of Row objects with (value)
x = lines.map(parseInput)
# Convert that to a DataFrame
xDF = spark.createDataFrame(x)
Output:
Observation: Gender has null values, we should not delete these as users
might have kept it blank .
class WhatAgeUsesFacebook(MRJob):
def steps(self):
return [
MRStep(mapper=self.mapper_get_ages,
reducer=self.reducer_count_ages),
MRStep(reducer=self.reducer_sorted_output)
]
def mapper_get_ages(self, _, line):
(userid, age, dob_day, dob_year, dob_month, gender, tenure, friend_count,
friendships_initiated, likes, likes_received, mobile_likes, mobile_likes_received, www_likes,
www_likes_receved) = line.split(',')
yield age, 1
if __name__ == '__main__':
WhatAgeUsesFacebook.run()
Analysis: Interesting obsservation for gender specific interaction with facebook: women like as well
as are liked a lot more than men (nearly 2.5 as much).
7.Friends Counts & Friendships initiated (using Zepplin -sql code)
Query :
SELECT gender,avg(friend_count) AS friend_count ,avg(friendships_initiated) AS
friendships_initiated
FROM fb
WHERE gender <> "NA"
GROUP BY gender
Output:
Analysis:
We see bumps between 1940 to 1980. After 1980 the no. users rocket. Since the
data is till 2000 (we see miniscule value in 2000)