0% found this document useful (0 votes)
11 views

DA3 SQL Portfolio Project

The project aims to analyze Stack Overflow post history using SQL, focusing on user activity and content evolution. It involves querying a dataset with tables related to badges, comments, post history, and more, covering tasks such as data exploration, filtering, joins, subqueries, and advanced queries. Deliverables include SQL scripts, insights from the analysis, and optional visualizations.

Uploaded by

touseefahmed70
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

DA3 SQL Portfolio Project

The project aims to analyze Stack Overflow post history using SQL, focusing on user activity and content evolution. It involves querying a dataset with tables related to badges, comments, post history, and more, covering tasks such as data exploration, filtering, joins, subqueries, and advanced queries. Deliverables include SQL scripts, insights from the analysis, and optional visualizations.

Uploaded by

touseefahmed70
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

"Stack Overflow Post Analysis: A SQL Portfolio Project"

Total marks: 100 + 10 bonus

Dataset:
https://www.kaggle.com/datasets/stackoverflow/stackoverflow/data?select=post_history

Project Objective

The objective is to write queries to analyze the history of Stack Overflow posts, including edits,
comments, and other changes, to gain insights into user activity and content evolution while
mastering SQL skills. Since the original dataset is very large with millions of rows per table, you
have been given 10 rows of data per table. Run your project_table_load.sql file which will
create your schema and tables and will insert the data. Write your queries using these tables.

Tables in the Dataset

1.​ badges
○​ Tracks badges earned by users.
○​ Key Fields:
■​ id, user_id, name (badge name), date (earned date).
2.​ comments
○​ Contains comments on posts.
○​ Key Fields:
■​ id, post_id, user_id, creation_date, text.
3.​ post_history
○​ Tracks the history of edits, comments, and other changes made to posts.
○​ Key Fields:
■​ id, post_history_type_id, post_id, user_id, text,
creation_date.
4.​ post_links
○​ Links between related posts.
○​ Key Fields:
■​ id, post_id, related_post_id, link_type_id.
5.​ posts_answers
○​ Contains questions and answers.
○​ Key Fields:
■​ id, post_type_id (question or answer), creation_date, score,
view_count, owner_user_id.
6.​ tags
○​ Information about tags associated with posts.
○​ Key Fields:
■​ id, tag_name.
7.​ users
○​ Details about Stack Overflow users.
○​ Key Fields:
■​ id, display_name, reputation, creation_date.
8.​ votes
○​ Tracks voting activity on posts.
○​ Key Fields:
■​ id, post_id, vote_type_id, creation_date.
9.​ posts
○​ Information about posts.
○​ Key Fields:
■​ id, title, post_type_id, creation_date, score,
view_count, owner_user_id.

Tasks and Concepts

Part 1: Basics

1.​ Loading and Exploring Data (2 marks each)


○​ Explore the structure and first 5 rows of each table.
○​ Identify the total number of records in each table.
2.​ Filtering and Sorting (2 marks each)
○​ Find all posts with a view_count greater than 100
○​ Display comments made in 2005, sorted by creation_date (comments
table).
3.​ Simple Aggregations (2 marks each)
○​ Count the total number of badges (badges table).
○​ Calculate the average score of posts grouped by post_type_id
(posts_answer table).
Part 2: Joins

1.​ Basic Joins (4 marks each)


○​ Combine the post_history and posts tables to display the title of
posts and the corresponding changes made in the post history.
○​ Join the users table with badges to find the total badges earned by each
user.
2.​ Multi-Table Joins (5 marks each)
○​ Fetch the titles of posts (posts), their comments (comments), and the
users who made those comments (users).
○​ Combine post_links with posts to list related questions.
○​ Join the users, badges, and comments tables to find the users who
have earned badges and made comments.

Part 3: Subqueries

1.​ Single-Row Subqueries (5 marks each)


○​ Find the user with the highest reputation (users table).
○​ Retrieve posts with the highest score in each post_type_id (posts
table).
2.​ Correlated Subqueries (5 marks)
○​ For each post, fetch the number of related posts from post_links.

Part 4: Advanced Queries

1.​ Window Functions (5 marks each)


○​ Rank posts based on their score within each year (posts table).
○​ Calculate the running total of badges earned by users (badges table).
2.​ Common Table Expressions (CTEs) (10 marks)
○​ Create a CTE to calculate the average score of posts by each user and
use it to:
■​ List users with an average score above 50.
■​ Rank users based on their average post score.
New Insights and Questions

●​ Which users have contributed the most in terms of comments, edits, and votes?
●​ What types of badges are most commonly earned, and which users are the top
earners?
●​ Which tags are associated with the highest-scoring posts?
●​ How often are related questions linked, and what does this say about knowledge
sharing?

Deliverables

●​ A report containing:
○​ SQL scripts for each task. (70 marks)
○​ Key insights derived from the queries. (30 marks)
○​ Visualizations (if using tools like Tableau or Power BI). (Optional) (+10 bonus)

You might also like