0% found this document useful (0 votes)
7 views

Home Assignment Data Engineer

The document outlines an assignment to analyze a significant breach at TheGoodCorp (TGC) involving compromised Microsoft 365 accounts. The task includes creating automated logic to identify affected users, detailing attack timelines, and providing insights on attacker locations and indicators of compromise (IOCs). Additionally, it emphasizes the importance of code correctness, readability, and documentation, with a submission deadline of seven days and strict confidentiality requirements.

Uploaded by

nileshnv123
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Home Assignment Data Engineer

The document outlines an assignment to analyze a significant breach at TheGoodCorp (TGC) involving compromised Microsoft 365 accounts. The task includes creating automated logic to identify affected users, detailing attack timelines, and providing insights on attacker locations and indicators of compromise (IOCs). Additionally, it emphasizes the importance of code correctness, readability, and documentation, with a submission deadline of seven days and strict confidentiality requirements.

Uploaded by

nileshnv123
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

- Confidential and proprietary to Proofpoint – do not share with anyone -

Assignment description
One of our major customers - TheGoodCorp (TGC) had a massive breach.
Attackers from unknown locations have penetrated some of their Microsoft 365
accounts.
We sampled a few users from TGC and found that some of them were compromised.
(HINT: more than 10 users).
You have been assigned the important task of analyzing the breach and creating an
automated logic to identify compromised users for TGC.
You have attached a raw data set in excel. Column descriptions are in the appendix of
this document.
Please provide python code that generates the following data, and any code used to
determine it (no need for any graphics or fancy formatting):
- A list of compromised users, for each user specify attack start and end times.
- A distribution of the Attackers country locations
- IOC’s of the attack
- Optional: any other information you think is relevant for TGC to know.

*Try researching data you don’t know and understand (google terms you don’t know) to
go beyond analyzing the data statistically or by model.

Submission Notes:
- Please do not share this task with anyone.
- You have 7 days to submit.
- Include your thought process during the task, document as much as you see fit.
- Make sure that the code is as generic as possible and written in Python.
- Focus on:
Correctness of the data.
Time complexity of the code.
Clean, readable, and well-documented code.
Appendix A

event_id - unique event ID


user_id - user ID (unique per user)
country - source ip country
client_ip - Unique event ID
status - Bool - successful\unsuccessful login
creation_date - event creation time (connection to Microsoft servers)
user_agent_app - breakdown of the user agent to app
user_agent_app_version - breakdown of the user agent to version
user_agent_device - breakdown of the user agent to device
user_agent_brand - breakdown of the user agent to brand
user_agent_model - breakdown of the user agent to model
user_agent_os - breakdown of the user agent to os
user_agent_os_version - breakdown of the user agent to os version
proxy_type - DCH\PUB - Public Proxy\TOR - Tor Node\VPN\None
event_rare - calculated rarity based on user agent, isp and country
ua_rare - user agent rarity for specific user calculated using our secret sauce (0 -
common, 1 - very rare)
isp_rare - isp rarity for specific user calculated using our secret sauce (0 - common, 1 -
very rare)
country_rare - country rarity for specific user calculated using our secret sauce (0 -
common, 1 - very rare)

You might also like