DSBA Curriculum Guide
DSBA Curriculum Guide
DATA SCIENCE
AND BUSINESS
ANALYTICS
CURRICULUM GUIDE
PROGRAM HIGHLIGHTS
6 Months Online
02
LEARNING OUTCOMES
03
CURRICULUM
Master data storytelling with Python. Learn to read, manipulate, and visualize data, driving insights
for impactful business solutions through exploratory data analysis. Transform raw information into
compelling narratives.
Concepts Used:
Variables and Datatypes
Data Structures
Conditional and Looping Statements
Functions
Learning Outcomes: Learn about the fundamentals of Python programming (variables, data
structures, conditional and looping statements, functions).
Case Study: CRED Pay- CRED Pay is a consultation firm that partners with banks and checks if
their customers are eligible for a credit card. They are a startup just at the roots of starting their
business. They have partnered with a few banks and are currently collecting data for credit card
applications. You have been hired as a Data Scientist to handle and organize the data. You are
responsible for organizing the data so that it will be easily accessible and to help the company
predict if the application can be accepted for a credit card or not.
04
Topic 2- Python for Data Science
Python for Data Science: NumPy is a Python package for mathematical and scientific
computing and involves working with arrays and matrices. Pandas is a fast, powerful,
flexible, and simple-to-use open-source library in Python to manipulate and analyze data.
This module will cover these important libraries and provide a deep understanding of how
to use them to explore data.
Concepts Used:
NumPy Arrays and Functions
Accessing and Modifying NumPy Arrays
Saving and Loading NumPy Arrays
Pandas Series (Creating, Accessing, and Modifying Series)
Pandas DataFrames (Creating, Accessing, Modifying, and Combining DataFrames)
Pandas Functions
Saving and Loading Datasets using Pandas
Learning Outcomes: Learn about two of the most commonly used libraries (NumPy and
Pandas) used in Data Science for reading and manipulating data.
Case Study: MovieLens- MovieLens is a company in the internet and entertainment domain
providing an online database of information related to films, television series, online
streaming content including cast, production crew, trivia, ratings, fan, and critical reviews.
Every year in collaboration with a guest curator, MovieLens publishes its annuals based on
a theme providing a comprehensive view of a topic. The company is planning to bring out
the ‘Movie Talkies: Classic’ edition this year. The idea is to explore the movies that are a
decade old and deliver a detailed analysis.
Learning Outcomes: Learn about different visual tools which help in summarizing data
better and how to create them using Seaborn, a popular Python library.
Case Study: Chef’s Kitchen- Chef's Kitchen is one of the most popular restaurants in the
city of San Diego and acts as a one-stop destination for food lovers. The polite and
efficient service provided by the restaurant staff often gets them tips from the customer.
As a Data Analyst for the restaurant, you have been asked to analyze the data provided to
identify the patterns and trends in the revenue and tips received from customers across
different demographics and come up with informative visualizations to convey the insights
obtained from the analysis.
05
Topic 4- Exploratory Data Analysis (Deep Dive)
Exploratory Data Analysis (Deep Dive): Exploratory Data Analysis, or EDA, is a process of
examining and visualizing data to uncover patterns and extract meaningful insights from it
and facilitates storytelling. This module provides a deep insight on how to conduct EDA
using Python and utilize the insights extracted to drive business decisions.
Concepts Used:
Data Overview
Univariate Analysis
Bivariate/Multivariate Analysis
Missing Value Treatment
Outlier Detection and Treatment
Learning Outcomes: Learn how to perform Exploratory Data Analysis (EDA) to extract
insights from data.
Case Study: Zoom Ads- Zoom Ads is an advertising agency that wants to perform an
analysis on the data of the Google Play Store. They need to understand the trend of
applications available on the Play Store so that they can decide to focus on promoting
advertisements on particular applications which are trending in the market and can lead
to maximum profit. As a Data Scientist, you are required to gather and analyze detailed
information on apps in the Google Play Store in order to provide insights on app features
and the current state of the Android app market.
Utilize Python for statistical analysis. Validate business estimates through confidence intervals,
ensuring reliability. Test assumptions with hypothesis testing, guiding informed resource
allocation and strategic decision-making based on data distribution analysis.
06
Topic 1- Inferential Statistics Foundations
Concepts Used:
Experiments, Events, and Definition of Probability
Introduction to Inferential Statistics
Introduction to Probability Distributions (Random Variable, Discrete and Continuous
Random Variables, Probability Distributions)
Binomial Distribution
Normal Distribution
Z-Score
Learning Outcomes: Learn about the fundamentals of probability distributions and the
foundations of Inferential Statistics
Case Study: Medicon Drug Testing- Pharmaceutical company Medicon has manufactured the
sixth batch (40,000 units) of COVID-19 vaccine doses. This vaccine was clinically tested last
quarter and around 2,00,000 doses of this vaccine have already been given to the people, in
five batches. Now, this sixth batch needs to be tested for their time of effect (which is
measured as the time taken for the dose to completely cure COVID), as well as for quality
assurance (which tells you whether the dose will be able to do a satisfactory job or not).
Estimation and Hypothesis Testing: Estimation involves determining likely values for
population parameters from sample data, while hypothesis testing provides a framework for
drawing conclusions from sample data to the broader population. This module covers the
important concepts of central limit theorem and estimation theory that are vital for statistical
analysis, and the framework for conducting hypothesis tests.
Concepts Used:
Sampling
Central Limit Theorem
Estimation
Introduction to Hypothesis Testing (Null and Alternative Hypothesis, Type-I and Type-II
errors, Alpha, Critical Region, P-Value)
Hypothesis Formulation and Performing a Hypothesis Test
One-Tailed and Two-Tailed Tests
Confidence Intervals and Hypothesis Testing
Learning Outcomes: Learn about the Central Limit Theorem, estimation, and the key
concepts of Hypothesis Testing.
07
Case Study: Talent Hunt Examination- A research institute conducts a Talent Hunt Examination
every year to hire people who can work on various research projects in the field of Mathematics
and Computer Science. A2Z institute provides a preparatory program to help the aspirants
prepare for the Talent Hunt Exam. The institute has a good record of helping many students clear
the exam. Before the application for the next batch starts, the institute wants to attract more
aspirants to their program. For this, the institute wants to assure the aspiring students of the
quality of results obtained by students enrolled in their program in recent years.
The institute wants to provide an estimate of the average score obtained by aspirants who enroll
in their program. Keeping in mind the variation in scores every year, the institute wants to provide
a more reliable estimate of the average score using a range of scores instead of a single estimate.
A recent social media post from A2Z institute received feedback from a reputed critic, mentioning
that the students from A2Z institute score less than last year's cut-off on average. The institute
wants to test if the claim by the critic is valid.
Concepts Used:
Common Statistical Tests
Test for One Mean
Test for Equality of Means (Known Standard Deviation)
Test for Equality of Means (Equal and Unknown Std Dev)
Test for Equality of Means (Unequal and Unknown Std Dev)
Test of Independence
One-Way ANOVA
Learning Outcomes: Learn about various commonly used statistical tests and their
implementation in Python with business examples.
Case Study: Diet- The Health Company, which provides various diet plans for weight loss,
conducted a market test experiment to test three different kinds of diets (A, B, C). Each of
the volunteers was given one of the three diet plans and asked to follow the diet for 6
weeks. In order to understand the effectiveness of each of the different diets for weight loss,
the executives of the company reached out to you, a data scientist at the company. The
weights before starting the diet and the weights 6 weeks after following the diet were
recorded for 78 volunteers who were provided with either of the three diet plans. You have
been asked to perform a statistical analysis to find evidence of whether the mean weight
losses with respect to the three diet plans are significantly different. Consider a 5%
significance level for the analysis.
08
Module 3: Supervised Learning - Foundations
Delve into linear models for uncovering relationships between variables and continuous outcomes.
Validate models for statistical soundness, drawing inferences to extract crucial business insights
into decision-making factors.
Intro to Supervised Learning - Linear Regression: Machine Learning (ML), a subset of Artificial
Intelligence (AI), which focuses on developing algorithms capable of learning patterns in data
and making predictions without being explicitly programmed to do so. Linear Regression is one
of the most popular supervised ML algorithms that identifies the degree of linear relationship in
data. This module introduces participants to ML and explores how linear regression can be used
for predictive analysis.
Concepts Used:
Introduction to Learning from Data
Simple and Multiple Linear Regression
Evaluating a Regression Model
Pros and Cons of Linear Regression
Learning Outcomes: Understand the concept of learning from data, how the linear regression
algorithm works, and how to build and assess the performance of a regression model in Python.
Case Study: Anime Rating- Streamist is a streaming company that streams web series and
movies to a worldwide audience. Every content on their portal is rated by the viewers, and the
portal also provides other information for the content like the number of people who have
watched it, the number of people who want to watch it, the number of episodes, duration of an
episode, etc. Streamist is currently focusing on the anime available in their portal and wants to
identify the most important factors involved in rating an anime. As a data scientist at Streamist,
you are tasked with analyzing the portal's anime data and identifying the important factors by
building a predictive model to predict the rating of an anime.
Linear Regression Assumptions and Statistical Inference: The linear regression algorithm has a
set of assumptions that need to be satisfied for the model to be statistically validated and to be
able to draw inferences from it. This module walks participants through these assumptions, how
to check them, what to do in case they are violated, and the statistical inferences that can be
drawn based on the model's output.
Concepts Used:
Statistician vs ML Practitioner
Linear Regression Assumptions
Statistical Inferences from a Linear Regression Model
Learning Outcomes: Understand the underlying assumptions of a linear regression model, how
to check and ensure their satisfaction, and making statistical inferences from the model.
09
Module 4: Supervised Learning - Classification
Unlock the power of classification models to discern relationships between variables and
categorical outcomes. Extract business insights by identifying pivotal factors shaping
decision-making processes.
Logistic Regression: Logistic regression is a statistical modeling technique primarily used for
modeling the probability of binary outcomes. It finds applications in various fields such as
medicine, finance, and manufacturing. This module covers the theory behind the logistic
regression model, how to assess its performance, and how to draw statistical inferences from it.
Concepts Used:
Introduction to Logistic Regression
Interpretation from a Logistic Regression Model
Changing the Threshold of a Logistic Regression Model
Evaluation of a Classification Model
Pros and Cons
Learning Outcomes: Understand the foundations of the Logistic Regression Model, how to make
interpretations from it, how to evaluate the performance of classification models, and how
changing the threshold of a Logistic Regression Model can help in improving predictions.
Case Study: Income group classification (WHO data)- DeltaSquare is an NGO that works with
the Government on matters of social policy to bring about a change in the lives of underprivileged
sections of society. They are tasked with coming up with a policy framework by looking at the
data the government got from WHO. The objective is to analyze the data provided to identify the
different factors that influence the income of an individual, to build a good predictive model for
income and assess its performance, and help in sharing a proposal for the government.
10
Topic 2- Decision Tree
Decision Tree: Decision Trees are supervised ML algorithms that utilize a hierarchical
structure for decision making and can be used for both classification and regression
problems. This module dives into how a decision tree can be used to model complex,
non-linear data and how to improve the performance of Decision Trees using pruning
techniques.
Concepts Used:
Introduction to Decision Tree
How a Decision Tree is Built
Methods of Pruning a Decision Tree
Different impurity measures
Regression Trees
Pros and Cons
Learning Outcomes: Understand the Decision Tree algorithm, how it’s built, the different
pruning techniques that can be used to improve performance, and learn about the different
impurity measures used to make decisions.
Case Study: Machine Predictive Maintenance- Analyze the data of an auto component
manufacturing company and develop a predictive model to detect potential machine
failures, determine the most influencing factors on machine health, and provide
recommendations for cost optimization to the management.
Combine the decisions from multiple models using ensemble techniques to arrive at more robust
models that can make better predictions.
11
Topic 1- Bagging and Random Forest
Bagging and Random Forest: Random forest is a popular ensemble learning technique that
comprises several decision trees, each using a subset of the data to understand patterns. The
outputs of each tree are then aggregated to provide predictive performance. This module will
explore how to train a random forest model to solve complex business problems.
Concepts Used:
Introduction to Ensemble Techniques
Introduction to Bagging
Sampling with Replacement
Introduction to Random Forest
Learning Outcomes: Understand how ensemble techniques work, learn about sampling with
replacement and the concept of bagging, and build Random Forest models to make better
predictions.
Case Study: HR Attribution- McCurr Consultancy is an MNC that has thousands of employees
spread across the globe. The company believes in hiring the best talent available and retaining
them for as long as possible. A huge amount of resources is spent on retaining existing employ-
ees through various initiatives. The Head of People Operations wants to bring down the cost of
retaining employees. For this, he proposes limiting the incentives to only those employees who
are at risk of attrition. The objective is to identify patterns in characteristics of employees who
leave the organization and use the information to predict if an employee is at risk of attrition
using an ML model. This information will be used to target them with incentives.
Topic 2- Boosting
Boosting: Boosting models are robust ensemble models that comprise several sub-models, each
of which are developed in a sequential manner to improve upon the errors made by the previous
one. These modules will cover essential boosting algorithms like Adaboost and XGBoost that are
widely used in the industry for accurate and robust predictions.
Concepts Used:
Introduction to Boosting
Boosting Algorithms like Adaboost, Gradient Boost, and XGBoost
Stacking
Learning Outcomes: Understand the concept of boosting, the difference between bagging and
boosting, learn various boosting algorithms, and understand the concept of stacking.
Case Study: Bike Sharing- Bike-sharing systems are a new generation of traditional bike rentals
where the whole process from membership, rental and return back has become automatic.
Through these systems, the user is able to easily rent a bike from a particular position and return
back to another position. 'Travel Along' is a new bike-sharing company and wants to expand its
customer count and provide better services at a reasonable cost. They have conducted several
surveys and collated the data about weather, weekends, holidays, etc. from the past 2 years. The
objective is to analyze the patterns in the data and figure out the key areas which can help the
organization to grow and manage the customer demands. Further, you need to use this
information to predict the count of bikes shared so that the company can take prior decisions
for surge hours.
12
Module 6: Model Tuning
Employ feature engineering techniques and hyperparameter tuning to improve model performance
and optimize associated business costs.
Feature Engineering and Cross Validation: Feature engineering involves creating new input
features or modifying existing ones to improve a machine learning model's performance, and
cross-validation is used for getting a better assessment of a model performance. This module
covers these two concepts along with regularization to tune the performance of ML models and
correctly assess their performance.
Concepts Used:
Feature Engineering
Cross-Validation
Oversampling and Undersampling
Regularization
Learning Outcomes: Learn how to handle imbalanced data, how to use the cross-validation
technique to get a better picture of model performance, and understand the concept of
regularization.
Case Study: Job change prediction- An ed-tech company wants to hire data scientists among
people who have successfully passed some courses and then signed up for training. The
company wants to know which of the people are really looking for a job change and will prefer
working with them, after completion of training, because it helps to reduce the cost and time
for categorization of candidates. Information related to demographics, education, experience
is in hands from candidates signup and enrollment. The objective is to identify the factors
affecting a person looking for a job change, build a predictive model to predict whether a
person is looking for a job change, and check whether imbalance in the data affects model
predictions.
13
Topic 2- ML Pipeline and Hyperparameter Tuning
Concepts Used:
Machine Learning Pipeline
Model Tuning and Performance
Hyperparameter Tuning
Grid Search
Random Search
Learning Outcomes: Learn how to optimize model performance using hyperparameter tuning
and how to automate standard workflows in a machine learning process using pipelines.
Case Study: Supermarket marketing campaign- ‘All You Need' Supermarket is planning for the
year-end sale - they want to launch a new offer i.e. gold membership for only $499 that is of
$999 on normal days (that gives 20% discount on all purchases) only for existing customers, for
that they need to do a campaign through phone calls - the best way to reduce the cost of the
campaign is to make a predictive model to classify customers who might purchase the offer,
using the data they gathered during last year's campaign.
The objective is to build a model for classifying whether customers will respond positively or
not, identify the different factors which affect the kind of response, and to improve the
performance of an initially built model using hyperparameter tuning.
Unlock the power of clustering algorithms to group data based on similarity, unveiling hidden
patterns and intrinsic structures. Explore dimensionality reduction techniques to grasp the
significance of streamlined data analysis.
14
Topic 1- K-Means Clustering
K-Means Clustering: K-means clustering is a popular unsupervised ML algorithm that is used for
identifying patterns in unlabeled data and grouping it. This module dives into the working of the
algorithm and the important points to keep in mind when implementing it in practical scenarios.
Concepts Used:
Introduction to Clustering
Types of Clustering
K-Means Clustering
Importance of Scaling
Silhouette Score
Visual Analysis of Clustering
Learning Outcomes: Learn about the different types of clustering algorithms, how K-means
clustering works, how to determine the optimal number of clusters by comparing different
metrics, and the importance of scaling data.
Case Study: Engineering Colleges Case Study- Education is fast becoming a very competitive
sector with hundreds of institutions to choose from. It is a life-transforming experience for any
student and it has to be a thoughtful decision. There are ranking agencies that do a survey of all
the colleges to provide more insights to students. Agency ‘RankForYou’ wants to leverage this
year's survey to roll out an editorial article in leading newspapers, on the state of engineering
education in the country. The objective is to cluster the colleges into groups based on the data
provided and come up with evidence-based insights for that article.
Hierarchical Clustering and PCA: Hierarchical clustering organizes data into a tree-like structure
of nested clusters, while dimensionality reduction techniques are used to transform data into a
lower-dimensional space while retaining the most important information in it. This module
covers the business applications of hierarchical clustering and how to reduce the dimension of
data using PCA to aid in visualization and feature selection of multivariate datasets.
Concepts Used:
Hierarchical Clustering
Cophenetic Correlation
Introduction to Dimensionality Reduction
Principal Component Analysis
Learning Outcomes: Learn how to apply the hierarchical clustering technique to group similar
data points together and discover underlying patterns, understand the need for reducing
dimensions of the data, and understand the working of the PCA and how to transform data into
fewer dimensions using PCA.
Case Study: Tourism Services- Tourism is now recognized as a directly measurable activity,
enabling more accurate analysis and more effective policies can be made for tourism. Whereas
previously the sector relied mostly on approximations from related areas of measurement (e.g.
Balance of Payments statistics), tourism nowadays is a productive activity that can be analyzed
using factors like economic indicators, social indicators, environmental & infrastructure indica-
tors, etc. The task is to analyze several of these factors and group countries based on them to
help understand the key locations where the company can invest to promote tourism services.
15
ENHANCE KNOWLEDGE WITH
SELF-PACED MODULES
The self-paced modules cater to skills that are complementary to those learnt in guided modules.
Since all learners do not need to/may not want to learn them, they have been kept as part of
self-paced modules. All these modules have similar high-quality recorded video lectures by UT Austin
faculty, global academicians, and industry experts, but do not have mentorship sessions. You can learn
them at your own pace and schedule, based on your interests and the current and future demands of
your role.
Pre-Work
Gain a fundamental understanding of the basics of Python programming and build a strong
foundation of coding to build Data Science applications.
Generative AI
Get an overview of Generative AI, what ChatGPT is and how it works. delve into the business
applications of ChatGPT, and get an overview of other generative AI models/tools via
demonstrations.
16
BUILD INDUSTRY-RELEVANT SKILLS WITH
HANDS-ON PROJECTS
Practical Learning
7 hands-on
projects Skill Development
that will help
you with:
Portfolio Enhancement
17
READY TO ADVANCE YOUR CAREER?
APPLY NOW
CONTACT US
+1 512 793 9938
https://onlineexeced.mccombs.utexas.edu/online-data-science-business-analytics-course