0% found this document useful (0 votes)
92 views

AI Capstone Project - Notes-Part2

1. The document describes the stages of the data science methodology including business understanding, data collection, data preparation, model training, evaluation and deployment. 2. It explains the train-test split evaluation technique where the dataset is divided into training and test sets. Common split percentages are 80-20, 67-33 and 50-50. 3. Cross-validation is described as a resampling technique for evaluation where the dataset is divided into k folds and each fold is used once as the validation set. It is more reliable than train-test split but takes longer.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
92 views

AI Capstone Project - Notes-Part2

1. The document describes the stages of the data science methodology including business understanding, data collection, data preparation, model training, evaluation and deployment. 2. It explains the train-test split evaluation technique where the dataset is divided into training and test sets. Common split percentages are 80-20, 67-33 and 50-50. 3. Cross-validation is described as a resampling technique for evaluation where the dataset is divided into k folds and each fold is used once as the validation set. It is more reliable than train-test split but takes longer.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

RAJAGIRI PUBLIC SCHOOL

DOHA, QATAR

Grade-12

843- Artificial Intelligence

Ch:1 Capstone Project -Part2

1. Draw the diagram of Analytic Approach and explain each stage?

Reference : Data Science Methodology 101. How can a Data Scientist organize his… | by Nunzio Logallo
| Towards Data Science

1.Business understanding
• What problem you are trying to solve?
• Every project, whatever its size, begins with the understanding of the
business.
• Business partners who need the analytics solution play a critical role
in this phase by defining the problem, the project objectives, and the
solution requirements from a business perspective.
2. Analytic approach
• How can you use the data to answer the question?
• The problem must be expressed in the context of statistical learning to
identify the appropriate machine learning techniques to achieve the
desired result.
3.Data Requirement
What data do you need to answer the question?
• Analytic approach determines the data requirements - specific
content, formats, and data representations, based on domain
knowledge.
4.Data collection
• Where is the data coming from (identify all sources) and how
will you get it?
• The Data Scientist identifies and collects data
resources (structured, unstructured and semi-structured) that
are relevant to the problem area.
• If the data scientist finds gaps in the data collection, he may need
to review the data requirements and collect more data.
5.Data understanding
• Is the data that you collected representative of the problem to be
solved?
• Descriptive statistics and visualization techniques can help a data
scientist understand the content of the data, assess its quality, and
obtain initial information about the data.
6. Data preparation
• What additional work is required to manipulate and work with the
data?
• The Data preparation step includes all the activities used to create
the data set used during the modeling phase.
• This includes cleansing data, combining data from multiple
sources, and transforming data into more useful variables.
• In addition, feature engineering and text analysis can be used to
derive new structured variables to enrich all predictors and improve
model accuracy.
7.Model Training
• In What way can the data be visualized to get the answer that is
required?
• From the first version of the prepared data set, Data scientists use a
Training dataset (historical data in which the desired result is
known) to develop predictive or descriptive models.
• The modeling process is very iterative.
8.Model Evaluation
• Does the model used really answer the initial question or does it
need to be adjusted?
• The Data Scientist evaluates the quality of the model and verifies that
the business problem is handled in a complete and adequate manner.
9.Deployment
• Can you put the model into practice?
• Once a satisfactory model has been developed and approved by
commercial sponsors, it will be implemented in the production
environment or in a comparable test environment.
10.Feedback
• Can you get constructive feedback into answering the question?
• By collecting the results of the implemented model, the
organization receives feedback on the performance of the model
and its impact on the implementation environment.

2. Explain Train-Test Split Evaluation?

• The train-test split is a technique for evaluating the performance of a


machine learning algorithm.

• It can be used for classification or regression problems and can be used


for any supervised learning algorithm.

• The procedure involves taking a dataset and dividing it into two subsets.

• The first subset is used to fit the model and is referred to as the training
dataset.

• The second subset is not used to train the model; but to evaluate the fit
machine learning model. It is referred to as testing dataset.
3. How will you configure train test split procedure?

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)


OR

X_train, X_test, y_train, y_test = train_test_split(X,y, train_size=0.67)

• The procedure has one main configuration parameter, which is the size of
the train and test sets.
• This is most commonly expressed as a percentage between 0 and 1 for
either the train or test datasets.
• For example, a training set with the size of 0.67 (67 percent) means that
the remainder percentage 0.33 (33 percent) is assigned to the test set.
• There is no optimal split percentage.
Nevertheless, common split percentages include:
• Train: 80%, Test: 20%
• Train: 67%, Test: 33%
• Train: 50%, Test: 50%
4. What are the considerations to choose split percentage in train-test-split
procedure?
• Computational cost in training the model.
• Computational cost in evaluating the model.
• Training set representativeness.
• Test set representativeness.
5. Explain cross validation?
• It is a resampling technique for evaluating machine learning models on a
sample of data.
• The process includes a parameter k, which specifies the number of groups
in to which a given data sample should be divided.
• The process is referred as K- fold cross validation.
• More reliable, though it takes longer to run.
• For example, we could have 5 folds or experiments(k=5). We divide the data
into 5 pieces, each being 20% of the full dataset.
• During first iteration (Experiment 1) the first fold (piece) is used as
holdout set(test data/validation data) and everything else as training
data.

• During second iteration(Experiment 2) the second fold (piece) is


used as holdout set(test data/validation data) and everything else
as training data.

• We repeat this process, using every fold once as the holdout. Putting
this together, 100% of the data is used as a holdout at some point.
6. Explain difference between cross validation and train test split?
• On small datasets, the extra computational burden of running
cross-validation isn't a big deal. So, if your dataset is smaller, you
should run cross-validation
• If your dataset is larger, you can use train-test-split method.

7. What are hyper parameters?


Hyper parameters are parameters whose values govern the learning
process. They also determine the value of model parameters learned by a
learning algorithm.
Eg: The ratio of train-test-split, Number of hidden layers in neural
network, Number of clusters in clustering task.
8. How are MSE and RMSE related? What is their range? Are they sensitive
to outliers?
MSE: One of the most used regression loss functions is MSE. We
determine the error in Mean-Squared-Error, also known as L2 loss, by
squaring the difference between the predicted and actual values and
average it throughout the dataset.

• Squaring the error gives outliers more weight, resulting in a smooth


gradient for minor errors.
• Because the errors are squared, MSE can never be negative. The
error value varies from 0 to infinity.
• The MSE grows exponentially as the error grows. An MSE value close
to zero indicates a good model.
• It is especially useful in removing outliers with substantial errors
from the model by giving them additional weight.

RMSE: The square root of MSE is used to calculate RMSE. The Root Mean
Square Deviation (RMSE) is another name for the Root Mean Square Error.
• A RMSE value of 0 implies that the model is perfectly fitted. The
model and its predictions perform better when the RMSE is low. A
greater RMSE indicates a substantial discrepancy between the
residual and the ground truth.
• The RMSE of a good model should be less than 180
9. What is loss function? What are the different categories of loss function?
• All the algorithms in machine learning rely on minimizing or
maximizing a function, which we call “objective function”.
• The group of functions that are minimized are called “loss
functions”.
• A loss function is a measure of how good a prediction model does in
terms of being able to predict the expected outcome.
• Loss functions can be broadly categorized into 2 types: Classification
and Regression Loss.
Regression functions predict a quantity, and classification functions
predict a label.

10. Consider the following data:


x y

40 42
42 45

44 47
46 44

48 50
50 48
52 49
54 50
58 55
60 58

Regression line equation: Y=0.681x + 15.142. Calculate MSE and RMSE from
the above information

You might also like