0% found this document useful (0 votes)
291 views8 pages

Class 12 AI Capstone Project Notes

1. The document describes the stages of the data science methodology including business understanding, data collection, data preparation, model training, evaluation and deployment. 2. It explains the train-test split evaluation technique where the dataset is divided into training and test sets. Common split percentages are 80-20, 67-33 and 50-50. 3. Cross-validation is described as a resampling technique for evaluation where the dataset is divided into k folds and each fold is used once as the validation set. It is more reliable than train-test split but takes longer.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
291 views8 pages

Class 12 AI Capstone Project Notes

1. The document describes the stages of the data science methodology including business understanding, data collection, data preparation, model training, evaluation and deployment. 2. It explains the train-test split evaluation technique where the dataset is divided into training and test sets. Common split percentages are 80-20, 67-33 and 50-50. 3. Cross-validation is described as a resampling technique for evaluation where the dataset is divided into k folds and each fold is used once as the validation set. It is more reliable than train-test split but takes longer.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
  • Introduction and Initial Steps
  • Model Training and Deployment
  • Advanced Modeling Techniques

RAJAGIRI PUBLIC SCHOOL

DOHA, QATAR

Grade-12

843- Artificial Intelligence

Ch:1 Capstone Project -Part2

1. Draw the diagram of Analytic Approach and explain each stage?

Reference : Data Science Methodology 101. How can a Data Scientist organize his… | by Nunzio Logallo
| Towards Data Science

[Link] understanding
• What problem you are trying to solve?
• Every project, whatever its size, begins with the understanding of the
business.
• Business partners who need the analytics solution play a critical role
in this phase by defining the problem, the project objectives, and the
solution requirements from a business perspective.
2. Analytic approach
• How can you use the data to answer the question?
• The problem must be expressed in the context of statistical learning to
identify the appropriate machine learning techniques to achieve the
desired result.
[Link] Requirement
What data do you need to answer the question?
• Analytic approach determines the data requirements - specific
content, formats, and data representations, based on domain
knowledge.
[Link] collection
• Where is the data coming from (identify all sources) and how
will you get it?
• The Data Scientist identifies and collects data
resources (structured, unstructured and semi-structured) that
are relevant to the problem area.
• If the data scientist finds gaps in the data collection, he may need
to review the data requirements and collect more data.
[Link] understanding
• Is the data that you collected representative of the problem to be
solved?
• Descriptive statistics and visualization techniques can help a data
scientist understand the content of the data, assess its quality, and
obtain initial information about the data.
6. Data preparation
• What additional work is required to manipulate and work with the
data?
• The Data preparation step includes all the activities used to create
the data set used during the modeling phase.
• This includes cleansing data, combining data from multiple
sources, and transforming data into more useful variables.
• In addition, feature engineering and text analysis can be used to
derive new structured variables to enrich all predictors and improve
model accuracy.
[Link] Training
• In What way can the data be visualized to get the answer that is
required?
• From the first version of the prepared data set, Data scientists use a
Training dataset (historical data in which the desired result is
known) to develop predictive or descriptive models.
• The modeling process is very iterative.
[Link] Evaluation
• Does the model used really answer the initial question or does it
need to be adjusted?
• The Data Scientist evaluates the quality of the model and verifies that
the business problem is handled in a complete and adequate manner.
[Link]
• Can you put the model into practice?
• Once a satisfactory model has been developed and approved by
commercial sponsors, it will be implemented in the production
environment or in a comparable test environment.
[Link]
• Can you get constructive feedback into answering the question?
• By collecting the results of the implemented model, the
organization receives feedback on the performance of the model
and its impact on the implementation environment.

2. Explain Train-Test Split Evaluation?

• The train-test split is a technique for evaluating the performance of a


machine learning algorithm.

• It can be used for classification or regression problems and can be used


for any supervised learning algorithm.

• The procedure involves taking a dataset and dividing it into two subsets.

• The first subset is used to fit the model and is referred to as the training
dataset.

• The second subset is not used to train the model; but to evaluate the fit
machine learning model. It is referred to as testing dataset.
3. How will you configure train test split procedure?

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)


OR

X_train, X_test, y_train, y_test = train_test_split(X,y, train_size=0.67)

• The procedure has one main configuration parameter, which is the size of
the train and test sets.
• This is most commonly expressed as a percentage between 0 and 1 for
either the train or test datasets.
• For example, a training set with the size of 0.67 (67 percent) means that
the remainder percentage 0.33 (33 percent) is assigned to the test set.
• There is no optimal split percentage.
Nevertheless, common split percentages include:
• Train: 80%, Test: 20%
• Train: 67%, Test: 33%
• Train: 50%, Test: 50%
4. What are the considerations to choose split percentage in train-test-split
procedure?
• Computational cost in training the model.
• Computational cost in evaluating the model.
• Training set representativeness.
• Test set representativeness.
5. Explain cross validation?
• It is a resampling technique for evaluating machine learning models on a
sample of data.
• The process includes a parameter k, which specifies the number of groups
in to which a given data sample should be divided.
• The process is referred as K- fold cross validation.
• More reliable, though it takes longer to run.
• For example, we could have 5 folds or experiments(k=5). We divide the data
into 5 pieces, each being 20% of the full dataset.
• During first iteration (Experiment 1) the first fold (piece) is used as
holdout set(test data/validation data) and everything else as training
data.

• During second iteration(Experiment 2) the second fold (piece) is


used as holdout set(test data/validation data) and everything else
as training data.

• We repeat this process, using every fold once as the holdout. Putting
this together, 100% of the data is used as a holdout at some point.
6. Explain difference between cross validation and train test split?
• On small datasets, the extra computational burden of running
cross-validation isn't a big deal. So, if your dataset is smaller, you
should run cross-validation
• If your dataset is larger, you can use train-test-split method.

7. What are hyper parameters?


Hyper parameters are parameters whose values govern the learning
process. They also determine the value of model parameters learned by a
learning algorithm.
Eg: The ratio of train-test-split, Number of hidden layers in neural
network, Number of clusters in clustering task.
8. How are MSE and RMSE related? What is their range? Are they sensitive
to outliers?
MSE: One of the most used regression loss functions is MSE. We
determine the error in Mean-Squared-Error, also known as L2 loss, by
squaring the difference between the predicted and actual values and
average it throughout the dataset.

• Squaring the error gives outliers more weight, resulting in a smooth


gradient for minor errors.
• Because the errors are squared, MSE can never be negative. The
error value varies from 0 to infinity.
• The MSE grows exponentially as the error grows. An MSE value close
to zero indicates a good model.
• It is especially useful in removing outliers with substantial errors
from the model by giving them additional weight.

RMSE: The square root of MSE is used to calculate RMSE. The Root Mean
Square Deviation (RMSE) is another name for the Root Mean Square Error.
• A RMSE value of 0 implies that the model is perfectly fitted. The
model and its predictions perform better when the RMSE is low. A
greater RMSE indicates a substantial discrepancy between the
residual and the ground truth.
• The RMSE of a good model should be less than 180
9. What is loss function? What are the different categories of loss function?
• All the algorithms in machine learning rely on minimizing or
maximizing a function, which we call “objective function”.
• The group of functions that are minimized are called “loss
functions”.
• A loss function is a measure of how good a prediction model does in
terms of being able to predict the expected outcome.
• Loss functions can be broadly categorized into 2 types: Classification
and Regression Loss.
Regression functions predict a quantity, and classification functions
predict a label.

10. Consider the following data:


x y

40 42
42 45

44 47
46 44

48 50
50 48
52 49
54 50
58 55
60 58

Regression line equation: Y=0.681x + 15.142. Calculate MSE and RMSE from
the above information

Common questions

Powered by AI

When deploying a machine learning model into a production environment, it is important to ensure that the model is robust and performs well under various conditions that it will encounter in real-time use. This includes verifying that the model meets the necessary performance criteria set during Model Evaluation, ensuring infrastructural support for the model's operational needs, and establishing monitoring frameworks to assess the model's performance over time. Additionally, securing data privacy and compliance with regulatory standards is crucial. Receiving continuous feedback is equally important to make timely adjustments as required .

Cross-validation and train-test split are both techniques for evaluating machine learning models. Cross-validation, specifically k-fold cross-validation, involves dividing the dataset into k groups and running multiple experiments, effectively using every piece of data as part of the training and test sets. It is particularly suitable for small datasets due to its thoroughness, despite higher computational costs. Train-test split, on the other hand, involves a simpler division of the dataset into two distinct subsets. It is typically used for larger datasets where the computational burden of cross-validation would be too high. Therefore, the choice depends on the dataset size and the available computational resources .

Hyperparameters are parameters that define aspects of the learning process, such as the ratio of train-test split, the number of hidden layers in a neural network, or the number of clusters in a clustering task. They are set before the learning process begins and directly influence the behavior and performance of the learning algorithm. Unlike parameters learned during training, hyperparameters are established independently and may require tuning for optimal model performance .

Data Preparation and Model Training are closely related in the Analytic Approach. During Data Preparation, the dataset undergoes cleansing, transformation, and feature engineering to ensure it is suitable for model building. This stage influences the model's effectiveness by enhancing data quality and deriving useful features. In Model Training, this prepared data is then used to develop and refine predictive models. Success in model training heavily relies on well-prepared data, as poor data quality can degrade model performance and lead to inaccurate predictions. Thus, these stages are critical as they essentially determine the quality and reliability of the model's outcomes .

The relationship between Business Understanding and the Modeling phases in the Analytic Approach is interconnected and cyclical. Business Understanding serves as the foundation, defining the problem from a business perspective and setting objectives. It informs the data requirements and influences how the data is collected and understood. These insights guide the Data Preparation and ultimately the Model Training phases, ensuring that the models are aligned to meet business needs. During the Model Evaluation phase, findings are checked against these objectives, creating a feedback loop that may necessitate revisiting initial assumptions or objectives to further refine and enhance the model's alignment with business goals as new insights are gained .

The Mean Squared Error (MSE) calculates the error by squaring the difference between predicted and actual values, which amplifies the weight of larger errors. This makes MSE more sensitive to outliers as excessively large errors significantly impact the overall error calculation. As a result, the presence of outliers can disproportionately inflate MSE, potentially distorting the evaluation of model performance. While this can be advantageous for identifying outliers, it necessitates careful consideration of their impact during model evaluation .

The Root Mean Squared Error (RMSE) is derived by taking the square root of the Mean Squared Error (MSE). RMSE represents the standard deviation of the residuals (prediction errors) and provides a measure of how spread out these errors are. A lower RMSE indicates better model performance, as it suggests smaller, more consistent prediction errors. An RMSE value close to zero signifies that the model's predictions are very close to the actual values, demonstrating a high level of accuracy in the model's predictions .

Feedback mechanisms in the Analytic Approach involve collecting data on a model's performance in real-world conditions, post-deployment. By analyzing the outcomes and comparing them with expected results, organizations can gather insights into the model's accuracy and effectiveness. This feedback loop helps identify areas for model improvement, such as adjusting features or retraining with updated data. Continuous feedback allows for adaptive adjustments, ensuring that the model remains aligned with the business requirements and performs optimally over time .

The Train-Test Split Evaluation method divides a dataset into two subsets: a training set and a testing set. The training set is used to train the machine learning model, while the testing set evaluates the model's performance. One key configuration parameter is the proportion of data allocated to training and testing, usually expressed as a percentage. Common splits include 80-20 or 67-33 for train and test sets, respectively. When choosing the split, considerations include computational costs during training and evaluation, and how representative each subset is of the overall dataset .

The Analytic Approach in data science begins with Business Understanding, where the problem is defined from a business perspective with input from business partners. The next stage is Data Requirement, which identifies the data needed, followed by Data Collection where data is collected from various sources. Data Understanding involves assessing data quality and understanding its content. Data Preparation includes cleansing and transforming data, and engaging in feature engineering. Model Training involves using a training dataset to develop models, while Model Evaluation checks the model's performance against the initial business problem. Deployment involves implementing the model in a production environment, and Feedback consists of using real-world results to refine the model and approach .

You might also like