AI Capstone Project - Notes-Part2
AI Capstone Project - Notes-Part2
DOHA, QATAR
Grade-12
Reference : Data Science Methodology 101. How can a Data Scientist organize his… | by Nunzio Logallo
| Towards Data Science
1.Business understanding
• What problem you are trying to solve?
• Every project, whatever its size, begins with the understanding of the
business.
• Business partners who need the analytics solution play a critical role
in this phase by defining the problem, the project objectives, and the
solution requirements from a business perspective.
2. Analytic approach
• How can you use the data to answer the question?
• The problem must be expressed in the context of statistical learning to
identify the appropriate machine learning techniques to achieve the
desired result.
3.Data Requirement
What data do you need to answer the question?
• Analytic approach determines the data requirements - specific
content, formats, and data representations, based on domain
knowledge.
4.Data collection
• Where is the data coming from (identify all sources) and how
will you get it?
• The Data Scientist identifies and collects data
resources (structured, unstructured and semi-structured) that
are relevant to the problem area.
• If the data scientist finds gaps in the data collection, he may need
to review the data requirements and collect more data.
5.Data understanding
• Is the data that you collected representative of the problem to be
solved?
• Descriptive statistics and visualization techniques can help a data
scientist understand the content of the data, assess its quality, and
obtain initial information about the data.
6. Data preparation
• What additional work is required to manipulate and work with the
data?
• The Data preparation step includes all the activities used to create
the data set used during the modeling phase.
• This includes cleansing data, combining data from multiple
sources, and transforming data into more useful variables.
• In addition, feature engineering and text analysis can be used to
derive new structured variables to enrich all predictors and improve
model accuracy.
7.Model Training
• In What way can the data be visualized to get the answer that is
required?
• From the first version of the prepared data set, Data scientists use a
Training dataset (historical data in which the desired result is
known) to develop predictive or descriptive models.
• The modeling process is very iterative.
8.Model Evaluation
• Does the model used really answer the initial question or does it
need to be adjusted?
• The Data Scientist evaluates the quality of the model and verifies that
the business problem is handled in a complete and adequate manner.
9.Deployment
• Can you put the model into practice?
• Once a satisfactory model has been developed and approved by
commercial sponsors, it will be implemented in the production
environment or in a comparable test environment.
10.Feedback
• Can you get constructive feedback into answering the question?
• By collecting the results of the implemented model, the
organization receives feedback on the performance of the model
and its impact on the implementation environment.
• The procedure involves taking a dataset and dividing it into two subsets.
• The first subset is used to fit the model and is referred to as the training
dataset.
• The second subset is not used to train the model; but to evaluate the fit
machine learning model. It is referred to as testing dataset.
3. How will you configure train test split procedure?
• The procedure has one main configuration parameter, which is the size of
the train and test sets.
• This is most commonly expressed as a percentage between 0 and 1 for
either the train or test datasets.
• For example, a training set with the size of 0.67 (67 percent) means that
the remainder percentage 0.33 (33 percent) is assigned to the test set.
• There is no optimal split percentage.
Nevertheless, common split percentages include:
• Train: 80%, Test: 20%
• Train: 67%, Test: 33%
• Train: 50%, Test: 50%
4. What are the considerations to choose split percentage in train-test-split
procedure?
• Computational cost in training the model.
• Computational cost in evaluating the model.
• Training set representativeness.
• Test set representativeness.
5. Explain cross validation?
• It is a resampling technique for evaluating machine learning models on a
sample of data.
• The process includes a parameter k, which specifies the number of groups
in to which a given data sample should be divided.
• The process is referred as K- fold cross validation.
• More reliable, though it takes longer to run.
• For example, we could have 5 folds or experiments(k=5). We divide the data
into 5 pieces, each being 20% of the full dataset.
• During first iteration (Experiment 1) the first fold (piece) is used as
holdout set(test data/validation data) and everything else as training
data.
• We repeat this process, using every fold once as the holdout. Putting
this together, 100% of the data is used as a holdout at some point.
6. Explain difference between cross validation and train test split?
• On small datasets, the extra computational burden of running
cross-validation isn't a big deal. So, if your dataset is smaller, you
should run cross-validation
• If your dataset is larger, you can use train-test-split method.
RMSE: The square root of MSE is used to calculate RMSE. The Root Mean
Square Deviation (RMSE) is another name for the Root Mean Square Error.
• A RMSE value of 0 implies that the model is perfectly fitted. The
model and its predictions perform better when the RMSE is low. A
greater RMSE indicates a substantial discrepancy between the
residual and the ground truth.
• The RMSE of a good model should be less than 180
9. What is loss function? What are the different categories of loss function?
• All the algorithms in machine learning rely on minimizing or
maximizing a function, which we call “objective function”.
• The group of functions that are minimized are called “loss
functions”.
• A loss function is a measure of how good a prediction model does in
terms of being able to predict the expected outcome.
• Loss functions can be broadly categorized into 2 types: Classification
and Regression Loss.
Regression functions predict a quantity, and classification functions
predict a label.
40 42
42 45
44 47
46 44
48 50
50 48
52 49
54 50
58 55
60 58
Regression line equation: Y=0.681x + 15.142. Calculate MSE and RMSE from
the above information