Exercise Underfitting and Overfitting
Exercise Underfitting and Overfitting
3","language":"python","name":"python3"},"language_info":{"codemirror_mode":
{"name":"ipython","version":3},"file_extension":".py","mimetype":"text/x-
python","name":"python","nbconvert_exporter":"python","pygments_lexer":"ipython3","vers
ion":"3.6.5"},"kaggle":{"accelerator":"none","dataSources":
[{"sourceId":10211,"databundleVersionId":111096,"sourceType":"competition"},
{"sourceId":15520,"sourceType":"datasetVersion","datasetId":11167},
{"sourceId":38454,"sourceType":"datasetVersion","datasetId":2709}],"isInternetEnabled":f
alse,"language":"python","sourceType":"notebook","isGpuEnabled":false}},"nbformat_mino
r":4,"nbformat":4,"cells":[{"cell_type":"markdown","source":"**This notebook is an exercise
in the [Introduction to Machine Learning](https://www.kaggle.com/learn/intro-to-machine-
learning) course. You can reference the tutorial at [this
link](https://www.kaggle.com/dansbecker/underfitting-and-overfitting).**\n\n---\
n","metadata":{}},{"cell_type":"markdown","source":"## Recap\nYou've built your first
model, and now it's time to optimize the size of the tree to make better predictions. Run
this cell to set up your coding environment where the previous step left off.","metadata":
{}},{"cell_type":"code","source":"# Code you have previously used to load data\nimport
pandas as pd\nfrom sklearn.metrics import mean_absolute_error\nfrom
sklearn.model_selection import train_test_split\nfrom sklearn.tree import
DecisionTreeRegressor\n\n\n# Path of the file to read\niowa_file_path = '../input/home-
data-for-ml-course/train.csv'\n\nhome_data = pd.read_csv(iowa_file_path)\n# Create target
object and call it y\ny = home_data.SalePrice\n# Create X\nfeatures = ['LotArea',
'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']\nX =
home_data[features]\n\n# Split into validation and training data\ntrain_X, val_X, train_y,
val_y = train_test_split(X, y, random_state=1)\n\n# Specify Model\niowa_model =
DecisionTreeRegressor(random_state=1)\n# Fit Model\niowa_model.fit(train_X, train_y)\n\
n# Make validation predictions and calculate mean absolute error\nval_predictions =
iowa_model.predict(val_X)\nval_mae = mean_absolute_error(val_predictions, val_y)\
nprint(\"Validation MAE: {:,.0f}\".format(val_mae))\n\n# Set up code checking\nfrom
learntools.core import binder\nbinder.bind(globals())\nfrom
learntools.machine_learning.ex5 import *\nprint(\"\\nSetup complete\")","metadata":
{"collapsed":true,"jupyter":{"outputs_hidden":true}},"execution_count":null,"outputs":[]},
{"cell_type":"markdown","source":"# Exercises\nYou could write the function `get_mae`
yourself. For now, we'll supply it. This is the same function you read about in the previous
lesson. Just run the cell below.","metadata":{}},{"cell_type":"code","source":"def
get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y):\n model =
DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)\n
model.fit(train_X, train_y)\n preds_val = model.predict(val_X)\n mae =
mean_absolute_error(val_y, preds_val)\n return(mae)","metadata":
{},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"## Step 1:
Compare Different Tree Sizes\nWrite a loop that tries the following values for
*max_leaf_nodes* from a set of possible values.\n\nCall the *get_mae* function on each
value of max_leaf_nodes. Store the output in some way that allows you to select the value of
`max_leaf_nodes` that gives the most accurate model on your data.","metadata":{}},
{"cell_type":"code","source":" candidate_max_leaf_nodes = [5, 25, 50, 100, 250, 500]\n#
Write loop to find the ideal tree size from candidate_max_leaf_nodes\nvalues=[]\n\n# Store
the best value of max_leaf_nodes (it will be either 5, 25, 50, 100, 250 or 500)\n\nfor l in
candidate_max_leaf_nodes:\n values.append(get_mae(l,train_X,val_X,train_y,val_y))\n if
get_mae(l,train_X,val_X,train_y,val_y)==min(values):\n best_tree_size=l\n \
nprint(best_tree_size)\n \n \n# Check your answer\nstep_1.check()","metadata":
{},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"# The lines below will
show you a hint or the solution.\n# step_1.hint() \n# step_1.solution()","metadata":
{},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"## Step 2: Fit
Model Using All Data\nYou know the best tree size. If you were going to deploy this model
in practice, you would make it even more accurate by using all of the data and keeping that
tree size. That is, you don't need to hold out the validation data now that you've made all
your modeling decisions.","metadata":{}},{"cell_type":"code","source":"# Fill in argument
to make optimal size and uncomment\nfinal_model =
DecisionTreeRegressor(max_leaf_nodes=best_tree_size,random_state=1)\n\n# fit the final
model and uncomment the next two lines\nfinal_model.fit(X,y)\n\n# Check your answer\
nstep_2.check()","metadata":{},"execution_count":null,"outputs":[]},
{"cell_type":"code","source":"# step_2.hint()\n# step_2.solution()","metadata":
{},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"You've tuned
this model and improved your results. But we are still using Decision Tree models, which
are not very sophisticated by modern machine learning standards. In the next step you will
learn to use Random Forests to improve your models even more.\n\n# Keep Going\n\nYou
are ready for **[Random Forests](https://www.kaggle.com/dansbecker/random-forests).**\
n","metadata":{}},{"cell_type":"markdown","source":"---\n\n\n\n\n*Have questions or
comments? Visit the [course discussion forum](https://www.kaggle.com/learn/intro-to-
machine-learning/discussion) to chat with other learners.*","metadata":{}}]}