References:
- https://dvc.org/doc/use-cases/versioning-data-and-model-files/tutorial
- https://realpython.com/python-data-version-control/
- https://github.com/realpython/data-version-control
- https://github.com/PeterFogh/dvc_dask_use_case
Use this template to start new DVC workspaces/experiments. We are using Data Version Control (https://dvc.org) to:
- track and version our datasets and models
- share development server resources
- create reproducible machine learning experiments
To create a new DVC workspace/experiment, click on the top green button "Use this template" on the github url for this repository. Give the new repo a name and git clone a local copy:
Folder structure:
$ tree expt01
bacteria
├── LICENSE
├── README.md
├── data
│ ├── prepared
│ └── raw
├── metrics
├── model
└── src
├── evaluate.py
├── prepare.py
└── train.pyBreakdown:
src/for source codedata/for all versions of the datasetdata/rawfor data obtained from external sourcesdata/prepared/for data modified internallymodel/for machine learning modelsdata/metricsfor tracking performance metrics of models
The src/ folder has three sample python files:
prepare.pyto prepare data for training.train.pyto train a machine learning model.evaluate.pyto evaluate the results of a model.
Download the raw datasets and place them within data/raw .e.g. https://realpython.com/python-data-version-control/#set-up-your-working-environment.
If the datasets are already version controlled with DVC, we can pull them in too (TODO).
https://realpython.com/python-data-version-control/
Forked from https://github.com/realpython/data-version-control
Original README:
Example repository for the Data Version Control With Python and DVC article on Real Python.
To use this repo as part of the tutorial, you first need to get your own copy. Click the Fork button in the top-right corner of the screen, and select your private account in the window that pops up. GitHub will create a forked copy of the repository under your account.
Clone the forked repository to your computer with the git clone command
git clone [email protected]:YourUsername/data-version-control.gitMake sure to replace YourUsername in the above command with your actual GitHub username.
Happy coding!