There are rough edges in code as well as in report. But for weekend project I had to be realistic.
Asignement pdf document is in docs folder.
Report is in output/Report.pdf.
Code is in private repo: https://github.com/homoludens/dialectica_assignment
Folder notebooks contain notebooks for exploring data and code.
Folder src contains productionized code created from notebooks. As always it can be much better.
Folder output will contain output results of code including report pdf.
There are two ways to run project in Docker and in Python virtual eviroment.
We are using Makefile as helper to run commands or you can use it as documentataion. Run
make help
to see available make commands.
One option is docker, to have CUDA available in docker you will need nvidia-container-toolkit installed on host machine.
make nvidia
might help with that. It will ask for sudo permisions. Please check the command by your self!
Clear previous runs with (output directory needs to be empty):
make clean
Build and run docker container.
make docker_build
make docker_run
first run will run trainings and generate reports in folder output.
Jupyter Lab will be accesable on port 8899 http://localhost:8899
Another option is to run project in python virtual eviroment.
python -m venv .venv
source ./venv/bin/activate
pip install -r requirements.txt
python ./src/model_development/main.py
If you want to explore notebooks start jupyter lab:
jupyter-lab
After starting docker with make docker_start go into it's bash shell with make docker_bash than use nbconvert:
jupyter nbconvert --TagRemovePreprocessor.remove_cell_tags='{"exclude_output"}' --to pdf ./notebooks/Report.ipynb --output-dir './output/' --no-input
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker
docker rm companyclass
docker build -t companyclass-notebook docker
docker run --name companyclass -it -v "${PWD}":/workspace companyclass-notebook
This was made with GCP T4 in mind, available for free in google colab and cheapeast GPU on GCP. 16GB of RAM and 16GB of GPU memory.
In load_and_split_data I am using test_size=0.02, train_size=0.2 to reduce memory and compyting time requiremets.
Set adequate per_device_train_batch_size and per_device_eval_batch_size in function train_bert_model to accomodate your GPU RAM.
Genereating report is done via jupyter and latex so both need to be installed. It is setup in docker, for local machine:
jupyter nbconvert --template article --PDFExporter.latex_command="['pdflatex', '{input}', '-interaction=nonstopmode', '-geometry=landscape']" --TagRemovePreprocessor.remove_cell_tags='{"exclude_output"}' --to pdf Report.ipynb --no-input