Company Classifier

There are rough edges in code as well as in report. But for weekend project I had to be realistic.

Dialectica - Data Science Assignment

Asignement pdf document is in docs folder.

Report is in output/Report.pdf.

Code is in private repo: https://github.com/homoludens/dialectica_assignment

Folder structure

Folder notebooks contain notebooks for exploring data and code.

Folder src contains productionized code created from notebooks. As always it can be much better.

Folder output will contain output results of code including report pdf.

How to run project

There are two ways to run project in Docker and in Python virtual eviroment.

We are using Makefile as helper to run commands or you can use it as documentataion. Run

make help

to see available make commands.

Docker

One option is docker, to have CUDA available in docker you will need nvidia-container-toolkit installed on host machine.

make nvidia

might help with that. It will ask for sudo permisions. Please check the command by your self!

Clear previous runs with (output directory needs to be empty):

make clean

Build and run docker container.

make docker_build
make docker_run

first run will run trainings and generate reports in folder output.

Jupyter Lab will be accesable on port 8899 http://localhost:8899

virtualenv

Another option is to run project in python virtual eviroment.

python -m venv .venv
source ./venv/bin/activate
pip install -r requirements.txt
python ./src/model_development/main.py

If you want to explore notebooks start jupyter lab:

jupyter-lab

Generate Report

After starting docker with make docker_start go into it's bash shell with make docker_bash than use nbconvert:

jupyter nbconvert --TagRemovePreprocessor.remove_cell_tags='{"exclude_output"}'  --to pdf ./notebooks/Report.ipynb --output-dir './output/'  --no-input

Additional

Install `nvidia-container-toolkit` for Docker to have CUDA

sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker

Basic docker commands needed for testing

docker rm companyclass
docker build -t companyclass-notebook docker
docker run --name companyclass  -it -v "${PWD}":/workspace companyclass-notebook

Speed and memeory

This was made with GCP T4 in mind, available for free in google colab and cheapeast GPU on GCP. 16GB of RAM and 16GB of GPU memory.

In load_and_split_data I am using test_size=0.02, train_size=0.2 to reduce memory and compyting time requiremets. Set adequate per_device_train_batch_size and per_device_eval_batch_size in function train_bert_model to accomodate your GPU RAM.

Genereating reports

Genereating report is done via jupyter and latex so both need to be installed. It is setup in docker, for local machine:

jupyter nbconvert --template article --PDFExporter.latex_command="['pdflatex', '{input}', '-interaction=nonstopmode', '-geometry=landscape']"   --TagRemovePreprocessor.remove_cell_tags='{"exclude_output"}'  --to pdf Report.ipynb  --no-input

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Company Classifier

Dialectica - Data Science Assignment

Folder structure

How to run project

Docker

virtualenv

Generate Report

Additional

Install `nvidia-container-toolkit` for Docker to have CUDA

Basic docker commands needed for testing

Speed and memeory

Genereating reports

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
common		common
data		data
docker		docker
docs		docs
notebooks		notebooks
output		output
src		src
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
requirements.txt		requirements.txt

homoludens/Company-Classifier

Folders and files

Latest commit

History

Repository files navigation

Company Classifier

Dialectica - Data Science Assignment

Folder structure

How to run project

Docker

virtualenv

Generate Report

Additional

Install nvidia-container-toolkit for Docker to have CUDA

Basic docker commands needed for testing

Speed and memeory

Genereating reports

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Install `nvidia-container-toolkit` for Docker to have CUDA

Packages