Skip to content

homoludens/Company-Classifier

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

36 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Company Classifier

There are rough edges in code as well as in report. But for weekend project I had to be realistic.

Dialectica - Data Science Assignment

Asignement pdf document is in docs folder.

Report is in output/Report.pdf.

Code is in private repo: https://github.com/homoludens/dialectica_assignment

Folder structure

Folder notebooks contain notebooks for exploring data and code.

Folder src contains productionized code created from notebooks. As always it can be much better.

Folder output will contain output results of code including report pdf.

How to run project

There are two ways to run project in Docker and in Python virtual eviroment.

We are using Makefile as helper to run commands or you can use it as documentataion. Run

make help

to see available make commands.

Docker

One option is docker, to have CUDA available in docker you will need nvidia-container-toolkit installed on host machine.

make nvidia

might help with that. It will ask for sudo permisions. Please check the command by your self!

Clear previous runs with (output directory needs to be empty):

make clean

Build and run docker container.

make docker_build
make docker_run

first run will run trainings and generate reports in folder output.

Jupyter Lab will be accesable on port 8899 http://localhost:8899

virtualenv

Another option is to run project in python virtual eviroment.

python -m venv .venv
source ./venv/bin/activate
pip install -r requirements.txt
python ./src/model_development/main.py

If you want to explore notebooks start jupyter lab:

jupyter-lab

Generate Report

After starting docker with make docker_start go into it's bash shell with make docker_bash than use nbconvert:

jupyter nbconvert --TagRemovePreprocessor.remove_cell_tags='{"exclude_output"}'  --to pdf ./notebooks/Report.ipynb --output-dir './output/'  --no-input

Additional

Install nvidia-container-toolkit for Docker to have CUDA

sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker

Basic docker commands needed for testing

docker rm companyclass
docker build -t companyclass-notebook docker
docker run --name companyclass  -it -v "${PWD}":/workspace companyclass-notebook

Speed and memeory

This was made with GCP T4 in mind, available for free in google colab and cheapeast GPU on GCP. 16GB of RAM and 16GB of GPU memory.

In load_and_split_data I am using test_size=0.02, train_size=0.2 to reduce memory and compyting time requiremets. Set adequate per_device_train_batch_size and per_device_eval_batch_size in function train_bert_model to accomodate your GPU RAM.

Genereating reports

Genereating report is done via jupyter and latex so both need to be installed. It is setup in docker, for local machine:

jupyter nbconvert --template article --PDFExporter.latex_command="['pdflatex', '{input}', '-interaction=nonstopmode', '-geometry=landscape']"   --TagRemovePreprocessor.remove_cell_tags='{"exclude_output"}'  --to pdf Report.ipynb  --no-input 

About

Using BERT for Company Classification

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published