This repository was used in the Text Classification using Machine Learning session at Lancaster Summer Schools in Corpus Linguistics and other Digital methods #LancsSS16 and #LancsSS17 at Lancaster University, UK – 12th to 15th July 2016 and 27th - 30th June 2017. http://ucrel.lancs.ac.uk/summerschool/nlp.php
Insttructor: Dr. Mahmoud El-Haj http://www.lancaster.ac.uk/staff/elhaj
Slides are avialable online here:
Workspace Setup: https://lancaster.box.com/s/j78l0b4197il98oze2gfqlidlsvg7jlt
The code trains classifiers for chairman's statements, governance & remuneration sections from 1,000 annual financial reports. Using WEKA Java the code does the following:
- Creates an ARFF File
- Train a model using different Algorithms
- Extract n-gram features using stringToWordsVector
- Reduce features
- Classify unseen documents using the created models.