0% found this document useful (0 votes)

105 views

Assignment 3: Named Entity Recognition: Training Dataset

This document provides instructions for Assignment 3 on building a named entity recognition system for diseases and treatments using MALLET. Students are asked to label medical entities in test sentences as disease (D), treatment (T), or other (O) using sequence tagging models. The training data and test sentences are provided in tokenized format. Students should write code that trains a CRF model on the training data using MALLET and outputs labels for the test sentences. Additional features can be added to improve performance. Students are asked to conduct experiments, analyze errors, and write a 2-page report describing their system and experimental results.

Uploaded by

ryder the ryder

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

105 views

Assignment 3: Named Entity Recognition: Training Dataset

Uploaded by

ryder the ryder

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

ASSIGNMENT 3: NAMED ENTITY RECOGNITION

Motivation: The motivation of this assignment is to get practice with sequence labeling tasks such as
Named Entity Recognition. More precisely you will experiment with the HMM and/or CRF models and
various features on a subset for a medical corpus with a natural language processing package called
MALLET.

Problem Statement: The goal of the assignment is to build an NER system for diseases and treatments.
The input of the code will be a set of tokenized sentences and the output will be a label for each token
in the sentence. Labels can be D, T or O signifying disease, treatment or other.

Training Data: We are sharing a training dataset of labeled sentences. The format of each line in the
training dataset is “token label”. There is one token per line followed by a space and its label. Blank
lines indicate the end of a sentence. It has a total of 3655 sentences.

The Task: You need to write a sequence tagger that labels the given sentences in a tokenized test file.
The tokenized test file follows the same format as training except that it does not have the final label in
the input. Your output should label the test file in the same format as the training data.

To accomplish this, first download and install MALLET. Get Familiar with the "sequence tagging" part of
MALLET by reading about command line interface for sequence tagging and the developer's guide for
sequence tagging. This documentation is very short and incomplete, but that's all there is.

Run Mallet's SimpleTagger on training.txt with the following command:

 java -cp "/home/username/mallet-2.0.7/class:/home/username/mallet-2.0.7/lib/mallet-deps.jar"

cc.mallet.fst.SimpleTagger --train true --test lab --threads 2 training.txt

The above command is meant for linux so you may need to adapt the syntax for other operating
systems. Also, make sure to correct the path. Mallet seems to be buggy if you use a single thread so
make sure to set the --threads option to at least 2. Mallet will optimize a CRF model on half of the data
and test it on the other half. If MALLET takes too long, increase the number of threads based on the
number of cores that your computer has. Also, use the --iterations option to reduce the number of
iterations from the 500 default to something smaller like 50.

In order to improve the tagging accuracy create additional features that might be useful for the task. For
example, if you wish to add features of capitalization and whether the current token is a body word, it
may look like this for the disease “chronic coronary Edwards complex”:

chronic D

coronary BODYWORD D
Edwards CAPITAL D

complex D

You may have multiple features space separated before the final label of the token. Insert only the
features that are on before the label. The word itself is treated as a feature. The order of the features
does not matter.

Here are some suggestions on features:

1. Try features from lower level syntactic processing like POS tagging or shallow chunking.
2. Try features that try to assign a semantic label to the current token. A well-known generic
ontology is Wordnet. Another famous medical ontology is MESH. Features based on these
ontologies are likely to help.
3. You may use existing word embeddings as features. One possibility is to use word2vec
embeddings that are trained on a general corpus.
4. You may also train your own word embeddings using unlabeled data. I have collected some
sample in-domain unlabeled data here.
5. You may define word shape features.
6. Your idea here…

You may also experiment with the order of the Markov Chain in CRF model by using the --orders option.
You may also modify the source code of SimpleTagger to experiment with an HMM instead of a CRF.

Methodology and Experiments:

Option 1: Remove 20% of data randomly as test set and 10% as development set. Train on 70% of the
data. If needed, do parameter fitting on the devset and finally test on the test set.

Option 2: Do a 10-fold cross validation. Train on 8 folds, use 9th as devset and 10th as test set. Repeat this
10 times with different folds as test set. This is a more robust option.

As you work on improving your baseline system document its performance. Perform error analysis on a
subset of data and think about what additional knowledge could help the classifier the most. That will
guide you in picking the next feature to add.

Perform ablation study by switching off subsets of your features and see the degradation of
performance. You can perform an alternative experiment by incrementally adding sets of features.
Either way the goal is to identify the most useful features (and the value of each feature) for this task. In
addition to quantitative results, also look at specific examples and try to qualitatively understand value
of each feature by noticing which examples each feature helps in.
What to submit?

1. Submit your best code (best if trained on all training data and not just on a subset) by Tuesday,
4th November 2014, 11:55 PM. The code should not need to train again. You should submit only
the testing code, after the models have been trained. That is, you should not need to access the
training data anymore.

Submit your code is in a .zip file named in the format <EntryNo>.zip. Make sure that when we
run “unzip yourfile.zip” the following files are produced in the present working directory:

compile.sh
run.sh
Writeup.pdf (and not writeup.pdf, Writeup.doc, etc)

You will be penalized if your submission does not conform to this requirement.

Your code will be run as ./run.sh inputfile.txt outputfile.txt. The outputfile.txt should have the
same number of lines as inputfile.txt. And it should have two additional characters per token
line (space and labeling). Here is a format checker. Make sure your code passes format checker
before final submission.

Your code should run on Baadal machines with 2 GB RAM.

2. Your writeup (at most 2 pages, 10 pt font) should describe how you created your best NER
system (about 1 page). Explain any interesting ideas that you used. Describe your ablation study
detailing the value of each feature quantitatively. Give specific examples to describe value of
each feature qualitatively. Also mention the names of people you discussed the assignment with.
The writeup will judged on clarity and innovation as well as experimental results.

Evaluation Criteria

(1) 30 points are for description of the system. Anything innovative may yield bonus points.
(2) 60 points for performance of your code.
(3) 10 points bonus given to outstanding write-ups and best performing systems.

What is allowed? What is not?

1. The assignment is to be done individually.

2. You may use Java, or Python for this assignment.
3. You must not discuss this assignment with anyone outside the class. Make sure you mention
the names in your write-up in case you discuss with anyone from within the class outside your
team. Please read academic integrity guidelines on the course home page and follow them
carefully.
4. Feel free to search the Web for papers or other websites describing how to build named entity
recognizers. Cite the references in your writeup. However, you should not use (or read) other
people’s NER code.
5. We will run plagiarism detection software. Any team found guilty will be awarded a suitable
penalty as per IIT rules.
6. Your code will be automatically evaluated. You get a zero if it is does not conform to output
guidelines. Make sure it satisfies the format checker before you submit.

Machine Learning With Ridge and Lasso Regression
No ratings yet
Machine Learning With Ridge and Lasso Regression
19 pages
Python: Learn Python in 24 Hours
From Everand
Python: Learn Python in 24 Hours
Alex Nordeen
4/5 (12)
C# Interview Questions You'll Most Likely Be Asked
From Everand
C# Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
Coding Interview Questions and Answers
From Everand
Coding Interview Questions and Answers
Chinmoy Mukherjee
No ratings yet
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
From Everand
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
César Pérez López
No ratings yet
Mastering Data Structures and Algorithms in Python & Java
From Everand
Mastering Data Structures and Algorithms in Python & Java
Sachin Naha
No ratings yet
Mastering Python: A Comprehensive Guide for Beginners and Experts
From Everand
Mastering Python: A Comprehensive Guide for Beginners and Experts
Rick Spair
No ratings yet
Python: Advanced Guide to Programming Code with Python
From Everand
Python: Advanced Guide to Programming Code with Python
Charlie Masterson
No ratings yet
Python: Advanced Guide to Programming Code with Python: Python Computer Programming, #4
From Everand
Python: Advanced Guide to Programming Code with Python: Python Computer Programming, #4
Charlie Masterson
No ratings yet
ConfigMgr - An Administrator's Guide to Deploying Applications using PowerShell
From Everand
ConfigMgr - An Administrator's Guide to Deploying Applications using PowerShell
Owen Smith
5/5 (1)
DATA MINING and MACHINE LEARNING: CLUSTER ANALYSIS and kNN CLASSIFIERS. Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING: CLUSTER ANALYSIS and kNN CLASSIFIERS. Examples with MATLAB
César Pérez López
No ratings yet
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
César Pérez López
No ratings yet
DEEP LEARNING TECHNIQUES: CLUSTER ANALYSIS and PATTERN RECOGNITION with NEURAL NETWORKS. Examples with MATLAB
From Everand
DEEP LEARNING TECHNIQUES: CLUSTER ANALYSIS and PATTERN RECOGNITION with NEURAL NETWORKS. Examples with MATLAB
César Pérez López
No ratings yet
PAM Mastery: IT Mastery, #10
From Everand
PAM Mastery: IT Mastery, #10
Michael W. Lucas
No ratings yet
Software Testing: A Guide to Testing Mobile Apps, Websites, and Games
From Everand
Software Testing: A Guide to Testing Mobile Apps, Websites, and Games
Mark Garzone
4.5/5 (3)
Python Programming Using Google Colab
From Everand
Python Programming Using Google Colab
AM Govind Kumar
No ratings yet
Essential Algorithms: A Practical Approach to Computer Algorithms
From Everand
Essential Algorithms: A Practical Approach to Computer Algorithms
Rod Stephens
4.5/5 (2)
SQL 101 Crash Course: Comprehensive Guide to SQL Fundamentals and Practical Applications
From Everand
SQL 101 Crash Course: Comprehensive Guide to SQL Fundamentals and Practical Applications
Emrys Callahan
5/5 (1)
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
From Everand
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
Marcus Richards
No ratings yet
Learn Programming Using C#
From Everand
Learn Programming Using C#
Taurius Litvinavicius
No ratings yet
NER Lab
No ratings yet
NER Lab
65 pages
Windows Batch File Programming
From Everand
Windows Batch File Programming
Michael Elliott
2/5 (2)
FINAL PPT
No ratings yet
FINAL PPT
16 pages
Algorithm Challenges: The Dojo Collection
From Everand
Algorithm Challenges: The Dojo Collection
Martin Puryear
No ratings yet
CODING INTERVIEW: 50+ Tips and Tricks to Better Performance in Your Coding Interview
From Everand
CODING INTERVIEW: 50+ Tips and Tricks to Better Performance in Your Coding Interview
Eric Schmidt
No ratings yet
Python Interview Questions You'll Most Likely Be Asked
From Everand
Python Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
2/5 (1)
Python For Data Science
From Everand
Python For Data Science
Kevin Clark
No ratings yet
Algorithms and Data Structures: An Easy Guide to Programming Skills
From Everand
Algorithms and Data Structures: An Easy Guide to Programming Skills
Rigdon Jonathan
No ratings yet
Machine Learning with Python: A Comprehensive Guide with a Practical Example
From Everand
Machine Learning with Python: A Comprehensive Guide with a Practical Example
MARTIN NEEL
No ratings yet
SAS Interview Questions You'll Most Likely Be Asked
From Everand
SAS Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
Collection of Raspberry Pi Projects
From Everand
Collection of Raspberry Pi Projects
Guillermo Perez Guillen
5/5 (1)
taask
No ratings yet
taask
18 pages
Python Algorithms Step by Step: A Practical Guide with Examples
From Everand
Python Algorithms Step by Step: A Practical Guide with Examples
William E. Clark
No ratings yet
JavaScript Introduction
From Everand
JavaScript Introduction
Lisa Saldivar
No ratings yet
The C++ Template Handbook: Advanced Techniques for Modern C++ Developers
From Everand
The C++ Template Handbook: Advanced Techniques for Modern C++ Developers
Robert Johnson
No ratings yet
Troubleshooting Puppet
From Everand
Troubleshooting Puppet
Uphill Thomas
No ratings yet
Machine Learning - A Comprehensive, Step-by-Step Guide to Learning and Applying Advanced Concepts and Techniques in Machine Learning: 3
From Everand
Machine Learning - A Comprehensive, Step-by-Step Guide to Learning and Applying Advanced Concepts and Techniques in Machine Learning: 3
Peter Bradley
No ratings yet
Python for Data Science: Data Science Mastery by Nikhil Khan, #1
From Everand
Python for Data Science: Data Science Mastery by Nikhil Khan, #1
Nikhil Khan
No ratings yet
Getting Started With Quick Test Professional (QTP) And Descriptive Programming
From Everand
Getting Started With Quick Test Professional (QTP) And Descriptive Programming
Gaurav Garg
4.5/5 (2)
The Rust Programming Language, 2nd Edition
From Everand
The Rust Programming Language, 2nd Edition
Steve Klabnik
No ratings yet
PRACTICAL GUIDE TO LEARN ALGORITHMS: Master Algorithmic Problem-Solving Techniques (2024 Guide for Beginners)
From Everand
PRACTICAL GUIDE TO LEARN ALGORITHMS: Master Algorithmic Problem-Solving Techniques (2024 Guide for Beginners)
MARTY TWITTY
No ratings yet
Lexicon of Programming Terminology: Lexicon of Tech and Business, #17
From Everand
Lexicon of Programming Terminology: Lexicon of Tech and Business, #17
Mustafa Al-Dori
5/5 (1)
C# Package Mastery: 100 Essentials in 1 Hour - 2024 Edition
From Everand
C# Package Mastery: 100 Essentials in 1 Hour - 2024 Edition
Tenko
No ratings yet
Production System: Fundamentals and Applications
From Everand
Production System: Fundamentals and Applications
Fouad Sabry
No ratings yet
Python Debugging from Scratch: A Practical Guide with Examples ASIN (Ebook):
From Everand
Python Debugging from Scratch: A Practical Guide with Examples ASIN (Ebook):
William E. Clark
No ratings yet
Programming Puzzles: Python Edition: The Guide to Sharpen Your Coding Skills with Engaging and Challenging Puzzles
From Everand
Programming Puzzles: Python Edition: The Guide to Sharpen Your Coding Skills with Engaging and Challenging Puzzles
Matthew Whiteside
No ratings yet
Chat Bot
No ratings yet
Chat Bot
3 pages
Django 1.1 Testing and Debugging
From Everand
Django 1.1 Testing and Debugging
Karen M. Tracey
4.5/5 (3)
Java: Tips and Tricks to Programming Code with Java
From Everand
Java: Tips and Tricks to Programming Code with Java
Charlie Masterson
No ratings yet
Java: Tips and Tricks to Programming Code with Java: Java Computer Programming, #2
From Everand
Java: Tips and Tricks to Programming Code with Java: Java Computer Programming, #2
Charlie Masterson
No ratings yet
Schematron: A language for validating XML
From Everand
Schematron: A language for validating XML
Erik Siegel
No ratings yet
Terraform for Developers, Second Edition
From Everand
Terraform for Developers, Second Edition
Kimiko Lee
No ratings yet
Terraform for Developers, Second Edition: Essentials of Infrastructure Automation and Provisioning
From Everand
Terraform for Developers, Second Edition: Essentials of Infrastructure Automation and Provisioning
Kimiko Lee
No ratings yet
237-1-1172-4-10-20240626
No ratings yet
237-1-1172-4-10-20240626
6 pages
MVS JCL Utilities Quick Reference, Third Edition
From Everand
MVS JCL Utilities Quick Reference, Third Edition
Robert Wingate
5/5 (1)
SQLite Database Programming for Xamarin: Cross-platform C# database development for iOS and Android using SQLite.XM
From Everand
SQLite Database Programming for Xamarin: Cross-platform C# database development for iOS and Android using SQLite.XM
Anthony Serpico
No ratings yet
Python Interview Questions
From Everand
Python Interview Questions
equitypress
4.5/5 (6)
COMPUTER SCIENCE FOR ROOKIES
From Everand
COMPUTER SCIENCE FOR ROOKIES
Angel Bahabwa
No ratings yet
Software Architecture with Python
From Everand
Software Architecture with Python
Anand Balachandran Pillai
3/5 (1)
DATA MINING and MACHINE LEARNING. CLASSIFICATION PREDICTIVE TECHNIQUES: SUPPORT VECTOR MACHINE, LOGISTIC REGRESSION, DISCRIMINANT ANALYSIS and DECISION TREES: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. CLASSIFICATION PREDICTIVE TECHNIQUES: SUPPORT VECTOR MACHINE, LOGISTIC REGRESSION, DISCRIMINANT ANALYSIS and DECISION TREES: Examples with MATLAB
César Pérez López
No ratings yet
Nlp Assignment 4(22bce9560)
No ratings yet
Nlp Assignment 4(22bce9560)
12 pages
MTech Computer Syllabus
No ratings yet
MTech Computer Syllabus
43 pages
Fellowship / Research: Ddartha 011 Walk Interview Award
No ratings yet
Fellowship / Research: Ddartha 011 Walk Interview Award
4 pages
Fellowship / Research: Ddartha 011 Walk Interview Award
No ratings yet
Fellowship / Research: Ddartha 011 Walk Interview Award
4 pages
Classic Data Structure by D Samanta PDF
No ratings yet
Classic Data Structure by D Samanta PDF
2 pages
Practical Guide To Scikit-Learn For Data Science
No ratings yet
Practical Guide To Scikit-Learn For Data Science
27 pages
Hippopotamus Optimization Algorithm: A Novel Nature Inspired Optimization Algorithm
No ratings yet
Hippopotamus Optimization Algorithm: A Novel Nature Inspired Optimization Algorithm
18 pages
HW4+Solution
No ratings yet
HW4+Solution
13 pages
2013 Gutierrez JClim
No ratings yet
2013 Gutierrez JClim
18 pages
Diabetes Data Analysis Using Python Report
No ratings yet
Diabetes Data Analysis Using Python Report
15 pages
A Study of Cross-Validation and Bootstrap
No ratings yet
A Study of Cross-Validation and Bootstrap
7 pages
Feature Selection For Machine Learning Based Iot Botnet Attack Detection
No ratings yet
Feature Selection For Machine Learning Based Iot Botnet Attack Detection
98 pages
Hull Form Optimization of A Cargo Ship For Reduced Drag-Dikonversi
No ratings yet
Hull Form Optimization of A Cargo Ship For Reduced Drag-Dikonversi
12 pages
All Types of Cross Validation
No ratings yet
All Types of Cross Validation
9 pages
2023 Data Driven Discovery of An Analytic Formula For The Life Prediction of Lithium-Ion Batteries
No ratings yet
2023 Data Driven Discovery of An Analytic Formula For The Life Prediction of Lithium-Ion Batteries
7 pages
39808709
No ratings yet
39808709
3 pages
Key Flexfields
0% (1)
Key Flexfields
13 pages
Geostatistic Analysis: Lecture 6-8 Feb25-Mar10, 2008
No ratings yet
Geostatistic Analysis: Lecture 6-8 Feb25-Mar10, 2008
60 pages
Mapping Soil Compaction in 3D With Depth Functions
No ratings yet
Mapping Soil Compaction in 3D With Depth Functions
8 pages
Generalized Binary Interaction Parameters For The Peng-Robinson Equation of State
No ratings yet
Generalized Binary Interaction Parameters For The Peng-Robinson Equation of State
58 pages
Efficient Python Tricks and Tools For Data Scientists - by Khuyen Tran
No ratings yet
Efficient Python Tricks and Tools For Data Scientists - by Khuyen Tran
20 pages
5.21. Chemometric Methods Applied To Analytical Data
No ratings yet
5.21. Chemometric Methods Applied To Analytical Data
18 pages
Accuracy Measures
No ratings yet
Accuracy Measures
61 pages
Unit 3 (ML)
No ratings yet
Unit 3 (ML)
26 pages
Workflowsim: A Toolkit For Simulating Scientific Workflows in Distributed Environments
No ratings yet
Workflowsim: A Toolkit For Simulating Scientific Workflows in Distributed Environments
8 pages
Data Mining For Unemployment Rate Prediction Using Search
No ratings yet
Data Mining For Unemployment Rate Prediction Using Search
10 pages
ML PATHWAY
No ratings yet
ML PATHWAY
4 pages
Overfitting and Underfitting
No ratings yet
Overfitting and Underfitting
3 pages
CS109a Lecture17 Boosting Other
No ratings yet
CS109a Lecture17 Boosting Other
21 pages
Phishing
No ratings yet
Phishing
13 pages
Download Complete Spatial Predictive Modelling with R 1st Edition Jin Li PDF for All Chapters
100% (3)
Download Complete Spatial Predictive Modelling with R 1st Edition Jin Li PDF for All Chapters
65 pages
Top 100 Interview Questions On Machine Learning
100% (1)
Top 100 Interview Questions On Machine Learning
155 pages
DSC Unit 1
No ratings yet
DSC Unit 1
59 pages
A Guide To Machine Learning Pipelines and Orchest: This Article Was Published As A Part of The Data Science Blogathon
No ratings yet
A Guide To Machine Learning Pipelines and Orchest: This Article Was Published As A Part of The Data Science Blogathon
27 pages

Assignment 3: Named Entity Recognition: Training Dataset

Uploaded by

Assignment 3: Named Entity Recognition: Training Dataset

Uploaded by

ASSIGNMENT 3: NAMED ENTITY RECOGNITION

Run Mallet's SimpleTagger on training.txt with the following command:

 java -cp "/home/username/mallet-2.0.7/class:/home/username/mallet-2.0.7/lib/mallet-deps.jar"

Here are some suggestions on features:

Methodology and Experiments:

Your code should run on Baadal machines with 2 GB RAM.

What is allowed? What is not?

1. The assignment is to be done individually.

You might also like