Methodology
Gathering data is the most important step in solving any supervised
machine learning problem. Your text classifier can only be as good as the
dataset it is built from.
Here are some important things to remember when collecting data:
• Understand the limitations of the API before using them. For example,
some APIs set a limit on the rate at which you can make queries.
• The more training examples (referred to as samples in the rest of this
guide) you have, the better. This will help your model generalize better.
• Make sure the number of samples for every class or topic is not
overly imbalanced. That is, you should have comparable number of
samples in each class.
• Make sure that your samples adequately cover the space of possible
inputs, not only the common cases.
Methodology
CNN training:
The gathered data should be processed and trained using CNN algorithm in order to finish
the training process soon. This involves 3 layers:
• Convolution layer: Here, the feature extraction will take place where only the useful
features which are needed to the machine will be collected and unwanted features will be
removed so that training period will be finished soon.
• Pooling layer: In this, the size of the data or image will be reduced and give us a
compressed document with important features which is needed for the machine.
• Fully connected layer: Here, the above data which we get from the previous layer will be
fed to fully connected layer in a vector form. Then these compressed features will be split
and get trained using CNN and will produce us the final output.
Methodology
• Filter methods are generally used as a preprocessing
step. The selection of features is independent of any
machine learning algorithms. Instead, features are
selected on the basis of their scores in various statistical
tests for their correlation with the outcome variable.
• Data filtering is the process of choosing a smaller part of
your data set and using that subset for viewing or
analysis. Filtering is generally (but not always) temporary
– the complete data set is kept, but only part of it is used
for the calculation.
Methodology
Classification is a supervised learning problem:
define a set of target classes and train a model
to recognize. Based on the trained data, we can
classify the results.