Machine Learning and Pattern Recognition Week 8 Neural Net Architectures
Machine Learning and Pattern Recognition Week 8 Neural Net Architectures
[If you are short on time this week. This is the note and videos to skip for now.]
The previous note introduced neural networks using a standard layer, an affine transforma-
tion, and an element-wise non-linearity:
h ( l ) = g ( l ) (W ( l ) h ( l − 1 ) + b ( l ) ) (1)
. However, any composition of differentiable functions that could be used, and their free
parameters learned (fitted to data) with a gradient-based optimizer. For example, we could
construct a function using the radial basis functions (RBFs) we discussed earlier, and then
train the centers and bandwidths of these functions.
This course doesn’t attempt to do a comprehensive review of the layers for particular appli-
cations or use-cases. However, it’s worth knowing that there are specialized transformations
for representing functions of different types of input, such as sequences or images. The MLP
course (2019 archive) and Goodfellow et al.’s deep learning textbook are good starting points
on the details of recent practice.
The neural network architectures that are in standard use are still evolving. If in the future
you’re ever applying neural networks to data with some particular structure, it will be worth
reviewing how neural net have been applied to similar data before. Working out what works
will then be an empirical exercise, backed up by training, validation, and testing.
T
xpooled = ∑ a ( e(t) ) e(t) , e(t) = embedding(x(t) ; V ), (6)
t =1
where a(e(t) ) is a scalar weight, often chosen to be positive and sum to one. If we place the
embeddings for all the words in our sequence into a T × K matrix E, the simplest way to get
weights for our average is probably: