0% found this document useful (0 votes)
10 views41 pages

Deep Learning for Natural Language GDG Bloomington 1690248059

Uploaded by

Asit Sahoo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views41 pages

Deep Learning for Natural Language GDG Bloomington 1690248059

Uploaded by

Asit Sahoo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

Deep Learning for Natural Language

Bloomington-Normal
July 18th, 2023

💬🧠🤖 Myles Harrison


https://www.linkedin.com/in/mylesharrison/
Who am I?
I'm Myles, a data scientist.

Most recently, I was the head of data science


at a Global Tech Bootcamp (~4 years).

Previously, I was a career consultant having


worked in data science at large organizations
such as Accenture, PwC, and Sapient.

Currently, I am teaching Conversational AI at


Georgian College north of Toronto.

I live in Ontario's Lake Country and love the


outdoors (fishing and hiking) as well as creative
writing.
Fundamentals

🏛🔠
Okay, but what's the deal with ChatGPT?
ChatGPT is an example of a large language
model (LLM), a type of deep learning model
trained with hundreds of millions or billions
of parameters on very large bodies of text.
Large language models currently represent
the state of the art in NLP.

While we're here, ChatGPT is not sentient,


nor is it an example of an Artificial General
Intelligence (AGI).

Let's take a step back…

Image credit: Leon Neal/Getty Images


What is Natural Language Processing (NLP)?
Natural language processing lies at the intersection of the domains of linguistics, computer
science, and artificial intelligence.

We are primarily concerned with NLP as it pertains to the field of data science and AI, in this
meaning referring to teaching computers to process - and perhaps even "understand" - text
written in ordinary language and perform associated tasks.

Though the term processing usually refers specifically to altering and preparing data, in the
domain of AI, NLP is often used to refer more generally to any language problem, including
those of applying machine learning (ML) to language, since these still require processing text
data beforehand.

🔡🛠💡
A Brief History of NLP (according to Wikipedia)

󰠁 🧮 🧠
Symbolic Statistical / ML Neural
(1950's- 1970's) (1980's- 2000's) (2000's - Present)

Rules-based methods Advent of statistical Breakthroughs in deep


for language tasks techniques and learning leading to
such as translation application of rapid advances in the
and conversation. machine learning. field up to today.
What is Machine Learning?
Machine learning (ML) is a relatively new field and sits at the intersection of software
engineering, computational mathematics, and statistics.

Whereas traditional software development is deterministic and requires the coding of specific
logic, machine learning models can learn from training data and infer relationships or make
predictions based upon patterns in a given data set, without being given explicit instructions.

Much of the mathematical backing for machine learning techniques has existed for quite some
time; it is only fairly recent advances in computing power, scale, and availability that have
enabled their application computationally, giving rise to the field of ML.

🤖🎓
Types of Machine Learning
A
B

Supervised Learning Unsupervised Learning Reinforcement Learning

Make predictions from a dataset Uses statistical techniques to Teaches an agent a behavior by
and data labels - associated uncover patterns in a dataset. optimizing against a target
categorical or numeric values, objective with a reward function.
In NLP, a major applications are
In NLP, can be used to classify topic modeling and embeddings - It is an important aspect of some
documents based upon their finding representations of large language models (LLMs) as
content, or to predict the next language in a vector space that part of their training using
character in generative text captures their statistical Reinforcement Learning from
applications. properties. Human Feedback (RLHF).
What is Deep Learning?
• Deep Learning is a specialized type of ML that takes
inspiration from the structure of the human brain

• Unlike other machine learning, deep learning models -


or artificial neural networks - are composed of many
nodes which can be viewed as individual "sub-models"

• The theoretical foundations for deep learning have


existed since the 1960s (or even earlier), but it only
recently been realized with the rise of cheap, power
computing

• In NLP, deep learning models represent the


state-of-the-art (SOTA) and can be used for supervised,
unsupervised, and semi-supervised problems
A Neuron in the Brain
The human brain is composed of
billions of neurons, electrically
excitable cells composed of a cell
body, dendrites, an axon, and
terminal.

Neurons receive input through


their dendrites, and when firing,
an electrical impulse travels
down the axon to the terminal
and release neurotransmitters to
the next cell.
An Artificial Neuron (Perceptron)
The structure of an artificial neuron
or perceptron follows that of those
in the human brain.

Inputs are multiplied by weights


which form a weighted sum, this
are then passed through an
activation function which produces
the model activation, which is
analogous to the electrical impulse
of a neuron firing.

Image from: https://deepai.org/machine-learning-glossary-and-terms/perceptron


Structure of a Neural Network
• Multiple perceptrons are put together into
layers composed of nodes (each
perceptron) to create a neural network.

• The outputs of previous layers become the


inputs of the following layer.

• The number of layers and number of nodes


- known as the network architecture - is
arbitrary and up to the choice of the
modeler. There are also specific
architectures that are well suited to
particular types of problems.
Input Layer
• The input layer is not a “true” layer but just
passes the data through to the following
layers – no activation, linear passthrough

• Each neuron in the input layer represents a


feature of the data

• Number of nodes in the input layer =


number of features
Output Layer
• Final layer of a neural network that
produces the network's predictions (output)

• Number of nodes dependent on problem:


single node for binary classification or
regression, for multi-class number of nodes
= number of classes

• Activation function depends on problem:


e.g. sigmoid for binary classification
(probability 0-1), softmax for multiclass
(multiple probabilities 0-1 that sum to to 1)
Hidden Layers
• Intermediate layers between the input and
output – so-called as they are hidden
between

• Perform computations on outputs of previous


layer (linear combo of outputs and weights
with activation function applied)

• Size of each layer arbitrary (network


architecture)

• Speaking here only of fully-connected


(feed-forward) networks
Activation Functions

• Applied to the linear combination of


inputs and weights for each layer

• Each layer may have different activation


function

• Families of well-known functions that


perform well and have desirable
mathematical properties

• It is from these that the power of deep


learning arises and ability to learn highly
complex relationships

Image source: https://machine-learning.paperspace.com/wiki/activation-function


Loss Functions Mean Squared Error (Regression)

• Measurement of error

• Not unique to deep learning – used in


traditional machine learning

• How “wrong” are the predictions?

• Used to optimize weights Cross-Entropy Loss (Classification)


How do Neural Networks Learn?
• Calculate the direction of change
where that the slope of the loss
with respect to the weights is
negative (gradients)

• Find global minimum of error

• Large number of weights


(millions?) = highly complex
optimization problem
Forward Pass and Backpropagation
Forward Pass (Perform calculations)
• In the forward pass, data is run
through the network to compute
the predictions and error
calculated from the loss function

• Backpropagation (“backprop”)
Loss
applies changes to weights in Training
&
network as determined from Data
Gradients
gradients (direction of greatest
decrease of error)

Backprop (Update weights)


Epochs and Batches Training Data

Epoch 1
• Deep learning differs from other machine
learning in the neural networks are Batch 1
trained with batches (subsets of fixed
Batch 2
size) of the training data
Batch 3
• Once all the data has gone through the
network once, this is referred to as a Batch 4

single epoch of training Epoch 2

• One epoch = many batches, networks are Batch 1

trained for many epochs and see the Batch 2


whole training dataset multiple times
Batch 3

Batch 4
(Python) Deep Learning Frameworks

• Google product
• Graph-based computation, GPU training
• Other deployment options (Tensorflow Lite, TF.js)
• Easy with integration of Keras into TF 2.x

• Facebook product
• Graph-based computation, GPU training
• Pytorch Mobile for embedded, no web (ONNX?)
• OOP dev focus (ML eng), Lightning equivalent to Keras
Language Models

🔠🧪💻
Sequence-to-Sequence Models
• Sequence-to-sequence
(Seq2Seq) neural networks take
a sequence as input and return a
sequence as output

• Applications in language
(generative models, translation,
text-to speech, summarization),
time series, audio / video
(captioning, transcription)

• e.g. RNNs and Transformers

Image source: https://google.github.io/seq2seq/


Recurrent Neural Networks (RNNs)
• Simplest type of sequence-to-sequence
model

• Uses output of previous nodes to affect


inputs of following (memory) One-to-one

• Can be one-to-one, one-to-many, One-to-many


many-to-one, or many-to-many

• Computationally expensive to train, as


sequential in nature

• Suffered from "forgetting" of vanishing


gradients

Many-to-one Many-to-many
Long-Short Term Memory Networks (LSTMs)
• Special type of RNN that captures
long-term dependencies

• Deals with problem of vanishing


gradients (forgetting)

• Additional components for


remembering and forgetting

• Still difficult to train due to lack of


parallelization, and did not solve
the forgetting problem entirely
Image from: https://d2l.ai/chapter_recurrent-modern/lstm.html
The Transformer Architecture
• Groundbreaking paper "Attention is All You
Need" from Google researchers (2017)
introduced Transformer architecture

• Attention discards the notion of recurrence:


deals with forgetting by having decoder look
at all previous states of encoder (weighted
sum)

• Now represents the state of the art for LLMs


and also applied in domains outside of
language (image generation)
RNNs vs. Transformers
RNNs Transformers
● Recurrent structure: outputs from previous ● Non-sequential in nature (no recurrence) -
inputs are used to make future predictions make predictions based on the whole
● Suffered from "vanishing gradients" making it dataset
difficult to learn long-term dependencies (i.e. ● Do not require information to be in order
your model has ADHD)
and keep track of position (e.g. words in a
● LSTM networks addressed the vanishing sentence), instead use positional encoding
gradient problem somewhat but not entirely
● Have a much more complex architecture
● Do not parallelize well due to their recurrent implementing the notion of self-attention
nature
● Parallelize well in training as no
sequential dependencies
LLM Model Development History

Image Source: https://huggingface.co/learn/nlp-course/chapter1/4


GPT-2

GPT-1

https://research.aimultiple.com/gpt/
To the Moon?
GPT-2 (2019):
1.5B parameters
GPT-3
GPT-3 (2020): GPT-2
175B parameters

GPT-4 (March 14, 2023):


1.76T (?) parameters
GPT-1

https://research.aimultiple.com/gpt/
Evaluating Large Language Models
Massive Multitask Language Understanding (MMLU) Performance
As LLMs have become more
sophisticated and began excelling at
"few-shot" and "zero-shot" learning
tasks, general evaluation has more
become more challenging.
As such a series of benchmarks has
arisen which are closer to knowledge and
reasonings tasks that would be given to a
human.
Some of these benchmarks are
composites encompassing existing
benchmarks as a suite (e.g. HeLM).

Source: https://paperswithcode.com/sota/multi-task-language-understanding-on-mmlu
Chinchilla: Bigger is better?
The Chinchilla Model was presented by
Google Deepmind in March 2022 and
followed the development of the earlier
Gopher model.

Though smaller in size than previous


LLMs (70B parameters vs. Gopher's
280B) it was trained on a larger dataset.
It outperformed other larger models on
standard benchmarks, including their
claiming it to outperform GPT-3 (175B
parameters).

Figure from: https://www.deepmind.com/blog/an-empirical-analysis-of-compute-optimal-large-language-model-training


Breaking News
Meta released Llama 2 today, a 70B
parameter model optimized for dialog.
Following the Chinchilla path, Llama is a
smaller model compared with some more
recent ones, but uses a very large
training set (2 trillion tokens).
The Llama models are "semi-open":
available for research and commercial
use under a license, however the details
of the specific training data and methods
have not been entirely released.
Limitations

https://huggingface.co/spaces/ysharma/Explore_llamav2_with_TGI
🤯
🙃
🙄
LLMs: Player Pianos or beginnings of AGI?
The Shameless Plug

NLP4Free 🔠⚡🤖🧠😃
https://mylesharrison.com/nlp4free/

A Free Natural Language Processing (NLP) microcourse, from basics to deep learning
Let's keep learning together!
Feel free to connect with me and
continue the conversation:

www.mylesharrison.com

linkedin.com/in/mylesharrison/

calendly.com/mylesmharrison
Thanks for
listening!

😀
Image Attribution

You might also like