Google - Machine Learning Glossary
Google - Machine Learning Glossary
This glossary defines general machine learning terms as well as terms specific to TensorFlow.
A/B testing
A statistical way of comparing two (or more) techniques, typically an incumbent against a new rival. A/B testing
aims to determine not only which technique performs better but also to understand whether the difference is
statistically significant. A/B testing usually considers only two techniques using one measurement, but it can be
applied to any finite number of techniques and measures.
accuracy
The fraction of predictions that a classification model got right. In multi-class classification, accuracy is defined
as follows:
Correct Predictions
Accuracy =
Total Number Of Examples
activation function
A function (for example, ReLU or sigmoid) that takes in the weighted sum of all of the inputs from the previous
layer and then generates and passes an output value (typically nonlinear) to the next layer.
active learning
A training approach in which the algorithm chooses some of the data it learns from. Active learning is particularly
valuable when labeled examples are scarce or expensive to obtain. Instead of blindly seeking a diverse range of
labeled examples, an active learning algorithm selectively seeks the particular range of examples it needs for
learning.
AdaGrad
A sophisticated gradient descent algorithm that rescales the gradients of each parameter, effectively giving each
parameter an independent learning rate. For a full explanation, see this paper.
agglomerative clustering
See hierarchical clustering.
The Area Under the ROC curve is the probability that a classifier will be more confident that a randomly chosen
positive example is actually positive than that a randomly chosen negative example is positive.
automation bias
#fairness
When a human decision maker favors recommendations made by an automated decision-making system over
information made without automation, even when the automated decision-making system makes errors.
backpropagation
The primary algorithm for performing gradient descent on neural networks. First, the output values of each node
are calculated (and cached) in a forward pass. Then, the partial derivative of the error with respect to each
parameter is calculated in a backward pass through the graph.
bag of words
A representation of the words in a phrase or passage, irrespective of order. For example, bag of words represents
the following three phrases identically:
• the dog jumps
• jumps the dog
• dog jumps the
Each word is mapped to an index in a sparse vector, where the vector has an index for every word in the
vocabulary. For example, the phrase the dog jumps is mapped into a feature vector with non-zero values at the
three indices corresponding to the words the, dog, and jumps. The non-zero value can be any of the following:
baseline
A model used as a reference point for comparing how well another model (typically, a more complex one) is
performing. For example, a logistic regression model might serve as a good baseline for a deep model.
For a particular problem, the baseline helps model developers quantify the minimal expected performance that a
new model must achieve for the new model to be useful.
batch
The set of examples used in one iteration (that is, one gradient update) of model training.
batch normalization
Normalizing the input or output of the activation functions in a hidden layer. Batch normalization can provide
the following benefits:
batch size
The number of examples in a batch. For example, the batch size of SGD is 1, while the batch size of a mini-batch
is usually between 10 and 1000. Batch size is usually fixed during training and inference; however, TensorFlow
does permit dynamic batch sizes.
bias (ethics/fairness)
#fairness
1. Stereotyping, prejudice or favoritism towards some things, people, or groups over others. These biases can affect
collection and interpretation of data, the design of a system, and how users interact with a system. Forms of this
type of bias include:
• automation bias
• confirmation bias
• experimenter’s bias
• group attribution bias
• implicit bias
• in-group bias
• out-group homogeneity bias
2. Systematic error introduced by a sampling or reporting procedure. Forms of this type of bias include:
• coverage bias
• non-response bias
• participation bias
• reporting bias
• sampling bias
• selection bias
Not to be confused with the bias term in machine learning models or prediction bias
bias (math)
An intercept or offset from an origin. Bias (also known as the bias term) is referred to as b or w0 in machine
learning models. For example, bias is the b in the following formula:
y ′ = b + w1x1 + w2x2 + … wnxn
bigram
An N-gram in which N=2.
binary classification
A type of classification task that outputs one of two mutually exclusive classes. For example, a machine learning
model that evaluates email messages and outputs either "spam" or "not spam" is a binary classifier.
binning
See bucketing.
boosting
A ML technique that iteratively combines a set of simple and not very accurate classifiers (referred to as "weak"
classifiers) into a classifier with high accuracy (a "strong" classifier) by upweighting the examples that the model
is currently misclassfying.
broadcasting
Expanding the shape of an operand in a matrix math operation to dimensions compatible for that operation. For
instance, linear algebra requires that the two operands in a matrix addition operation must have the same
dimensions. Consequently, you can't add a matrix of shape (m, n) to a vector of length n. Broadcasting enables this
operation by virtually expanding the vector of length n to a matrix of shape (m,n) by replicating the same values
down each column.
For example, given the following definitions, linear algebra prohibits A+B because A and B have different
dimensions:
[[2, 2, 2],
[2, 2, 2]]
bucketing
Converting a (usually continuous) feature into multiple binary features called buckets or bins, typically based on
value range. For example, instead of representing temperature as a single continuous floating-point feature, you
could chop ranges of temperatures into discrete bins. Given temperature data sensitive to a tenth of a degree, all
temperatures between 0.0 and 15.0 degrees could be put into one bin, 15.1 to 30.0 degrees could be a second bin,
and 30.1 to 50.0 degrees could be a third bin.
calibration layer
A post-prediction adjustment, typically to account for prediction bias. The adjusted predictions and probabilities
should match the distribution of an observed set of labels.
candidate generation
The initial set of recommendations chosen by a recommendation system. For example, consider a bookstore that
offers 100,000 titles. The candidate generation phase creates a much smaller list of suitable books for a particular
user, say 500. But even 500 books is way too many to recommend to a user. Subsequent, more expensive, phases
of a recommendation system (such as scoring and re-ranking) whittle down those 500 to a much smaller, more
useful set of recommendations.
candidate sampling
A training-time optimization in which a probability is calculated for all the positive labels, using, for example,
softmax, but only for a random sample of negative labels. For example, if we have an example labeled beagle and
dog candidate sampling computes the predicted probabilities and corresponding loss terms for the beagle and dog
class outputs in addition to a random subset of the remaining classes (cat, lollipop, fence). The idea is that the
negative classes can learn from less frequent negative reinforcement as long as positive classes always get proper
positive reinforcement, and this is indeed observed empirically. The motivation for candidate sampling is a
computational efficiency win from not computing predictions for all negatives.
categorical data
Features having a discrete set of possible values. For example, consider a categorical feature named house
style, which has a discrete set of three possible values: Tudor, ranch, colonial. By representing
house style as categorical data, the model can learn the separate impacts of Tudor, ranch, and colonial
on house price.
Sometimes, values in the discrete set are mutually exclusive, and only one value can be applied to a given example.
For example, a car maker categorical feature would probably permit only a single value (Toyota) per
example. Other times, more than one value may be applicable. A single car could be painted more than one
different color, so a car color categorical feature would likely permit a single example to have multiple values
(for example, red and white).
centroid
The center of a cluster as determined by a k-means or k-median algorithm. For instance, if k is 3, then the k-
means or k-median algorithm finds 3 centroids.
centroid-based clustering
A category of clustering algorithms that organizes data into nonhierarchical clusters. k-means is the most widely
used centroid-based clustering algorithm.
checkpoint
Data that captures the state of the variables of a model at a particular time. Checkpoints enable exporting model
weights, as well as performing training across multiple sessions. Checkpoints also enable training to continue past
errors (for example, job preemption). Note that the graph itself is not included in a checkpoint.
class
One of a set of enumerated target values for a label. For example, in a binary classification model that detects
spam, the two classes are spam and not spam. In a multi-class classification model that identifies dog breeds, the
classes would be poodle, beagle, pug, and so on.
classification model
A type of machine learning model for distinguishing among two or more discrete classes. For example, a natural
language processing classification model could determine whether an input sentence was in French, Spanish, or
Italian. Compare with regression model.
classification threshold
A scalar-value criterion that is applied to a model's predicted score in order to separate the positive class from the
negative class. Used when mapping logistic regression results to binary classification. For example, consider a
logistic regression model that determines the probability of a given email message being spam. If the classification
threshold is 0.9, then logistic regression values above 0.9 are classified as spam and those below 0.9 are classified
as not spam.
class-imbalanced dataset
A binary classification problem in which the labels for the two classes have significantly different frequencies.
For example, a disease dataset in which 0.0001 of examples have positive labels and 0.9999 have negative labels is
a class-imbalanced problem, but a football game predictor in which 0.51 of examples label one team winning and
0.49 label the other team winning is not a class-imbalanced problem.
clipping
A technique for handling outliers. Specifically, reducing feature values that are greater than a set maximum value
down to that maximum value. Also, increasing feature values that are less than a specific minimum value up to that
minimum value.
For example, suppose that only a few feature values fall outside the range 40–60. In this case, you could do the
following:
Cloud TPU
#TensorFlow
#GoogleCloud
clustering
Grouping related examples, particularly during unsupervised learning. Once all the examples are grouped, a
human can optionally supply meaning to each cluster.
Many clustering algorithms exist. For example, the k-means algorithm clusters examples based on their proximity
to a centroid, as in the following diagram:
tree
height
centroid
cluster 1
cluster 2
tree width
A human researcher could then review the clusters and, for example, label cluster 1 as "dwarf trees" and cluster 2
as "full-size trees."
As another example, consider a clustering algorithm based on an example's distance from a center point, illustrated
as follows:
cluster 1
cluster 2
cluster 3
co-adaptation
When neurons predict patterns in training data by relying almost exclusively on outputs of specific other neurons
instead of relying on the network's behavior as a whole. When the patterns that cause co-adaption are not present in
validation data, then co-adaptation causes overfitting. Dropout regularization reduces co-adaptation because
dropout ensures neurons cannot rely solely on specific other neurons.
collaborative filtering
Making predictions about the interests of one user based on the interests of many other users. Collaborative
filtering is often used in recommendation systems.
confirmation bias
#fairness
The tendency to search for, interpret, favor, and recall information in a way that confirms one's preexisting beliefs
or hypotheses. Machine learning developers may inadvertently collect or label data in ways that influence an
outcome supporting their existing beliefs. Confirmation bias is a form of implicit bias.
Experimenter's bias is a form of confirmation bias in which an experimenter continues training models until a
preexisting hypothesis is confirmed.
confusion matrix
An NxN table that summarizes how successful a classification model's predictions were; that is, the correlation
between the label and the model's classification. One axis of a confusion matrix is the label that the model
predicted, and the other axis is the actual label. N represents the number of classes. In a binary classification
problem, N=2. For example, here is a sample confusion matrix for a binary classification problem:
The preceding confusion matrix shows that of the 19 samples that actually had tumors, the model correctly
classified 18 as having tumors (18 true positives), and incorrectly classified 1 as not having a tumor (1 false
negative). Similarly, of 458 samples that actually did not have tumors, 452 were correctly classified (452 true
negatives) and 6 were incorrectly classified (6 false positives).
The confusion matrix for a multi-class classification problem can help you determine mistake patterns. For
example, a confusion matrix could reveal that a model trained to recognize handwritten digits tends to mistakenly
predict 9 instead of 4, or 1 instead of 7.
Confusion matrices contain sufficient information to calculate a variety of performance metrics, including
precision and recall.
continuous feature
A floating-point feature with an infinite range of possible values. Contrast with discrete feature.
convenience sampling
Using a dataset not gathered scientifically in order to run quick experiments. Later on, it's essential to switch to a
scientifically gathered dataset.
convergence
Informally, often refers to a state reached during training in which training loss and validation loss change very
little or not at all with each iteration after a certain number of iterations. In other words, a model reaches
convergence when additional training on the current data will not improve the model. In deep learning, loss values
sometimes stay constant or nearly so for many iterations before finally descending, temporarily producing a false
sense of convergence.
By contrast, the following function is not convex. Notice how the region above the graph is not a convex set:
local
minimum
global local
minimum minimum
A strictly convex function has exactly one local minimum point, which is also the global minimum point. The
classic U-shaped functions are strictly convex functions. However, some convex functions (for example, straight
lines) are not U-shaped.
A lot of the common loss functions, including the following, are convex functions:
• L2 loss
• Log Loss
• L1 regularization
• L2 regularization
Many variations of gradient descent are guaranteed to find a point close to the minimum of a strictly convex
function. Similarly, many variations of stochastic gradient descent have a high probability (though, not a
guarantee) of finding a point close to the minimum of a strictly convex function.
The sum of two convex functions (for example, L2 loss + L1 regularization) is a convex function.
Deep models are never convex functions. Remarkably, algorithms designed for convex optimization tend to find
reasonably good solutions on deep networks anyway, even though those solutions are not guaranteed to be a global
minimum.
convex optimization
The process of using mathematical techniques such as gradient descent to find the minimum of a convex
function. A great deal of research in machine learning has focused on formulating various problems as convex
optimization problems and in solving those problems more efficiently.
convex set
A subset of Euclidean space such that a line drawn between any two points in the subset remains completely within
the subset. For instance, the following two shapes are convex sets:
convolution
In mathematics, casually speaking, a mixture of two functions. In machine learning, a convolution mixes the
convolutional filter and the input matrix in order to train weights.
The term "convolution" in machine learning is often a shorthand way of referring to either convolutional
operation or convolutional layer.
Without convolutions, a machine learning algorithm would have to learn a separate weight for every cell in a large
tensor. For example, a machine learning algorithm training on 2K x 2K images would be forced to find 4M
separate weights. Thanks to convolutions, a machine learning algorithm only has to find weights for every cell in
the convolutional filter, dramatically reducing the memory needed to train the model. When the convolutional
filter is applied, it is simply replicated across cells such that each is multiplied by the filter.
convolutional filter
One of the two actors in a convolutional operation. (The other actor is a slice of an input matrix.) A convolutional
filter is a matrix having the same rank as the input matrix, but a smaller shape. For example, given a 28x28 input
matrix, the filter could be any 2D matrix smaller than 28x28.
In photographic manipulation, all the cells in a convolutional filter are typically set to a constant pattern of ones
and zeroes. In machine learning, convolutional filters are typically seeded with random numbers and then the
network trains the ideal values.
convolutional layer
A layer of a deep neural network in which a convolutional filter passes along an input matrix. For example,
consider the following 3x3 convolutional filter:
The following animation shows a convolutional layer consisting of 9 convolutional operations involving the 5x5
input matrix. Notice that each convolutional operation works on a different 3x3 slice of the input matrix. The
resulting 3x3 matrix (on the right) consists of the results of the 9 convolutional operations:
convolutional neural network
A neural network in which at least one layer is a convolutional layer. A typical convolutional neural network
consists of some combination of the following layers:
• convolutional layers
• pooling layers
• dense layers
Convolutional neural networks have had great success in certain kinds of problems, such as image recognition.
convolutional operation
The following two-step mathematical operation:
1. Element-wise multiplication of the convolutional filter and a slice of an input matrix. (The slice of the
input matrix has the same rank and size as the convolutional filter.)
2. Summation of all the values in the resulting product matrix.
A convolutional layer consists of a series of convolutional operations, each acting on a different slice of the input
matrix.
cost
Synonym for loss.
coverage bias
See selection bias.
crash blossom
A sentence or phrase with an ambiguous meaning. Crash blossoms present a significant problem in natural
language understanding. For example, the headline Red Tape Holds Up Skyscraper is a crash blossom because an
NLU model could interpret the headline literally or figuratively.
cross-entropy
A generalization of Log Loss to multi-class classification problems. Cross-entropy quantifies the difference
between two probability distributions. See also perplexity.
cross-validation
A mechanism for estimating how well a model will generalize to new data by testing the model against one or
more non-overlapping data subsets withheld from the training set.
custom Estimator
#TensorFlow
data analysis
Obtaining an understanding of data by considering samples, measurement, and visualization. Data analysis can be
particularly useful when a dataset is first received, before one builds the first model. It is also crucial in
understanding experiments and debugging problems with the system.
data augmentation
Artificially boosting the range and number of training examples by transforming existing examples to create
additional examples. For example, suppose images are one of your features, but your dataset doesn't contain
enough image examples for the model to learn useful associations. Ideally, you'd add enough labeled images to
your dataset to enable your model to train properly. If that's not possible, data augmentation can rotate, stretch, and
reflect each image to produce many variants of the original picture, possibly yielding enough labeled data to enable
excellent training.
DataFrame
A popular datatype for representing datasets in pandas. A DataFrame is analogous to a table. Each column of the
DataFrame has a name (a header), and each row is identified by a number.
A high-level TensorFlow API for reading data and transforming it into a form that a machine learning algorithm
requires. A tf.data.Dataset object represents a sequence of elements, in which each element contains one or
more Tensors. A tf.data.Iterator object provides access to the elements of a Dataset.
For details about the Dataset API, see Importing Data in the TensorFlow Programmer's Guide.
decision boundary
The separator between classes learned by a model in a binary class or multi-class classification problems. For
example, in the following image representing a binary classification problem, the decision boundary is the frontier
between the orange class and the blue class:
decision threshold
Synonym for classification threshold.
decision tree
A model represented as a sequence of branching statements. For example, the following over-simplified decision
tree branches a few times to predict the price of a house (in thousands of USD). According to this decision tree, a
house larger than 160 square meters, having more than three bedrooms, and built less than 10 years ago would
have a predicted price of 510 thousand USD.
Is house > 160
square meters?
No Yes
No Yes No Yes
Price = 217 Price = 285 Price = 258 Price = 341 Price = 312 Price = 335 Price = 382 Price = 510
deep model
A type of neural network containing multiple hidden layers.
dense feature
A feature in which most values are non-zero, typically a Tensor of floating-point values. Contrast with sparse
feature.
dense layer
Synonym for fully connected layer.
depth
The number of layers (including any embedding layers) in a neural network that learn weights. For example, a
neural network with 5 hidden layers and 1 output layer has a depth of 6.
A depthwise separable convolution (also abbreviated as separable convolution) factors a standard 3-D convolution
into two separate convolution operations that are more computationally efficient: first, a depthwise convolution,
with a depth of 1 (n ? n ? 1), and then second, a pointwise convolution, with length and width of 1 (1 ? 1 ? n).
To learn more, see Xception: Deep Learning with Depthwise Separable Convolutions.
device
#TensorFlow
A category of hardware that can run a TensorFlow session, including CPUs, GPUs, and TPUs.
dimension reduction
Decreasing the number of dimensions used to represent a particular feature in a feature vector, typically by
converting to an embedding.
dimensions
Overloaded term having any of the following definitions:
You can uniquely specify a particular cell in a one-dimensional vector with one coordinate; you need two
coordinates to uniquely specify a particular cell in a two-dimensional matrix.
• The number of entries in a feature vector.
discrete feature
A feature with a finite set of possible values. For example, a feature whose values may only be animal, vegetable,
or mineral is a discrete (or categorical) feature. Contrast with continuous feature.
discriminative model
A model that predicts labels from a set of one or more features. More formally, discriminative models define the
conditional probability of an output given the features and weights; that is:
For example, a model that predicts whether an email is spam from features and weights is a discriminative model.
The vast majority of supervised learning models, including classification and regression models, are discriminative
models.
discriminator
A system that determines whether examples are real or fake.
The subsystem within a generative adversarial network that determines whether the examples created by the
generator are real or fake.
divisive clustering
See hierarchical clustering.
downsampling
Overloaded term that can mean either of the following:
• Reducing the amount of information in a feature in order to train a model more efficiently. For example,
before training an image recognition model, downsampling high-resolution images to a lower-resolution
format.
• Training on a disproportionately low percentage of over-represented class examples in order to improve
model training on under-represented classes. For example, in a class-imbalanced dataset, models tend to
learn a lot about the majority class and not enough about the minority class. Downsampling helps
balance the amount of training on the majority and minority classes.
dropout regularization
A form of regularization useful in training neural networks. Dropout regularization works by removing a random
selection of a fixed number of the units in a network layer for a single gradient step. The more units dropped out,
the stronger the regularization. This is analogous to training the network to emulate an exponentially large
ensemble of smaller networks. For full details, see Dropout: A Simple Way to Prevent Neural Networks from
Overfitting.
dynamic model
A model that is trained online in a continuously updating fashion. That is, data is continuously entering the model.
eager execution
#TensorFlow
A TensorFlow programming environment in which operations run immediately. By contrast, operations called in
graph execution don't run until they are explicitly evaluated. Eager execution is an imperative interface, much like
the code in most programming languages. Eager execution programs are generally far easier to debug than graph
execution programs.
early stopping
A method for regularization that involves ending model training before training loss finishes decreasing. In early
stopping, you end model training when the loss on a validation dataset starts to increase, that is, when
generalization performance worsens.
embeddings
A categorical feature represented as a continuous-valued feature. Typically, an embedding is a translation of a
high-dimensional vector into a low-dimensional space. For example, you can represent the words in an English
sentence in either of the following two ways:
• As a million-element (high-dimensional) sparse vector in which all elements are integers. Each cell in the
vector represents a separate English word; the value in a cell represents the number of times that word
appears in a sentence. Since a single English sentence is unlikely to contain more than 50 words, nearly
every cell in the vector will contain a 0. The few cells that aren't 0 will contain a low integer (usually 1)
representing the number of times that word appeared in the sentence.
• As a several-hundred-element (low-dimensional) dense vector in which each element holds a floating-
point value between 0 and 1. This is an embedding.
In TensorFlow, embeddings are trained by backpropagating loss just like any other parameter in a neural
network.
embedding space
The d-dimensional vector space that features from a higher-dimensional vector space are mapped to. Ideally, the
embedding space contains a structure that yields meaningful mathematical results; for example, in an ideal
embedding space, addition and subtraction of embeddings can solve word analogy tasks.
ensemble
A merger of the predictions of multiple models. You can create an ensemble via one or more of the following:
• different initializations
• different hyperparameters
• different overall structure
epoch
A full training pass over the entire dataset such that each example has been seen once. Thus, an epoch represents
N/batch size training iterations, where N is the total number of examples.
Estimator
#TensorFlow
An instance of the tf.Estimator class, which encapsulates logic that builds a TensorFlow graph and runs a
TensorFlow session. You may create your own custom Estimators (as described here) or instantiate premade
Estimators created by others.
example
One row of a dataset. An example contains one or more features and possibly a label. See also labeled example
and unlabeled example.
experimenter's bias
#fairness
Models suffering from the exploding gradient problem become difficult or impossible to train. Gradient clipping
can mitigate this problem.
False Positives
False Positive Rate =
False Positives+True Negatives
feature
An input variable used in making predictions.
A function that specifies how a model should interpret a particular feature. A list that collects the output returned
by calls to such functions is a required parameter to all Estimators constructors.
The tf.feature_column functions enable models to easily experiment with different representations of input
features. For details, see the Feature Columns chapter in the TensorFlow Programmers Guide.
feature cross
A synthetic feature formed by crossing (taking a Cartesian product of) individual binary features obtained from
categorical data or from continuous features via bucketing. Feature crosses help represent nonlinear
relationships.
feature engineering
The process of determining which features might be useful in training a model, and then converting raw data from
log files and other sources into said features. In TensorFlow, feature engineering often means converting raw log
file entries to tf.Example protocol buffers. See also tf.Transform.
feature extraction
Overloaded term having either of the following definitions:
feature set
The group of features your machine learning model trains on. For example, postal code, property size, and
property condition might comprise a simple feature set for a model that predicts housing prices.
feature spec
#TensorFlow
Describes the information required to extract features data from the tf.Example protocol buffer. Because the
tf.Example protocol buffer is just a container for data, you must specify the following:
• the data to extract (that is, the keys for the features)
• the data type (for example, float or int)
• The length (fixed or variable)
The Estimator API provides facilities for producing a feature spec from a list of FeatureColumns.
feature vector
The list of feature values representing an example passed into a model.
feedforward neural network (FFN)
A neural network without cyclic or recursive connections. For example, traditional deep neural networks are
feedforward neural networks. Contrast with recurrent neural networks, which are cyclic.
few-shot learning
A machine learning approach, often used for object classification, designed to learn effective classifiers from only
a small number of training examples.
fine tuning
Perform a secondary optimization to adjust the parameters of an already trained model to fit a new problem. Fine
tuning often refers to refitting the weights of a trained unsupervised model to a supervised model.
forget gate
The portion of a Long Short-Term Memory cell that regulates the flow of information through the cell. Forget
gates maintain context by deciding which information to discard from the cell state.
full softmax
See softmax. Contrast with candidate sampling.
G
GAN
Abbreviation for generative adversarial network.
generalization
Refers to your model's ability to make correct predictions on new, previously unseen data as opposed to the data
used to train the model.
generalization curve
A loss curve showing both the training set and the validation set. A generalization curve can help you detect
possible overfitting. For example, the following generalization curve suggests overfitting because loss for the
validation set ultimately becomes significantly higher than for the training set.
loss
validation set
training set
iterations
generalized linear model
A generalization of least squares regression models, which are based on Gaussian noise, to other types of models
based on other types of noise, such as Poisson noise or categorical noise. Examples of generalized linear models
include:
• logistic regression
• multi-class regression
• least squares regression
The parameters of a generalized linear model can be found through convex optimization.
• The average prediction of the optimal least squares regression model is equal to the average label on the
training data.
• The average probability predicted by the optimal logistic regression model is equal to the average label on
the training data.
The power of a generalized linear model is limited by its features. Unlike a deep model, a generalized linear model
cannot "learn new features."
generative model
Practically speaking, a model that does either of the following:
• Creates (generates) new examples from the training dataset. For example, a generative model could create
poetry after training on a dataset of poems. The generator part of a generative adversarial network falls
into this category.
• Determines the probability that a new example comes from the training set, or was created from the same
mechanism that created the training set. For example, after training on a dataset consisting of English
sentences, a generative model could determine the probability that new input is a valid English sentence.
A generative model can theoretically discern the distribution of examples or particular features in a dataset. That is:
p(examples)
gradient
The vector of partial derivatives with respect to all of the independent variables. In machine learning, the gradient
is the vector of partial derivatives of the model function. The gradient points in the direction of steepest ascent.
gradient clipping
A commonly used mechanism to mitigate the exploding gradient problem by artificially limiting (clipping) the
maximum value of gradients when using gradient descent to train a model.
gradient descent
A technique to minimize loss by computing the gradients of loss with respect to the model's parameters,
conditioned on training data. Informally, gradient descent iteratively adjusts parameters, gradually finding the best
combination of weights and bias to minimize loss.
graph
#TensorFlow
In TensorFlow, a computation specification. Nodes in the graph represent operations. Edges are directed and
represent passing the result of an operation (a Tensor) as an operand to another operation. Use TensorBoard to
visualize a graph.
graph execution
#TensorFlow
A TensorFlow programming environment in which the program first constructs a graph and then executes all or
part of that graph. Graph execution is the default execution mode in TensorFlow 1.x.
Contrast with eager execution.
ground truth
The correct answer. Reality. Since reality is often subjective, expert raters typically are the proxy for ground truth.
Assuming that what is true for an individual is also true for everyone in that group. The effects of group attribution
bias can be exacerbated if a convenience sampling is used for data collection. In a non-representative sample,
attributions may be made that do not reflect reality.
hashing
In machine learning, a mechanism for bucketing categorical data, particularly when the number of categories is
large, but the number of categories actually appearing in the dataset is comparatively small.
For example, Earth is home to about 60,000 tree species. You could represent each of the 60,000 tree species in
60,000 separate categorical buckets. Alternatively, if only 200 of those tree species actually appear in a dataset,
you could use hashing to divide tree species into perhaps 500 buckets.
A single bucket could contain multiple tree species. For example, hashing could place baobab and red maple—two
genetically dissimilar species—into the same bucket. Regardless, hashing is still a good way to map large
categorical sets into the desired number of buckets. Hashing turns a categorical feature having a large number of
possible values into a much smaller number of values by grouping values in a deterministic way.
For more information on hashing, see the Feature Columns chapter in the TensorFlow Programmers Guide.
heuristic
A quick solution to a problem, which may or may not be the best solution. For example, "With a heuristic, we
achieved 86% accuracy. When we switched to a deep neural network, accuracy went up to 98%."
hidden layer
A synthetic layer in a neural network between the input layer (that is, the features) and the output layer (the
prediction). Hidden layers typically contain an activation function (such as ReLU) for training. A deep neural
network contains more than one hidden layer.
hierarchical clustering
A category of clustering algorithms that create a tree of clusters. Hierarchical clustering is well-suited to
hierarchical data, such as botanical taxonomies. There are two types of hierarchical clustering algorithms:
• Agglomerative clustering first assigns every example to its own cluster, and iteratively merges the closest
clusters to create a hierarchical tree.
• Divisive clustering first groups all examples into one cluster and then iteratively divides the cluster into a
hierarchical tree.
hinge loss
A family of loss functions for classification designed to find the decision boundary as distant as possible from
each training example, thus maximizing the margin between examples and the boundary. KSVMs use hinge loss
(or a related function, such as squared hinge loss). For binary classification, the hinge loss function is defined as
follows:
hinge
loss
2
-2 -1 0 1 2 3
( y * y' )
holdout data
Examples intentionally not used ("held out") during training. The validation dataset and test dataset are
examples of holdout data. Holdout data helps evaluate your model's ability to generalize to data other than the data
it was trained on. The loss on the holdout set provides a better estimate of the loss on an unseen dataset than does
the loss on the training set.
hyperparameter
The "knobs" that you tweak during successive runs of training a model. For example, learning rate is a
hyperparameter.
i.i.d.
Abbreviation for independently and identically distributed.
image recognition
A process that classifies object(s), pattern(s), or concept(s) in an image. Image recognition is also known as image
classification.
imbalanced dataset
Synonym for class-imbalanced dataset.
implicit bias
#fairness
Automatically making an association or assumption based on one’s mental models and memories. Implicit bias can
affect the following:
For example, when building a classifier to identify wedding photos, an engineer may use the presence of a white
dress in a photo as a feature. However, white dresses have been customary only during certain eras and in certain
cultures.
inference
In machine learning, often refers to the process of making predictions by applying the trained model to unlabeled
examples. In statistics, inference refers to the process of fitting the parameters of a distribution conditioned on
some observed data. (See the Wikipedia article on statistical inference.)
in-group bias
#fairness
Showing partiality to one's own group or own characteristics. If testers or raters consist of the machine learning
developer's friends, family, or colleagues, then in-group bias may invalidate product testing or the dataset.
In-group bias is a form of group attribution bias. See also out-group homogeneity bias.
input function
#TensorFlow
In TensorFlow, a function that returns input data to the training, evaluation, or prediction method of an Estimator.
For example, the training input function returns a batch of features and labels from the training set.
input layer
The first layer (the one that receives the input data) in a neural network.
instance
Synonym for example.
interpretability
The degree to which a model's predictions can be readily explained. Deep models are often non-interpretable; that
is, a deep model's different layers can be hard to decipher. By contrast, linear regression models and wide models
are typically far more interpretable.
inter-rater agreement
A measurement of how often human raters agree when doing a task. If raters disagree, the task instructions may
need to be improved. Also sometimes called inter-annotator agreement or inter-rater reliability. See also
Cohen's kappa, which is one of the most popular inter-rater agreement measurements.
item matrix
In recommendation systems, a matrix of embeddings generated by matrix factorization that holds latent signals
about each item. Each row of the item matrix holds the value of a single latent feature for all items. For example,
consider a movie recommendation system. Each column in the item matrix represents a single movie. The latent
signals might represent genres, or might be harder-to-interpret signals that involve complex interactions among
genre, stars, movie age, or other factors.
The item matrix has the same number of columns as the target matrix that is being factorized. For example, given a
movie recommendation system that evaluates 10,000 movie titles, the item matrix will have 10,000 columns.
items
In a recommendation system, the entities that a system recommends. For example, videos are the items that a
video store recommends, while books are the items that a bookstore recommends.
iteration
A single update of a model's weights during training. An iteration consists of computing the gradients of the
parameters with respect to the loss on a single batch of data.
K
Keras
A popular Python machine learning API. Keras runs on several deep learning frameworks, including TensorFlow,
where it is made available as tf.keras.
k-means
A popular clustering algorithm that groups examples in unsupervised learning. The k-means algorithm basically
does the following:
The k-means algorithm picks centroid locations to minimize the cumulative square of the distances from each
example to its closest centroid.
For example, consider the following plot of dog height to dog width:
height
width
If k=3, the k-means algorithm will determine three centroids. Each example is assigned to its closest centroid,
yielding three groups:
Imagine that a manufacturer wants to determine the ideal sizes for small, medium, and large sweaters for dogs. The
three centroids identify the mean height and mean width of each dog in that cluster. So, the manufacturer should
probably base sweater sizes on those three centroids. Note that the centroid of a cluster is typically not an example
in the cluster.
The preceding illustrations shows k-means for examples with only two features (height and width). Note that k-
means can group examples across many features.
k-median
A clustering algorithm closely related to k-means. The practical difference between the two is as follows:
• In k-means, centroids are determined by minimizing the sum of the squares of the distance between a
centroid candidate and each of its examples.
• In k-median, centroids are determined by minimizing the sum of the distance between a centroid candidate
and each of its examples.
• k-median relies on the Manhattan distance from the centroid to an example. This distance is the sum of the
absolute deltas in each dimension. For example, the k-median distance between (2,2) and (5,-2) would be:
Manhattan distance = | 2 − 5 | + | 2 − − 2 | = 7
L1 loss
Loss function based on the absolute value of the difference between the values that a model is predicting and the
actual values of the labels. L1 loss is less sensitive to outliers than L2 loss.
L1 regularization
A type of regularization that penalizes weights in proportion to the sum of the absolute values of the weights. In
models relying on sparse features, L1 regularization helps drive the weights of irrelevant or barely relevant
features to exactly 0, which removes those features from the model. Contrast with L2 regularization.
L2 loss
See squared loss.
L2 regularization
A type of regularization that penalizes weights in proportion to the sum of the squares of the weights. L2
regularization helps drive outlier weights (those with high positive or low negative values) closer to 0 but not quite
to 0. (Contrast with L1 regularization.) L2 regularization always improves generalization in linear models.
label
In supervised learning, the "answer" or "result" portion of an example. Each example in a labeled dataset consists
of one or more features and a label. For instance, in a housing dataset, the features might include the number of
bedrooms, the number of bathrooms, and the age of the house, while the label might be the house's price. In a spam
detection dataset, the features might include the subject line, the sender, and the email message itself, while the
label would probably be either "spam" or "not spam."
labeled example
An example that contains features and a label. In supervised training, models learn from labeled examples.
lambda
Synonym for regularization rate.
(This is an overloaded term. Here we're focusing on the term's definition within regularization.)
layer
A set of neurons in a neural network that process a set of input features, or the output of those neurons.
Also, an abstraction in TensorFlow. Layers are Python functions that take Tensors and configuration options as
input and produce other tensors as output. Once the necessary Tensors have been composed, the user can convert
the result into an Estimator via a model function.
A TensorFlow API for constructing a deep neural network as a composition of layers. The Layers API enables you
to build different types of layers, such as:
When writing a custom Estimator, you compose Layers objects to define the characteristics of all the hidden
layers.
The Layers API follows the Keras layers API conventions. That is, aside from a different prefix, all functions in
the Layers API have the same names and signatures as their counterparts in the Keras layers API.
learning rate
A scalar used to train a model via gradient descent. During each iteration, the gradient descent algorithm
multiplies the learning rate by the gradient. The resulting product is called the gradient step.
linear regression
A type of regression model that outputs a continuous value from a linear combination of input features.
logistic regression
A model that generates a probability for each possible discrete label value in classification problems by applying a
sigmoid function to a linear prediction. Although logistic regression is often used in binary classification
problems, it can also be used in multi-class classification problems (where it becomes called multi-class logistic
regression or multinomial regression).
logits
The vector of raw (non-normalized) predictions that a classification model generates, which is ordinarily then
passed to a normalization function. If the model is solving a multi-class classification problem, logits typically
become an input to the softmax function. The softmax function then generates a vector of (normalized)
probabilities with one value for each possible class.
In addition, logits sometimes refer to the element-wise inverse of the sigmoid function. For more information, see
tf.nn.sigmoid_cross_entropy_with_logits.
Log Loss
The loss function used in binary logistic regression.
log-odds
The logarithm of the odds of some event.
If the event refers to a binary probability, then odds refers to the ratio of the probability of success (p) to the
probability of failure (1-p). For example, suppose that a given event has a 90% probability of success and a 10%
probability of failure. In this case, odds is calculated as follows:
p .9
odds = = =9
(1-p) .1
The log-odds is simply the logarithm of the odds. By convention, "logarithm" refers to natural logarithm, but
logarithm could actually be any base greater than 1. Sticking to convention, the log-odds of our example is
therefore:
loss
A measure of how far a model's predictions are from its label. Or, to phrase it more pessimistically, a measure of
how bad the model is. To determine this value, a model must define a loss function. For example, linear regression
models typically use mean squared error for a loss function, while logistic regression models use Log Loss.
loss curve
A graph of loss as a function of training iterations. For example:
loss
iterations
The loss curve can help you determine when your model is converging, overfitting, or underfitting.
loss surface
A graph of weight(s) vs. loss. Gradient descent aims to find the weight(s) for which the loss surface is at a local
minimum.
LSTM
Abbreviation for Long Short-Term Memory.
machine learning
A program or system that builds (trains) a predictive model from input data. The system uses the learned model to
make useful predictions from new (never-before-seen) data drawn from the same distribution as the one used to
train the model. Machine learning also refers to the field of study concerned with these programs or systems.
majority class
The more common label in a class-imbalanced dataset. For example, given a dataset containing 99% non-spam
labels and 1% spam labels, the non-spam labels are the majority class.
matplotlib
An open-source Python 2D plotting library. matplotlib helps you visualize different aspects of machine learning.
matrix factorization
In math, a mechanism for finding the matrices whose dot product approximates a target matrix.
In recommendation systems, the target matrix often holds users' ratings on items. For example, the target matrix
for a movie recommendation system might look something like the following, where the positive integers are user
ratings and 0 means that the user didn't rate the movie:
Casablanca The Philadelphia Story Black Panther Wonder Woman Pulp Fiction
User 1 5.0 3.0 0.0 2.0 0.0
User 2 4.0 0.0 0.0 1.0 5.0
User 3 3.0 1.0 4.0 5.0 0.0
The movie recommendation system aims to predict user ratings for unrated movies. For example, will User 1 like
Black Panther?
One approach for recommendation systems is to use matrix factorization to generate the following two matrices:
• A user matrix, shaped as the number of users X the number of embedding dimensions.
• An item matrix, shaped as the number of embedding dimensions X the number of users.
For example, using matrix factorization on our three users and five items could yield the following user matrix and
item matrix:
The dot product of the user matrix and item matrix yields a recommendation matrix that contains not only the
original user ratings but also predictions for the movies that each user hasn't seen. For example, consider User 1's
rating of Casablanca, which was 5.0. The dot product corresponding to that cell in the recommendation matrix
should hopefully be around 5.0, and it is:
More importantly, will User 1 like Black Panther? Taking the dot product corresponding to the first row and the
third column yields a predicted rating of 4.3:
metric
#TensorFlow
A number that you care about. May or may not be directly optimized in a machine-learning system. A metric that
your system tries to optimize is called an objective.
mini-batch
A small, randomly selected subset of the entire batch of examples run together in a single iteration of training or
inference. The batch size of a mini-batch is usually between 10 and 1,000. It is much more efficient to calculate
the loss on a mini-batch than on the full training data.
minimax loss
A loss function for generative adversarial networks, based on the cross-entropy between the distribution of
generated data and real data.
Minimax loss is used in the first paper to describe generative adversarial networks.
minority class
The less common label in a class-imbalanced dataset. For example, given a dataset containing 99% non-spam
labels and 1% spam labels, the spam labels are the minority class.
ML
Abbreviation for machine learning.
MNIST
A public-domain dataset compiled by LeCun, Cortes, and Burges containing 60,000 images, each image showing
how a human manually wrote a particular digit from 0–9. Each image is stored as a 28x28 array of integers, where
each integer is a grayscale value between 0 and 255, inclusive.
MNIST is a canonical dataset for machine learning, often used to test new ML approaches. For details, see The
MNIST Database of Handwritten Digits.
model
The representation of what an ML system has learned from the training data. Within TensorFlow, model is an
overloaded term, which can have either of the following two related meanings:
• The TensorFlow graph that expresses the structure of how a prediction will be computed.
• The particular weights and biases of that TensorFlow graph, which are determined by training.
model capacity
The complexity of problems that a model can learn. The more complex the problems that a model can learn, the
higher the model’s capacity. A model’s capacity typically increases with the number of model parameters. For a
formal definition of classifier capacity, see VC dimension.
model function
#TensorFlow
The function within an Estimator that implements ML training, evaluation, and inference. For example, the
training portion of a model function might handle tasks such as defining the topology of a deep neural network and
identifying its optimizer function. When using premade Estimators, someone has already written the model
function for you. When using custom Estimators, you must write the model function yourself.
For details about writing a model function, see the Creating Custom Estimators chapter in the TensorFlow
Programmers Guide.
model training
The process of determining the best model.
Momentum
A sophisticated gradient descent algorithm in which a learning step depends not only on the derivative in the
current step, but also on the derivatives of the step(s) that immediately preceded it. Momentum involves computing
an exponentially weighted moving average of the gradients over time, analogous to momentum in physics.
Momentum sometimes prevents learning from getting stuck in local minima.
multi-class classification
Classification problems that distinguish among more than two classes. For example, there are approximately 128
species of maple trees, so a model that categorized maple tree species would be multi-class. Conversely, a model
that divided emails into only two categories (spam and not spam) would be a binary classification model.
multinomial classification
Synonym for multi-class classification.
NaN trap
When one number in your model becomes a NaN during training, which causes many or all other numbers in your
model to eventually become a NaN.
negative class
In binary classification, one class is termed positive and the other is termed negative. The positive class is the
thing we're looking for and the negative class is the other possibility. For example, the negative class in a medical
test might be "not tumor." The negative class in an email classifier might be "not spam." See also positive class.
neural network
A model that, taking inspiration from the brain, is composed of layers (at least one of which is hidden) consisting
of simple connected units or neurons followed by nonlinearities.
neuron
A node in a neural network, typically taking in multiple input values and generating one output value. The neuron
calculates the output value by applying an activation function (nonlinear transformation) to a weighted sum of
input values.
N-gram
An ordered sequence of N words. For example, truly madly is a 2-gram. Because order is relevant, madly truly is a
different 2-gram than truly madly.
Many natural language understanding models rely on N-grams to predict the next word that the user will type or
say. For example, suppose a user typed three blind. An NLU model based on trigrams would likely predict that the
user will next type mice.
Contrast N-grams with bag of words, which are unordered sets of words.
NLU
Abbreviation for natural language understanding.
noise
Broadly speaking, anything that obscures the signal in a dataset. Noise can be introduced into data in a variety of
ways. For example:
non-response bias
#fairness
normalization
The process of converting an actual range of values into a standard range of values, typically -1 to +1 or 0 to 1. For
example, suppose the natural range of a certain feature is 800 to 6,000. Through subtraction and division, you can
normalize those values into the range -1 to +1.
numerical data
Features represented as integers or real-valued numbers. For example, in a real estate model, you would probably
represent the size of a house (in square feet or square meters) as numerical data. Representing a feature as
numerical data indicates that the feature's values have a mathematical relationship to each other and possibly to the
label. For example, representing the size of a house as numerical data indicates that a 200 square-meter house is
twice as large as a 100 square-meter house. Furthermore, the number of square meters in a house probably has
some mathematical relationship to the price of the house.
Not all integer data should be represented as numerical data. For example, postal codes in some parts of the world
are integers; however, integer postal codes should not be represented as numerical data in models. That's because a
postal code of 20000 is not twice (or half) as potent as a postal code of 10000. Furthermore, although different
postal codes do correlate to different real estate values, we can't assume that real estate values at postal code 20000
are twice as valuable as real estate values at postal code 10000. Postal codes should be represented as categorical
data instead.
NumPy
An open-source math library that provides efficient array operations in Python. pandas is built on NumPy.
objective
A metric that your algorithm is trying to optimize.
objective function
The mathematical formula or metric that a model aims to optimize. For example, the objective function for linear
regression is usually squared loss. Therefore, when training a linear regression model, the goal is to minimize
squared loss.
In some cases, the goal is to maximize the objective function. For example, if the objective function is accuracy,
the goal is to maximize accuracy.
offline inference
Generating a group of predictions, storing those predictions, and then retrieving those predictions on demand.
Contrast with online inference.
one-hot encoding
A sparse vector in which:
One-hot encoding is commonly used to represent strings or identifiers that have a finite set of possible values. For
example, suppose a given botany dataset chronicles 15,000 different species, each denoted with a unique string
identifier. As part of feature engineering, you'll probably encode those string identifiers as one-hot vectors in which
the vector has a size of 15,000.
one-shot learning
A machine learning approach, often used for object classification, designed to learn effective classifiers from a
single training example.
one-vs.-all
Given a classification problem with N possible solutions, a one-vs.-all solution consists of N separate binary
classifiers—one binary classifier for each possible outcome. For example, given a model that classifies examples
as animal, vegetable, or mineral, a one-vs.-all solution would provide the following three separate binary
classifiers:
Operation (op)
#TensorFlow
A node in the TensorFlow graph. In TensorFlow, any procedure that creates, manipulates, or destroys a Tensor is
an operation. For example, a matrix multiply is an operation that takes two Tensors as input and generates one
Tensor as output.
optimizer
A specific implementation of the gradient descent algorithm. TensorFlow's base class for optimizers is
tf.train.Optimizer. Different optimizers may leverage one or more of the following concepts to enhance the
effectiveness of gradient descent on a given training set:
• momentum (Momentum)
• update frequency (AdaGrad = ADAptive GRADient descent; (Adam = ADAptive with Momentum;
RMSProp)
• sparsity/regularization (Ftrl)
• more complex math (Proximal, and others)
The tendency to see out-group members as more alike than in-group members when comparing attitudes, values,
personality traits, and other characteristics. In-group refers to people you interact with regularly; out-group refers
to people you do not interact with regularly. If you create a dataset by asking people to provide attributes about
out-groups, those attributes may be less nuanced and more stereotyped than attributes that participants list for
people in their in-group.
For example, Lilliputians might describe the houses of other Lilliputians in great detail, citing small differences in
architectural styles, windows, doors, and sizes. However, the same Lilliputians might simply declare that
Brobdingnagians all live in identical houses.
Outliers often cause problems in model training. Clipping is one way of managing outliers.
output layer
The "final" layer of a neural network. The layer containing the answer(s).
overfitting
Creating a model that matches the training data so closely that the model fails to make correct predictions on new
data.
pandas
A column-oriented data analysis API. Many ML frameworks, including TensorFlow, support pandas data
structures as input. See the pandas documentation for details.
parameter
A variable of a model that the ML system trains on its own. For example, weights are parameters whose values the
ML system gradually learns through successive training iterations. Contrast with hyperparameter.
See the TensorFlow Architecture chapter in the TensorFlow Programmers Guide for details.
parameter update
The operation of adjusting a model's parameters during training, typically within a single iteration of gradient
descent.
partial derivative
A derivative in which all but one of the variables is considered a constant. For example, the partial derivative of
f(x, y) with respect to x is the derivative of f considered as a function of x alone (that is, keeping y constant). The
partial derivative of f with respect to x focuses only on how x is changing and ignores all other variables in the
equation.
participation bias
#fairness
partitioning strategy
The algorithm by which variables are divided across parameter servers.
perceptron
A system (either hardware or software) that takes in one or more input values, runs a function on the weighted sum
of the inputs, and computes a single output value. In machine learning, the function is typically nonlinear, such as
ReLU, sigmoid, or tanh. For example, the following perceptron relies on the sigmoid function to process three
input values:
In the following illustration, the perceptron takes three inputs, each of which is itself modified by a weight before
entering the perceptron:
input 1
weight 1
output
input 2 weight 2 perceptron
weight 3
input 3
Perceptrons are the (nodes) in deep neural networks. That is, a deep neural network consists of multiple
connected perceptrons, plus a backpropagation algorithm to introduce feedback.
performance
Overloaded term with the following meanings:
• The traditional meaning within software engineering. Namely: How fast (or efficiently) does this piece of
software run?
• The meaning within ML. Here, performance answers the following question: How correct is this model?
That is, how good are the model's predictions?
perplexity
One measure of how well a model is accomplishing its task. For example, suppose your task is to read the first few
letters of a word a user is typing on a smartphone keyboard, and to offer a list of possible completion words.
Perplexity, P, for this task is approximately the number of guesses you need to offer in order for your list to contain
the actual word the user is trying to type.
P = 2 −cross entropy
pipeline
The infrastructure surrounding a machine learning algorithm. A pipeline includes gathering the data, putting the
data into training data files, training one or more models, and exporting the models to production.
pooling
Reducing a matrix (or matrices) created by an earlier convolutional layer to a smaller matrix. Pooling usually
involves taking either the maximum or average value across the pooled area. For example, suppose we have the
following 3x3 matrix:
A pooling operation, just like a convolutional operation, divides that matrix into slices and then slides that
convolutional operation by strides. For example, suppose the pooling operation divides the convolutional matrix
into 2x2 slices with a 1x1 stride. As the following diagram illustrates, four pooling operations take place. Imagine
that each pooling operation picks the maximum value of the four in that slice:
Pooling for vision applications is known more formally as spatial pooling. Time-series applications usually refer
to pooling as temporal pooling. Less formally, pooling is often called subsampling or downsampling.
positive class
In binary classification, the two possible classes are labeled as positive and negative. The positive outcome is the
thing we're testing for. (Admittedly, we're simultaneously testing for both outcomes, but play along.) For example,
the positive class in a medical test might be "tumor." The positive class in an email classifier might be "spam."
precision
A metric for classification models. Precision identifies the frequency with which a model was correct when
predicting the positive class. That is:
True Positives
Precision =
True Positives+False Positives
prediction
A model's output when provided with an input example.
prediction bias
#fairness
A value indicating how far apart the average of predictions is from the average of labels in the dataset.
Not to be confused with the bias term in machine learning models or with bias in ethics and fairness.
premade Estimator
#TensorFlow
An Estimator that someone has already built. TensorFlow provides several premade Estimators, including
DNNClassifier, DNNRegressor, and LinearClassifier. To learn more about premade Estimators, see
the Premade Estimators chapter in the TensorFlow Programmers Guide.
Contrast with custom estimators.
pre-trained model
Models or model components (such as embeddings) that have been already been trained. Sometimes, you'll feed
pre-trained embeddings into a neural network. Other times, your model will train the embeddings itself rather
than rely on the pre-trained embeddings.
prior belief
What you believe about the data before you begin training on it. For example, L2 regularization relies on a prior
belief that weights should be small and normally distributed around zero.
proxy labels
Data used to approximate labels not directly available in a dataset.
For example, suppose you want is it raining? to be a Boolean label for your dataset, but the dataset doesn't contain
rain data. If photographs are available, you might establish pictures of people carrying umbrellas as a proxy label
for is it raining? However, proxy labels may distort results. For example, in some places, it may be more common
to carry umbrellas to protect against sun than the rain.
quantile
Each bucket in quantile bucketing.
quantile bucketing
Distributing a feature's values into buckets so that each bucket contains the same (or almost the same) number of
examples. For example, the following figure divides 44 points into 4 buckets, each of which contains 11 points. In
order for each bucket in the figure to contain the same number of points, some buckets span a different width of x-
values.
quantization
An algorithm that implements quantile bucketing on a particular feature in a dataset.
queue
#TensorFlow
A TensorFlow Operation that implements a queue data structure. Typically used in I/O.
random forest
An ensemble approach to finding the decision tree that best fits the training data by creating many decision trees
and then determining the "average" one. The "random" part of the term refers to building each of the decision trees
from a random selection of features; the "forest" refers to the set of decision trees.
rank (ordinality)
The ordinal position of a class in an ML problem that categorizes classes from highest to lowest. For example, a
behavior ranking system could rank a dog's rewards from highest (a steak) to lowest (wilted kale).
rank (Tensor)
#TensorFlow
The number of dimensions in a Tensor. For instance, a scalar has rank 0, a vector has rank 1, and a matrix has rank
2.
rater
A human who provides labels in examples. Sometimes called an "annotator."
recall
A metric for classification models that answers the following question: Out of all the possible positive labels, how
many did the model correctly identify? That is:
True Positives
Recall =
True Positives+False Negatives
recommendation system
A system that selects for each user a relatively small set of desirable items from a large corpus. For example, a
video recommendation system might recommend two videos from a corpus of 100,000 videos, selecting
Casablanca and The Philadelphia Story for one user, and Wonder Woman and Black Panther for another. A video
recommendation system might base its recommendations on factors such as:
For example, the following figure shows a recurrent neural network that runs four times. Notice that the values
learned in the hidden layers from the first run become part of the input to the same hidden layers in the second run.
Similarly, the values learned in the hidden layer on the second run become part of the input to the same hidden
layer in the third run. In this way, the recurrent neural network gradually trains and predicts the meaning of the
entire sequence rather than just the meaning of individual words.
User probably
wants the
Queen song.
output layer
hidden layer
hidden layer
input layer
regression model
A type of model that outputs continuous (typically, floating-point) values. Compare with classification models,
which output discrete values, such as "day lily" or "tiger lily."
regularization
The penalty on a model's complexity. Regularization helps prevent overfitting. Different kinds of regularization
include:
• L1 regularization
• L2 regularization
• dropout regularization
• early stopping (this is not a formal regularization method, but can effectively limit overfitting)
regularization rate
A scalar value, represented as lambda, specifying the relative importance of the regularization function. The
following simplified loss equation shows the regularization rate's influence:
Raising the regularization rate reduces overfitting but may make the model less accurate.
reinforcement learning
A machine learning approach to maximize an ultimate reward through feedback (rewards and punishments) after a
sequence of actions. For example, the ultimate reward of most games is victory. Reinforcement learning systems
can become expert at playing complex games by evaluating sequences of previous game moves that ultimately led
to wins and sequences that ultimately led to losses.
reporting bias
#fairness
The fact that the frequency with which people write about actions, outcomes, or properties is not a reflection of
their real-world frequencies or the degree to which a property is characteristic of a class of individuals. Reporting
bias can influence the composition of data that ML systems learn from.
For example, in books, the word laughed is more prevalent than breathed. An ML model that estimates the relative
frequency of laughing and breathing from a book corpus would probably determine that laughing is more common
than breathing.
representation
The process of mapping data to useful features.
re-ranking
The final stage of a recommendation system, during which scored items may be re-graded according to some
other (typically, non-ML) algorithm. Re-ranking evaluates the list of items generated by the scoring phase, taking
actions such as:
ridge regularization
Synonym for L2 regularization. The term ridge regularization is more frequently used in pure statistics contexts,
whereas L2 regularization is used more often in machine learning.
RNN
Abbreviation for recurrent neural networks.
root directory
#TensorFlow
The directory you specify for hosting subdirectories of the TensorFlow checkpoint and events files of multiple
models.
sampling bias
#fairness
SavedModel
#TensorFlow
The recommended format for saving and recovering TensorFlow models. SavedModel is a language-neutral,
recoverable serialization format, which enables higher-level systems and tools to produce, consume, and transform
TensorFlow models.
See the Saving and Restoring chapter in the TensorFlow Programmer's Guide for complete details.
Saver
#TensorFlow
scalar
A single number or a single string that can be designated a tensor of rank 0. For example, the following lines of
code each create one scalar in TensorFlow:
scaling
A commonly used practice in feature engineering to tame a feature's range of values to match the range of other
features in the dataset. For example, suppose that you want all floating-point features in the dataset to have a range
of 0 to 1. Given a particular feature's range of 0 to 500, you could scale that feature by dividing each value by 500.
scikit-learn
A popular open-source ML platform. See www.scikit-learn.org.
scoring
The part of a recommendation system that provides a value or ranking for each item produced by the candidate
generation phase.
selection bias
#fairness
Errors in conclusions drawn from sampled data due to a selection process that generates systematic differences
between samples observed in the data and those not observed. The following forms of selection bias exist:
• coverage bias: The population represented in the dataset does not match the population that the ML model
is making predictions about.
• sampling bias: Data is not collected randomly from the target group.
• non-response bias (also called participation bias): Users from certain groups opt-out of surveys at
different rates than users from other groups.
For example, suppose you are creating an ML model that predicts people's enjoyment of a movie. To collect
training data, you hand out a survey to everyone in the front row of a theater showing the movie. Offhand, this may
sound like a reasonable way to gather a dataset; however, this form of data collection may introduce the following
forms of selection bias:
• coverage bias: By sampling from a population who chose to see the movie, your model's predictions may
not generalize to people who did not already express that level of interest in the movie.
• sampling bias: Rather than randomly sampling from the intended population (all the people at the movie),
you sampled only the people in the front row. It is possible that the people sitting in the front row were
more interested in the movie than those in other rows.
• non-response bias: In general, people with strong opinions tend to respond to optional surveys more
frequently than people with mild opinions. Since the movie survey is optional, the responses are more
likely to form a bimodal distribution than a normal (bell-shaped) distribution.
semi-supervised learning
Training a model on data where some of the training examples have labels but others don’t. One technique for
semi-supervised learning is to infer labels for the unlabeled examples, and then to train on the inferred labels to
create a new model. Semi-supervised learning can be useful if labels are expensive to obtain but unlabeled
examples are plentiful.
sentiment analysis
Using statistical or machine learning algorithms to determine a group's overall attitude—positive or
negative—toward a service, product, organization, or topic. For example, using natural language understanding,
an algorithm could perform sentiment analysis on the textual feedback from a university course to determine the
degree to which students generally liked or disliked the course.
sequence model
A model whose inputs have a sequential dependence. For example, predicting the next video watched from a
sequence of previously watched videos.
serving
A synonym for inferring.
session (tf.session)
#TensorFlow
An object that encapsulates the state of the TensorFlow runtime and runs all or part of a graph. When using the
low-level TensorFlow APIs, you instantiate and manage one or more tf.session objects directly. When using
the Estimators API, Estimators instantiate session objects for you.
shape (Tensor)
The number of elements in each dimension of a tensor. The shape is represented as a list of integers. For example,
the following two-dimensional tensor has a shape of [3,4]:
[[5, 7, 6, 4],
[2, 9, 4, 8],
[3, 6, 5, 1]]
TensorFlow uses row-major (C-style) format to represent the order of dimensions, which is why the shape in
TensorFlow is [3,4] rather than [4,3]. In other words, in a two-dimensional TensorFlow Tensor, the shape is
[number of rows, number of columns].
sigmoid function
A function that maps logistic or multinomial regression output (log odds) to probabilities, returning a value
between 0 and 1. The sigmoid function has the following formula:
1
y=
1+e−σ
In other words, the sigmoid function converts σ into a probability between 0 and 1.
In some neural networks, the sigmoid function acts as the activation function.
similarity measure
In clustering algorithms, the metric used to determine how alike (how similar) any two examples are.
size invariance
In an image classification problem, an algorithm's ability to successfully classify images even when the size of the
image changes. For example, the algorithm can still identify a cat whether it consumes 2M pixels or 200K pixels.
Note that even the best image classification algorithms still have practical limits on size invariance. For example,
an algorithm (or human) is unlikely to correctly classify a cat image consuming only 20 pixels.
Sketching decreases the computation required for similarity calculations on large datasets. Instead of calculating
similarity for every single pair of examples in the dataset, we calculate similarity only for each pair of points within
each bucket.
softmax
A function that provides probabilities for each possible class in a multi-class classification model. The
probabilities add up to exactly 1.0. For example, softmax might determine that the probability of a particular image
being a dog at 0.9, a cat at 0.08, and a horse at 0.02. (Also called full softmax.)
sparse feature
Feature vector whose values are predominately zero or empty. For example, a vector containing a single 1 value
and a million 0 values is sparse. As another example, words in a search query could also be a sparse feature—there
are many possible words in a given language, but only a few of them occur in a given query.
sparse representation
A representation of a tensor that only stores nonzero elements.
For example, the English language consists of about a million words. Consider two ways to represent a count of the
words used in one English sentence:
• A dense representation of this sentence must set an integer for all one million cells, placing a 0 in most of
them, and a low integer into a few of them.
• A sparse representation of this sentence stores only those cells symbolizing a word actually in the
sentence. So, if the sentence contained only 20 unique words, then the sparse representation for the
sentence would store an integer in only 20 cells.
For example, consider two ways to represent the sentence, "Dogs wag tails." As the following tables show, the
dense representation consumes about a million cells; the sparse representation consumes only 3 cells:
Dense Representation
Cell Number Word Occurrence
0 a 0
1 aardvark 0
2 aargh 0
3 aarti 0
… 140,391 more words with an occurrence of 0
140395 dogs 1
… 633,062 words with an occurrence of 0
773458 tails 1
… 189,136 words with an occurrence of 0
962594 wag 1
… many more words with an occurrence of 0
Sparse Representation
Cell Number Word Occurrence
140395 dogs 1
773458 tails 1
962594 wag 1
sparse vector
A vector whose values are mostly zeroes. See also sparse feature.
sparsity
The number of elements set to zero (or null) in a vector or matrix divided by the total number of entries in that
vector or matrix. For example, consider a 10x10 matrix in which 98 cells contain zero. The calculation of sparsity
is as follows:
98
sparsity = = 0.98
100
Feature sparsity refers to the sparsity of a feature vector; model sparsity refers to the sparsity of the model
weights.
spatial pooling
See pooling.
squared hinge loss
The square of the hinge loss. Squared hinge loss penalizes outliers more harshly than regular hinge loss.
squared loss
The loss function used in linear regression. (Also known as L2 Loss.) This function calculates the squares of the
difference between a model's predicted value for a labeled example and the actual value of the label. Due to
squaring, this loss function amplifies the influence of bad predictions. That is, squared loss reacts more strongly to
outliers than L1 loss.
static model
A model that is trained offline.
stationarity
A property of data in a dataset, in which the data distribution stays constant across one or more dimensions. Most
commonly, that dimension is time, meaning that data exhibiting stationarity doesn't change over time. For example,
data that exhibits stationarity doesn't change from September to December.
step
A forward and backward evaluation of one batch.
step size
Synonym for learning rate.
The preceding example demonstrates a two-dimensional stride. If the input matrix is three-dimensional, the stride
would also be three-dimensional.
• The desire to build the most predictive model (for example, lowest loss).
• The desire to keep the model as simple as possible (for example, strong regularization).
For example, a function that minimizes loss+regularization on the training set is a structural risk minimization
algorithm.
subsampling
See pooling.
summary
#TensorFlow
In TensorFlow, a value or set of values calculated at a particular step, usually used for tracking model metrics
during training.
supervised machine learning
Training a model from input data and its corresponding labels. Supervised machine learning is analogous to a
student learning a subject by studying a set of questions and their corresponding answers. After mastering the
mapping between questions and answers, the student can then provide answers to new (never-before-seen)
questions on the same topic. Compare with unsupervised machine learning.
synthetic feature
A feature not present among the input features, but created from one or more of them. Kinds of synthetic features
include:
Features created by normalizing or scaling alone are not considered synthetic features.
target
Synonym for label.
temporal data
Data recorded at different points in time. For example, winter coat sales recorded for each day of the year would be
temporal data.
Tensor
#TensorFlow
The primary data structure in TensorFlow programs. Tensors are N-dimensional (where N could be very large)
data structures, most commonly scalars, vectors, or matrices. The elements of a Tensor can hold integer, floating-
point, or string values.
TensorBoard
#TensorFlow
The dashboard that displays the summaries saved during the execution of one or more TensorFlow programs.
TensorFlow
#TensorFlow
A large-scale, distributed, machine learning platform. The term also refers to the base API layer in the TensorFlow
stack, which supports general computation on dataflow graphs.
Although TensorFlow is primarily used for machine learning, you may also use TensorFlow for non-ML tasks that
require numerical computation using dataflow graphs.
TensorFlow Playground
#TensorFlow
A program that visualizes how different hyperparameters influence model (primarily neural network) training.
Go to http://playground.tensorflow.org to experiment with TensorFlow Playground.
TensorFlow Serving
#TensorFlow
An ASIC (application-specific integrated circuit) that optimizes the performance of TensorFlow programs.
Tensor rank
#TensorFlow
See rank (Tensor).
Tensor shape
#TensorFlow
The number of elements a Tensor contains in various dimensions. For example, a [5, 10] Tensor has a shape of 5
in one dimension and 10 in another.
Tensor size
#TensorFlow
The total number of scalars a Tensor contains. For example, a [5, 10] Tensor has a size of 50.
test set
The subset of the dataset that you use to test your model after the model has gone through initial vetting by the
validation set.
tf.Example
#TensorFlow
A standard protocol buffer for describing input data for machine learning model training or inference.
tf.keras
#TensorFlow
timestep
One "unrolled" cell within a recurrent neural network. For example, the following figure shows three timesteps
(labeled with the subscripts t-1, t, and t+1):
tower
A component of a deep neural network that is itself a deep neural network without an output layer. Typically,
each tower reads from an independent data source. Towers are independent until their output is combined in a final
layer.
TPU
#TensorFlow
#GoogleCloud
TPU chip
#TensorFlow
#GoogleCloud
A programmable linear algebra accelerator whose performance is optimized for machine learning workloads,
specifically the training phase. Also known as a shard.
TPU device
#TensorFlow
#GoogleCloud
A board with 4 TPU chips, where each chip has two cores for a total of 8 cores of ML compute. Current TPU
device versions are referred to as either TPU v2-8 or TPU v3-8. Each board is used independently, connected to
Google Cloud through normal networking infrastructure. In Cloud, access to the TPUs is instantiated through
Cloud TPU APIs.
TPU master
#TensorFlow
#GoogleCloud
The central coordination process running on a host machine that sends and receives data and results, programs, and
performance and health data to the TPU workers. It also manages the setup and shutdown of devices in a TPU
configuration.
TPU node
#TensorFlow
#GoogleCloud
TPU Pod
#TensorFlow
#GoogleCloud
A set quantity of TPU devices, connected to one another through a high-speed network. For example, a TPU
version 2 Pod contains 64 networked TPU devices. A full TPU version 2 Pod is referred to as v2-512.
TPU resource
#TensorFlow
#GoogleCloud
TPU slice
#TensorFlow
#GoogleCloud
A supported subset of a TPU Pod that a user can specify. For example a TPU v2 Pod slice could be v2-32 (TPU v2
containing 32 cores) up to a full pod of v2-512 (TPU v2 containing 512 cores).
TPU worker
#TensorFlow
#GoogleCloud
A process running on a host machine connected to a TPU that executes TensorFlow programs on the TPU node.
training
The process of determining the ideal parameters comprising a model.
training set
The subset of the dataset used to train a model.
transfer learning
Transferring information from one machine learning task to another. For example, in multi-task learning, a single
model solves multiple tasks, such as a deep model that has different output nodes for different tasks. Transfer
learning might involve transferring knowledge from the solution of a simpler task to a more complex one, or
involve transferring knowledge from a task where there is more data to one where there is less data.
Most machine learning systems solve a single task. Transfer learning is a baby step towards artificial intelligence in
which a single program can solve multiple tasks.
translational invariance
In an image classification problem, an algorithm's ability to successfully classify images even when the position of
objects within the image changes. For example, the algorithm can still identify a dog, whether it is in the center of
the frame or at the left end of the frame.
trigram
An N-gram in which N=3.
True Positives
True Positive Rate =
True Positives+False Negatives
U
underfitting
Producing a model with poor predictive ability because the model hasn't captured the complexity of the training
data. Many problems can cause underfitting, including:
unlabeled example
An example that contains features but no label. Unlabeled examples are the input to inference. In semi-
supervised and unsupervised learning, unlabeled examples are used during training.
The most common use of unsupervised machine learning is to cluster data into groups of similar examples. For
example, an unsupervised machine learning algorithm can cluster songs together based on various properties of the
music. The resulting clusters can become an input to other machine learning algorithms (for example, to a music
recommendation service). Clustering can be helpful in domains where true labels are hard to obtain. For example,
in domains such as anti-abuse and fraud, clusters can help humans better understand the data.
Another example of unsupervised machine learning is principal component analysis (PCA). For example, applying
PCA on a dataset containing the contents of millions of shopping carts might reveal that shopping carts containing
lemons frequently also contain antacids.
upweighting
Applying a weight to the downsampled class equal to the factor by which you downsampled.
user matrix
In recommendation systems, an embedding generated by matrix factorization that holds latent signals about
user preferences. Each row of the user matrix holds information about the relative strength of various latent signals
for a single user. For example, consider a movie recommendation system. In this system, the latent signals in the
user matrix might represent each user's interest in particular genres, or might be harder-to-interpret signals that
involve complex interactions across multiple factors.
The user matrix has a column for each latent feature and a row for each user. That is, the user matrix has the same
number of rows as the target matrix that is being factorized. For example, given a movie recommendation system
for 1,000,000 users, the user matrix will have 1,000,000 rows.
validation
A process used, as part of training, to evaluate the quality of a machine learning model using the validation set.
Because the validation set is disjoint from the training set, validation helps ensure that the model’s performance
generalizes beyond the training set.
validation set
A subset of the dataset—disjoint from the training set—used in validation.
Wasserstein loss
One of the loss functions commonly used in generative adversarial networks, based on the earth-mover's
distance between the distribution of generated data and real data.
wide model
A linear model that typically has many sparse input features. We refer to it as "wide" since such a model is a
special type of neural network with a large number of inputs that connect directly to the output node. Wide
models are often easier to debug and inspect than deep models. Although wide models cannot express
nonlinearities through hidden layers, they can use transformations such as feature crossing and bucketization to
model nonlinearities in different ways.
width
The number of neurons in a particular layer of a neural network.
Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 3.0
License, and code samples are licensed under the Apache 2.0 License. For details, see our Site Policies. Java is a
registered trademark of Oracle and/or its affiliates.
• Connect
◦ Blog
◦ Facebook
◦ Medium
◦ Twitter
◦ YouTube
• Programs
◦ Women Techmakers
◦ Agency Program
◦ GDG
◦ Google Developers Experts
◦ Startup Launchpad
◦ User Studies
• Developer Consoles
•
◦ Android
◦ Chrome
◦ Firebase
◦ Google Cloud Platform
◦ All Products
Bahasa Indonesia
Terms Privacy
Sign up for the Google Developers newsletter
Subscribe
This page
Documentation feedback