classification and prediction informs successive tweaks Network Structure ●Multiple layers ●Input layer (raw observations) ●Hidden layers ●Output layer ●Nodes ●Weights (like coefficients, subject to iterative adjustment) ●Bias values (also subject to iterative adjustment) Schematic Diagram Tiny Example Predict consumer opinion of cheese product based on fat and salt content
Obs. Fat Score Salt Score Opinion
1 0.2 0.9 like 2 0.1 0.1 dislike 3 0.2 0.4 dislike 1 4 0.2 0.5 dislike 5 0.4 0.5 like 6 0.3 0.8 like Example – Using fat & salt content to predict consumer acceptance of cheese
Rectangles are nodes, wij on arrows are weights, and ϴj are node bias values Moving Through the Network The Input Layer
●For input layer, input = output
●E.g., for record #1: Fat input = output = 0.2 Salt input = output = 0.9
●Output of input layer = input into hidden
layer The Hidden Layer ●In this example, it has 3 nodes ●Each node receives as input the output of all input nodes ●Output of each hidden node is some function of the weighted sum of inputs The Weights ●The weights θ (theta) and w are typically initialized to random values in the range - 0.05 to +0.05
●Equivalent to a model with random
prediction (in other words, no predictive value)
●These initial weights are used in the first
round of training Output of Node 3 if g is a Logistic Function Initial Pass of the Network Node outputs (on right within node) using first record in tiny example, and logistic function
Calculations at hidden node
3: Output Layer The output of the last hidden layer becomes input for the output layer, which has one node per class. Mapping the output to a classification
Output for “like” = 0.506, just slightly greater
than that for “dislike,” so classification, at this early stage, is “like” Relation to Linear Regression A net with a single output node and no hidden layers, where g is the identity function, takes the same form as a linear regression model Training the Model
●Categorical variables ● If equidistant categories, map to equidistant interval points in 0-1 range ● Otherwise, create dummy variables ●Transform (e.g., log) skewed variables Initial Pass Through Network ●Goal: Find weights that yield best predictions ●The process described above is repeated for all records ●At each record compare prediction to actual ●Difference is the error for the output node ●Error is propagated back and distributed to all the hidden nodes and used to update their weights Back Propagation (“back- prop”)
● Output from output node k:
● Error associated with that node:
Note: this is like ordinary error, multiplied by
a correction factor Error is Used to Update Weights
l = constant between 0 and 1, reflects the
“learning rate” or “weight decay parameter” Why It Works ●Big errors lead to big changes in weights ●Small errors leave weights relatively unchanged ●Over thousands of updates, a given weight keeps changing until the error associated with that weight is negligible, at which point weights change little RapidMine r Process Tiny Example - Final Weights Tiny Example - Final Propensities and Classifications
And Confusion Matrix
Common Criteria to Stop the Updating ●When weights change very little from one iteration to the next
●When the misclassification rate reaches a
required threshold
●When a limit on runs is reached
Avoiding Overfitting With sufficient iterations, neural net can easily overfit the data
To avoid overfitting:
● Track error in validation data or via cross-
validation ● Limit iterations ● Limit complexity of network User Inputs Specify Network Architecture
Number of hidden layers
Most popular – one hidden layer Size (number of nodes) in hidden layer(s) More nodes capture complexity, but increase chances of overfit) Learning Rate Low values “downweight” the new information from errors at each iteration This slows learning, but reduces tendency to overfit to local structure Momentum Helps avoid getting stuck in local max or min Advantages
● Good predictive ability
● Can capture complex relationships ● No need to specify a model ● Complex networks are good with large numbers of “low level” features, like pixel values in an image, or words in a text (see “deep learning”) Disadvantages ●Considered a “black box” prediction machine, with no insight into relationships between predictors and outcome ●No variable-selection mechanism, so you have to exercise care in selecting variables ●Heavy computational requirements if there are many variables (additional variables dramatically increase the number of weights to calculate) Deep Learning ● The statistical and machine learning models in this book - including standard neural nets - work where you have informative predictors (purchase information, bank account information, # of rooms in a house, etc.)
● In rapidly-growing applications of voice and
image recognition, you have high numbers of “low-level” granular predictors - pixel values, wave amplitudes, uninformative at this low level Deep Learning The most active application area for neural nets RapidMiner extensions Image Handling and Deep Learning
• In image recognition, pixel values are predictors, and there might be
100,000+ predictors – big data! (voice recognition similar) • Deep neural nets with many layers (“neural nets on steroids”) have facilitated revolutionary breakthroughs in image/voice recognition, and in artificial intelligence (AI) • Key is the ability to self-learn features (“unsupervised”) • For example, clustering could separate the pixels in this 1” by 1” football field image into the “green field” and “yard marker” areas without knowing that those concepts exist • From there, the concept of a boundary, or “edge” emerges • Successive stages move from identification of local, simple features to more global & complex features Convolutional Neural Net example in image recognition ● A popular deep learning implementation is a convolutional neural net (CNN) ● Need to aggregate predictors (pixels) ● Rather than have weights for each pixel, group pixels together and apply the same operation: “convolution” ● Common aggregation is a 3 x 3 pixel area, for example the small area around this man’s lower chin
Enlargement Pixel values
of area (higher number = darker) Apply the convolution
Convolution operation is “multiply the pixel matrix by
the filter matrix” then sum
0*25 + 1*200 + 0*25 +
x 0*25 + 1*225 + 0*25 + 0*25 + 1*225 + 0*25 = 650
Filter matrix that is Sum = 650; this is higher
good at identifying Pixel values than for any other center vertical arrangement of the filter lines (we will see matrix, because pixel values why shortly) are highest in central column Continue the Convolution ● The filter matrix moves across the image, storing its result, yielding a smaller matrix whose values indicate the presence or absence of a vertical line. ● Similar filters can detect horizontal lines, curves, borders - hyper-local features ● Further convolutions can be applied to these local features ● Result: multi-dimensional matrix, or tensor, of higher-level features The Learning Process How does the net learn which convolutions to do? ● In supervised learning, the net retains those convolutions and features which are successful in labeling (tagging) images ● Note that the feature-learning process yields a reduced (simpler) set of features than the original set of pixel values
training data has
known labels Unsupervised Learning Autoencoding ● Deep learning nets can learn higher level features even when there are no labels to guide the process ● The net adds a process to take the high level features and generate an image ● The generated image is compared to the original image and the net retains the architecture that produces the best matches Deep learning networks have many settings Summary ●Neural nets can capture flexible/complicated relationships between outcome and predictors ●The network “learns” and updates its model iteratively as more data are fed into it ●Major danger: overfitting ●Requires large amounts of data ●Good predictive performance, yet it’s a “black box” ●Deep learning (very complex neural nets) is effective in learning higher level features from a multitude of lower level ones ●Deep learning is the key to image recognition and many AI applications