0% found this document useful (0 votes)
54 views9 pages

Statistical Machine Learning and Dissolved Gas Analysis - A Review - Mirowski - LeCun

Uploaded by

Felipe Kaew
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
54 views9 pages

Statistical Machine Learning and Dissolved Gas Analysis - A Review - Mirowski - LeCun

Uploaded by

Felipe Kaew
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

IEEE TRANSACTIONS ON POWER DELIVERY, VOL. 27, NO.

4, OCTOBER 2012 1791

Statistical Machine Learning and Dissolved


Gas Analysis: A Review
Piotr Mirowski, Member, IEEE, and Yann LeCun

Abstract—Dissolved gas analysis (DGA) of the insulation oil A. About the Difficulty of Interpreting DGA Measurements
of power transformers is an investigative tool to monitor their
health and to detect impending failures by recognizing anomalous Despite the fact that DGA has been used for several decades
patterns of DGA concentrations. We handle the failure prediction and is a common diagnostic technique for transformers, there
problem as a simple data-mining task on DGA samples, optionally are no universally accepted means for interpreting DGA re-
exploiting the transformer’s age, nominal power and voltage,
and consider two approaches: 1) binary classification and 2) sults. IEEE C57-104 [3] and IEC 60599 [4] use threshold values
regression of the time to failure. We propose a simple logarithmic for gas levels. Other methods make use of ratios of gas con-
transform to preprocess DGA data in order to deal with long-tail centrations [2], [5] and are based on observations that relative
distributions of concentrations. We have reviewed and evaluated gas amounts show some correlation with the type, location, and
15 standard statistical machine-learning algorithms on that task, severity of the fault. Gas ratio methods allow for some level of
and reported quantitative results on a small but published set
of power transformers and on proprietary data from thousands problem diagnosis whereas threshold methods focus more on
of network transformers of a utility company. Our results con- discriminating between normal and abnormal behavior.
firm that nonlinear decision functions, such as neural networks, The amount of any gas produced in a transformer is expected
support vector machines with Gaussian kernels, or local linear to be influenced by age, loading, and thermal history, the pres-
regression can theoretically provide slightly better performance ence of one or more faults, the duration of any faults, and ex-
than linear classifiers or regressors. Software and part of the data
are available at http://www.mirowski.info/pub/dga. ternal factors such as voltage surges. The complex relationship
between these is, in large part, the reason why there are no uni-
Index Terms—Artificial intelligence, neural networks, power
versally acceptable means for interpreting DGA results. It is also
transformer insulation, prediction methods, statistics, trans-
formers. worth pointing out that the bulk of the work, to date, on DGA
has been done on large power transformers. It is not at all clear
how gas thresholds, or even gas ratios, would apply to much
I. INTRODUCTION smaller transformers, such as network transformers, which con-
tain less oil to dilute the gas.

D ISSOLVED GAS analysis (DGA) has been used for


more than 30 years [1]–[3] for the condition assessment
of functioning electrical transformers. DGA measures the con-
B. Supervised Classification of DGA-Based Features
Due to the complex interplay between various factors that
centrations of hydrogen , methane , ethane ,
lead to gas generation, numerous data-centric machine-learning
ethylene , acetylene , carbon monoxide , and
techniques have been introduced for the prediction of trans-
carbon dioxide dissolved in transformer oil. and
former failures from DGA data [6]–[17]. These methods rely
are generally associated with the decomposition of cellulosic
on DGA samples that are labelled as being taken either on a
insulation; usually, small amounts of and would be
“healthy” or on a “faulty” (alternatively, failure-prone) trans-
expected as well. , and larger amounts of
former. As we will review them in Section II, we will see that
and are generally associated with the decomposition of oil.
it is not obvious, from their description, how each algorithm
All transformers generate some gas during normal operation,
contributed to good classification performance, and why one
but it has become generally accepted that gas generation, above
should be specifically chosen over any other. Neither are we
and beyond that observed in normally operating transformers,
aware of comprehensive comparative studies that would bench-
is due to faults that lead to local overheating or to points of
mark those techniques on a common dataset.
excessive electrical stress that result in discharges or arcing.
In a departure from previous work, we propose not adding a
novel algorithm to the library, but instead review in Section IV
common, well-known statistical learning tools that are readily
Manuscript received March 19, 2011; revised September 28, 2011, December
20, 2011, and February 12, 2012; accepted April 14, 2012. Date of current ver- available to electrical engineers. An extensive computational
sion September 19, 2012. This work was supported by the NYU-Poly Seed evaluation of all those techniques is conducted on two different
Grant. Paper no. TPWRD-00234-2011.
datasets, one (small) public dataset of large-size power trans-
P. Mirowski is with the Statistics and Learning Research Department, Al-
catel-Lucent Bell Laboratories, Murray Hill, NJ 07974 USA. formers (Section V-B), and one large proprietary dataset of
Y. LeCun is with the Courant Institute of Mathematical Sciences, New York thousands of network transformers (Section V-C).
University, New York City, NY 10003 USA .
In addition, the novel contributions of our work lie in the use
Color versions of one or more of the figures in this paper are available online
at http://ieeexplore.ieee.org. of a logarithmic transform to handle long-tail distributions of
Digital Object Identifier 10.1109/TPWRD.2012.2197868 DGA concentrations (Section III-B) in approaching the problem

0885-8977/$31.00 © 2012 IEEE


1792 IEEE TRANSACTIONS ON POWER DELIVERY, VOL. 27, NO. 4, OCTOBER 2012

by regressing the time to failure, and in considering semi-super- 3) About the Complexity of Hybrid Techniques: Much of
vised learning approaches (Section IV-C). the previous work introduces techniques that are a combina-
All of the techniques previously introduced as well as those tion of two different learning algorithms. For instance [17] uses
presented in this paper have the following steps in common: 1) genetic-programming (GP) optimization on top of neural net-
the constitution of a dataset of DGA samples (Section III-A) and works or SVM, while [16] uses GP in combination with wavelet
2) the extraction of mathematical features from DGA data (Sec- networks; similarly, [15] builds a self-organizing map followed
tion III-B), followed by 3) the construction of a classification by a neural-fuzzy model. And yet, the DGA datasets generally
tool that is trained in a supervised way on the labelled features consist of a few hundred samples of a few (typically seven)
(Section IV). noisy gas measurements. Employing complex and highly para-
metric models on small training sets increases the risk of over-
II. REVIEW OF THE RELATED WORK fitting the training data and thereby of worse “generalization”
performance on the out-of-sample test set. This empirical ob-
A. Collection of AI Techniques Employed servation has been formalized in terms of minimizing the struc-
Here, we briefly review previous techniques for transformer tural (i.e., model-specific) risk [19], and is often referred to as
failure prediction from DGA. All of them follow the method- the Occam’s razor principle.1 The additional burden of hybrid
ology enunciated in Section I-B, consisting of feature extraction learning methods is that one needs to test for the individual con-
from DGA, followed by a classification algorithm. tributions of each learning module.
The majority of them are techniques [6], [7], [9]–[13], [15], 4) Lack of Publicly Available Data: To our knowledge, only
[16] built around a feedforward neural-network classifier, that [1] provides a dataset of labeled DGA samples and only [15]
is also called multilayer perceptron (MLP) and that we explain evaluates their technique on that public dataset. Combined with
in Section IV. Some of these papers introduce further enhance- the complexity of the learning algorithms, the research work
ments to the MLP: in particular, neural networks that are run documented in other publications becomes more difficult to
in parallel to an expert system in [10]; wavelet networks (i.e., reproduce.
neural nets with a wavelet-based feature extraction) in [16]; Capitalizing upon the lessons learned from analyzing the
self-organizing polynomial networks in [9] and fuzzy networks state-of-the-art transformer failure prediction methods, we
in [6], [12], [13], and [15]. propose in our paper to evaluate our method on two different
Several studies [6], [8], [12], [13], [15], [16] resort to fuzzy datasets (one of them being publicly available), using large test
logic [18] when modeling the decision functions. Fuzzy logic sets as much as possible and establishing comparisons among
enables logical reasoning with continuously valued predicates 15 well-known, simple, and representative statistical learning
(between 0 and 1) instead of binary ones, but this inclusion of algorithms described in Section IV.
uncertainty within the decision function is redundant with the
probability theory behind Bayesian reasoning and statistics. III. DGA DATA
Stochastic optimization techniques, such as genetic program- Although DGA measurements of transformer oil provide
ming, are also used as an additional tool to select features for the concentrations of numerous gases, such as nitrogen , we
classifier in [8], [12], [14], [16], and [17]. restrict ourselves to key gases suggested in [3] (i.e., to hy-
Finally, Shintemirov et al. [17] conduct a comprehensive drogen , methane , ethane , ethylene ,
comparison between -nearest neighbors, neural networks, and acetylene , carbon monoxide , and carbon dioxide
support vector machines (three techniques that we explain in .
Section IV), each of them combined with genetic program-
ming-based feature selection. A. (Un)Balanced Transformer Datasets
Transformer failures are, by definition, rare events. Therefore
B. Limitations of Previous Methods and similar to other anomaly detection problems, transformer
1) Insufficient Test Data: Some earlier methods that we failure prediction suffers from the lack of data points acquired
reviewed would use a test dataset as small as a few transformers during (or preceding) failures, relative to the number of data
only, on which no statistically significant statistics could be points acquired in normal operating mode. This data imbalance
drawn. For instance, [6] evaluate their method on a test set of may impede some statistical learning algorithms: for example,
three transformers, and [7] on 10 transformers. Later publica- if only 5% of the data points in the dataset correspond to faulty
tions were based on larger test sets of tens or hundreds of DGA transformers, a trivial classifier could obtain an accuracy of 95%
samples; however, only [12] and [17] employed cross-vali- simply by ignoring its inputs and by classifying all data points
dation on test data to ensure that their high performance was as normal.
stable for different splits of train/test data. Two strategies are proposed in this paper to balance the
2) No Comparative Evaluation With the State of the Art: faulty and normal data. The first one consists in data resam-
Most of the studies conducted in the aforementioned articles pling for one of the two classes, and may consist in generating
[8]–[10], [12]–[14] compare their algorithms to standard mul- new data points for the smaller class: for instance, during
tilayer neural networks. But only [17] compares itself to two experiments on the public Duval dataset, the ranges of DGA
additional techniques—support vector machines (SVMs) and measures for normally operating transformers were known, and
-nearest neighbors, and solely [13] and [15] make numerical 1The Occam’s razor principle could be paraphrased as “all things being con-
comparisons to previous DGA predictive techniques. sidered equal, the simplest explanation is to be preferred.”
MIROWSKI AND LECUN: STATISTICAL MACHINE LEARNING AND DISSOLVED GAS ANALYSIS: A REVIEW 1793

considerable speedup in the convergence can be obtained when


the mean value of each input variable is close to zero, and the
covariance matrix is diagonal and unitary [20]. Therefore and
although we will not decorrelate the DGA measurements, we
propose at least to standardize all of the features to zero mean
and unit variance over the entire dataset. Data standardization
simply consists here, for each gas variable , in subtracting its
mean value over all examples and then dividing the result
Fig. 1. Histogram of log-concentration of methane among samples taken
from faulty (black) and normal operating (gray) network transformers (utility
by the standard deviation of the variable, to obtain
data from Section V-C). . The result of a logarithmic transfor-
mation of DGA values, followed by their standardization, is ex-
emplified on Fig. 2, where we plot 167 datapoints (marked as
we randomly generated new data points within those ranges crosses and circles) from a DGA dataset in a 2-D space
(see Section V-B). The second strategy consists in selecting a versus ). The ranges of the log-transformed and standard-
subset of existing data, as we did, for instance, on our second ized DGA values on Fig. 2 go from about 2.5 to 2.5 for both
series of experiments (in Section V-C). gases, with mean values at 0.

B. Preprocessing DGA Data C. Additional Features


1) Logarithmic Transform of DGA Concentrations: Dis- 1) Total Gas: In addition to the concentrations of in-
solved gas concentrations typically present highly skewed dividual gases, it might be useful to know the total con-
distributions, where the majority of the transformers have low centration of inflammable carbon-containing gases, that is
concentrations of a few ppm (parts per million), but where . As with the other
faulty transformers can often attain thousands or tens of thou- concentrations, we suggest taking the of that sum. We
sands of ppm [1]–[3]. This fat-tail distribution is, at the same immediately see that including this total gas concentration
time, difficult to visualize, and the extreme values can be a as a feature enables us to express Duval Triangle-like ratios
source of numerical imprecisions and overflows in a statistical [1], [2], e.g., %
learning algorithm. .
For this reason, we assert that the most informative feature 2) Transformer-Specific Features: The age of the trans-
of DGA data are the order of magnitude of the DGA concentra- former (in years), its nominal power (in kilovolt amperes), and
tions, rather than their absolute values. A natural way to account its voltage (in volts) are three potential causes for the large
for these changes of magnitude is to rescale DGA data using the variability among transformers’ gas production, and could
logarithmic transform. For ease of interpretation, we used the be taken into account for the failure classification task. Since
. We assumed that the DGA measuring device might not these features are positive and may have a large scale, we also
discriminate between an absence of gas (0 ppm) and a negligible propose normalizing them by taking their .
quantity (say 1 ppm), and for this reason, we lower-thresholded
all of the concentrations at 1 (conveniently, this also prevented D. Summary: Inputs to the Classifier
us from dealing with negative log feature values). We illustrate
At this point, let us note a vector containing the input fea-
in Fig. 1 how the log-transform can ease the visualization of key
tures associated with a specific DGA measurement . These fea-
gas distributions and even highlight the log-normal distribution
tures consist of seven gas concentrations, optionally enriched
of some gases.
by such features as total gas, the transformer’s age, its nominal
2) Relationship to Key Gas Ratios: Conventional
power, and voltage. We propose to -normalize and to stan-
methods of DGA interpretation rely on gas ratios [1]–[3].
dardize all of the features. The next section explains how we
We notice that log-transforming the DGA concentra-
find the “label” , and most important, how we build a classi-
tions enables to express the ratios as subtractions, e.g.,
fier that predicts from .
. Because
most of the parametric algorithms explained in the next sec-
tion perform at some point linear combinations between their IV. METHODS FOR CLASSIFYING DGA MEASUREMENTS
inputs (which are log-transformed concentrations), they may This section focuses on our statistical machine-learning
learn to evaluate ratio-like relationships between the raw gas methodology for transformer failure prediction. We begin by
concentrations. formulating the problem from two possible viewpoints: classi-
3) Standardizing the DGA Data for Learning Algorithms: In fication or regression (Section IV-A). Then, we recapitulate the
order to keep the numerical operations stable, the values taken most important concepts of predictive learning in Section IV-B
by the input features should be close to zero and have a small before enumerating selected classification and regression algo-
range of the order of a few units. This requirement stems from rithms, as well as their semi-supervised version that can exploit
the statistical learning algorithms described in the next section, unlabeled DGA data points, in Section IV-C. These algorithms
some of whom rely on the assumption that the input data are nor- are described in more depth in the online Appendix to this
mally distributed, with a zero mean and unit diagonal covariance paper and are implemented as Matlab code libraries: both are
matrix. For some other algorithms, such as neural networks, a available at http://www.mirowski.info/pub/dga.
1794 IEEE TRANSACTIONS ON POWER DELIVERY, VOL. 27, NO. 4, OCTOBER 2012

Fig. 2. Comparison of six regression or classification techniques on a simplified 2-D version of the Duval dataset consisting of log-transformed and standardized
values of DGA measures for and . There are 167 datapoints: 117 “faulty” DGA measures (marked as red or magenta crosses) and 50 “normal” ones (blue
or cyan circles). Since the training datapoints are not easily separable in 2-D, the accuracy and area under the curve (see paper) on the training set are generally
not 100%. The test data points consist in the entire DGA values space. The output of the six decision functions goes from white ( , meaning “no impending
failure predicted”) to black ( 0, meaning “failure is deemed imminent”); for most classification algorithms, we plot the continuously valued probability of
having 1 instead of the actual binary decision ( 0 or 1). The decision boundary (at 0.5) is marked in green. Note that we do not know the actual
labels for the test data—this figure provides instead with an intuition of how the classification and regression algorithms operate. -Nearest Neighbors (KNN, top
left) partitions the space in a binary way, according to the Euclidian distances to the training datapoints. Weighted kernel regression (WKR, bottom middle) is
a smoothed version of KNN, and local linear regression (LLR, top middle) performs linear regression on small neighborhoods, with overall nonlinear behavior.
Neural networks (bottom left) cut the space into multiple regions. Support vector machines (SVMs, right) use only a subset of the datapoints (so-called support
vectors, in cyan and magenta) to define the decision boundary. Linear kernel SVMs (top right) behave like logistic regression and perform linear classification,
while Gaussian kernel SVMs (bottom right) behave like WKR.

A. Classification or Regression Problem measurement classification”, and we associate to each mea-


1) Formulation as a Binary Classification Problem: Al- surement taken at time , a label that characterizes the
though DGA can diagnose multiple reasons for transformer short-term or middle-term risk of failure relative to time . In
failures [1]–[3] (e.g., high-energy arcing, hot spots above 400 the experiments described in this paper, some transformers
, or corona discharges), the primordial task can be expressed had more than a single DGA measurement taken across their
as binary classification: “is the transformer at risk of failure?” lifetime (e.g., ), but we considered the datapoints
From a dataset of DGA measures collected on the pool of separately.
transformers, one can identify DGA readings recorded shortly 3) Formulation as a Regression Problem: The second
before failures, and separate them from historical DGA read- dataset investigated in this paper also contained the time
ings from transformers that kept on operating for an extended stamps of DGA measurements, along with information about
period of time. We use the convention that measurement is the time of failure. We used this information to obtain more
labeled in the “faulty” case and in the “normal” informative labels , where 0 would mean
case. In the experiments described in this paper, we arbitrarily “bound to fail,” would mean “should not fail in the
labeled DGA measurement as “normal” if it was taken at foreseeable future,” and values between those two extremes
least five years prior to a failure, and “faulty” otherwise. would quantify the risk of failure. A predictor trained on such
2) Classifying Measurements Instead of Transformers: As a dataset could have a real-valued output that would help
a transformer ages, its risk of failure should increase and the prioritize the intervention by the utility company.2
DGA measurements are expected to evolve. Our predictive 2Note that many classification algorithms, although trained on binary classes,
task therefore shifts from “transformer classification” to “DGA can be provided with probabilities.
MIROWSKI AND LECUN: STATISTICAL MACHINE LEARNING AND DISSOLVED GAS ANALYSIS: A REVIEW 1795

4) Labeled Data for the Regression Problem: We obtained [25] with three different types of kernels: linear, polynomial,
the labels for the regression task in the following way. First, we and Gaussian.
gathered for each DGA measurement, both the date at which the Some algorithms strive at defining boundaries that would cut
DGA measurement was taken, and the date at which the cor- the input space of multivariate DGA measurements into “faulty”
responding transformer failed, and computed the difference in or “normal” ones. It is the case of decision trees, neural net-
time, expressed in years. Transformers that had their DGA sam- works, and linear classifiers, such as an SVM with linear or
ples done at the time of or after the failure were given a value of polynomial kernel, which can all be likened to the tables of limit
zero, while transformers that did not fail were associated with concentrations used in [3] to quantify whether a transformer has
an arbitrary high value. These values corresponded to the time dissolved gas-in-oil concentrations below safe limits. Instead of
to failure (TTF) in years. Then, we considered only the DGA predetermined key gas concentrations or concentration ratios,
samples from transformers that (ultimately) failed, and sorted all of these rules are, however, automatically learned from the
the TTF in order to compute their empirical cumulated distribu- supplied training data.
tion function (CDF). TTFs of zero would correspond to a CDF The intuition for using -NN and SVM with Gaussian ker-
of zero, while very long TTFs would asymptotically converge nels, can be described as “reasoning by analogy”: to assess the
to a CDF of one. The CDF can be simply implemented by using risk of a given DGA measurement, we compare it to the most
a sorting algorithm; on a finite set of TTF values, the CDF value similar DGA samples in the database.
itself corresponds to the rank of the sorted value, divided by the 2) Regression of the Time to Failure: The algorithms that
number of elements. Our proposed approach to obtain labels for we considered were essentially the regression counterpart to
the regression task of the TTF is to employ the values of the the classification algorithms: Linear regression and regularized
CDF as the labels. Under that scheme, all “normal” DGA sam- LASSO regression [26] (with linear dependencies between the
ples from transformers that did not fail (yet) are simply labeled log-concentrations of gases and the risk of failure), weighted
as “1.” kernel regression [27] (a continuously valued equivalent of
-NN), local linear regression (LLR) [28], neural network
B. Commonalities of the Learning Algorithms regression, and support vector regression (SVR) [29] with
1) Supervised Learning of the Predictor: Supervised linear, polynomial, and Gaussian kernels.
learning consists in fitting a predictive model to a training 3) Semi-Supervised Algorithms: In the presence of large
dataset (which consists here in pairs of DGA amounts of unlabeled data (as was the case for the utility
measurements and associated risk-of-failure labels ). The company’s dataset explained in this paper), it can be helpful to
objective is merely to optimize a “black-box” function so that include them along the labeled data when training the predictor.
for each data point , the prediction is as close as The intuition behind semisupervised learning (SSL) is that the
possible to the ground truth target . learner could get better prepared for the test set “exam” if it
2) Training, Validation, and Test Sets: Good statistical knew the distribution of the test data points (aka “questions”).
machine-learning algorithms are capable of extrapolating Note that the test set labels (aka “answers”) would still not be
knowledge and of generalizing it on unseen data points. supplied at training time.
For this reason, we separate the known data points into a We tested two SSL algorithms that obtained state-of-the-art
training (in-sample) set, used to define model , and a test results on various real-world datasets: low-dimensional scaling
(out-of-sample) set, used exclusively to quantify the predictive (LDS) [30], [31] (for classification) and local linear semi-su-
power of . pervised regression (LLSSR) [32]. Their common point is that
3) Selection of Hyper-Parameters by Cross-Validation: they try to place the decision boundary between “faulty” and
Most models, including the nonparametric ones, need the spec- “normal” DGA samples in regions of the DGA space where
ification of a few hyperparameters (e.g., the number of nearest there are few (unlabeled, test) DGA samples. This follows the
neighbors, or the number of hidden units in a neural network); intuition that the decision between a “normal” and “faulty”
to this effect, a subset of the training data (called the validation transformer should not change drastically with small DGA
set) can be set apart during learning, in order to evaluate the value changes.
quality of fit of the model for various values of the hyper- 4) Illustration on a 2-D Toy Dataset: We illustrate in Fig. 2
parameters. In our research, we resorted to cross-validation how a few classification and regression techniques behave on
(i.e., multiple (here 5-fold) validation on five nonoverlapping 2-D data. We trained six different classifiers or regressors on
sets). More specifically, for each choice of hyperparameters, a 2-D, two-gas training set of real DGA data (that we ex-
we performed five cross-validations on five sets that contained tracted from the seven-gas Duval public dataset, and we plot
each 20% of the available training data, while the remaining on Fig. 2 failure prediction results of each algorithm on the
80% would be used to fit the model. entire two-gas DGA subspace. Some algorithms have a linear
decision boundary at 0.5, while other ones are nonlinear,
C. Machine-Learning Algorithms some smoother than others. For each of the six algorithms, we
1) Classification Techniques: We considered -Nearest also report the accuracy on the training set . Not all algo-
Neighbors ( -NN) [21], C-45 Decision Trees [22], neural net- rithms fit the training data perfectly; as can be seen on these
works with one hidden layer [23] and trained by stochastic plots, some algorithms obtain very high accuracy on the training
gradient descent [20], [24], as well as support vector machines set (e.g., 100% for -NN), whereas their behavior on the entire
1796 IEEE TRANSACTIONS ON POWER DELIVERY, VOL. 27, NO. 4, OCTOBER 2012

two-gas DGA space is incorrect; for instance, very low concen- number of faulty transformers, while the false positive rate is
trations of both DGA gases, here standardized and the number of false alarms over the total number of “normal”
with values below 1.5, are classified as “faulty” transformers. The area under the curve (AUC) of the ROC can
(in black) by -NN. The explanation is very simple: real DGA be approximately measured by numerical integration. A random
data are very noisy and two DGA gases (namely, and predictor (e.g., an unbiased coin toss) has , and
in this example) are not enough to discriminate well between we have 0.5, while a perfect predictor first finds all of
“faulty” and “normal” transformers. For this reason, we see on the true positives (e.g., the TPR climbs to 1) before making any
Fig. 2 “faulty” datapoints (red crosses) that have very low con- false alarms and, thus, 1.
centrations of and , lower than “normal” datapoints Because of the technicalities involved in maintaining a pool
(blue circles): those faulty datapoints may have other gases at of power or network transformers based on periodic DGA sam-
much higher concentrations, and we most likely need to con- ples (namely because a utility company cannot suddenly replace
sider all seven DGA gases (and perhaps additional information all of the risky transformers, but needs to prioritize these re-
about the transformer) to discriminate well. This figure should placements based on transformer-specific risks), a real-valued
also serve as a cautionary tale about the risk of a statistical prediction is more advantageous than a mere binary classifi-
learning algorithm that overfits the training data but that gen- cation, since it introduces an order (ranking) of the most risky
eralizes poorly on additional test data. transformers. The AUC, which evaluates the decision function
at different sensitivities (i.e., “thresholds”), is therefore the most
V. RESULTS AND DISCUSSION appropriate metric.

We compared the classification and regression algorithms on B. Public “Duval” Dataset of Power Transformers
two distinct datasets. One dataset was small but publicly avail-
able (see Section V-B), while the second one was large, had In a first series of experiments, we compared 15 well-known
time-stamped data, but was proprietary (see Section V-C). classification and regression algorithms on a small-size dataset
of power transformers [1]. These public data contain
A. Evaluation Metrics log-transformed DGA values of seven gas concentrations (see
Section III) from 117 faulty and 50 functional transformers.
Three different metrics were considered: accuracy,
Note that because DGA samples in this dataset have no time-
correlation, and area under the receiver operating character-
stamp information, the labels are binary (i.e., 0 “faulty”
istic (ROC) curve; each metric had different advantages and
versus 1 “normal”), even for regression-based predictors.
limitations.
In summary, the input data consisted of pairs
1) Accuracy (Acc): Let us assume that we have a collection
, where each was a 7-D
of binary (0- or 1-valued) target labels , as well as
vector of log-transformed and standardized DGA measurements
corresponding predictions . When is not binary
from seven gases (see Section III-B).
but real-valued, we make them binary by thresholding. Then,
Reference [1] also provides ranges of gas concentrations
the accuracy of a classifier is simply the percentage of correct
for the normal operating mode, which we used to ran-
predictions over the total number of predictions: 50% means
domly generate 67 additional “normal” data points (beyond the
random and 100% is perfect.
167 data points from the original dataset) uniformly sampled
2) Correlation: For regression tasks, that is, when the tar-
within that interval. This way, we obtained a new, balanced
gets (signal) and predictions are real-valued (e.g., between 0
dataset with “normal” and 117 “faulty”
and 1), the correlation (equal to
DGA samples. We evaluated the 15 methods on those new
) quantifies how “aligned” the predictions are with the
DGA data to investigate the impact of the label imbalance
targets. When the magnitude of the errors is compa-
on the prediction performance. For a given dataset (either
rable to the standard deviation of the signal, then 0.
or ) and a given algorithm algo, we ran the
1 means perfect predictions. Note that we can still apply this
following learning evaluation:
metric when the target is binary.
3) Area Under the ROC Curve: In a binary classifier, the
Algorithm I Learn
ultimate decision (0 or 1) is often the function of a threshold
where one can vary the value of to obtain more or fewer
Randomly split (80%, 20%) into train/test sets
“positives” (alarms) or, inversely, “negatives” .
Other binary classifiers, such as SVM or logistic regression, can 5-fold cross-validate hyper-parameters of algo on
predict the probability which is then thresholded
Train algorithm on
for the binary choice. Similarly, one can threshold the output of
a regressor’s prediction . Test algorithm on
The ROC [33] is a graphical plot of the true positive rate
Obtain predictions from where
(TPR) as a function of the false positive rate (FPR) as the crite-
rion of the binary classification (the aforementioned threshold) Compute Area Under ROC Curve (AUC) of given
changes. In the case of DGA-based transformer failure predic-
if classification algo then Compute accuracy acc
tion, the true positive rate is the number of data samples pre-
dicted as “faulty” and that were indeed faulty, over the total else Compute correlation .
MIROWSKI AND LECUN: STATISTICAL MACHINE LEARNING AND DISSOLVED GAS ANALYSIS: A REVIEW 1797

TABLE I of linear classifiers and regressors. We nevertheless advocate for


PERFORMANCE OF THE CLASSIFICATION AND REGRESSION ALGORITHMS ON richer (larger) datasets, and conservatively recommend sticking
THE DUVAL DATASET, MEASURED IN TERMS OF AVERAGE AUC
to the data-mining rule of thumb of balanced datasets.

C. Large Proprietary Dataset of Network Transformers

1) Large Dataset of Network Transformers: The second


dataset on which we evaluate the algorithms was given by
an electrical power company that manages several thousand
network transformers.
To constitute our dataset, we gathered time-stamped DGA
measures and information about transformers (age, power,
voltage, see Section III-C) from two disjoint lists that we call
and . List contained 1796 DGA measures from all
transformers that failed or that were under careful monitoring,
and list contained about 30 500 DGA measures from the
operating ones. There were about 32 300 DGA measures in
For each algorithm algo, we repeated the learning experiment total, most conducted within the past 10 years, and some
50 times and computed the mean values of as well transformers had multiple DGA measures across time.
as for classification algorithms and for regression In the failed transformers list , we qualified 1167 DGA
algorithms. These results are summarized in Table I using the measures from transformers that failed because of gas- or
AUC metric and for the original Duval data only (117 “faulty” pressure-related issues as “positives” and we discarded 629 re-
and 50 “normal” transformers) or after balancing the dataset maining DGA measures from non-DGA-fault-related corroded
with 66 additional “normal” DGA data points sampled within transformers. Then, using the difference between the date of the
. DGA test and the date of failure, we computed a time to failure
From this extensive evaluation, it appears that the top per- (TTF) in years; we further removed 26 transformers that failed
forming classification algorithms on the Duval dataset are: 1) more than five years since the last DGA test and qualified them
SVM with Gaussian kernels; 2) one hidden-layer neural net- as “negatives.” Finally, we converted these TTF to numbers
works with logistic outputs; 3) -nearest neighbors (albeit they between 0 and 1 using the cumulated distribution function
do not provide probability estimates, which prevents us from (CDF) of the TTF, with values of corresponding to
evaluating their AUC); and 4) the semi-supervised low-dimen- “immediate failure” and values of corresponding to
sional scaling. These four nonlinear classifiers dominate linear “failure in 5 or more year.”
classifiers (here, an SVM with linear kernels) by three points of By definition, transformers in the “normal” transformer list
accuracy, suggesting to both that the manifold that can separate were not labeled, since they did not fail. We, however, as-
Duval DGA data is nonlinear, and that nonlinear methods are sumed that DGA samples taken more than 5 years ago could be
more adapted. These results are unsurprising, since Gaussian considered as “negatives:” this represented an additional 1480
kernel SVMs and neural networks have proved their applica- data points . The remaining 29 000 measurements
bility and superior performance in many domains. collected within the last 5 years could not be directly exploited
Similarly, the top regression algorithms in terms of cor- as labeled data.
relation are the: 1) nonparametric LLR; 2) single hidden-layer Similar to the public Duval dataset, the input data con-
neural networks with linear outputs; 3) SVR with Gaussian ker- sisted in pairs , where each was an 11-D
nels; and 4) weighted kernel regression. Again, these four al- vector of log-transformed and standardized DGA measure-
gorithms are nonlinear. All of them exploit a notion of local ments from seven gases, concatenated with the standard-
smoothness, but they express a complex decision function in ized values of:
terms of DGA gas concentrations, contrary to linear or Lasso , and (see
regression. Sections III-B and III-C), and our dataset consisted of 2647
Finally, we evaluate the impact of an increased fraction of data points, plotted on Fig. 3.
“normal” data points over the total number of data points. We 2) Comparative Analysis of 12 Predictive Algorithms: We
notice that while the correlation and the accuracy markedly performed the analysis on the proprietary, utility data, similar to
increase when we balance the data (e.g., from 90% accuracy the way we did on the Duval dataset, with the exception that we
with unbalanced data to more than 96% accuracy with balanced did not add or remove data points.
data for Gaussian SVM), with the exception of LASSO regres- We investigated only 12 out of the 15 algorithms previously
sion and SVR with quadratic kernels, the AUC does not change used, discarding -Nearest Neighbors and C-45 classification
as drastically: notably, the AUC of SVM with linear or quadratic trees (for which one cannot evaluate the AUC) as well as SVR
kernels, and of most regression algorithms, does not show an with quadratic kernels (because of computational cost, that was
upward trend. We can find an obvious explanation for the linear not justified by a mediocre performance on the Duval dataset).
algorithms: the more points that are added to the dataset, the less For each algorithm algo, we repeated the learning experi-
linear the decision boundary, hence the worse the performance ment (see Algorithm 1) 25 times. We plotted
1798 IEEE TRANSACTIONS ON POWER DELIVERY, VOL. 27, NO. 4, OCTOBER 2012

0.97. On the other hand, this might highlight some cru-


cial differences between the maintenance of small, numerous
network transformers and large, scarce power transformers. We
conjecture that the data set may have some imprecisions in the
labeling, or that we missed some transformer-related discrimi-
native features.
Nevertheless, we demonstrated the applicability of simple,
out-of-the-box machine-learning algorithms for DGA of net-
work transformers who can achieve promising numerical per-
formance on a large dataset. Indeed, and as visible in Fig. 4,
at 1% of the false alarm rate, between 30% and 50% of faulty
DGA samples were detected (using SVM with Gaussian ker-
nels, neural network classifiers, or LDS); for the same classi-
fiers and at 10% of false positives, 80% to 85% of faulty DGA
samples were detected. This performance still needs to be val-
idated over an extended period of time on real-life transformer
maintenance.
Fig. 3. The 3-D plots of DGA samples from the utility dataset, showing log
3) Applicability of Semi-Supervised Algorithms to DGA:
concentrations of acetylene versus ethylene and ethane . The In a last, inconclusive series of experiments, we incorporated
color code of the data point labels goes from green/light (failure at a later date, knowledge about the distribution of the 29 000 recent DGA
) to red/dark (impending failure, 0).
measurements. Those were discarded from dataset because
they were not labeled (but they should be mostly taken from
“healthy” transformers). We relied on two semi-supervised
algorithms (see Section IV-C): 1) LDS, classification and 2)
LLSSR, where unlabeled test data were supplied at learning
time. The AUC of the semi-supervised algorithms dropped,
which can be explained by the fact that the unlabeled test set
was probably heavily biased toward “normal” transformers
whereas these algorithms are designed for balanced data sets.

VI. CONCLUSION
We addressed the problem of DGA for the failure predic-
tion of power and network transformers from a statistical ma-
chine-learning angle. Our predictive tools would take as input
log-transformed DGA measurements from a transformer and
provide, as an output, the quantification of the risk of an im-
pending failure.
To that effect, we conducted an extensive study on a small
but public set of published DGA data samples, and on a very
large set of thousands of network transformers belonging to a
Fig. 4. Comparison of classification and regression techniques on the propri- utility company. We evaluated 15 straightforward algorithms,
etary, utility dataset. The faulty transformer prediction problem is considered considering linear and nonlinear algorithms for classification
as a retrieval problem, and the ROC is computed for each algorithm as well as
its associated AUC. The learning experiments were repeated 25 times and we
and regression. Nonlinear algorithms performed better than
show the average ROC curves over all experiments. linear ones, hinting at a nonlinear boundary between DGA
samples from “failure-prone” and those from “normal.” It was
hard to choose between a subset of high-performing algorithms,
the 25-run average ROC curve on held-out 20% test sets on including support vector machines (SVM) with Gaussian ker-
Fig. 4, along with the average AUC curves. nels, neural networks, and LLR, as their performances were
Overall, the classification algorithms performed slightly comparable. There seemed to be no specific advantage in trying
better than the regression algorithm, despite not having access to regress the time to failure rather than performing a binary
to subtle information about the time to failure. The best (clas- classification; but there was a need to balance the dataset in
sification) algorithms were indeed SVM with Gaussian kernels terms of “faulty” and “normal” DGA samples. Finally, as
0.94), LDS 0.93) and neural networks with shown through repeated experiments, a robust classifier such
logistic outputs 0.93). Linear classifiers or regressors as SVM with Gaussian kernel could achieve an area under the
did almost as well as nonlinear algorithms. ROC curve of AUC 0.97 on the Duval dataset, and of AUC
On one hand, one could deplore the slightly disappointing 0.94 on the utility dataset, making this DGA-based tool
performance of statistical learning algorithms, compared to the applicable to prioritizing repairs and replacements of network
Duval results, where the best algorithms reached a very high transformers. We have made our Matlab code and part of the
MIROWSKI AND LECUN: STATISTICAL MACHINE LEARNING AND DISSOLVED GAS ANALYSIS: A REVIEW 1799

dataset available at http://www.mirowski.info/pub/dga in order [21] T. Cover and P. Hart, “Nearest neighbor pattern classification,” IEEE
to ensure reproducibility and to help advance the field. Trans. Inf. Theory, vol. 13, no. 1, pp. 21–27, Jan. 1967.
[22] J. Quinlan, C4.5: Programs for Machine Learning. San Mateo, CA:
Morgan Kaufman, 1993.
ACKNOWLEDGMENT [23] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, Learning Internal
Representations by Error Propagation. Cambridge, MA: MIT Press,
The authors would like to express their gratitude to Profs. W. 1986, pp. 318–362.
Zurawsky and D. Czarkowski for their valuable input and help [24] L. Bottou et al., “Stochastic learning,” in Advanced Lectures on Ma-
chine Learning, O. B. , Ed. Berlin, Germany: Springer-Verlag, 2004,
in the elaboration of this manuscript. They would also like to pp. 146–168.
thank the utility company who provided them with DGA data, [25] C. Cortes and V. Vapnik, Machine Learning, 1995.
as well as three anonymous reviewers for their feedback. [26] R. Tibshirani, “Regression shrinkage and selection via the lasso,” J.
Roy. Stat. Soc., vol. 58, pp. 267–288, 2006.
[27] E. Nadaraya, “On estimating regression,” Theory Probability App., vol.
REFERENCES 9, pp. 141–142, 1964.
[28] C. Stone, “Consistent nonparametric regression,” Ann. Stat., vol. 5, pp.
[1] M. Duval and A. dePablo, “Interpretation of gas-in-oil analysis using
595–620, 1977.
new IEC publication 60599 and IEC TC 10 databases,” IEEE Elect.
[29] A. J. Smola and B. Schlkopf, “A tutorial on support vector regression,”
Insul. Mag., vol. 17, no. 2, pp. 31–41, Mar./Apr. 2001.
Stat. Comput., vol. 14, pp. 199–222, 2004.
[2] M. Duval, “Dissolved gas analysis: It can save your transformer,” IEEE
[30] O. Chapelle and A. Zien, “Semi-supervised classification by low den-
Elect. Insul. Mag., vol. 5, no. 6, pp. 22–27, Nov./Dec. 1989.
sity separation,” in Proc. 10th Int. Workshop Artif. Intell. Stat., 2005,
[3] IEEE Guide for the Interpretation of Gases Generated in Oil-Immersed
pp. 57–64.
Transformers, IEEE Standard C57.104-2008, 2009.
[31] A. N. Erkan and Y. Altun, “Semi-supervised learning via general-
[4] Mineral Oil-Impregnated Equipment in Service Guide to the Interpre-
ized maximum entropy,” in Proc. Conf. Artif. Intell. Stat., 2010, pp.
tation of Dissolved and Free Gases Analysis, IEC Standard Publ. 60
209–216.
599, 1999.
[32] M. R. Rwebangira and J. Lafferty, “Local linear semi-supervised re-
[5] R. R. Rogers, “IEEE and IEC codes to interpret incipient faults in trans-
gression,” School Comput. Sci., Carnegie Mellon Univ., Pittsburgh,
formers, using gas in oil analysis,” IEEE Trans. Elect. Insul., vol. EI-13,
PA, Tech. Rep. CMU-CS-09-106, Feb. 2009.
no. 5, pp. 349–354, Oct. 1978.
[33] D. Green and J. Swets, Signal Detection Theory and Psychophysics.
[6] J. J. Dukarm, “Tranformer oil diagnosis using fuzzy logic and neural
New York: Wiley, 1966.
networks,” in Proc. CCECE/CCGEI, 1993, pp. 329–332.
[7] Y. Zhang, X. Ding, Y. Liu, and P. Griffin, “An artificial neural network
approach to transformer fault diagnosis,” IEEE Trans. Power Del., vol. Piotr Mirowski (M’11) received the Dipl.Ing. degree
11, no. 4, pp. 1836–1841, Oct. 1996. from ENSEEIHT, Toulouse, France, in 2002 and the
[8] Y.-C. Huang, H.-T. Yang, and C.-L. Huang, “Developing a new trans- Ph.D. degree in computer science from the Courant
former fault diagnosis system through evolutionary fuzzy logic,” IEEE Institute, New York University, New York, in 2011.
Trans. Power Del., vol. 12, no. 2, pp. 761–767, Apr. 1997. He joined Alcatel-Lucent Bell Labs as a Research
[9] H.-T. Yang and Y.-C. Huang, “Intelligent decision support for diag- Scientist in 2011. He was also with Schlumberger
nosis of incipient transformer faults using self-organizing polynomial Research from 2002 to 2005, and interned at Google,
networks,” IEEE Trans. Power Syst., vol. 13, no. 3, pp. 946–952, Aug. S&P and AT&T Labs. He filed five patents and pub-
1998. lished papers on the applications of machine learning
[10] Z. Wang, Y. Liu, and P. J. Griffin, “A combined ANN and expert to geology, epileptic seizure prediction, statistical
system tool for transformer fault diagnosis,” IEEE Trans. Power Del., language modeling, robotics, and localization.
vol. 13, no. 4, pp. 1224–1229, Oct. 1998.
[11] J. Guardado, J. Naredo, P. Moreno, and C. Fuerte, “A comparative
study of neural network efficiency in power transformers diagnosis
using dissolved gas analysis,” IEEE Trans. Power Del., vol. 16, no. Yann LeCun received the Ph.D. degree from Uni-
4, pp. 643–647, Oct. 2001. versit́e P. & M. Curie, Paris, France, in 1987.
[12] Y.-C. Huang, “Evolving neural nets for fault diagnosis of power trans- Currently, he is Silver Professor of Computer
formers,” IEEE Trans. Power Del., vol. 18, no. 3, pp. 843–848, Jul. Science and Neural Science at the Courant Institute
2003. and at the Center for Neural Science of New York
[13] V. Miranda and A. R. G. Castro, “Improving the IEC table for trans- University (NYU), New York. He became a Postdoc-
former failure diagnosis with knowledge extraction from neural net- toral Fellow at the University of Toronto, Toronto,
works,” IEEE Trans. Power Del., vol. 20, no. 4, pp. 2509–2516, Oct. ON, Canada. He joined AT&T Bell Laboratories
2005. in 1988, and became head of the Image Processing
[14] X. Hao and S. Cai-Xin, “Artificial immune network classification al- Research Department at AT&T Labs-Research in
gorithm for fault diagnosis of power transformer,” IEEE Trans. Power 1996. He joined NYU as a Professor in 2003, after a
Del., vol. 22, no. 2, pp. 930–935, Apr. 2007. brief period at the NEC Research Institute, Princeton. He has published more
[15] R. Naresh, V. Sharma, and M. Vashisth, “An integrated neural fuzzy than 150 papers on these topics as well as on neural networks, handwriting
approach for fault diagnosis of transformers,” IEEE Trans. Power Del., recognition, image processing and compression, and very-large scale integrated
vol. 23, no. 4, pp. 2017–2024, Oct. 2008. design. His handwriting recognition technology is used by several banks
[16] W. Chen, C. Pan, Y. Yun, and Y. Liu, “Wavelet networks in power around the world to read checks. His image compression technology, called
transformers diagnosis using dissolved gas analysis,” IEEE Trans. DjVu, is used by hundreds of websites and publishers and millions of users
Power Del., vol. 24, no. 1, pp. 187–194, Jan. 2009. to access scanned documents on the web, and his image-recognition methods
[17] A. Shintemirov, W. Tang, and Q. Wu, “Power transformer fault clas- are used in deployed systems by companies, such as Google, Microsoft, NEC,
sification based on dissolved gas analysis by implementing bootstrap and France Telecom for document recognition, human–computer interaction,
and genetic programming,” IEEE Trans. Syst., Man Cybern. C, Appl. image indexing, and video analytics. He is the cofounder of MuseAmi, a
Rev., vol. 39, no. 1, pp. 69–79, Jan. 2009. music technology company. His research interests include machine learning,
[18] L. Zadeh, “Fuzzy sets,” Inf. Control, vol. 8, pp. 338–353, 1965. computer perception and vision, robotics, and computational neuroscience.
[19] V. Vapnik, The Nature of Statistical Learning Theory. Berlin, Ger- Dr. LeCun has been on the editorial board of IJCV, IEEE TRANSACTIONS
many: Springer-Verlag, 1995. ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, IEEE TRANSACTIONS ON
[20] Y. LeCun, L. Bottou, G. Orr, and K. Muller, “Efficient backprop,” in NEURAL NETWORKS, was Program Chair of CVPR06, and is Chair of the annual
Lecture Notes Comput. Sci.. New York: Springer, 1998. Learning Workshop. He is on the science advisory board of IPAM.

You might also like