Statistical Machine Learning and Dissolved Gas Analysis - A Review - Mirowski - LeCun
Statistical Machine Learning and Dissolved Gas Analysis - A Review - Mirowski - LeCun
Abstract—Dissolved gas analysis (DGA) of the insulation oil A. About the Difficulty of Interpreting DGA Measurements
of power transformers is an investigative tool to monitor their
health and to detect impending failures by recognizing anomalous Despite the fact that DGA has been used for several decades
patterns of DGA concentrations. We handle the failure prediction and is a common diagnostic technique for transformers, there
problem as a simple data-mining task on DGA samples, optionally are no universally accepted means for interpreting DGA re-
exploiting the transformer’s age, nominal power and voltage,
and consider two approaches: 1) binary classification and 2) sults. IEEE C57-104 [3] and IEC 60599 [4] use threshold values
regression of the time to failure. We propose a simple logarithmic for gas levels. Other methods make use of ratios of gas con-
transform to preprocess DGA data in order to deal with long-tail centrations [2], [5] and are based on observations that relative
distributions of concentrations. We have reviewed and evaluated gas amounts show some correlation with the type, location, and
15 standard statistical machine-learning algorithms on that task, severity of the fault. Gas ratio methods allow for some level of
and reported quantitative results on a small but published set
of power transformers and on proprietary data from thousands problem diagnosis whereas threshold methods focus more on
of network transformers of a utility company. Our results con- discriminating between normal and abnormal behavior.
firm that nonlinear decision functions, such as neural networks, The amount of any gas produced in a transformer is expected
support vector machines with Gaussian kernels, or local linear to be influenced by age, loading, and thermal history, the pres-
regression can theoretically provide slightly better performance ence of one or more faults, the duration of any faults, and ex-
than linear classifiers or regressors. Software and part of the data
are available at http://www.mirowski.info/pub/dga. ternal factors such as voltage surges. The complex relationship
between these is, in large part, the reason why there are no uni-
Index Terms—Artificial intelligence, neural networks, power
versally acceptable means for interpreting DGA results. It is also
transformer insulation, prediction methods, statistics, trans-
formers. worth pointing out that the bulk of the work, to date, on DGA
has been done on large power transformers. It is not at all clear
how gas thresholds, or even gas ratios, would apply to much
I. INTRODUCTION smaller transformers, such as network transformers, which con-
tain less oil to dilute the gas.
by regressing the time to failure, and in considering semi-super- 3) About the Complexity of Hybrid Techniques: Much of
vised learning approaches (Section IV-C). the previous work introduces techniques that are a combina-
All of the techniques previously introduced as well as those tion of two different learning algorithms. For instance [17] uses
presented in this paper have the following steps in common: 1) genetic-programming (GP) optimization on top of neural net-
the constitution of a dataset of DGA samples (Section III-A) and works or SVM, while [16] uses GP in combination with wavelet
2) the extraction of mathematical features from DGA data (Sec- networks; similarly, [15] builds a self-organizing map followed
tion III-B), followed by 3) the construction of a classification by a neural-fuzzy model. And yet, the DGA datasets generally
tool that is trained in a supervised way on the labelled features consist of a few hundred samples of a few (typically seven)
(Section IV). noisy gas measurements. Employing complex and highly para-
metric models on small training sets increases the risk of over-
II. REVIEW OF THE RELATED WORK fitting the training data and thereby of worse “generalization”
performance on the out-of-sample test set. This empirical ob-
A. Collection of AI Techniques Employed servation has been formalized in terms of minimizing the struc-
Here, we briefly review previous techniques for transformer tural (i.e., model-specific) risk [19], and is often referred to as
failure prediction from DGA. All of them follow the method- the Occam’s razor principle.1 The additional burden of hybrid
ology enunciated in Section I-B, consisting of feature extraction learning methods is that one needs to test for the individual con-
from DGA, followed by a classification algorithm. tributions of each learning module.
The majority of them are techniques [6], [7], [9]–[13], [15], 4) Lack of Publicly Available Data: To our knowledge, only
[16] built around a feedforward neural-network classifier, that [1] provides a dataset of labeled DGA samples and only [15]
is also called multilayer perceptron (MLP) and that we explain evaluates their technique on that public dataset. Combined with
in Section IV. Some of these papers introduce further enhance- the complexity of the learning algorithms, the research work
ments to the MLP: in particular, neural networks that are run documented in other publications becomes more difficult to
in parallel to an expert system in [10]; wavelet networks (i.e., reproduce.
neural nets with a wavelet-based feature extraction) in [16]; Capitalizing upon the lessons learned from analyzing the
self-organizing polynomial networks in [9] and fuzzy networks state-of-the-art transformer failure prediction methods, we
in [6], [12], [13], and [15]. propose in our paper to evaluate our method on two different
Several studies [6], [8], [12], [13], [15], [16] resort to fuzzy datasets (one of them being publicly available), using large test
logic [18] when modeling the decision functions. Fuzzy logic sets as much as possible and establishing comparisons among
enables logical reasoning with continuously valued predicates 15 well-known, simple, and representative statistical learning
(between 0 and 1) instead of binary ones, but this inclusion of algorithms described in Section IV.
uncertainty within the decision function is redundant with the
probability theory behind Bayesian reasoning and statistics. III. DGA DATA
Stochastic optimization techniques, such as genetic program- Although DGA measurements of transformer oil provide
ming, are also used as an additional tool to select features for the concentrations of numerous gases, such as nitrogen , we
classifier in [8], [12], [14], [16], and [17]. restrict ourselves to key gases suggested in [3] (i.e., to hy-
Finally, Shintemirov et al. [17] conduct a comprehensive drogen , methane , ethane , ethylene ,
comparison between -nearest neighbors, neural networks, and acetylene , carbon monoxide , and carbon dioxide
support vector machines (three techniques that we explain in .
Section IV), each of them combined with genetic program-
ming-based feature selection. A. (Un)Balanced Transformer Datasets
Transformer failures are, by definition, rare events. Therefore
B. Limitations of Previous Methods and similar to other anomaly detection problems, transformer
1) Insufficient Test Data: Some earlier methods that we failure prediction suffers from the lack of data points acquired
reviewed would use a test dataset as small as a few transformers during (or preceding) failures, relative to the number of data
only, on which no statistically significant statistics could be points acquired in normal operating mode. This data imbalance
drawn. For instance, [6] evaluate their method on a test set of may impede some statistical learning algorithms: for example,
three transformers, and [7] on 10 transformers. Later publica- if only 5% of the data points in the dataset correspond to faulty
tions were based on larger test sets of tens or hundreds of DGA transformers, a trivial classifier could obtain an accuracy of 95%
samples; however, only [12] and [17] employed cross-vali- simply by ignoring its inputs and by classifying all data points
dation on test data to ensure that their high performance was as normal.
stable for different splits of train/test data. Two strategies are proposed in this paper to balance the
2) No Comparative Evaluation With the State of the Art: faulty and normal data. The first one consists in data resam-
Most of the studies conducted in the aforementioned articles pling for one of the two classes, and may consist in generating
[8]–[10], [12]–[14] compare their algorithms to standard mul- new data points for the smaller class: for instance, during
tilayer neural networks. But only [17] compares itself to two experiments on the public Duval dataset, the ranges of DGA
additional techniques—support vector machines (SVMs) and measures for normally operating transformers were known, and
-nearest neighbors, and solely [13] and [15] make numerical 1The Occam’s razor principle could be paraphrased as “all things being con-
comparisons to previous DGA predictive techniques. sidered equal, the simplest explanation is to be preferred.”
MIROWSKI AND LECUN: STATISTICAL MACHINE LEARNING AND DISSOLVED GAS ANALYSIS: A REVIEW 1793
Fig. 2. Comparison of six regression or classification techniques on a simplified 2-D version of the Duval dataset consisting of log-transformed and standardized
values of DGA measures for and . There are 167 datapoints: 117 “faulty” DGA measures (marked as red or magenta crosses) and 50 “normal” ones (blue
or cyan circles). Since the training datapoints are not easily separable in 2-D, the accuracy and area under the curve (see paper) on the training set are generally
not 100%. The test data points consist in the entire DGA values space. The output of the six decision functions goes from white ( , meaning “no impending
failure predicted”) to black ( 0, meaning “failure is deemed imminent”); for most classification algorithms, we plot the continuously valued probability of
having 1 instead of the actual binary decision ( 0 or 1). The decision boundary (at 0.5) is marked in green. Note that we do not know the actual
labels for the test data—this figure provides instead with an intuition of how the classification and regression algorithms operate. -Nearest Neighbors (KNN, top
left) partitions the space in a binary way, according to the Euclidian distances to the training datapoints. Weighted kernel regression (WKR, bottom middle) is
a smoothed version of KNN, and local linear regression (LLR, top middle) performs linear regression on small neighborhoods, with overall nonlinear behavior.
Neural networks (bottom left) cut the space into multiple regions. Support vector machines (SVMs, right) use only a subset of the datapoints (so-called support
vectors, in cyan and magenta) to define the decision boundary. Linear kernel SVMs (top right) behave like logistic regression and perform linear classification,
while Gaussian kernel SVMs (bottom right) behave like WKR.
4) Labeled Data for the Regression Problem: We obtained [25] with three different types of kernels: linear, polynomial,
the labels for the regression task in the following way. First, we and Gaussian.
gathered for each DGA measurement, both the date at which the Some algorithms strive at defining boundaries that would cut
DGA measurement was taken, and the date at which the cor- the input space of multivariate DGA measurements into “faulty”
responding transformer failed, and computed the difference in or “normal” ones. It is the case of decision trees, neural net-
time, expressed in years. Transformers that had their DGA sam- works, and linear classifiers, such as an SVM with linear or
ples done at the time of or after the failure were given a value of polynomial kernel, which can all be likened to the tables of limit
zero, while transformers that did not fail were associated with concentrations used in [3] to quantify whether a transformer has
an arbitrary high value. These values corresponded to the time dissolved gas-in-oil concentrations below safe limits. Instead of
to failure (TTF) in years. Then, we considered only the DGA predetermined key gas concentrations or concentration ratios,
samples from transformers that (ultimately) failed, and sorted all of these rules are, however, automatically learned from the
the TTF in order to compute their empirical cumulated distribu- supplied training data.
tion function (CDF). TTFs of zero would correspond to a CDF The intuition for using -NN and SVM with Gaussian ker-
of zero, while very long TTFs would asymptotically converge nels, can be described as “reasoning by analogy”: to assess the
to a CDF of one. The CDF can be simply implemented by using risk of a given DGA measurement, we compare it to the most
a sorting algorithm; on a finite set of TTF values, the CDF value similar DGA samples in the database.
itself corresponds to the rank of the sorted value, divided by the 2) Regression of the Time to Failure: The algorithms that
number of elements. Our proposed approach to obtain labels for we considered were essentially the regression counterpart to
the regression task of the TTF is to employ the values of the the classification algorithms: Linear regression and regularized
CDF as the labels. Under that scheme, all “normal” DGA sam- LASSO regression [26] (with linear dependencies between the
ples from transformers that did not fail (yet) are simply labeled log-concentrations of gases and the risk of failure), weighted
as “1.” kernel regression [27] (a continuously valued equivalent of
-NN), local linear regression (LLR) [28], neural network
B. Commonalities of the Learning Algorithms regression, and support vector regression (SVR) [29] with
1) Supervised Learning of the Predictor: Supervised linear, polynomial, and Gaussian kernels.
learning consists in fitting a predictive model to a training 3) Semi-Supervised Algorithms: In the presence of large
dataset (which consists here in pairs of DGA amounts of unlabeled data (as was the case for the utility
measurements and associated risk-of-failure labels ). The company’s dataset explained in this paper), it can be helpful to
objective is merely to optimize a “black-box” function so that include them along the labeled data when training the predictor.
for each data point , the prediction is as close as The intuition behind semisupervised learning (SSL) is that the
possible to the ground truth target . learner could get better prepared for the test set “exam” if it
2) Training, Validation, and Test Sets: Good statistical knew the distribution of the test data points (aka “questions”).
machine-learning algorithms are capable of extrapolating Note that the test set labels (aka “answers”) would still not be
knowledge and of generalizing it on unseen data points. supplied at training time.
For this reason, we separate the known data points into a We tested two SSL algorithms that obtained state-of-the-art
training (in-sample) set, used to define model , and a test results on various real-world datasets: low-dimensional scaling
(out-of-sample) set, used exclusively to quantify the predictive (LDS) [30], [31] (for classification) and local linear semi-su-
power of . pervised regression (LLSSR) [32]. Their common point is that
3) Selection of Hyper-Parameters by Cross-Validation: they try to place the decision boundary between “faulty” and
Most models, including the nonparametric ones, need the spec- “normal” DGA samples in regions of the DGA space where
ification of a few hyperparameters (e.g., the number of nearest there are few (unlabeled, test) DGA samples. This follows the
neighbors, or the number of hidden units in a neural network); intuition that the decision between a “normal” and “faulty”
to this effect, a subset of the training data (called the validation transformer should not change drastically with small DGA
set) can be set apart during learning, in order to evaluate the value changes.
quality of fit of the model for various values of the hyper- 4) Illustration on a 2-D Toy Dataset: We illustrate in Fig. 2
parameters. In our research, we resorted to cross-validation how a few classification and regression techniques behave on
(i.e., multiple (here 5-fold) validation on five nonoverlapping 2-D data. We trained six different classifiers or regressors on
sets). More specifically, for each choice of hyperparameters, a 2-D, two-gas training set of real DGA data (that we ex-
we performed five cross-validations on five sets that contained tracted from the seven-gas Duval public dataset, and we plot
each 20% of the available training data, while the remaining on Fig. 2 failure prediction results of each algorithm on the
80% would be used to fit the model. entire two-gas DGA subspace. Some algorithms have a linear
decision boundary at 0.5, while other ones are nonlinear,
C. Machine-Learning Algorithms some smoother than others. For each of the six algorithms, we
1) Classification Techniques: We considered -Nearest also report the accuracy on the training set . Not all algo-
Neighbors ( -NN) [21], C-45 Decision Trees [22], neural net- rithms fit the training data perfectly; as can be seen on these
works with one hidden layer [23] and trained by stochastic plots, some algorithms obtain very high accuracy on the training
gradient descent [20], [24], as well as support vector machines set (e.g., 100% for -NN), whereas their behavior on the entire
1796 IEEE TRANSACTIONS ON POWER DELIVERY, VOL. 27, NO. 4, OCTOBER 2012
two-gas DGA space is incorrect; for instance, very low concen- number of faulty transformers, while the false positive rate is
trations of both DGA gases, here standardized and the number of false alarms over the total number of “normal”
with values below 1.5, are classified as “faulty” transformers. The area under the curve (AUC) of the ROC can
(in black) by -NN. The explanation is very simple: real DGA be approximately measured by numerical integration. A random
data are very noisy and two DGA gases (namely, and predictor (e.g., an unbiased coin toss) has , and
in this example) are not enough to discriminate well between we have 0.5, while a perfect predictor first finds all of
“faulty” and “normal” transformers. For this reason, we see on the true positives (e.g., the TPR climbs to 1) before making any
Fig. 2 “faulty” datapoints (red crosses) that have very low con- false alarms and, thus, 1.
centrations of and , lower than “normal” datapoints Because of the technicalities involved in maintaining a pool
(blue circles): those faulty datapoints may have other gases at of power or network transformers based on periodic DGA sam-
much higher concentrations, and we most likely need to con- ples (namely because a utility company cannot suddenly replace
sider all seven DGA gases (and perhaps additional information all of the risky transformers, but needs to prioritize these re-
about the transformer) to discriminate well. This figure should placements based on transformer-specific risks), a real-valued
also serve as a cautionary tale about the risk of a statistical prediction is more advantageous than a mere binary classifi-
learning algorithm that overfits the training data but that gen- cation, since it introduces an order (ranking) of the most risky
eralizes poorly on additional test data. transformers. The AUC, which evaluates the decision function
at different sensitivities (i.e., “thresholds”), is therefore the most
V. RESULTS AND DISCUSSION appropriate metric.
We compared the classification and regression algorithms on B. Public “Duval” Dataset of Power Transformers
two distinct datasets. One dataset was small but publicly avail-
able (see Section V-B), while the second one was large, had In a first series of experiments, we compared 15 well-known
time-stamped data, but was proprietary (see Section V-C). classification and regression algorithms on a small-size dataset
of power transformers [1]. These public data contain
A. Evaluation Metrics log-transformed DGA values of seven gas concentrations (see
Section III) from 117 faulty and 50 functional transformers.
Three different metrics were considered: accuracy,
Note that because DGA samples in this dataset have no time-
correlation, and area under the receiver operating character-
stamp information, the labels are binary (i.e., 0 “faulty”
istic (ROC) curve; each metric had different advantages and
versus 1 “normal”), even for regression-based predictors.
limitations.
In summary, the input data consisted of pairs
1) Accuracy (Acc): Let us assume that we have a collection
, where each was a 7-D
of binary (0- or 1-valued) target labels , as well as
vector of log-transformed and standardized DGA measurements
corresponding predictions . When is not binary
from seven gases (see Section III-B).
but real-valued, we make them binary by thresholding. Then,
Reference [1] also provides ranges of gas concentrations
the accuracy of a classifier is simply the percentage of correct
for the normal operating mode, which we used to ran-
predictions over the total number of predictions: 50% means
domly generate 67 additional “normal” data points (beyond the
random and 100% is perfect.
167 data points from the original dataset) uniformly sampled
2) Correlation: For regression tasks, that is, when the tar-
within that interval. This way, we obtained a new, balanced
gets (signal) and predictions are real-valued (e.g., between 0
dataset with “normal” and 117 “faulty”
and 1), the correlation (equal to
DGA samples. We evaluated the 15 methods on those new
) quantifies how “aligned” the predictions are with the
DGA data to investigate the impact of the label imbalance
targets. When the magnitude of the errors is compa-
on the prediction performance. For a given dataset (either
rable to the standard deviation of the signal, then 0.
or ) and a given algorithm algo, we ran the
1 means perfect predictions. Note that we can still apply this
following learning evaluation:
metric when the target is binary.
3) Area Under the ROC Curve: In a binary classifier, the
Algorithm I Learn
ultimate decision (0 or 1) is often the function of a threshold
where one can vary the value of to obtain more or fewer
Randomly split (80%, 20%) into train/test sets
“positives” (alarms) or, inversely, “negatives” .
Other binary classifiers, such as SVM or logistic regression, can 5-fold cross-validate hyper-parameters of algo on
predict the probability which is then thresholded
Train algorithm on
for the binary choice. Similarly, one can threshold the output of
a regressor’s prediction . Test algorithm on
The ROC [33] is a graphical plot of the true positive rate
Obtain predictions from where
(TPR) as a function of the false positive rate (FPR) as the crite-
rion of the binary classification (the aforementioned threshold) Compute Area Under ROC Curve (AUC) of given
changes. In the case of DGA-based transformer failure predic-
if classification algo then Compute accuracy acc
tion, the true positive rate is the number of data samples pre-
dicted as “faulty” and that were indeed faulty, over the total else Compute correlation .
MIROWSKI AND LECUN: STATISTICAL MACHINE LEARNING AND DISSOLVED GAS ANALYSIS: A REVIEW 1797
VI. CONCLUSION
We addressed the problem of DGA for the failure predic-
tion of power and network transformers from a statistical ma-
chine-learning angle. Our predictive tools would take as input
log-transformed DGA measurements from a transformer and
provide, as an output, the quantification of the risk of an im-
pending failure.
To that effect, we conducted an extensive study on a small
but public set of published DGA data samples, and on a very
large set of thousands of network transformers belonging to a
Fig. 4. Comparison of classification and regression techniques on the propri- utility company. We evaluated 15 straightforward algorithms,
etary, utility dataset. The faulty transformer prediction problem is considered considering linear and nonlinear algorithms for classification
as a retrieval problem, and the ROC is computed for each algorithm as well as
its associated AUC. The learning experiments were repeated 25 times and we
and regression. Nonlinear algorithms performed better than
show the average ROC curves over all experiments. linear ones, hinting at a nonlinear boundary between DGA
samples from “failure-prone” and those from “normal.” It was
hard to choose between a subset of high-performing algorithms,
the 25-run average ROC curve on held-out 20% test sets on including support vector machines (SVM) with Gaussian ker-
Fig. 4, along with the average AUC curves. nels, neural networks, and LLR, as their performances were
Overall, the classification algorithms performed slightly comparable. There seemed to be no specific advantage in trying
better than the regression algorithm, despite not having access to regress the time to failure rather than performing a binary
to subtle information about the time to failure. The best (clas- classification; but there was a need to balance the dataset in
sification) algorithms were indeed SVM with Gaussian kernels terms of “faulty” and “normal” DGA samples. Finally, as
0.94), LDS 0.93) and neural networks with shown through repeated experiments, a robust classifier such
logistic outputs 0.93). Linear classifiers or regressors as SVM with Gaussian kernel could achieve an area under the
did almost as well as nonlinear algorithms. ROC curve of AUC 0.97 on the Duval dataset, and of AUC
On one hand, one could deplore the slightly disappointing 0.94 on the utility dataset, making this DGA-based tool
performance of statistical learning algorithms, compared to the applicable to prioritizing repairs and replacements of network
Duval results, where the best algorithms reached a very high transformers. We have made our Matlab code and part of the
MIROWSKI AND LECUN: STATISTICAL MACHINE LEARNING AND DISSOLVED GAS ANALYSIS: A REVIEW 1799
dataset available at http://www.mirowski.info/pub/dga in order [21] T. Cover and P. Hart, “Nearest neighbor pattern classification,” IEEE
to ensure reproducibility and to help advance the field. Trans. Inf. Theory, vol. 13, no. 1, pp. 21–27, Jan. 1967.
[22] J. Quinlan, C4.5: Programs for Machine Learning. San Mateo, CA:
Morgan Kaufman, 1993.
ACKNOWLEDGMENT [23] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, Learning Internal
Representations by Error Propagation. Cambridge, MA: MIT Press,
The authors would like to express their gratitude to Profs. W. 1986, pp. 318–362.
Zurawsky and D. Czarkowski for their valuable input and help [24] L. Bottou et al., “Stochastic learning,” in Advanced Lectures on Ma-
chine Learning, O. B. , Ed. Berlin, Germany: Springer-Verlag, 2004,
in the elaboration of this manuscript. They would also like to pp. 146–168.
thank the utility company who provided them with DGA data, [25] C. Cortes and V. Vapnik, Machine Learning, 1995.
as well as three anonymous reviewers for their feedback. [26] R. Tibshirani, “Regression shrinkage and selection via the lasso,” J.
Roy. Stat. Soc., vol. 58, pp. 267–288, 2006.
[27] E. Nadaraya, “On estimating regression,” Theory Probability App., vol.
REFERENCES 9, pp. 141–142, 1964.
[28] C. Stone, “Consistent nonparametric regression,” Ann. Stat., vol. 5, pp.
[1] M. Duval and A. dePablo, “Interpretation of gas-in-oil analysis using
595–620, 1977.
new IEC publication 60599 and IEC TC 10 databases,” IEEE Elect.
[29] A. J. Smola and B. Schlkopf, “A tutorial on support vector regression,”
Insul. Mag., vol. 17, no. 2, pp. 31–41, Mar./Apr. 2001.
Stat. Comput., vol. 14, pp. 199–222, 2004.
[2] M. Duval, “Dissolved gas analysis: It can save your transformer,” IEEE
[30] O. Chapelle and A. Zien, “Semi-supervised classification by low den-
Elect. Insul. Mag., vol. 5, no. 6, pp. 22–27, Nov./Dec. 1989.
sity separation,” in Proc. 10th Int. Workshop Artif. Intell. Stat., 2005,
[3] IEEE Guide for the Interpretation of Gases Generated in Oil-Immersed
pp. 57–64.
Transformers, IEEE Standard C57.104-2008, 2009.
[31] A. N. Erkan and Y. Altun, “Semi-supervised learning via general-
[4] Mineral Oil-Impregnated Equipment in Service Guide to the Interpre-
ized maximum entropy,” in Proc. Conf. Artif. Intell. Stat., 2010, pp.
tation of Dissolved and Free Gases Analysis, IEC Standard Publ. 60
209–216.
599, 1999.
[32] M. R. Rwebangira and J. Lafferty, “Local linear semi-supervised re-
[5] R. R. Rogers, “IEEE and IEC codes to interpret incipient faults in trans-
gression,” School Comput. Sci., Carnegie Mellon Univ., Pittsburgh,
formers, using gas in oil analysis,” IEEE Trans. Elect. Insul., vol. EI-13,
PA, Tech. Rep. CMU-CS-09-106, Feb. 2009.
no. 5, pp. 349–354, Oct. 1978.
[33] D. Green and J. Swets, Signal Detection Theory and Psychophysics.
[6] J. J. Dukarm, “Tranformer oil diagnosis using fuzzy logic and neural
New York: Wiley, 1966.
networks,” in Proc. CCECE/CCGEI, 1993, pp. 329–332.
[7] Y. Zhang, X. Ding, Y. Liu, and P. Griffin, “An artificial neural network
approach to transformer fault diagnosis,” IEEE Trans. Power Del., vol. Piotr Mirowski (M’11) received the Dipl.Ing. degree
11, no. 4, pp. 1836–1841, Oct. 1996. from ENSEEIHT, Toulouse, France, in 2002 and the
[8] Y.-C. Huang, H.-T. Yang, and C.-L. Huang, “Developing a new trans- Ph.D. degree in computer science from the Courant
former fault diagnosis system through evolutionary fuzzy logic,” IEEE Institute, New York University, New York, in 2011.
Trans. Power Del., vol. 12, no. 2, pp. 761–767, Apr. 1997. He joined Alcatel-Lucent Bell Labs as a Research
[9] H.-T. Yang and Y.-C. Huang, “Intelligent decision support for diag- Scientist in 2011. He was also with Schlumberger
nosis of incipient transformer faults using self-organizing polynomial Research from 2002 to 2005, and interned at Google,
networks,” IEEE Trans. Power Syst., vol. 13, no. 3, pp. 946–952, Aug. S&P and AT&T Labs. He filed five patents and pub-
1998. lished papers on the applications of machine learning
[10] Z. Wang, Y. Liu, and P. J. Griffin, “A combined ANN and expert to geology, epileptic seizure prediction, statistical
system tool for transformer fault diagnosis,” IEEE Trans. Power Del., language modeling, robotics, and localization.
vol. 13, no. 4, pp. 1224–1229, Oct. 1998.
[11] J. Guardado, J. Naredo, P. Moreno, and C. Fuerte, “A comparative
study of neural network efficiency in power transformers diagnosis
using dissolved gas analysis,” IEEE Trans. Power Del., vol. 16, no. Yann LeCun received the Ph.D. degree from Uni-
4, pp. 643–647, Oct. 2001. versit́e P. & M. Curie, Paris, France, in 1987.
[12] Y.-C. Huang, “Evolving neural nets for fault diagnosis of power trans- Currently, he is Silver Professor of Computer
formers,” IEEE Trans. Power Del., vol. 18, no. 3, pp. 843–848, Jul. Science and Neural Science at the Courant Institute
2003. and at the Center for Neural Science of New York
[13] V. Miranda and A. R. G. Castro, “Improving the IEC table for trans- University (NYU), New York. He became a Postdoc-
former failure diagnosis with knowledge extraction from neural net- toral Fellow at the University of Toronto, Toronto,
works,” IEEE Trans. Power Del., vol. 20, no. 4, pp. 2509–2516, Oct. ON, Canada. He joined AT&T Bell Laboratories
2005. in 1988, and became head of the Image Processing
[14] X. Hao and S. Cai-Xin, “Artificial immune network classification al- Research Department at AT&T Labs-Research in
gorithm for fault diagnosis of power transformer,” IEEE Trans. Power 1996. He joined NYU as a Professor in 2003, after a
Del., vol. 22, no. 2, pp. 930–935, Apr. 2007. brief period at the NEC Research Institute, Princeton. He has published more
[15] R. Naresh, V. Sharma, and M. Vashisth, “An integrated neural fuzzy than 150 papers on these topics as well as on neural networks, handwriting
approach for fault diagnosis of transformers,” IEEE Trans. Power Del., recognition, image processing and compression, and very-large scale integrated
vol. 23, no. 4, pp. 2017–2024, Oct. 2008. design. His handwriting recognition technology is used by several banks
[16] W. Chen, C. Pan, Y. Yun, and Y. Liu, “Wavelet networks in power around the world to read checks. His image compression technology, called
transformers diagnosis using dissolved gas analysis,” IEEE Trans. DjVu, is used by hundreds of websites and publishers and millions of users
Power Del., vol. 24, no. 1, pp. 187–194, Jan. 2009. to access scanned documents on the web, and his image-recognition methods
[17] A. Shintemirov, W. Tang, and Q. Wu, “Power transformer fault clas- are used in deployed systems by companies, such as Google, Microsoft, NEC,
sification based on dissolved gas analysis by implementing bootstrap and France Telecom for document recognition, human–computer interaction,
and genetic programming,” IEEE Trans. Syst., Man Cybern. C, Appl. image indexing, and video analytics. He is the cofounder of MuseAmi, a
Rev., vol. 39, no. 1, pp. 69–79, Jan. 2009. music technology company. His research interests include machine learning,
[18] L. Zadeh, “Fuzzy sets,” Inf. Control, vol. 8, pp. 338–353, 1965. computer perception and vision, robotics, and computational neuroscience.
[19] V. Vapnik, The Nature of Statistical Learning Theory. Berlin, Ger- Dr. LeCun has been on the editorial board of IJCV, IEEE TRANSACTIONS
many: Springer-Verlag, 1995. ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, IEEE TRANSACTIONS ON
[20] Y. LeCun, L. Bottou, G. Orr, and K. Muller, “Efficient backprop,” in NEURAL NETWORKS, was Program Chair of CVPR06, and is Chair of the annual
Lecture Notes Comput. Sci.. New York: Springer, 1998. Learning Workshop. He is on the science advisory board of IPAM.