Data Mining and Knowledge Discovery For Process Monitoring and Control (Wang 1999-09-15) PDF
Data Mining and Knowledge Discovery For Process Monitoring and Control (Wang 1999-09-15) PDF
Apart from any fair dealing for the purposes of research or private study, or criticism or review, as
permitted under the Copyright, Designs and Patents Act 1988, this publication may only be reproduced,
stored or transmitted, in any form or by any means, with the prior permission in writing of the
publishers, or in the case of reprographic reproduction in accordance with the terms oflicences issued by
the Copyright Licensing Agency. Enquiries concerning reproduction outside those terms should be sent
to the publishers.
The use of registered names, trademarks, etc. in this publication does not imply, even in the absence of a
specific statement, that such names are exempt from the relevant laws and regulations and therefore free
for general use.
The publisher makes no representation, express or implied, with regard to the accuracy of the
information contained in this book and cannot accept any legal responsibility or liability for any errors or
omissions that may be made.
9 8 7 6 543 2 1
springer.com
XueZ. Wang
, Springer
Xue Z. Wang, BEng, MSc, PhD
Department of Chemical Engineering, University of Leeds, Leeds. LS2 9JT.
ISBN 978-1-4471-1137-5
British Library Cataloguing in Publication Data
Wang,Xuez.
Data mining and knowledge discovery for process monitoring
and control. - (Advances in industrial control)
1.Process control- Data processing 2.Data mining
LTitle II.McGreavy, C..
629.8'9
ISBN 978-1-4471-1137-5
Library of Congress Cataloging-in-Publication Data
Wang, X. Z. (Xue Zhang), 1963-
Data mining and knowledge discovery for process monitoring and
Control/ Xue Z. Wang.
p. cm. -- (Advances in industrial control)
Includes bibliographical references and index.
ISBN 978-1-4471-1137-5 ISBN 978-1-4471-0421-6 (eBook)
DOI 10.1007/978-1-4471-0421-6
1. Process control-- Data processing. 2. Data mining. 3. Knowledg
acquisition (Expert systems) 1. Title. II. Series.
TSI56.8.W37 1999
670.42'7563--dc2I
Apart from any fair dealing for the purposes of research or private study, or criticism or review, as
permitted under the Copyright, Designs and Patents Act 1988, this publication may only be reproduced,
stored or transmitted, in any form or by any means, with the prior permission in writing of the
publishers, or in the case of reprographic reproduction in accordance with the terms of licences issued by
the Copyright Licensing Agency. Enquiries concerning reproduction outside those terms should be sent
to the publishers.
© Springer-Verlag London 1999
Originally published by Springer-Verlag London Berlin Heidelberg in 1999
Softcover reprint ofthe hardcover Ist edition 1999
MATLABiI> is the registered trademark ofThe MathWorks,lnc., htţp:llwww.mathworks.com
The use of registered names, trademarks, etc. in this publication does not imply, even in the absence of a
specific statement, that such names are exempt from the relevant laws and reguIations and therefore free
for general use.
The publisher makes no representation, express or implied, with regard to the accuracy of the
information contained in this book and cannot accept any legal responsibility or liability for anyerrors or
omissions that may be made.
Typesetting: Camera ready by author
Series Editors
Dr D.C. McFarlane
Department of Engineering
University of Cambridge
Cambridge CB2 lQJ
United Kingdom
Professor B. Wittenmark
Department of Automatic Control
Lund Institute of Technology
PO Box 118
S-221 00 Lund
Sweden
Professor H. Kimura
Department of Mathematical Engineering and Information Physics
Faculty of Engineering
The University of Tokyo
7-3-1 Hongo
Bunkyo Ku
Tokyo 113
Japan
Dr M.K. Masten
Texas Instruments
2309 Northcrest
Plano
TX 75075
United States of America
The above roles of plant operators and supervisors imply that they are an integral
part of the overall control system. The current approach to designing control
systems has not adequately addressed this point. It is done mainly in terms of
identifying the process dynamics, selecting measurements, defining control
structures and selecting algorithms.
This book introduces development in automatic analysis and interpretation of
process operational data both in real-time and over the operational history, and
describes new concepts and methodologies for developing intelligent, state space
based systems for process monitoring, control and diagnosis. It is known that
processes can have multiple steady and also abnormal states. State space based
monitoring and diagnosis can project multivariate real-time measurements onto a
point in the operational state plane and monitor the trajectory of the point which can
identify previously unknown states and the influence of individual variables. It is
now possible to exploit data mining and knowledge discovery technologies to the
analysis, representation, and feature extraction of real-time and historical
operational data to give deeper insight into the systems behaviour. The emphasis is
on addressing challenges facing interpretation of process plant operational data,
including the multivariate dependencies which determine process dynamics, noise
and uncertainty, diversity of data types, changing conditions, unknown but feasible
conditions, undetected sensor failures and uncalibrated and misplaced sensors,
without being overwhelmed by the volume of data.
To cover the above themes, it is necessary to cover the following topics,
• new ways of approaching process monitoring, control and diagnosis
• specification of a framework for developing intelligent, state space based
monitoring systems
• introduction to data mining and knowledge discovery
• data pre-processing for feature extraction, dimension reduction, noise
removal and concept formation
• multivariate statistical analysis for process monitoring and control
• supervised and unsupervised methods for operational state identification
• variable causal relationship discovery
• software sensor design
The methodologies and concepts are illustrated by considering illustrative examples
and industrial case studies.
Xue Z. Wang
Leeds, England, 1999
ACKNOWLEDGEMENTS
The author wishes to acknowledge the invaluable contribution of the late Professor
Colin McGreavy in the preparation and editing of this text.
The author would like to thank his present and fonner colleagues and students at
the University of Leeds, Drs F.Z. Chen, B.H. Chen, S.H. Yang, Y.C. Huang, Mr.
R.F. Li, Ms B. Yuan, Dr. S.A. Yang, Mr. R. Garcia-Flores, Mr. D. Sampson, and
Mr. A. Watson who have provided much help with this book. He acknowledges
gratefully Dr. M.L. Lu at Aigis Systems USA who has given useful input
continuously. Thanks are also due to Dr. G.G. Goltz for his friendship and help in
many ways.
Particular thanks go to Professors M.A. Johnson and M. 1. Grimble at the
Industrial Control Centre, University of Strathclyde for their encouragement.
Professor M.A. Johnson has read the manuscript and given very useful input. The
author would like to extend his thanks to Mr. O. Jackson, Ms E. Fuoco and Mr. N.
Pinfield at Springer-Verlag London for their successful cooperation and help.
The author is grateful to the University of Leeds for providing the supporting
facilities in the preparation of the book. The book has also benefited from an
EPSRC grant (GRlLl61774).
CONTENTS
1 Introduction ................................................................. 1
4.1 PCA for State Identification and Monitoring ..... ...... .... 61
4.2 Partial Least Squares (PLS) ................ ........................... 68
4.3 Variable Contribution Plots .......................................... 68
4.4 Multiblock PCA and PLS .............................................. 72
4.5 Batch Process Monitoring Using Multiway PCA ......... 72
4.6 Nonlinear PCA ............................................................... 74
4.7 Operational Strategy Development and Product Design
- an Industrial Case Study ................................................ 76
4.8 General Observations ........... ........................................... 82
Over the last twenty years, it has become increasingly obvious that the perfonnance
of process control systems depends not only on the control algorithms but also on
how these integrate into the operational policy in terms of safety, environmental
protection, equipment protection, as well as general monitoring to identify poor
performance and detect faults. This calls for creating a virtual environment which is
able to provide a comprehensive assessment of performance and can identify the
factors which determine it. The important agents which are responsible for bringing
these together are the plant operators and supervisors. This implies that it is
important to consider them as part of the overall solutions, and if they are part of the
system they must be provided with the means of carrying out the role effectively.
While attention has been given to improving the interface of control systems for
operators and supervisors through the design of information display and alarm
systems, most of this is concerned with awareness, with little concern as to
processing functionality in assessing the large volume of multivariate data more
effectively.
This book addresses these issues by seeking to make use of emerging data mining
and knowledge discovery technology (KDD) to develop approaches for designing
state space based process monitoring and control systems so as to integrate plant
operators and supervisors into the operational strategy. Modem computer control
and automatic data logging systems create large volumes of data, which contain
valuable information about normal and abnormal operations, significant
disturbances and changes in operational and control strategies. The data
unquestionably provides a useful source for supervisors and engineers to monitor
X. Z. Wang, Data Mining and Knowledge Discovery for Process Monitoring and Control
© Springer-Verlag London Limited 1999
2 Data Mining and Knowledge Discovery for Process Monitoring and Control
the perfonnance of the plant and identify opportunities for improvement and causes
of poor perfonnance. The volume of data is generally so large and data structure too
complex for it to be used for characterisation of behaviour by manual analysis. Data
mining and KDD offers the capability of extracting knowledge from such data
which can be used for developing state space based process monitoring systems.
The first chapter reviews the current approaches to process monitoring with
particular reference to distributed control systems (DeS) displays, monitoring charts
for statistical quality control, and the use of a concept of the operating window.
This leads naturally to the concept of state space based process monitoring and
control which is well suited for defining operators and supervisors requirements.
Based on this, it is possible to develop a conceptual architecture for system design.
To meet the goals of process control requires monitoring and control of process
variables such as temperature, pressure, flow, composition and the levels in vessels.
In modem distributed control systems, to assist in process monitoring and fault
diagnosis, the measurements are displayed as groups of variables, dynamic trends
and alanns. To do this effectively requires careful consideration as to the design of
the displays.
A first requirement is to collect and display as much infonnation as possible
which is relevant. In fact, being able to collect and provide access to a large amount
of infonnation of measurement of variables is regarded as one of the most important
advances of DeS when compared with earlier systems [1]. In the case of a typical
olefin plant, for example, over 5000 measurements need to be monitored [2].
Secondly, the display should be arranged in a way that makes it easy for operators
to assimilate the measured values, obtain associated infonnation about key variables
provided by independent sensors and diagnose incipient problems, preferably before
causing major upsets. This requires that the interface display to operators should be
properly designed. The display to operators in a modem Des system has a
hierarchical structure as shown in Figure 1.1 [1]. This arrangement allows engineers
and operators having varied responsibilities to access infonnation at different levels
in a convenient way. For example, supervisors might be mainly interested in the
general perfonnance of the plant in tenns of operating capacity, energy efficiency,
overall mass balances and nonnal and abnonnal status of plant areas or units. On the
Chapter 1 Introduction 3
other hand, operators are more concerned with one section of a plant and monitoring
the associated variables.
Other features of DCS displays make it possible for operators to quickly
assimilate data using colours to indicate different operational states. For example,
green is used for normal and red for abnormal. Care is also needed in the location of
displays, grouping of variables and grouping and sequencing of alarms.
WORKING DISPLAYS
Station Mimics
Trend Displays
Control Graphics
Batch Sequences
Operator Guides
X-Y Plots
Tuning Display
This emphasis again draws attention to the fact that supervisors and operators are
part of the overall control system. In fact, not only are they responsible for many
feedback control tasks which are not automated, such as switching feedbacks, but
also they must undertake supervision of the general strategy and seek to develop an
understanding of the process performance both in the short- and long- term. This
understanding can be used to identify [3] :
Such tasks require operators to be able to not only access the data at the right time
but more importantly to assimilate and assess the data quickly and correctly,
especially when abnormal conditions arise. This is a very challenging task because
the volume of data is generally very large: large scale plants have as many as 20,
000 variables which are measured continuously [4, 5, 6]. Moreover, the data are
multivariate and are interrelated so there is a need to make evaluations
simultaneously. Humans are not able to simultaneously analyse problems involving
more than three variables very effectively and this becomes more difficult when the
data are corrupted by noise and uncertainty. The need to provide computer
assistance in assimilating data has now become a major concern and it is important
that automatic data analysis systems should be developed and integrated with DCS
control.
Automatic process control compensates for the effects of disturbances and maintains
the controlled variable at the desired value. However, it does not eliminate the
course of poor operation. Since the sources of disturbances have not been
eliminated, leaving the process susceptible to future disturbances from the same
source. Statistical process control (SPC) has the goal of detecting and eliminating
disturbance. SPC works in conjunction with automatic control and monitors the
performance of a process over time in order to verify that the process meets the
quality requirements [22]. SPC charts such as Shewhart [7, 8, 9] , cumulative sum
(CUSUM) [10] or exponentially weighted moving average (EWMA) charts [11, 12]
are used to monitor the key product variables in order to detect the occurrence of
any event having a special or assignable cause. In the case of assignable causes,
long-term improvements can be made in process operations which enhance product
quality.
Figure 1.2 is an example of a Shewhart chart. It shows the variation in a single
variable against the statistical mean and the upper and lower limits defining an
acceptable quality band. The mean and the upper and lower limits are obtained by
statistical analysis of historical data. If the measured value falls outside either of the
limits, a diagnostic analysis is carried out to determine the assignable cause so that
appropriate corrective action can be taken.
Chapter 1 Introduction 5
Most SPC methods record only a small number of variables, usually the final
product quality variables which are examined sequentially. Such an approach is now
not acceptable for modem processes. Modem computer based monitoring collects
massive amounts of data continuously. Variables such as temperatures, pressures,
flowrates etc. are typically measured every seconds although only a few of the
underlying events drive the process at any time: in fact, all measurements are simply
aspects of the same underlying events. Examining one variable at a time, is of
limited value. There is clearly the need to develop methods to examine a number of
variables simultaneously.
)()(
)()( )()(
)()( )()( Centre line Mean
")O(~ ~
)()(
)()( )()(
)()(
L-----------------------or------------- Time
Special cause
operating point can move outside of the steady-state region. Operating windows
have not been adopted in DeS display design as widely as they ought to have and
there is still the problem of dimension limitation. Systems aiming to replace or help
operators in process status assessment require capabilities for dealing with much
higher dimensionality.
480
300~--~----~----~----~----~--~
o 0.1 0.2 0.3 0.4 0.5
Reactantconcenuation
In practice, operators are usually more concerned with the current operational status
and evolving patterns of behaviour, rather than the instant values of specific
variables. The judgement on the operational states requires the simultaneous
assimilation of the multivariate process variables. In doing this, operators are
actually projecting the process to a single point of a state space. It is clear that the
above described monitoring charts and operating windows are still far less powerful
than what is required. The concept of state space based monitoring is helpful in
clarifying the ideas behind this book. Figure 1.4 illustrates how a process can be
operated in various normal and abnormal operational states or modes that have
different characteristics in terms of operability and controllability, safety and flow
Chapter I Introduction 7
..
Abnormal
operation
region ·4
patterns among other things. The accumulation of know-how derived from previous
experience and from computer simulation makes it possible to gain better insight
into the operational behaviour of equipment. A state space based monitoring system
can therefore be identified to have the following functions.
It should be able to automatically assimilate the real-time measurements, project
the process to a specific state and indicate the path of the operating point in the state
space in real-time. It is now known that a process or a unit operation can operate at
abnormal and multiple steady states. The best known example is the exothermic
continuous stirred tank reactor (CSTR) with a cooling water jacket [8, 13]. Multiple
steady-state behaviour is also possible in distillation columns [14], reactive
distillation processes [15], and refinery fluid catalytic cracking processes [16].
Changes of product specifications and feedstock properties, which are very common
today [17], as well as large disturbances may also drive a process to a different
operating state. These operational states are not obvious without careful analysis.
Abnormal operations can take various forms and are more difficult to predict.
Nevertheless, with the gradual accumulation of knowledge, more insight is being
generated about operational behaviour patterns. It is also important to be able to
identify various unfamiliar operational states, whether normal or abnormal.
8 Data Mining and Knowledge Discovery for Process Monitoring and Control
Moreover it should be possible to identify the most important variables which are
responsible for changes in operational states and so provide guidance for operators
in adjusting the process. While the schematic in Figure 1.4 is yet another facet of
system characteristics, it will not replace the traditional DCS display. It provides a
different perspective of system behaviour.
The major challenge in developing the kind of state space based system described
above arises from the characteristics of operational data, which are summarised as
follows:
• Large volume. A DCS automatic data logging system continuously stores data.
The large volume makes manual probing almost impossible. Large volumes of
data also demand large computer memory and high speed.
• High dimensionality. The behaviour of a process is usually defmed by a large
number of correlated variables. As a result it is difficult to visualise the
behaviour without dimension reduction.
• Process uncertainty and noise. Uncertainty and noise emphases the need for
good data pre-processing techniques.
• Dynamics. In operational status identification, it is very important to take
account of the dynamic trends. In other words, the values of variables are
dynamic trends. Many data mining and knowledge discovery tools, such as the
well-known inductive machine learning system C5.0 [17, 18], are mainly
designed to handle categorical values such as a colour being red or green. They
are not effective in dealing with continuous-valued variables. These tools are not
able to handle variables that take values as dynamic trends.
• Difference in the sampling time of variables. On-line measurements and
laboratory analyses have variable sampling periods.
• Incomplete data. Some important data may not be recorded.
• Small and stale data. Sometimes, data analysis is used to identify abnormal
operations. The data corresponding to abnormal operations might be buried in a
huge database. Some tools are not effective in identifying small patterns in a
large database.
• Complex interactions between process variables. Many techniques require that
attributes be independent. However, many process variables are interrelated.
Chapter 1 Introduction 9
Current methods only address some of these issues, certainly not all and the
following observations can be made:
(1). Data pre-processing is critical for various reasons including noise removal,
data reconciliation, dimension reduction and concept formation.
(2). Effective integration of the tools is needed. It means combining various tools
for data preparation for other tools or for validation.
(3). Validation of discoveries from the data and presentation of the result is
essential. Many times, because of lack of knowledge about the data, interpretation
becomes a major issue.
(4). Windowing and sampling from a large database for analysis. This is necessary
particularly for analysis of historical operational data.
To develop the kind of state space based monitoring and control environment
described above calls for an integrated data mining and KDD system. The system
should have the following functions: (1) identification of operational states; (2)
projection of the operation to a single point of the operational plane; and (3) giving
explanations on the major variables that are responsible for the projection and
providing guidance on adjustment. A conceptual system architecture and associated
components are shown in Figure 1.5. It is important for a system to be able to
provide some basic functions and be flexible enough to be tailored to meet special
purposes [21]. The basic functions include:
• Pattern discovery. Grouping data records into clusters and then analysing the
similarities and dissimilarities of data between clusters is an important starting
point for analysis. An obvious example is identifying abnormal conditions.
• Trend and deviation analysis. Various technologies for trend and deviation
analysis are available including statistics and calculation of mean and standard
deviation.
• Link and dependency analysis. The link and dependency between performance
metrics is important in understanding process behaviour and improving
performance. Some existing data mining methods such as the inductive learning
10 Data Mining and Knowledge Discovery for Process Monitoring and Control
approach CS.O [18, 19, 20], as well as many graphical tools, are not able to be
applied because of the real-valued dynamic trends and interactions between
variables.
• Summarising. This provides a compact description of a subset of data, such as
the mean and standard deviation of all fields. More sophisticated techniques use
summary rules, multivariate visualisation techniques, and functional
relationships between variables.
• Sequence analysis. Analysing models of sequential patterns (e.g., in data with
time dependence, such as time series analysis) aims at generating the sequence
or extracting report deviations and trends over time. A typical example is in
batch process operations.
• Regression. This is required for predictive model development, as in the case of
software sensor models.
Historical
Database
Supervised Unsupervised
Data "r<~-pl'oclessiinl!:1 classification tools classification tools
- Wavelets - FFNN -ARTI
- Statistics methods - AutoClass
- Fuzzy set covering - PCA
- Episode approac ...
-peA
Dependency modelling Others:
• Dependency discovef) - Visualisation
- Bayesian graph - Regression
Experience - Summarising
- Fuzzy SDU
- Rules extraction
- C5.0
...
Qualitative
models
Integrated Data mining and KDD System
The rest of the book is organised as follows. In Chapter 2 the methodologies and
tools for data mining and KDD are briefly reviewed. Chapter 3 focuses on data pre-
processing for the purpose of feature extraction, dimension reduction, noise removal
and concept formation. The emphasis is on processing of dynamic trend signals
which are considered as the most important information in operational state
identification. Three approaches are introduced, namely principal component
analysis, wavelet analysis, and episode representation.
Multivariate statistical analysis methods are introduced in Chapter 4 for analysis
of process operational data and developing multivariate statistical control charts.
These include linear and nonlinear principal component analysis (PCA), partial least
squares (PLS), and multiblock PCA. Examples are used to illustrate the approaches
and an industrial case study is described which uses PCA to discover knowledge
from data for operational strategy development and product design.
Supervised machine learning approaches are discussed in Chapter 5 for
identification of process operational states. While the focus is put on feedforward
neural networks (FFNNs), other methods are also introduced and compared with
FFNNs, including fuzzy FFNNs, single layer percetron, fuzzy set covering method
and fuzzy signed digraphs.
Supervised machine learning requires data with known classification as training
data and therefore learns from known to predict unknown, while unsupervised
approaches do not require training data therefore are able to learn from unknown.
Chapter 6 is devoted to unsupervised approaches for identification of operational
states. An integrated framework (ARTnet) combining unsupervised neural network
ART2 with wavelet feature extraction is developed, which uses wavelet as the
substitute of the data pre-processing part of ART2. It is shown that ARTnet is
superior over ART2 in avoiding the adverse effect of noise and is faster and more
robust. A Bayesian automatic classification approach (AutoClass) is also described.
The advantage of the approach is that it does not require the users to give any input:
the system determines the classification scheme automatically. A refinery fluid
catalytic cracking process is used to illustrate the approaches.
Most approaches for operational state identification, whether supervised or
unsupervised, are based on calculating a distance or similarity measure. They give
the classification but not causal explanations. Conceptual clustering which is
introduced in Chapter 7, on the other hand is able to create a conceptual clustering
12 Data Mining and Knowledge Discovery for Process Monitoring and Control
X. Z. Wang, Data Mining and Knowledge Discovery for Process Monitoring and Control
© Springer-Verlag London Limited 1999
14 Data Mining and Knowledge Discovery for Process Monitoring and Control
Value
Information
Data
Volume
Panerns
I
.. Transformed
r .t.
Data
Prepro~esscd 1
o I
Targe. 0 at. ! I
L...........~ ... ~... "u.................. ~j. . . . ". . . . .". H .. U !....
. . . . . . W• • H T
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . ." . . . . . . . . . . . . . . . . . . . . .
include statistics [26, 27], data warehousing [28, 29, 30], pattern recognition,
artificial intelligence and computer visualisation. Data mining and KDD draws upon
methods, algorithms and technologies from these diverse fields, and the unifying
goal is extracting knowledge from data.
Chapter 2 Data Mining and Knowledge Discovery - an Overview 15
50%
Cleaning
cApJotil'l:f
S.Ie: c1lft t Inter'HtU~G
tbe Data and
Evalu tin 9
the ,te n It.
hopi>;
tbt Prepating
Ptl3bl&m E xplomng
the Data
th.
:P.ull&
0%
Over the last ten years data mining and KDD has been developing at a dramatic
speed. In Information Week's 1996 survey of the 500 leading information
technology user organisations in the US, data mining came second only to the
Internet and intranets as having greatest potential for innovation in information
technology. The rapid progress is reflected not only by the establishment of research
groups on data mining and KDD in many international companies, but also by the
investment from banking, telecommunication and marketing sectors.
Figure 2.2 provides an overview of the activities involved in KDD and Figure 2.3
shows the typical distribution of effort.
Data mining and KDD is a very complex process, typically involving the following
procedures (Figure 2.4) [23, 34]:
account for noise, deciding on strategies for handing missing data fields, and
accounting for time sequence information and known changes.
Data organised
by function
Eliminate
noisy data
Transform
values
Transform to
different
re resentation
(4). Data reduction and projection: finding useful features to represent the data
depending on the goal of the task. Using dimensionality reduction or transformation
methods to reduce the effective number of variables under consideration or to find
invariant representations for the data.
(5). Choosing the data-mining task: deciding whether the goal of the KDD process
is logical, summarising, classification, regression, prediction, and clustering etc.
(6). Choosing the data analysis algorithm(s): selecting methodes) to be used for
searching for patterns in the data. This includes deciding which models and
parameters may be appropriate (e.g., models for categorical data are different from
models on vectors over the real domain) and matching a particular data mining
method with the overall criteria ofKDD process.
Chapter 2 Data Mining and Knowledge Discovery - an Overview 17
Data mining and KDD is potentially valuable in virtually any industrial and
business sectors where database and information technology is used. The following
are some reported applications:
Data mining methods and tools can be categorised in different ways [25, 33, 28, 37].
According to functions and application purposes, data mining methods can be
classified as clustering, classification, summarisation, dependency modelling, link
analysis and sequence analysis. Some methods are traditional and established and
some are relatively new. In the following a very brief review of the techniques is
given.
18 Data Mining and Knowledge Discovery for Process Monitoring and Control
2.3.1 Clustering
.
T abl e 21 Anexamp eo fd ata structure.
Instances Attributes
1 2 ... ... j ...... m
As a branch of statistics, clustering analysis has been studied extensively for many
years, mainly focused on distance-based clustering analysis, such as using the
Euclidean distance. There are many text books on this topic [44,45,46]. A notable
progress in clustering has been in unsupervised neural networks, including the self-
Chapter 2 Data Mining and Knowledge Discovery - an Overview 19
organising Kohonen neural network [36, 38] and the adaptive resonance theory
(ART) [47,48,43,49,50,51]. There have been many reports on the application in
operational state identification and fault diagnosis in process industries.
2.3.2 Classification
For a given number of data patterns such as those shown in Table 2.1, if the number
and descriptions of classes as well as the assignments of individual data patterns are
known, the task is to assign unknown data patterns to the established classes, the
task belongs to classification. The most widely currently used classification
approach is based on feedforward neural networks (FFNNs). Classification is also
called supervised machine learning because it always requires data patterns with
known class assignments to train a model which is then used for predicting the class
assignment of new data patterns.
T a bl e 22
. Ad atabase examp e.
Variable values
case xl x2 x3
1 1 0 0
2 1 1 1
3 0 0 I
4 1 1 1
5 0 0 0
6 0 1 1
7 1 1 1
8 0 0 0
9 1 1 1
10 0 0 0
it is not possible to know the most probable dependencies directly. Figure 2.5 (a)
and (b) illustrate two such possible dependencies. Theoretically, for a given
database there is a unique structure which has the highest joint probability and can
be found by some algorithm such as those developed by Cooper and Herskovits [54]
and Bouckaert [55, 56]. When a structure is identified, the next step is to find such a
probabilistic table as shown in Table 2.3.
(a)
2.3.5 Summarisation
C'/
Step-J Itemset No. of supporting Step-2 Itemset N.S.D
datacases (N.S.D) -----+
Scan the
database {A} 2 {A} 2
D {B} 3 {B} 3
{C} 3 {C} 3
{D} 1 {E} 3
{E} 3
Step-3
C L,
Tt,.m",.t "
...
Itpm"pt Nc;:n
{AB} Itemset N.S.D
{AB} 1
{AC} Step-4 {AC} 2 Step-5
{A E} ~
{A E} 1 ~ {AC} 2
{B C} ScanD {BC} 2 {BC} 2
{B E} {B E} 3 {B E} 3
Ic E} I IC E} 2 {C E} 2
Step-6
I
C, C,
L,
Itemset Itemset N.S:C
Step-7 Itemset N.S.D
~
~ {ABC} {ABC} 1 Step-8~
{AB E} ScanD {AB E} 1 {BC E} 2
{B C E} IB C El 2
A notable technique for summarisation is mining association rules [61, 62]. Given
a relational database, mining association rules finds all associations of the form.
A rule is valid given two parameters T c and T s, such that, the rule holds with
certainty > T c and the rule is supported by at least T s cases. Some commercial
systems have been developed using this approach [61, 62].
A very simple example can be used to illustrate the approach. Table 2.4 shows an
example of the transaction of part of a database. The purpose for mining the
Chapter 2 Data Mining and Knowledge Discovery - an Overview 23
2.3.6 Regression
Linear and non-linear regression is one of the commonest approaches for correlating
data. Statistical regression methods often require the user to specify the function
over which the data is to be fitted. In order to specify the function, it is necessary to
know the forms of the equations governing the correlation for the data. The
advantage of such methods is that from the equation it is possible to gain some
qualitative knowledge about the input - output relationships. However, if prior
knowledge is not available, it is necessary to find out the most probable function by
trial-and-error which may require very time consuming effort. Feedforward neural
networks (FFNNs) do not need functions to be fixed in order to learn and have
shown very remarkable results in representing non-linear functions. However the
resulting function using a FFNN is not easy to understand and is virtually a black
box without any explanations.
internal structures. Different from other data mining techniques, it does not require
to have a large number of historical data patterns. There are only a few reports of
the application of case-based reasoning in process industries such as case-based
learning for historical equipment failure databases [65, 66] and equipment design
[67].
Many industrial and business areas deal with time-series or dynamic data. It is
apparent that all statistical and real-time control data in process monitoring and
control are essentially time-series. Figure 2.7 shows the dynamic trend signals of a
variable under two different operational conditions. It is very easy for humans to
visually capture features of each trend and identify their difference. However for
computers to perform the same task is difficult. Most KDD techniques cannot
account for the time series of data. The techniques to deal with time series data are
to carry out pre-processing of the data to use minimum data points to capture the
features and remove noise. These techniques include filters, e.g., Kalman filters,
Fourier and wavelet transforms, statistical approaches, neural networks as well as
various qualitative signal interpretation methods. Chapter 3 will introduce some of
these techniques for pre-processing dynamic data.
Time
Figure 2.7 The dynamic trend signals of a variable under two different operational
conditions.
Chapter 2 Data Mining and Knowledge Discovery - an Overview 25
There is no well-developed techniques for selecting data mining and KDD methods.
It is still an art [23]. Apart from the general considerations such as cost, and
support, there are some technical dimensions to the method selection. These include
[23]:
(1) univariate vs. multi-variate data. Most approaches assume independence of
variables or simply consider a single variable at a time.
(2) numerical vs. categorical or mixed data. Some methods are only suitable for
numerical data, others only for categorical data. There are only a few cases which
allow mixed data.
(3) explanation requirements or comprehensibility. Some tools give results which
are implicit to users (black box), while others can give causal and explicit
representations.
(4) fuzzy or precise patterns. There are methods such as decision trees which only
work with clear cut definitions.
(5) sample independence assumptions. Most methods assume independence of data
patterns. If there are dependency on the data patterns, it is necessary to remove or
explore.
(6) availability of prior knowledge. Some tools require prior knowledge which
might be not available. On the other hand, some others do not allow input of prior
knowledge, causing a waste of prior knowledge.
It is important to be aware of the complexity of data which tends to contain noise
and erroneous components and has missing values. Other challanges come from
lack of understanding of the domain problem and assumptions associated with
individual techniques. Therfore, data mining is rarely done in one step. It often
requires using a number of approaches to use some tools to prepare data for other
methods, or for validating purposes. As a result, multifunctional and integrated
systems are required.
Data pre-processing may be more time consuming and presents more challenges
than data mining. Process data often contains noise and erroneous components and
has missing values. There is also the possibility that redundant or irrelevant
26 Data Mining and Knowledge Discovery for Process Monitoring and Control
variables are recorded, while important features are missing. Data pre-processing
includes provision for correcting inaccuracies, removing anomalies and eliminating
duplicate records, and filling holes in the data and checking entries for consistency.
It also requires making the necessary transformation of the original to put it in the
format suitable for data mining tools.
The other important requirement with KDD process is feature selection. KDD is a
complicated task and often depends on the proper selection of features. Feature
selection is the process of choosing features which are necessary and sufficient to
represent the data. There are several issues influencing feature selection, such as .
masking variables, the number of variables employed in the analysis and relevancy
of the variables [68].
Masking variables hide or disguise patterns in data. Numerous studies have shown
that inclusion of irrelevant variables can hide real clustering of the data so only
those variables which help discriminate the clustering should be included in the
analysis [68, 69, 70].
The number of variables used in data mining is also an important consideration.
There is generally a tendency to use more variables. However, increased
dimensionality has an adverse effect because, for a fixed number of data patterns,
increased dimensionality makes the multidimensional data space sparse.
However failing to include relevant variables also causes failure in identifying the
clusters. A practical difficulty in mining some industrial data is to know if all
important variables have been included in the data records.
Prior knowledge should be used if it is available. Otherwise, mathematical
approaches need to be employed. Feature extraction shares many approaches with
data mining. For example, principal component analysis (PCA), which is a useful
tool in data mining, is also very useful for reducing the dimension (PCA and its
applications are introduced in Chapters 3 and 4). However, PCA is only suitable for
dealing with real-valued attributes. Mining of association rules mining is also an
effective approach in identifying the links between variables which take only
categoric values [68]. Sensitivity studies using feedforward neural networks
(FFNNs) are also an effective way of identifying important and less important
variables (sensitivity studies using FFNNs are introduced in Chapter 9).
Gnanadeskikan et al. [69] reviewed a number of clustering techniques which
identify discriminating variables in data.
Chapter 2 Data Mining and Knowledge Discovery - an Overview 27
This Chapter has provided an overview of data mining and KDD. Data mining and
KDD makes use of various technologies including statistics, neural networks,
machine learning, artificial intelligence, pattern recognition and databases. The
unifying goal is to extract useful information and knowledge from massive data. It is
a complex and interative process, starting with data access, continuing with data
cleaning and pre-processing as well as data mining and knowledge discovery, finally
culminating with intepretation and validation of results. It is important to be aware
of the complexity of industrial data and that there are always some assumptions
related to specific KDD techniques. Integration of various methods is necessary, so
that some tools can be used in preparing data for other methods and results obtained
using different methods can be compared.
Despite the rapid growth, the success achieved and a huge predicted market, KDD
is still considered to be in its infancy. There are still many challenges to overcome.
An obvious issue in process monitoring and control is how to deal with dynamics
associated with the data. Another issue is that most data mining tools assume that
the variables are independent of each other, but process variables are often
connected. The challenges posed by operational data have already been summarised
in Section 1.5.
Apart from the need to develop more reliable data mining and KDD tools, there is
also the need to gain more experience in applying them to industrial and business
problems.
For introduction and review of data mining and KDD, readers are referred to
Fayyad et al. [71], Wu [72], Chen et al. [28], Simoudis et al. [73], Wu et al. [74],
and Pyle [75]. There are also some useful web sites which are summarised in Table
2.5 and provide gateways to other resources.
CHAPTER 3
DATA PRE-PROCESSING FOR FEATURE EXTRACTION,
DIMENSION REDUCTION AND CONCEPT FORMATION
X. Z. Wang, Data Mining and Knowledge Discovery for Process Monitoring and Control
© Springer-Verlag London Limited 1999
30 Data Mining and Knowledge Discovery for Process Monitoring and Control
first of all to minimise the dependencies between attributes and secondly to reduce
dimensionality.
(3) Deal with the problem of variable sampling periods for data, such as on-line
real time signals and laboratory analytical data.
(4) Develop concept formation because some data mining and KDD tools have
been developed only for dealing with discrete-valued attributes and are not effective
in dealing with continuous-valued variables. It is not possible to use variables
represented by a trend without preprocessing the data.
It is worth noting that data pre-processing has many features in common with data
mining, such as principal component analysis, supervised and unsupervised
classification using statistical and neural network algorithms.
The method of principal component analysis (PCA) was originally developed in the
1900's [84, 85], and has now re-emerged as an important technique in data analysis.
The central idea is to reduce the dimensionality of a data set consisting of a large
number of interrelated variables, while retaining as much as possible of the variation
present in the data set. Multiple regression and discrimination analysis use variable
selection procedures to reduce the dimension but result in the loss of one or more
important dimensions. The PCA approach uses all of the original variables to obtain
a smaller set of new variables (principal components - PCs) that they can be used to
approximate the original variables. The greater the degree of correlation between
the original variables, the fewer the number of new variables required. PCs are
uncorrelated and are ordered so that the first few retain most of the variation present
in the original set.
It is convenient at this point to gave a brief summary of the basic points. In the
univariate case the mean and variance are used to summarise a data set. The mean or
mean value of a discrete distribution is denoted by j.l and is defined by
J.! = LX,f(X,) (3.1)
,
Chapter 3 Data Pre-processing 31
where j{Xi) is the probability function of the random variable X considered. The
mean is also known as the mathematical expectation of X and is sometimes denoted
by E(X).
The variance of a distribution is denoted by c/ and is defined by
a 2 = I(Xi-IlY!(Xi) = E(Xi-IlY (3.2)
I
In fact it is an index reflecting the deviation of Xi from the mean Il. In other words,
the variance a 2 describes the linear dependency of all Xi . The bigger the variance,
the less dependent; the smaller the variance and hence the greater the linear
dependency between Xi .
To summarise multivariate data sets, it is necessary to find the mean and variance
of each of the p variables, together with a measure of the way each pair of variables
is related. For the latter, the covariance or correlation of each pair of variables is
used.
The population mean vector is given by J.l = [1lJ, 1l2, ... , IIp], where
Ili = E(x;) (3.3)
An estimate of J.l based on n, p-dimensional observations IS
x = [XI, X2, ... , xp], where Xi is the sample mean of variable Xi
The vector representing the population variance is a· = E(xf) - Il~
An estimate of a based on n, p-dimensional observations is s· = [sf, s~, ... s~],
where s7 is the sample variance of variable Xi.
The covariance of two variables Xi and Xj is defined by
Cov (Xi ,Xj) =E(Xi'Xi) - Ili Ili (3.4)
With P variables, X /, Xl> •.•xP' there are p variances and 1. pep _ 1) covariances. In
2
general these quantities are arranged in the p x p symmetric matrix, called
covariance matrix, I,
cr II cr 12
all 0'22 alp
I= (3.5)
cr pi cr p2 cr pp
where aij = aji. The simple version of I , usually denoted, S, is generally estimated
as
1 "
S = - I ( X i - X)(Xi - x)' (3.6)
n -1 i=1
32 Data Mining and Knowledge Discovery for Process Monitoring and Control
The correlation coefficient lies between -I and I and gives a measure of the linear
relationship between variables Xi and Xj
... xp , the purpose of principal component analysis is to determine a new variable Y],
that can be used to account for the variation in the p variables, Xl, X2, ••• xp. The first
principal component is given by a linear combination of the p variables as
YI = WIIXI + W12X2 + ... + WlpX p (3.8)
where the sample variance is greatest for all of the coefficients (also called weights),
WII, W12, ... Wlp' conveniently written as a vector WI.
The WI I , W12, '" Wlp have to satisfy the constraint that the sum-of-squares of
the coefficients, i.e., W'I WI , should be unity.
The second principal component, Y2, is given by the linear combination of the p
variables in the form:
Y2 = WZIXI + W22X2 + ... + W2pXp (3.9)
or y, = w·,x
which has the greatest variance subject to the two conditions,
w',w, = 1 (3.10)
and
W',WI = 0 (so thatYI andY2 are uncorrelated)
Similarly the jth principal component is a linear combination
Yj = w'jX (3.11)
To find the coefficients defming the first principal component, the elements of
WI should be chosen so as to maximise the variance of YI subject to the constraint,
w't WI = 1. The variance of Yl is then given by
Var(yl) =Var( WI x) = W'I S
I WI (3.12)
where S is the variance-covariance matrix of the original variables (See Section
3.2.1). The solution of WI =( WI I , W12, ••• Wlp) to maximise the variance Yl is the
eigenvector of S corresponding to the largest eigenvalue. The eigenvalues of S are
roots of the equation,
IIS-A.II =0 (3.13)
If the eigenvalues are A.I, A.2, ••• A. p , then they can be arranged from the largest
to the smallest. The first few eigenvectors are the principal components that can
capture most of the variance of the original data while the remaining PCs mainly
represent noise in the data.
PCA is scale dependent, and so the data must be scaled in some meaningful way
before PCA analysis. The most usual way of scaling is to scale each variable to unit
variance.
In computer control systems such as DCS, nearly all important process variables are
recorded as dynamic trends. Dynamic trends can be more important than the actual
real time values in evaluating the current operational status of the process and in
anticipating possible future developments. Appendix C describes a data set of one
hundred cases corresponding to various operational modes such as faults,
disturbances and normal operation of a refmery reactive distillation process for
manufacture of methyl tertiary butyl ether (MTBE), a lead-free gasoline additive.
This can be used to illustrate the dimension compression capability of PCA. For
each data case, twenty one variables are recorded as dynamic responses after a
disturbance or fault occurs. Each trend consists of 256 sample points. Figure 3.l
shows the trends of a variable for two different cases. The eigenvalues of the first 20
principal components are summarised in Figure 3.2. It is apparent that the
eigenvalues of the first few principal components can be· used as a concise
representation of the original dynamic trend, and so are used to replace the original
responses for use in pattern recognition.
34 Data Mining and Knowledge Discovery for Process Monitoring and Control
o
31 61 91 121 151 181 211 241
Point
Figure 3.1 The dynamic trends of a variable for two data cases.
250
200
100
50
..
TTTfl
- f f f f I I I I I I I I I I
Points
11 13 15 17 19
Since the first two principal components can capture the main feature of a dynamic
trend, this can be displayed graphically by plotting the eigenvalues on a two-
dimensional plane. Figure 3.3 shows such a plot of the eigenvalues of the first two
principal components of a variable Fo' A point in the two dimensional plane
represents the feature of the variable response trend for one data case. Data points in
region B have response trends which are similar and unlike those in region D.
The fact that a two-dimensional plot is able to capture the features can be seen
from Figures 3.4 and 3.5. Figure 3.4 shows the dynamic responses of the variable
T_MTBE for seven data cases. After being processed using peA (actually the seven
data cases are processed using peA together with another 93 data cases, but here
only the seven are shown for illustrative purpose), the results are shown on the two-
dimensional peA plane" in Figure 3.5. It is clear that the dynamic trends of data
cases 1 and 2 are more alike than with the others in Figure 3.4 and they are grouped
closer in Figure 3.5. Similar observations can be made for data cases 40 and 80, as
well as 14 and 15.
40
30
B
20 ...... ,. .... :,.*..
'"
•
............
"
10
..
... ....... ..
...:.. ~ ""
': .••....•• ::~ C
......... ...........
\ . ...... -.
.
........... .......
.....
.,,:
[" ... )
~ 0
...............]
~
u
=- -10
•
-20 .............
......... '.~
D
A
-30
-40
-40 -20 o 20 40 60 80 100 120
PC-I-Fo
Figure 3.3 The peA two dimensional plane of the variable Fo.
36 Data Mining and Knowledge Discovery for Process Monitoring and Control
12r------------------------------------------------------.
Case 16
Figure 3.4 The dynamic trends of the temperature T_MTBE for the case study
described in Appendix C.
,•
60
,
50 Case 16
,•
40
30 Case 15
..I
=
f;I;l Case 80 Case I
20
E-
:::!!I
E-I
10
•• II
NI 0
•
I
u
1:1. -10
-20 Case 14
!
Case 2
Case 40
-30
-40
-40 -30 -20 -10 0 10 20 30 40 50 60
PC_LTjlTBE
Figure 3.5 The projection of the dynamic trends of Figure 3.4 on the two-
dimensional PCA plane.
Chapter 3 Data Pre-processing 37
40
30
20
.,B
... D
10 ..
'
.'
.,1'1
:~
.. (. \'}
.'.
.. , . I " ' ..
~
M
I
I
0
_,.'"1, •••
..: \ '.
U '"'"""".",
=-. -10
"""" ..,.''''. C
A
-20
-30
-40
-100 -80 -60 -40 -20 o 20 40 60 80
PC-I-TR
Figure 3.6 The PCA two dimensional plane of the variable TR.
38 Data Mining and Knowledge Discovery for Process Monitoring and Control
100
....... ,"1 ••• , •••
~
80
..:......... .
",,,.'
60
ABN2
NOR 2
....,...-...
.•..
.#f••••, ' :••••••••
.
"'"1 1.,11"""'"1,
40
. .... ' ""
"
1 1 '"'",.
.
--'"
.
'. .'
20 ",
. , "... ...
. " • • • • • ' . ' • • 1 • • • • • • • • • 11 . . . .
GI ~ •• I
I" ""
~
I 0 ABNI '.
M
U
I •
Q., :..
-20 ·I., '
.
u· ... ,........
. ...• ..
""""1.", •••• , •• 1. 1 "
.,.... ............................. .
NOR 3
-40 ". '.
.:
-60 ,,\~:~.II'I"I., ""1"" 01 , •• , ,I" .1'
NOR I
-80+----r---,.---,---~----.----r----~--~
-180 -140 -100 -60 -20 20 60 100 140
PC-I-State
Studies have found that presence of redundancy and irrelevant variables may
deteriorate pattern recognition or hide the real patterns in the data [68] and so some
data mining and KDD tools require the inputs to be independent. Sometimes it is not
possible to directly identify the dependencies between variables. PCA can be used
to pre-process the data and the first few principal components are then be used by
other data mining and KDD tools.
Recently, wavelet analysis has emerged as a promising new approach for signal and
image analysis and has been extended to process monitoring and control. In this
section, it is convenient to start with the well-established approach based on Fourier
Chapter 3 Data Pre-processing 39
transform for signal processing and how it relates to wavelet transforms. Wavelet
transforms are then introduced, followed by its application for feature extraction.
Fourier transforms are well known as a useful technique for frequency analysis of a
signal which breaks down a signal into constituent sinusoids of different
frequencies. The transform is defined as
F(oo) = f: f(t) e-iro1dt (3.14)
and the inverse is
j{t)=
1 f-
2n .
-ooF(ro)e1rotdro (3.15)
t t
{~
t'~ ~
g(t) = (3.17)
otherwise
40 Data Mining and Knowledge Discovery for Process Monitoring and Control
W!Wv ...
+ _._ /"\ 1\ . . -
~ ---vv- +
-rJ'v-+
•••
5(t)
~~J=.t
~ t
t'
8(t)=5(t)1(t)
~ t
1
Figure 3.9 The windowing operation.
3.10. The scaling factor works exactly the same way with wavelets, can be seen in
Figure 3.11: the smaller the scale factor, the more compressed the wavelet is.
~ f(t) = sin(t) ; a = 1
= 21
~ : : :J f(t) = sin(2t) a
r\;:
: : : :J
f(t) = sin(4t) a = 41
~~ ~ I f(t) = ",(1) a =1
The parameter b in Equation 3.19 is a translation along the time axis and simply
shifts a wavelet and so delays or advances the time at which it is activated.
Mathematically, delaying a functionj(t) by td is represented by j(t-td). The factor
1
~ is used to ensure that the energy of the scaled and translated versions is the
same as the mother wavelet. Figure 3.12 shows an example of a particular mother
wavelet '" (t) , known as the Mexican hat function [98],
1 I t2
'V(t) = .Jj1t-"4(l- t 2 )ex p(-Z-)
Chapter 3 Data Pre-processing 43
The stretched and compressed wavelets through scaling operation are used to
capture the different frequency components of the function being analysed. The
compressed version in Figure 3. 12(b) is used to fit the high frequency needs, and the
stretched version in Figure 3.12(c) is for low frequencies. The translation operation,
on the other hand, involves shifting of the mother wavelet along the time axis to
capture the time information of the function to be analysed at a different position, as
shown in Figure 3.12 (d).
2 ~~~~~~--~~--~~
I I I I I I I
0t-------___ o t---------,II,~---------+
\JV
·1 -1 I--t----t---t--t---+--t---t---+
·20 ·15 ·10 -5 0 5 10 15 20 -20 -15 -10 -5 0 5 10 15 20
(a) (b)
2 ~--'---"---"---"---"---"-~--T
Dilation Translation
a=5.b=O a=1.b=8
o r---_ o 1-____________,
\V
-1 -1 I--t---t--t---t--t---+--t---+
-20 -15 -10 -5 0 5 10 15 20 -20 -15 -10 -5 0 5 10 15 20
(c) (d)
In this way, a family of scaled and translated wavelets can be created using
scaling and translation parameters a and b. This allows signals occurring at different
times which have different frequencies to be analysed.
In contrast to the short-time Fourier transform, which uses a single analysis
window function, the wavelet transform can use short windows at high frequencies
or long windows at low frequencies. Thus the wavelet transform is capable of
zooming-in on short-lived high frequency phenomena, and zooming-out for
sustained low frequency phenomena. This is the main advantage of the wavelet over
the short-time Fourier transform.
Given a mother wavelet function \If (t), the continuous wavelet transform CWTr of a
function.f{t) is defined by
t-b c
CWTfa,
.
b) = <f, \If ah>
.
= \If (--)dt
a
/ va, a, b ER, a"* 0 (3.20)
where \If (t - b )dt /.,fa is sometimes called the baby wavelet. Here time t and the
a
scaling and translation parameters a and b can be changed continuously. In this
case, CWTJa, b) is called the wavelet transform coefficient. If a, b, and t change
continuously, the values of CWTja, b) can be represented by a three dimensional
diagram.
The application of a continuous wavelet transform consists of the following steps.
(1) take a wavelet and compare it to a section at the start of the original signal.
Signal
Wavelet
C=O.OI02
(2) Calculate the wavelet coefficient CW1f. representing how closely the wavelet
is related to this section of the signal, as shown in Figure 3. 13(a). The higher CWTf
is, the greater the similarity. The result obviously depends on the shape of the
wavelet chosen.
Signal
Wavelet ¢
Figure 3.13(b) Step (3) in continuous wavelet transform.
(3) Shift the wavelet to the right and repeat steps 1 and 2 until all of the signal has
been examined, as shown n Figure 3.13 (b).
Signal
Wavelet
C=O.2247
(4) Scale (stretch) the wavelet and repeat steps 1 through 3 (Figure 3.l3 (c)).
(5) Repeat steps 1 to 4 for all scales.
46 Data Mining and Knowledge Discovery for Process Monitoring and Control
Plotting the wavelet coefficients against time and scale generates a three
dimensional diagram, as shown in Figure 3.14. An alternative would be to use the
two dimensional diagram in Figure 3.15, where brightness reflects the magnitude of
the wavelet coefficients.
Large
Coefficients
Small
1950 400Q 4050 4100 4150 4200 4:?5O Coefficients
Time
Continuous in the context of wavelet transform implies that the scaling and
translation parameters a and b change continuously. In practice, it is necessary to
select a number of scales which is determined by the computational effort. A similar
argument applies to the translation (shifting) parameter. In both cases, however, the
Chapter 3 Data Pre-processing 47
data of the signal being processed by the continuous wavelet transform using a
computer is discrete.
The wavelet transform coefficients which are generated can be used to reconstruct
the original function.
f( t ) = C;; I J+xJ+x
-x -x CWT r(b)
a, \JI a.h ( / )dadb
-,- (3.21 )
a"
where C. is called the admissibility constant defined by
C N1 (00 )1 doo 2
= C~ < + CX) (3.22)
00
Singularities often carry the most important infonnation in signals [82]. For
example, Bakshi and Stephanopoulos [90, 91] used inflexion points as the
connection points of episode segments of a signal, as shown in Figure 3.17 (episode
segments of a signal is discussed in Section 3.4). Singularities of a signal can be
used as the compact representation of the original signal and used as inputs to
pattern recognition systems.
Chapter 3 Data Pre-processing 49
c d
c d
(I)
b a b
(3)
e
b
••••••••
I
D x
2
D x
3
D x
4
D x
4
Ax
i
D x -- detail on the ith decomposition
i
A x -- approximation on ith decomposition
where {h(k)} is the low-pass filter coefficients and {g(k)} the band-pass filter
coefficients.
Daubechies wavelets have a maximum number of vanishing moments over the
support space. The vanishing moments of the wavelets also have a different number
of coefficients. Using wavelets with more vanishing moments has the advantage of
being able to measure the Lipschitz regularity up to a higher order, which is helpful
in filtering noise, but it also increases the number of maxima lines. The number of
maxima for a given scale often increases linearly with the number of moments of the
wavelet. In order to minimise computational effort, it is necessary to have the
minimum number of maxima to detect the significant irregular behaviour of a signal.
This means choosing a wavelet with as few vanishing moments as possible but with
enough moments to detect the Lipschitz exponents of the highest order components
of interest.
For the cases considered here, an eight coefficient "least-asymmetric" Daubechies
wavelet is used as a filter. The scale and wavelet function for this filter are
illustrated in Figure 3.19.
52 Data Mining and Knowledge Discovery for Process Monitoring and Control
1. 8calirYJ Fm:tion
0.8
o.s
o.
o.
o
-0··2'cO-~~:--~----'4'---'--S'-~--' -1.fI:_3-~_2-_~1-~~1~~~~4--.J
Time Time
A signal f(t)=sin(t) and its extrema from the wavelet analysis using a non-
subsampled filter bank with Daubechies eight coefficients least asymmetry wavelet
is illustrated in Figure 3.20. It shows that extrema of wavelet analysis correspond to
the singularities of signal. The shape of corresponding extrema of the wavelet
analysis can be maximum or minimum for the same signal singularity and depends
on the wavelet used. In Figure 3.20(b), the wavelet is used as a filter, and the first
singularity of the signal in Figure 3.20(a) corresponds to the minimum in the
wavelet analysis. In Figure 3.21 it is a maximum because a different wavelet is
employed. The former is used here.
-1
-2
-3~~-~-~-r~~-~~
10 12 14 0 10 20 30 40 50 SO 70
(a) (b)
Figure 3.20 Signal (a) and its extrema (b) of the wavelet analysis based on
Daubechies eight coefficients wavelets.
Chapter 3 Data Pre-processing 53
or - - ' .---
-2
-3
o 10 20 30 40 50 60 70
Figure 3.21 Extrema of wavelet analysis with Daubechies ten coefficients wavelet.
1.5 r--T"'""-...----r--~-_._
0.5
2~-~-~-~--r-~ 2~--r---r---r---..--____._
1.5 1.5
1
1 j
:~~~,l----JI~----~~1
0.5
o 1-....1'-----..lI,.-------..II.---+ Scale I
-0.5 ' ,
-1 -1
-1.5 -1.5
2~--r-~~~~~--~
2~-~-~-~-_._---.
1.5
1.5
0.5
o 0.5 ~ Scale 2
-0.5 -:5~--------I.~y~~
-1
-1
-1.5
-1.5
-2 ~--_--I___I--____t_-I
-2~--1______t _ ____t--_---I
o 20 40 60 80 100
o 20 40 60 80 100
2r---r-~~~~_.--_.
1.5 1.5
1 1
0.5 0.5
o Scale 3
O~-----------------+
-0.5 -0.5
-1 -1
-1.5 -1.5
-2~--~--~--+_--T_~ -2~_+_-+--~-~-_4
o 20 40 60 80 100 o 20 40 60 80 100
Figure 3.22 Noise signal, its wavelet transfonn and the extrema of wavelet
transfonn_
Chapter 3 Data Pre-processing 55
o 10 20 30 40 50 60 70
Scale 2
f!~I------t-----t-~1
Scale 3
o 10 20 30 40 50 60 70
Scale 4
lh: : i ~ : I
Scale 5
A :
o 10 20 30 40 50 60 70
Two observations can be made from the above discussion. Firstly, extrema analysis
using wavelet multiresolution analysis remains steady with increase in scale. For
example, in Figure 3.23 when the scale is increased from 4 to 5, the four extrema
remain. Secondly, the location of extrema may slightly shift with time as the scale
increases. In Figure 3.23, the extrema representation in scale 4 is a vector of
dimension 70,
It is clear that after piece-wise processing, the dimension is reduced and sca1e-4'
and scale-5' are consistent. Therefore using a piece-wise processing technique, it is
possible to achieve consistent feature extraction and reduction in dimension.
Chapter 3 Data Pre-processing 57
This section describes qualitative interpretation of dynamic trends using the episode
approach. It was developed earlier than PCA and wavelets and is straightforward.
However it normally suffers from being week in dealing with noise.
The episode representation approach was originally developed by William [96].
Janusz and Venkatasubramanian [77] adapted it and used nine primitives to
represent any plots of a function, as shown in Figure 3.24. Each primitive consists
of the signs and the first and second derivatives of the function. This means, each
primitive possesses information about whether the function is positive or negative,
increasing, decreasing, or not changing, and the concavity. An episode is an interval
described by only one primitive and the time interval the episode spans. A trend is a
series of episodes that when grouped together can completely describe the
qualitative states of the system. C and D in Figure 3.24 are actually not primitives
because they can be regarded as the combination of A, F and B, E. Therefore they
can be reduced to seven primitives as shown in Figure 3.25.
lc~ ~
A(+,+,-) B (+,-,+) C (+,0,-)
~ ~ b
D (+,0,+) E (+,+,+)
~ ~ l=-
F(+,-,-)
ax
-=0
at
Concave Upward Concave Upward
Monotonic Increase ) Monotonic Decrease
[ax]>o ~ [ax] < 0
[00 x] > 0 [00 x] > 0
(a) (b)
Concave Downward
Monotonic Increase Concave Downward
Monotonic Decrease
[ax] > 0
[00 x] < 0 (~ [ax] < 0
[00 x] < 0
(c) (d)
Linea<ine,ea"~ Constant \ Lin,", Decre"",
[ax] > 0 [ax] < 0
[oox] = 0 [ox] = 0 [oox]=O
(e) (g) (f)
This is also illustrated in Figure 3.17, where trend 1 consists of primitives c-d-b-
a-c-d-b. The connection points c-d, b-a, are extrema, maximum or minimum points
respectively, and inflexion points are between d and b, a and c, and d and b. Trend 2
in Figure 3.17 illustrates another case.
The task of identifying the episodes from a signal is simply to identify the
inflexions and/or extrema, i.e., singularities in the signal since they correspond to
distinct points of the episode segments. This means that the singularities of a signal
contain the most important infonnation about the trend. Using singularities for
feature representation therefore completely defines the episodes characteristics of a
signal.
Chapter 3 Data Pre-processing 59
However, the singularities are strongly influenced by noise and this is the major
weakness of this approach. Noise components must be identified and filtered from
the features, otherwise the representation will be misleading. Bakshi and
Stephanopoulos [81, 90] used the wavelet approach developed by Mallat [97],
Mallat and Zhong [93] and Mallat and Hwang [92] for detecting the inflection
points, as described in sections 3.3.5 and 3.3.6.
3.5 Summary
For a specific variable, its dynamic responses under various disturbances or faults
can be effectively discriminated by inspecting the location of a single data point on
a PCA two dimensional plane. Concept formation is an essential step for conceptual
clustering which, to be introduced in Chapter 7, is an approach that can develop a
descriptive language for clustering operational states.
Episode based approaches are able to convert the signal information to qualitative
descriptions however often suffer from the adverse effect of noise components.
Available episode approaches have not addressed how they could deal with these
effects.
CHAPTER 4
MULTIVARIATE STATISTICAL ANALYSIS FOR DATA
ANALYSIS AND STATISTICAL CONTROL
Process monitoring and diagnosis is conducted at two levels [8]: the immediate
safety and operations of the plant usually monitored by plant operators, and the
long-term performance analysis monitored by supervisors and engineers. Long term
X. Z. Wang, Data Mining and Knowledge Discovery for Process Monitoring and Control
© Springer-Verlag London Limited 1999
62 Data Mining and Knowledge Discovery for Process Monitoring and Control
92 0/0
\---
9J
00
70
~
0 100 200 :m 400 500
Days
The fIrst step involves PCA analysis of the data matrix of the 442 x 498 (process
variables x number of data patterns). It was found that the fIrst 7 PCs can explain
93% in purity variation and 93% in recovery. Projection of the fIrst two PCs to a
two dimensional plane indicated that for the last three months (data points 401 to
490), where the recovery was low, the process operation had changed to a new
operational state. It clearly explains the reason for dropped recovery over the last
three months, which would have been very difficult to fInd out without this
multivariate analysis.
15
5
T2
-5
...
If we regard the area in the circle of Figure 4.2 as normal operation, then this
diagram can also be used for a multivariate statistical monitoring. If the operating
point goes outside the region then the operation can be regarded as abnormal.
Kresta et al. [104] gave other examples using the same approach to design statistical
monitoring systems for a fluidised bed reactor and a binary distillation column.
64 Data Mining and Knowledge Discovery for Process Monitoring and Control
Traditionally, univariate statistical process control charts, e.g., the Sherwhart chart
shown in Figure 1.2 have been used in industry to separately monitor either a few
process variables, or key measurements on the final product which in some way
defme the quality of the product. The difficulty with this approach is that these
quality variables are not independent of one another nor does any of them
adequately defme product quality by itself. The difficulties with using independent
univariate control charts can be illustrated by reference to Figure 4.3. Here only two
quality variables (y J, Y2) are considered for ease of illustration. Suppose that, when
the process is in a state of statistical control where only common course variation is
present, YJ, and Y2 follow a multivariate nonnal distribution and are correlated
(pYj..l'2 = 0.8) as illustrated in the joint plot ofYJ vs. Y2 in Figure 4.3.
..
.... .- ............
:
........
...
........ . :@.' ."
:~
. .. ...... -.: :
~
.'
l Thue-+
~I ... . : I!
g.! • : · • !
i i
~ I .. . !!
I.
. · ! ·
I .
I
~
. ' '
!
I . . .'· ... @ I
ee
". '
, 'e
I . .....• .:- · I
~
..... . ;
.
!
i II
· '.' • !
!
Figure 4.3 Quality control of two variables illustrating the misleading nature of
univariate charts.
Chapter 4 Multivariate Data Analysis and Statistical Control 65
The ellipse represents a contour for the in-control process, and the dots represent
a set of observations from this distribution. The same observations are also plotted
in Figure 4.3 as univariate Shewhart charts on YI, and Y2 vs. time with their
corresponding upper and lower limits. Note that by inspection of each of the
individual Shewhart charts the process appears to be clearly in a state of statistical
control. The only indication of any difficulty is that a customer has complained
about the performance of the product corresponding to the EB in Figure 4.3. The
true situation is only revealed in the multivariate YI vs. Y2 plot where it is seen that
the product indicated by the EB is clearly outside the joint confidence region, and is
clearly different from the normal "in-control" population of product.
In multivariate data analysis, the ellipse in Figure 4.3 is determined by calculating
the Mahalanobis distance [106]. In the discussion of bivariate samples the quantity
(4.1)
is often used to describe the locus of an ellipse in two-dimensional space with centre
(YI 'Y2)' This quantity also measures the square of the Mahalanobis distance
between the point (YI, yJ and the centre CY I 'Y2)' All points on this ellipse have the
same distance m2 from (iii 'Y2) .
This can be extended to multivariate situation and used for designing multivariate
Shewhart charts, called x 2 and r charts for statistical process control. This was
originated by Hotelling [107] and several references discussed the charts in more
detail [103, 105, 108, 109, 110].
Given a (k xl) vector of variables y on k normally distributed variables with a
covariance matrix L, the Mahalanobis distance from the centre equivalent to
Equation 4.1 is
(4.2)
L is also called in-control covariance matrix. If L is not known, it must be
estimated from a sample of n past multivariate observations as
S = (n-1)-1 ±(Yi - Y)(Yi-y)T (4.3)
i=1
When new multivariate observations (v) are obtained, then Hotlling's r statistic is
given by
(4.4)
66 Data Mining and Knowledge Discovery for Process Monitoring and Control
T can be plotted against time. An upper control limit (VCL) on this chart is given
by [105],
2 _ (n -l)(n + l)k (k k
TUCL- n(n-k) Fa ,n- ) (4.5)
where Fa (k,n - k) is the upper lOOa% critical point of the F distribution with k
and n-k degrees offreedom [III].
70 I Hotelling's Tl
•
4 95 0/0
/
10
0 100 200 300 400 500
Days
Figure 4.4 The Hotelling's r chart for the recovery process indicating a deviation
from normal operation around the 400th day
Since many of the process variables are autocorrelated, only a few underlying events
are driving a process at any time, and all these measurements are simply different
reflections of these same underlying events. Principal component analysis therefore
can be applied to process the data first before the multivariate Shewhart charts are
used. It means the latent variables, i.e., the first few PCs are used rather than the
original variables. If the first A PCs are used, then the Hotlling's r can calculated
by [105],
A 2
T~ = L t~ (4.6)
;=1 Sf,
Chapter 4 Multivariate Data Analysis and Statistical Control 67
where s~ is the estimated variance of Ii' If A =2, a joint 1/ vs. I] plot can be used.
Note that the traditional Hottelling T in Equation 4.4 is equivalent to
A t2 k t2
T2 = Li+ L i (4.7)
i=1 s~; i=A+1 s~;
For the recovery process, the r chart is shown in Figure 4.4. The 95% confidence
limit was determined based on good operation where the recovery is around 92%.
Had the chart been on-line the deviation from normal could have been detected
based on process data only, immediately when it occurred around observation 400.
2500
SPEy
1500 99 %
However, monitoring product quality via T based on the first A pes is not
sufficient. This will only detect whether or not the variation in the quality variables
in the plane of the first A pes is greater than can be explained by common cause. If
a totally new type of special event occurs which was not present in the reference
data used to develop the in-control peA model, the new pes will appear and the
new observation Ynew will move off the plane. Such new events can be detected by
computing the squared prediction error (SPEy) of the residual of a new observation,
k 2
SPEy = L(Ynew,i-Ynew)
A
(4.8)
i=1
68 Data Mining and Knowledge Discovery for Process Monitoring and Control
where Ynew is computed from the reference PCA model. SPEv is also refereed to as
Q statistic or distance to the model. It represents the squared perpendicular distance
of a new multivariate observation from the projection space. Figure 4.5 shows the
SPEv chart for the recovery process. The 99% confidence limit was determined
based on good operation when the recovery is around 92%.
When the process is "in-control", this value of SPEy should be small. Therefore a
very effective set of multivariate statistical control charts is a r chart on the A
dominant orthogonal PCs (t/, t2, ... , tA) plus a SPEychart.
combination of the x variables that maximises the covariance between it and the Y
space. The first PLS loading vector WI is the first eigenvector of the sample
covariance matrix XTyyTX. Once the scores II = XWI for the first component have
been computed the columns of X are regressed on II to give regression vector PI =
XII! ti
II and the X matrix is deflated to give residuals X2 = X-II pi.
The second
latent variable is then computed as t2 = wJx where W2 is the first eigenvector of
XIyyTX2 and so on. As in PCA, the new latent vectors or scores (II. 12 , •.• ) and the
loading vector (WI. W2, ... ) are orthogonal.
Although the rand SPEy charts are powerful ways for detecting deviations from
normal operations, they do not indicate reasons for such deviations. This can be
achieved by plotting variable contribution plots. There are two alternative ways of
doing this which will be illustrated by reference to a case study described by
MacGregor et al. [113], a low-density polyethylene tubular and autoclave reactor. A
database of dimensionality of 55x 14 (number of observations x number of process
variables) was analysed using PCA and PLS. The PC two dimensional plane of PLS
Chapter 4 Multivariate Data Analysis and Statistical Control 69
analysis as well as the SPEy plot are shown in Figures 4.6 and 4.7. From Figure 4.6
it can be seen that from data point 53 the operation deviates from normal operation.
The deviation point is detected at point 54 on the SPEy plot. This difference is not
significant and our focus here is on how to find out which variable is the main factor
contributing to the deviation.
4
-- --.- ...........
,,
",I "
",
I '
2 / \
...
\
,,
/
... .. .
I \
\
\
~ 0
_1 .55 . 54 •~~
""T
\
\
52
••
••• 5".II. • ..
:.-
-.
I
I
I
)
\ I
_2
.,
\ /
,,
/
"
.....----
.
_3 /
.... .... "
_4
_5
-10 -5 o 5
Figure 4.6 PLS trt2 plane for the low-density polyethylene reactor indicating
deviation from data pint 54.
70 Data Mining and Knowledge Discovery for Process Monitoring and Control
80
70
60
50
>.. 40
W
a.
(f)
30
20
... .. ....... .. . 51
10
.......- .' • •52
"
""
0
0 10 20 30 40 50 60
# of observations
Figure 4.7 PLS SPEy chart for the low-density polyethylene reactor indicating
deviation from data pint 53.
3.5
3 r-
~
2.S
2 f-
e
~
W 1.5
C
0 r-
r-
U r-
I
~
'C
~ 0.5
a.
I--
0
Tin out' Tcm1 "\,02
" '2 Fi1 Fa
F" F"
P ....
""'" L--
-o.s l...-
L..-
-1
o 2 4 6 8 10 12 14 16
# of X-Variables
Figure 4.8 PLS prediction errors in the individual process variables contributing
to SPEy at time point 54.
Chapter 4 Multivariate Data Analysis and Statistical Control 71
One way is to plot the SPE prediction error for the deviation point, say point 54,
against the process variables, as shown in Figure 4.8.
To diagnose the event one can examine the contributions of the individual
variables to this larger than normal value of the SPEy at point 54 that is,
k
SPE x ,54 = L(X54,r X54,ji
)=1
Here the predictions X54,) are made from the PLS model developed for the "in-
control" operating data as,
A
X54,) = Lta ,54Pa)
a=1
where the new latent variable projections for the 54th observation are given by,
k
t a ,54 = LWa,)X54,)
)=1
From Figure 4.8 it is clear that the major contributing variable is Z2.
0.5
Tein1 Tcin2
0
mu' ~ mu2 ~ '-- " "
-0.5
~~
-1
-
- '--
i'"
-1.5
'--
-2
'--
-2.5
o 2 4 6 8 10 12 14 16
# of X-Variables
Figure 4.9 PLS variable contributions to the change in t1 from point 51 to 54.
72 Data Mining and Knowledge Discovery for Process Monitoring and Control
An alternative way of diagnosing the event is note that in the latent variable plane
Figure 4.6, the deviation of point 54 is mainly due to a large decrease in tl'
Therefore we can analyse the importance of each process variable to this latent
variable through examining the loading vector WJ. The contribution of each variable
(x;) to this large movement in II can be computed as {(WI.~ Xj); j = 1, 2, ... , k} where
~ Xj = (X54J-X5J,;)' These contributions are shown in Figure 4.9. It confirms that the
main contributing variable is Z2.
If the number of variables is not large, generally, the first two or three PCs are
sufficient to capture most of the variance in the data [105, 114]. However if the
number of variables is large, it is necessary to consider more PCs. Increased number
of PCs makes the interpretation of results more difficult. First of all, the two or three
dimensional PC plane, such as Figure 4.2 can not be used. Secondly the variable
contribution analysis becomes difficult. A variation of multiway PCA and PLS
[115] was proposed by MacGregor et al. [113]. The idea is to divide the process
variables (X) into a number of blocks and then perform PCA or PLS analysis for
each block as well as for the entire process. The blocks can be arranged based on
functional and structural analysis of the process to be analysed and the interactions
between blocks should be minimised. MacGregor et al. [113] discussed the
procedures by reference to a low-density polyethylene process.
The above discussed case studies are two-way arrays: the variables and the
observations. In some cases the data takes the form of three-way arrays. Data about
batch process operations is such a case [116]. Batch production consists of batch
runs, one after another. For a typical batch run,j=l, 2, ... , Jvariables are measured
at k= 1,2, ... ,K time intervals throughout the batch. Suppose the data consists of i =
1,2, ... , I such batch runs, then the database will be a three-way array X(IxJxK), as
illustrated in Figure 4.10, where different batch runs are organised along the vertical
side, the measurement variables along the horizontal side, and their time evolution
occupies the third dimension. Each horizontal slice through this array is a (JxK)
data matrix representing the time histories or trajectories for all the variables for
Chapter 4 Multivariate Data Analysis and Statistical Control 73
.......
........
.-II
]u X
til
CO
•
~\V'LlZ
Measurements
j=l,J
I I
R
I
r=l
n ®
[J +
to
Figure 4.10 Arrangement and decomposition of a three-way array by multi-way
PCA.
single batch, the ith batch. Each vertical slice is an (lxJ) matrix representing the
values of all the variables for all the batches at a common time interval (k).
The multi-way PCA (MPCA) approach proposed by Wold et al. [115] is
statistically and algorithmically consistent with PCA and has the same goals and
benefits. The relation between MPCA and PCA is that MPCA is equivalent to
performing PCA on a large two-dimensional matrix formed by unfolding the three-
way array K in one of the three possible ways. For analysis of batch process
operational data, Nomikos and MacGregor [116] unfolded the data in such a way as
to put each of its vertical slices (lxJ) slide by slide to the right, starting with the one
corresponding to the first time interval. The resulting two-dimensional matrix has
dimensions (lxJK). This unfolding allows us to analyse the variability among the
batches in K by summarising the information in the data with respect both to
variables and their time variation.
74 Data Mining and Knowledge Discovery for Process Monitoring and Control
Though peA has been widely studied and applied successfully to many application
areas including chemistry, biology, meteorology and process engineering etc., there
is criticism on this approach being a linear operation. Some researchers have
pointed out that peA can be inadequate in solving some nonlinear problems [117,
118]. Xu et al. [118] has given an example showing that when peA is applied to a
nonlinear problem, minor pes do not always consist of noise or unimportant
variance. If they are discarded, important information is lost; if they are kept, the
large number of pes make the interpretation of results difficult. As a result
nonlinear peA has attracted the interest of some researchers as summarised by
Dong and McAvoy [119].
One way of introducing nonlinearity into peA is the "generalised peA" by
Gnanadesikian [120]. The basic idea of this approach is to extend an m-D variable X
to include nonlinear functions of its elements. For example, for two dimensions
X=(xJ. Xl), three variables can be added: X3= xi, X4= x~, and X5 = X/Xl. Then one can
do the same calculations as linear peA on this 5-D data. Another way of introducing
nonlinearity into peA is "nonlinear factor analysis" [121]. In this method l-D
polynomials are used to approximate m-D data with l<m. A linear least squares
method is used to fmd the coefficients of the polynomials. Dong and McAvoy [119]
commented on the method that for high-dimensional data, it becomes tedious.
Kramer [122] presented a nonlinear principal component analysis method based
on autoassociative neural networks. The architecture of the network is shown in
Figure 4.11. It has five layers, i.e., the input, mapping, bottleneck, de-mapping and
output layers, and its input is used as the desired output. The network is therefore
supervised in nature and can be taught with a backpropagtion learning algorithm.
As no assumptions are needed about the nature of nonlinearity between the
variables, the network can be used in situations where common transformations
(e.g., logarithm, square root) can not be used. The nonlinearity is introduced into the
network by sigmoidal transfer functions in the mapping and de-mapping layers.
The bottleneck layer will perform the dimension reduction, because the number of
neurons in this layer is smaller than that in the input and output layers, so that the
network is forced to develop a computer representation of the input data. The goal
of the network is to minimise the error function
(4.9)
Chapter 4 Multivariate Data Analysis and Statistical Control 75
Y2
Ym
Figure 4.11 An autoassociative nonlinear peA network with five layers. Transfer
functions! are nonlinear and transfer functions I are linear.
This section describes an industrial case study of analysing historical data using
PCA for operational strategy development and product design [135].
The fluid catalytic cracking process (FCC) of the refinery converts a mixture of
heavy oils into more valuable products. The relevant section of the process is shown
in Figure 4.12, where the oil gas mixture leaving the reactor goes into the main
fractionator to be separated into various products. The individual side draw products
are further processed by down stream units before being sent to blending units.
FI Q22
One of the product is light diesel whose quality is typically characterised by the
temperature of condensation. Traditionally the temperature of condensation has
been monitored by off-line laboratory analysis, which caused time delays because
the interval between two samples is between four to six hours. As a result a software
sensor has been developed using 303 data patterns spanning over nearly a year for
Chapter 4 Multivariate Data Analysis and Statistical Control 77
predicting the condensation point using fourteen process variables which are
measured on-line (a detail discussion of the software development is given in
Chapter 9). The fourteen variables are listed in Table 4.1.
An interesting problem with the process is that it is required to produce three
product grades according to seasons and market demand, namely -10#,0# and 5#
defined by the ranges of condensation temperature. Because there are more than one
process variable the operators use their experience through trial-and-error to adjust
process variables to move the operation from producing one product grade to
another. There is a clear need to minimise the time of change over because off-
specification product may be produced during transition.
Table 4.1 The fourteen variables used as input to the FFNN model.
Tl-I1 - the temperature on tray 22 where the light diesel is withdrawn
Tl-12 - the temperature on tray 20 where the light diesel is withdrawn
Tl-33 - the temperature on tray 19
Tl-42 - the temperature on tray 16, i.e., the initial temperature of the
pump around
Tl-20 - the return temperature of the pumparound
F215 - the flowrate of the pump around
Tl-09 - column top temperature
Tl-OO - reaction temperature
F205 - fresh feed flowrate to the reactor
F204 - flowrate of the recycle oil
F101 - steam flowrate
FR-1 - steam flowrate
FIQ22 - flowrate of the over-heated steam
F207 - flowrate of the rich-absorbent oil
The difficulty of the problem comes from the fact that there are fourteen process
variables to consider. Application of PCA to the database of the size 303xl4
(number of data patternsxnumber of process variables) found that the first seven
variables account for about 93% of the variance (Table 4.2). The PCI and PC2 two
dimensional plot is shown in Figure 4.13. It was found that the 303 data patterns are
grouped into four clusters. Three clusters correspond to three products -10#, 5# and
0# and the cluster at the bottom-right comer is found to be a cluster that has a high
probability of product off-specification. Before we analyse how this can be used to
develop operational strategies it is necessary to validate the clustering result since
78 Data Mining and Knowledge Discovery for Process Monitoring and Control
the fIrst two PCs only account for 53% of the variance. For this purpose, the fIrst
three PCs are plotted in a three dimensional diagram (Figure 4.14). It is found that
the cluster at the centre of Figure 4.13 is further divided into two clusters. Using the
fIrst seven PCs, ART2 gives a similar result as indicated by the dotted curve in
Figure 4.13. This demonstrates that for problems having large dimensions, clusters
may overlap in a two dimensional PC display. Nevertheless, for the current
problem, it is found that the two clusters at the centre of Figure 4.14 both
correspond to product 0#. As a result, in the following discussion, we still use the
result of Figure 4.13.
Therefore the strategy for operation and product design should be to operate the
process in the region of the bottom-left if the desired product is -10#, or the region
at the top if the desire product is 5#, or the region at the middle if the desired
product is 0#, and try to avoid the region at the bottom-right comer. Another point
is that to move from producing -10# to 0#, adjusting PC 1 is more important than
changing PC2. While to switch from producing 0# to 5#, PC2 is more important
than PC 1. Both PC 1 and PC2 are important in avoiding the region at the bottom-
right comer which produces off-specifIcation product.
However, PCl and PC2 are latent variables. To link PCl and PC2 to the original
variables, contribution plots are used. The contribution plot of PC 1 is shown in
Figure 4.15, from which it is found that the most important variables are TI12 (the
temperature on tray 20 where the product is withdrawn) and TI42 (the temperature
on tray 16 close to the flashing zone). Some other variables are not important such
as FR-1. The above discovery is confIrmed by looking at the change of TI-12 over
the 303 data patterns (Figure 4.16). It clearly shows that TI-12 can distinguish
product -10# from 0# and 5#, but can not distinguish 0# and 5#.
The contribution plot of PC2 is shown in Figure 4.17 which indicates that FR-l is
the most important variable. The changing profIle of FR-l for the 303 data patterns
are shown in Figure 4.18. It clearly shows that FR-l can distinguish product 5# from
0# and -10#, but not between 0# and -10#. The fIgure also confIrms that FR-I is not
important to PC 1.
Therefore the operational strategy for product design should be that if we want to
change from producing -10# to 5#, we should increase TI-12 and TR-42 and then
increase FR-1. In order to avoid off-specifIcation product we should carefully
monitor TI-12, TR-42 and FR-l to avoid the region at the bottom-right comer. Of
course it is important to be aware that fIne tuning of all the variables is necessary
but this guidance can help operators to move the process from producing one
Chapter 4 Multivariate Data Analysis and Statistical Control 79
product quickly to another product. The PC 1 and PC-2 two dimensional plane can
also be used by operators as a monitoring screen as demonstrated by Figure 4.19.
PC2
Data patterns
192-210,272-277
Product 5#
...
Product 0#
-0.3
l'
.f/ '
(/
Data patterns
...... -- ....
-0.2
,
1-116,213-242 Data patterns
-0.3
117-124,211,212,
Product -10#
243,244,278-288
PCI
Figure 4.13 The PC 1 and PC2 two dimensional plot.
80 Data Mining and Knowledge Discovery for Process Monitoring and Control
0.4
PC2
0.3
••
0.2
•• ,1.= ••
•
•• •
0.1 •
••
0 I~ifI'r
....
••••~4f.
-0.1 • • ••
-0.1
PC3
-0.1 0 0.1 0.2 0.3
-0.2
PCI
Figure 4.14 The PCI, PC2 and PC3 three dimensional plot.
PCI
0.6~-----------------------------------,
0.4 ~~~--~~--------------------------~
Original variables
T1-12
·C
280 ;--_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _---, High
prob.
270 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . lproduct
260 ..... off-spec.
250
========~~==~=F~~~~O
{J.
240 +......................................................................1 0# 5#
o
230
.=====.J="""'="=====~=~==....,.,.j {!. -10#
220
210
Figure 4.16 The changing profile ofTI-12 over the 303 data patterns.
pe2
0.6
0.4
0.2
-0.4
f-
,
N
f-
,
M
";"
E=
N
"'f
f-
0
~
f-
on
N
LI.
9'"
E=
0
9
0
f-
0
N
LI.
0
N
LI.
0
LI.
-
LI.
==
N
2'
LI.
.....
0
N
LI.
-0.6
Original variables
FR-l
11
5#
10
product
9
4J
8 0# -10#
7
product
6 D
High
5 prob.
product
4 +-r,",,-0,""-0""",,,-0,,,_,-0,""rr",-,rr,_,-o,".,-0",-,rr,~ off-spec.
28 55 82 109 136 163 190 217 244 271 298
Original variables
Figure 4.18 The changing profile ofFR-l over the 303 data patterns.
Transfer
matrix
-0
G
Operatin
point
TII2
-0 TI42
( samPle)
C;) -0
Historical
<::J cases
@data
PCA and PLS have proved to be powerful tools for operational data analysis and
statistical process control. However they still have limitations. PCA and PLS based
data analysis for statistical process control has the assumption that the first few PCs
can capture most of the variations in a multivariate database. This assumption may
be violated in some cases, e.g., when the dimension of the original variables is very
large. Multiblock PCA and PLS can tackle this problem for some applications,
however, dividing variables into blocks may not always be possible. Sammon [123]
gave an example where data generated to contain five groups in four dimensions are
projected into the space of two principal eigenvectors. Visual examination of this
projection shows only four groups, since two of the clusters overlap completely in
the two dimensional space. In such cases some alternative approaches may have to
be used such as the unsupervised machine learning approaches to be introduced in
Chapter 6, including neural network and Bayesian automatic classification methods.
It has reported that one of the Bayesian automatic classification systems - AutoClass
has successfully clustered data with 1204 attributes [39]. However, PCA and PLS
may still be a useful approach for pre-processing the data to eliminate the linear
dependencies in the data. PCA is also a useful approach for pre-processing the data
for dimension reduction for neural networks [124].
The variable contributing plots such as Figure 4.9 may not be applicable in cases
where the contributions of the original variables to the PCs are not equally
distributed. Use of other approaches to compensate this limit of PC A can be a good
alternative. For example, neural network models can be developed and used as
sensitivity study tools to identifY the contributions of variables.
In the above applications, PCA and PLS are used mainly for statistical process
control for long tenn perfonnance monitoring and the data dealt with are averaged
over hours or days. PCA and PLS are also potentially useful for on-line real time
data analysis. In Chapter 3 we have introduced the application of PCA for feature
extraction and concept fonnation from dynamic trend signals. Bakshi [125]
combined wavelet multiscale analysis with PCA for developing on-line monitoring
systems. Tabe et al. [126] combined Fourier and wavelet analysis and PCA and
developed an approach called dynamic PCA.
PCA can also be categorised as an unsupervised learning approach. However its
learning is not recursive or incremental. For on-line real time use, it is useful for
PCA to be able to learn incrementally, i.e., learn from a single example when it is
84 Data Mining and Knowledge Discovery for Process Monitoring and Control
presented. There has been an report on such on-line learning for principle
component analysis [127].
Only a few case studies were mentioned above mainly for the propose of
illustration the methods. There are many successful applications in using PCA and
PLS approaches to analysing databases about continuous and batch operations.
These include analysis of data of emulsion batch processes [128], product design
[129], inferential process model development [130], reactor analysis [131], fault
diagnosis [132], sensor fault identification [133, 134], normal operational region
identification [136] and monitoring [137]. These applications not only explored the
potential applications, but also provide valuable experience in overcoming some of
the limitations ofthe PCA and PLS in solving practical problems.
CHAPTERS
SUPERVISED LEARNING FOR OPERATIONAL SUPPORT
Studies on machine learning have mainly been concerned with automatic learning
from examples to develop the knowledge describing these examples. This is clearly
different from the kind of learning as learning to ride a bicycle. In supervised
learning, each example used is typically described by a number of attributes. The
attributes are divided into inputs and outputs, and the learning process is to develop
a model mapping the multiple inputs and outputs. The model is gradually refined
during learning to minimise the errors between the predictions and real values of
outputs, i.e., so-called supervised learning. The most widely studied supervised
learning approach is the feedforward neural network (FFNN). The FFNN model and
its application to process operational support will be introduced in this Chapter. The
discussion on FFNN will be focused on many of the practical issues that have to be
considered in applying FFNN. While the focus will be on FFNN, other supervised
models will also be described and compared with FFNN. These include fuzzy
FFNN, fuzzy set covering approach and fuzzy signed digraph.
There are already a large number of textbooks on FFNNs. Here it is introduced less
technically. Simply speaking, a FFNN neural network is an algorithm or computer
software that can learn to identify the complex nonlinear relationship between
multiple inputs and outputs. The learning process has a number of characteristics.
X. Z. Wang, Data Mining and Knowledge Discovery for Process Monitoring and Control
© Springer-Verlag London Limited 1999
86 Data Mining and Knowledge Discovery for Process Monitoring and Control
Firstly, FFNN does not need fundamental domain problem models and is easy to be
set up and trained. This is different from conventional statistical methods that
usually require the user to specify the functions over which the data is to be
regressed. In order to specify the function, the user has to know the forms of the
equations governing the correlations between the data. If these functions are
incorrectly specified, the data will not be satisfactorily regressed. Furthermore,
considerable mathematics and numerical experience is required to obtain
convergence if these equations are highly nonlinear. FFNN does not need to specifY
the forms of the correlations as well as any mathematical and numerical expertise
requirements. Secondly, data examples used for training are allowed to be imprecise
or noisy, in some cases even incomplete. Thirdly it mimics the human learning
process: learning from examples through repeatedly updating the performance.
A FFNN neural network consists of a number of processing elements called
neurons. These neurons are divided into layers. Figure 5.1 shows a three layered
FFNN architecture including an input, a hidden and an output layer. Typically the
input layer nodes correspond to input variables and the output layer to output
variables. Hidden neurons do not have physical meanings. Neurons between two
adjacent layers are fully connected by branches.
Output layer
Hidden layer
Input layer
Each neuron in the hidden and output layer is described by a transfer function (or
activation function). Usually a sigmoidal function is used,
1
fez) = - 1-07
(5.1)
+e
f(z) transforms an input z to the neuron to the range of [0.0, 1.0] as shown in
Figure 5.2(a). The parameter a in Equation 5.1 is used to change the shape of the
Chapter 5 Supervised LearningJor Operational Support 87
sigmoidal function. Some other activation functions can also be used as shown in
Figure 5.2 (b) and (c). However for a specific FFNN structure, the neurons in the
hidden and output layers are usually fixed on the same transfer function.
1.0
z z
z
0.0
Given some arbitrary values for all the connection weights, for a specific data
pattern, the FFNN makes use of the weights and input values to predict the outputs.
The training is intended to gradually update the connection weights to minimise the
mean square error E,
E = f±(tfmLy~m))2 (5.2)
/)/=\ i",l
The process involves a forward path calculation to predict the outputs and
backward path calculation to update the weights. For a neuron in the input layer, its
output is equal to the input so there is in fact no activation function for an input
neuron. For a neuron in the hidden and output layers, it receives the values of the
outputs of its front layer nodes and takes the weighted sum as its input. The
weighted sum is then transformed by the activation function to give an output. The
outputs of the output layer neurons are compared with the target values using
Equation 5.2 to calculate an error. The error is used to backwards updating the
weights.
Given the mth data pattern, the weight updating in a supervised learning algorithm
follows the formulation,
(5.3)
where
w;:1) - the weight of the connection between the jth neuron of the upper
layer and the ith neuron of the lower layer, in the mth learning
iteration.
wj:'-ll - the weight of the connection between thejth neuron of the upper
layer and the ith neuron of the lower layer, in the (m-l)th learning
iteration.
~W;:'l - the weight change.
In backpropagation learning approach, the weight change is calculated by,
(5.4)
where
11 - learning rate, providing the step size during gradient descent.
Generally to assure rapid convergence, larger step sizes which
do not lead to oscillation are used
a - coefficient of momentum term, 0< a < 1
eim) - the output value of the ith neuron of the previous layer, in the mth
iteration
8i m ) - the error signal ofthejth neuron in the mth learning iteration.
s:(m)
U)
= f' ("
)
i.m)
£- WI' 0,
(m) + WIo( m )£-" Uk
s:(") (m)
Wig (5.6)
i k
From the above discussion, it is clear that FFNN training involves the following
initial decisions to begin: network topology, i.e., number of hidden layers and
hidden neurons; learning rate; momentum factor; error tolerance or number of
iteration; initial values of weights. Learning rate 11 and momentum factor a are not
very difficult to set. We can start with some reasonable values, e.g., 11 = 0.35, a =
0.7 and then find the most appropriate values in training. The error tolerance
apparently depends on the problem to be solved. Understanding how these
parameters affect the learning performance, which is due to discuss next, is useful in
setting the right values. Initial values of weights can be generated by a random
number generator. Therefore here our discussion will focus on network topology
and then some other important issues in learning.
90 Data Mining and Knowledge Discovery for Process Monitoring and Control
It is generally accepted that only one hidden layer is necessary in a network using a
sigmoid activation function, and that no more than two are necessary if a step
activation function is used.
There are no available methods to decide how many hidden neurons are required
in a three layered network. The number of hidden neurons depends on the
nonlinearity of the problem and error tolerance. Empirically the number of neurons
in the hidden layer is of the same order as the number of neurons in the input and
output layers. The number of hidden neurons must be large enough to form a
decision region that is as complex as required by a given problem, too few hidden
neurons hinder the learning process and may not be able to achieve the required
accuracy. However, the number of hidden neurons must not be so large that many
weights required can not be reliably estimated from available training data patterns.
An unnecessary large hidden layer can lead to poor generality. A practical method is
to start with a small number of neurons and gradually increase the number.
Chemical process models are multidimensional with peaks and valleys [145], which
can trap the gradient descent process before it reaches the system minimum. There
are several methods of combating the problem of local minima [146, 147]. The
momentum factor a , which tends to keep the weight changes moving in the same
direction, allowed the algorithm to slip over small minimal. Another approach is to
start the learning again with different set of initial weights if it is found that the
network keeps oscillating around a set of weights due to lack of improvement in the
error. Some times adjusting the shape of the activation function (e.g., through
adjusting the constant a in Equation 5.1 can have an effect on the network
susceptibility to local minima. Some new optimising approaches have been applied
to multilayer neural networks which prove to be able to address the local minima
significantly, such as the simulated annealing [147]. In contrast to the comments
made by Crowe and Vassiliadis [145] and Chitra [147], Knight [146] thought that
FFNN rarely slips into local minima. However, it should be dealt with care.
Chapter 5 Supervised Learningfor Operational Support 91
Overfitting occurs when the network learns the classification of specific training
points but fail to capture the relative probability densities of the classes [148]. This
can be caused by two situations: (1) oversized network, e.g., due to inclusion of
irrelevant inputs in the network structure or too many hidden layers or neurons; and
(2) insufficient number of training data patterns.
5.1.3.4 Generality
Over fitting in training a neural network deteriorates the generality of the network.
studies can be used to develop a simpler network structure. Since the inputs are
possibly correlated, sensitivity study should be carried out with care.
An alternative approach is to analyse the weights of a trained FFNN to find out
the relative importance of inputs. Several approaches have been proposed which
were discussed in Chapter 9. Due to the complex internal structure of FFNN, these
approaches should be used with care.
The difficulty in analysing the relative importance of inputs to an output is that
there is hidden layers and neurons and neurons between two layers are fully
connected. An interesting idea is to develop a model without hidden layer and
neurons, which is traditionally called single layer percetron (PCT). The order of
magnitude of input - output linkage weights is clear. A PCT model may not be
accurate enough but can provide some useful information for analysis. The analysis
result can be used to further develop a FFNN model with hidden neurons. In fact,
as will be demonstrated, for a problem whose nonlinearity is not high, a PCT can
give equally good performance. In this case a PCT clearly should be used rather
thanaFFNN.
For on-line applications, feature extraction also means dimension reduction from
dynamic transients to use minimum data to capture the useful information as well as
noise removal. This has been discussed in detail in Chapter 3.
The advantage of FFNN not requiring fundamental domain knowledge also brings
a drawback of being a blackbox with poor extrapolation capability. Therefore,
FFNN application always involves training and test procedures: using some data for
training and the rest for testing. However, due to the multivariate nature, when a
new data pattern is given, the confidence of prediction is still unknown, because the
data pattern represents a point in the multidimensional space. The mean squared
error to quantify the accuracy of training does not give this kind of information.
A detailed procedure is introduced in Chapter 9 to address this issue in the course
of developing a software sensor, which uses an automatic clustering approach to
group the multivariate data into clusters, and then training and test data is selected
from each cluster. When new data patterns are available the clustering approach is
also used to test if the data patterns are within the region of previously trained data,
so provides clues of prediction accuracy and information whether the FFNN model
needs to be retrained.
Chapter 5 Supervised Learningfor Operational Support 93
Techniques used in statistics for error or residual analysis can perfectly be used. A
simple plotting of the error residual can provide useful information. In developing a
software analyser for predicting a toxicity measure, Microtox, it was found that nine
of the 180 data cases used for training and test have abnormally large errors [124].
It was suspicious that these nine data patterns may contain noise components. The
plotting of error distribution supports this. As shown in Figure 5.3 that the error
almost follows a normal distribution and there are irregular structure at the two ends
which correspond to the nine data patterns.
90
c 80
0
'g 70
~ 60
!::
:.a'" 50
g 40
:c«I 30
.J:l
...
0
Q..
20
10
0
-100 -50 0 50 100 150
Relative error
••. XN) and multiple outputs (Y/. Y2....YM), all the sensitivities form a Jacob matrix,
94 Data Mining and Knowledge Discovery for Process Monitoring and Control
(5.7)
(3) the density distribution of the data sample in the multidimensional space of
training data.
Shao et al. [149] and Zhang et al. [150] developed a complicated procedure for
calculating the confidence bounds of using a FFNN mode to predict a new data
pattern. Figure 5.4 shows such an example [180].
o mod.1 prediction
3
+ observation
confidence interval
2 - confidence bound
-2
~3~------2~------~1------0~----~------~~--~3
independent varlabl.
There has been a large number of publications on applying FFNN to process fault
diagnosis, many have used continuous stirred tank reactors as case studies. Here an
earlier case study on a CSTR reactor [153] is described in order to analyse the kinds
of issues that need to be considered. In the CSTR reactor, there is a first-order
exothermic reaction A ~ B. Heat generated by the reaction is taken away by
cooling water. A three layered neural network was developed to use symptom
variables to predict faults, as shown in Figure 5.5.
Figure 5.5 A three layer neural network for fault diagnosis of a CSTR reactor.
Temperature
250
200
ISO
100
o 100 200 300 400 500 600
Time
(2) Noise has to be dealt with carefully. Under large noise to signal ratio, the
noise may bury the real trend of a dynamic transient. Methods for noise removal has
also been discussed in Chapter 3.
(3) This CSTR example has six inputs and six outputs. For a larger process there
may be hundred variables to be monitored. It may not be suitable to use a single
neural network. Several neural networks or stacked networks may have to be used as
described by Zhang et al.[154].
(4) Multiple faults also pose a challenge. This has been addressed by many
researchers. However, some researchers have found that networks trained with
single fault data can be used to detect multiple faults. It is rare to have more than
two irrelevant faults occurring at the same time. But it is not unusual that one fault
may cause other faults to occur.
(5) Most discussions on fault diagnosis using multilayerd neural networks are
concerned with failure of equipment, sensors or sudden change of a variable. There
is another type of fault or abnormal operation which is concerned with gradual
degradation of product quality or other performance measures. Though there have
been some discussions on using expert systems to deal with this kind of problems"
little work has been done on how neural networks can be used to deal with this type
of problems.
(6) A necessary step before fault diagnosis is fault recognition or identification. It
clearly depends on effective assimilation of all the measurements. A neural network
may be a useful tool for diagnosis: mapping symptoms to faults, but it is not very
effective in fault recognition. Some researchers used neural networks trained with
normal operational data to identify faults: if the output was not in the normal region
then a fault is expected.
(7) The biggest problem with FFNN is availability of training data. It is
unthinkable that a real plant will initiate some faults to provide training data.
Though dynamic training simulators have been used to generate training data [151,
155], methods for fault diagnosis should not rely on the assumption that a high
flexible customised simulator of high fidelity is available.
The last two points make FFNNs which adopt a supervised learning mechanism
less attractive in fault identification and diagnosis than the multivariate analysis
approaches introduced in Chapter 4 and unsupervised machine learning methods to
be introduced in Chapters 6 and 7.
Chapter 5 Supervised Learningfor Operational Support 99
~ 1.0
t~ 0.0
"'" 0.7
NonnaliztXi value of a process variable
Conventional neural networks have real number inputs and weights. There are
three main types of fuzzy neural networks (FNNs) [156]: FNNs with fuzzy input
signals but real number weights, FNNs with real number input signals but fuzzy
weights, and FNNs with both fuzzy input signals and fuzzy weights. It is the first
type of FNNs, i.e., fuzzy input signals and real number weights that has been studied
for process fault diagnosis [l5I, 155, 154]. In the work of [154] fuzzy membership
functions take part in the learning. In our work, fuzzy membership functions do not
participate in the learning process but are only used for pre-processing the data. The
procedure can be illustrated using Figure 5.8, which illustrates a fuzzy neural
network for two input variables, Xl and X2 and one output y.
100 Data Mining and Knowledge Discovery for Process Monitoring and Control
tU:\ 1- - - - - - - - - -
~I I
~\81 I
:
I
6)1 I-
I
I
/8 1
: r7\_1
\..,L./I-
~
' 6)
~I
I
:
I
I
~
"'----------
I I
Inside the dashed box of Figure 5.8 is a normal feed forward neural network.
Outside of the dashed box represents fuzzy processing of the data before used for
training. However, Such a FNN increases the size of the network dramatically. Each
nput variable needs to be represented with three nodes. If each input variable is to
Ie expressed by five fuzzy values, i.e., high, medium high, normal, medium low and
ow, it will require five nodes. The FNN used in [154] has a similar problem. An
lternative is to use the structure of Figure 5.9. In this structure each input variable
: always split into a pair of nodes, one is used to describe the fuzzy value being
ther L, M or H, the other to represent the membership value. Case studies of
lplying FNNs for fault diagnosis will be given in Section 5.8
6:)-;:c5 -------:
~~: : I I
~
E0+ :
I :
y
~~
.... _--------
Ire 5.9 An alternative structure of fuzzy neural network.
Chapter 5 Supervised Learningfor Operational Support 101
N
Modell M( 1\, V) bj= 1':1 (ail\ri,j) j=1,2, ... T
N
Model 2 M(e, +) bj= L
;=1
ali,j j=I,2, ... T
N
Model 3 M(e, V) b·- V airi,j
J- 1=1 j=1,2, ... T
N
recent years a notable development in fault diagnosis has been the signed directed
lph (SDG). Since it was first proposed for fault diagnosis by Iri et a!. [159] it has
:acted much attention [160 - 171, 194]. It is attractive because it provides an
gant and straight forward tool for qualitatively analyse the cause-effect
ltionships between variables. Neural networks are clearly not capable in this
ect. However, there are common limitations in all the SDG models. Firstly, the
ressive capability is very limited since they are crisp graphs - a node or a branch
only take three values, i.e., -, 0 and +. As a result it will give ambiguous
tions in complicated fault diagnosis. The application of fuzzy concepts by Han
'. [172] and Shih and Lee [173] only makes the input nodes to be able to
ert numerical data to qualitative expression but the graph as a whole is still a
one. Secondly the reasoning methods in a SDG model are often dependent on
, over simplified assumptions. As a result large errors can be expected when
ning in a complex structure. Thirdly, the development of the SDG has been
:ation driven. For example, the algorithms have been developed specifically
mIt diagnosis and can't be applied directly to qualitative simulation.
Chapter 5 Supervised Learning for Operational Support 103
Furthennore, the SDG models are not able to deal with uncertainty in data and
reasoning simultaneously. Finally, they are not able to learn from data.
A fuzzy-SDG method [59, 60, 174] was developed which has far more features to
overcome many of the limitations of SDG models, which was later further improved
by Huang and Wang (1999).
Fuzzy graph is a natural generalisation of the crisp graph using fuzzy set concepts.
A crisp graph is defined by the pair G = (X, E) where X is a finite set of nodes and
E a non fuzzy relation on X x X. A fuzzy graph [179] is a pair (X, E), where X is a
fuzzy set on X and E is a fuzzy relation on X x X such that IlE s: min ( IlX (x), Il
X (x'». Here IlE is the membership function of the binary effect of two adjacent
nodes x and x' over a branch, IlX ' the membership function of the node. However,
in some situations it may be desirable to relax this inequality [178]. Algorithms
about fuzzy graphs can be found in [178]. Obviously, if IlE and IlX only take the
values -1, 0 or 1, then a fuzzy graph becomes crisp. A fuzzy-SDG is defined by
nodes defining variables, branches representing the effects between two variables
and reasoning propagation algorithms associated with the graph.
5.7.1 Nodes
Each node in the fuzzy-SDG is represented by a variable which can take a number
of values from the fuzzy value space. An example of value space of a node is shown
in Figure 5.7, in which the process variable takes three fuzzy values, high, nonnal
and low. Another example is shown in Figure 5.10, in which L is the liquid level in
a tank changing from 0 to 2 meters, v is the nonnalised value of L in the range of
-1.0 to 1.0, and the fuzzy membership value changes from 0 to 1. Each fuzzy value
in Figure 5.10, such as medium low or low, is a fuzzy set Mdefined by Equation
5.13.
M = {x. Il}. Il = [0,1] (5.13)
Mis therefore represented by its membership function, /J., such that the value of
/J. illustrates the degree of membership of the element x belonging to M. The
membership function can have many shapes, such as triangular and trapezoidal. The
fuzzy value medium low in Figure 5.10 is a half-declined trapezoidal fonn which
can be represented by Equation 5.14.
104 Data Mining and Knowledge Discovery for Process Monitoring and Control
0.6'5 Y < al
a2 -y
m= { al:S; y :s; a2 (5.14)
a2 - al
o v > a or y < 0.6
Medium Medium
0.. Low Low Nonnal High High
:..c
'"....d.l
~
E
E
d.l '"
I:
>,.3
N ....
~ g °1---I-L....iiLI.!......4..l.1i1-=-_-L...L...)_-L..4_+_.
""' c.2 I
I 1.0
V
I
, , . I L
0.0 0.4 0.7 1.0 1.30 1.6 2.0 •
meter
5.7.2 Branches
Attached to each branch connecting two nodes is an arrow representing the effect
direction and an effect strength. The effect strength is measured by a weight.
Suppose that Xj+ 1 and Xj are two nodes linked by a branch directed from Xj+ 1 to Xj,
then the effect strength of Xj+ 1 on Xj is determined by Equation 5.15,
) S Rx j + 1 (5.15)
e(xj+l~Xj = j,j+l - - -
RXj
in which Rxj+ 1 and Rxj are the value range (i.e., maximum -minimum) of nodes
Xj+ 1 and Xj respectively and Sj, j+ 1 is the sensitivity of Xj to Xj+ 1, determined by
Equation 5.16.
_ OXj
Sj,j+l - - - (5.16)
OXj + 1
The value range for a node consists of positive and negative ranges, corresponding
to fuzzy values, v in the range [0, 1] and [-1, 0] respectively. Obviously, the larger
the value of e(xj+ 1~ Xj), the stronger the effect of Xj+ 1 on Xj. If the relationship
uses a time derivative to account for the dynamics, this can be approximated using a
backward difference. The sensitivity will be derived from the partial derivative
estimated using the partial with respect to the rate of change of the quantity as
shown by Equation 5.17.
Chapter 5 Supervised Learningfor Operational Support 105
S. '+1 = a(dx j / dt )
J,j a Xj + I
(5.17)
There are three basic connections in a fuzzy-SDG, i.e., serial (Figure 5.11(a»,
divergent (Figure 5. 11 (b» and convergent (Figure 5. 11 (c» connections.
Combination of the three can form any complicated networks such as Figure
5.11(d). Wang et al. [59] described the reasoning strategies for the basic
connections.
In many situations, using a single weight (defined by Equation 5.15) is not sufficient
because it implies a linear relationship between two nodes. The extended fuzzy-
SDG developed by Huang and Wang [175] developed a more sophisticated method
describing the connection relationship.
Two adjacent layers of a fuzzy-SDG is shown in Figure 5.12(a), which depicts the
cause - effect relationships between variables [XI. X2 , X3] and [ZI. Zz, Z3],
indicating that ZI is dependent on XI and X3 but independent on X2. Figure 5. 12(b)
shows the detail for training this substructure. The substructure involving XI. X3 and
lO6 Data Mining and Knowledge Discovery for Process Monitoring and Control
ZI can be trained independently. The node XI is converted into two types of nodes:
XI[L, M, H] and Jl. The first type of nodes takes only discrete values such as H
(high), M (medium) and L (low). The second type of nodes takes continuous values
Jl between 0 and 1 representing the fuzzy membership values when the first node
takes the value of H, M or L. The outside of the dashed box of Figure 5.12(b)
represents fuzzy processing of the original data. The arrangement is different from
that of the fuzzy neural network previously studied [151], that requires three nodes
to represent a variable if the variable takes three fuzzy values. However, the present
method always uses two nodes, therefore, the size of the network does not increase
with increased values in the fuzzy space. Inside the box of Figure 5 .12(b) is a single
layer percetron with no hidden layers. But it also allows to have one hidden layer.
G ~0
~
@ .- - - - - - - - - -•
"-+0
(a)
·
~
Xll;r
• •
•
Xl • •
I •I ZI!LM
~ m
• •• Zl
• •
....•
(b)
Figure 5.13 compares a fuzzy - SDG with a fuzzy neural network. Suppose all the
variables in Figure 5.13, Xl to X3, Zl to Z 11 and Yl to Y3 represent variables in a
process. The non-linearity between the input variables [Xl, X2, X3] and the output
variables [Y 1, Y2, Y3] are expected to be high due to their distances. A neural
network (or a fuzzy neural network) with only one hidden layer as shown in Figure
5. 13 (b) can usually be able to deal with the high non-linearity between [Xl, X2, X3]
and [Y1, Y2, Y3].
Chapter 5 Supervised Learning for Operational Support 107
If we have the knowledge of the cause - effect relationships between [Xl, X2,
X3] and [Yl, Y2, Y3] via a number of intermediate variables, e.g., Zl-Zll, we can
make use of the knowledge to develop a cause-effect diagram like Figure 5.13(a).
The procedure for training the convergent connection in a fuzzy-SDG has been
discussed above (refer to Figure 5.12) using the first layer of Figure 5.13(a). A
single layer percetron (Figure 5.l2(b» can normally give good performance for a
substructure in a fuzzy-SDG. It is because the non-linearity between any directly
connected layers is normally not high. If we view the fuzzy-SDG network in the
horizontal direction, it is a different way of linear summation of a number of small
non-linear (e.g., sigmoid) functions.
There is no doubt that the relationship between two connected nodes becomes
more complex compared with the original fuzzy-SDG because the weight is
replaced by a complicated relationship. However, the relative magnitudes of weights
are not significant, because they depend on the determination of the maximum and
minimum boundary values of variables in normalisation. In fact during reasoning we
are mainly concerned with the values of individual nodes and the propagation of
reasoning in the whole network, not the weight of a branch. Similar observations
can be found in the Bayesian networks in which the branches only mean a link
between two nodes. The reasoning in a Bayesian network is based on the
conditional probability calculation, which requires a complex conditional
probability table.
/Z4 _ _
:SZ:VI
:~~Zlj{~:;;~: ~~:~ ~Y2
X3~Z2~Z9 V3
Z3
In this section we describe two case studies. The first case study compares fuzzy
neural network, fuzzy single layer percetron and fuzzy set covering approaches to
fault diagnosis of a FCC process. It will demonstrate that three layered fuzzy neural
networks normally give more accurate results than single layer percetron and fuzzy
set covering approaches. However all three can identify significant disturbances or
faults and the difference is at identifying small changes. In terms of qualitative
interpretation of connection weights, FSC and PCT are advantageous compared
with FFNN. In the second case study, it shows that for a problem whose nonlinearity
is not high, a fuzzy - sna gives equally good result as FFNN and has much simpler
structure and clearer causal - effect picture.
The refmery residual fluid catalytic cracking process (R-FCC) described in the
appendix B is used as a case study. The data used for this study is summarised in
Table B2 of Appendix B. To avoid confusion in what follows, we use data patterns
to refer to the 67 data patterns, and fault types to refer to the 13 types of faults.
For a process variable that changes rapidly over a short time, violations of
prescribed high and low limits are indications of possible faults. They are described
by two variable pairs: variable_name (high, x) and variable_name (low, x), where x
is the membership value representing the degree of high and low measures.
Model equations impose constraints on the process variables which are derived
from the material and energy balances, equilibrium and rate equations. The
equations are written so that they are equal to zero when satisfied. Associated with
each model equation are tolerance limits which are the expected positive and
negative values of the residual for which the constraint equation is to be satisfied.
Here the residual is defined as the deviation of the constraint equations from zero.
They are also described by the two variable pair: constraint _name (positive, x) and
constraint _name(negative, x). In this case study, no model equations are
considered.
Chapter 5 Supervised Learningfor Operational Support 109
trend_section_3{divergence, x)
trend_section_3(oscillation, x)
trend_section_3{stable, x)
trend_section_3 (mean_value_high, x)
trend_section_3(mean_value_low, x).
In the first three pairs x has a value 0 or 1. While in the last two measures which
represent the deviation of the mean value of this stage from that for normal
operation x has values from 0 to 1. This means that a dynamic trend can be
represented by nine inputs in a network for fault diagnosis, as shown in Figure 5.15
(the part in the box of dashed line) of a diagnostic network.
Figure 5.15 is a fuzzy neural network for diagnosing the FCC process. The output
layer represents faults. The first output node, fresh feed F (high, x), refers to fault of
high fresh feed flowrate. The x is in the range [0, 1] measuring the degree of high.
While the x in node 12, compressor (failure, x) only takes 0 or 1, with 1 meaning
failure and 0 normal.
110 Data Mining and Knowledge Discovery for Process Monitoring and Control
Regeneration temperature °C
650
:. ············:1
.
...........! "1
.. .
640 .
~ .........................
III
.
II:
630
620
., ........................... ..
610
The input layer nodes in Figure 5.15 refer to symptoms. Some are described by
dynamic trends represented by nine nodes, and some as process variables and are
described by two nodes, high and low. By independently changing the parameters of
the outputs and recording the responses of inputs, data corresponding to Table B2
was obtained.
Figure 5.15 shows the fuzzy neural network structure which has one hidden layer.
A fuzzy single layer percetron which has the same input - output structure but no
hidden layers, and a fuzzy set covering model is also used to the same problem in
order to make comparisons. It might be difficult to have a completely fair
comparison, considering the difference in structure and training details. The
comparison is based on the assumption that all three, are in their optimum structure
and training parameters.
The broad nature of results is summarised in Table 5.2, which shows that the
Fuzzy FFNN model is able to identify 65 of the 67 faults in the samples, compared
with 55 and 53 out of 67 for fuzzy PCT and FSC respectively. The result might be
anticipated since FFNN is the best technique for dealing with non-linear data
followed by PCT and FSC. However there are more interesting observations.
, Regeneration T trend, section-l ( inc, x) ,
Fresh feed F ( high, x )
. Regeneration T trend, section-l ( dec, x) :
: Regeneration T trend, section-2 ( inc, x) : Fresh feed F ( low, x)
:, Regeneration T trend, section-2 ( dec, x) ,
,
Mixed feed preheated T (high, x)
: Regeneration T trend, section-3 ( con, x) :
: RegenerationT trend, section-3 ( div, x) Mixed feed preheated T ( low, x )
g
Regeneration T trend, section-3 ( osc, x) -§
. Regeneration T ( high, x ) ,CD Residue oil F ( high, x ) ~
v.
Regeneration T ( low, x ) Residue oil F ( low, x )
r?
Reaction T trend, section-l (inc, x)
External heat removal V
••• opening (lzigh, x)
l[
Reaction P trend, section-l (inc, x) External heat removal V
••• opening( low, x) ~
~
Regeneration P trend, section-l ( inc, x) External heat removal water ;:.
()q
pump ifailure, x)
••• '0""I:>
02 content in flue gas ( high, x ) Main air F ( high, x)
02 content in flue gas ( low, x )
~(\)
Figure 5.15 The DIZZy neural network structure for fault diagnosis of FCC. .....
.....
112 Data Mining and Knowledge Discovery for Process Monitoring and Control
Close inspection of the fault data patterns in Table 5.2 shows that the faults which
are not identified by any of the procedures are for smaller disturbances. For
instance, data pattern 24 represents an increase of 10% in the preheated temperature
of the mixed feed. This is a relatively small change compared with other data
patterns, and the system controllers are able to bring the system back to designed
operating condition fairly quickly. In fact, all the significant disturbances and faults
can be recognised by three models. It is apparent that faults not identified by FFNN
remain unidentified by PCT and FSC as well. In the same way, those not identified
by PCT are missed by FSC. So the order for describing effectiveness in accounting
for small disturbances is FFNN, PCT and FSC.
Fe Fe
TCi TCi
FHi Ho FHi Ho
THi TH i
(a) FSC (b}PCT
FC
TC o
TCi
FHi
THi THo
(e) B P
Figure 5.16 The effect weights of three approaches for a heat exchanger and the
cause-effect explanation.
Chapter 5 Supervised Learning for Operational Support 113
No 32 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 1.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
BP 0.0006 0.0024 0.0000 0.0001 0.0001 0.0000 0.8491 0.0001 0.0000 0.0000 0.0035 0.0000 0.0000
PCT 0.1224 0.0476 0.0200 0.0295 0.0305 0.0293 0.7479 0.0085 0.0002 0.0264 0.0061 0.0009 0.0067
FSC 0.21610.0393 -0.0762 -0.0184 0.0309 0.0079 0.6593 -0.0189 -0.0003 0.0171 0.0049 -0.0028 0.0132
No 38 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 1.0000 0.0000 0.0000 0.0000 0.0000
BP 0.0000 0.0022 0.0000 0.0000 0.0000 0.0000 0.00000.0327 0.9871 0.0008 0.0000 0.0000 0.0106
PCT 0.0322 0.0229 0.0199 0.0167 0.0134 0.0161 0.0017 0.3529 0.9018 0.0648 0.0030 0.0675 0.0156
FSC -0.04290.0317 -0.0662 0.0316 0.0007 -0.0205 0.0020 0.2796 0.7681 0.0282 -0.0864 -0.1079 0.0460
No 43 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 1.0000 0.0000 0.0000 0.0000
BP 0.0000 0.0128 0.0005 0.0000 0.0000 0.0001 0.0000 0.0000 0.0002 0.9557 0.0006 0.0004 0.0035
PCT 0.00220.14580.02700.00640.00240.0171 0.01120.00300.00780.81700.0041 0.00750.0885
FSC -0.195 0.0313 0.0427 -0.0124 -0.0432 -0.0211 0.0401 -0.1144 0.0695 0.7781 0.0196 0.0316 0.0190
No 60 0.7222 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.8454 0.0000 0.0000 0.0000 0.0000 0.0000
BP 0.7347 0.0000 0.0000 0.0002 0.0019 0.0000 0.0013 0.8203 0.0019 0.0000 0.0000 0.0007 0.0000
PCT 0.6892 0.0050 0.Q305 0.0536 0.1205 0.0292 0.0208 0.7579 0.0669 0.0013 0.0058 0.0246 0.0027
FSC 0.5345 0.0153 -0.0145 -0.0559 0.0241 -0.1010 0.1218 0.7848 0.1320 -0.2026 0.0058 0.1114 -0.0328
No 61 0.7778 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 1.0000 0.0000 0.0000 0.0000 0.0000
BP 0.7763 0.0006 0.0000 0.0000 0.0000 0.0000 0.0000 0.0075 0.9908 0.0000 0.0000 0.0033 0.0000
PCT 0.6969 0.0039 0.0082 0.0263 0.0456 0.0181 0.0025 0.1887 0.9362 0.0050 0.0042 0.0586 0.0045
FSC 0.7340 -0.0615 -0.0813 0.0339 0.1096 -0377 -00388 -0.0442 1.1698 0.0665 0.0399 00896 -00729
No 65 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 1.0000 0.0000 0.0000 1.0000 0.0000
BP 0.0000 0.0003 0.0000 0.0000 0.0000 0.0000 0.0000 0.0087 0.98\2 0.0015 0.0000 0.9784 0.0037
PCT 0.0022 0.0104 0.0049 0.0097 0.0119 0.0098 0.0010 0.1603 0.8441 0.0433 0.0018 0.8598 0.0679
FSC -.0223 0.0473 -0.0160 -0.0128 -0.0527 0.0374 -0.0055 0.2691 0.6280 0.0932 -0.0255 0.7562 0.0926
No 67 0.0000 0.8333 0.7500 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
BP 0.0001 0.8200 0.7447 0.0000 00000 0.0012 0.0000 0.0000 0.0002 0.0020 0.0021 0.0000 0.01 17
PCT 0.0300 0.7592 0.7144 0.0040 0.0013 0.0077 0.0146 0.0015 0.0100 0.0630 0.0251 0.0003 0.0787
FSC 0.0390 0.7977 0.7038 0.0143 0.0211 -0.0044 -0.0135 0.0201 -0.01760.1017 -0.030 -0.0273 0.0188
a_I to 13 are fault types corresponding the 13 output nodes of Figure 5.15
b _ No 7 is the 7th data pattern
Despite the three having significantly different capabilities when monitoring small
disturbances, they are almost equal in isolating major disturbances and faults. Table
5.3 gives the comparison of three methods for a number of fault situations. All three
can identify the faults, even in double fault situations, though in most cases FFNN
predictions are slightly closer to 1.0 than the other two (1 means fault). However,
PCT and FSC are much faster in convergence. A typical example is that FFNN
requires several hours, PCT one hour and FSC 30 minutes on a personal computer.
People are also interested in what FFNN, PCT and FSC can help us to develop a
better understanding of the cause-effect behaviour of the process. It is easier to
illustrate this in a simple counter current flow heat exchanger shown in Figure
5.l6(a). Our interest is to study how the cold stream flowrate (FC) and its inlet
temperature (TC;) and the hot stream flowrate (FH) and its inlet temperature (TH;)
affect the outlet temperature TC o and THo. The weights obtained for the same set of
data for FFNN, PCT and FSC are shown in Figure 5.l6(b). For both PCT and FSC,
it is very clear to see from the weights how an input affects an output. However it is
not so clear from the weights of a FFNN due to the existence of a hidden layer.
The extended fuzzy-SDG was applied to a waste water treatment plant and
compared with feedforward neural networks by Huang and Wang [175]. The plant
has three treatment units in series. Through sensitivity study, a fuzzy-SDG is
developed as shown in Figure 5.17. The variables are illustrated in Table 5.4. The
advantage of this fuzzy-SDG network is that compared with neural networks, it is
more intuitive to engineers and supervisors. Neural networks are fully connected
and can give predictions given the inputs. However, it is not straightforward to
qualitatively know the weighted contribution of inputs to the outputs. In contrast,
the causal network is no longer a blackbox because it is a partially connected graph.
Engineers can trace forward and backward the network to analyse problems. For
example, the output suspended solids, SS-S is observed as (High, 0.10), tracing
back the causal graph and the nodes, SS-D (High, 0.68), SS-P(High, 0.28), RD-
SSP(Medium, 1.00), SS-E (High, 0.88) and SSV-E(M, 0.87). So the main reason
causing SS-S(High, 0.10) is due to SS-E(High, 0.88).
Chapter 5 Supervised Learningfor Operational Support 115
~
~
0.8 --Fuzzy-SDG-O
--Fuzzy-SDG-2
0.7 ,.. c-_'c"'_"'" BP Prediction
0.6
0.5
0.4
0.3
0.2
0.1
10 20 30 40 50 60 70 80 90
-o~
/0--
PH-E
-0
PH-P
SSE
Fuzzy-SDG-O Fuzzy-SDG-2
X. Z. Wang, Data Mining and Knowledge Discovery for Process Monitoring and Control
© Springer-Verlag London Limited 1999
120 Data Mining and Knowledge Discovery for Process Monitoring and Control
since training data is not available. In these cases unsupervised learning approaches
are needed and the goal is to group data into clusters in a way such that intraclass
similarity is high and interclass similarity is low. In other words, supervised
approaches can learn from known to predict unknown while unsupervised
approaches learn from unknown in order to predict unknown. Supervised learning
can generally give more accurate predictions, but can not be extrapolated: when new
data is not in the range of training data, predictions will not generally be reliable.
For process operational state identification and diagnosis, supervised learning needs
both symptoms and faults. Therefore the routine data collected by computer control
systems can not be used directly for training. Faults are unlikely to be deliberately
introduced to an industrial process in order to generate training data.
Grouping of data patterns using unsupervised learning is often based on a
similarity or distance measure, which is then compared with a threshold value. The
degree of autonomy will depend on whether the threshold value is given by the users
or determined automatically by the system. In this chapter three approaches are
studied, the adaptive resonance theory (ART2), a modified version of it named
ARTnet, and Bayesian automatic classification (AutoClass). ART2 and ARTnet
though require a pre-defined threshold value are able to deal with both the third and
fourth types of data. AutoClass is a completely automatic clustering approach
without the need to pre-define a threshold value and the number and descriptions of
classes, so is able to deal with the fourth type of data.
The adaptive resonance theory (ART) was developed by Grossberg [47, 48], as a
clustering-based, autonomous learning model. ART has been instantiated in a series
of separate neural network models referred to as ART!, ART2 and ART3 [43,49,
50,51]. ART! is designed for clustering binary vectors and ART2 for continuous-
valued vectors.
A general architecture of the ART neural network is shown in Figure 6.1. The
network consists of three groups of neurons, the input processing layer (F 1), the
cluster layer (F2 ), and a mechanism which determines the degree of similarity of
patterns placed in the same cluster (a reset mechanism). The input layer can be
considered as consisting of two parts: the input and an interface. Some processing
may occur in the both (especially in ART2). The interface combines signals from
the input to the weight vector for the cluster unit which has been selected as a
Chapter 6 Unsupervised Learningfor Operational State Identification 121
candidate for learning. The input and interface are designated as F, (a) and F, (b) in
Figure 6.1.
(Cluster layer)
Fl (b) Layer
(Interface)
the input layer and the cluster units. The FI (b) layer is connected to the F2 layer by
bottom-up weights; the bottom-up weight on the connection from the ith FI unit to
the jth F2 unit is designated b ij . The F2 layer is connected to the FI (b) layer by top-
down weights; the top-down weight on the connection from the jth F2 unit to the ith
Fl unit is designated tji .
f(x)
Wi l - - - - - - - - + - i
The symbols on the connection paths between the various units in the FJ layer in
Figure 6.2 indicate the transformation that occurs to the signals as it passes from one
type of unit to the next; they do not indicate multiplication by the given quantity.
However, the connections between units Pi (of the Fi layer) and Y; (of the F2 layer)
do show the weights that multiply the signal transmitted over those paths. The
activation of the winning F2 unit is d, where O<d<l. The symbol ---> indicates
normalisation; i.e., the vector q of activations of Q units is just the vector p of
activations of the P units, normalised to be approximately unit length.
The U units perform the role of an input phase of the F J layer. The P units play
the role of the interface to the FJ layer. Units Xi and Qi apply an activation
function to their net input; this function suppresses any components of the vectors of
activations at those levels that fall below the user-selected value 9. The connection
paths from W to U and from Q to V have fixed weights a and b, respectively.
The most complete description of the ART2 network can be found in [43].
Although written for ARTl, the discussions by Caudil [181] and Wasserman [182]
provide a useful but less technical description of many of the principles employed in
ART2.
ART2 has shown great potential for analysis of process operational data and
identification of operational state due to a number of properties [183, 184, 195].
First, it is considered as an unsupervised machine learning system that does not
require training data. It is well known that it is difficult to find training data for the
purpose of process fault identification and diagnosis. Second, the approach is
recursive, or in the terms used in ART2, it is plastic, that is, it is able to acquire new
knowledge and retain stable in the sense that existing knowledge is not corrupted.
124 Data Mining and Knowledge Discovery for Process Monitoring and Control
This property is apparently very useful for on-line monitoring where information is
received continuously.
Wang et al. [185] and Chen et al. [82] developed an integrated framework named
ARTnet which combines wavelet for feature extraction from dynamic transient
signals and adaptive resonance theory. In ARTnet the data pre-processing part uses
wavelets for preprocessing the data for feature extraction. In order to introduce
ARTnet it is helpful to first examine the mechanism of ART2 for noise removal.
ART2 has a data pre-processing unit which is very complicated but the mechanism
for removing noise uses a simple activation function A(x),
X x > e
A(x) = { 0 (6.1)
x < e
A mechanism was proposed by Pao [186] to replace the data pre-processing part of
ART2 with more efficient noise removal and dimension reduction methods. This
has been followed in this study by using wavelet to replace the data pre-processing
unit of ART2. The integral framework is called ARTnet to distinguish it from
ART2. The conceptual architecture of ART net is shown in Figure 6.4.
In this new architecture, wavelets are used to pre-process the dynamic trend
signals. The extracted features are used as inputs to the kernel of ARTnet for
clustering. A pattern feature vector (XI' x 2 ,.·· ,x N ) is fed to the input layer of the
ARTnet kernel and weighted by bij' bottom-up weights. In Chapter 3, the extrema of
wavelet multiscale analysis should be regarded as the features of dynamic transient
signals.
Chapter 6 Unsupervised Learningfor Operational State Identification 125
The weighted input vector is then compared with existing clusters in the top layer
by calculating the distance between the input and existing clusters. The existing
cluster prototype, which has a smaller distance than the input is called the winner.
By considering this input the description or knowledge of the wining cluster is
updated. Whether or not a winning cluster prototype is allowed to learn from an
input data pattern depends on how similar the input is to the cluster. If the similarity
measure exceeds a predetermined value, called the vigilance parameter, learning is
enabled. If the similarity measure is less than the required vigilance parameter, a
new cluster unit is then created which reflects the input. Clearly this is unsupervised
and recursive learning process.
Updating the
·{~j?<l
•
t •t •t Input to
ARTnet
Features Kernel
~
( Wavelet Feature extraction)
~
Dynamic trend signals
It is apparent that the learning process is concerned with the extent to which how
similar two vectors are. There are several ways to measure the distance between two
pairs of observations, such as the Hamming or Euclidean distance. For continuous
data, the Euclidean distance is the most commonly used [187]. Formally, the
Euclidean distance between two vectors x and y is defined as the root sum-squared
error,
126 Data Mining and Knowledge Discovery for Process Monitoring and Control
(6.2)
Suppose there are K existing cluster prototypes. The kth cluster prototype consists
of a number of data patterns and is also described by a vector, denoted as Z(k) ,
which has considered all data patterns belonging to it. Clearly, if there is only one
data pattern in the cluster, Z(k) , it is equal to that data pattern. When a new input
data pattern x is received, a distance between x and z(k) is calculated according to
the expression,
(6.3)
Since the distance between x and all existing cluster prototypes is calculated, the
cluster prototype with the smallest distance is the winner. If the distance measure for
the winner is smaller than a pre-set distance threshold, p, then the input x is assigned
to the winning cluster and the description of the cluster is then updated,
1
Z(k) = Z(k) + --x.b .. i= 1...N, j = 1.. .K (6.4)
I I NF I Y
where Z,(k) refers to the ith attribute of the vector z for the cluster k. h'l is the weight
between the ith attribute of the input and the jth existing cluster prototype. NF is the
number of features.
The FCC process and the data used is described in appendix B. To demonstrate
the procedure, 64 data patterns are used. Discussion is limited to 64 data patterns in
order to keep the discussion manageable and to assist in presentation of the results.
The data sets include the following faults or disturbances:
• compressor fails
• double faults occur
The sixty four data patterns were obtained from a customised dynamic training
simulator, to which random noise was added using a zero-mean noise generator
(MATLAB®). In the following discussions, the term "data patterns" refers to these
sixty four data patterns and "identified patterns" to the patterns estimated by
ARTnet.
Figure 6.5(a) shows a reactor temperature transient when the fresh feed flowrate
increases by 70%. Figure 6.5(b) is the same transient with random noise. The
corresponding four scales from multiresolution analysis for this transient are shown
in Figure 6.6, together with the corresponding extrema on the right hand side of
Figure 6.6.
2
0
·2
·4
·6
·8
·10
·12
·14
·16
0 20 40 60 80 100
(a)
-2
-4
·6
-8
-10
-12
-14
-16
(b)
Figure 6.S A signal from the simulator (a) and the signal with random noise (b).
128 Data Mining and Knowledge Discovery for Process Monitoring and Control
5 5
4 4
3 3
2 2
1
0 • 1
0 ~A
-1 -1 Scale 1
-2 -2
-3 -3
-4 -4
-5 -5
0 20 40 60 80 100 0 5 10 15 20 25
5 5
4 4
3 3
2
l
2
1 1
A 0
0
-1 Scale 2
-1
-2 -2
-3 -3
-4 -4
-5 -5
0 20 40 60 80 100 0 5 10 15 20 25
20 20
15 15
10 10
'1 ~~
5 5
0 -y 0
Scale 3
-5 -5
-10 -10
-15 -15
-20 -20
0 20 40 60 80 100 0 5 10 15 20 25
20
20
15
15 10
10
5
5 0 Scale 4
0
-5 ~ -5
-10
-10 -15
-15 -20
-20 '0 20 40 60 80 00 0 5 10 15 20 25
With the threshold p increased to 1.0, data patterns 56 and 57, which represent
cases where the opening of valve 401-ST is decreased from 100% by 80% and 90%
are grouped together. When p is 2.0, further groupings are [5, 7], representing the
fresh feed flowrate increasing by 50% and 70%, [25, 26] recycle oil flowrate
increasing by 70% and 90%, and [27, 28] recycle oil flowrate decreasing by 70%
and 90%. It is obvious that these are all reasonable groupings.
When the threshold value is 4.5, the groupings are [3,4,5,6,7,8,9], [192021 22
2324], [2526], [27,28], [35,36] and [56 57]. The pairing of identified patterns and
original data patterns are shown in Table 6.2. The clustering is justified by
inspecting the results in detail. Figure 6.9 shows the trends of three measurements
for data pattern 5. It shows that regenerator temperature and concentration of
oxygen in regenerator flue gas drop sharply while catalyst hold-up in reactor
increases dramatically. All of which mean abnormal operations. Very similar
scenarios can be found for data patterns 3, 4, 5, 6, 7, 8, and 9, so the result of
regarding them as a single pattern is acceptable. The grouping [35, 36] can also be
justified by inspecting the dynamic responses (Figure 6.10). In both cases, the
dynamic responses of catalyst recycle rate lead to a steady state with the process
remaining under control.
Table 6.2 ATRnet identified clusters when the distance threshold is 4.5 and the
corresponding data patternsa .
Identified Corresponding Identified Corresponding Identified Corresponding
clusters data patterns clusters data natterns clusters data patterns
1 I 19 32 37 51
2 2 20 33 38 52
3 [34 5 6 7 8 91 21 34 39 53
4 10 22 [35361 40 54
5 11 23 37 41 55
6 12 24 38 42 [56571
7 13 25 39 43 58
8 14 26 40 44 59
9 15 27 41 45 60
10 16 28 42 46 61
11 17 29 43 47 62
12 18 30 44 48 63
13 [ 19 20 21 22 23 31 45 49 64
241
14 f25261 32 46
15 [27281 33 47
16 29 34 48
17 30 35 49
18 31 36 50
a - [3 4 5 6 7 8 9] means data patterns 3 to 9 are identified in the same cluster
Chapter 6 Unsupervised Learning for Operational State Identification 131
4 4
3 3
2 2
1
0
, ~A......
-1 -1 Salle 1
-2 -2
-3 -3
-4 -4
-5 -5
0 20 40 00 00 100 0 5 10 15 20 25
,
5
4 4
3 3
2 2
~
1
0 A
-1 -1 Salle 2
-2 -2
-3 -3
-4 -4
-5 -5
A;, 100
20 40 00 00 0 5 10 15 20 25
20 5
15 4
3
10
2
~ 5
1, 1
0
-1
:~ Salle 3
-5
-2
-10
-3
-15 -4
-20 -5
20 40 00 00 100 0 10 15 20 25
20
15 4
3
10
5
1, 1
0
-1
~ , ~ Salle 4
-5
-2
-10
-3
-15 -4
-20 -5
20 40 00 00 100 10 15 20 25
Figure 6.7 Extrema after removing noise (left) and piece wise analysis (right).
Di - detail of multiscale wavelet analysis; Ai - approximation.
132 Data Mining and Knowledge Discovery for Process Monitoring and Control
However, any further increase in threshold is not useful because some data
patterns that are significantly different are grouped in the same cluster. For instance,
when the threshold value is 5, data pattern 29 (opening ratio of the hand-valve V20
increasing by 5%) is merged with the clusters representing increase and decrease in
the preheat temperature of the mixed feed. Therefore, the threshold p = 4.5 is
considered as the most appropriate value for this case.
10
10
200~-~--'-:::0----:!15:----:::2o----!2·5
(a)
·10
~200L-~--,-:::0-~,5:-----:::20------:!25
(b)
Figure 6.8 Comparison of the result after noise removal (b) with the
multiresolution analysis of the original simulation signal (a).
Chapter 6 Unsupervised Learning for Operational State Identification 133
-10 -0.15
Oxygen in outlet of
Regenerator
300
200
100 Catalyst
holdup
Time
' .. ~-~----~------.
.,
Tine Tim!
Figure 6.10 Dynamic trends of catalyst recycle rate for data patterns 35 and 36.
134 Data Mining and Knowledge Discovery for Process Monitoring and Control
In this case, only 57 data patterns are used to compare the distance threshold for
using ARTnet and the vigilance value in ART2 using noise-free data. For noise free
data, ARTnet and ART2 give the same results if the ARTnet distance threshold and
the ART2 vigilance are appropriately adjusted, as shown in Table 6.3. To
understand the table, consider the last row, which shows that when the distance
threshold of ARTnet is 4.5 it gives the same grouping result as ART2 with a
vigilance value of 0.9985. From Table 6.3, for the same groupings, the ARTnet
distance threshold changes from 0.8 to 4.5 while the vigilance of ART2 varies from
0.9998 down to 0.9985. So the distance threshold for ARTnet is less sensitive than
the vigilance of ART2. The ART2 clustering is too sensitive to the vigilance value,
making it difficult to set a value.
Table 6.3 Comparison of the value ranges of the distance threshold of ARTnet
and the vigilance value of ART2, for the same grouping schemes'abc, .
ARTnet ART2
distance vigilance Grouping of data samples
threshold value
0.8 0.9998
1.0 0.9996 [5657]
2.0 0.9992 [5 7] [2526] [2728] [5657]
3.0 0.9990 [57] [19202324] [25 26] [2728] [5657]
4.0 0.9987 [5 67 8] [192021 2324] [2526] [2728] [56 57]
4.5 0.9985 [3456789] [192021222324] [25 26] [2728]
[35 36] [5657]
a [56 57] means that data patterns 56 and 57 are grouped in the same cluster, b Only the
first 57 data patterns are considered and the data is noise free, cThe ARTnet distance
threshold changes in a wider range while ART2 vigilance is too sensitive making it
difficult to set a value.
Table 6.4 Clusters predicted by ARTnet when the distance threshold is 4.5 and
Cnoise varies over a Wi'de range, firom 0 .00 1 to 100 a
Identified Corresponding Identified Corresponding Identified Corresponding
patterns data patterns patterns data patterns patterns data patterns
1 I 15 [27,28] 29 43
2 2 16 29 30 44
3 [34 5 6 7 8 9] 17 30 31 45
4 10 18 31 32 46
5 II 19 32 33 47
6 12 20 33 34 48
7 13 21 34 35 49
8 14 22 [35 36] 36 50
9 15 23 37 37 51
10 16 24 38 38 52
11 17 25 39 39 53
12 18 26 40 40 54
13 [19202122 27 41 41 55
23241
14 [25 26] 28 42 42 [5657]
a[3 4 5 6 7 8 9] means that data patterns 3 to 9 are grouped in the same cluster.
136 Data Mining and Knowledge Discovery for Process Monitoring and Control
In Equation 6.5, Cnoise changes ranging from 0.001 to 100 are examined in what
follows where the smaller the Cnoise, the larger the noise to signal ratio.
The best clustering results are obtained when the distance threshold of ARTnet is
4.5. This result is not affected by changing Cnoise from 0.001 to 100, as can be seen
in Table 6.4. For ART2, the best value of the vigilance is 0.9985 and Cnoise = 100,
and is the same result as ARTnet (Table 6.4). However, as Cnoise decreases to 10,
i.e., larger noise to signal ratio, ART2 splits the cluster [3 4 5 6 7 8 9] into two [3 4
5 6 7] and [8 9]. As Cnoise decreases to 0.001, i.e., a much larger noise to signal
ratio, there are further new groupings, [20 42] and [29 51]. The new groups are not
able to be satisfactorily explained. Although the inappropriate groupings [20 42]
and [29 51] can be avoided by changing the vigilance value, other unreasonable
groupings are generated.
It is found that ARTnet is faster than ART2. After optimum values of the distance
threshold of ARTnet and the vigilance of ART2 are found, for the same data,
ARTnet is typically two times faster than ART2.
threshold value which is set by users based on trial and error. The Kohonen
network requires the number of classes to be detennined beforehand. The
Bayesian solution to the problem is based on the use of prior knowledge. It
assumes that simpler class hypotheses (e.g., those with fewer classes) are more
likely than complex ones, in advance of acquiring any data, and the prior
probability of the hypothesis reflects this preference. The prior probability
tenn prefers fewer classes, while the likelihood of the data prefers more, so
both effects balance at the most probable number of classes. Because of this,
AutoClass fmds only one class in random data.
• Objects are not assigned to a class absolutely. AutoClass calculates the
probability of membership of an object in each class, providing a more intuitive
classification than absolute partitioning techniques. An object described equally
well by two class descriptions should not be assigned to either class with
certainty, because the evidence cannot support such an assertion.
• All attributes are potentially significant. Classification can be based on any or
all attributes simultaneously, not just the most important one. This represents an
advantage of the Bayesian method over human classification. In many
applications, classes are distinguished not by one or even by several attributes,
but by many small differences. Humans often have difficulty in taking more
than few attributes into account. The Bayesian approach utilises all attributes
simultaneously, pennitting unifonn consideration of all the data. At the end of
learning, AutoClass gives the contributing factors to class fonnation.
• Data can be real or discrete. Many methods have difficulty in analysing mixed
data. Some methods insist on real valued data, while others accept only discrete
data. The Bayesian approach can utilise the data exactly as they are given.
• It allows missing attribute values.
The fundamental model of AutoClass is the classical finite mixture model of Everitt
and Hand [188] and Titterington et al. [189], made up of two parts. The first is the
probability of an instance being drawn from a class Cs (s = 1, k), denoted As. Each
class Cs then is modelled by a class distribution function, P(Xi I Xi E Cs, es), giving
the probability distribution of attributes conditional on the assumption that instance
Xi belongs to class C. These class distributions are described by a class parameter
vector, es, which for single attribute normal distribution consists of the class mean,
·
J.ls. and vanance cr s2•
Thus, the probability of a given datum coming from a set of classes is the sum of
the probabilities that it came from each class separately, weighted by the class
probabilities.
L
k
P(Xi Ie,A, k) = As P(Xi I Xi E Cs, es) (6.7)
It is assumed that the data is unordered and independent, given the model. Thus
the likelihood of measuring an entire database is the product of the probabilities of
measuring each object
p(x Ie,A, k) = n
i=!
n
p( Xi Ie,A, k) (6.8)
For a given value of the class parameters, the probability that instance i belongs
to a class using Bayes's theorem is calculated using
Chapter 6 Unsupervised Learningfor Operational State Identification 139
To solve the second half of the classification problem (i.e., detennining the
number of classes k) the posterior distribution of the number classes k has to be
calculated. This is proportional to the product of the prior distribution p(k) and the
pseudo- likelihood function p(x Ik).
p(k) P (x Ik)
p(k Ix) = (6.12)
p(x)
In principle, the most probable number of classes are detennined by evaluating
p(k Ix) over the range of k for which the prior p(k) is significant. In practice, the
multi-dimensional integrals of Equation 6.6 are computationally intractable, and the
maximum of the function has to be found so that it can be approximated at about
that point.
140 Data Mining and Knowledge Discovery for Process Monitoring and Control
In AutoClass, it assumed that attributes are independent for each class. This permits
an extremely simple form for the class distributions used in Equation 6.2.
m
where 8sj is the parameter vector describing the j th attribute in the sth class C s.
AutoClass models for real valued attributes are Gaussian normal distributions
parameterised by a mean and a standard deviation, and thus 8 sj takes the form
!-l s}
(8 sj ) =
cr s}
The class distribution is thus
s}
(6.14)
As mentioned earlier AutoClass breaks the classification problem into two parts:
determining the number of classes and determining the parameters defining them. It
uses a Bayesian variant of Dempster and Laird's EM (expectation and
maximisation) algorithm [190] to find the best class parameters for a given number
of classes. To derive the algorithm, the posterior distribution is differentiated with
respect to the class parameters and equate with zero. This yields a system of non-
linear equations which hold at the maximum of the posterior:
Ws + w'-1
AS = s = l...k ( 6.15)
n+kCw' -1)
a a
- I n p(
as s
e +L
s)
n
i=!
W is as
s
In p( Xi
~
Ie, ) = 0 (6.16)
where Wi, is the probability that the datum, Xi' was drawn from class s (given by
Equation 6.4) and Ws is the total weight for class Cs :
Wis = P(Xi EC s I Xi,e,~)
Chapter 6 Unsupervised Learningfor Operational State Identification 141
Ws = LW;s
;=1
Thus far, the discussion of the search algorithm has related to general class model
with an arbitrary eSj • The Equation 6.12 is now applied to the specific AutoClass -
model of Equation 6.8 through 6.9.
For real valued attributes, the equations for the updated ~sj and cJSj are a function
- 2
of the prior information and the empirical mean, X,j and cr sj of the jth attribute in
L W;sXij
;=1
X sj =
L W;sX2ij
n
i=1 -2
X sj
""2
,
W' (cr j)
2
+ U;:cr Sj
2
W' u;: , _ 2
cr = + (cr .- x .) S = l...k
s} W'+Ws + 1 (W'+W)(W'+W
s S
+ 1) J SJ
(6.19)
Equations 6.10, 6.13 and 6.14 do not, of course, give the estimators explicitly;
instead they must be solved using an iterative procedure [191]. The simplest way of
estimating parameters using maximum likelihood estimate method is that suggested
by Wolfe [192] which is essentially an application of EM algorithm [190]. Whereas,
by the Bayesian parameter estimation method, AutoClass uses a Bayesian variant of
142 Data Mining and Knowledge Discovery for Process Monitoring and Control
Dempster and Laird EM algorithm. Initial estimates of the A" 11" 0" s~ are obtained
by one of the variety of methods [188], and these are then used to obtain first
estimates of the p(sl Xj) i.e., the weights Wj s and hence Ws, - the E-step; these are
then inserted into Equations 6.10, 6.13 and 6.14 to give revised parameter estimates,
which is essentially the M-step. The process is continued until some convergence
criterion is satisfied [188].
Using the same FCC process as the case study, 42 data patterns are studied, which
are summarised in Table 6.5. For every variable in each of the data patterns, 60
sampling points are recorded. For example, the dynamic trend represented by 15
(reaction temperature) in Figure 6.11 is composed of 60 data points when the valve
opening on the top of the distillation column changes from 100% to 90%. Six
process parameters are recorded including reaction and regeneration temperatures
(TRA and TRG), reactor and regenerator pressures (PRA and PRG), oxygen and
carbon monoxide volumetric contents in the flue gas from the regenerator (PT02
and PTCO). These are known to be the major variables for the FCC process,
although more precise characterisation would be expected if more parameters were
included.
Chapter 6 Unsupervised Learningfor Operational State Identification 143
.·,.
510 660
."'-'",
._
650
a"' 470
620
J 610
• • • .. 1 5 - 1 7 - 1 9
450
600
S90
430
10 20 30 40 50 60
0 10 20 11mo 30 40 50 60 11. .
Clearly prior knowledge about the data makes it possible to test the automatic
classification capabilities. Since each data instance involves six variables and each
variable is represented by 60 data points, the data base is a 360 x 42 matrix. In
144 Data Mining and Knowledge Discovery for Process Monitoring and Control
Chapter 3, various methods have been introduced for reducing the dimensionality of
a dynamic trend without signal losing its important features. These include wavelets,
principal components and episode representations. Here the crude data are used
since the concern is the identification of operational states.
..
. T able 65
Table 6.6 The AutoClass clustenng resu ts 0 f the 42 cases III
Classes Cases
I 1,2,3,4,5,6, 7, 8,9, 10, 11,
33,34,35,36,37,38,39,40,41,42
2 21,22,2331,3213,29
3 15,16,17,18,19,20
4 24,25,26,27,28
5 12, 14,30
Chapter 6 Unsupervised Learningfor Operational State Identification 145
The classified results are analysed by comparing Table 6.6 and Table 6.5. Class I
includes cases 1 - 11 and 33 - 42, which correspond to normal operations as can be
seen from Table 6.5 so it is reasonable to assign them into a single class. The
significance of being able to automatically distinguish between normal and
abnormal operational data, which represent moderate to significant disturbances as
well as faults is that process upsets can be seen.
Class III includes cases 15 - 20 which correspond to decreases in the opening of
the valve 401-ST opening from 100% to 90%, 80%, 60%, 40%, 30% and 20%
which cause the differential pressure (PRG - PRA) between the regenerator and
reactor to decrease. Consequently, the regenerated catalyst circulation rate falls and
the reaction and regeneration temperatures will be influenced. The closed - loop
dynamic responses of reaction and regeneration temperatures for cases 15, 17 and
19 are shown in Figures 6.11 and 6.12. It can be seen that although they are not
identical, they show greater similarity than those in class I, i.e., normal operation.
Class IV includes cases 24 - 28 which correspond to the changes in the opening
of the hand-operated valve V20, which first cause the regeneration temperature to
change due to changes in heat transfer and then affect all other parameters. Cases 24
- 28 refer to a reduction in the opening of V20 from its normal operation value
(75%) to 60%, 50%, 40%, 35% and 30%. All these operations cause the
regeneration temperature to increase, so is quite reasonable that they should be
grouped into one class. The regeneration temperature TRG and oxygen volumetric
percentage in flue gas PT02 change are shown in Figures 6.13 and 6.14. It is
important that cases 21, 22 and 23 are grouped in a different class (class II)
although they also represent V20 opening changes, because they represent increases
in opening which have different effects on associated variables. Figure 6.15 shows
the PT02 changes for cases 21, 22 and 23. It is clear that they are different from
cases 24,26 and 28 shown in Figure 6.14.
146 Data Mining and Knowledge Discovery for Process Monitoring and Control
M0r---------------------____ ~
"r-----------------------~
-24 -26 ·-"-28
.,,
I
05 -24-26·""-28
10 20 TI~ 40 50 60
4~r-----------------------~
0.9
"",,
0.' ,,
,
.. ,-.._-".. ---........_----_._..
0.7
i .90
..i
0.6
U
'"
i
0.5
0.'
.... .. -21 ~
--~~
--23
of .85
0.3
!l 0.2 - - 3 1 · _ .. -32
0.1
.90~--+_--~--~---+--~--~
10 20 30 .0 50 60 o 10 20 30 40 50 60
Tio. Time
Class V includes 12, 14 and 30 which are clearly in one class and represent feed
pump P1329 failure, feed flowrate decreases by 50% and 90%, all mean a sharp
decrease in feed flowrate.
Class II includes 21, 22, 23, 31, 32, 13 and 29. As discussed above, it is
reasonable that 21, 22 and 23 should be grouped together since they represent
increases in the opening of V20 which cause decrease in regeneration temperature.
Both 13 and 32 represent slight increases (15% and 9%) in fresh feed flowrate,
while 29 a three times increase in sludge oil flowrate and case 31 sludge oil pump
failure. Since sludge oil flow is only very small compared with the fresh feed at
normal operations (8000 kg/hr sludge oil vs 150,000 kg/hr fresh feed), it is not
surprising that a factor of three time increase in sludge oil has a similar effect on
process operation as slight fresh feed increases by 15% and 9%. The major
difference between the four cases, 12, 32, 29 and 31 is that 12, 29 and 32 represent
increases in feed while 31 indicates a decrease, as indicated in the early stage of
reaction temperature responses (Figure 6.16). Because the process is under closed
Chapter 6 Unsupervised Learningfor Operational State Identification 147
X. Z. Wang, Data Mining and Knowledge Discovery for Process Monitoring and Control
© Springer-Verlag London Limited 1999
150 Data Mining and Knowledge Discovery for Process Monitoring and Control
Inductive learning is probably the most widely studied method based on symbolic
learning [202-204]. It attempts to acquire a conceptual language for describing an
object by drawing inductive inference from observations. The focus is on deriving
rules or decision trees from unordered sets of examples, especially attributes based
induction, a formalism where examples are described in terms of a fixed collection
of attributes. The discussion excludes such learning methods as feedforward neural
networks which learn to develop an implicit rather than explicit and transparent
rules or decision trees. An obvious motivation for inductive learning is that it
provides a method for solving the bottleneck arising from knowledge acquisition
which is necessary to develop expert systems. It is relatively easy for human experts
to document cases rather than for them to articulate the expertise explicitly and
clearly. Several approaches to inductive learning have been proposed, such as AQ 11
[196], VersionSpaces [197] and C5.0 [18, 19, 20]. Here the focus is on C5.0, a
system that is designed to learn to develop rules and decision trees from examples.
The conceptual clustering approach used in C5.0 was developed by Quinlan [18, 19,
20]. Given a database of objects (or in other words data sets) which are described in
terms of a collection of attributes, which measure some important feature of an
object. Each object belongs to one of a set of mutually exclusive classes, the task is
to develop a classification rule that can determine the class of any object from its
values of the attributes. The decision tree generated can be used for conceptual
clustering. The procedure is iterative and can be summarised as follows [18, 19]:
(1) Select a random subset of the given training examples (called the window)
(2) Repeat (a) to (c)
(a) Develop a decision tree which correctly classifies all objects in the
window
(b) Find exceptions to this decision tree in the remaining examples
(c) Form a new window by adding incorrectly classified objects to the
window
Chapter 7 Inductive Learning and Conceptual Clustering 151
The crux of the problem is how to develop a decision tree for an arbitrary
collection of objects in the window. To form a decision tree requires selecting the
root attribute. To do this, assume that there are only two classes representing all the
data, P and N (extension to any number of classes is not difficult). The method of
finding the root attribute is adopted from an information based method that depends
on two assumptions. Suppose the window C contains p objects of class P and n
objects of class N. The assumptions are:
(1) Any correct decision tree for the window C will classify objects in the same
proportion as their representation in C. An arbitrary object will be determined as
belonging to class P with probability p/(p+n) and to class N with probability
n/(p+n).
(2) When a decision tree is used to classify an object, it returns a class. A decision
tree can then be regarded as the source of a message 'P' or 'N'. with the expected
information needed to generate this message given by
I(p,n)= _ _ P- log _P_ _ _ n_ log _n_ (7.1)
p+n 2 p+n p+n 2 p+n
If attribute, A, having values {A I, A 2, ..• A v }, is used for the root of the decision
tree, it will partition the window C into { C I, C2 , ... Cv} where C i contains those
objects in C that have values Ai of A. Suppose C i contains Pi objects of class P and
ni of class N. The expected information required for the subtree for C i is I(pi, ni) and
for the tree with A as root is then obtained as the weighted average given by
E(A) = ±Pi+niI(p,n) (7.2)
i=l p+n I
where the weight for the ith branch is the proportion of the objects in C that belong
to C i. The information gained by branching on A is therefore
The approach has been used in the commercial software C5.0 [20], which has
evolved from the earlier versions C4.5 [18] and ID3 [19]. A major limitation of ID3
was that it assumed that the values of all attributes are discrete, for instance a colour
being red or green. C4.5 claimed to be able to deal with continuous-valued
attributes, is still weak compared with the way it deals with discrete-valued
attributes, as noted by Quinlan [198]. Though Quinlan [198] made a further effort to
improve the method so that it could deal with continuous-valued attributes, the
outcome is still not very satisfactory. Nevertheless, C5.0 has become one of the
most well known tools for use in data mining and knowledge discovery, especially
in domains involving only discrete values.
Like most of the available inductive learning methods, C5.0 was developed for
problem domains where attributes only take discrete values. The methods have
proved to perform remarkably well with discrete valued attributes. However, when
the problem domains contain real numbers, the performance usually decreases in
terms of accuracy. Using inductive learning based on continuous-valued attributes
requires discretisation of the values into a number of intervals. To deal with this, a
number of approaches have been proposed.
The class-separating method [199] is based on the assumption that if the
assignments of individual examples are known, then such knowledge can be used to
discretise the attributes. An example of this is to use the range of boiling points of
petroleum fractions into product grades. This approach however sometimes
produces too many intervals which are not very informative. Moreover, if the
assignment of individual cases is not known, the approach can't be used.
It is possible to use an equal distance approach to divide the values of the
continuous variable between the minimum and maximum values into a fixed number
of intervals. Alternatively equal n1lmhers can be used to make each interval contain
equal number of examples.
Another approach is the k-nearest neighbours approach which tries to estimate to
which class the value of a specified attribute most likely belongs. It places a border
between two values Xi and Xi+ I if the estimates for them are different. The estimates
Chapter 7 Inductive Learning and Conceptual Clustering 153
are based on the assumption that the most probable class is the most common class
among the k nearest examples, where k is normally a user specified parameter.
Wu [199] developed a Bayesian discretizer. The approach is shown to be more
accurate than some other approaches but can only be used for integer values and
requires the assignments for the examples to be known. The method used by C4.5
[18] is based on information gain. But this approach only provides binary
discretisation.
Saraiva [5] and Saraiva and Stephanopoulos [4] divided process plant data and
monitoring and control strategies into layers as shown in Figure 7.1 and investigated
the application of inductive learning to the highlighted layer in the figure based on
analysis of daily/weekly averaged data as a basis to continuous process
improvement. The goal is to develop a conceptual language describing the
contributions of various operational parameters to one or a number of performance
metrics which are product quality related.
154 Data Mining and Knowledge Discovery for Process Monitoring and Control
Strategic Decisions
Overall Plant
Plant Departments
Plant Area
SecondslMinutes
SecondslMinutes
Figure 7.1 Levels, time scales, and application scopes of decision making
activities.
Saraiva [5] presented four industrial case studies using this approach. A simple
example is used here to illustrate the methodology. The case study is concerned with
records of operating data from a refinery unit [200] shown in Table 7.1 which has
five variables, Xl, X2, X3, X4 and y. The latter is the octane number of the gasoline
product and is discretised as very low if the value is less or equal to 91, low if it is
between 91 and 92, and good if greater than 92. Xl, X2 and X3 are different measures
of the feed composition and X4 is the log of a combination of process conditions.
Figure 7.2 shows the induced tree as well as the partition of the (Xl, X4) plane
defined by the leaves, together with a projection of all the available (x, y) pairs on
the same plane. These two decision variables clearly influence the current
performance of the refinery unit, and the decision tree leaves give a reasonable
partition of the plane. To achieve better performance, operating zones that will
result in obtaining mostly y = 3 values need to be found. Terminal nodes 2 and 7
identify two such zones. The corresponding solutions are
It effectively says that one should expect to get almost only "good" y values while
operating inside these zones of the decision space, as opposed to the current
operating conditions, which lead to just 40% "good" y values.
Saraiva [5] presented more complex case studies and methods for defining
performance metrics. He used the binary discretisation mechanism for continuous-
valued variables embedded in C4.5. In addition, the approach is only suitable for
156 Data Mining and Knowledge Discovery for Process Monitoring and Control
analysing data of daily/weekly averaged data, not on-line data from computer
control systems, based on seconds.
(b)
• .,=1
1=3 01,,2
•
'II
• .1=3
2.0
••
.
..
.-
1=3 • • n
x
~C ':~CIIo o·
~ o 0 •
..-:
0 •
..
o·~
I.S • ~o • 0
...
o c ••
• 1=2 1=1
• 0 •
50 eo 10
XI
Figure 7.2 (a) Induced decision tree; (b) partition of the plane defmed by its leaves.
states using a conceptual clustering approach. This will then be extended to a more
complicated case study of a refinery MTBE process in Section 7.4.
The approach basically comprises the following procedures: (1) concept
extraction from dynamic trend signals using PCA. (2) identification of operational
states using an unsupervised machine learning approach, and (3) application of the
inductive machine learning system to develop decision trees and rules for process
monitoring.
This has been discussed in detail in Chapter 3, so only a brief review is presented
here. For a specific set of data, the value of a variable represents a dynamic trend,
consisting of tens to hundreds of sampled points. In inductive learning, it is the
shape of the trend that matters so for a specific variable, when the trends of all the
data sets are considered and processed using PCA, the first two principal
components (PCs) can be plotted in a two dimensional plane, as shown in Figure
7.3. Figure 7.3 showing PC-I-TR and PC-2-TR corresponds to the first two PCs of
the reaction temperature TR. The data sets are grouped into clusters in this two
dimensional plane. This pennits a dynamic trend to be abstracted as a concept as
typically by PC-l-TR in region A. The following sections will show how this
process can be used for conceptual clustering using inductive learning.
The next step is identification of operational states. In this case, this can be done
using PCA because there are only eight variables. For more complex processes,
more sophisticated approaches need to be used, which will be described later in the
case study of MTBE. The first two PCs of the eight variables (TR, Fo, Fw, Fi, Ti,
Ci, Twi, L) are plotted in Figure 7.7. The five groups which are identified represent
the 85 data cases as five clusters corresponding to five distinct operational modes.
Detailed examination of the clusters shows that these groups are reasonable.
158 Data Mining and Knowledge Discovery for Process Monitoring and Control
~~------------------------~ 40
Cases 22.23,45.46.67.68.70
30
B
Om; 25-46
Om; 1-24
D 20 .. , ...... ;.~. C<lSe 19
B .... C
10 '.
....... ::. ....
., .. \
10 ", ".,,, "
~ . ,'"''
'.
:
......
f .:
~ 0 &: 0
,........ j .....
: ", " •••• II.:
,.!,
:::~::.;:'
'"
-10' ~ -10
D
-2) A -2C ".,', .....:.;.. C<lSeS 71-80
Om;8HlS A
-3C C<lSes H R,20.21 ,24-44,
_m
4Nl6,69.81-85
~I+--r--r-~_.--.--'--.--r~
~m ~ ~ ~ ~ 0 2) ~ ro ro -40 -20 0 20 40 60 60 100 120
K:-I-lR PC-I-F.
Figure 7.3 PCA two dimensional plot ofTR. Figure 7.4 PCA two dimensional plot ofF o
rn~----------------------~
1m
l.,
8J 65.67.6'\8Hl5 Cases 1-15,17,20-39,41,
ffi
41
Om;~~~ 43-60,62,64,66-85
B
B ,-...1 '--__
~
~
J)
0
-J)
:-"'\
".:,f,"
Y
,,::,'\1',":
~~/\~.i.~,~
ff \
"tit.., .... -:~ t)
, c .........._____.. __.-.....
'-"\
:~.,,;:(t~yr.'.~·".';,:;
'••.: C
-4l A .". Cases 18.
-6l Om; 55, 7HD 42,65
-8l
-1m
-rn +-,..-,..-,..-,..-,..-,..-,-,..-,..-,..-.,.--i
-rn -1m -8l -6l -4l .J) 0 J) 41 ffi 8J 1m rn
K;.1-FW PC-l-Fi
Figure 7.5 PCA two dimensional plot ofF w • Figure 7.6(a) PCA two dimensional plot ofF;.
"
I'
10 Cases 14,39,59
Cases 1,25.27,47,49 IC
Cases 4,28,51 C
5 B D 5
Figure 7.6(b) PCA two dimensional plot ofTi. Figure 7.6(c) PCA two dimensional plot ofC.
Chapter 7 Inductive Learning and Conceptual Clustering 159
15r---------------.
ClIses 1-7.9.1 1-31,34.
10
.l6-12.54.56-l!5
Ca'6.21,44,69 Caso; 23,46.67.68
B
.'
'J
0 :,:
:'t'.'....
: a:
.', .'.
~
'.'
~ ~. '.,,0'
·5 A C D
c
Caso; 71-M
ClIses35.55 ClIses8,32,53 ClIsesI(),33 -20 Caso; 1-20.
·10 22.24-43,45,
-30 47-<;6.70.81-85
·15 ~~-r--~~~~-~-~-_4
·15 -10 ·5 o 10 15 -100 -M .(i) ~ -20 20 40
PCI·Thi PC-I-L
Figure 7.6(d) PCA two dimensional plot ofTwi. Figure 7 .6(e) PCA two dimensional plot of L.
100r-----------------------------~
80
ABN 2
60 NOR 2 Cases 81-85
Cases 47-70
" ~:: :'.' ..................... ,.
40
.... :.
.. ..,.-.-... .......
20 'I,
.................... ....
.........................
. ..' ....:
ABN I
o Cases 71-80 .......... .J .....
. .-. -:-
NOR 3 ..•. :::::::::: .. ::::::::::. '"
-40 Cases 25-4~ •.. ·· • • ".
., .'
':,
-60 NOR I _.:::.....~ ....................... .
Cases 1-24
-80 +----.---.------.---...---...,.----,..---.---1
-180 -140 -100 -60 -20 20 60 100 140
PC-I-State
Figure 7.7 PCA two dimensional plot of the CSTR operational states.
Having characterised the dynamic trend signals and identified the operational states,
it is necessary to find out how to generate knowledge which correlates the variables
and operational states. To do this requires generating a file as shown in Table 7.2. In
fact, each data set in Table 7.2 can be interpreted as a production rule. Thus, the
first case is equivalent to the following rule,
160 Data Mining and Knowledge Discovery for Process Monitoring and Control
Table 7.2 The data structure used by C5.0 for conceptual clustering.
Obviously this is simply an explanation of the database and the decision tree
developed will be very complex. C5.0 makes it possible to develop a simpler tree.
Simple tree is preferable because it can usually perform better than a complex tree
for data cases outside the training data set.
Chapter 7 Inductive Learning and Conceptual Clustering 161
The decision tree developed for the CSTR case study is shown in Figure 7.8 and
can be converted to production rules, as shown in Table 7.3. CS.O identifies the
reactor temperature as the root node and states that if TR is in the region of A, B or
D of Figure 7.3, then the operation will be in regions ABN2 (abnormal mode 2),
NOR2 (Normal operation mode 2), or NORl (Normal operation mode 1) of Figure
7.7. If TR is in the region C in Figure 7.3, then there are three possible situations
depending on Fo. If Fo is in the region D of Figure 7.4, then the operation will
correspond to ABNl (Abnormal operation 1): ifFo is in A or B of Figure 7.4, then
the operation will be NOR3 (Normal operation 3). The result effectively states that
it is possible to focus on monitoring TR in Figure 7.3. IfTR is in the region C, then
Fo in Figure 7.4 should be examined. It also shows the variables responsible for the
location placing the operation in a specific region of Figure 7.7.
-<c c c
0 0 0
'50 '50 '50
....
<l)
~ ~
.5 .5
Figure 7.8 The decision tree developed for the CSTR after conceptualization of the
variable trends.
162 Data Mining and Knowledge Discovery for Process Monitoring and Control
yes
: ••••••••••• 1
Figure 7.9. The decision tree developed for the CSTR using the eigenvalues of the
first two principal components of each variable.
Chapter 7 Inductive Learning and Conceptual Clustering 163
Table 7.4 The numerical values of the first two principal components of each
. F19ure 7 9
. bl e are use d b)y C5 0 to deve Iop the tree III
vana
Fi Ti C, Twi Fw Fo TR L State
pC·t PC·2 pC·t PC-2 PC·1 PC-2 PC·1 PC·2 PC·1 PC·2 PC-t PC-2 PC·1 PC·2 PC·1 PC-2
·107 0.47 5.66 -0.45 -4.47 0.19 -0.10 ·0.08 -41.9 -5.57 -11.5 ·0.92 47.83 ·2.48 119 0.04 NOR1
-0.92 0.47 ·0.29 0.01 -4.72 0.28 ·0.11 -0.06 86.7 0.64 -11.7 -0.98 -88.9 -6.19 11.96 0.30 ABN2
The decision tree shown in Figure 7.8 and the rules in Table 7.3 provide guidance
for operation clustering which is transparent.
In the above discussion, the dynamic trends in the two-dimensional PCA planes
have been abstracted. An alternative way is to use the numerical values of the first
two PCs directly. The data structure to be processed by C5.0 is then put in the
format of Table 7.4. A decision tree thus developed is shown in Figure 7.9 and the
rules shown in Table 7.5. Inspection of Figure 7.9 reveals that it is very similar to
164 Data Mining and Knowledge Discovery for Process Monitoring and Control
the tree in Figure 7.8. For example, the conditions leading to ABN2 in Figure 7.9
are
THEN ABN2
From Figure 7.3 it can be seen that the two preconditions of the rule are
equivalent to IF TR = region A and so gives the same result as Figure 7.8. The only
difference is the last branch leading to NOR3 (the dashed line box of Figure 7.9). In
Figure 7.9 the rule is,
IF PC-I-TR>-33.8
THEN NOR3
By making reference to Figures 7.3 and 7.4, the rule condition is equivalent to IF
TR = C in Figure 7.3 and Fo = A or B or C in Figure 7.4. But in Figure 7.8, the
corresponding rule condition is TR = C in Figure 7.3 and Fo = A or B. This slight
difference does not say which approach is better. However, in the next section
involving a larger case study, conceptualisation of dynamic trends first gives better
results.
as operating normally. The study is restricted to 100 data sets since increasing the
size of the data set by including more normal operational data does not make
difference to the result and it makes it easy to understand the analysis and
presentation of the result. Each data set consists of twenty one variables, which are
listed in appendix C. For each variable, a dynamic trend consisting of 256 points is
used to record the response to changes. Therefore the size of the data to be analysed
is 100x21x256.
.
T a bie 76The c usters 0 f operatlOna states usmg ART2
Clusters Cluster Cases Clusters Cluster Cases
Name Name
The projections of the dynamic trends of some variables onto the PCA plane are
shown in Figures 7.10 (a) to (I). Only those variables that appear later in the
decision trees are shown in Figure 7.10. In Figure 7.1O(a), the regions A, C and D
are clearly distinguished. However, region B is fuzzy. This simply means that the
cases in region B (4-13, 17, 18 and 23-100) can not be distinguished in Figure
7.10(a). Their differences can only be identified by reference to other variables. A
similar explanation applies to Figure 7.1O(j) which classifies the responses of the
variable F_D202_in I into two classes for all data sets. This requires other variables
to discriminate the data sets. When all the variables are considered together, some
variables might be more important than others to the classification. For most of the
twenty one variables, the grouping based on visual examination of the two
dimensional PCA plane is straightforward. Even for a few variables, the groupings
may not be very clear, it does not affect the final result significantly. When grouping
for one variable is not clear, other variables will play more important roles in the
operational state clustering.
166 Data Mining and Knowledge Discovery for Process Monitoring and Control
6 0 . - - - -_ _ _ _ _ _ _ _ _ _---,
4 0 . - - - - -_ _ _ _ _ _ _ _ _ _----,
50 (Case 16)
C 30
40 (Cases4-13,17,18,23-100)
'-'-~
:,J..... _.............. ",.,...-;"
.
.....
B 20
~
.."....
(Cases4-24,28-100)
30 { ....
D
~ 20 ,'" \
~I
f
:=: , , 10 . . -r'
!-oliO'
I ',-.--":,~ ,:
\
.'. .... ,
..l,
N'
0 \ ( Ca 6)
... - . .
;,':',:--,
"'"\!'
~I
l' ' . . . __
"' __ "' , _• • • • ,
.!- ... ., \ ·f I ~I -10 (Case27 ),r"" . . .. .,
.,.,. ... ' 1
'. .:...1
-10
-20 (Case~ 14, 15)
J~.' -20
C
~
(Case~
E
1-3)
D (Case25 )
(Cases 1-3,19-22)
-30 -30
404~0~---_'2-0-----r----~2~0-----,41~)----~60 40~--._--~---r---r--_,---.~--4
-100 -80 -60 -40 -20 20 40
PC_CL_C201
(b)
30.--_ _ _ _ _ _ _ _ _ _ _ _- - ,
5 0 , - - - -_ _ _ _ _ _ _ _ _-----,
,"' \
20
...... - ............
.: .. ,,
30
,,,
".
(Case 31) (Ca~e32)
'.
I... _. .,B, . , ....
10
..... ,,
\
.
(Case29 ) (Case30) D " 10 \ \
~I ~ \'-.... (\~ .... = \ \
... ...
\
.....,, ,_......'
\. \
~I
..' ... ,
\
[J 0 • •• ~ I
~
~ .... ,
..~:
...... , ... I. , I "---) '--' -10 \
[J
NI -10
1
\~ ""--' ""--' '
\ ,~~
I;:: A
-30 ',-' ""A
.J (Cases 25-27)
,I
(Cases 1-28,33-1(0) f
0-
-20 B
-50 (Cases 1-24.28-100)
-30 -10 10 30 50 70 90 110 130 150 170 -UO -I]() -1)0 -70 -50 -30 -10 ]() 30 50
PC_LF_ClOI_ref
PC_CF_C20Louf
(e)
(d)
100.--_ _ _ _ _ _ _ _ _ _ _ _ _-----,
50r--------------,
80 40
(Cases 10,35,)6)
..,.F
(Case 9)
, ... -.. . -t
-
60
C
~l ..
':, ...,,,,~-:,. ~ 30
40 .... : ' (Case 16) 1
(Ca~ I)
.
Q
' 20
Case;; 12-15,17-34,
37-100) :
....
'i'
~
_"H (Case 6)
~ 20
...' (Ca C 20)
F
\"
,.,
~: o
~
,-t ':".. .
V 1.1
l'-'
1
~I 10 (C'ases4-18.23-IOO) \ (Case21)"
'-.. '- E,
(Cases 1-5,11 A (Ca:;,e 19) , D (Cases 2,),22)
....... __ .-. ::.;,; "._ ......G..;'-.... \ .~
-20 I. • '.(\ ... ,.... ''''\
B
(Case 8)
'--"""" \ •• ,
'" D ...... -;,.
(Case 7)
'." " ';" ':) \.:., ..'
40'+---.-~r------,--_.~_.--_r--_+ -10)+----.---_.----.---0:-----0::-----+
-40 -20 20 40 60 80 100 -10 10 20 30 40 50
(f)
Ie)
2001.---------------, 2001 r--------------,
150 150
=
J 100 100
(Case 1)
~
PC_CF_D20Cin PC_I_F_D201_out
(0) (II)
Chapter 7 Inductive Learning and Conceptual Clustering 167
200,--_ _ _ _ _ _ _ _ _ _ _---.
140
120
100 ISO
80 ;;
~I
iii0, 60
100
0) (j)
40,-_ _ _ _ _ _ _ _ _ _ _ _ _•
60 ,
I.' (Cases 1-13,17-100)
'.I'A
D
.,
30 ~
..
(Cases29-31)
20
B ( Cases 1-3,20-22,25-27 )
'\ D
20
(Case 16) .--
, ~·-·,t ,
,:, N
;;
,
\ , • I
..,,'
't N' B
(Case 32) C -60 (Case 15)
-20 (Cases4-19,23,24,28,33-100)
-80
-30
-100
40~-r-~r--'-~--r--r-...--t -100 -&0 -60 40 -10 0 to 40
40 -30 -20 -10 10 20 30 40 PCI1_DZ02_ouI2
PC_I_T_C20Uop
0)
(k)
Figure 7.10 (Including preceding page) PCA two dimensional plots of variables for
MTBE,
300
250
200
150 , ,. ; '
I
...
.,
Cases 19-22
...::;;r:;s 100
, .'
I .,
-~:-
~
.... ...
50 , . . . . , . ... ...,... Cases 4,5
.....:.,. ... ,
: ••• •.:.,.t
, .. -....
·~
""
N
V ~~- ..
, .....
;---'
",'
Ilo -50 Cases 6-18. , __.....- -
.
23,28-100 I • '\
'\ Cases 1-3
,
-100
'\
.~
-150 "" Cases 25-27
-200
-150 -50 50 150 250 350
PC-l-State-MTBE
Figure 7.11 PCA two dimensional plot of operational states for the MTBE process.
168 Data Mining and Knowledge Discovery for Process Monitoring and Control
The CSTR case study showed that PCA can classify the operational states
satisfactorily. Here, both PCA and the adaptive resonance theory (ART2, introduced
in Chapter 6) are used. The PCA analysis shown in Figure 7.11 results in five
clusters and the ART2 result is shown in Table 7.6 which predicts twelve clusters.
Both results are reasonable, but ART2 gives a more detailed picture and appears to
be more accurate. Sammon [123] indicated that since PCA is a linear method, it may
not give an adequate representation in two or three dimensions when the original
number of attributes is large, visual examination may not be possible. He also gave
an example where data generated giving five groups in four dimensions are
projected into the space of the two principal eigenvectors. Visual examination of
this projection shows only four groups, since two of the clusters overlap completely
in the two dimensional space. In analysis of process operational data, similar
observations have been made by other researchers [122, 119, 201]. In the following
discussion, only the ART2 clustering result will be used.
It is apparent that clusters with only one data set are correct. These are clusters
three, four and eight. Data sets 1, 2 and 3 in cluster one all cause the flowrate of the
C4 hydrocarbons feed to fall to zero and so should be in the same class. Data sets in
cluster two comprising 4, 5, 7, 8 and 11 cause the methanol flow to the mixer M201
to be either completely cut off or greatly reduced to cause them to be in the same
class. Data sets in class five, which include cases lO, 35 and 36 corresponding to
changes of the output of the controller FC202D from 33% to 59%, 33% to 50% and
33% to 55% respectively. Cluster six has cases 12, 13, 17 and 18 representing
reduction or cut in methanol flow to the tank D211 and column C20l. The two
cases in class seven, 14 and 15 represent the changes in the opening of the valve
HC211D from 18% to 40% and 60% respectively. Cases 19 to 22 in cluster nine
refer to changes of the output of the controller TC201R from 39% to 30%, 20%,
lO% and 0%. 23 to 27 are cases corresponding to changes in the output of the
controller FC203E from 37% to 33%, 29%, 16%, 6% and 0%, and are classified as
cluster ten. Cluster eleven has four cases, 29-32 corresponding to changes of the
output of the controller FC201D from 40% to 30%, 20%, lO% and 0%. The last
cluster, cluster twelve, has the normal operational data sets 37 to lOO. The
assignment of cases 28, 33 and 34 to this cluster is not apparent but is nevertheless
not unreasonable given that they represent insignificant changes.
Chapter 7 Inductive Learning and Conceptual Clustering 169
Conceptual clustering not only predicts the operational states but also interprets the
prediction using causal knowledge in tenns of decision trees or production rules.
Here, a variable takes discrete values from a region of the two dimensional PCA
plane of the variable. For example, the liquid level at the bottom of column C20l,
L_C201, takes values from its PCA plane in Figure 7.10(b) including A, B, C, D,
and E. For the data case number 24, L_C201 takes the value ofD. For each data set,
the operational state takes values from Table 7.6. For example, data set 24 has the
value of ABNlO.
The decision tree developed is shown in Figure 7.12. The decision tree can be
easily translated into rules. For example, the rules that lead to ABN4 and ABNll
are,
The root node is T_MTBE, the bottom temperature of the reactive distillation
column C201. It indicates that it is the most important variable that distinguishes
operational modes representing the one hundred data sets. Detailed examination of
the decision tree and all the dynamic responses in conjunction with the MTBE
process flowsheet revealed that the tree is reasonably good. An example illustrating
this is the rules leading to ABNII. From Table 7.6, it is known that ABNll covers
data cases 29, 30, 31 and 32, corresponding to changes of the output of the
controller FC20lC (reflux flowrate control) in manual mode from 40% to 30%,
40% to 20%, 40% to 10% and 40% to 0%. Clearly the most important variable
which discriminates between these data sets from others is the column top
temperature T_C20 I_top. This is continned by Figure 7.12, in which the nearest
node to ABNII is T_C20Ltop.
170 Data Mining and Knowledge Discovery for Process Monitoring and Control
EJ
ABN7 ABN 5 ABN2 ABNII ABNIO NORMAL
4,12,13,17, /8
16 14,15 10,35,36 4,5,7,8.11 29-32 25-27 28,33.34.37-100
In Figure 7.12, the numbers at the bottom of the nodes indicates the data sets, For
instance, the node ABN8 has only one data set, namely 16, Comparing Figure 7,12
and Table 7.6, it is found that the decision tree in Figure 7.12 gives correct
predictions except for the nodes ABNlO and NORMAL in Figure 7.12.
Data sets 23 and 24 (underlined in Figure 7.12) are assigned to ABNlO by ART2,
as shown in Table 7.6, but to the node NORMAL in Figure 7.12 by CS.O. Data sets
23 and 24 represent the cases where the output of the steam flowrate controller
FC203E at the bottom of the reactive distillation column C201 is changed from 37%
to 33% and 37% to 29%. In fact these are insignificant changes so assigning them to
the NORMAL operational state is acceptable. This inconsistency with ART2 can be
attributed to two factors. Firstly ART2 is based on numerical values which is more
precise than the conceptual clustering using CS.O. Secondly, the conceptualisation
of variables by visual examination of the PCA two dimensional plane may give rise
to some inaccuracies.
Data sets 12, 13, 17 and 18 (in italics in Figure 7.12) are clustered in a separate
class by ART2, as shown in Table 7.6. They are mixed with other cases in the node
labelled NORMAL in Figure 7.12. Referring to Table Cl and Figure Cl, all the four
Chapter 7 Inductive Learning and Conceptual Clustering 171
cases cause flowrate changes on the methanol stream to tank D211 and then to the
column C201. In reality, this flowrate is very small compared to the total methanol
and C4 hydrocarbons flows to the mixer M201, (about 1112 or 11100 respectively).
As a result, the changes of the methanol flow to D211 are insignificant.
Consequently it is reasonable to regard cases 12, 13, 17 and 18 as being in the
NORMAL operation class.
ye~
In the above discussion, the dynamic trend signals of a variable are converted to
qualitative concepts in a PCA two dimensional plane and then the inductive learning
approach is used. An alternative way is using the eigenvalues of the first two PCs of
a variable directly. The resulting decision tree derived in this way is shown in Figure
7.13. The tree is found to be completely unreasonable because most clusters are
overlapped therefore it cannot be used for predictions. Figure 7.14 shows the
discretisation of two variables T_MTBE and F_D220_out2. The dashed lines are
the boundary values for the discretisatiom. Since for each variable the discretisation
is always binary, this is obviously not satisfactory. For example, Figure 7.1O(i)
shows that the variables L_D201 can clearly take three values. A two-valued
discretisation is not able to capture all the features which causes the inaccuracy of
the tree.
172 Data Mining and Knowledge Discovery for Process Monitoring and Control
_
60r-----------------~------------------------------__.
50
40
OJ 30
al
.... 20
~I
..
10
....,
"',
U -10 ·c
0.
·20
·30
~o~ ____- -____- - - - - - - -________- -__- - - -__- - - -__- -__ --~
(a)
(b)
Inductive learning has been introduced as a method for analysis of data records
averaged over days or weeks and a conceptual clustering tool for developing on-line
operational monitoring systems. It can learn from a large number of examples to
develop explicit and transparent knowledge in the form of decision trees and
production rules. It is also able to identify the most important variables that
contribute to clustering, which is clearly valuable for analysing process operational
data and process monitoring. There are several issues that need to be addressed.
Like most inductive learning systems, C5.0 is not recursive, it means it can only
deals with data as a batch, not be able to learn as an example is presented. In
addition though PCA has proved to be an effective way of concept extraction from
dynamic trend signals, it is expected that the combination of PCA and Wavelet will
deliver more effective pre-processing methods. Compared with similarity or
distance based methods which have been widely studied, conceptual clustering
clearly needs more research attention.
CHAPTERS
AUTOMATIC EXTRACTION OF KNOWLEDGE RULES
FROM PROCESS OPERATIONAL DATA
Generating fuzzy rules based on fuzzy set method developed by Wang and Mendel
[205-208] is concerned with the following context. Suppose we are given a set of
input-output data pairs,
X. Z. Wang, Data Mining and Knowledge Discovery for Process Monitoring and Control
© Springer-Verlag London Limited 1999
174 Data Mining and Knowledge Discovery for Process Monitoring and Control
(8.1)
where XI. , X2 , ",XN are inputs and yJ, Y2, ...Yr are outputs. M is the total number of
given data patterns. The procedure for generating fuzzy rules is divided into five
steps which are explained based on a problem of two inputs and one output which is
a special case of Expression 8.1.
(XII), X~l») (y(l»)
(xI2), x~2») (y(2»)
(8.2)
Step I: Divide the input and output variables into fuzzy regions of fuzzy concepts.
Each region of a fuzzy concept is represented by a membership function. Figure
8.I(a) shows that the input variable Xl is divided into five regions of fuzzy concepts,
i.e., CE (Centre), BI (Big I), B2 (Big 2), Sl (Small 1) and S2 (Small 2). The shape
of the membership function, m(xl) is triangular. X2 is divided into seven regions and
Y into five regions.
Step 2: Generate fuzzy rules from given data pairs. For given xli), x~i)and y(i) find
corresponding membership values. For example, in Figure 8.1(a), since m(xj!) in B1
= 0.8, m(x\,l) in B2 = 0.2 and 0.8 > 0.2, we take m(x\I) = 0.8 and consider xP)
belonging to Bl. Following the same consideration, from Figure 8.I(b), we have
x~ll = 0.7 in Sl. Then we can obtain one rule from one pair of desired input-output
data, e.g.,
.: (xPl, x~ll) (y(l»)
=> [xPl(0.8 in BI), xill(0.7 in SI)] [y(l) (0.9 in CE)]
:. we generate rule 1 as
IF XI is B1 andx2 is Sl, THEN Y is CEo
Similarly,
.: (X\2l, xS2») (y(2))
=> [X\2 l (0.6 in B1), xS2l(1 in CE] [y(2) (0.7 in B1)]
:. we generate rule 2 as
IFxl isBl andx2is CE, THENyisBI.
Chapter 8 Automatic Extraction of Knowledge Rules 175
The rules generated in this way are only "and" rules. However, if two rules
generated have the same THEN part then they represent "or" relation.
m(x l )
10 t l)<)()(N S2 SI CE 81 82
0.0 ~
Xl
+
X l~ x:21 X:ll Xl
(a)
1"1') S3 S2 SI CE 81 82 83
0.0 JXXXl>XxJ
X2 X(ll
2
X~2)
X;
(b)
~ X2
m(y)
10 t
S2 SI CE 81 82
Figure 8.1 Divisions of the input and output spaces into fuzzy regions
and the corresponding membership functions. (a) m(xI), (b) m(x;), (c) m(y).
Step 3: Assign a degree to each rule. Since there are normally lots of data pairs
and each data pair generates one rule, it is highly probable that there are conflicting
rules, i.e., rules have the same IF part but a different THEN part. This conflict is
solved by assigning a degree to each rule and accept only the rule from a conflict
group that has the maximum. The equation to calculate the degree for a rule is, for
example rule 1,
Degree of Rule 1 = mBI(xI)msl(x2)mcE(Y)m(1)
= 0.8*0.7*0.9* mOl
=0.504* mOl (8.3)
Where m O) is a degree given by human expert about the importance and trustiness
of the data pair.
Step 4: Create the combined fuzzy rule base. The combined fuzzy rule base is
created by filling the box of Figure 8.2 with the rule produced in the above steps. If
there is more than one in one box, use the rule with maximum degree.
176 Data Mining and Knowledge Discovery for Process Monitoring and Control
B3
B2
Bl
Sl
S2
S3
S2 Sl CE Bl B2
X,
The approach uses algorithms of fuzzy set and is simple and straightforward. The
rules generated by this method may be repeating and the set of rules may be large
[209]. Srinivasan et a1. [210] improved this method in this aspect using an inductive
learning to determine the fuzzy rules.
Other methods of generating fuzzy rules from numerical data using fuzzy sets
have slightly different procedures and are not discussed here [211-216].
The rules generation procedure from neural networks by Fu [217] can be illustrated
by reference to Figure 8.3.
A rule generated has the form of
IF the premise, THEN the action (conclusion).
Specifically for the case in Figure 8.3,
lFAi, ... Ai, ... -,Aj, ... -,Aj, ... , THENC(or-,C)
where Ai is a positive antecedent (an attribute in the positive form), -,Aj is a
negative antecedent (an attribute in the negative form), C the concept (conclusion),
and -, reads "not". Each node in the hidden or output layer is designated by a
symbol which represents a concept to be confirmed or disconfirmed, such as node C
in Figure 8.3. Confirmation (or disconfrrmation) of a node concept is measured by
the activation of the node which is calculated by
OJ = F(I;wjiXi -9 j ) (8.4)
Chapter 8 Automatic Extraction of Knowledge Rules 177
where,
OJ - the activation of node j (a hidden or output node);
Wji - the weight on the connection from unit i to unit j;
OJ - the threshold on unit j, which is adjustable;
F - the activation function which represents the nonlinearity on the hidden or
output nodes and is defined as the sigmoid form.
1
F(a) = (8.5)
A- [ A-
j
I I
Figure 8.3 The post-attrs and neg-atts relative to a concept. An attribute can be
either a post-att or a neg-att, depending on the concept.
178 Data Mining and Knowledge Discovery for Process Monitoring and Control
Learning production rules by rough set approach has been studied by Chan [219],
Srinivasan and Chan [210], Chmielewski et al. [220], Quafafou and Chan [209] and
Ziarko [226]. The approach makes use of the rough set theory introduced by Pawlak
[221, 222, 223]. Let U be a nonempty set called universe, and R be an equivalence
relation on U called indiscernibility relation. An ordered pair A = (U, R) is called an
approximation space. For each subset in U, X is characterised by a pair of sets, the
lower and upper approximation of X in A. The lower approximation of X in A is
defined as
BX = {x E V![xlR<;:;X} and
the upper approximation of X in A is defined as
RX= {XE UI[xlRnX;t0}
in which [xJR denotes the equivalence class of R containing x. A subset X of U is
said definable in A iff (if and only if) Jix= Fix.
The rules generating procedure based on the above concepts is better to be
illustrated by reference to an example as shown in Table 8.1.
The universe U is the set of example objects,
U= {el, e2, ... e8}
and each object is characterised by the attributes
Density, Colour, Boiling range and Class.
The attributes are classified as condition attributes and decision attributes. (The
production rule for a single example object is IF condition attributes THEN decision
attributes. For el as an example, IF Density = Low, Colour = Light and Boiling
range = Narrow, THEN Class = I). The automatic rules generating method is
dependent on the calculation of the lower and upper approximations of the class 1
and 2 (denoted asXl andX2) as well as the definability.
The concept "class = I" is the subset Xl = {el, e2, e4, e5, e7} and the concept
"class = 2" is the subsetX2 = {e3, e6, e8}.
These define the relations on U.
The relations on U can also be defined on the basis of single attributes, then we
have the following attributes on U induced by those equivalence relations on U.
{Density} * = {{el, e5, e8}, {e2, e3, e4, e6, e7}},
{Colour} * = {{el, e3, e7, e8}, {e2, e4, e5}, {e6}},
{Boiling range}*={ {el, e2, e7}, {e3, e4, e5, e6, e8}}, and
{Class}*= {{el, e2, e4, e5, e7}, {e3, e6, e8}}
Chapter 8 Automatic Extraction of Knowledge Rules 179
Table 8.1 A training set for the rough set approach represented
b)ya d · · tabl e.
eClSlOn
Attributes
We have developed a fuzzy neural network method for generating production rules
from data [174]. A conventional NN has real number inputs and weights. There are
three main types of fuzzy neural networks (fuzzy NNs) [156]: fuzzy NNs with fuzzy
input signals but real number weights, fuzzy NNs with real number input signals but
fuzzy weights and fuzzy NNs with both fuzzy input signals and fuzzy weights. The
first type is used here. The method involves the following steps:
(1) The input and output variables are divided into fuzzy regions. Figure 8.4
shows examples of such regions, such as (Normal, High and Low) or extended to
five values (Normal, Medium High, High, Medium Low and Low).
(2) All input and output data patterns are processed using the membership
functions in step (1). For example, value of 600°C for a temperature may be
regarded as High ( I-lH = 0.7) when expressed as a fuzzy variable.
(3) A fuzzy NN is constructed. To illustrate the fuzzy NN structure, consider a
fuzzy NN with two input variables (T Cl and FCI ) and a single output variable (T Hll),
as shown in Figure 8.5. The box contains a normal NN which has one input and one
output layer together with a simple hidden layer. The activation functions of the
neurons for both the hidden and output layers are sigmoidal functions. Thus, if the
normalised value for TCl is 0.7 and it has the fuzzy membership function of the type
shown in Figure 8.4( c), then the outputs of the three )l neurons corresponding to T Cl
would be
Chapter 8 Automatic Extraction of Knowledge Rules 181
Il L, Tel
=0 Il N, Tn
=0 II
rH, Tel
=1
If the nonnalised value for Tel is 0,7 and it has the fuzzy membership function of
Figure 8.4(d), then the outputs of the three I-l neurons corresponding to Tel would
be
"
.~
u
: : lL-_L_---L__ M_L_.L-_N_---L_M_H_...l-_H_-L-_~
-=:"
(a)
c. 1.0
~~
"O,O~
..0
E (b)
"
E 1.0
»
N
N
::>
u...
0.0 1 I I
L N H
(c)
t ~
1.0
L H
0.0
0,7 (d)
Normalized values of process variables
Figure 8.4 Divisions of input and output variables into fuzzy regions and the
corresponding membership functions.
(4) Use the data produced in Steps 2 and 3 to train the fuzzy neural network
shown in Figure 8.5, The parts outside the dashed-line box are the fuzzy
representation of input and output data. The part inside the dashed-line box is a
nonnal three layer back propagation neural network. The training procedure uses
error backpropagation which adjusts the weights between the input and output
layers,
The fuzzy rules obtained by training the network are of two types. One is of the
following fonn:
IF Tel is High AND FCI is Low
THEN THlI is high (8.7)
182 Data Mining and Knowledge Discovery for Process Monitoring and Control
This means that if the value of a variable belongs to a fuzzy class it belongs to it
unambiguously. Another type of rule which is obtained from training the network is
of the following form:
IF TCl is High ( !..l H, T CJ ) AND FCl is Low (J.l L. F CJ )
If the membership functions of the variables of a fuzzy NN have the form of Figure
8.4(a) or (c), then the rules generated take the form of Expression 8.7, i.e., rules
without fuzzy membership values. The maximum number of rules is determined by
the network topology. If there are NOY input variables and each variable takes NFY
fuzzy values (i.e., each input variable has NFY corresponding nodes, assuming all
input variables take the same number of fuzzy values), then the maximum number of
rules will be equal to NOyNFV. This represents the maximum number of rules
although it is likely that in practise it will be less since it depends on the range of
data used to generate rules.
--
--
--
For the fuzzy NN structure shown in Figure 8.5 having two input variables and
one output variable with each variable taking three fuzzy values, the maximum
number of rules will be 32=9. If each variable takes five values (High, Medium
High, Normal, Medium Low, and Low), then the maximum number of rules is
52=25;
In the case of a shell-and-tube heat exchanger of the type illustrated in Figure
8.6(a) [224]. The steady state values are shown in Figure 8.6(a). Assuming no leaks,
then the cause-effect diagram for variables is shown in Figure 8.6(b). For simplicity,
attention is restricted to fixed values of T HI and FHI so only Fel and TCI change.
Then if each input variable takes three fuzzy values, i.e., Normal, High and Low, the
situation is as depicted by Figure 8.4(c). A fuzzy NN structure for the heat
exchanger is shown in Figure 8.5 and there will be a maximum number of 32 =9
rules.
The boundary values of the variables for low, normal and high are indicated in
Table 8.2. 87 input-output patterns are obtained by changing FCI and/or Tel based
on Table 8.3. Details of simulation results are not given to save space but the
resulting data patterns (as shown in Table 8.4) are given which are obtained after
fuzzification of simulation results using the fuzzy membership function of Figure
8.4(c) and the boundary values of Table 8.2. Each pattern in Table 8.4 is a different
form of a production rule, for example, data pattern I represents the following rule:
If there are no conflicts in any of the data patterns , such as in the case of data
patterns I, 2, and 3 which give the same rule then Table 8.4 is the final result.
However, more often than not, there are conflicts in data patterns, i.e., rules have the
same IF part but different THEN part. Thus data pattern 7 and 7* are examples
which take the following form:
Here, 7 and 7* have the same IF part but different THEN part, the conflict has to
be resolved. The approach being proposed here is to use the neural networks to train
the data patterns in Table 8.4. Since neural networks are capable of dealing with
noise, the noise data patterns can be identified and neglected. To demonstrate this,
Table 8.4 contains 10 data patterns having noise which illustrate the conflicts. The
10 data patterns are shown with a "*,, e.g., 7*. The result is an increase in the
number of data patterns from 87 to 97.
The 97 data patterns are fed to a back propagation neural network (Figure 8.5)
having 6 inputs (FCl Low, Fel Normal, Fel High; Tel Low, Tel Normal and TCl
High) and 3 outputs (THII Low, THII Normal and TH11 High) with 6 hidden neurons.
The activation functions of output and hidden layers are sigmoidal. The trained
results after 5000 iterations are shown in Table 8.5.
Table 8.5 gives the targets and predictions of T HII for all the 97 data patterns of
Table 8.4. For example, for data pattern I, the target values for T HII are L = 0, N =
I and H = 0 and the predictions for T HII are L =0.1, N = 0.85 and H = 0.06.
It was found that for all 87 data patterns obtained from simulation, very good
predictions are obtained, but for all 10 of the noisy data the errors are very large.
The conflict can be resolved by comparing errors between the target values and
predictions, keeping the rules with small errors and abandoning those with large
errors.
The conflicting data patterns 7 and 7* can be used as an example. By comparing
the results in Table 8.5, it can be seen that the error for 7* is very large and for 7 is
small. The square root of the sum of error squares for data patterns 7 and 7* are
0.26 and 1.15 respectively. So 7* should be rejected. The same conclusion can be
reached by looking at the other 9 noise data patterns. The rules generated have been
summarised in Table 8.6.
Table 8.2 The boundary values of low, normal and high for T HI ], Fel and Tel.
L(Low) N (Normal) H(High)
FCI lh/sec <0.627*Fcl. "eady [0.627*Fcl 'teadv , 1.42* FCI steadY 1 >1.42* FCI steady
TCI OF <0.55*Tcl "eadv rO.55*Tcl '!ead ,1.6* TCI steadyl > 1.6* TCI. steadv
Chapter 8 Automatic Extraction ofKnowledge Rules 185
Stream Hot-I
1 Stream cold - I
TCflOooF
FC f 20000 Ibisec
TH 1= 400 of
FH 1=20000 Ibisec Stream Hot - II
•
! Stream cold -I I
TClf 200 of
(a)
(b)
Figure 8.6 The shell-and-tube heat exchanger and its cause-effect diagram_
Table 8.4 Fuzzification results of data Table 8.5 Target and predicted fuzzy
patterns obtained through simulation using membership values for THII.
the membership function of Figure 8.4( c).
FCI TCI THII FCI TCI THII
No. L NH L NH L NH No. (T, P) (T, P) (T, P)
If fuzzy membership functions take the fonn shown in Figure 8.4(b) or (d), then
rules of the fonn of Expression 8.8 are obtained. Translating all the data into rules
of the fonn of Expression 8.8, there are as many rules as the number of data
patterns. Therefore an important step is to deal with number of rules. Two
approaches are proposed and described next.
One way to reduce the number of rules is to reject the rules that are less reliable.
There are two situations which make rules less reliable. One is that in Expression
8.8, IlH. Tel, ~L, F cl , and ~H,THll are small because small membership values
mean it is ambiguous in identifying which fuzzy values should be used. The other
situation is that the error, ER, for the data pattern is large when training is
completed. It is therefore necessary to define a confidence factor, CF, representing
reliability:
CF = IlH. TCl Ill. Fel IlH. TCHll / ER (8.9)
A A. -Cut value can then be defined as the threshold for the rule to be worth
keeping. If the confidence factor of the rule is smaller than the A -Cut value then the
rule should be abandoned. Adjusting the A -Cut value can change the number of
rules generated.
Referring to the heat exchanger the simulation results of the heat exchanger shown
in Table 8.3 are first converted to fuzzy patterns using corresponding fuzzy
membership functions. The function used in this case is represented in Figure 8.4(b)
and (d) and described by the following expression:
y = e- 1xl ' (8.10)
Here, x changes in the range [-2, 2] and I changes from 0.51 to 5.0. In this example,
1= 1.0. The boundary values for Low, Nonnal and High for the three variables (Fel'
Tel and THII) are shown in Table 8.7. There is a little difference from Table 8.2 but
it does not affect the principle to be demonstrated.
By applying the fuzzy membership functions in Equation 8.10 to the 87
simulation data patterns corresponding to Table 8.3, data patterns can be converted
to fuzzy rules of the following fonn,
IF Fel is Nonnal (0.61) and Tel is Nonnal (1.0)
THEN T HII is Nonnal (0.67).
188 Data Mining and Knowledge Discovery for Process Monitoring and Control
All the 87 data patterns after fuzzification are input to the neural network having 6
hidden neurons, as shown in Figure 8.5. The network has six input neurons (FCI
Low, Normal and High as well as T CI Low, Normal and High) and three outputs
(T HII Low, Normal and High).
After training, the confidence factors CF are shown in Table 8.8. A large value of
CF means that it is more reliable. So a A-cut value is used to select the desired
number of rules. By changing the A -cut values it is possible to change the number
of rules, as is demonstrated in Table 8.9. For example, with a A -cut value xlO is
6.0, there are 17 rules and for 5.0 are 21 rules. The rules for the former case are
shown in Table 8.10.
The A -cut approach can effectively select the desired number of rules which then
can be used by a fuzzy expert system shell to develop an expert system. In
applications, when the input data does not exactly match the IF parts, interpolation
and extrapolation are required.
Table 8.7 Boundary values the three variables or Low, Norma an dH"19h.
L N H
THII <280 r280, 320] >320
FCI 12000 [12000,28000] >28000
TCI <40 r40, 1601 >160
Confidence 24.01 32.23 3.07 .89 .41 ... .21 .35 .79
Factor x 10
51 52 53 54 55 56 57 58 59 ...
Table 8 9 Number of rules generated can change with changing A - cut values
A -Cut value x 10 1.0 1.5 2.0 3.0 5.0 6.0 8.0 9.0 10.
Number of rules 48 42 32 25 21 17 11 8 6
Chapter 8 Automatic Extraction of Knowledge Rules 189
Although the A-cut can effectively select the desired number of rules, the rules
obtained do not keep all the infonnation inherent in the original data. Some
infonnation is lost. The neuro-expert system approach is different in that it can
make use of all the infonnation in the data. It requires writing the basic number of
rules (i.e., NOVNFv). For two inputs, where each variable takes three values, the
basic number of rules are 32 =9. These rules are of the fonn of expression (8.11) in
which IlH , Xl ' ilL X2 and IlMH ' yare variables that can be any values in the range
I
of [0, 1].
IF Xl is High ( Il H• Xl) AND X2 is Low ( ilL. X2)
THEN y is medium high ( Il MH. y ) (8.11)
When an expert system is used, it receives fuzzy values of Xl and X2 and the
corresponding Il values. The expert system gives the fuzzy values for y and uses a
trained neural network algorithm (based on the weights) to calculate the Il for y.
In conclusion, it is essentially the same as the neural network but it has the
following important features. Firstly, it is able to give explanations about how
conclusions are reached. In the case of the heat exchanger, it can also give the
following explanations:
190 Data Mining and Knowledge Discovery for Process Monitoring and Control
8.S Discussion
Both neural networks and expert systems are now starting to realise their potential in
developing intelligent operational decision support systems. Knowledge used by an
expert system to reach conclusions is transparent and can be displayed through the
HOW and WHY explanation facilities. However, a critical problem with expert
systems is knowledge acquisition. The methods introduced in this Chapter are able
to extract production rules automatically from data therefore are potentially very
useful.
The fuzzy neural network method is similar to the method of rule generation by
fuzzy set operation [205, 207] in the first two steps. The major difference is in
dealing with conflicts. The fuzzy set operation method calculates the extent to which
a rule depends on a factor derived from human experts. This is not practical in many
applications and so efficiency can be a serious problem when large amounts of data
are involved. The fuzzy NN described in this chapter is not an incremental
approach. Frayman and Wang [225] developed a recurrent fuzzy NN which is able
to learn incrementally.
The rough set method [219] initially was not able to generate fuzzy rules.
Quafafou and Chan [209] have improved the method so as to be able to generate
fuzzy rules but there are still limitations, as indicated by Quafafou and Chan [209].
The neural network method by Fu [217] depends on the explanation of a properly
trained structured neural network. The nodes represent concepts and the branches
describe cause-effect relations. However, since hidden neurons are used, the
knowledge obtained includes concepts which do not have physical significance.
Another problem is that the number of rules increases dramatically with increase in
the size of the problem.
Chapter 8 Automatic Extraction of Knowledge Rules 191
Chan [219] has classified the various methods into two categories: incremental
and non-incremental learning. The former works on one example at one time and
the latter on a mass of training examples. The methods of the fuzzy set operation
and the rough set belong to the incremental method group. The neural network
method by Fu [217] is non-incremental for generating the trained neural network but
becomes incremental when the weights are explained. The method proposed in this
paper belongs to incremental. A disadvantage is that some information contained in
the numerical data may be lost during conflict filtering, which is an essential step in
incremental methods. This is unlike the NN which takes account of all data used for
training to obtain the weight. However, incremental learning has its advantage that it
is a recursive method, so that as new data becomes available, it updates the system
and changes the existing knowledge base. This is particularly helpful in on-line
application, as new data may be continuously made available for updating.
Another very effective technique for automatic generation of rules from data is
inductive learning which has been introduced in Chapter 7 therefore is not repeated
here.
CHAPTER 9
INFERENTIAL MODELS AND SOFTWARE SENSORS
X. Z. Wang, Data Mining and Knowledge Discovery for Process Monitoring and Control
© Springer-Verlag London Limited 1999
194 Data Mining and Knowledge Discovery for Process Monitoring and Control
A limitation of FFNN models is that they are data intensive because they are trained
and not programmed [145]. Therefore when such a model is being developed, it is
important to divide the data into training and test sets so that the model developed
can be validated. In addition, the trained model needs to be periodically updated to
maintain the accuracy of the model. The traditional way of dividing a database into
training and test sets is through random sampling. Random sampling of test data has
major shortcomings. If more sampled patterns belong to the sparse region of the
high dimensional parameter space, the results will not be reliable, because the
training patterns from that region are not adequate. On the other hand if more
sampled patterns for testing come from the dense space, the model will give over
optimistic estimation. Another problem is to decide when the model needs to be
retrained. It is generally assumed that if the values of all input variables are within
the boundaries of the training data, the model is reliable, but this assumption is
questionable. Given a new data pattern, with input attributes within the boundaries
Chapter 9 Inferential Models and Software Sensors 195
of the training data, it is still difficult to tell if it is covered by the training data since
the new data represents a point in a very high dimensional non-linear parameter
space.
To address this issue, we have proposed a method which combines unsupervised
clustering approach and FFNN, using unsupervised clustering to automatically
cluster the data into classes so that data patterns within a class can be distinguished
from those in other classes and so form a basis for sampling test data patterns [227].
Application of the method to the development of a software sensor for the main
fractionator of a refinery fluid catalytic cracking process is described to illustrate the
approach.
Figure 9.1 schematically depicts the method. It is divided into two stages, model
development and model maintenance. In model development, data are first grouped
into classes using the unsupervised clustering system AutoClass, a Bayesian
clustering approach that has been introduced in Chapter 6 and can automatically
group multivariate data patterns into clusters. Test patterns for FFNN are sampled
from each class according to the size of the class. During model maintenance, it is
necessary to decide when the model requires to be retrained. Because of the high
dimensionality and large volume of data, it is difficult to carry out the analysis
manually. The approach used here mixes new data with older data and then analyses
it using AutoClass. If any of the following three situations arises, the model needs to
be retrained.
• New classes are formed that cover mainly new data. This means that the new
data are located in new parameter spaces.
• Some new data are assigned to small sized existing classes. This implies that the
parameter spaces covered by small sized classes are insufficiently trained. When
new data are available, the model should be retrained to improve its performance.
• New data are assigned to large classes and the degree of confidence estimated
using the old model for the new patterns is low.
Other unsupervised machine learning approaches introduced in Chapter 6 could
also be used. AutoClass is used here because its degree of autonomy is high. For
example, users are not required to give a threshold for the similarity measure. The
algorithm was discussed in detail in Chapter 6 so will not be repeated here.
196 Data Mining and Knowledge Discovery for Process Monitoring and Control
Model development
Model maintenance
This section describes the application of the above method to an industrial process
for developing a software sensor predicting product quality.
The fluid catalytic cracking process of the refinery has been described in Chapter
4 (Figure 4.12). One of the products is light diesel which is typically characterised
by temperature of condensation. Previously the temperature of condensation was
monitored by off-line laboratory analysis. The sampling interval is between four to
six hours and it is not practical to sample it more frequently since the procedure is
time consuming. This deficiency of off-line analysis is obvious and the time delay is
a cause of concern because control action is delayed. Moreover, laboratory analysis
requires sophisticated equipment facilities and analytical technicians. Clearly there
is great scope for a software sensor to carry out such on-line monitoring.
Fourteen variables are used as inputs to the FFNN model as shown in Figure 9.2.
These variables are also summarised in Table 4.1.
Chapter 9 Inferential Models and Software Sensors 197
Initially 146 data patterns (Dataset-I) from the refinery have been used to develop
the model (Model-I). Later, three different sets of data became available having
data patterns of 84 (Dataset-2), 48 (Dataset-3) and 25 (Dataset-4) respectively.
Model-I has therefore been modified three times with the versions being denoted by
Model-2, Model-3 and Model-4. There are interesting and different results for every
model improvement, as described below. In all cases, the accuracy of the model is
required to be within ±2°C.
The data set (Dataset-I) used for model development comprises 146 patterns. As
illustrated in Figure 9.1, the first step is to process the data using AutoClass. It
requires selecting a density distribution model for all attributes (i.e., all the inputs to
the FFNN, Figure 9.2). All variables have real values and no data are missing so a
Gaussian model is used.
AutoClass predicts seven classes (numbered 0, 1, ... , 6) as shown in Table 9.1.
°
For example, class has 32 members (i.e., 32 data patterns). Test data are sampled
from each class and they are indicated in bold and underlined in Table 9.1. More
data patterns are sampled from larger classes and fewer from smaller classes.
198 Data Mining and Knowledge Discovery for Process Monitoring and Control
Altogether there are 30 data patterns used for testing and 116 for training. The
procedure is summarised in Table 9.2.
The assignment of a data pattern to a class is fuzzy in the sense that there is a
membership probability. Examples of membership probabilities are shown in Table
9.3. For instance, data pattern 17 (the third row in Table 9.3) has a membership
probability of 0.914 to class 0,0.055 to class 1 and 0.030 to class 4. It therefore is
assigned to class O.
The FFNN software sensor model (Model -1) is obtained when the training error
reaches 0.424. With normalized [0, 1] training data, the error is calculated using
1~,--. \2 '
- L.i \Vi - Yi J ' where Y; is the prediction by the model for the ith training
2 ;=1
pattern and Yi the target value. There are three training patterns and two test
patterns with absolute errors exceeding the required ±2oC. Therefore the degree of
confidence for training data is 100% - (3/116)% =97.4%, and that for test data is
100% - (2/30)% = 93.3%. A degree of confidence of 90% is considered acceptable
by the refmery.
Table 9.1 Classification of Dataset-l and test data selection for FFNNa•
Class 0 Weir:.ht 32 Class 1 Weir:.ht 31 Class 2 Weir:.ht 29
5 ~ 16 1731 32 3435 36 1 10 25 26 27 44 45 46 2 J 4 15 18 19 64 65 69
37-39 40 41-43 81 103 104 47-50 51 52 76 79 80 70 72 73 74 75 83 84
105 108 109 110 112 114 82 120 121 122-24 125 85 86-88 89 90 91 93 94
115 116 118 136 137 138 139 126 128 129 132 133 95 96 97 146
134 135
Class 3 Weir:.ht 26 'lass 4 Weir:.ht II Class 5 Weir:.ht 9 Class 6 Weir:.ht 8
11 12 13 14 53 54 55 7 9 28 29 33 20 21 30 101 102 8 22 23 24 107
56-59 60 61-63 66 67 68 III 117 127 130 106 140 141 142 113 119 144
71 77 78 92 98 99 100 131 145
143
a data in bold and underlined are test data; weight - number of members
Table 9.2 Test and training data from the 146 patterns in Dataset-I.
Classes 0 1 2 3 4 5 6
Class weight 32 31 29 26 11 9 8
Number of patterns for training 25 24 23 21 9 7 7
Number of patterns for test 7 7 6 5 2 2 1
Chapter 9 Inferential Models and Software Sensors 199
The 84 new data patterns were combined with dataset-l and processed by
AutoClass. It was found that the 230 data patterns (84 + 146) were classified into 15
classes. Interestingly, all the 84 new data patterns in dataset-2 were classified into
seven new classes (classes 2,5, 7, 8, 9, 13, 14), while all the 146 patterns in dataset-
1 are in eight classes (classes 0, 1,3,4, 6, 10, 11, 12). This is consistent with the
poor predictions for the dataset-2 using Model-I. A summary of the classification is
given in Table 9.4 where the data in italic are from dataset-2. The data patterns in
bold and underlined are chosen for testing and the rest for retraining to generate
model-2. All together 49 data patterns are chosen for test, of which 19 are from
Dataset-2 and 30 from Dataset-I. The degree of confidence for the training data is
100% - (111181)% =92.8% and for test data 100%-(3/49)% =93.9%.
Later further 48 data patterns (Dataset-3) were provided. Model-2 was applied to
dataset-3 and the degree of confidence was found to be 100% - (27/48)% = 43.8%.
This is low but better than the prediction to dataset-2 using Model-l (25.0%). The
48 new data patterns are then mixed with dataset-l and dataset-2 and processed by
AutoClass to give sixteen classes (a detailed classification is given in Table 9.5). It
is found that twenty-five of the forty-eight data patterns form a new class (class 1 in
Table 9.5, data patterns 251 - 275) but the rest are assigned to the classes combined
with data from dataset-l and dataset-2. The degree of confidence using model-2 to
predict the classes 1, 2, 3, 8 and 11 are summarised in Table 9.6. It can be seen that
class I contains only new data patterns and has the lowest degree of confidence for
prediction using model-2. Nineteen of the twenty-five patterns in dataset-l have
deviations bigger than ±2oC. Classes with fewer new data patterns have a higher
degree of confidence in the predictions. This result further demonstrates the
advantage of using AutoClass for clustering data before training a FFNN model. It
is also found that a confidence in a model for predicting new data is lower if more
data are grouped into new classes. For example, all patterns in dataset-2 are grouped
into new classes and the confidence of predicting dataset-2 using Model-l is 25.0%;
while some patterns in dataset-3 are classified into old classes, with a confidence in
predicting dataset-3 using Model-2 of 43.8%.
Chapter 9 Inferential Models and Software Sensors 201
The 48 new data patterns in dataset-3 have been combined with datasets-1 and -2
to develop model-3. The sampling of test data patterns follows the same procedure
as before. A total of 68 data patterns has been used for testing with the remaining
2lO patterns for training. Model-3 has a degree of confidence of 92.6% (=lOO% -
(5/68)%) for testing data and 93.8% (=100%-{l3/2lO)%) for the training data.
Dataset-4 was subsequently obtained from the refinery and has 25 new data patterns.
The confidence of applying model-3 to dataset-4 is 100%-(8/25)% = 68.0. Dataset-4
was then combined with previous data sets to give a database of 303 patterns. These
are classified by AutoClass into fifteen classes. 77 data patterns have been selected
as test data and the rest for training to develop the FFNN model-4. The model has a
confidence of 92.5% (=100%-(17/226)%) for training data and 90.9% (=100%-
(7177)%) for test data. In order to make a comparison with the conventional
sampling approach, 77 data patterns have selected using random sampling as test
patterns. The training is terminated using the same criterion, i.e., a training error of
5.75e- l . On this basis, 11 test data patterns have errors exceeding ±2°C. So the
confidence for the test data is 100%-(11177)% = 85.7%. This is lower than that of
the test data used in this study, which is 90.9% (100%-(7177)%). The confidence of
the training data for both approaches is the same, 92.5%.
,-
I
-------------- ------ ------------~
10 1
I
Prediction
5~ Target
o~-- --
-15 - I
I
------- --~
Figure 9.3 Comparison between predictions using Model-4 and the target values for
the first 151 data patterns.
10 -
-10 ~ - Prediction
Target
-15 -
Figure 9.4 Comparison between predictions using Model-4 and the target values for
the last 152 data patterns.
Model-4 (i.e., all data sets) covers all the operational seasons and is being used
very satisfactorily. It has proved to be robust and reliable over a wide range of
operational conditions. Figures 9.3 and 9.4 show the differences between
predictions using Mode-4 and the target values of the 303 patterns. Variations in the
input and output variables are summarised in Table 9.7. The plant has reduced the
sampling frequency to twice a day compared with four to six times used previously
and the intention is to reduce it to once a day in the future.
204 Data Mining and Knowledge Discovery for Process Monitoring and Control
In applying neural networks for software sensor design, researchers tend to include
as many variables as possible to ensure that no relevant variables are omitted. This
may sometimes result in a very large dimension of input. An unnecessary large
dimension of input variables may have adverse effects. For a fIxed number of
training data patterns, with the increase of input variables it becomes more sparse in
the multi-dimensional space, and therefore degrades the learning performance. The
generality of the learned model may also be reduced due to inclusion of irrelevant or
not important input variables. Increased number of inputs also means more
connection weights to be determined and therefore leads to longer training time.
Apart from irrelevant and not important variables that cause large dimension of
input variables, there may be correlations between input variables. Correlated inputs
make the model more sensitive to the statistical peculiarities of the particular
sample; they accentuate the ovemtting problem and limit generalisation - a common
occurrence in linear regression.
Use of prior knowledge, if available to identify irrelevant or correlated variables
is clearly useful. Otherwise mathematical techniques are required to solve the
problem. In the following, we will be looking at several techniques.
Since the fIrst few PCs are able to capture most of the variance of the original
explanatory variables, it is possible to use only the fIrst few PCs to replace the
explanatory variables as the inputs to a FFNN model. Since normally not all PCs are
required, the dimension of inputs is reduced. In addition, these PCs are linearly
uncorrelated and orthogonal.
McAvoy et al. [228] studied the use of FFNNs to fluorescent spectral analysis
data for predicting composition of a binary mixture. Thirty data points were
sampled from each fluorescent spectrum which were used as inputs of a FFNN
mode whose output is the composition of one of the components. Baughman and
Liu [229] used the same data and developed a 30-20-1 FFNN for this purpose.
Because there are only 46 data patterns available, the size of the network is
relatively large.
Chapter 9 Inferential Models and Software Sensors 205
Table 9.S Principal components for the fluorescent spectral analysis data.
Original variables PC I PCz PC3 PC30
XI 8.74E-03 0.975 0.223 -2.42E-05
X2 6.20E-02 0.995 7.37E-02 1.13E-04
X3 9.50E-02 0.995 -7.16E-03 -1.42E-04
X4 0.123 0.991 -4.39E-02 -8.74E-05
X5 0.172 0.984 -4.90E-02 1.06E-04
X6 0.28 0.958 -5.70E-02 -9.44E-05
X7 0.472 0.88 -5.77E-02 2.92E-04
X8 0.715 0.697 -5.06E-02 -5.28E-05
X9 0.898 0.437 -3.94E-02 -2.27E-04
Xo 0.975 0.219 -2.8 I E-02 -1.30E-OS
XII 0.996 7.98E-02 -1.98E-02 1.26E-04
X/2 1.000 -6.04E-03 -1.32E-02 -3.17E-04
Xi3 0.998 -S.80E-02 -7.49E-03 7.07E-04
X/4 0.996 -8.84E-02 -5.41 E-03 -2.87E-04
XI5 0.994 -0.109 -9.89E-04 7.49E-OS
XI6 0.993 -0.122 1.34E-04 -1.82E-04
XI7 0.991 -0.131 2.36E-03 1.0IE-OS
XJ8 0.991 -0.137 3.7SE-03 1.6SE-OS
XJ9 . 0.99 -0.141 6.07E-03 S.93E-OS
X20 0.989 -0.144 7.51 E-03 -1.72E-OS
X21 0.989 -0.145 8.06E-03 S.70E-05
X22 0.989 -0.146 1.0IE-02 -8.84E-OS
X23 0.989 -0.145 l.38E-02 7.60E-OS
X24 0.989 -0.145 1.62E-02 S.66E-06
X25 0.989 -0.145 1.87E-02 -9.03E-05
X26 0.99 -0.142 1.81 E-02 9.29E-06
X27 0.99 -0.139 1.69E-02 2.64E-OS
X28 0.99 -0.137 2.2 I E-02 -6.22E-OS
X29 0.99 -0.135 2.72E-02 -1.62E-OS
X30 0.991 -0.13 2.79E-02 3.38E-OS
We have carried out PCA analysis of the data and the result is shown in Table 9.8.
Two observations are made. Firstly, it was found that the fIrst two PCs can capture
99.729% of the variance. Secondly, Xl - X7 are mainly associated with PC 2 and X9-
X30with PCI.XS is nearly equally associated both PCs'
Because the first two PCs can capture 99.729% of all the variance, it is possible to
use the two PCs rather than the 30 x, as the inputs to develop a FFNN model. As a
result we have developed a 2-4-1 model. Comparison was made between the 30-20-
1 model developed by Baughman and Liu [229] and the 2-4-1 model. For both
models, 32 data patterns were used for training and 9 for test. The training and test
errors for the 30-20-1 model are 3.02% and 3.87%, while for the 2-4-1 model are
206 Data Mining and Knowledge Discovery for Process Monitoring and Control
2.29% and 3.55%. Because the difference in configuration of the two networks it is
not appropriate to give a definite conclusion on which model is more accurate, only
based on the above errors. Nevertheless it is obvious that the 2-4-1 model can give
more accurate or equivalent result.
An alternative way to use PCA to reduce the input variable dimension is to
analyse the contribution of individual explanatory variables to the first few PCs.
Those which are not important can be excluded from the FFNN model.
Sensitivity analysis is a simple method of fmding the effect of an input on the output
of the network. A FFNN can be developed first using all possible input variables
and then be used to carry out sensitivity studies. Irrelevant variables can be
excluded as a new FFNN is developed. Suppose there are a number of inputs Xi and
only one output z, then the relationship of an input variable Xi and the output z is
found by determining the impact of a small change in Xi on z. If large change occurs
on z, then Xi is considered to be one of the key factors in producing the current
activation value of z. The sensitivity can be deduced given a FFNN structure with
multiple inputs [231], a number of hidden neurons and only single output. The input
to a hidden neuron is
(9.1)
where aoj is the bias weight and Xi denotes an input with associated weight aij. The
output of a hidden neuron is
YJ.= ---
1 (9.2)
1+ e -u,
where bo is the bias weight and bj are the respective weights for the hidden nodes.
Clearly the output z = _ 1 _ .
1+ e- v
The sensitivity of the output z to the input variable Xi is dzldx i . Using the chain
rule, the sensitivity may be related to the output z, the output from each of the
hidden nodes and the different set of weights aij and bj as follows,
dzldxi = (dz/dv)( dvldYj)( dyjduj)( dujdx;)
= z(1-z) I bjYj(I - Yj)aij
Chapter 9 Inferential Models and Software Sensors 207
For a FFNN with multiple outputs, the same procedure can be applied to study the
impact of each input on each output. Table 9.9 is an example of sensitivity study for
a waste water treatment process [230]. It shows sensitivity of the standard deviation
of suspended solid to a number of operational variables. It indicates that the
classifier interface level is the most important variable and inlet oil concentration
and inlet cyanide act have positive impact, while an increase in ammonia flow has a
positive effect in reducing the standard deviation of suspended solids.
Sensitivity can also be carried out by plotting the output variable against an input
variable as shown in Figure 9.5 [175]. Figure 9.5 shows the relationships of
variables of a waste water processing unit. PH-P is the measured PH value of the
effiuent from the unit and PH-E, SS-E and Q-E are the PH value, suspended solid
and flowrate of the influent to the unit.
8.4
8.3
8.2
PH-E
8.1
8.0
"-
:i: 7.9
"-
7.8
7.7
7.6
7.5
7.4
7.3
0 0.2 0.4 0.6 0.8 1.0
Normalised values of PH-E, SS-E and Q-E
Figure 9.5 The sensitivity of the effect ofPH-P to influent SS-E, Q-E and PH-E.
208 Data Mining and Knowledge Discovery for Process Monitoring and Control
However the above sensitivity studies should be conducted with great care. If the
inputs are strongly correlated then the sensitivity of the output to an input variable
may be dependent on what values the other input variables are fixed at. If PCs rather
than the explanatory variables are used, this problem can be resolved since all PCs
are linearly uncorrelated. Although it may be able to estimate the impact of each PC
on the output, to relate to the explanatory variables is not always possible. For
example, Table 9.10 shows the coefficients of the first four PCs and the six
explanatory variables Xl - X6. Since PCl is nearly equally weighted on four (i.e., XJ,
Xl> X4, xs)of the six variables and the PCl only accounts about half of the total
variance, even we can find the impact of PC I on the output, it is difficult to
determine the impact on the output by Xl - X6 [231]. A satisfactory situation would
be that the first PC accounts for most of the total variance and/or is heavily
weighted on a few explanatory variables.
Another approach to analyse the input impact on an output is to analyse the weight
matrix. Since the nodes between two adjacent layers are fully connected, such an
analysis is not straightforward. Garson [232] proposed the following measure for the
proportional contribution of an input to a particular output,
Chapter 9 Inferential Models and Software Sensors 209
(9.5)
In the above equation, p and q refer to the input layer, p = 1, nj, j refers to hidden
layer, j = 1, nh, i, j, k are the ith, jth and kth input, hidden and output node. A
disadvantage of the approach is that during the summation process, positive and
negative weights can cancel their contribution, which leads to inconsistent results.
Milne [233] commented that the sign of the contribution is lost, and proposed the
following measure,
(9.6)
Wong et al. [234] and Gedeon [235] defmed measures for the contribution of an
input to a neuron in the hidden layer, and a hidden layer neuron to an output layer
neuron,
(9.7)
Iw},1 (9.8)
P}k =-nh--
Llwrkl
r""l
nh
Q., = L(Pir X Prk) (9.9)
r=!
The approaches using weight matrix to examine the impact of input nodes on output
nodes are interesting but they have not been widely tested.
210 Data Mining and Knowledge Discovery for Process Monitoring and Control
The data used in the industrial case study described in Section 9.3 was collected on
steady state basis. Consequently the model developed will only be accurate if the
operation is around steady states. A feedforward neural network is static in nature
and is not designed to deal with dynamics, e.g., time delays. Recently several
approaches have been proposed to introduce dynamics to FFNNs, in order to
capture the dynamic nonlinear feature of processes. Such neural networks can
essentially be classified according to implementation of the dynamic character as:
(1) NNs with lumped dynamics; (2) recurrent NNs; and (3) dynamic multilayer
percetron [236].
NNs with lumped dynamics are static networks with delayed measurements of the
process inputs and outputs. Accordingly, the network approximate the current
process outputy(k) as a function of the delayed measurements:
It is clear that NNs with lumped dynamics do not change the internal structure of
a static NN. Dynamics is captured by sampling data from input-output dynamic
trends. Consequently, the dimension of the NN input space increases according to
the number of delayed measurements used.
Bhat and McAvoy [237] described such a case study using a stirred tank reactor
as shown in Figure 9.6. The dynamics of the system can be described by two linear
differential equations and three nonlinear algebraic equations. Training data were
Chapter 9 Inferential Models and Software Sensors 211
~
0
0 550
z
"0
~
e
~
0
LA:
450
Figure 9.7 Data used for training the CSTR neural network.
212 Data Mining and Knowledge Discovery for Process Monitoring and Control
A moving window approach is used. At the beginning of the training process the
window is placed at the beginning of the database. The fIrst fIve pH and ten F2 are
used as inputs and the next fIve pH are used as the outputs. After the fIrst
presentation of the data, the window moves T down the database. Again the fIrst
fIve pH and ten F2 in the window are inputs to the net and the next fIve pH in the
window are outputs. This process is continued until the end of the database is
reached and, the process is repeated by starting at the beginning of the database,
until the training is terminated through either reaching the required error or the
maximum number of iteration.
Recurrent neural networks (RNN) (e.g., Shaw et al. [238]) dispense with the
explicit use of lagged measurements at the network input. Instead, RNN posses a
long-term memory through recurrent connections within the network. However,
according to Ayoubi [236], the application of RNN was minimised to sequence
processing, yet there are no signifIcant applications to model identifIcation of real-
world systems.
An alternative method is the multilayer dynamic percetron [236, 239] which
modifIes the neuron processing and interconnections in order to incorporate
dynamics inherently within the network.
9.6 Summary
Due to inadequate knowledge of the domain problem, people tend to use many
input variables. Inclusion of irrelevant and redundant variables may deteriorate the
model performance because it increases the size of the net, leading to longer
training time, and requires more training data, and reduces model generalisation
capability. The several approaches introduced in this chapter, including PCA,
sensitivity analysis and weight matrix analysis can be used to combat the problem.
However there is still the need to carry out further investigation in these approaches,
especially in solving practical problems.
Only a very brief introduction has been given to dynamic FFNNs for software
sensor design. Due to its great application potential, there has been a growing
interest in dynamic neural networks. Apart from the need to develop the most
appropriate network structure and learning approaches, there is clearly a practical
difficulty in applying dynamic neural networks, that is, how to get the large number
of data patterns needed. Experimenting on industrial processes is time consuming
and expensive because it may cause the process to deviate from desired conditions.
There is another problem associated with doing experiment on industrial processes,
which is that the processes are often under closed loop control using linear models.
Kim et al. [240] addressed the need to consider the reality that most processes are
under linearized control scheme.
There are many other issues associated with determination of FFNN network
topology and training. A detailed discussion on these has been given in Chapter 5
therefore they are not repeated in this chapter.
CHAPTER 10
CONCLUDING REMARKS
X. Z. Wang, Data Mining and Knowledge Discovery for Process Monitoring and Control
© Springer-Verlag London Limited 1999
216 Data Mining and Knowledge Discovery for Process Monitoring and Control
Examples and industrial case studies in varying degrees of complexity have been
used to illustrate these methods. They include a continuous stirred tank reactor, the
reaction regeneration system of a fluid catalytic cracking process, a methyl tertiary
butyl ether unit, the reaction fractionation system of fluid catalytic cracking and a
waste water treatment plant. Many more examples could have been added, but we
feel these should be sufficient to demonstrate the effectiveness of the methods and
aid in clarifYing the ideas and concepts.
We have intended to provide a new way of thinking in the approach to designing
and operating process control systems and introduce practical methodologies for
implementation. We also hope that the book will provide a stimulus to researchers
since the field is still in its infancy. Many more challenges than those addressed in
this book need to be considered in order to develop effective, robust and practical
systems for knowledge discovery and for designing intelligent and state space based
monitoring and control environment.
Suggestions for future work have been summarised at the end of each chapter and
will therefore not be repeated here. Most of the case studies in the book are with
respect to continuos processes save a case study described in Chapter 4. Batch
processes are expected to provide more challenges due to the distinctive features of
batch operation mode. Batch processes do not have steady state operations and are
more flexible. They also provide great opportunities for application of the
techniques discussed in this book. For example, wavelets are potentially a powerful
technique for dealing with signals of batch operations because they often change at
higher frequencies than in continuous processes. In addition, feedstock disturbances
occur frequently and on-line measurements of product quality variables are not
available. As a result, most batch processes have not been able to achieve tight
quality control. Knowledge discovery approaches are attractive for dealing with
these kind of problems because of difficulties associated with developing accurate
process models from first principles [241]. At present, industrialised countries are
shifting the focus from large-scale commodity production to smaller scale fine
chemicals, pharmaceutical and food production, therefore, there is a clear need for
advances in this field.
APPENDIX A
THE CONTINUOUS STIRRED TANK REACTOR (CSTR)
. . .(0;. . ....
·
····· .....
Feed ·
. . . . . . .Q
o
......... -_ ..... .
Cooling !
Wdter~
A single reaction A ~ B takes place in the reactor. It is assumed that the tank is
very well mixed therefore the concentration of Co of component A and the
temperature of the product stream leaving the tank are equal to that in the tank. The
reaction is exothermic and reaction temperature T R is automatically controlled by a
PID controller through manipulating the cooling water flowrate. There are also feed
218 Data Mining and Knowledge Discovery for Process Monitoring and Control
flowrate and liquid level controllers as shown in Figure AI. Apart from the three
controllers, the dynamic behaviour of the process is described by the following
equations:
Component mass balance
d Co
V ----;;t=FiC;-FoCo- V -EIRT
Koe RCa (AI)
Energy balance
dTR
VPCp----;;t = PCpFi Ti - PCpFo TR
(A2)
b - Constant, b =0.5
More detailed description of the process can be found in [8]. A dynamic simulator
was developed for the CSTR which has included three controllers as shown in
Figure AI. The simulator has a MS Windows based graphical interface which
allows users to make any changes and faults and disturbances can be easily
introduced. To generate a data set or data case, run the simulator at steady state and
introduce a disturbance or fault and at the same time start to record the dynamic
responses. Eighty five data sets were generated which are summarised in Table AI.
For each data set, eight variables were recorded, including F;, T;, C;, T w;, Fw, TR, Co
and L. In each data set, each variable was recorded as a dynamic trend consisting of
150 sampling points. Therefore for each variable the data size is a matrix 85 (the
number of data sets) x 150 (the number data points representing a dynamic trend).
37-39 All control loops are at AUTO mode and S.P. ofT R = 40S K. Change Ci
(lanol/m3) from 2.0 ~ 1.6, 1.6 ~ 2.0, 2.0 ~ 1.2
40-43 All control loops are at AUTO mode and S.P. of TR =40S K. Change Fi (m 3
/min) from I.OO~ 1.06, 1.06~ 0.98, 0.98~ 0.86, 0.86~ 1.02
44-46 All control loops are at AUTO mode and S.P. ofT R =40S K. Change L (%)
from SO.O~ 60.0, 60.0~ SO.O, SO.O~ 40.0
46-52 All control loops are at AUTO mode and S.P. ofTR =380 K. Change Ti (K)
from 343~333, 333 ~ 323, 323 ~ 333, 333 ~ 343, 343 ~ 3S3, 3S3 ~ 343
53-56 All control loops are at AUTO mode and S.P. of TR = 380 K. Change Twi (K)
from 310 ~ 303, 303 ~ 313, 310 ~ 323, 323 ~ 313
57-60 All control loops are at AUTO mode and S.P. ofT R = 380 K. Change Ci
(kmollm3) from 2.0 ~ 1.6, 1.6 ~ 2.0, 2.0 ~ 1.2, 1.2 ~ 2.0
61-66 All control loops are at AUTO mode and S.P. of TR = 380 K. Change Fi (m 3
/min) from 1.00~ 1.06, 1.06~ 1.00, 1.00~ 1.10, 1.I0~ 1.00, 1.00~ 0.90,
0.90~ 1.00
67-70 All control loops are at AUTO mode and S.P. of TR = 380 K. Change the S.P.
of L ( % ) from SO.O~ 40.0, 40.0~ SO.O, SO.O~ 60.0, 60.0~ SO.O
71-80 All control loops are at AUTO mode and S.P. ofTR =380 K. Change the
output of the CSTR level controller from (S) SO.O~ 10.0, SO.O~ 8.0,
SO.O~ 6.0, SO.O~ 12.0, SO.O~ 14.0, SO.O~IS.O, SO.O~ S.O, SO.O~ 2.0,
SO.O~ 3.0, SO.O~ 0.0
81-85 All control loops are at AUTO mode. Change the output of the controller T R
from (%) 71.4~ 20.0, 71.4~ 21.0, 71.4~ 22.0, 71.4~ 19.0, 71.4~18.0
APPENDIXB
THE RESDIUE FLUID CATALYTIC CRACKING (R-FCC)
PROCESS
The sixty four data patterns were first generated by the customised dynamic
training simulator, then random noises were added to the data using a noise
generator ofMATLAB®, before it is fed to the system.
Flue gas fj
~
401-ST
LC302
----T
~
~ ~
Water -_J ~
~.
:..... ~ ....... . Steam §
!:l..
~ ~
<::l
~
E ~
~
G)
~
~
~ b
G) I-<
~.
::t:: £
'"
~ ~
S
o ~
""'C
(,)
'"tl
~
r)
~
'"
air feed ~
;:s
~.
fresh
combustion P1321 ~.
feed
oil I:l
furnace ;:s
8- !:l..
~
;:s
Figure Bl The simplified flow sheet of the R-FCC process. ~
-.
Appendix B The Residue Fluid Catalytic Cracking (R-FCC) Process 225
Table Bl The major control loops and safety guard systems of the R-FCC
rocess.
Major control loops
Reaction TC302 Temperature control at the exit of the riser tube
temperature reactor. Set pint, 5 \30c. Manipulated variable is
regenerated catalyst flow to the riser tube reactor
External heat When the low main air supply safety guard is
exchanging system activated the external heat exchanging system is also
needed to be cut off
226 Data Mining and Knowledge Discovery for Process Monitoring and Control
The refinery methyl tertiary butyl ether (MTBE) process, shown in Figure Cl is
used in Chapter 7 as a case study of applying conceptual clustering to developing
process monitoring systems. MTBE is an important industrial chemical because
large quantities of MTBE are now required as an octane booster in gasoline to
replace tetra ethyl lead. MTBE is produced from the reaction between isobutene and
methanol in a reactor and a catalytic distillation column packed with catalysts. The
reaction takes place at a temperature of 35-75 0C and a pressure of 0.7-0.9 MPa.
The main chemical reaction is:
methanol + isobutene === methy tertiary butyl ether
CH30H + (CH 3hC=CH 2 === (CH3)3COCH3
The reaction occurs in the liquid phase and is reversible and exothermic. Proper
selection of the reaction pressure allows part of the reaction product to be
vaporised, absorbing part of the heat of reaction. Thus the temperature in the reactor
could be controlled.
An unavoidable side reaction is:
isobutene + isobutene === diisobutene
(CH 3hC=CH 2 + (CH 3hC=CH 2 === {(CH 3hC=CH 2 h
However, the selectivity of the primary reaction is about 98-99%. Since the product
of the side of reaction is very limited it can be ignored.
MTBE is mostly produced in the reactor, and a small amount is produced in the
catalytic distillation column which combines reaction and distillation. The rest of
the isobutene and methanol from the reactor passes to the catalytic distillation
228 Data Mining and Knowledge Discovery for Process Monitoring and Control
each variable, 256 data points are recorded following a disturbance or fault. So the
data has a dimension of 21 x 256. Random noise was added to the data. Figure C2
shows such an example.
Only the sections of feed, reactor and reactive distillation column of the process
have been considered, which means that columns C202 and C203 have not been
included in the study apart from the fact that there is a small recycle stream of the
recovered methanol which affects the feed tank D202. This recycle stream was fixed
at the design flowrate.
~
~
§
l:l...
~.
()
FC20 I C 1-'1
''-I~~t-I- - - f :j
~
~
~
6
~
-
:::t
: .1··6 5·
P201 \.!:9 ~
Methanol e .. :- .... LC203D b::I
_ . • • . • • • r-J.---.--...r l::
~ ~
@@ • Steam
.~ @ ~
~
<$).; ~
: FC203E ~
~
... ·cw ~
n
P203
MTBE ~
P208
tv
Figure Cl The simplified flowsheet of the MTBE process. w
232 Data Mining and Knowledge Discovery for Process Monitoring and Control
o 3
;::s
<;j
> 2
o
4
o
31 61 91 121 151 181 211 241
Data point
Lukas MP. Distributed control systems: their evaluation and design. Van Nostrand
Reinhold Company, New York Wokingham, 1986.
2 Yamanaka F and Nishiya T. Application of the intelligent alarm system for the plant
operation. Comput Chern Engng 1997; 21: s625-s630.
3 Howat CS. Analysis of plant performance. In: Perry RH, Green DW (eds.) Perry's
chemical engineers' handbook, Section 30. McGraw-Hill, 1997,30-1 - 30-36.
4 Saraiva PM and Stephanopoulos G. Continuous process improvement through
inductive and analogical learning. AIChE 1. 1992; 38: 161-183.
5 Saraiva, PM. Inductive and analogical learning: data-driven improvement of process
operations. In: Stephanopoulos G and Han C (Eds) Intelligent Systems in Process
Engineering: Paradigms from Design and Operations. Academic Press, Inc., San
Diego, California, 1996,377-435.
6 Taylor W. What every engineer should know about artificial intelligence. MIT Press,
Cambridge, MA, 1989.
7 Shewhart WA. Economic control of quality of manufactured product. Van Nostrand,
Princeton, N.J., 1931.
8 Marlin TE. Process control - designing processes and control systems for dynamic
performance. McGraw-Hill, 1995.
9 Oakland JS. Statistical process control: a practical guide. Heinemann, London, 1986.
10 Woodward RH, Goldsmith PL. Cumulative sum techniques. Oliver and Boyd"
London, 1964.
II Roberts SW. Control charts tests based on geometric moving averages.
Technometrics 1959; I: 239-250.
12 Hunter JS. The exponentially weighted moving average. J of Quality Technology
1986; 18:203-210.
234 Data Mining and Knowledge Discovery for Process Monitoring and Control
17 Hangos KM, Perkins JD. Structural stability of chemical process plants. AIChE 1.,
1997; 43: 1511-1518.
18 Quinlan JR. C4.5: programs for machine learning. Morgan Kauffman, 1993.
21 Chen FZ, Wang XZ. An integrated data mining system and its application to process
operational data analysis. Comput. Chern. Engng 1999; 23(suppl.): s777-s780.
23 Fayyad UM, Simoudis E. Data mining and knowledge discovery. Tutorial Notes at
PADD'97 - 1st Int. Conf. Prac. App. KDD & Data Mining, London, 1997.
26 Glymour C, Madigan D, Pregibon D, Smyth P. Statistical themes and lessons for data
mining. Data Mining and Knowledge Discovery 1997; 1: 11-28.
28 Chen MS, Han JW, Yu PS. Data mining: an overview from a database perspective.
IEEE Trans. Know. Data Eng. 1996,8: 866-883.
30 Inmon WHo The data warehouse and data mining. Comm. of the ACM 1996; 39( II):
49-50.
35 Fayyad U, Haussler D, Stolorz P. Mining scientific data. Comm. of the ACM 1996;
39(11): 51-57.
37 Apte, C. Data Mining: an industrial research perspective. IEEE Computational Sci. &
Eng. 1997; 4(2): 6-9.
39 Cheeseman P and Stutz J. Bayesian classification (AutoClass): theory and results. In:
Fayyad UM, Piatetsky-Shapiro G, Smyth P, Uthurusamy R (eds.) Advances in
Knowledge Discovery and Data Mining. AAAI Press/MIT Press, 1996, 153-180.
http://fi-www.arc.nasa.gov/fi/projects/bayes-group/group/autoclass/autoclass-c-
program.html
45 Everitt BS and Dunn G. Applied multivariate data analysis. Edward Arnold. 1991.
236 Data Mining and Knowledge Discovery for Process Monitoring and Control
46 Everitt BS. Cluster analysis. 3rd edition, Edward Arnold, London, 1993.
53 Buntine W. Graphical models for discovering knowledge. In: Fayyad UM, Piatetsky-
Shapiro G, Smyth P, Uthurusamy R (eds.) Advances in Knowledge Discovery and
Data Mining. AAAI Press/MIT Press, 1996, 59-82.
54 Cooper GF and Herskovits E. A Bayesian method for the induction of probabilistic
networks from data. Machine Learning 1992; 9: 309-347.
55 Bouckaert RB. Properties of Bayesian network learning algorithms. In: Lopez R,
Poole, D (Eds.), Uncertainty in Artificial Intelligence: Proc. Ninth Conf., de
Mantaras, 1993, 102.
56 Bouckaert RB. Belief network construction using the minimum description length
principle. Proc. ECSQARU 1993; 41.
57 Wang XZ, Chen BH, McGreavy C. Data mining for failure diagnosis of process units
by learning probabilistic networks. Trans. IChemE 1997; 75B: 210-216.
58 Jensen FV. An introduction to Bayesian networks. UCL Press, 1996.
59 Wang XZ, Yang SA, Veloso E, Lu ML, McGreavy C. Qualitative process modelling
- a fuzzy signed directed graph method. Comput. & Chern. Engng 1995; 19(suppl.):
s735-s740.
60 Wang XZ, Yang SA, Yang SH, McGreavy C. Fuzzy qualitative simulation in safety
and operability studies of process plants. Comput. & Chern. Engng 1996; 20: s671-
s676.
61 Agrawal R, Imielinski T, Swami A. Mining association rules between sets of items in
large databases. In: Proc. ACM SIG-MOD Conf. on Management of Data,
Washington, D.C., 1993,207-216.
References 237
descriptions of process trends for fault detection and diagnosis. Engng Applic. Artif.
Intel!., 1991; 4: 329-339.
82 Chen BH, Wang XZ, Yang SH, McGreavy C. Application of wavelets and neural
networks to diagnostic system development - I. feature extraction. Comput. Chern.
Engng. 1999; 23: 899-906.
83 Wang XZ, Li RF. Combining conceptual clustering and principal component analysis
for state space based process monitoring. Internal report, University of Leeds, 1999.
84 Pearson K. On lines and planes of closest fit to systems of points in space. Phi.1 Mag.
1901; 2: 559-572.
87 Chan YT. Wavelet basics. Kluwer Academic Publishers, Boston London, 1995.
92 Mallat S, Hwang WL. Singularity detection and processing with wavelets. IEEE
Trans. on Inf. Theory 1992; 38: 617-643.
References 239
108 Alt FB, Smith ND. Multivariate process control. In: Krishnaiah PR, Rao CR (eds.)
Handbook of statistics, Vol. 7, North-Holland, Amsterdam, 1988,333-351.
109 Ryan TP. Statistical methods for quality improvement. Wiley, New York, 1989.
110 Jackson lE. A user's guide to principal components. John Wiley and Sons, Inc., New
York,1991.
I II Tracy ND, Young JC, Mason RL. Multivariate control charts for individual
observations. 1. Quality Tech. 1992; 24: 88-95.
112 MacGregor JF, Kourti T. Statistical process control of multi variable processes.
Control Engng Practice 1995; 3: 403-414.
114 Martin EB, Morris J, Papazoglou MC. Confidence bounds for multivariate process
performance monitoring charts. Preprints of the IF AC workshop on on-line fault
detection and supervision in the chemical process industries, Newcastle, June 1995;
33-42.
115 Wold S, Geladi P, Esbensen K, Ohman J. Multi-way principal components- and PLS-
analysis. 1. Chemometrics 1987; 1:41-56.
116 Nomikos P, MacGregor JF. Monitoring batch process using multi way principal
component analysis. AIChE 1. 1994; 401361-1375.
118 Xu L, Oja E, Suen CY. Modified hebbian learning for curve and surface fitting.
Neural Networks 1992; 5: 441-457.
119 Dong D, McAvoy TJ. Nonlinear principal component analysis - based on principal
curves and neural networks. Comput Chern Engng 1996; 20: 65-78.
124 Yuan B, Wang XZ, Chen FZ, Morris T. Software analyser design using data mining
technology for toxicity prediction of aqueous effluents. Proc. 2 nd Conf. on Process
Integration, Modelling and Optimisation for Energy Saving and Pollution Reduction
References 241
125 Bakshi BR. Multiscale PCA with application to multivariate statistical process
monitoring. AIChE J. 1998; 44:1596-1610.
126 Tabe H, Chow KC, Tan KJ, Zhang J, Thornhill N. Dynamic principal component
analysis using integral transforms. AIChE Annual Meeting, November 1998.
127 Biehl M, Schlosser E. The dynamics of on-line principal component analysis. J Phys.
A: Math. Gen. 1998; 31: L97-Ll03.
128 Neogi D, Schlags CEo Multivariate statistical analysis of an emulsion batch process.
Ind. Eng. Chern. Res. 1998; 37: 3971-3979.
129 Jaeckle CM, MacGregor JF. Product design through multivariate statistical analysis
of process data. AIChE J. 1998; 44: 1105-1118.
130 Kresta JV, Marlin TE, MacGregor JF. Development of inferential process models
using PLS. Comput. Chern. Engng. 1994; 18: 597-611.
131 Santen A, Koot GLM and Zullo LC. Statistical data analysis of a chemical plant.
Comput Chern Engng 1997; 21(suppl.): sI123-s1129.
132 Zhang J, Martin EB, Morris AJ. Fault detection and diagnosis using multivariate
statistical techniques. Trans. IChemE 1996; 74A: 89-96.
133 Dunia R, Qin SJ, Edgar TF, McAvoy T1. Identification of faulty sensors using
principal component analysis. AIChE 1. 1996; 42: 2797-2812.
134 Qin SJ, Vue HY, Dunia R. Self-validating inferential sensors with application to air
emission monitoring. Ind. Eng. Chern. Res. 1997; 36: 1675-1685.
135 Chen FZ, Wang xz. Knowledge discovery using PCA for operational strategy
development and product design. Submitted to Trans. IChemE Part A, 1999.
136 Chen JG, Bandoni JA, Romagnoli JA. Robust PCA and normal region in
multivariate statistical process monitoring. AIChE J. 1996; 42: 3563-3566.
137 Negiz A, Cinar A. Statistical monitoring of multivariable dynamic processes with
state space models. AIChE J. 1997; 43: 2002-2020.
138 Leonard J, Kramer MA. Improvement of the backpropagation algorithm for training
neural networks. Comput Chern Engng 1990; 14: 337-341.
139 Brent RP. Fast training algorithms for multilayer neural nets. IEEE Trans. Neural
Nets 1991; 2:346-354.
140 Chen S, Billings SA. Neural networks for nonlinear dynamic system modelling and
identification. Int. J. Control 1992; 56: 319-346.
141 Peel C, Willis MJ, Tham MT. A fast procedure for the training of neural networks. J.
Proc. Cont. 1992; 2: 205-211.
242 Data Mining and Knowledge Discovery for Process Monitoring and Control
142 Powell MJD. Some global convergence properties of a variable metric algorithm for
minimisation without exact line searches. In: Cottle R, Lemke CE (eds.) SIAM-AMS
Proc. Symposium on non-linear programming, 1975, IX: 53-72.
143 Press WH, Flannery, BP, Teukolsky, SA, Vetterling, WT. Numerical recipes, the art
of scientific computing (Fortran version). Cambridge University Press, Cambridge,
1989.
144 Lorentz, GG. The 13 th problem of Hilbert. In: Browder FE (Ed.) Mathematical
developments from Hilbert problems. America Mathematical Society, Providence,
R.I., 1976.
145 Crowe ER, Vassiliadis CA. Artificial intelligence: staring to realise its practical
promise. Chern. Eng. Progr. 1995; 91(1): 22-31.
146 Knight K. Connectionist ideas and algorithms. Comm. of the ACM 1990; 33 (II):
59-74.
147 Chitra SP. Use of neural networks for problem solving. Chern. Eng. Progr. 1993;
89(4): 44-52.
148 Kramer MA, Leonard, JA. Diagnosis using backpropagation neural networks -
analysis and criticism. Comput. Chern. Engng 1990; 14: 1323-1338.
149 Shao R, Martin EB, Zhang J, Morris AJ. Confidence bounds for neural network
representations. Comput. Chern Engng 1997; 21: sI173-s1178.
150 Zhang J, Martin EB, Morris AJ, Kiparissides C. Inferential estimation of polymer
quality using stacked neural networks. Comput. Chern. Engng 1997; 21: s I 025-
s1030.
151 Wang XZ, Lu ML, McGreavy C. Learning dynamic fault models based on a fuzzy set
covering method. Comput. Chern Engng 1997; 21: 621-630.
152 Kramer MA. Malfunction diagnosis using quantitative models with non-Boolean
reasoning in expert systems. AIChE J. 1987; 33: 130-140.
154 Zhang J, Morris J. Process modelling and fault diagnosis using fuzzy neural
networks. Fuzzy Sets and Systems 1996; 79: 127-140.
155 Wang XZ, Chen BH, Yang SH, McGreavy C. Neural nets, fuzzy sets and digraphs in
safety and operability studies of refinery reaction processes. Chern Engng Sci 1996;
51: 2169-2178.
156 Buckley JJ, Hayashi Y. Fuzzy neural networks: a survey. Fuzzy Sets and Systems
1994; 66: 1-13.
157 Reggia JA, Nau DS, Wang PY. Diagnostic expert system based on a set covering
References 243
158 Penalva JM, Coudouneau L, Leyval L, Montmain J. A supervision support system for
industrial processes. IEEE Expert 1993; 8(5): 57-65.
160 Oyeleye 00, Kramer MA. Qualitative simulation of chemical process systems:
steady-state analysis. AIChE J. 1988; 34: 1441-1454.
163 Mohindra S, Clark P A. A distributed fault diagnosis method based on graph models:
steady-state analysis. Comput. Chern. Engng 1993; 17: 193-209.
164 Wilcox NA, Himmelblau DM. The possible cause-effect graph (PCEG) model for
fault diagnosis. Comput. Chern. Engng 1994; 18: 103-127.
165 Ouassir M, Melin C. Causal graphs and rule generation: application to fault diagnosis
of dynamic processes. Proceedings of the 3rd IF AC symposium on fault detection,
supervision and safety for technical processes 1997 (SAFEPROCESS 97), Kingston
Hull, England, Aug. 1997,1087-1092.
167 Vedam H, Venkatasubramanian V. Signed digraph based multiple fault diagnosis.
Comput. Chern. Engng 1997; 21: s655-s660.
168 Umeda T, Kuriyama T, O'shima E. Matsuyama H. A graphical approach to cause and
effect analysis of. chemical processing systems. Chern. Eng. Sci. 1980; 35: 2379-
2388.
169 Vaidhyanathan R, Venkatasubramanian V. Digraph-based models for automated
HAZOP analysis. Reliability Engineering and System Safety 1995; 50: 33-49.
170 Srinivasan R, Venkatasubramanian V. Petri Net-Digraph models for automating
HAZOP analysis of batch process plants. Comput. Chern. Engng 1996; 20: s719-
s725.
171 Kuo DH, Hsu DS, Chang CT, Chen DH. Prototype for integrated hazard analysis.
AIChE J. 1997; 43: 1494-1510.
172 Han CC, Shih RF, Lee LS. Quantifying signed directed-graphs with the fuzzy set for
fault diagnosis resolution improvement. Ind. Eng. Chern. Res. 1994; 33: 1943-1954.
173 Shih RF, Lee LS. Use of fuzzy cause-effect digraph for resolution fault diagnosis for
process plants. Ind. Eng. Chern. Res. 1995; 34: 1688-1717.
244 Data Mining and Knowledge Discovery for Process Monitoring and Control
174 Wang XZ, Chen BH, Yang SH, McGreavy C. Fuzzy rule generation from data for
process operational decision support. Comput. Chern. Engng 1997; 21: s661-s666.
175 Huang YC, Wang XZ. Application of fuzzy causal networks to wastewater treatment
plants. Chern. Eng. Sci. 1999; 54: 2731-2738.
176 Chen CL, Jong MJ. Fuzzy predictive control for the time-delay system. Proc. 2nd
IEEE into conf. fuzzy systems 1993; 236-240.
177 Becraft WR, Lee PL. An integrated neural network/expert system approach for fault
diagnosis. Comput. Chern. Engng 1993; 17: 1001-1014.
178 Dubois D, Prade H. Fuzzy sets and systems: theory and applications. Academic
Press, 1980
179 Rosenfield A. Fuzzy graphs. In: Zadeh LA, Fu KS, Tanaka K, Shimura M (eds.)
Fuzzy sets and their applications to cognitive and decision processes. Academic
Press, New York, 1975,77-95.
180 Zhang J, Morris AJ, Martin EB, Kiparissides C. Prediction of polymer quality in
batch polymerisation reactors using robust neural networks. Chern. Eng. J. 1998; 69:
135-143.
182 Wasserman PD. Neural computing: theory and practice. Van Nostrand Reinhold,
New York, 1989.
183 Whiteley JR, Davis JF. A similarity-based approach to interpretation of sensor data
using adaptive resonance theory. Comput. & Chern. Engng 1994; 18: 637-661.
184 Whiteley JR, Davis JF Mehrotra A and Ahalt SC. Observations and problems
applying ART2 for dynamic sensor pattern interpretation. IEEE Trans. on Sys. Man
Cybernetics Part A: Sys. and Humans 1996; 26: 423-437.
185 Wang XZ, Chen BH, Yang SH, McGreavy C. Application of wavelets and neural
networks to diagnostic system development - 2. an integrated framework and its
application. Comput. Chern. Engng. 1999; 23: 945-954.
186 Pao YH. Adaptive pattern recognition and neural networks. Addison-Wesley, 1989.
187 Looney CG. Pattern recognition using neural networks: theory and algorithms for
engineers and scientists. Oxford University Press, 1997.
188 Everitt BS, Hand DJ. Finite mixture distributions. London Chapman & Hall, 1981.
189 Titterington DM, Smith AFM, Makov UE. Statistical analysis of finite mixture
distributions. John Wiley & Sons, New York, 1985.
190 Dempster AP, Laird NM, Rubin DB. Maximum likelihood from incomplete data via
the EM algorithm. Journal of the Royal Statistical Society 1977; 39 (series B), No.
1: 1.
References 245
191 Ayoubi M, Leonhardt S. Methods of fault diagnosis. Control Eng. Practice 1977; 5:
683.
193 Wang XZ, McGreavy C. Automatic classification for mining process operational
data. Ind. Eng. Chern. Res. 1998; 37: 2215-2222.
194 Nam DS, Han C, Jeong CW, Yoon ES. Automatic construction of extended
symptom-fault associations from the signed digraph. Comput. Chern. Engng. 1996;
20(supp\.): s605-s610.
195 Wang XZ, Chen BH. Clustering of infrared spectra of lubricating base oils using
adaptive resonance theory. 1. Chern. Inf. Comput. Sci. 1998; 38: 457-462.
196 Michalski, R.S. and Larson, J.B. Selection of most representative training examples
and incremental generation of VLI hypotheses: the underlying method and desc. of
prog. ESEL and AQll, 867, Computer Science Department, University of Illinois,
1978.
197 Mitchell, T.M. Version Spaces: a candidate elimination approach to rule learning, in
Proc. ofIJCAI-87, Cambridge, Mass., 1977
198 Quinlan JR. Improved use of continuous attributes in C4.5. J. Artif. Intel\. Res. 1996;
4: 77-90.
199 Wu, X. A Bayesian discretiser for real-valued attributes. The Comput 1., 1996; 39:
688-69l.
200 Daniel C, Wood FS. Fitting equations to data, 2nd Edition. John Wiley & Sons, Inc.,
1980.
202 Shi ZZ. Principles of machine learning. International Academic Publishers, Beijing,
1992.
203 Dutton DM, Conroy GV. A review of machine learning. The Know\. Engng Review
1996; 12: 341-367.
204 Kocabas S. A review oflearning. The Know\. Engng Review 1991; 6: 195-222.
205 Wang LX, Mendel JM. Generating fuzzy rules by learning from examples. IEEE
Trans. on Systems, Man, and Cybernetics 1992; 22: 1414-1427.
206 Wang LX, Mendel 1M. Fuzzy basis functions, universal approximation, and
orthogonal least-squares learning. IEEE Trans. on Neural Networks 1992; 3: 807-
814.
207 Wang LX. Adaptive fuzzy systems and control: design and stability analysis.
246 Data Mining and Knowledge Discovery for Process Monitoring and Control
208 Wang LX. Fuzzy systems are universal approximators. Proc. IEEE Int. Conf. on
Fuzzy Systems, San Diego, 1992, 1163-1170.
209 Quafafou M. and Chan CC. An incremental approach for learning fuzzy rules from
examples. EUFIT' 95, Aachen, Germany, 1995,520-523.
210 Srinivasan ACB, Chan CC. Using inductive learning to determine fuzzy rules for
dynamic system. Engng Applic. of Artif. Intell. 1993; 6: 257-264.
211 Abe S, Lan MS. A method for fuzzy rules extraction directly from numerical data
and its application to pattern classification. IEEE Trans. on Fuzzy Systems 1995; 3:
18-28.
214 Levrat E, Rondeau L, Ruelas R, Lamotte M. Fuzzy rules learning method. EUFIT
'95, Aachen, Germany, 1995,515-519.
215 Sudkamp T, Hammell RJ. Interpolation, completion, and learning fuzzy rules. IEEE
Trans. on Systems, Man, and Cybernetics 1994; 24: 332-342.
216 Kosko, B. Neural networks and fuzzy systems: a dynamic systems approach to
machine intelligence. Prentice Hall, Englewood Cliffts, NJ, 1992.
217 Fu LM. Rule generation from neural networks. IEEE Trans. on Systems, Man, and
Cybernetics 1994; 24: 1114-1124.
218 Gallant SI. Connectionist expert systems. Comm. of The ACM 1988; 31(2): 152-169.
219 Chan CC. Incremental learning of production rules from examples under uncertainty:
a rough set approach. Int. J. of Software Engng and Know!. Engng 1991; I: 439-461.
220 Chmielewski MR, Grzymala-Busse JW, Peterson NW, Than S. The rule induction
system LERS - a version for personal computers. Foundations of Computing and
Decision Sciences 1993; 18: 181-212.
221 Pawlak Z. Rough sets. Int. J. Computer and Information Sci. 1982; 11: 341-356.
222 Pawlak Z. Rough classification. Int. 1. man-machine Studies 1984; 20: 469-483.
223 Pawlak Z. Rough sets and fuzzy sets. Fuzzy Sets and Systems 1985; 17: 99-102.
224 Fusillo RH, Powers GJ. Operating procedure synthesis using local models and
distributed goals. Comput. Chern Engng 1988; 12: 1023-1034.
225 Frayman Y, Wang L. Data mining using dynamically constructed recurrent fuzzy
neural networks. In: Wu X, Kotagiri R, Korb KB (eds.) Research and development in
References 247
226 Ziarko WP (ed.) Rough sets, fuzzy sets and knowledge discovery. Springer-Verlag,
1994.
227 Chen FZ, Wang XZ. Software sensor design using Bayesian automatic classification
and backpropagation neural networks. Ind. Eng. Chern. Res. 1998; 37: 3985-399\.
229 Baughman DR, Liu YA. Neural networks in bioprocessing and chemical engineering.
Academic Press, 1995.
230 Miller RM, Itoyama K, Uda A, Takada H, Bhat N. Modeling and control of a
chemical waste water treatment plant. Comput. Chern. Engng 1997; 21(suppl.): s947-
s952.
231 Gillespie ES, Wilson RN. Application of sensitivity analysis to neural network
determination of financial variable relationships. Applied Stochastic Models and
Data Analysis 1998; 13: 409-414.
232 Garson GD. Interpreting neural-network connection weights. AI Expert 1991; April:
47-5\.
233 Milne LK, Gedeon TD, Skidmore AK. Classifying dry sclerophyll forest from
augmented satellite data: comparing neural networks, decision tree & maximum
likelihood. Proc. Australian Conf. Neural Networks, Sydney, 1995, 160-163.
234 Wong PM, Gedeon TD, Taggart 11. An improved technique in porosity prediction: a
neural network approach. IEEE Trans. Geoscience and Remote Sensing 1995; 33:
971-980.
235 Gedeon TD. Data mining of inputs: analysing magnitude and functional measures.
Int. J. Neural Systems 1997; 8: 209-218.
236 Ayoubi M. Comparison between the dynamic multi-layered percetron and the
generalised Hammerstein model for experimental identification of the loading
process in diesel engines. Control Engng Practice 1998,6: 271-279.
237 Bhat N, McAvoy TJ. Use of neural nets for dynamic modelling and control of
chemical process systems. Comput. Chern. Engng 1990; 14: 573-583.
238 Shaw AM, Doyle FJ, Schwaber JS. A dynamic neural network approach to nonlinear
process modelling. Comput. Chern. Engng 1997; 21: 371-385.
239 Morris AJ, Montague GA, Willis MJ. Artificial neural networks: studies in process
modelling and control. Trans. IChemE. 1994; 72A: 3-19.
240 Kim SJ, Lee MH, Park SW, Lee SY, Park CH. A neural linearizing control scheme
for nonlinear chemical processes. Comput. Chern. Engng 1997; 21: 187-200
248 Data Mining and Knowledge Discovery for Process Monitoring and Control
241 Martinez EC, Pulley RA, Wilson JA. Learning to control the performance of batch
processes. Trans IChemE 1998; 76A: 711-722.
INDEX