Thar Wat 2018
Thar Wat 2018
a r t i c l e i n f o a b s t r a c t
Article history: Independent component analysis (ICA) is a widely-used blind source separation technique. ICA has been
Received 24 May 2018 applied to many applications. ICA is usually utilized as a black box, without understanding its internal
Revised 26 August 2018 details. Therefore, in this paper, the basics of ICA are provided to show how it works to serve as a com-
Accepted 29 August 2018
prehensive source for researchers who are interested in this field. This paper starts by introducing the
Available online xxxx
definition and underlying principles of ICA. Additionally, different numerical examples in a step-by-
step approach are demonstrated to explain the preprocessing steps of ICA and the mixing and unmixing
Keywords:
processes in ICA. Moreover, different ICA algorithms, challenges, and applications are presented.
Independent component analysis (ICA)
Blind source separation (BSS)
Ó 2018 The Author. Production and hosting by Elsevier B.V. on behalf of King Saud University. This is an
Cocktail party problem open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
Principal component analysis (PCA)
1. Introduction two sensors, i.e., microphones, are used for recording two signals,
and the outputs from these sensors are two mixtures. The goal is
Measurements cannot be isolated from a noise which has a to extract original signals1 from mixtures of signals. This problem
great impact on measured signals. For example, the recorded can be solved using independent component analysis (ICA) technique
sound of a person in a street has sounds of footsteps, pedestrians, [23].
etc. Hence, it is difficult to record a clean measurement; this is due ICA was first introduced in the 80s by J. Hérault, C. Jutten and B.
to (1) source signals always are corrupted with a noise, and (2) the Ans, and the authors proposed an iterative real-time algorithm
other independent signals (e.g. car sounds) which are generated [15]. However, in that paper, there is no theoretical explanation
from different sources [31]. Thus, the measurements can be was presented and the proposed algorithm was not applicable in
defined as a combination of many independent sources. The topic a number of cases. However, the ICA technique remained mostly
of separating these mixed signals is called blind source separation unknown till 1994, where the name of ICA appeared and intro-
(BSS).The term blind indicates that the source signals can be sepa- duced as a new concept [9]. The aim of ICA is to extract useful
rated even if little information is known about the source signals. information or source signals from data (a set of measured mixture
One of the most widely-used examples of BSS is to separate signals). These data can be in the form of images, stock markets, or
voice signals of people speaking at the same time, this is called sounds. Hence, ICA was used for extracting source signals in many
cocktail party problem [31]. The independent component analysis applications such as medical signals [7,34], biological assays [3],
(ICA) technique is one of the most well-known algorithms which and audio signals [2]. ICA is also considered as a dimensionality
are used for solving this problem [23]. The goal of this problem reduction algorithm when ICA can delete or retain a single source.
is to detect or extract the sound with a single object even though This is also called filtering operation, where some signals can be fil-
different sounds in the environment are superimposed on one tered or removed [31].
another [31]. Fig. 1 shows an example of the cocktail party prob- ICA is considered as an extension of the principal component
lem. In this example, two voice signals are recorded from two dif- analysis (PCA) technique [9,33]. However, PCA optimizes the
ferent individuals, i.e., two independent source signals. Moreover, covariance matrix of the data which represents second-order
statistics, while ICA optimizes higher-order statistics such as kur-
tosis. Hence, PCA finds uncorrelated components while ICA finds
Peer review under responsibility of King Saud University.
independent components [21,33]. As a consequence, PCA can
https://doi.org/10.1016/j.aci.2018.08.006
2210-8327/Ó 2018 The Author. Production and hosting by Elsevier B.V. on behalf of King Saud University.
This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
Please cite this article in press as: A. Tharwat, Applied Computing and Informatics (2018), https://doi.org/10.1016/j.aci.2018.08.006
2 A. Tharwat / Applied Computing and Informatics xxx (2018) xxx–xxx
Fig. 1. Example of the cocktail party problem. Two source signals (e.g. sound signals) are generated from two individuals and then recorded by two sensors, e.g., microphones.
Two microphones mixed the two source signals linearly. The goal of this problem is to recover the original signals from the mixed signals.
extract independent sources when the higher-order correlations of nals (s1 and s2 ) can be mixed as follows, x1 ¼ a s1 þ b s2 ,
mixture data are small or insignificant [21]. where a and b are the mixing coefficients and x1 is the first mixture
ICA has many algorithms such as FastICA [18], projection pursuit signal. Thus, the mixture x1 is the weighted sum of the two source
[21], and Infomax [21]. The main goal of these algorithms is to signals (s1 and s2 ). Similarly, another mixture (x2 ) can be measured
extract independent components by (1) maximizing the non- by changing the distance between the source signals and the sensing
Gaussianity, (2) minimizing the mutual information, or (3) using device, e.g. microphone, and it is calculated as follows,
maximum likelihood (ML) estimation method [20]. However, ICA x2 ¼ c s1 þ d s2 , where c and d are mixing coefficients. The two
suffers from a number of problems such as over-complete ICA mixing coefficients a and b are different than the coefficients c and
and under-complete ICA. d because the two sensing devices which are used for sensing these
Many studies treating the ICA technique as a black box without signals are in different locations, so that each sensor measures a dif-
understanding the internal details. In this paper, in a step-by-step ferent mixture of source signals. As a consequence, each source sig-
approach, the basic definitions of ICA, and how to use ICA for nal has a different impact on output signals. The two mixtures can be
extracting independent signals are introduced. This paper is represented as follows:
divided into eight sections. In Section 2, an overview of the defini-
tion of the main idea of ICA and its background are introduced. This x1 as1 þ bs2 a b s1
X¼ ¼ ¼ ¼ As ð2Þ
section begins by explaining with illustrative numerical examples x2 cs1 þ ds2 c d s2
how signals are mixed to form mixture signals, and then the
unmixing process is presented. Section 3 introduces with visual- where X 2 RnN is the space that is defined by the mixture signals
ized steps and numerical examples two preprocessing steps of and n is the number of mixtures. Therefore, simply, the mixing coef-
ICA, which greatly help for extracting source signals. Section 4 pre- ficients (a; b; c, and d) are utilized for transforming linearly source
sents principles of how ICA extracts independent signals using dif- signals from S space to mixed signals in X space as follows,
ferent approaches such as maximizing the likelihood, maximizing S ! X : X ¼ AS, where A 2 Rnp is the mixing coefficients matrix
the non-Gaussianity, or minimizing the mutual information. This (see Fig. 2) and it is defined as:
section explains mathematically the steps of each approach. Differ-
a b
ent ICA algorithms are highlighted in Section 5. Section 6 lists some A¼ ð3Þ
applications that use ICA for recovering independent sources from c d
a set of sensed signals that result from a mixing set of source sig-
nals. In Section 7, the most common problems of ICA are explained. 2.1.1. Illustrative example
Finally, concluding remarks will be given in Section 8. The goal of this example is to show the properties of source and
mixture signals. Given two source signals s1 ¼ sinðaÞ and
2. ICA background s2 ¼ r 0:5, where a is in the range of [1,30] with time step 0.05
and r indicates a random number in the range of [0,1]. Fig. 3 shows
2.1. Mixing signals source signals, histograms, and scatter diagram of both signals. As
shown, the two source signals are independent and their his-
Each signal varies over time and a signal is represented as fol- tograms are not Gaussian. The scatter diagram in Fig. 3(e) shows
lows, si ¼ fsi1 ; si2 ; . . . ; siN g, where N is the number of time steps how the two source signals are independent, where each point
and sij represents the amplitude of the signal si at the jth time.2 represents the amplitude of both source signals. Fig. 4 shows the
Given two independent source signals3 s1 ¼ fs11 ; s12 ; . . . ; s1N g and mixture signals with their histograms and scatter diagram. As
s2 ¼ fs21 ; s22 ; . . . ; s2N g (see Fig. 2). Both signals can be represented shown, the histograms of both mixture signals are approximately
as follows: Gaussian, and the mixtures are not independent. Moreover, the
mixture signals are more complex than the source signals. From
s1 ðs11 ; s12 ; . . . ; s1N Þ this example, it is remarked that the mixed signals have the follow-
S¼ ¼ ð1Þ
s2 ðs21 ; s22 ; . . . ; s2N Þ ing properties:
where S 2 RpN represents the space that is defined by source sig- 1. Independence: if the source signals are independent (as
nals and p indicates the number of source signals.4 The source sig- in Fig. 3(a and b)), their mixture signals are not (see
Fig. 4(a and b)). This is because the source signals are
2
In this paper, source and mixture signals are represented as random variables shared between both mixtures.
instead of time series or time signals, i.e., the time index is dropped. 2. Gaussianity: the histogram of mixed signals are bell-
3
Two signals s1 and s2 are independent if the amplitude of s1 is independent of the
shaped histogram (see Fig. 4e., Gaussian or normal. This
amplitude of s2 .
4
In this paper, all bold lowercase letters denote vectors and bold uppercase letters property can be used for searching for non-Gaussian
indicate matrices. signals within mixture signals to extract source or
Please cite this article in press as: A. Tharwat, Applied Computing and Informatics (2018), https://doi.org/10.1016/j.aci.2018.08.006
A. Tharwat / Applied Computing and Informatics xxx (2018) xxx–xxx 3
Fig. 2. An illustrative example of the process of mixing signals. Two source signals are mixed linearly by the mixing matrix (A) to form two new mixture signals.
Fig. 3. An illustrative example for two source signals. (a) and (b) first and second source signals (s1 and s2 ), (c) and (d) histograms of s1 and s2 , respectively, (e) scatter diagram
of source signals, where s1 and s2 represent the x-axis and y-axis, respectively.
independent signals. In other words, the source signals and s2 which form the space S. The two axes of the S space (s1 and
must be non-Gaussian, and this assumption is a funda- s2 ) represent the x-axis and y-axis, respectively. Additionally, the
mental restriction in ICA. Hence, the ICA model cannot vector with coordinates ð 1 0 ÞT lie on the axis s1 in S and hence
estimate Gaussian independent components. simply, the symbol s1 refers to this vector and similarly, s2 refers
3. Complexity: It is clear from the previous example that to the vector with the following coordinates ð 0 1 ÞT . During the
mixed signals are more complex than source signals. mixing process, the matrix A transforms s1 and s2 in the S space to
s01 and s02 , respectively, in the X space (see Eqs. (4) and (10)).
From these properties we can conclude that if the extracted sig-
nals from mixture signals are independent, have non-Gaussian his-
a b 1 a
tograms, or have low complexity than mixture signals; then these s01 ¼ As1 ¼ ¼ ð4Þ
signals represent source signals. c d 0 c
2.1.2. Numerical example: Mixing signals a b 0 b
s02 ¼ As2 ¼ ¼ ð5Þ
The goal of this example5 is to explain how source signals are c d 1 d
mixed to form mixture signals. Fig. 5 shows two source signals s1
In our example, assume that the mixing matrix is as follows,
1 2
5
In all numerical examples, the numbers are rounded up to the nearest hundredths A¼ . Given two source signals are as follows,
(two numbers after the decimal point).
1 1
Please cite this article in press as: A. Tharwat, Applied Computing and Informatics (2018), https://doi.org/10.1016/j.aci.2018.08.006
4 A. Tharwat / Applied Computing and Informatics xxx (2018) xxx–xxx
Fig. 4. An illustrative example for two mixture signals (a) and (b) first and second mixture signals x1 and x2 , respectively, (c) and (d) the histogram of x1 and x2 , respectively,
(e) scatter diagram of both mixture signals, where x1 and x2 represent the x-axis and y-axis, respectively.
(a)
Fig. 5. An example of the mixing process. The mixing matrix A transforms the two source signals (s1 and s2 ) in the S space to (s01 and s02 ) in the mixture space X. The two
source signals can be represented by four points (in black color) in the S space. These points are also transformed using the mixing matrix A into different four points (in red
color) in the X space. Additionally, the vectors w1 and w2 are used to extract the source signal s1 and s2 , and they are plotted in dotted red and blue lines, respectively. w1 and
w2 are orthogonal on s02 and s01 , respectively.
Please cite this article in press as: A. Tharwat, Applied Computing and Informatics (2018), https://doi.org/10.1016/j.aci.2018.08.006
A. Tharwat / Applied Computing and Informatics xxx (2018) xxx–xxx 5
3 4 5 6 signals as follow, X ! Y : Y ¼ WT X, where W 2 Rnp is the unmixing
; ; ; ð9Þ
0 1 1 0 coefficients matrix as shown in Fig. 7. Simply we can say that the
first source signal, y1 , can be extracted from the mixtures (x1 and
Assumed the second source s2 is silent/OFF; hence, the sensors
x2 ) using two unmixing coefficients (a and b). This pair of unmixing
record only the signal that is generated from s1 (see Fig. 6(a)). The
T
coefficients defines a point with coordinates ða; bÞ, where
mixed signals are laid along s01 ¼ ð a c Þ and the distribution of T
w1 ¼ ð a b Þ is a weight vector (see Eq. (11)). Similarly, y2 can be
the projected samples onto s01 are depicted in Fig. 6(a). Similarly,
T
extracted using the two unmixing coefficients c and d which define
Fig. 6(b) shows the projection onto s02 ¼ ð b d Þ when the first
the weight vector w2 ¼ ð c d ÞT (see Eq. (11)).
source is silent; this projection represents the mixed data. It is
worth mentioning that the new axes s01 and s02 need not to be y1 ¼ ax1 þ bx2 ¼ wT1 X
ð11Þ
orthogonal on the s1 and s2 , respectively. Fig. 5 is the combination y2 ¼ cx1 þ dx2 ¼ wT2 X
of Fig. 6(a) and (b) when both source signals are played together
and the sensors measure the two signals simultaneously. W ¼ ð w1 w2 ÞT is the unmixing matrix and it represents the
A related point to consider is that the number of red points in inverse of A. The unmixing process can be achieved by rotating
Fig. 6(a) which represent the projected points onto s01 is three while the rows of W. This rotation will continue till each row in
the number of original points was four. This can be interpreted W (w1 or w2 ) finds the orientation which is orthogonal on other
mathematically by calculating the coordinates of the projected transformed signals. For example, in our example, w1 is orthogonal
points onto s01 . For example, the projection of the first point on s02 (see Fig. 5). The source signals are then extracted by project-
1 ing mixture signals onto that orientation.
ð 1 1 ÞT is calculated as follows, s01 ð 1 1 ÞT ¼ ð 1 1 ÞT ¼ 2.
1 In practice, changing the length or orientation of weight vectors
Similarly, the projection of the second, third, and fourth points has a great influence on the extracted signals (Y). This is the reason
are 3; 3, and 4, respectively. Therefore, the second and third sam- why the extracted signals may be not identical to original source
ples were projected onto the same position onto s01 . This is the rea- signals. The consequences of changing the length or orientation
son why the number of projected points is three. of the weight vectors are as follows:
y1 ¼ ax1 þ bx2
ð10Þ Orientation: As mentioned before, the source signals s1
y2 ¼ cx1 þ dx2
and s2 in the S space are transformed to s01 and s02 (see
where a; b; c, and d represent unmixing coefficients, which are used Eqs. (4) and (5)), respectively, where s01 and s02 form the
for transforming the mixture signals into a set of independent mixture space X. The signal (s1 ) is extracted only if w1
Fig. 6. An example of the mixing process. The mixing matrix A transforms source signals as follows: (a) s1 is transformed from S space to s01 ¼ ða; cÞT (solid red line) which is
one of the axes of the mixture space X. The red stars represent the projection of the data points onto s01 . These red stars represent all samples that are generated from the first
T
source s1 . (b) s2 is transformed from S space to s02 ¼ ðb; dÞ (solid blue line) which is one of the axes of the mixture space X. The blue stars represent the projection of the data
points onto s02 . These blue stars represent all samples that are generated from the second source s2 .
Please cite this article in press as: A. Tharwat, Applied Computing and Informatics (2018), https://doi.org/10.1016/j.aci.2018.08.006
6 A. Tharwat / Applied Computing and Informatics xxx (2018) xxx–xxx
Fig. 7. An illustrative example of the process of extracting signals. Two source signals (y1 and y2 ) are extracted from two mixture signals (x1 and x2 ) using the unmixing
matrix W.
is orthogonal to s02 and hence at different orientations, dif- agree with X; hence, W is transposed, and the first element in
ferent signals are extracted. This is because the inner pro- the first extracted signal (y1 ) is estimated as follows,
duct for any orthogonal vectors is zero as follows, y11 ¼ fw11 x11 þ w21 x21 þ . . . þ wn1 xn1 g. Similarly all the other ele-
y1 ¼ wT1 X ¼ wT1 AS ¼ wT1 ð s01 s02 Þ, where w1 s02 ¼ 0 because ments of all extracted signals can be estimated.
w1 is orthogonal to s02 , and the inner product of w1 and s01
is as follows, wT1 s01 ¼ jw1 js01 cosh ¼ jw1 jjAs1 jcosh ¼ ks1 , 2.2.1. Numerical examples: Unmixing signals
where h is the angle between w1 and s01 as shown in The goal of this example is to extract source signals which are
Fig. 5, and k is a constant. The value of k depends on the mixed in the numerical example in Section 2.1.2. The matrix W
length of w1 and s01 and the angle h. The extracted signal 1 2
will be as follows, y1 ¼ wT1 ð s01 s02 Þ ¼ ðwT1 s01 þ wT1 s02 Þ ¼ is the inverse of A and the value of W is W ¼ 31 1 3 , where
3 3
ks1 . The extracted signal (ks1 ) is a scaled version from the vector w1 in W is orthogonal to s02 , i.e., the inner product
the source signal (s1 ), and ks1 is extracted from X by tak- 1 2
3 3
ð 2 1 ÞT ¼ 0, and similarly, the vector w2 is orthogonal to
ing the inner product of all mixture signals with w1 which 0
s1 (see Fig. 5). Moreover, the source signal s1 is extracted as fol-
is orthogonal to s02 . Thus, it is difficult to recover the
1 2 1
amplitude of source signals. lows, s1 ¼ wT1 X ¼ 13 23 ¼ , and similarly, s2 is
1 1 0
1 2 0
Fig. 8 displays the mixing and unmixing steps of ICA. As shown, extracted as follows, s2 ¼ wT2 X ¼ 13 13 ¼ .
1 1 1
the first mixture signal x1 is observed using only the first row in A
Hence, the original source signals are extracted perfectly. This is
matrix, where the first element in x1 is calculated as follows,
because k 1 and hence according to Eq. (12) the extracted signal
x11 ¼ fa11 s11 þ a12 s21 þ . . . þ a1p sp1 g. Moreover, the number of mix-
is identical to the source signal. As mentioned before, the value of
ture signals and the number of source signals are not always the
k is calculated as follows, k ¼ jw1 js01 cosh, and the value of
same. This is because, the number of mixture signals depends on qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi pffiffi qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi pffiffiffi
jw1 j ¼ ð13Þ þ ð23Þ ¼ 35, and the value of s01 ¼ ð1Þ2 þ ð1Þ2 ¼ 2.
2 2
the number of sensors. Additionally, the dimension of W is not
Fig. 8. Block diagram of the ICA mixing and unmixing steps. aij is the mixing coefficient for the ith mixture signal and jth source signal, and wij is the unmixing coefficient for
the ith extracted signal and jth mixture signal.
Please cite this article in press as: A. Tharwat, Applied Computing and Informatics (2018), https://doi.org/10.1016/j.aci.2018.08.006
A. Tharwat / Applied Computing and Informatics xxx (2018) xxx–xxx 7
The angle between s01 and the s1 axes is 45 because s01 ¼ ð 1 1 ÞT ; as follows, Rij ¼ Efxi xj g Efxi gEfxj g ¼ E½ðxi li Þðxj lj Þ.
and similarly, the angle between w1 and s1 is With many variables, the covariance matrix is calculated
ffiffi Þ ¼ cos1 ðp1ffiffiÞ 63 (see Fig. 5 top left corner). Therefore,
cos1 ðp1=3 as follows, R ¼ E½DDT , where D is the centered data (see
5=3 5
qffiffipffiffiffi Fig. 9B)). The covariance matrix is solved by calculating
h 63 45 18 , and hence k ¼ 59 2cos18 1. Hence, chang- the eigenvalues (k) and eigenvectors (V) as follows,
ing the orientation of w1 leads to a different extracted signal. VR ¼ kV, where the eigenvectors represent the principal
components which represent the directions of the PCA
2.3. Ambiguities of ICA space and the eigenvalues are scalar values which repre-
sent the magnitude of the eigenvectors. The eigenvector
ICA has some ambiguities such as: which has the maximum eigenvalue is the first principal
component (PC 1 ) and it has the maximum variance [33].
The order of independent components: In ICA, the weight For decorrelating mixture signals, they are projected onto
vector (wi ) is initialized randomly and then rotated to find the calculated PCA space as follows, U ¼ VD.
one independent component. During the rotation, the value of 2. Scaling: the goal here is to scale each decorrelated signal
wi is updated iteratively. Thus, wi extracts source signals but to be with a unit variance. Hence, each vector in U has a
not in a specific order. unit length and is then rescaled to be with a unit variance
The sign of independent components: Changing the sign of 1 1
as follows, Z ¼ k2 U ¼ k2 VD, where Z is the whitened or
independent components has not any influence on the ICA 12
sphered data and k is calculated by simple component-
model. In other words, we can multiply the weight vectors in 1 1 1 1
W by 1 without affecting the extracted signal. In our example, wise operation as follows, k2 ¼ fk1 2 ; k2 2 ; . . . ; kn 2 g. After
in Section 2.2.1, the value of w1 was 13 23 . Multiplying w1 by the scaling step, the data becomes rotationally symmetric
1 like a sphere; therefore, the whitening step is also called
1, i.e., w1 ¼ 3 23 has no influence because w1 still in the
same direction with the same magnitude and hence the value of sphering [32].
k will not be changed, and the extracted signal s1 will be with
the same values but with a different sign, i.e., 3.3. Numerical example
T
s1 ¼ wT1 X ¼ ð 1 0 Þ . As a result, the matrix W in n-
Given eight mixture signals X ¼ fx1 ; x2 ; . . . ; x8 g, each mixture
dimensional space has 2n local maxima, i.e., two local maxima
signal is represented by one row in X as in Eq. (14).6 The mean
for each independent component, corresponding to si and si
(l) was then calculated and its value was l ¼ 2:63 3:63 .
[21]. This problem is insignificant in many applications [16,19].
1:00 1:00 2:00 0:00 5:00 4:00 5:00 3:00
XT ¼
3. ICA: Preprocessing phase 3:00 2:00 3:00 3:00 4:00 5:00 5:00 4:00
ð14Þ
This section explains the preprocessing steps of the ICA tech-
nique. This phase has two main steps: centering and whitening. In the centering step, the data are centered by subtracting the mean
from each signal and the value of D will be as follows:
3.1. The centering step
1:63 1:63 0:63 2:63 2:38 1:38 2:38 0:38
DT ¼
0:63 1:63 0:63 0:63 0:38 1:38 1:38 0:38
The goal of this step is to center the data by subtracting the
mean from all signals. Given n mixture signals (X), the mean is l ð15Þ
and the centering step can be calculated as follows: The covariance matrix (R) and its eigenvalues (k) and eigenvectors
0 1 0 1
d1 x1 l (V) are then calculated as follows:
B d C B x2 l C
B 2C B C 3:70 1:70 0:28 0:00 0:45 0:90
D¼Xl¼B C B C R¼ ; k¼ ; and V ¼
B .. C ¼ B .. C ð13Þ
1:70 1:13 0:00 4:54 0:90 0:45
@ . A @ . A
ð16Þ
dn xn l
where D is the mixture signals after the centering step as in Fig. 9A) From Eq. (16) it can be remarked that the two eigenvectors are
1N orthogonal as shown in Fig. 10, i.e., v T1 v 2 ¼ ½0:45 0:9 0:90
and l 2 R is the mean of all mixture signals. The mean vector can
be added back to independent components after applying ICA. 0:45T ¼ 0, where v 1 and v 2 represent the first and second eigen-
vectors, respectively. Moreover, the value of the second eigenvalue
3.2. The whitening data step (k2 ) was more than the first one (k1 ), and k2 represents
4:54
0:28þ4:54
94:19% of the total eigenvalues; thus, v 2 and v 1 repre-
This step aims to whiten the data which means transforming sig- sent the first and second principal components of the PCA space,
nals into uncorrelated signals and then rescale each signal to be respectively, and v 2 points to the direction of the maximum vari-
with a unit variance. This step includes two main steps as follows. ance (see Fig. 10).
The two signals are decorrelated by projecting the centered data
1. Decorrelation: The goal of this step is to decorrelate sig- onto the PCA space as follows, U ¼ VD. The values of U is
nals; in other words, make each signal uncorrelated with
0:16 0:73 0:28 0:61 0:72 0:62 0:18 0:17
each other. Two random variables are considered uncor- UT ¼
1:73 2:18 0:84 2:63 2:29 1:84 2:74 0:50
related if their covariance is zero. In ICA, the PCA tech-
nique can be used for decorrelating signals. In PCA, ð17Þ
eigenvectors which form the new PCA space are calcu-
lated.
In PCA, first, the covariance matrix is calculated. The 6
Due to the paper size, Eq. (14) indicates XT instead of X; hence, each column
covariance matrix of any two variables (xi xj ) is defined represents one signal/sample. Similarly, D in Eq. (15), U in Eq. (17), and Z in Eq. (19).
Please cite this article in press as: A. Tharwat, Applied Computing and Informatics (2018), https://doi.org/10.1016/j.aci.2018.08.006
8 A. Tharwat / Applied Computing and Informatics xxx (2018) xxx–xxx
Fig. 9. Visualization for the preprocessing steps in ICA. (A) the centering step, (B) The whitening data step.
The matrix U is already centered; thus, the covariance matrix for U (see Fig. 10). The whitening can be calculated as follows,
is given by 1
Z ¼ k2 VD, and the values of the mixture signals after the scaling
step are
0:28 0 0:31 1:38 0:53 1:15 1:36 1:17 0:33 0:32
EðUUT Þ ¼ ð18Þ ZT ¼
0 4:54 0:81 1:02 0:39 1:23 1:08 0:87 1:29 0:24
From Eq. (18) it is remarked that the two mixture signals are ð19Þ
decorrelated by projecting them onto the PCA space. Thus, the The covariance matrix for the whitened data is
covariance matrix is diagonal and the off-diagonal elements T 0:5 0:5 T 0:5 T T 0:5
E½ZZ ¼ E½ðk VDÞðk VDÞ ¼ E½ðk VDÞðD V k . k is diago-
which represent the covariance between two mixture signals
are zeros. Fig. 10 displays the contour of the two mixtures is nal; thus, k ¼ kT , and ½DDT is the covariance matrix (R) which is
ellipsoid centered at the mean. The projection of mixture signals equal to VT kV. Hence, E½ZZT ¼ E½k0:5 VVT kVVT k0:5 ¼ I, where
onto the PCA space rotates the ellipse so that the principal com- VVT ¼ I because V is orthonormal.7 This means that the covariance
ponents are aligned with the x1 and x2 axes. After the decorrela- matrix of the whitened data is the identity matrix (see Eq. (20))
tion step, the signals are then rescaled to be with a unit variance which means that the data are decorrelated and have unit variance.
Please cite this article in press as: A. Tharwat, Applied Computing and Informatics (2018), https://doi.org/10.1016/j.aci.2018.08.006
A. Tharwat / Applied Computing and Informatics xxx (2018) xxx–xxx 9
Fig. 10. Visualization for our whitening example. In the left panel, the mixture signals are plotted in red stars. This panel also shows the principal components (PC 1 and PC 2 ).
In the top right panel, the data in a blue color represent the projected data onto the PCA space. The data are then normalized to be with a unit variance (the bottom right
panel). In this panel, the data in a green color represent the whitened data.
(a) (b)
(c)
Fig. 11. Visualization for mixture signals during the whitening step. (a) scatter plot for two mixture signals x1 and x2 , (b) the projection of mixture signals onto the PCA space,
i.e., decorrelation, (c) mixture signals after the whitening step are scaled to have a unit variance.
1:00 0 signals. This matrix can be estimated using three main approaches of
EðZZT Þ ¼ ð20Þ
0 1:00 independence, which result in slightly different unmixing matrices.
The first is based on the non-Gaussianity. This can be measured by
Fig. 11 displays the scatter plot for two mixtures, where each some measures such as negentropy and kurtosis, and the goal of this
mixture signal is represented by 500-time steps. As shown in approach is to find independent components which maximize the
Fig. 11(a), the scatter of the original mixtures forms an ellipse cen- non-Gaussianity [25,30]. In the second approach, the ICA goal can
tered at the origin. Projecting the mixture signals onto the PCA be obtained by minimizing the mutual information [22,14]. Indepen-
space rotates the principal components to be aligned with the x1 dent components can be also estimated by using maximum likelihood
and x2 axes and hence the ellipse is also rotated as shown in (ML) estimation [28]. All approaches simply search for a rotation or
Fig. 11(b). After the whitening step, the contour of the mixture sig- unmixing matrix W. Projecting the whitened data onto that rotation
nals forms a circle. This is because the signals have unit variance. matrix extracts independent signals. The preprocessing steps are cal-
culated from the data, but the rotation matrix is approximated
4. Principles of ICA estimation numerically through an optimization procedure. Searching for the
optimal solution is difficult due to the local minima exists in the objec-
In ICA, the goal is to find the unmixing matrix (W) and then pro- tive function. In this section, different approaches are introduced for
jecting the whitened data onto that matrix for extracting independent extracting independent components.
Please cite this article in press as: A. Tharwat, Applied Computing and Informatics (2018), https://doi.org/10.1016/j.aci.2018.08.006
10 A. Tharwat / Applied Computing and Informatics xxx (2018) xxx–xxx
4.1. Measures of non-Gaussianity because it is scaled to be with a unit length, and Z is the whitened
data, so, it has a unit length. Thus, the kurtosis can be expressed as:
Searching for independent components can be achieved by 4
maximizing the non-Gaussianity of extracted signals [23]. Two KðYÞ ¼ E½ðWT ZÞ 3 ð28Þ
measures are used for measuring the non-Gaussianity, namely, The gradient of the kurtosis of Y is given by,
Kurtosis and negative entropy. @KðWT ZÞ T 3
@W
¼ cE½ZðW ZÞ , where c is a constant, which we set to unity
4.1.1. Kurtosis for convenience. The weight vector is updated in each iteration
3
Kurtosis can be used as a measure of non-Gaussianity, and the as follows, wnew ¼ wold þ gE½ZðwTold ZÞ , where g is the step size
extracted signal can be obtained by finding the unmixing vector for the gradient ascent. Since we are optimizing the kurtosis on
which maximizes the kurtosis of the extracted signal [4]. In other the unit circle kwk ¼ 1, the gradient method must be comple-
words, the source signals can be extracted by finding the orienta- mented by projecting w onto the unit circle after every step. This
tion of the weight vectors which maximize the kurtosis. can be done by normalizing the weight vectors wnew through divid-
Kurtosis is simple to calculate; however, it is sensitive for out- ing it by its norm as follows, wnew ¼ wnew =kwnew k. The value of wnew
liers. Thus, it is not robust enough for measuring the non- is updated in each iteration.
Gaussianity [21]. The Kurtosis (K) of any probability density function
(pdf) is defined as follow, 4.1.2. Negative entropy
2 2
Negative entropy is termed negentropy, and it is defined as fol-
KðxÞ ¼ E½x4 3½E½x ð21Þ lows, JðyÞ ¼ HðyGaussian Þ HðyÞ, where HðyGaussian Þ is the entropy of a
^ is the ratio between the fourth
where the normalized kurtosis (K) Gaussian random variable whose covariance matrix is equal to the
and second central moments, and it is given by covariance matrix of y. The entropy of a random variable Q which
has N possible outcomes is
PN 4
E½x4 i¼1 ðxi lÞ
1
^
KðxÞ ¼ 3 ¼
N
2 3 ð22Þ 1XN
E½x2
2 P N HðQ Þ ¼ E½log pq ðqÞ ¼ log pq ðqt Þ ð29Þ
1
N
ðx lÞ2
i¼1 i N t
For whitened data (Z), E½Z2 ¼ 1 because Z with a unit variance. where pq ðqt Þ is the probability of the event qt ; t ¼ 1; 2; . . . ; N.
Therefore, the kurtosis will be The negentropy is zero when all variables are Gaussian, i.e.,
^ HðyGaussian Þ ¼ HðyÞ. Negentropy is always nonnegative because the
KðZÞ ¼ KðZÞ ¼ E½Z4 3 ð23Þ
entropy of Gaussian variable is the maximum among all other ran-
As reported in [20], the fourth moment for Gaussian signals is dom variables with the same variance. Moreover, it is invariant for
2 invertible linear transformation and it is scale-invariant [21]. How-
3ðE½Z2 Þ and hence
2 2
ever, calculating the entropy from a finite data is computationally
^
KðxÞ ¼ E½Z 3 ¼ E½3ðE½Z Þ 3 ¼ E½3ð1Þ2 3 ¼ 0,
4
where difficult. Hence, different approximations have been introduced for
E½Z2 ¼ 1. As a consequence, Gaussian pdfs have zero kurtosis. calculating the negentropy [21]. For example,
Kurtosis has an additivity property as follows:
1 2 1
Kðx1 þ x2 Þ ¼ Kðx1 Þ þ Kðx2 Þ; ð24Þ JðyÞ E½y3 þ KðyÞ2 ð30Þ
12 48
and for any scalar parameter a, where y is assumed to be with zero mean. This approximation suf-
fers from the sensitivity of kurtosis; therefore, Hyvarinen proposed
Kðax1 Þ ¼ a4 Kðx1 Þ ð25Þ
another approximation based on the maximum entropy principle as
where a is a scalar. follows [23]:
These properties can be used for interpreting one of the ambi- X
p
guities of ICA that are mentioned in Section 2.3, which is the sign JðyÞ ki ðE½Gi ðyÞ E½Gi ðv ÞÞ2 ; ð31Þ
of independent components. Given two source signals s1 and s2 , i¼1
and the matrix Q ¼ AT W ¼ A1 W. Hence, where ki are some positive constants, v indicates a Gaussian vari-
T T
Y ¼ W X ¼ W AS ¼ QS ¼ q1 s1 þ q2 s2 ð26Þ able with zero mean and unit variance, Gi represent some quadratic
functions [23,20]. The function G has different choices such as
Using the kurtosis properties in Eqs. (24) and (25), we have
1
KðYÞ ¼ Kðq1 s1 Þ þ Kðq2 s2 Þ ¼ q41 Kðs1 Þ þ q42 Kðs2 Þ ð27Þ G1 ðyÞ ¼ log cosh a1 y and G2 ðyÞ ¼ expðy2 =2Þ ð32Þ
a1
Assume that s1 ; s2 , and Y have a unit variance. This implies that where 1 6 a1 6 2. These two functions are widely used, and these
E½Y2 ¼ q21 E½s1 þ q22 E½s2 ¼ q21 þ q22 ¼ 1. Geometrically, this means approximations give a very good compromise between the kurtosis
that Q is constrained to a unit circle in the two-dimensional space. and negentropy properties which are the two classical non-
The aim of ICA is to maximize the kurtosis (KðYÞ ¼ Gaussianity measures.
q41 Kðs1 Þ þ q42 Kðs2 Þ) on the unit circle. The optimal solutions, i.e.,
maxima, are the points when one of Q is zero and the other is non- 4.2. Minimization of mutual information
zero; this is due to the unit circle constraint, and the nonzero ele-
ment must be 1 or 1 [11]. These optimal solutions are the ones Minimizing mutual information between independent compo-
T
which are used to extract si . Generally, Q ¼ A W ¼ I means that nents is one of the well-known approaches for ICA estimation. In
each vector in the matrix Q extracts only one source signal. ICA, maximizing the entropy of Y ¼ WT X can be achieved by
The ICs can be obtained by finding the ICs which maximizes ^ can
spreading out the points in Y as much as possible. Signals Y
kurtosis of extracted signals Y ¼ WT Z. The kurtosis of Y is then cal- be obtained by transforming Y by g as follows, Y ^ ¼ gðYÞ, where g
2
culated as in Eq. (23), where the term ðE½y2i Þ in Eq. (22) is equal is assumed to be the cumulative density function cdf of source sig-
one because W and Z have a unit length. W has a unit length ^ have a uniform joint distribution.
nals. Hence, Y
Please cite this article in press as: A. Tharwat, Applied Computing and Informatics (2018), https://doi.org/10.1016/j.aci.2018.08.006
A. Tharwat / Applied Computing and Informatics xxx (2018) xxx–xxx 11
The pdf of the linear transformation Y ¼ WT X is, where detW is a notation for a determine of the matrix W. When Y
pY ðYÞ ¼ pX ðXÞ=jWj, where jWj represents j@Y=@Xj. Similarly, is whitened; thus, E½YYT ¼ WE½XXT WT ¼ I ) detðWE½XXT WT Þ ¼
^ ¼ p ðYÞ=dY^ ¼ pY ðYÞ, where dY^ is equal to g 0 ðyÞ which repre-
p ^ ðYÞ ðdetWÞðdetE½XXT ÞðdetWT Þ ) detðWE½XXT WT Þ ¼ detI ¼ 1. As a
Y Y dY pS ðYÞ dY
consequence, detW is a constant, and the definition of mutual infor-
sents the pdf for source signals (pS ).
mation is
This can be substituted in Eq. (29) and the entropy will be
X
Iðy1 ; y2 ; . . . ; ym Þ ¼ C Jðyi Þ ð39Þ
X
N X
N
p ðYÞ
^ ¼1
HðYÞ ^tÞ ¼ 1
log pY^ ðY log Y
i
N t¼1
N t
pS ðYÞ
where C is a constant.
1X N
pX ðxt Þ From Eq. (39), it is clear that maximizing negentropy is related
¼ log
N t¼1 jWjpS ðyt Þ to minimizing mutual information and they differ only by a sign
and a constant C. Moreover, non-Gaussianity measures enable
1X N
1X N
¼ log pS ðyt Þ þ log jWj log pX ðxt Þ ð33Þ the deflationary (one-by-one) estimation of the ICs which is not
N t¼1 N t¼1 possible with mutual information or likelihood approaches.8 Fur-
ther, with the non-Gaussianity approach, all signals are enforced to
be uncorrelated, while this constraint is not necessary using mutual
In Eq. (33), increasing the matching between the extracted and information approach.
pY ðYÞ
source signals, the ratio pS ðYÞ
will be one. As a consequence, the
^ ¼
pY^ ðYÞ pY ðYÞ
becomes uniform which maximizes the entropy of 4.3. Maximum Likelihood (ML)
pS ðYÞ
^ P
pY^ ðYÞ. Moreover, the term N1 Nt¼1 log pX ðxt Þ represents the entropy
Maximum likelihood (ML) estimation method is used for esti-
of X; hence, Eq. (33) is given by
mating parameters of statistical models given a set of observations.
In ICA, this method is used for estimating the unmixing matrix (W)
^ ¼ 1X N
which provides the best fit for extracted signals Y.
HðYÞ log pS ðyt Þ þ log jWj þ HðXÞ ð34Þ
N t¼1 The likelihood is formulated in the noise-free ICA model as fol-
lows, X ¼ AS, and this model can be estimated using ML method
Hence, from Eq. (34), HðYÞ ¼ HðXÞ þ log jWj. This means that in S ðSÞ
[6]. Hence, pX ðXÞ ¼ jpdetA j
¼ jdetWjpS ðSÞ. For independent source sig-
Q
the linear transformation Y ¼ WT X, the entropy is changed nals, (i.e. pS ðSÞ ¼ p1 ðs1 Þp2 ðs2 Þ . . . pp ðsp Þ ¼ i pi ðsi Þ), pX ðXÞ is given by
(increased or decreased) by log jWj. As mentioned before, the Y Y
entropy HðXÞ is not affected by W and W maximizes only the px ðXÞ ¼ jdetWj pi ðsi Þ ¼ jdetWj pi ðwTi XÞ ð40Þ
^ and hence HðXÞ is removed from Eq. (34), and final
entropy HðYÞ i i
x;y
pðxÞpðyÞ logLðWÞ ¼ log pi ðwTi xðtÞÞ þ Tlog jdetWj ð42Þ
i¼1
¼ HðxÞ HðxjyÞ ¼ HðyÞ HðyjxÞ ð36Þ
¼ HðxÞ þ HðyÞ Hðx; yÞ The mean of any random variable x can be calculated as
PT PT
¼ Hðx; yÞ HðxjyÞ HðyjxÞ E½x ¼ 1T
t¼1 xt ) t¼1 xt ¼ TE½x. Hence, Eq. (42) can be simplified to
where HðxÞ and HðyÞ represent the marginal entropies, HðxjyÞ and 1 Xp
HðyjxÞ are conditional entropies, and Hðx; yÞ is the joint entropy of logLðWÞ ¼ E log pi ðwTi XÞ þ log jdetWj ð43Þ
T i¼1
x and y. The value of I is zero if and only if the variables are indepen-
P P
dent; otherwise, I is non-negative. Mutual information between m The first term E i¼1 log pi ðwTi XÞ ¼ i¼1 HðwTi XÞ; therefore, the
random variables (yi ; i ¼ 1; 2; . . . ; m) is given by likelihood and mutual information are approximately equal, and
they differ only by a sign and an additive constant. It is worth men-
X
m tioning that maximum likelihood estimation will give wrong
Iðy1 ; y2 ; . . . ; ym Þ ¼ Hðyi Þ HðyÞ ð37Þ results if the information of ICs are not correct; but, with the
i¼1 non-Gaussianity approach, we need not for any prior information
[23].
In ICA, where Y ¼ WT X and HðYÞ ¼ HðXÞ þ log jWj, Eq. (37) can be
written as m X 5. ICA algorithms
Iðy1 ; y2 ; . . . ; ym Þ ¼ Hðyi Þ HðYÞ
i¼1
In this section, different ICA algorithms are introduced.
X
m
¼ Hðyi Þ HðXÞ log jdetWj ð38Þ
i¼1 8
Maximum Likelihood approach will be introduced in the next section.
Please cite this article in press as: A. Tharwat, Applied Computing and Informatics (2018), https://doi.org/10.1016/j.aci.2018.08.006
12 A. Tharwat / Applied Computing and Informatics xxx (2018) xxx–xxx
5.1. Projection pursuit In the projection pursuit algorithm, mixture signals are first
whitened, and then the values of the first weight vector (w1 ) are
Projection pursuit (PP) is a statistical technique for finding pos- initialized randomly. The value of w1 is listed in Table 1. This
sible projections of multi-dimensional data [13]. In the basic one- weight vector is then normalized, and it will be used for extracting
dimensional projection pursuit, the aim is to find the directions one source signal (y1 ). The kurtosis for the extracted signal is then
where the projections of the data onto these directions have distri- calculated and the weight vector is updated to maximize the kur-
butions which are deviated from Gaussian distribution, and this tosis iteratively. Table 1 shows the kurtosis of the extracted signal
exactly is the same goal of ICA [13]. Hence, ICA is considered as a during some iterations of the projection pursuit algorithm. It is
variant of projection pursuit. remarked that the kurtosis increases during the iterations as
In PP, one source signal is extracted from each projection, which shown in Fig. 14(a). Moreover, in this example, the correlation
is different than ICA algorithms that extract p signals simultane- between the extracted signal (y1 ) and all source signals (s1 ; s2 ,
ously from n mixtures. Simply, in PP, after finding the first projec- and s3 ) were calculated. This may help to understand that how
tion which maximizes the non-Gaussianity, the same process is the extracted signal is correlated with one source signal and not
repeated to find new projections for extracting next source signal correlated with the other signals. From the table, it can be
(s) from the reduced set of mixture signals, and this sequential pro- remarked that the correlation between y1 and source signals are
cess is called deflation [17]. changed iteratively, and the correlation between y1 and s1 was 1
Given n mixture signals which represent the axes of the n- at the end of iterations.
dimensional space (X). The nth source signal can be extracted using Fig. 15 shows the histogram of the extracted signal during the
the vector wn which is orthogonal to the other n 1 axes. These iteration. As shown in Fig. 15(a), the extracted signal is Gaussian;
mixture signals in the n-dimensional space are projected onto hence, its kurtosis value which represents the measure of non-
the ðn 1Þ-dimensional space which has n 1 transformed axes. Gaussianity in the projection pursuit algorithm is small (0.18).
For example, assume n ¼ 3, and the third source signal can be The kurtosis value of the extracted signal increased to 0.21, 3.92,
extracted by finding w3 which is orthogonal to the plane that is and 4.06 after the 10th, 100th, and 1000th iterations, respectively.
defined by the other two transformed axes s01 and s02 ; this plane This reflects that the non-Gaussianity of y1 increased during the
is denoted by p01;2 . Hence, the data points in three-dimensional iterations of the projection pursuit algorithm. Additionally,
space are projected onto the plane p01;2 which is a two- Fig. 14(b) shows the angle between the optimal vector and the gra-
dimensional space. This process is continued until all source sig- dient vector (a). As shown, the value of the angle is dramatically
nals are extracted [20,32]. decreased and it reached zero which means that both the optimal
Given three source signals each source signal has 10000 time- and gradient vectors have the same direction.
steps as shown in Fig. 12. These signals represent sound signals.
These sound signals were collected from Matlab, where the first 5.2. FastICA
signal is called Chrip, the second signal is called gong, and the third
is called train. Fig. 12 (d, e, and f) shows the histogram for each sig- FastICA algorithm extracts independent components by maxi-
nal. As shown, the histograms are non-Gaussian. These three sig- mizing the non-Gaussianity by maximizing the negentropy for
nals were mixed, and the mixing matrix was as follows: the extracted signals using a fixed-point iteration scheme [18]. Fas-
tICA has a cubic or at least quadratic convergence speed and hence
0 1 it is much faster than Gradient-based algorithms that have linear
1:5 0:7 0:2
B C convergence. Additionally, FastICA has no learning rate or other
A ¼ @ 0:6 0:2 0:9 A ð44Þ adjustable parameters which makes it easy to use.
0:1 1 0:6 FastICA can be used for extracting one IC, this is called one-unit,
where FastICA finds the weight vector (w) that extracts one inde-
Fig. 13 shows the mixed signals and the histogram for these
pendent component. The values of w are updated by a learning rule
mixture signals. As shown in the figure, the mixture signals follow
that searches for a direction which maximizes the non-Gaussianity.
all the properties that were mentioned in Section 2.1.1, where (1)
The derivative of the function G in Eq. (31) is denoted by g, and
source signals are more independent than mixture signals, (2) the
the derivatives for G1 and G2 in Eq. (32) are:
histograms of mixture signals in Fig. 13 are much more Gaussian
than the histogram of source signals in Fig. 12 mixtures signals g 1 ðyÞ ¼ tanhða1 uÞ and g 2 ðyÞ ¼ u expðu2 =2Þ ð45Þ
(see Fig. 13 are more complex than source signals (see Fig. 12)).
Please cite this article in press as: A. Tharwat, Applied Computing and Informatics (2018), https://doi.org/10.1016/j.aci.2018.08.006
A. Tharwat / Applied Computing and Informatics xxx (2018) xxx–xxx 13
Table 1
Results of the projection pursuit algorithm in terms of the correlation between the extracted signal (y1 ) and source signals, values of the weight vector (w1 ), kurtosis of y1 , and the
angle between the optimal vector and the gradient vector (a) during the iterations of the projection pursuit algorithm.
(a) (b)
Fig. 14. Results of the projection pursuit algorithm. (a) Kurtosis of the extracted signal (y1 ) during some iterations of the projection pursuit algorithm, (b) the angle between
the optimal vector and gradient vector (a) during some iterations of the projection pursuit algorithm.
where 1 6 a1 6 2 is a constant, and often a1 ¼ 1. In FastICA, the con- inverted. The value of w can be updated according to Newton’s
vergence means that the dot-product between the current and old method as follows:
weight vectors is almost equal to one and hence the values of the
FðwÞ E½XgðwT XÞ bw
new and old weight vectors are in the same direction. The maxima wþ ¼ w 0 ¼w ð46Þ
F ðwÞ E½g 0 ðwT XÞ b
of the approximation of the negentropy of wT X is calculated at a
certain optima of E½GðwT XÞ, where E½ðwT XÞ ¼ w2 ¼ 1. The opti-
2
Eq. (46) can be further simplified by multiplying both sides by
mal solution is obtained where, E½XgðwT XÞ bw ¼ 0, and this b E½g 0 ðwT XÞ as follows:
equation can be solved using Newton’s method.9 Let
wþ ¼ E½XgðwT XÞ E½g 0 ðwT XÞw ð47Þ
FðwÞ ¼ E½XgðwT XÞ bw; hence, the Jacobian matrix is given by,
@F
JFðwÞ ¼ @w ¼ E½XXT g 0 ðwT XÞ bI. Since the data are whitened; thus, Several units of FastICA can be used for extracting several inde-
T 0 T 0 T 0 0 pendent components, the output wTi X is decorrelated iteratively
E½XX g ðw XÞ E½XX E½g ðw XÞ ) E½XX g ðw XÞ ¼ E½g ðw XÞI and
T T T T
hence the Jacobian matrix becomes diagonal, which is easily with the other outputs which were calculated in the previous iter-
ations (wT1 X; wT2 X; . . . ; wTi1 X). This decorrelation step prevents dif-
ferent vectors from converging to the same optima. Deflation
orthogonalization method is similar to the projection pursuit,
9
Assume f ðxÞ ¼ 0, using Newton’s method, the solution is calculated as follows, where the independent components are estimated one by one.
xiþ1 ¼ xi ff0ðxÞ
ðxÞ
. For each iteration, the projections of the previously estimated
Please cite this article in press as: A. Tharwat, Applied Computing and Informatics (2018), https://doi.org/10.1016/j.aci.2018.08.006
14 A. Tharwat / Applied Computing and Informatics xxx (2018) xxx–xxx
(a) (b)
(c) (d)
Fig. 15. Histogram of the extracted signal (y1 ). (a) after the first iteration, (b) after the tenth iteration, (c) after the 100th iteration, and (d) after the 1000th iteration.
weight vectors ðwp wj Þwj are subtracted from wp , where Image processing: ICA is used in image segmentation to extract
j ¼ 1; 2; . . . ; p 1, and then wp is normalized as in Eq. (48). In this different layers from the original image [12]. Moreover, ICA is
method, estimation errors in the first vectors are cumulated over widely used for noise removing from raw images which repre-
the next ones by orthogonalization. Symmetric orthogonalization sent the original signals [24].
method can be used when a symmetric correlation, i.e., no vectors
are privileged over others, is required [18]. Hence, the vectors wi
can be estimated in parallel which enables parallel computation.
This method calculates all wi vectors using one-unit algorithm in 7. Challenges of ICA
parallel, and then the orthogonalization step is applied for all vec-
1 ICA is used for estimating the unknown matrix W ¼ A1 . When
tors using symmetric method as follows, W ¼ ðWWT Þ 2 W, where
the number of sources (p) and the number of mixture signals (n)
12
ðWWT Þ is calculated from the eigenvalue decomposition as fol- are equal, the matrix A is invertible. When the number of mixtures
12 1 is less than the number of source signals (n < p) this is called the
lows, VðWWT Þ ¼ kV; thus, ðWWT Þ ¼ VT k2 V.
over-complete problem; thus, A is not square and not invertible
X
p
[26]. This representation sometimes is advantageous as it uses as
1:wp ¼ wp wTp wj wj
few ”basis” elements as possible; this is called sparse coding. On
j¼1
ð48Þ the other hand, when n > p means that the number of mixtures
wp
2:wp ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffi is higher than the number of source signals and this is called the
wTp wp Under-complete problem. This problem can be solved by deleting
some mixtures using dimensionality reduction techniques such
as PCA to decrease the number of mixtures [1].
6. Applications
Biomedical applications: ICA was used for removing artifacts ICA is a widely-used statistical technique which is used for esti-
which mixed with different biomedical signals such as Elec- mating independent components (ICs) through maximizing the
troencephalogram (EEG), functional magnetic resonance imaging non-Gaussianity of ICs, maximizing the likelihood of ICs, or mini-
(fMRI), and Magnetoencephalography (MEG) signals [5]. Also, mizing mutual information between ICs. These approaches are
ICA was used for removing the electrocardiogram (ECG) interfer- approximately equivalent; however, each approach has its own
ence from EEG signals, or for differentiating between the brain limitations.
signals and the other signals that are generated from different This paper followed the approach of not only explaining the
activities as in [29]. steps for estimating ICs, but also presenting illustrative visualiza-
Audio signal processing: ICA has been widely used in audio tions of the ICA steps to make it easy to understand. Moreover, a
signals for removing noise [36]. Additionally, ICA was used as number of numerical examples are introduced and graphically
a feature extraction method to design robust automatic speech illustrated to explain (1) how signals are mixed to form mixture
recognition models [8]. signals, (2) how to estimate source signals, and (3) the preprocess-
Biometrics: ICA is for extracting discriminative features in dif- ing steps of ICA. Different ICA algorithms are introduced with
ferent biometrics such as face recognition [10], ear recognition detailed explanations. Moreover, ICA common challenges and
[35], and finger print [27]. applications are briefly highlighted.
Please cite this article in press as: A. Tharwat, Applied Computing and Informatics (2018), https://doi.org/10.1016/j.aci.2018.08.006
A. Tharwat / Applied Computing and Informatics xxx (2018) xxx–xxx 15
References [18] A. Hyvarinen, Fast and robust fixed-point algorithms for independent
component analysis, IEEE Trans. Neural Networks 10 (3) (1999) 626–634.
[19] A. Hyvarinen, Gaussian moments for noisy independent component analysis,
[1] S.-I. Amari, Natural gradient learning for over-and under-complete bases in
IEEE Signal Process. Lett. 6 (6) (1999) 145–147.
ICA, Neural Comput. 11 (8) (1999) 1875–1883.
[20] A. Hyvärinen, J. Karhunen, E. Oja, Independent Component Analysis, Vol. 46,
[2] A. Asaei, H. Bourlard, M.J. Taghizadeh, V. Cevher, Computational methods for
John Wiley & Sons, 2004.
underdetermined convolutive speech localization and separation via model-
[21] A. Hyvärinen, E. Oja, Independent component analysis: algorithms and
based sparse component analysis, Speech Commun. 76 (2016) 201–217.
applications, Neural Networks 13 (4) (2000) 411–430.
[3] R. Aziz, C. Verma, N. Srivastava, A fuzzy based feature selection from
[22] D. Langlois, S. Chartier, D. Gosselin, An introduction to independent
independent component subspace for machine learning classification of
component analysis: Infomax and fastica algorithms, Tutorials Quantit.
microarray data, Genomics data 8 (2016) 4–15.
Methods Psychol. 6 (1) (2010) 31–38.
[4] E. Bingham, A. Hyvärinen, A fast fixed-point algorithm for independent
[23] T.-W. Lee, Independent component analysis, in: Independent Component
component analysis of complex valued signals, Int. J. Neural Syst. 10 (01)
Analysis, Springer, 1998, pp. 27–66.
(2000) 1–8.
[24] T.-W. Lee, M.S. Lewicki, Unsupervised image classification, segmentation, and
[5] V.D. Calhoun, J. Liu, T. Adal, A review of group ICA for FMRA data and ICA for
enhancement using ica mixture models, IEEE Trans. Image Process. 11 (3)
joint inference of imaging, genetic, and ERP data, Neuroimage 45 (1) (2009)
(2002) 270–279.
S163–S172.
[25] T.-W. Lee, M.S. Lewicki, T.J. Sejnowski, Ica mixture models for unsupervised
[6] J.-F. Cardoso, Infomax and maximum likelihood for blind source separation,
classification of non-gaussian classes and automatic context switching in blind
IEEE Sig. Process. Lett. 4 (4) (1997) 112–114.
signal separation, IEEE Trans. Pattern Anal. Mach. Intell. 22 (10) (2000) 1078–
[7] R. Chai, G.R. Naik, T.N. Nguyen, S.H. Ling, Y. Tran, A. Craig, H.T. Nguyen, Driver
1089.
fatigue classification with independent component by entropy rate bound
[26] M.S. Lewicki, T.J. Sejnowski, Learning overcomplete representations, Learning
minimization analysis in an eeg-based system, IEEE J. Biomed. Health Inf. 21
12 (2) (2006).
(3) (2017) 715–724.
[27] F. Long, B. Kong, Independent component analysis and its application in the
[8] J.-W. Cho, H.-M. Park, Independent vector analysis followed by hmm-based
fingerprint image preprocessing, Proceedings. International Conference on
feature enhancement for robust speech recognition, Sig. Process. 120 (2016)
Information Acquisition, IEEE, 2004, pp. 365–368.
200–208.
[28] B.A. Pearlmutter, L.C. Parra, Maximum likelihood blind source separation: A
[9] P. Comon, Independent component analysis, a new concept?, Sig Process. 36
context-sensitive generalization of ica. In: Advances in neural information
(3) (1994) 287–314.
processing systems. 1997, pp. 613–619.
[10] I. Dagher, R. Nachar, Face recognition using ipca-ica algorithm, IEEE Trans.
[29] M.B. Pontifex, K.L. Gwizdala, A.C. Parks, M. Billinger, C. Brunner, Variability of
Pattern Anal. Machine Intell. 28 (6) (2006) 996–1000.
ica decomposition may impact eeg signals when used to remove eyeblink
[11] N. Delfosse, P. Loubaton, Adaptive blind separation of independent sources: a
artifacts, Psychophysiology 54 (3) (2017) 386–398.
deflation approach, Sig. Process. 45 (1) (1995) 59–83.
[30] S. Shimizu, P.O. Hoyer, A. Hyvärinen, A. Kerminen, A linear non-gaussian acyclic
[12] S. Derrode, G. Mercier, W. Pieczynski, Unsupervised multicomponent image
model for causal discovery, J. Mach. Learn. Res. 7 (Oct) (2006) 2003–2030.
segmentation combining a vectorial hmc model and ica, Proceedings of
[31] J. Shlens, A tutorial on independent component analysis. arXiv preprint
International Conference on Image Processing (ICIP), Vol. 2, IEEE, 2003, pp. II–
arXiv:1404.2986, 2014.
407.
[32] J.V. Stone, 2004. Independent component analysis. A tutorial introduction. A
[13] J.H. Friedman, J.W. Tukey, A projection pursuit algorithm for exploratory data
bradford book.
analysis, IEEE Trans. Comput. 100 (9) (1974) 881–890.
[33] A. Tharwat, Principal component analysis-a tutorial, Int. J. Appl. Pattern
[14] S.S. Haykin, S.S. Haykin, S.S. Haykin, S.S. Haykin, Neural Netw. Learn. Machines,
Recognit. 3 (3) (2016) 197–240.
Vol. 3, Pearson Upper Saddle River, NJ, USA, 2009.
[34] J. Xie, P.K. Douglas, Y.N. Wu, A.L. Brody, A.E. Anderson, Decoding the encoding
[15] J. Hérault, C. Jutten, B. Ans, Détection de grandeurs primitives dans un message
of functional brain networks: an fmri classification comparison of non-
composite par une architecture de calcul neuromimétique en apprentissage
negative matrix factorization (nmf), independent component analysis (ica),
non supervisé. In: 10 Colloque sur le traitement du signal et des images, FRA,
and sparse coding algorithms, J. Neurosci. Methods 282 (2017) 81–94.
1985. GRETSI, Groupe d’Etudes du Traitement du Signal et des Images 1985.
[35] H.-J. Zhang, Z.-C. Mu, W. Qu, L.-M. Liu, C.-Y. Zhang, A novel approach for ear
[16] A. Hyvärinen, Independent component analysis in the presence of gaussian
recognition based on ica and rbf network, Proceedings of International
noise by maximizing joint likelihood, Neurocomputing 22 (1) (1998) 49–67.
Conference on Machine Learning and Cybernetics, Vol. 7, IEEE, 2005, pp.
[17] A. Hyvärinen, New approximations of differential entropy for independent
4511–4515.
component analysis and projection pursuit. In: Advances in neural information
[36] M. Zibulevsky, B.A. Pearlmutter, Blind source separation by sparse
processing systems. (1998b) pp. 273–279.
decomposition in a signal dictionary, Neural Computat. 13 (4) (2001) 863–882.
Please cite this article in press as: A. Tharwat, Applied Computing and Informatics (2018), https://doi.org/10.1016/j.aci.2018.08.006