0% found this document useful (0 votes)
63 views8 pages

Multi-Task Learning For Detecting and Segmenting Manipulated Facial Images and Videos

This document proposes a multi-task convolutional neural network to simultaneously detect manipulated facial images and videos and segment the manipulated regions. The network includes an encoder that extracts features from the input, which are then used for binary classification. It also includes a Y-shaped decoder with two branches - one branch outputs a segmentation mask for the manipulated regions, while the other reconstructs the original input to improve performance. Experiments on the FaceForensics and FaceForensics++ datasets show it can effectively detect facial reenactment and face swapping attacks, as well as generalize to previously unseen attacks with fine-tuning on a small amount of data.

Uploaded by

zeeeshan khan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
63 views8 pages

Multi-Task Learning For Detecting and Segmenting Manipulated Facial Images and Videos

This document proposes a multi-task convolutional neural network to simultaneously detect manipulated facial images and videos and segment the manipulated regions. The network includes an encoder that extracts features from the input, which are then used for binary classification. It also includes a Y-shaped decoder with two branches - one branch outputs a segmentation mask for the manipulated regions, while the other reconstructs the original input to improve performance. Experiments on the FaceForensics and FaceForensics++ datasets show it can effectively detect facial reenactment and face swapping attacks, as well as generalize to previously unseen attacks with fine-tuning on a small amount of data.

Uploaded by

zeeeshan khan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Multi-task Learning For Detecting and Segmenting Manipulated Facial Images

and Videos

Huy H. Nguyen? , Fuming Fang† , Junichi Yamagishi?†‡ , and Isao Echizen?†


?
SOKENDAI (The Graduate University for Advanced Studies), Kanagawa, Japan

National Institute of Informatics, Tokyo, Japan

The University of Edinburgh, Edinburgh, UK
arXiv:1906.06876v1 [cs.CV] 17 Jun 2019

Email: {nhhuy, fang, jyamagis, iechizen}@nii.ac.jp

Abstract

Detecting manipulated images and videos is an impor-


tant topic in digital media forensics. Most detection meth-
ods use binary classification to determine the probability of
a query being manipulated. Another important topic is lo-
cating manipulated regions (i.e., performing segmentation),
which are mostly created by three commonly used attacks:
removal, copy-move, and splicing. We have designed a con-
volutional neural network that uses the multi-task learning
approach to simultaneously detect manipulated images and Figure 1. Original video frame (top left), video frame modified us-
videos and locate the manipulated regions for each query. ing Face2Face method [30] (top right, smooth mask almost com-
Information gained by performing one task is shared with pletely covers the skin area), using Deepfakes method [1] (bottom
the other task and thereby enhance the performance of both left, rectangular mask), and using FaceSwap method [27] (bottom
right, polygon-like mask).
tasks. A semi-supervised learning approach is used to im-
prove the network’s generability. The network includes an
encoder and a Y-shaped decoder. Activation of the encoded
features is used for the binary classification. The output of
facial movements in real time [30, 14] or create videos from
one branch of the decoder is used for segmenting the ma-
photographs [4, 9]. Moreover, thanks to advances in speech
nipulated regions while that of the other branch is used for
synthesis and voice conversion [19], an attacker can also
reconstructing the input, which helps improve overall per-
clone a person’s voice (only a few minutes of speech are
formance. Experiments using the FaceForensics and Face-
needed) and synchronize it with the visual component to
Forensics++ databases demonstrated the networks effec-
create an audiovisual spoof [29, 9]. These methods may be-
tiveness against facial reenactment attacks and face swap-
come widely available in the near future, enabling anyone
ping attacks as well as its ability to deal with the mismatch
to produce deepfake material.
condition for previously seen attacks. Moreover, fine-tuning
using just a small amount of data enables the network to Several countermeasures have been proposed for the
deal with unseen attacks. visual domain. Most of them were evaluated using
only one or a few databases, including the CGvsPhoto
database [25], the Deepfakes databases [2, 16, 17], and the
FaceForensics/FaceForensics++ databases [26, 27]. Coz-
1. Introduction zolino et al. addressed the transferability problem of sev-
A major concern in digital image forensics is the deep- eral state-of-the-art spoofing detectors [11] and developed
fake phenomenon [1], a worrisome example of the societal an autoencoder-like architecture that supports generaliza-
threat posed by computer-generated spoofing videos. Any- tion and can be easily adapted to a new domain with simple
one who shares video clips or pictures of him or herself on fine-tuning.
the Internet may become a victim of a spoof-video attack. Another major concern in digital image forensics is lo-
Several available methods can be used to translate head and cating manipulated regions. The shapes of the segmentation
masks for manipulated facial images and videos could re- a video as a sequence of image frames and work on the
veal hints about the type of manipulation used, as illustrated images as input. The noise-based method proposed by
in Figure 1. Most existing forensic segmentation methods Fridrich and Kodovsky [12] is considered one of the best
focus on three commonly used means of tampering: re- handcrafted detectors. Its improved version using a convo-
moval, copy-move, and splicing [6, 32, 7]. As in other lutional neural network (CNN) [10] demonstrated the effec-
image segmentation tasks, these methods need to process tiveness of using automatic feature extraction for detection.
full-scale images. Rahmouni et al. [25] used a sliding win- Among deep learning approaches to detection, fine-tuning
dow to deal with high-resolution images, as subsequently and transfer learning take advantage of high-performing
used by Nguyen et al. [21] and Rossler et al. [26]. This pre-trained models [24, 26]. Using part of a pre-trained
sliding window approach effectively segments manipulated CNN as the feature extractor is an effective way to improve
regions in spoofed images [26] created using the Face2Face the performance of a CNN [21, 22]. Other approaches to de-
method [30]. However, these methods need to score many tection include using a constrained convolutional layer [8],
overlapped windows by using a spoofing detection method, using a statistical pooling layer [25], using a two-stream
which takes a lot of computation power. network [31], using a lightweight CNN network [2], and
We have developed a multi-task learning approach for using two cascaded convolutional layers at the bottom of a
simultaneously performing classification and segmentation CNN [23]. Cozzolino et al. created a benchmark for de-
of manipulated facial images. Our autoencoder comprises termining the transferability of state-of-the-art detectors for
an encoder and a Y-shaped decoder and is trained in a semi- use in detecting unseen attacks [11]. They also proposed an
supervised manner. The activation of the encoded features autoencoder-like architecture with which adaptation ability
is used for classification. The output of one branch of the was greatly increased. Li et al. proposed using a temporal
decoder is used for segmentation, and the output of the other approach and developed a network for detecting eye blink-
branch is used to reconstruct the input data. The informa- ing, which is not well reproduced in fake videos [17]. Our
tion gained from these tasks (classification, segmentation, proposed method, besides performing classification, pro-
and reconstruction) is shared among them, thereby improv- vides segmentation maps of manipulated areas. This addi-
ing the overall performance of the network. tional information could be used as a reference for judging
the authenticity of images and videos, especially when the
2. Related Work classification task fails to detect spoofed inputs.
2.1. Generating Manipulated Videos
Creating a photo-realistic digital actor is a dream of
2.3. Locating Manipulated Regions in Images
many people working in computer graphics. One initial
success is the Digital Emily Project [3], in which sophis-
There are two commonly used approaches to locating
ticated devices were used to capture the appearance of an
manipulated regions in images: segmenting the entire in-
actress and her motions to synthesize a digital version of
put image and repeatedly performing binary classification
her. At that time, this ability was unavailable to attackers,
using a sliding window. The segmentation approach is com-
so it was impossible to create a digital version of a victim.
monly used to detect removal, copy-move, and splicing at-
This changed in 2016 when Thies et al. demonstrated fa-
tacks [6, 7]. Semantic segmentation methods [18, 5] can
cial reenactment in real time [30]. Subsequent work led to
also be used for forgery segmentation [7]. A slightly differ-
the ability to translate head poses [14] with simple require-
ent segmentation approach is to return the boxes that rep-
ments that are met by any normal person. The Xpression
resent the boundaries of the manipulated regions instead
mobile app1 providing the same function was subsequently
of returning segmentation masks [32]. The sliding win-
released. Instead of using RGB videos as was done in pre-
dow approach is used more for detecting spoofing regions
vious work [30, 14], Averbuch et al. and Chung et al. used
generated by a computer to create spoof images or videos
ID-type photos [4, 9], which are easily obtained on social
from bona fide ones [25, 21, 26]. In this approach, binary
networks. Combining this capability with speech synthesis
classifiers for classifying images as spoof or bona fide are
or voice conversion techniques [19], attackers are now able
called at each position of the sliding window. The stride
to make spoof videos with voices [29, 9], which are more
of the sliding window may equal the length of the win-
convincingly authentic.
dow (non-overlapped) [25] or be less than the length (over-
2.2. Detecting Manipulated Images and Videos lapped) [21, 26]). Our proposed method takes the first ap-
proach but with one major difference: only the facial areas
Several countermeasures have been introduced for de-
are considered instead of the entire image. This overcomes
tecting manipulated videos. A typical approach is to treat
the computation expense problem when dealing with large
1 https://xpression.jp/ inputs.
Spoof
Aggregating FAKE
Probabilities
Pre- Classifying &
processing Segmenting
aa
aa
a

Figure 2. Overview of proposed network.

3. Proposed Method
Label L2 Reconstructed a
3.1. Overview (0 or 1) Loss image

Unlike other single-target methods [22, 11, 7], our pro-


posed method outputs both the probability of an input be- L1 CE
Loss Loss
ing spoofed and segmentation maps of the manipulated re-
Conv(3×3, 1) Tanh Softmax
gions in each frame of the input, as diagrammed in Figure 2. Batch Norm (8) TransConv(3×3, 1) TransConv(3×3, 1)
Video inputs are treated as a set of frames. We focused on ReLU
facial images in this work, so the face areas are extracted in Conv(3×3, 2) ReLU ReLU
the pre-processing phase. In theory, the proposed method Batch Norm (8) Batch Norm (8) Batch Norm (8)
ReLU TransConv(3×3, 2) TransConv(3×3, 2)
can deal with various sizes of input images. However, to
maintain simplicity in training, we resize cropped images to Conv(3×3, 1) ReLU ReLU
Batch Norm (16) Batch Norm (8) Batch Norm (8)
256 × 256 pixels before feeding them into the autoencoder. ReLU TransConv(3×3, 1) TransConv(3×3, 1)
The autoencoder outputs the reconstructed version of the in-
Conv(3×3, 2) ReLU ReLU
put image (which is used only in training), the probability of Batch Norm (16) Batch Norm (16) Batch Norm (16)
the input image having been spoofed, and the segmentation ReLU TransConv(3×3, 2) TransConv(3×3, 2)

map corresponding to this input image. For video inputs, Conv(3×3, 1) ReLU ReLU
we average the probabilities of all frames before drawing a Batch Norm (32) Batch Norm (16) Batch Norm (16)
ReLU TransConv(3×3, 1) TransConv(3×3, 1)
conclusion on the probability of the input being real or fake.
Conv(3×3, 2) ReLU
3.2. Y-shaped Autoencoder Batch Norm (32) Batch Norm (32)
ReLU TransConv(3×3, 2)
The partitioning of the latent features (motivated by Coz- Conv(3×3, 1) ReLU
zolino et al.’s work [11]) and the Y-shaped design of the Batch Norm (64) Batch Norm (32)
decoder enables the autoencoder to share valuable infor- ReLU TransConv(3×3, 1)

mation between the classification, segmentation, and recon- Conv(3×3, 2) ReLU


Batch Norm (64) Batch Norm (64)
struction tasks and thereby improve overall performance by ReLU TransConv(3×3, 2)
reducing loss. There are three types of loss: activation loss
Conv(3×3, 1) ReLU
Lact , segmentation loss Lseg , and reconstruction loss Lrec . Batch Norm (128) Batch Norm (64)
Given label yi ∈ {0, 1}, activation loss measures the ac- ReLU TransConv(3×3, 1)
curacy of partitioning in the latent space on the basis of the
Latent
activation of the two halves of the encoded features: 16 × 16 × 64 × 2
1 X Selection
Lact = |ai,1 − yi | + |ai,0 − (1 − yi )|, (1) Activation
N i
Figure 3. Proposed autoencoder with Y-shaped decoder for detect-
where N is the number of samples, ai,0 and ai,1 are the ing and segmenting manipulated facial images.
activation values and defined as the L1 norms of the corre-
sponding halves of the latent features, hi,0 and hi,1 (given
K is the number of features of {hi,0 |hi,1 }):
1 0). The other half, hi,1−c , remains quiesced (ai,1−c = 0).
ai,c = khi,c k1 , c ∈ {0, 1}. (2)
2K To force the two decoders, Dseg and Drec , to learn the right
This ensures that, given an input xi of class c, the corre- decoding schemes, we set the off-class part to zero before
sponding half of the latent features hi,c is activated (ai,c > feeding it to the decoders (ai,1−c := 0).
We utilize cross-entropy loss as the segmentation loss 4. Experiments
to measure the agreement between the segmentation mask
(si = Dseg ({hi,0 |hi,1 })) and the ground-truth mask (mi ) 4.1. Databases
corresponding to input xi : We evaluated our proposed network using two databases:
FaceForensics [26] and FaceForensics++ [27]. The Face-
1 X
Lseg = kmi log(si ) + (1 − mi ) log(1 − si )k1 . (3) Forensics database contains 1004 real videos collected from
N i YouTube and their corresponding manipulated versions,
which are divided into two sub-datasets:
The reconstruction loss uses the L2 distance to mea-
sure the difference between the reconstructed image (x̂ = • Source-to-Target Reenactment dataset containing 1004
Drec ({hi,0 |hi,1 })) and the original one (xi ). For N sam- fake videos created using the Face2Face method [30];
ples, the reconstruction loss is in each input pair for reenactment, the source video
(the attacker) and the target video (the victim) are dif-
1 X ferent.
Lrec = kxi − x̂i k2 . (4)
N i
• Self-Reenactment dataset containing another 1004
The total loss is the weighted sum of the three activation fake videos created again using the Face2Face method;
losses: in each input pair for reenactment, the source and
target videos are the same. Although this dataset is
L = γact Lact + γseg Lseg + γrec Lrec . (5) not meaningful from the attacker’s perspective, it does
present a more challenging benchmark than does the
Unlike Cozzolino et al. [11], we set the three weights equal Source-to-Target Reenactment dataset.
to each other (equal to 1). This is because the classification
task and the segmentation task are equally important, and Each dataset was split into 704 videos for training, 150 for
the reconstruction task plays an important role in the seg- validation, and 150 for testing. The database also provided
mentation task. We experimentally compared the effects of segmentation masks corresponding to manipulated videos.
the different settings (described below). Three levels of compression based on the H.264 codec2
were used: no compression, light compression (quantiza-
3.3. Implementation tion = 23), and strong compression (quantization = 40).
The FaceForensics++ database is an enhanced version
The Y-shaped autoencoder was implemented as shown in
of the FaceForensics database and includes the Face2Face
Figure 3. It is a fully connected CNN using 3 × 3 convolu-
dataset plus the FaceSwap3 dataset (graphics-based manip-
tional windows (for the encoder) and 3 × 3 deconvolutional
ulation) and the DeepFakes4 dataset (deep-learning-based
windows (for the decoder) with a stride of 1 interspersed
manipulation) [27]. It contains 1,000 real videos and 3,000
with a stride of 2. Following each convolutional layer is
manipulated videos (1,000 in each dataset). Each dataset
a batch normalization layer [13] and a rectified linear unit
was split into 720 videos for training, 140 for validation,
(ReLU) [20]. The selection block allows only the true half
and 140 for testing. The same three levels of compression
of the latent features (hi,yi ) to pass by and zeros out the
based on the H.264 codec were used with the same quanti-
other half (hi,1−yi ). Therefore, the decoders (Dseg , Drec )
zation values.
are forced to decode only the true half of the latent fea-
tures. The dimension of the embedding is 128, which has For simplicity, we used only videos with light com-
been shown to be optimal [11]. For the segmentation branch pression (quantization = 23). Images were extracted from
(Dseg ), a softmax activation function at the end is used to videos using Cozzolino et al.’s settings [11]: 200 frames of
output segmentation maps. For the reconstruction branch each training video were used for training, and 10 frames
(Drec ), a hyperbolic tangent function (tanh) is used to shape of each validation and testing video were used for valida-
the output into the range [−1, 1]. For simplicity, we directly tion and testing, respectively. There is no detailed descrip-
feed normalized images into the autoencoder without con- tion of the rules for frame selection, so we selected the first
verting them into residual images [11]. Further work will (200 or 10) frames of each video and cropped the facial
focus on investigating the benefits of using residual images areas. For all databases, we applied normalization with
in the classification and segmentation tasks. mean = (0.485, 0.456, 0.406) and standard deviation =
(0.229, 0.224, 0.225); these values have been widely used
Following Cozzolino et al.’s work [11], we trained the
network using the ADAM optimizer [15] with a learning 2 http://www.h264encoder.com/
rate of 0.001, a batch size of 64, betas of 0.9 and 0.999, and 3 https://github.com/MarekKowalski/FaceSwap/

epsilon equal to 10−8 . 4 https://github.com/deepfakes/faceswap/


Table 1. Design of training and testing datasets.
Manipulation Number
Name Source dataset Description
Method of Videos
FaceForensics Training set used for
Training Face2Face 704 ×2
Source-to-Target all tests
FaceForensics Match condition for
Test 1 Face2Face 150 ×2
Source-to-Target seen attack
FaceForensics Mismatch condition for
Test 2 Face2Face 150 ×2
Self-Reenactment seen attack
FaceForensics++ Unseen attack (deep-
Test 3 Deepfakes 140 ×2
Deepfakes learning-based)
FaceForensics++ Unseen attack (computer-
Test 4 FaceSwap 140 ×2
FaceSwap graphics-based)

Table 2. Settings for autoencoder.


Seg. Rec. Rec.
No. Method Depth Comments
weight weight loss
Re-implementation of ForensicsTransfer [11] using
1 FT Res Shallower 0.1 0.1 L1
residual images as input
Re-implementation of ForensicsTransfer [11] using
2 FT Shallower 0.1 0.1 L1
normal images as input
Proposed deeper version of FT (Proposed Old
3 Deeper FT Deeper 0.1 0.1 L1
method without segmentation branch)
Proposed method using ForensicsTransfer
4 Proposed Old Deeper 0.1 0.1 L1
settings
5 No Recon Deeper 1 1 L2 Proposed method without reconstruction branch
6 Proposed New Deeper 1 1 L2 Complete proposed method with new settings

in the ImageNet Large Scale Visual Recognition Chal- 4.2. Training Y-shaped Autoencoder
lenge [28]. We did not apply any data augmentation to the
trained datasets.
To evaluate the contributions of each component in
the Y-shaped autoencoder, we designed the settings as
shown in Table 2. The FT Res and FT methods are re-
The training and testing datasets were designed as shown implementations of Cozzolino et al.’s method with and
in Table 1. For the Training, Test 1, and Test 2 datasets, without using residual images [11]. They can also be un-
the Face2Face method [26] was used to create manipulated derstood as the Y-shaped autoencoder without the segmen-
videos. Images in Test 2 were harder to detect than those tation branch. The Deeper FT method is a deeper version
in Test 1 since the source and target videos used for reen- of FT, which has the same depth as the proposed method.
actment were the same, meaning that the reenacted video The Proposed Old method is the proposed method using
frames had better quality. Therefore, we call Test 1 and weighting settings from Cozzolino et al.’s work [11], the
Test 2 the match and mismatch conditions for a seen at- No Recon method is the version of the proposed method
tack. Test 3 used the Deepfake attack method while Test 4 without the reconstruction branch, and the Proposed New
used the FaceSwap attack method, presented in the Face- method is the complete proposed method with the Y-shaped
Forensics++ database [27]. These both attack methods autoencoder using equal losses for the three tasks and the
were not used to create the training set, therefore they were mean squared error for reconstruction loss.
considered as unseen attacks. For the classification task,
we calculated the accuracy and equal error rate (EER) of Since shallower networks take longer to converge than
each method. For the segmentation task, we used pixel- deeper ones, we trained the shallower ones with 100 epochs
wise accuracy between ground-truth masks and segmenta- and the deeper ones with 50 epochs. For each method, the
tion masks. The FT Res, FT, and Deeper FT method could training stage with the highest accuracy for the classifica-
not perform the segmentation task. All the results were at tion task and a reasonable segmentation loss (if available)
the image level. was used to perform all the tests described in this section.
4.3. Dealing with Seen Attacks which had a greater chance of being over-fitted, had nearly
random classification results. In Test 4, although all meth-
The results for the match and mismatch conditions for
ods suffered from nearly random classification accuracies,
seen attacks are respectively shown in Tables 3 (Test 1) and
their better EERs indicated that the decision thresholds had
4 (Test 2). The deeper networks (the last four) had substan-
been moved.
tially better classification performance than the shallower
A particularly interesting finding was in the segmenta-
ones (the first two) proposed by Cozzolino et al. [11].
tion results. Although degraded, the segmentation accura-
Among the four deeper networks, there were no substan-
cies were still high, especially in Test 4, in which FaceSwap
tial differences in their performances on the classification
copied the facial area from the source faces to the target
task. For the segmentation task, the No Recon and Pro-
ones using a computer-graphics method. When dealing with
posed New methods, which used the new weighting set-
unseen attacks, this segmentation information could thus be
tings, had higher accuracy than the Proposed Old method,
an important clue in addition to the classification results for
which used the old weighting settings.
judging the authenticity of the queried images and videos.
Table 3. Results for Test 1 (image level).
Classification Segmentation Table 5. Results for Test 3 (image level).
Method Classification Segmentation
Acc (%) EER (%) Acc (%) Method
Acc (%) EER (%) Acc (%)
FT Res 82.30 14.53
FT Res 64.75 30.71
FT 88.43 11.60
FT 62.61 37.43
Deeper FT 93.63 7.20
Deeper FT 51.21 42.71
Proposed Old 92.60 7.40 81.40
Proposed Old 53.75 42.00 70.18
No Recon 93.40 7.07 89.21
No Recon 51.96 42.45 70.43
Proposed New 92.77 8.18 90.27
Proposed New 52.32 42.24 70.37

Table 4. Results for Test 2 (image level).


Classification Segmentation Table 6. Results for Test 4 without fine-tuning (image level).
Method Classification Segmentation
Acc (%) EER (%) Acc (%) Method
FT Res 82.33 15.07 Acc (%) EER (%) Acc (%)
FT 87.33 12.03 FT Res 53.50 43.10
Deeper FT 92.70 7.80 FT 52.29 41.79
Proposed Old 91.83 8.53 81.40 Deeper FT 53.39 37.00
No Recon 92.83 8.29 89.10 Proposed Old 56.82 36.29 84.23
Proposed New 92.50 8.07 90.20 No Recon 54.86 35.86 84.86
Proposed New 54.07 34.04 84.67
The performances of all methods was slightly degraded
when dealing with the mismatch condition for seen attacks.
The FT Res and Proposed New methods had the best adap- 4.4.2 Fine-tuning using small amount of data
tation ability, as indicated by the lower degradation in their
scores. This indicates the importance of using residual im- We used the validation set (a small set normally used for
ages (for the FT Res method) and of using the reconstruc- selecting hyper-parameters in training that differs from the
tion branch (for the Y-shaped autoencoder with new weight- test set) of the FaceForensics++ - FaceSwap dataset [27] for
ing settings: Proposed New method). The reconstruction fine-tuning all the methods. To ensure that the amount of
branch also helped the Proposed New method achieve the data was small, we used only ten frames for each video. We
highest score on the segmentation task. divided the dataset into two parts: 100 videos of each class
for training and 40 of each class for evaluation. We trained
4.4. Dealing with Unseen Attacks them using 50 epochs and selected the best models on the
basis of their performance on the evaluation set.
4.4.1 Evaluation using pre-trained model
The results after fine-tuning for Test 4 are shown in Ta-
When encountering unseen attacks, all six methods had sub- ble 7. Their classification and segmentation accuracies in-
stantially lower accuracies and higher EERs, a shown in creased around 25% and 8%, respectively, which are re-
Tables 5 (Test 3) and 6 (Test 4). In Test 3, the shallower markable compared with the small amount of data used.
methods had better adaptation ability, especially the FT Res The one exception was the Proposed Old method – its seg-
method, which uses residual images. The deeper methods, mentation accuracy did not improve. The FT Res method
Table 7. Results for Test 4 after fine-tuning (image level). segmentation. IEEE Transactions on Pattern Analysis and
Classification Segmentation
Method Machine Intelligence, 39(12):2481–2495, 2017. 2
Acc (%) EER (%) Acc (%)
FT Res 80.04 (↑ 26.54) 17.57 (↓ 25.53) [6] J. H. Bappy, A. K. Roy-Chowdhury, J. Bunk, L. Nataraj, and
FT 70.89 (↑ 18.60) 25.56 (↓ 16.23) B. Manjunath. Exploiting spatial structure for localizing ma-
Deeper FT 82.00 (↑ 28.61) 17.33 (↓ 19.67) nipulated image regions. In ICCV, pages 4970–4979, 2017.
Proposed Old 78.57 (↑ 21.75) 20.79 (↓ 15.50) 84.39 (↑ 0.16) 2
No Recon 82.93 (↑ 28.07) 16.93 (↓ 18.93) 92.60 (↑ 7.74) [7] J. H. Bappy, C. Simons, L. Nataraj, B. Manjunath, and A. K.
Proposed New 83.71 (↑ 29.64) 15.07 (↓ 18.97) 93.01 (↑ 8.34) Roy-Chowdhury. Hybrid lstm and encoder-decoder architec-
ture for detection of image forgeries. IEEE Transactions on
Image Processing, 2019. 2, 3
had better adaptation than the FT one, which supports Coz- [8] B. Bayar and M. C. Stamm. A deep learning approach to
zolino et al.’s claim [11]. The Proposed New method had universal image manipulation detection using a new convo-
the highest transferability against unseen attacks as evi- lutional layer. In IH&MMSEC. ACM, 2016. 2
denced by the results in Table 7. [9] J. S. Chung, A. Jamaludin, and A. Zisserman. You said that?
In BMVC, 2017. 1, 2
5. Conclusion [10] D. Cozzolino, G. Poggi, and L. Verdoliva. Recasting
residual-based local descriptors as convolutional neural net-
The proposed convolutional neural network with a Y- works: an application to image forgery detection. In
shaped autoencoder demonstrated its effectiveness for both IH&MMSec. ACM, 2017. 2
classification and segmentation tasks without using a slid- [11] D. Cozzolino, J. Thies, A. Rössler, C. Riess, M. Nießner,
ing window, as is commonly used by classifiers. Informa- and L. Verdoliva. Forensictransfer: Weakly-supervised
tion sharing among the classification, segmentation, and re- domain adaptation for forgery detection. arXiv preprint
construction tasks improved the network’s overall perfor- arXiv:1812.02510, 2018. 1, 2, 3, 4, 5, 6, 7
[12] J. Fridrich and J. Kodovsky. Rich models for steganalysis of
mance, especially for the mismatch condition for seen at-
digital images. IEEE Transactions on Information Forensics
tacks. Moreover, the autoencoder can quickly adapt to deal and Security, 2012. 2
with unseen attacks by using only a few samples for fine- [13] S. Ioffe and C. Szegedy. Batch normalization: Accelerating
tuning. Future work will mainly focus on investigating the deep network training by reducing internal covariate shift. In
effect of using residual images [11] on the autoencoder’s ICML, pages 448–456, 2015. 4
performance, processing high-resolution images without re- [14] H. Kim, P. Garrido, A. Tewari, W. Xu, J. Thies, M. Nießner,
sizing, improving its ability to deal with unseen attacks, and P. Pérez, C. Richardt, M. Zollhöfer, and C. Theobalt. Deep
extending it to the audiovisual domain. video portraits. In SIGGRAPH. ACM, 2018. 1, 2
[15] D. P. Kingma and J. Ba. Adam: A method for stochastic
Acknowledgement optimization. In ICML, 2015. 4
[16] P. Korshunov and S. Marcel. Deepfakes: a new threat to face
This research was supported by JSPS KAKENHI Grant recognition? assessment and detection. Idiap-RR Idiap-RR-
Number JP16H06302, JP18H04120, and JST CREST Grant 18-2018, Idiap, 2018. 1
Number JPMJCR18A6, Japan. [17] Y. Li, M.-C. Chang, and S. Lyu. In ictu oculi: Exposing
ai created fake videos by detecting eye blinking. In WIFS,
References pages 1–7. IEEE, 2018. 1, 2
[18] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional
[1] Terrifying high-tech porn: Creepy ’deepfake’ videos networks for semantic segmentation. In CVPR, pages 3431–
are on the rise. https://www.foxnews.com/ 3440, 2015. 2
tech/terrifying-high-tech-porn-creepy- [19] J. Lorenzo-Trueba, J. Yamagishi, T. Toda, D. Saito, F. Villav-
deepfake-videos-are-on-the-rise. Accessed: icencio, T. Kinnunen, and Z. Ling. The voice conversion
2019-03-12. 1 challenge 2018: Promoting development of parallel and non-
[2] D. Afchar, V. Nozick, J. Yamagishi, and I. Echizen. parallel methods. In Odyssey 2018 The Speaker and Lan-
MesoNet: a compact facial video forgery detection network. guage Recognition Workshop, pages 195–202, 2018. 1, 2
In WIFS. IEEE, 2018. 1, 2 [20] V. Nair and G. E. Hinton. Rectified linear units improve
[3] O. Alexander, M. Rogers, W. Lambeth, J.-Y. Chiang, W.-C. restricted boltzmann machines. In ICML, pages 807–814,
Ma, C.-C. Wang, and P. Debevec. The digital emily project: 2010. 4
Achieving a photorealistic digital actor. IEEE Computer [21] H. H. Nguyen, T. Tieu, H.-Q. Nguyen-Son, V. Nozick, J. Ya-
Graphics and Applications, 30(4):20–31, 2010. 2 magishi, and I. Echizen. Modular convolutional neural net-
[4] H. Averbuch-Elor, D. Cohen-Or, J. Kopf, and M. F. Cohen. work for discriminating between computer-generated images
Bringing portraits to life. ACM Transactions on Graphics, and photographic images. In ARES, page 1. ACM, 2018. 2
2017. 1, 2 [22] H. H. Nguyen, J. Yamagishi, and I. Echizen. Capsule-
[5] V. Badrinarayanan, A. Kendall, and R. Cipolla. Segnet: A forensics: Using capsule networks to detect forged images
deep convolutional encoder-decoder architecture for image and videos. In ICASSP. IEEE, 2019. 2, 3
[23] W. Quan, K. Wang, D.-M. Yan, and X. Zhang. Distinguish-
ing between natural and computer-generated images using
convolutional neural networks. IEEE Transactions on Infor-
mation Forensics and Security, 2018. 2
[24] R. Raghavendra, K. B. Raja, S. Venkatesh, and C. Busch.
Transferable deep-CNN features for detecting digital and
print-scanned morphed face images. In ICCV Workshop.
IEEE, 2017. 2
[25] N. Rahmouni, V. Nozick, J. Yamagishi, and I. Echizen. Dis-
tinguishing computer graphics from natural images using
convolution neural networks. In WIFS. IEEE, 2017. 1, 2
[26] A. Rössler, D. Cozzolino, L. Verdoliva, C. Riess, J. Thies,
and M. Nießner. Faceforensics: A large-scale video dataset
for forgery detection in human faces. arXiv preprint
arXiv:1803.09179, 2018. 1, 2, 4, 5
[27] A. Rössler, D. Cozzolino, L. Verdoliva, C. Riess, J. Thies,
and M. Nießner. Faceforensics++: Learning to detect ma-
nipulated facial images. arXiv preprint arXiv:1901.08971,
2019. 1, 4, 5, 6
[28] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh,
S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein,
A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual
Recognition Challenge. International Journal of Computer
Vision, 115(3):211–252, 2015. 5
[29] S. Suwajanakorn, S. M. Seitz, and I. Kemelmacher-
Shlizerman. Synthesizing obama: learning lip sync from
audio. ACM Transactions on Graphics, 2017. 1, 2
[30] J. Thies, M. Zollhofer, M. Stamminger, C. Theobalt, and
M. Nießner. Face2Face: Real-time face capture and reen-
actment of RGB videos. In CVPR. IEEE, 2016. 1, 2, 4
[31] P. Zhou, X. Han, V. I. Morariu, and L. S. Davis. Two-stream
neural networks for tampered face detection. In ICCV Work-
shop. IEEE, 2017. 2
[32] P. Zhou, X. Han, V. I. Morariu, and L. S. Davis. Learning
rich features for image manipulation detection. In CVPR,
pages 1053–1061, 2018. 2

You might also like