Multi-Task Learning For Detecting and Segmenting Manipulated Facial Images and Videos
Multi-Task Learning For Detecting and Segmenting Manipulated Facial Images and Videos
and Videos
Abstract
3. Proposed Method
Label L2 Reconstructed a
3.1. Overview (0 or 1) Loss image
map corresponding to this input image. For video inputs, Conv(3×3, 1) ReLU ReLU
we average the probabilities of all frames before drawing a Batch Norm (32) Batch Norm (16) Batch Norm (16)
ReLU TransConv(3×3, 1) TransConv(3×3, 1)
conclusion on the probability of the input being real or fake.
Conv(3×3, 2) ReLU
3.2. Y-shaped Autoencoder Batch Norm (32) Batch Norm (32)
ReLU TransConv(3×3, 2)
The partitioning of the latent features (motivated by Coz- Conv(3×3, 1) ReLU
zolino et al.’s work [11]) and the Y-shaped design of the Batch Norm (64) Batch Norm (32)
decoder enables the autoencoder to share valuable infor- ReLU TransConv(3×3, 1)
in the ImageNet Large Scale Visual Recognition Chal- 4.2. Training Y-shaped Autoencoder
lenge [28]. We did not apply any data augmentation to the
trained datasets.
To evaluate the contributions of each component in
the Y-shaped autoencoder, we designed the settings as
shown in Table 2. The FT Res and FT methods are re-
The training and testing datasets were designed as shown implementations of Cozzolino et al.’s method with and
in Table 1. For the Training, Test 1, and Test 2 datasets, without using residual images [11]. They can also be un-
the Face2Face method [26] was used to create manipulated derstood as the Y-shaped autoencoder without the segmen-
videos. Images in Test 2 were harder to detect than those tation branch. The Deeper FT method is a deeper version
in Test 1 since the source and target videos used for reen- of FT, which has the same depth as the proposed method.
actment were the same, meaning that the reenacted video The Proposed Old method is the proposed method using
frames had better quality. Therefore, we call Test 1 and weighting settings from Cozzolino et al.’s work [11], the
Test 2 the match and mismatch conditions for a seen at- No Recon method is the version of the proposed method
tack. Test 3 used the Deepfake attack method while Test 4 without the reconstruction branch, and the Proposed New
used the FaceSwap attack method, presented in the Face- method is the complete proposed method with the Y-shaped
Forensics++ database [27]. These both attack methods autoencoder using equal losses for the three tasks and the
were not used to create the training set, therefore they were mean squared error for reconstruction loss.
considered as unseen attacks. For the classification task,
we calculated the accuracy and equal error rate (EER) of Since shallower networks take longer to converge than
each method. For the segmentation task, we used pixel- deeper ones, we trained the shallower ones with 100 epochs
wise accuracy between ground-truth masks and segmenta- and the deeper ones with 50 epochs. For each method, the
tion masks. The FT Res, FT, and Deeper FT method could training stage with the highest accuracy for the classifica-
not perform the segmentation task. All the results were at tion task and a reasonable segmentation loss (if available)
the image level. was used to perform all the tests described in this section.
4.3. Dealing with Seen Attacks which had a greater chance of being over-fitted, had nearly
random classification results. In Test 4, although all meth-
The results for the match and mismatch conditions for
ods suffered from nearly random classification accuracies,
seen attacks are respectively shown in Tables 3 (Test 1) and
their better EERs indicated that the decision thresholds had
4 (Test 2). The deeper networks (the last four) had substan-
been moved.
tially better classification performance than the shallower
A particularly interesting finding was in the segmenta-
ones (the first two) proposed by Cozzolino et al. [11].
tion results. Although degraded, the segmentation accura-
Among the four deeper networks, there were no substan-
cies were still high, especially in Test 4, in which FaceSwap
tial differences in their performances on the classification
copied the facial area from the source faces to the target
task. For the segmentation task, the No Recon and Pro-
ones using a computer-graphics method. When dealing with
posed New methods, which used the new weighting set-
unseen attacks, this segmentation information could thus be
tings, had higher accuracy than the Proposed Old method,
an important clue in addition to the classification results for
which used the old weighting settings.
judging the authenticity of the queried images and videos.
Table 3. Results for Test 1 (image level).
Classification Segmentation Table 5. Results for Test 3 (image level).
Method Classification Segmentation
Acc (%) EER (%) Acc (%) Method
Acc (%) EER (%) Acc (%)
FT Res 82.30 14.53
FT Res 64.75 30.71
FT 88.43 11.60
FT 62.61 37.43
Deeper FT 93.63 7.20
Deeper FT 51.21 42.71
Proposed Old 92.60 7.40 81.40
Proposed Old 53.75 42.00 70.18
No Recon 93.40 7.07 89.21
No Recon 51.96 42.45 70.43
Proposed New 92.77 8.18 90.27
Proposed New 52.32 42.24 70.37