MINISTRY OF EDUCATION AND TRAINING
UNIVERSITY OF ECONOMIC HO CHI MINH CITY
PROJECT
SUBJECT: DATA SCIENCE
Topic: Analyzing tiktok's reliability in requesting removal of videos
that violate national government standards
Instructor: Thai Kim Phung
Group of students:
Students Student ID
Nguyen Hoang Huy 31221021995
Pham Quynh Nga-Leader 31221024877
Vu Tuong Vy 31221021239
Vu Thi Hong Tuoi 31221023878
Vo Tran Bao Tran 31221024798
Class of subject: 23D1INF50905929
Ho Chi Minh City, March 20, 2023
Catalog
1. Project introduction........................................................................................1
1.1 Reasons for choosing topic........................................................................1
1.2 Research goals...........................................................................................2
1.3 Methods of implementation.......................................................................2
1.4 Research object..........................................................................................2
2. Theoretical basis..............................................................................................3
2.1 Data mining................................................................................................3
2.2 Machine learning.......................................................................................3
2.3 Data visualization.......................................................................................5
3. Proposed research model................................................................................6
3.1 Description of data.....................................................................................6
3.2 Processing of data......................................................................................9
3.3 Data visualization.....................................................................................10
4. Performance results.......................................................................................17
4.1 Analysis of results based on software......................................................17
4.2 Prediction data results..............................................................................20
4.3 Evaluation of results and models.............................................................22
5. Conclusions and General Comments...........................................................23
6. Thank you.......................................................................................................24
7. References.......................................................................................................24
1
1. Project introduction
1.1 Reasons for choosing the topic
In the era of science and technology, which is growing strongly and having a
certain influence on most areas of life, along with the appearance of many social
networking sites, it has had a great impact on people. use, especially by young people.
Each advancement in the field of technology sets the stage for the development of a
new form of communication. The benefits that social networks bring such as a huge
amount of information, diversity, and wealth are constantly updated, and many
utilities for entertainment and learning are undeniable, ... especially the change of a
powerful form of communication between individuals, groups, and nations. worldwide
(connected). But at the same time, social networks also bring many negative effects
such as spreading false information, content that does not match community standards,
etc.
Currently, the TikTok platform is one of the most visited and used social
networking platforms by users in the world in general and Vietnam in particular.
Although it has only begun to expand beyond the Chinese market since 2017, soon
after that, TikTok also developed in the US when it acquired the musical.ly app in
August 2018. As of the end of 2019, TikTok reached more than 1.5 billion downloads
on Google Play and the App Store globally, according to Sensor Tower. It can be seen
that the rapid and outstanding development has turned TikTok into a platform leading
the latest trends as well as having a significant influence on users' lives and
perceptions. In addition to these social networking benefits, young people are most
likely to fall into the temptation of uncensored videos with enough challenges and
trends that can affect and negatively affect their health. , the psychology of the user.
Therefore, it is necessary to have strict management and censorship of the content and
activities of users in the community. Therefore, analyzing the reliability of the
censorship and removal of videos that violate community standards on the TikTok
platform of the governments of different countries, as well as applying knowledge of
technology and innovation to the community. Observing, thinking, and catching up
with the social networking trends of the domestic and foreign markets is essential for
students majoring in Technology. The application of technology as well as data
2
analysis and processing methods becomes easier to understand thanks to the
knowledge in the Data Science module.
As for the reliability of the countries in removing videos that violate community
standards on TikTok, the following factors are influential: total number of requests,
content handled for violating Community Standards, content is handled due to
violation of (internal) law, content is not handled, reports request to be removed…
Through these factors, it can be easier to access information and collect data. data to
conduct reliability analysis.
1.2 Research goals
Solve research problems through analysis of theoretical bases.
Methods of data processing and classification research (classification methods
make predictions, classify and also classify objects). The study will introduce data
classification methods and then select the most optimal and guaranteed method for
data forecasting.
Analyze the credibility of national governments' handling of community
standards violations on the TikTok platform.
Through the results of data analysis along with confirming the reliability level
through the indicators, thereby drawing conclusions, and limitations of the research
paper, then giving the best solution for the problem research.
1.3 Methods of implementation
Collect information and data on TikTok's official website through the
Government's report on deletion requests.
Use Excel tool, Orange data mining tool to process data, represent data as well as
compare models.
1.4 Research object
Reports of total content, takedown requests and accounts reported in violation of
the TikTok Community Standards in 43 countries.
3
2. Theoretical basis
2.1 Data mining
Data mining is the process of finding and analyzing patterns in data to find useful
and potentially useful information. It is a part of data science and is commonly used in
fields such as business, health, social sciences, and others.
Data mining involves using tools and techniques to analyze large, complex, and
unstructured data. These techniques may include cluster analysis, correlation analysis,
time series analysis, machine learning, and text mining. The results of this process can
be used to create predictions, discover relationships, find new knowledge, and support
decisions in business and other areas.
Data mining process:
Orange is known for integrating data mining and machine learning tools. Orange
is written in Python language, providing interactive visuals and aesthetics for users.
Orange provides users with tools to perform data analysis algorithms such as principal
component analysis (PCA), cluster analysis, independent component analysis (ICA),
regression, and support vector machines. (SVM) and many other algorithms.
2.2 Machine learning
A method in data science that allows computers to learn from data automatically
and generate predictive or classification models.
Instead of programming explicit rules to solve a problem, machine learning
allows computers to learn from data and create models based on patterns, patterns, and
latent information in the data.
4
It is an important tool in data science and is widely used in many fields such as
economics, finance, healthcare, marketing, and many others to predict trends, make
decisions. and data analysis.
Some algorithms in Machine Learning:
Linear Regression: A supervised learning method for continuous value prediction
problems based on independent variables.
Logistic Regression: A supervised learning method for the binary classification
problem.
Decision Tree: A supervised learning method for a classification or regression
problem based on building a binary tree based on decisions.
Random Forest: A supervised learning method for a classification or regression
problem based on the construction of many random decision trees.
Support Vector Machine (SVM): A supervised learning method for a classification
or regression problem based on finding the best hyperplane to divide data points.
Clustering: An unsupervised learning method for a clustering problem based on
finding similar groups in data.
Principal Component Analysis (PCA): An unsupervised learning method for data
dimensionality reduction based on finding the principal components of data.
Neural Networks (Mạng neural): A supervised or unsupervised learning method
for a classification or regression problem based on the construction of a neural network
structured by multiple layers of connections.
5
Figure 1. Machine Learning classification image
2.3 Data visualization
The process of presenting data and information using graphics or charts to help
users easily understand and analyze the data.
It helps visually shape relationships between different data and attributes, and
helps detect trends and patterns in the data.
The purpose of data visualization is to make it easier for users to understand and
extract information from data. It can help analyze data and find solutions to problems
in areas like business, science, politics, and health.
Figure 2. Examples of some types of charts used to visualize data
6
3. Proposed research model
3.1 Description of data
In the data columns in all spreadsheets, the column “Reliability” is the target of
the study, it indicates the reliability of Tik Tok to the governments of countries in
handling micro-content. offense in the host country. The group used 2 separate data
tables with 100% confidence. In which, the announcement table of Tik Tok on
November 29, 2022 for testing data and the table published on May 17, 2022 for
training.
Other variables include:
Variables Describe
Country Names of countries that submitted reports
on Tik Tok starting January 1, 2019
Total requests received Government requests to remove or restrict
content or accounts, including requests
with inaccurate URLs.
Total content received Valid content URLs requested, excluding
any acceptable form of an account (i.e.
URLs, UIDs, and usernames). We review
all content requests for inaccurate URLs,
duplicate requests, requests routed to
different channels, and requests submitted
with insufficient information to determine
validity.
Content actioned due to Community Valid content URLs reviewed and
Guidelines violations actioned upon for violating our
Community Guidelines.
Content actioned due to (local) law All valid content URLs reviewed and
Violations actioned upon due to a violation of local
7
law.
Content not actioned Valid content URLs reviewed and deemed
not to violate TikTok’s Community
Guidelines, Terms of Service, and/or local
law.
Total accounts received Valid account URLs and any other
acceptable form of an account (i.e. UIDs,
usernames, etc.) requested. We review all
content requests for inaccurate URLs,
duplicate requests, requests routed to
different channels, and requests submitted
with insufficient information to determine
validity.
Accounts actioned due to Community Valid account URLs reviewed and
Guidelines Violations actioned due to Community Guidelines
violation.
Accounts actioned due to (local) law Valid account URLs reviewed and
violations actioned due to a violation of local law.
Accounts not actioned Valid account URLs reviewed and
deemed not to violate TikTok’s
Community Guidelines and/or local law.
Removal rate Rate at which TikTok removed or
restricted content or accounts in response
to government demands.
Date TikTok’s periodical reporting period
Total Government Requests The Tik Tok Foundation aggregates all
8
requests from governments over time.
Date range The period from January 1, 2022 to June
30, 2022 for test data (July 1, 2021 to
December 31, 2021, for training data)
Total copyright removal requests Request to Tik Tok about copyright issues
of countries.
Succesful copyright removal requests Requests approved by Tik Tok
Percentage of successful copyright Number of successful requests/total
requests
number of requests
Total trademark removal requests Requests to Tik Tok about trademark
issues between countries.
Succesful trademark removal Number of approved trademark removal
requests
requests
Percentage of successful trademark Number of Trademark Claims Approved /
requests
Total Claims
Reliability Indicates the level of confidence from
the requirements of the governments of
the countries (Reliable or unreliable).
9
3.2 Processing of data
Through the statistical analysis tool, the team cleaned the data as follows:
When looking at the data, the Date range is fixed because all data were reported
from January 1, 2022 to June 30, 2022. When using the group, the Date range was
removed from the studies and evaluated price.
10
3.3 Data visualization
The team used the Excel spreadsheet tool and the Python tool along with the
libraries to visualize the data and obtained the following results:
Number of requests over the years:
Over the years, the Tik Tok organization has received a large number of requests
from national governments. This requires the enormous amount of information that
this popular software has to deal with. As evidenced by the number of requests
skyrocketing year by year and showing no signs of stopping, giving the credibility of
these governments' claims is also a top priority.
11
Request removal for copyright reasons:
The failure rate of government requests is lower than the success rate,
demonstrating high reliability.
12
According to the criteria for removing the mark:
The number of requests made is higher than the number of failed requests, the
reliability from tik tok is high.
13
Rate between countries sending requests to Tik Tok:
The amount of requirements is also very different between the governments of the
countries and therefore it accounts for different proportions. The graph shows large
disparities across countries, and it does not govern reliability. This difference explains
the need to handle dirty information, excluding Tik Tok's reliability to this need.
14
Visual about the amount of content Tik Tok received:
The percentage of content removed for violating community standards is higher
than the percentage of content removed for violating the laws of the host country. The
percentage of content that is not removed is approximately 10% (low) showing the
level of trust in the content of countries for Tik Tok.
15
Number of accounts received:
The percentage of accounts that are resolved due to local legislation is lower than
accounts that are in violation of community guidelines. The rate of disapproved
accounts is still quite low, the reliability is relatively high.
16
Reliability based on deletion rate:
The country-to-country deletion rate is a reflection of Tik Tok's credibility with
government requests. The higher the deletion rate countries have, the more valid and
credible that government's claim is.
17
Heatmap obtained after concluding:
4. Performance results
4.1 Analysis of results based on software
In the first step in the training process, the students put the report data on the
request to remove the videos that violate the standards of the governments of the
countries from July to December of 2021 collected on Tiktok into the software. Orange
and declare properties for variables.
Figure 3: Declare attributes for variables in the training dataset
In which, the dependent variable “Reliability” is labeled into 2 types:
“Unreliable” and “Reliable” declared under the target attribute. The variable “Nation”
declared under the meta attribute does not affect the data classification process. The
18
remaining variables are independent variables whose properties are declared as
feature.
4 algorithms selected by the trainees for the training process include: Decision
Tree, Neural Network, SVM and Logistic Regression. And test the above 4 models to
overview the criteria and choose the most suitable model for the study according to the
following steps:
Figure 4 : Overview of the training process on the forecast
Here, the study uses the method of evaluating the classification model with Cross
Validation: K-fold with k = 5 to evaluate the model thanks to its outstanding features.
The model will be trained and predicted on many different pieces of data, not having
the same data when training between test sets to help the model increase its accuracy.
19
After training the data, the students obtained the following results:
Figure 5: Results after training data
Based on the CA, F1, Precision, Recall and AUC indexes, we can see that the
Decision Tree model has the best measured data among the models. In which the F1-
index is commonly used to evaluate the model. The model with the highest value is
0.977 or 97.7%. Although the AUC value of this method is not the highest (by 96.7%
lower than Neural Network 96.9%) , this is only a small part so it does not affect the
overall accuracy of the model.
In particular, the appropriateness of the Decision Tree algorithm for this study is
also proven through the confusion matrix evaluation method:
Figure 6: Results after training data
The above confusion matrix shows that in the 43 samples of the training dataset:
There are 15 samples belonging to the class of “Unreliable”, in which the number
of samples correctly classified up to 14 samples and 1 sample being misclassified.
20
In addition, the "Reliable" subclass has 28 samples, in which all samples belong
to the correct class and no sample is mistaken when classifying.
In conclusion, the Decision Tree model is very suitable for the dataset of this
study and is quite suitable for predicting the reliability analysis model of Tiktok in
requesting removal of videos that violate government standards. The countries in the
forecast dataset are presented in the following section.
4.2 Prediction data results
After deciding to choose the Decision Tree algorithm, students proceed to put the
forecast dataset into Orange software, then use the analysis learned from the training
data set to predict the reliability of the decision. request removal of videos that violate
national government standards from January to June 2022.
Figure 7 : Properties of the forecast dataset
Just like the training dataset, the dependent variable “Reliability” is declared as target.
The “Continents” variable is not important, so we will declare skip. And the variable
"Nation" and "Country code" declared under the meta attribute does not affect the data
classification process. In the forecast dataset are independent variables whose attribute
is feature.
21
We then feed the forecast data into Predictions to predict the reliability of tiktok
in requesting removal of videos that violate national government standards in 2022
using the following Decision Tree method:
Figure 8: Results of forecasting using Decision Tree
22
The results of tiktok's reliability forecast in requesting the removal of videos that
violate the standards of the governments of 54 countries from January to June 2022
show that:
There are 34 samples classified as “Reliable” and the remaining 20 samples are
predicted to be “Unreliable” in requesting removal of videos that violate national
government standards.
4.3 Evaluation of results and models
Based on the above 4 models that have been run, we find that the Decision Tree
model gives better results than the remaining 3 models, so it should be used to apply to
the data set to be predicted. Students believe that this model should be applied to the
reliability assessment of future applications.
23
5. Conclusions and General Comments
In general, the reliability of Tiktok in requesting the removal of videos that
violate the standards in countries is not really high and is absolutely accurate.
Standards for violating community standards are subject to the law. of each country,
each region and the strict control of the governments of the countries. For example, in
some developed countries, standards-violating video removal rates and relatively high
removal rates are implicitly confirmed. The work of ensuring network security in that
area is very good and vice versa.
Limitations of the topic:
* As a student, we do not have enough knowledge and experience to delve deeply
into the cybersecurity laws of each country to draw clear and absolute conclusions
about the security and accuracy of the network system. tiktok in removing videos that
violate standards in each country and region around the world.
* The survey scale is not really wide in only 43 countries and territories, so it is
not possible to cover all the factors that lead to the high reliability of tiktok.
24
6. Thanks
The team would like to express our sincere thanks to Dr. Thai Kim Phung-
Lecturer in Data Science Department for their enthusiastic support so that we can
successfully complete this statistical project. In the process of implementing the
project, if the team has any mistakes, we would like to receive your sincere
contribution so that the team can improve in the next projects.
7. References
Data source:
https://www.tiktok.com/transparency/en-us/government-removal-requests-2022-1/
Related knowledge such as: logistic regression ,decision tree, neutral work from
websites:
https://goeco.link/VNZtF