Knowledge Graph Implementation On The Wikipedia Page Using A Deep Learning Algorithm
Knowledge Graph Implementation On The Wikipedia Page Using A Deep Learning Algorithm
ISSN: Vol
I. Introduction
Analysis of social networks or online communities can be very difficult when working in large
networks, as many measurements require expensive hardware. For example, identifying the
community structure of a network is a very computationally expensive task. Embedded graphs are a
way to represent graphs with vectors, so that further analysis becomes easier [1].
Structured data as a graph in many places such as biology, chemistry,image, the system of
decision-making, and social media networks. Using these data in machine learning models proved
difficult because of the nature of high-dimensional graph data. Graph Neural network is a new
technique in the knowledge graph, allowing to create a model of machine learning files
simultaneously the right tip for studying the structure of the chart data and adjust predictive models
in it [2].
The neural networks to learn graphical representation graph by making the insertion node graph
in the lower-dimensional space. Training to support this representation studied to end reflect the
properties of the structural chart of interest for the problems encountered. Representation of nodes
iteratively embedding created by combining information from the environment each node. [3].
The purpose of this research is to test and find out how the results of the deep learning approach
to represent knowledge graph data from wikipedia articles.
A. Data Mining
Many people treat data mining as a synonym for another popular term, knowledge discovery
data, or KDD, while others see data mining as just an important step in the knowledge discovery
process [4]. The knowledge discovery process is shown in Figure 1.
D. Deepwalk
Fig. 3. Deepwalk
E. Word2vec
Figure 4 below as seen w(t) is the given target word. There is one hidden layer that performs
calculations between the matrix weights and the input vector w(t). The calculation results in the
Yudistira Bagus Pratama et.al (Knowledge Graph Implementation on Wikipedia page using Deep Learning)
International Journal Of Artificial Intelegence Research ISSN:
Vol
hidden layer are passed to the output layer. The output layer calculates the product of the output
vector of the hidden layer and the weight matrix of the output layer. Then the softmax activation
function is applied to calculate the probability of words appearing in w(t) in a certain context
location [11].
Fig. 4. Word2vec
F. Kmeans
K-means algorithm is one of the partitioning algorithm, as K-Means based on a preliminary
estimate by defining group centroid [12]. The K-means algorithm uses an iterative process to obtain
a cluster database. It takes the desired starting cluster number as input and produces as output the
final centroid. K-means clustering method will choose k pattern as the start point at random or
random centroid. The number of iterations to achieve the cluster centroid will be influenced by a
candidate at random cluster centroid. So there is a way to develop the algorithm by determining the
centroid of the cluster seen from the high initial data density in order to get higher performance.
[13].
In its completion, the K-Means algorithm will produce the centroid point that is the goal of the
K-Means algorithm. The K-means algorithm will group data items in a dataset into a cluster based
on the closest distance [14]. The initial centroid value chosen at random which is the initial center
point will be calculated the distance with all data using the Euclidean Distance formula. Data that
has a short distance from the centroid will create a cluster. This process continues until there is no
change in each group [15].
G. Girvan Newman
The Girvan-Newman algorithm detects progressive communities by removing the edges of the
original network. The connected components of the remaining network are communities. Instead of
trying to establish a measure that tells us where the edges are most central to the community, the
Girvan-Newman algorithm focuses on edges that are likely to be "between" communities. The
vertex betweenness is an indicator of a very central node in the network. For each node, the vertex
betweenness is defined as the fraction of the shortest path between the pair of nodes that traverse it.
This is relevant for modulating a network model in which the transfer of goods between the starting
and ending points is known, assuming the transfer finds the shortest available route [16].
H. Javascript
JavaScript is an object-based scripting language that allows users to control many user
interactions in an HTML document. Where such objects may include windows, frames, URLs,
documents, forms, buttons, or other [17].
I. Wikipedia
Wikipedia is a free and open networked multilingual encyclopedia project [18]. Since its official
launch on January 15, 2001, the English Wikipedia has experienced a tremendous growth in the
number of articles. Overall, Wikipedia in all languages reached 1 million articles in September
2004 and then continued to climb to more than 1 million articles. 55 million articles in 2020 [19].
Yudistira Bagus Pratama et.al (Knowledge Graph Implementation on Wikipedia page using Deep Learning)
ISSN: International Journal Of Artificial Intelegence Research
Vol
J. Method
This research uses the Cross Industry Standard Process for Data Mining (CRISP-DM) method.
CRISP-DM is the most representative method for planning overall data extraction, experimental
design and evaluation. Since this thesis has exploratory and research objectives, the first and last
steps of CRISP-DM, understanding business and deployment, were not implemented. Only the data
understanding, data preparation, modeling, evaluation
Web scraping is the process of retrieving information from existing websites. Web scraping
applies indexing by browsing HTML documents from a website from which information will be
retrieved [20]. Wikipedia website was the object for this research
Yudistira Bagus Pratama et.al (Knowledge Graph Implementation on Wikipedia page using Deep Learning)
International Journal Of Artificial Intelegence Research ISSN:
Vol
Fig. 6. Crawler Application
Yudistira Bagus Pratama et.al (Knowledge Graph Implementation on Wikipedia page using Deep Learning)
ISSN: International Journal Of Artificial Intelegence Research
Vol
When this scraping web application is run, the web application will display a page that contains
the functions found in web scraping. The value of "distance" determines the number of iterations.
With a distance of 1, will get the original page and the page in the "see also" section. increase in
value, the application will perform the same operation on every page that is fetched. With the "stop
words" column it is possible to specify which pages should be discarded. The application will
search for every "stop word" in the article title, if there is a match then the article will be discarded.
Results are cached in local browser storage for 24 hours, allowing to quickly recreate previous
scraps or restart canceled scraps. Click the "clear cache" button to reset it. "Source" contains the
analyzed article. "Target" contains the article reference page. "Level" is the distance from the
original node.
B. Preprocesing analysis
At this stage the data preprocessing will go through several stages, namely: case folding and
stopword removal. This process aims to transform the data for the better so that the data does not
have a lot of noise that can affect the level of accuracy in classification. The following flowchart for
preprocessing data can be seen in figure 7.
Read Data In this process then the data stored as a document will be read first and converted
into a pandas dataframe for preprocessing. Case Folding This process is done to filter data by
removing numbers in sentences, spaces at the beginning and end, and changing all sentences to
lower case. Stopword removal In this stage, the words that are not used in the classification process
will be removed, such as words: at, list of, with and others. This process will be implemented using
the NPM library of javascript. Save TSV The data that has gone through the entire process will be
saved in the form of a document in TSV format. Then The data in TSV format will be analyzed.
In the case folding process, the data will be transformed into lowercase, deleting other than
letters in the sentence. The stopword removal process is carried out on the source and target
attributes. The results of the preprocessing can be seen in Figure 8.
C. Kmeans Clustering
Before doing the kmeans analysis. First determine the number of clusters to be used. There
are various methods to determine the optimal number of clusters, that is the average silhouette
and elbow methods.
Using the average silhouette scoring, after the calculation is done, the silhouette score is 0.74.
The following is a Silhouette analysis performed on the above plot with the aim of selecting the
optimal value for n_cluster. As can be seen at figure 9. N_cluster value as 3, 4 and 5 seems not
optimal to data provided for the following reason: The presence of clusters with a score
silhouette below the average fluctuation width in a plot size silhouette. The value for n_cluster as
2 looks optimal. The silhouette score for each cluster is above the average silhouette score. Also,
the size fluctuations are almost similar. the thickness is more uniform than others. Then, the
optimal number of clusters that can be selected is 2.
.
Fig. 9. Silhouette Scoring
Using elbow method by running k-means grouping for the k cluster range and for each value,
the sum of the squared distances from each point to the point is calculated. Set center
(distortion), when the distortion is plotted and the plot looks like an arm then the "elbow" (the
inflection point on the curve) is the best k value. It can be seen that "elbow" is the optimal
number 2 for this research. Thus K-Means can be run using n_cluster number 2.
Using kmeans clustering, after obtaining the optimal number of clusters through the
calculation process on the average silhouette scoring & elbow method algorithm. The Kmeans
algorithm is applied whose calculation results can be seen in figure 11.
Yudistira Bagus Pratama et.al (Knowledge Graph Implementation on Wikipedia page using Deep Learning)
ISSN: International Journal Of Artificial Intelegence Research
Vol
The graph above shows a plot of the data presented by the clusters they belong to. In the plot,
it can be seen that cluster 1 is blue while cluster 2 is green. k-means clustering tries to group
similar items in the form of clusters and find similarities between items and group them into
clusters. The right number of clusters is used to obtain from the previous siloet scoring and
elbow method processes. Where each cluster also has its own centeroid in red. The 2 clusters
obtained are in accordance with the amount of data entered during the data crawling process.
D. Deepwalk analysis
After applying the deepwalk algorithm to the data knowledge graph, then enter some
keywords from the wikipedia page for example to be tested. As a result, similar sikipedia entities
are grouped together. For example, "crypto-anarchism", "digital gold currency", "2018 crypto
crash", and "cryptocurrency and crime" are all pages that are directly related to cryptocurrency
pages. While other pages that do not reference each other will have a range of distance from each
other.
E. Community Detection
By doing community detection girvan newman on the crawled data obtained 2 communities.
From this data, a community structure is then made that shows the interconnected nodes in the
graph by targeting the 2 main node colors that show the community. Similar wikipedia pages
will be linked to each other and grouped with the same color, which means the crawler
application has performed its function properly, that is displaying similar wikipedia pages.
Yudistira Bagus Pratama et.al (Knowledge Graph Implementation on Wikipedia page using Deep Learning)
International Journal Of Artificial Intelegence Research ISSN:
Vol
From the figure above, it can be seen that the wikipedia page community graph is formed
by successfully linking wikipedia pages that reference each other.
F. Betweenness Centrality
Betweenness centrality measures the extent to which a node lies on the path between other
nodes. Specifically in this study, betweenness centrality measures the extent to which a
wikipedia page is on the shortest path between other wikipedia pages in the network. The more a
page depends on other pages, the higher the betweenness centrality value is obtained.
From the figure above, it can be seen that the graph betweenness centrality of the wikipedia
page is formed by successfully sorting the wikipedia page that has the highest influence based on
the color of the heatmap.
Table 1. Betweenness Centrality
Betweenness Centrality
Rank
Wikipedia Page Centrality Measure
1 alternative currency 0.8130618720253833
2 cryptocurrency 0.4456196016217168
3 crypto-anarchism 0.18975850520007054
4 bitcoin 0.16596157236030318
5 virtual currency law in the united states 0.16014454433280453
6 blockchain-based remittances company 0.09518773135906927
7 cryptographic protocol 0.06398730830248546
Yudistira Bagus Pratama et.al (Knowledge Graph Implementation on Wikipedia page using Deep Learning)
ISSN: International Journal Of Artificial Intelegence Research
Vol
Table 1 is the result of the calculation of betweenness centrality according to the list of nodes.
From the results of these calculations, it can be seen that the one with the highest value
betweenness centrality is the alternative currency page, which is 0.8130618720253833.
IV. Conclusion
This research was successfully carried out using a deep learning approach as an evaluation
process. The conclusions that can be drawn from this research are:
• Application crawler successfully displays similar wikipedia pages.
• Analysis using the kmeans algorithm can group the training data into 2 clusters according
to the input of training data.
• The analysis using the girvan newman algorithm finds 2 communities in the training data
and describes them in graph form according to the input of training data.
• Analysis using the deepwalk algorithm produces a group of wikipeda pages in the form of
words in low dimensions according to the input of training data.
• Analysis using the betweenness centrality can find most referenced page from wikipedia
dataset.
The suggestions that the author can give for further research are:
• Trying to use stopwords and literary stemming in the preprocessing process. Because there
are differences in other languages and English.
• Trying to use other algorithms to enrich research literacy related to the knowledge graph.
Acknowledgment
We are very grateful to the Data Science Interdisciplinary Research Center of Bina Darma
University for providing facilities in completing this research.
References
[1] K. Ollivier, “Embedding Directed Graphs using Random Walks (Master Thesis),” 2017. doi:
10.13140/RG.2.2.24150.88648.
[2] P. Rodriguez Esmerats, “Graph Neural Networks and its applications,” 2019.
[3] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini, “The Graph
Neural Network Model,” IEEE Transactions on Neural Networks, vol. 20, no. 1, pp. 61–80,
2009, doi: 10.1109/TNN.2008.2005605.
[4] J. Han, M. Kamber, and J. Pei, “1 - Introduction,” in Data Mining (Third Edition), Third
Edition., J. Han, M. Kamber, and J. Pei, Eds. Boston: Morgan Kaufmann, 2012, pp. 1–38. doi:
https://doi.org/10.1016/B978-0-12-381479-1.00001-0.
[5] J. Pujoseno, “Implementasi Deep Learning Menggunakan Convolutional Neural Network Untuk
Klasifikasi Alat Tulis,” 2018.
[6] K. P. Danukusumo, “Implementasi deep learning menggunakan convolutional neural
network untuk klasifikasi citra candi berbasis GPU,” 2017.
[7] W. L. Hamilton, R. Ying, and J. Leskovec, “Inductive representation learning on large graphs,” in
Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017,
pp. 1025–1035.
[8] Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and S. Y. Philip, “A comprehensive survey on
graph neural networks,” IEEE transactions on neural networks and learning systems, vol. 32, no.
1, pp. 4–24, 2020..
[9] T. N. Kipf and M. Welling, “Semi-supervised classification with graph convolutional networks,”
arXiv preprint arXiv:1609.02907, 2016.
[10] B. Perozzi, R. Al-Rfou, and S. Skiena, “Deepwalk: Online learning of social representations,”
in Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and
data mining, 2014, pp. 701–710.
[11] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representations in
Yudistira Bagus Pratama et.al (Knowledge Graph Implementation on Wikipedia page using Deep Learning)
International Journal Of Artificial Intelegence Research ISSN:
Vol
vector space,” arXiv preprint arXiv:1301.3781, 2013.
[12] T. S. Madhulatha, “An overview on clustering methods,” arXiv preprint arXiv:1205.1117, 2012.
[13] M.-C. Hung, J. Wu, J.-H. Chang, and D.-L. Yang, “An Efficient k-Means Clustering Algorithm
Using Simple Partitioning.,” Journal of information science and engineering, vol. 21, no. 6, pp.
1157–1177, 2005.
[14] B. Bangoria, N. Mankad, and V. Pambhar, “A survey on efficient enhanced K-Means
clustering algorithm,” International journal for scientific research & development, i (9), pp.
1698–1700, 2013.
[15] A. Agrawal and H. Gupta, “Global K-means (GKM) clustering algorithm: a survey,” International
journal of computer applications, vol. 79, no. 2, 2013.
[16] M. Girvan and M. E. J. Newman, “Community structure in social and biological networks,”
Proceedings of the national academy of sciences, vol. 99, no. 12, pp. 7821–7826, 2002.
[17] T. S. Koesheryatin, Aplikasi Internet Menggunakan HTML, CSS, dan JavaScript. Elex Media
Komputindo, 2014.
[18] M. Miliard, “Wikipediots: who are these devoted, even obsessive contributors to Wikipedia,” Salt
Lake City Weekly, vol. 1, no. 03, 2008.
[19] Wikipedia Milestones. Accessed 21 November 2021. Accessed from
https://meta.wikimedia.org/wiki/Wikipedia_milestones
[20] A. A. Maulana, A. Susanto, and D. P. Kusumaningrum, “Rancang Bangun Web Scraping Pada
Marketplace di Indonesia,” JOINS (Journal Inf. Syst., vol. 4, no. 1, pp. 41–53, 2019.
Yudistira Bagus Pratama et.al (Knowledge Graph Implementation on Wikipedia page using Deep Learning)