0% found this document useful (0 votes)
2 views

Mining big data in Tourism

Uploaded by

agahbasdegirmen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Mining big data in Tourism

Uploaded by

agahbasdegirmen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Quality & Quantity (2020) 54:1655–1669

https://doi.org/10.1007/s11135-019-00927-0

Mining big data in tourism

Carmela Iorio1 · Giuseppe Pandolfo1 · Antonio D’Ambrosio2 · Roberta Siciliano1

Published online: 17 August 2019


© Springer Nature B.V. 2019

Abstract
Knowledge discovery from various sources of information based on different data types
for decision and accurate prediction can be rather complex and costly without a statistical
information system. In Big Data Era, Statistical Tourism Observatory needs to be revised.
This paper introduces a conceptual model of Digital Tourism System (DTS) where vari-
ous types of standard and non-standard data can be processed by actors and spectators in
tourism sector. Particularly, big data can be very useful and the figure of Data Scientist
within the tourism industry becomes prominent. DTS allows to emphasize four knowledge
areas of interest for different purposes, specifically, destination management, research and
innovation, market analysis, labor market, in order to improve tourism management and
research. Key steps of the knowledge discovery pyramid are exploited to provide an added
value in decision-making on the basis of statistical learning methods. Two examples are
shown, mining online textual and photo data respectively.

Keywords Big data · Data mining · Tourism research · Statistical tourism observatory ·
Statistical learning

1 Introduction

Nowadays, data can be easy and cheap, information may be available in real time, the
knowledge discovery from different sources of information for decision and accurate pre-
diction can be rather complex and costly without a statistical information system. This
paper provides a conceptual model of Digital Tourism System for monitoring with an over-
view of the different sources of big data as well as data mining processes in tourism sector.

* Roberta Siciliano
[email protected]
Carmela Iorio
[email protected]
Giuseppe Pandolfo
[email protected]
Antonio D’Ambrosio
[email protected]
1
Department of Industrial Engineering, University of Naples Federico II, Naples, Italy
2
Department of Economics and Statistics, University of Naples Federico II, Naples, Italy

13
Vol.:(0123456789)
1656 C. Iorio et al.

Fundamental issues are related to the amount of the data and its quality as well. These both
significantly affect the results of a knowledge discovery process. Specifically, the volume
of data is connected with the scalability of the data mining techniques while the data are
continuously updated and modified because of its changing nature. Obviously, data quality
plays a prominent role. Indeed, precision, completeness and redundancy are all related to
the quality of the data. All the above mentioned issues are thus to be taken into account
before and during the analysis of the data, otherwise the impact on the knowledge discov-
ery process outcome may be seriously affected in a negative way. Interpreting the unstruc-
tured data and acquiring relevant insights is a big challenge facing the tourism industry.
Big data can be a very useful source of information for the tourism sector. Anyway, it
needs to combine modern statistics and technology to empower tourism economy also by
taking into account how statistical learning has been changing in the big data era and the
figure of Data Scientist within the tourism industry becomes prominent. Through the pro-
posed Digital Tourism System is possible to emphasize four knowledge areas of interest for
different purposes, specifically, destination management, research and innovation, market
analysis, labor market, in order to improve tourism management and tourism research.
The paper is structured as follows. Tourism evolution toward Tourism 4.0 is sketched
out in Sect. 2. Section 3 discusses the new professions and some definitions of big data in
tourism. Section 4 introduces the conceptual model for a Digital Tourism System which
can be considered as an alternative of the traditional Observatory for Statistical Informa-
tion. In Sect. 5 the key steps of the knowledge discovery pyramid to provide an added value
in decision-making are presented emphasizing the contribution of statistical learning meth-
ods. Concluding remarks end the paper.

2 Toward Tourism 4.0

In recent years the tourism sector has been experiencing a profound evolution in response
to a required implementation of Information and Communication Technologies (ICT).
Main aims are to operate competitively in the global market, to meet dimensional growth
targets to follow the competitive balance of large groups, to seize the opportunity for a
generational change, especially in family business, to get a deep knowledge of the reference
markets to adapt the offer to the demand, to acquire a global vision of the integrated tour-
ism product that meets the environmental sustainability requirements taking into account
the relationship between resources-environment-capital and market.
All these goals need to match with usability in tourism to be defined as feature of the
services to respond to criteria of ease and simplicity of use, efficiency, compliance with
the user’s needs, pleasantness and satisfaction in using the product. Thus, these are the
points: reach-ability of the destination, adequate infrastructure, integrated tourist offer sys-
tem (additional services), wide and diversified accommodation offer, qualitative and quan-
titative competitiveness, comparable tourism products in different places.
Development in terms of tourism goes through good transport systems, globalization of
tourism supply processes, increase in the number of tourist destinations, reach-ability of
touristic attractions at local level, new technologies, internet and online booking systems
(do-it-yourself), optimization of choice and decision processes, statistic analysis, monitor-
ing of tourist flows in the main destinations (incoming), tourist satisfaction analysis (design
the attributes of the product/service), market analysis (penetration, loyalty, etc.).

13
Mining big data in tourism 1657

It is not just a matter of setting up portals dedicated to tourism, or optimizing decision-


making and choice processes, or hijacking promotional materials on the network, or training
and training qualified personnel in charge of managing qualitative and quantitative informa-
tion on the web.
Today, dynamism and speed are the keywords. Reliable acquisition processes and timely
sharing can decree the end of the old ideologies of observation of the tourist phenomenon. The
technology together with the statistical expertise offer the possibility of taking advantage of
new types of data, namely field data or data streams, dynamically updated, collected through
the use of sensors and/or RFID technologies, track data and identification of user behavior
when browsing web portals, panel data on loyal tourists, traceability of tourist choices and
profiling (one-to-one logic), monitoring of tourist flows and tourist satisfaction, monitoring of
the ex-ante, in-itinere, ex-post choices of the potential and current tourist for market analysis
in the various possible variations, namely penetration, loyalty and forecasting.
Declining the Latin word monitoris—derivative of monere—the meaning of admonishing,
advising, informing—will have to refer to the territory in which the data are collected primar-
ily and, for the level of analysis and decision, to the actors and spectators committed to giving
useful and meaningful answers. Monitoring at regional level means the possibility of structur-
ing useful and significant information for local tourism planning strategies; Monitoring at the
level of the Country System means the possibility of evaluating and comparing local dynam-
ics with respect to international competitors, or as a real analysis and assessment of the impact
of local choices with respect to the positioning of the Country System in the world.
Tourists occur by chance in a territory but Tourism is not by chance. With the awareness
that the amount of potentially and already available data requires accurate processing and dis-
semination of results in forms that are understandable to non-experts, it must be taken into
account that the speed of information (internet, do-it-yourself, word of mouth, last minute,
etc.) involves a highly elastic tourist demand. For these reasons, the contribution of statistical
methods must not be underestimated .
Tourism 4.0 springs from the fourth industrial revolution, the process that will lead to fully
automated and interconnected industrial production. It is well known that the new digital tech-
nologies will have a profound impact in the context of four development guidelines. The first
concerns the use of data, computing power and connectivity, and is declined in big data, open
data, Internet of Things, machine-to-machine and cloud computing for information centraliza-
tion and storage. The second is that of analytics: once the data has been collected, value must
be obtained. Today just a small percentage of the data collected is used by companies, which
could instead obtain advantages starting from machine learning, i.e., machines that improve
their performance by learning from the data gradually collected and analyzed. The third direc-
tion of development is the interaction between man and machine, which involves increasingly
widespread touch interfaces and augmented reality. Finally there is the whole sector that deals
with the transition from digital to real and which includes additive manufacturing, 3D print-
ing, robotics, communications, machine-to-machine interactions and new technologies for
storing and using energy in a targeted way, rationalizing costs and optimizing performance.

3 New professions and big data in Tourism 4.0

What is the mission of Tourism 4.0 and how to innovate? New professional profiles are
necessary to empower the own business, namely:

13
1658 C. Iorio et al.

• Data Scientist Thanks to multidisciplinary skills is able to run predictive analysis and
scenario analysis, explore data coming from different sources and structured in com-
plex way, develop new methods of analysis and customized models for specific pur-
poses.
• Big Data Professionals Experts to identify bottlenecks in the organization and manage-
ment, build up systems of data collection and storage considering different data types
and various sources, develop multi-scaled systems
• Data Analyzers Collect and process data, perform data analysis in order to get the use-
ful information, develop reporting such to disseminate the results of data analysis, such
to provide an added value in terms of knowledge for decision-making and prediction.

The goal is to combine various types of skills, Statistics, Information Technology (IT), data
analysis, business management (for private firms), destination management systems (for
policy makers institutions). Research in Statistics and IT goes very fast and all actors must
be ready to harvest the fruits. Reducing the gap between theory and practice is the chal-
lenge to innovate (Fig. 1).
According to John Akred (Founder and CTO, Silicon Valley Data Science), Big Data
refers to a combination of an approach to informing decision making with analytical
insight derived from data, and a set of enabling technologies to be economically derived
from at times very large, diverse sources of data. Sean Patrick Murphy (Consulting Data
Scientist) says While Big Data is often large in size relative to the available tool set, big
actually refers to being important. Thus, Big Data is at the intersection of collecting,
organizing, storing, and turning all of that raw data into truly meaningful information.
Data can be massive, messy and sparse, and thus requiring capturing and storage meth-
ods to be kept in memory. It’s not all. While Big Data is often large in size relative to the
available tool set, Big actually refers to being important in the knowledge extraction ori-
ented to assessment and scenario analyses. All data are big but some are structured in com-
plex way: non-standard data require non-standard methods. Some examples are symbolic

Fig. 1  New professions in big data era

13
Mining big data in tourism 1659

and/or imprecise data, web data, preference rankings, web data, time course data, incom-
plete data, etc.. These are big data learning problems (like, for example, Iorio et al. 2019;
D’Ambrosio et al. 2019; Siciliano et al. 2016, 2017; Iorio et al. 2016; Amodio et al. 2016)
within the Statistical Learning Paradigm (Hastie et al. 2009; Siciliano and D’Ambrosio
2012).
The term big data refers to the massive data in both structured and un-structured data
types that are generated, recorded and stored but also to large and complex data sets that
are difficult to process by traditional software applications and statistical methods within a
reasonable period of time (Snijders et al. 2012; Hassani and Silva 2015).
Actually, there is no an unified definition of big data, and many different definitions are
given by researchers. Laney (2001) defined big data by using a “3V’s” model including the
characteristics volume, variety and velocity. In this definition, volume refers to the amount
of data generated and collected (in orders of magnitude of gigabytes, terabytes, zettabytes)
and to the variables observed; velocity stands for the fast timing of the analysis to maxi-
mize the commercial value of the big data; variety includes the different types of data (e.g.
audio, video, text data, etc.) in addition to the traditional structured data. Indeed, the inter-
net is not only data source, since there are other data sources coming from communication
systems (such as smart mobile device data and social media posts), business process (i.e.
flight booking systems), sensors (like satellite images, scanner data, geographic informa-
tion systems data). Volume, veracity and variety are more oriented towards a point of view
based on the Information and Technology (IT) aspects. On the other hand, to highlight the
importance and the usefulness of big data, Beyer et al. (2012) added one more “V” in their
model that stands for veracity. It refers to quality, reliability and usefulness of big data. As
a matter of fact, the exploding quantity of data and observations does not guarantee the
quality. For instance, even if volume and veracity can be under control thanks to technical
innovations in IT infrastructures, specific techniques must be considered to deal with preci-
sion, missing values, completeness and redundancy of the data. Chen et al. (2014) added a
“V” to identify the ‘value of big data for both statisticians and for business and policy mak-
ers in an era of data-driven decisions. More recently, De Mauro et al. (2015) gave a con-
sensual definition of big data that “represents the information assets characterized by such
a high volume, velocity and variety to require specific technology and analytical methods
for its transformation into value”. It is not surprising that big data are identified as the lead-
ing edge for innovation, competition and competitivity (Manyika et al. 2011). As a matter
of fact, big data are employed by researchers, managers and policy makers to make deci-
sions as well as to discover new opportunities for their business (Irudeen and Samaraweera
2013). In such era, the tourism industry has also changed by providing a new way to under-
stand tourist behavior, tourist satisfaction and other tourism issues (Siciliano et al. 2011;
Yang et al. 2015; Li et al. 2017). Big data have modified the traditional tourism research
based on traditional data. Travelers produce large amounts of data through every travel
(before it, during it, after it) especially in the form of social networking feeds (Perdana
2014). These amounts of unstructured and repositioned data are stored in the clouds and
need for data analytics to extract information, thus knowledge. For instance, big data ana-
lysts can capture information of tourists interests by looking photos posted on some social
network. This huge amount of unforced data covers any aspect of human life not suffering
from selection biases that can be present in traditional surveys increasing the sample also
avoiding information loss in the sample data (Meeker and Hong 2014). The reliability also
ensures more confident decision making (Fuchs et al. 2014). Through the use of new infor-
mation flows, the companies are able to track and analyze shopping patterns, recommenda-
tions, purchasing behavior and all other drivers influencing tourism sector. In this way, the

13
1660 C. Iorio et al.

tourism industry becomes more efficient since it is always connected with potential tourists
at every stage of a trip. By using big data, both tourists and tourist product providers can
take advantage of better, targeted and profitable services. Nowadays, the use of web and
social media has effectively changed the way of traveling. Compared to traditional data,
big data are much more informative and structure-complex, appearing different data char-
acteristics, focusing on various research issues, and requiring different analytic techniques
(Li et al. 2018). Big data provides two kinds of information that include structured and
unstructured data. The main characteristics of the tourism-related big data are the volume
and the variety. The first one does not concern only the traditional channels of distribution
but it also refers to the amount of data created on a real time basis. The tourism big data are
always varied since they are originated from all accessible technologies.
Figure 2 describes Big Data in Tourism, considering data sources and their category
with some examples of big data types.
Due to the development of internet of things, various sensor devices have been devel-
oped and employed to the track the movements of the tourists and the meteorological data
(to serve travel decision making). Thus, tourism-related big data concerns also spatial-tem-
poral data, like global position system (GPS) data, mobile roaming data, Bluetooth data,
WiFi data, weather data. Another important type of tourism-related big data is originated
from transaction and activities such as web searching, web-page visiting, on-line booking
and purchasing. These data are known as transaction data and are employed to promote
tourism prediction, tourism marketing and tourists behaviors. Usually, three main sources
for the tourism-related big data can be considered: data by device, transaction data by
operations and user-generated content (UGC) (Shoval and Ahas 2016; Xiang et al. 2017).
Among the variety of big data applied to tourism research, UGC data are the major cat-
egory helpful tourist sentiment analysis, tourism marketing and tourist recommendation.

Fig. 2  Big data sources and their category along with their types

13
Mining big data in tourism 1661

On the other hand, device data and transaction data contribute in a small manner to tour-
ism research because of both their limitation due to privacy concern and their high cost for
buying sensor devices then for recruiting volunteers. UGC data are online textual data and
online photo data (i.e. reviews data and blogs data released on social media) through which
tourists, sharing traveling review and experiences, express their satisfaction and dissatisfac-
tion. To measure the tourist satisfaction through UGC data, various data mining techniques
have been applied.

4 Digital tourism system: the conceptual model

In such Smart technologies Era or Big Data Era, Research and Scientific Plus allow to
increase revenues, reduce operational costs, improve customer experience, skilling decision
making. Business Intelligence pass through 3-Key Factors: Statistics = “Stato” = observe
the facts, Technology = “techne” and “logìa” = how to learn, Analysis of Data = “anály-
sis” and “dato” = outcome. Within this framework, we introduce the conceptual model to
design mining processes involving actors and spectators in tourism sector nowadays. The
comprehension of real problems, its translation into a statistical one, the way to collect
data, and the choice of the statistical method to process them are not so obvious. Neverthe-
less, no statistical information system can be made if it is not clear whom is directed to and
which are the relationships among the players and stakeholders of the economic system.
The actors involved in knowledge discovery process who play an important role in the pro-
cess of knowledge discovery are the analysts, namely the beneficiary of the quantitative
analysis (i.e. the policy maker, the stakeholder, etc.) and the data scientists, namely the
researcher in statistical methodology and computer science with experience in specific or
general application contexts. The figure of data scientist is increasing in popularity in pri-
vate and public sectors. As a matter of fact, a real data scientist can be a statistical advisor
in possess of three main skills: (1) data management (i.e., collection, access, pre-process),
(2) information management (i.e., data selection and application of statistical methods) and
(3) knowledge discovery (i.e., interpretation and analysis, storytelling, exploitation). This
is a mix of skills which can be formed only by doing applications of statistical methodol-
ogy in various contexts. The added value of Information and Communication Technology
(ICT) in combination with statistical learning process allows us to define a new concep-
tual model to collect data and to access to informational statistics providing knowledge for
decision-making and prediction.
To provide a statistical methodological paradigm for tourism monitoring in a territory
looking at the inside and the outside of this industry, the conceptual model called Digital
Tourism System (DTS) and depicted in Fig. 3 is introduced. Within the framework of Big
Data Era, this model can be understood as a revised and extended version of the Digital
Accessible Statistical Information System for Monitoring of Tourism (DASIS-MT) (Sicili-
ano and D’Ambrosio 2012). This model represents a necessary tool in the globalization
time for all players and stakeholders (inside and outside the production chain) to be ade-
quate to the international competitiveness and for increasing the value of a territory as an
Integrated Touristic Destination System. Figure 3 offers a view of processes for data min-
ing involving all players belonging to any hypothetical Tourist Integrated System within
a given territory. Kernel of the model is a web-based platform called DTS empowered by
statistics and technologies supporting all players, playing the role of either actors or spec-
tators. Ideally, this replaces an old fashion Observatory of Tourism for any Territory of

13
1662 C. Iorio et al.

Fig. 3  Digital tourism for monitoring

interest at any multi-scale dimension, i.e., a city, a region, a country, a community of coun-
tries and so on. Actors can be distinguished by spectators with respect to the role played
in the Tourist Integrated System. They are all related to each other and the evidence of
all links is given by crosses in the Fig. 3. There cannot be tourism economy and tourist
destination products without taking care of tourists needs. Thus, in order to have a strong
impact in the economic and social development of a territory, the analysis of the changes
in tourism becomes crucial in policy making. However, the performances of the companies
in the tourism sector are not gathered by official statistics, but their availability is necessary
for both marketing planning strategies and revenue management.
As it can be seen by looking at the Fig. 3, the production chain of the tourism economic
system is formed by the actors of tourism supply:

• Policy makers (Region, Province, City Hall, Local Institutions) are responsible for the
local government, the definition of tourist districts (or local touristic systems) with
their touristic destination products.
• Among the producers, it is possible to distinguish between main operators and addi-
tional services operators. The former are those producers of fundamental needs, for
accommodation (hotels, bed and breakfast, residence, etc.), travel (fly companies, ship-
ping, etc.) and public transport, attraction (museums, archeological sites, etc.). The
latter are additional operators of services differentiating (events organizers, food and
beverage, etc.), providing distinctiveness in the competitiveness of territory tourist des-
tination.
• Market operators or distributors identify touristic packages to satisfy the local demand,
the potential demand and, in general, the market demand; at different level of interest

13
Mining big data in tourism 1663

(there can be tour organizers, travel agencies, tour operators). All of these can also play
the role of Web Operators.

Outside the productive chain there are various suppliers of other types of data and informa-
tion playing the role of spectators:

• Education includes Professional Schools, Universities, Center of Research, empower-


ing the human resources skills in touristic industry.
• Official statistics includes all public institutions gathering official statistics producers at
multi-scale dimension, i.e., the World Organization of Tourism, European Travel Com-
mission, EUROSTAT, the Ministry of Cultural Heritage, the National Statistical Insti-
tute, Local Institutions such as Tourism Offices or Bureau, at different level of interest,
namely district, city hall, province, region.
• Digital library includes all public catalogues of photos, video, media, free available
again at multi-scale dimension.
• Data science includes all methodologies and technologies to automatize decision-mak-
ing process and forecasting.

DTS in a given territory aims at providing knowledge for decision-makers in both private
and public sectors. For its application, data and information should be accessible and avail-
able in digital form with the purpose of a timeless data sharing and knowledge dissemina-
tion. The statistical information and knowledge resulting from the system are relevant for
specific groups of user needs. Figure 3 shows four elliptical shapes corresponding to four
knowledge area of interest where decision-making process occurs involving various actors:

• Knowledge of labor market the largest involving all actors, matches the demand and
supply of the labor market exploiting all opportunities of empowering the skills of
human resources, which are fundamental in the tourism sector.
• Knowledge of destination management takes care of the co-definition of touristic desti-
nation products with the cooperation of policy makers, producers and distributors.
• Knowledge of research and innovation through the advances in methods and technolo-
gies supports the touristic chain of production and distribution and involves the institu-
tions of official statistics, academics, educational agencies, the private sector in ICT
area.
• Knowledge of market analysis takes care of the touristic packages, web-marketing,
online process, do-it-yourself procedures.

5 Knowledge discovery processes in tourism

The role played by tourism statistics and tourism-related economic information is priceless.
To highlight the importance of tourism, it is necessary to develop policies at both national
and local levels by encouraging decision making process for business in the tourism indus-
try and establishing linkages with the rest of economy. Government policy requires offi-
cial data collection, integrated systems of tourism statistics and socio-economic indicators
to take a global picture of tourism sector. The cooperation between the involved actors
to share information and experiences along with methodological statistical techniques
are crucial for developing harmonized tourism statistics. These, in combination with the

13
1664 C. Iorio et al.

support of new technologies for data analysis, draw the path from data to information and
from information to knowledge that can be structured through the process of Knowledge
Discovery in Databases (Fayyad et al. 1996). It consists of large-scale searching for pat-
terns that exist in databases, but are hidden among the great amount of data. The discover
of patterns provides valuable knowledge for any decision support. Knowledge extraction is
very useful when a data analysis problem contains a lot of potential variables and multidi-
mensional relations. Knowledge Discovery in Databases (KDD) is a data analysis process
that works on various type of information when deriving knowledge from data.
To describe quality levels of information playing a role in the KDD process, Fig. 4
shows the knowledge discovery pyramid. It allows to emphasize the key steps to provide an
added value in decision-making starting from the real problem definition.
Mining different patterns is the central task within the KDD process. Data mining meth-
ods are associated to the lower levels, whereas knowledge discovery refers also to the top
level in order to generate information that can be eligible as knowledge. Main steps from
data to information and from information to knowledge are described by the key stages to
provide an added value in decision-making starting from the real problem definition. These
stages of the Knowledge Discovery can be viewed as a set of functional actions whose
main targets are moving upward the pyramid of knowledge. At the start of the KDD pro-
cess, a careful definition by the beneficiaries of statistical analysis must be provided, then
a brainstorming involving the actors to transform the real problem in a statistical one. As
the result, sharing data allows the statisticians to capitalize the information to be processed.
The effort of collecting data directly (via field analysis) and indirectly (via official statistics
or other statistics procedures) belongs to the stage of data accessibility. This together with
filtering (that consists of in the data sets selection of features and statistical units to be pro-
cessed) are preliminary steps to the selection of the statistical method in the research phase.
As a matter of fact, data are processed in accordance with the statistical methodology sat-
isfying the specific research question. The results of the data processing step needs to be
analyzed to provide interpretative issues, in cooperation with the beneficiary of statistics.
Obviously, the results exploitation stage consists of the dissemination of statistical results

Fig. 4  Knowledge discovery


pyramid for decision-makin

13
Mining big data in tourism 1665

that are oriented to provide solutions to the question marks. The final result of the overall
discovery process is knowledge, which can be fruitfully used to provide an added value in
terms of utility, revenues, etc. Any KDD process distinguishes two main steps, from data to
information and from information to knowledge. The first step is EDA (Exploratory Data
Analysis), in the “context of discovery”, involving also the fundamental passages related
to data management and organization (extraction of data from one or more databases, data
cleaning, treatment of missing data, pre-processing). Exploratory analysis of data involves
both Data Analysis and Data Mining tools, especially when we are in the presence of large
amounts of data. The second step is CDA (Confirmatory Data Analysis), in the “context
of justification”, where both the a-priori and the statistical information deriving from the
first step are fully exploited in order to build statistical models for decisions and forecasts,
offering possible generalizations. Thus, KDD process takes account of a variety of analysis
methods as pattern recognition, data mining, natural language processing, sentiment analy-
sis, statistical and visual analysis. Statistical analysis is interested in summarizing massive
data sets, understanding data and defining models for prediction. Data mining correlates
with discovering useful models in massive data sets by itself, whereas machine learning
combines with data mining and statistical methods enabling machines to understand data
sets. Nowadays, also visual analysis field is developing tools in which large data sets are
serviced to final users who will be able to understand relationships among data.
Figures 5 and 6 show two examples of KDD process for mining online Textual Data and
Photo Data respectively.
To analyze the information contained in online textual data three steps are performed.
The first one is the implementation of a web crawling technology to automatically col-
lect the data from the related social media sites, such as Tripadvisor, Expedia, Booking
(for reviews data) and Twitter (for blogs data). The second step is the pre-processing of
the data, required to have only relevant tourism information. It comprises data cleaning,
tokenization, word-stemming and part of speech tagging. The third step is the pattern dis-
covery that aims at exploring the information in the relevant online textual tourism data.
Typical data mining techniques used in tourism studies are Latent Dirichlet allocation
(Guo et al. 2017), sentiment analysis (Philander and Zhong 2016), correspondence analy-
sis (Költringer and Dickinger 2015), cluster analysis (Bordogna et al. 2016; Zhang et al.
2017), linear regression models (Xie et al. 2014), network analysis (D’Agata et al. 2013),
Bayesian ordered logit model (Zhang et al. 2016) and Tobit regression model (Fang et al.
2016). Moreover, also useful software packages have been developed for text processing
(Hall et al. 2009; Aria and Cuccurullo 2017). These steps are helpful in analyzing tourist
satisfaction about tourism product and destination. Other types of UCG data are photos
posted and shared on social media such as Flickr and Instagram. Containing useful mes-
sages in terms of metadata, online photos are employed for tourism marketing purposes.
In exploring the information hidden in the online photos by data mining, the process starts
with the pre-processing of the raw data collected (in terms of data cleaning, data forma-
tion and text mining). This step aims to explore the interests and the motivations of the
tourists in taking photo. On the extracted valuable metadata, the cluster analysis is then
performed. k-means clustering and Density-Based Spatial Clustering of Application with
Noise (DBSCAN) have different applications in the geospatial context. k-means is found to
be the most popular algorithm in tourism research (Salas-Olmedo et al. 2018; Kurashima
et al. 2013; Chen et al. 2009). Since DBSCAN is better suited for finding geospatial aggre-
gations in the presence of noise points, this algorithm is extensively applied to tourism
photo clustering by Memon et al. (2015), Lee et al. (2014), Kisilevich et al. (2010). Fur-
thermore, also the Markov Chain are employed for travel pattern mining (Vu et al. 2015).

13
1666 C. Iorio et al.

Fig. 5  KDD for mining online textual data in tourism

After a classification step, the trajectory discovery is performed to present appropriate rec-
ommendations for traveling. Through the mentioned three steps, a promoting tourism mar-
keting can be formulated.
Finally, the data mining process in tourism allows to transform the information extracted
from the tourism-related big data into useful knowledge to understand tourist behavior,
tourist satisfaction and other tourism issues offering helpful in improving tourism manage-
ment and in tourism research.

6 Concluding remarks

Big data are certainly a new and challenging data source for statistics. The challenge is not
only to collect and manage a vast volume and different types of data, but also to extract mean-
ingful value from it. By using big data in tourism research, important issues such as tourist
sentiment analysis, tourist satisfaction and other tourism issues are addressed. Big data are
characterized by three main components: variety, velocity and volume. In the recent years,
veracity as well as value are also considered to describe big data. They are generated from
online transactions, emails, videos, audios, images, blogs, search queries, social networking
sites, sensors and mobile phones and their applications. Big Data Sources and their Category
along with their Types have been discussed in this paper. It has been emphasized the fig-
ure of Data Scientist who has become prominent also within the tourism industry. Big data
have unstructured information and their analysis requires a revolutionary step forward from

13
Mining big data in tourism 1667

Fig. 6  KDD for mining online photo data in tourism

traditional data analysis. Interpreting the unstructured data and acquiring relevant insights
is a big challenge facing the tourism industry that can be better explored and understood by
both academia and tourism sector. In this work a conceptual model of Digital Tourism System
(DST) for monitoring has been introduced. Main issue is to emphasize the main data mining
processes involving all players belonging to any hypothetical Tourist Integrated System within
a given territory. Traditional Statistical Observatory of Tourism can be thus replaced by Digi-
tal Tourism System mining all types of big data. DTS allows to emphasize four knowledge
areas of interest for different purposes, specifically, destination management, research and
innovation, market analysis, labor market, in order to improve tourism management and tour-
ism research. The cooperation between the involved actors to share information and experi-
ences along with methodological statistical techniques are crucial for developing harmonized
tourism statistics. These techniques, in combination with the support of new technologies for
data analysis, draw the path from data to information and from information to knowledge that
can be structured through the Knowledge Discovery Pyramid. Any decision-making and sce-
nario analysis can be understood as such Knowledge Discovery Pyramid. In this sense, two
examples have been shown.

References
Amodio, S., D’Ambrosio, A., Siciliano, R.: Accurate algorithms for identifying the median ranking when
dealing with weak and partial rankings under the kemeny axiomatic approach. Eur. J. Oper. Res.
292(2), 667–676 (2016)

13
1668 C. Iorio et al.

Aria, M., Cuccurullo, C.: Bibliometrix: an r-tool for comprehensive science mapping analysis. J. Inf.
11(4), 959–975 (2017)
Beyer, M.A., Laney, D.: The Importance of ‘Big Data’: A Definition, pp. 2014–2018. Gartner, Stamford
(2012)
Bordogna, G., Frigerio, L., Cuzzocrea, A., Psaila, G.: Clustering geo-tagged tweets for advanced big
data analytics. In: 2016 IEEE International Congress on Big Data (BigData Congress), pp. 42–51
(2016)
Chen, W.C., Battestini, A., Gelfand, N., Setlur, V.: Visual summaries of popular landmarks from com-
munity photo collections. In: 2009 Conference Record of the Forty-Third Asilomar Conference on
Signals, Systems and Computers, pp. 1248–1255 (2009)
Chen, M., Mao, S., Zhang, Y., Leung, V.C.: Big data: related technologies, challenges and future pros-
pects. Springer, Berlin (2014)
D’Agata, R., Gozzo, S., Tomaselli, V.: Network analysis approach to map tourism mobility. Qual. Quant.
47(6), 3167–3184 (2013)
D’Ambrosio, A., Iorio, C., Staiano, M., Siciliano, R.: Median constrained bucket order rank aggregation.
Comput. Stat. 34(2), 787–802 (2019)
De Mauro, A., Greco, M., Grimaldi, M.: What is big data? A consensual definition and a review of key
research topics. In: AIP Conference Proceedings, vol. 1644, pp. 97–104. AIP (2015)
Fang, B., Ye, Q., Kucukusta, D., Law, R.: Analysis of the perceived value of online tourism reviews:
influence of readability and reviewer characteristics. Tour. Manag. 52, 498–506 (2016)
Fayyad, U., Piatetsky-Shapiro, G., Smyth, P.: From data mining to knowledge discovery in databases. AI
Mag. 17(3), 37 (1996)
Fuchs, M., Höpken, W., Lexhagen, M.: Big data analytics for knowledge generation in tourism destina-
tions—a case from sweden. J. Destin. Mark. Manag. 3(4), 198–209 (2014)
Guo, Y., Barnes, S.J., Jia, Q.: Mining meaning from online ratings and reviews: tourist satisfaction anal-
ysis using latent dirichlet allocation. Tour. Manag. 59, 467–483 (2017)
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The weka data mining soft-
ware: an update. ACM SIGKDD Explor. Newsl. 11(1), 10–18 (2009)
Hassani, H., Silva, E.S.: Forecasting with big data: a review. Ann. Data Sci. 2(1), 5–19 (2015)
Hastie, T., Friedman, J., Tibshirani, R.: The Elements of Statistical Learning, vol. 1. Springer Series in
Statistics. Springer : New York (2009)
Iorio, C., Frasso, G., D’Ambrosio, A., Siciliano, R.: Parsimonious time series clustering using p-splines.
Expert Syst. Appl. 52, 26–38 (2016)
Iorio, C., Aria, M., D’Ambrosio, A., Siciliano, R.: Informative trees by visual pruning. Expert Syst.
Appl. 127, 228–240 (2019)
Irudeen, R., Samaraweera, S.: Big data solution for sri lankan development: a case study from travel and
tourism. In: 2013 International Conference on Advances in ICT for Emerging Regions (ICTer), pp.
207–216 (2013)
Kisilevich, S., Mansmann, F., Keim, D.: P-dbscan: a density based clustering algorithm for exploration
and analysis of attractive areas using collections of geo-tagged photos. In: Proceedings of the 1st
International Conference and Exhibition on Computing for Geospatial Research and Application,
ACM (2010)
Költringer, C., Dickinger, A.: Analyzing destination branding and image from online sources: a web
content mining approach. J. Bus. Res. 68(9), 1836–1843 (2015)
Kurashima, T., Iwata, T., Irie, G., Fujimura, K.: Travel route recommendation using geotagged photos.
Knowl. Inf. Syst. 37(1), 37–60 (2013)
Laney, D.: 3d data management: controlling data volume velocity and variety. META Group Res. Note
6(70), 1 (2001)
Lee, I., Cai, G., Lee, K.: Exploration of geo-tagged photos through data mining approaches. Expert Syst.
Appl. 41(2), 397–405 (2014)
Li, X., Pan, B., Law, R., Huang, X.: Forecasting tourism demand with composite search index. Tour.
Manag. 59, 57–66 (2017)
Li, J., Xu, L., Tang, L., Wang, S., Li, L.: Big data in tourism research: a literature review. Tour. Manag.
68, 301–323 (2018)
Manyika, J., Chui, M., Brown, B., Bughin, J., Dobbs, R., Roxburgh, C., Byers, A.H.: Big data: the next
frontier for innovation, competition, and productivity (2011). https​://www.mckin​sey.com/~/media​
/McKin​s ey/Busin​e ss%20Fun​c tion​s /McKin​s ey%20Dig​i tal/Our%20Ins​i ghts​/ Big%20dat ​a %20The​
%20nex​t%20fro​ntier​%20for​%20inn​ovati​on/MGI_big_data_exec_summa​r y.ashx
Meeker, W.Q., Hong, Y.: Reliability meets big data: opportunities and challenges. Qual. Eng. 26(1),
102–116 (2014)

13
Mining big data in tourism 1669

Memon, I., Chen, L., Majid, A., Lv, M., Hussain, I., Chen, G.: Travel recommendation using geo-tagged
photos in social media for tourist. Wirel. Pers. Commun. 80(4), 1347–1362 (2015). https​://doi.
org/10.1007/s1127​7-014-2082-7
Perdana, D.H.F.: Trip guidance: a linked data based mobile tourists guide. Adv. Sci. Lett. 20(1), 75–79
(2014)
Philander, K., Zhong, Y., et al.: Twitter sentiment analysis: capturing sentiment from integrated resort
tweets. Int. J. Hosp. Manag. 55(2016), 16–24 (2016)
Salas-Olmedo, M.H., Moya-Gómez, B., García-Palomares, J.C., Gutiérrez, J.: Tourists’ digital footprint in
cities: comparing big data sources. Tour. Manag. 66, 13–25 (2018)
Shoval, N., Ahas, R.: The use of tracking technologies in tourism research: the first decade. Tour. Geogr.
18(5), 587–606 (2016)
Siciliano, R., D’Ambrosio, A.: Statistical monitoring of tourism in the knowledge era. In: vol Morvillo, A.
(ed.) Advances in Tourism Studies, pp. 231–258. McGraw-Hill, New York (2012)
Siciliano, R., Aria, M., D’Ambrosio, A., Tutore, V.A.: Indagine statistica sulle aspettative e priorità per sod-
disfare il turista a napoli. In: Becheri, E., Maggiore, G. (eds.) Rapporto Sul Turismo Italiano, vol.
XVII, pp. 449–470. Franco Angeli, Milano (2011)
Siciliano, R., D’Ambrosio, A., Aria, M., Amodio, S.: Analysis of web visit histories part I: distance-based
visualization of sequence rules. J. Classif. 33(2), 298–324 (2016)
Siciliano, R., D’Ambrosio, A., Aria, M., Amodio, S.: Analysis of web visit histories part II: predicting navi-
gation by nested stump regression trees. J. Classif. 34, 473–493 (2017)
Snijders, C., Matzat, U., Reips, U.D.: Big data: big gaps of knowledge in the field of internet science. Int. J.
Internet Sci. 7(1), 1–5 (2012)
Vu, H.Q., Li, G., Law, R., Ye, B.H.: Exploring the travel behaviors of inbound tourists to hong kong using
geotagged photos. Tour. Manag. 46, 222–232 (2015)
Xiang, Z., Du, Q., Ma, Y., Fan, W.: A comparative analysis of major online review platforms: implications
for social media analytics in hospitality and tourism. Tour. Manag. 58, 51–65 (2017)
Xie, K.L., Zhang, Z., Zhang, Z.: The business value of online consumer reviews and management response
to hotel performance. Int. J. Hosp. Manag. 43, 1–12 (2014)
Yang, X., Pan, B., Evans, J.A., Lv, B.: Forecasting chinese tourist volume with search engine data. Tour.
Manag. 46, 386–397 (2015)
Zhang, Z., Zhang, Z., Yang, Y.: The power of expert identity: how website-recognized expert reviews influ-
ence travelers’ online rating behavior. Tour. Manag. 55, 15–24 (2016)
Zhang, L., Lan, C., Qi, F., Wu, P.: Development pattern classification and evaluation of the tourism aca-
demic community in china in the last ten years: from the perspective of big data of articles of tourism
academic journals. Tour. Manag. 58, 235–244 (2017)

Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and
institutional affiliations.

13

You might also like