0% found this document useful (0 votes)
33 views

Data Analysis Using GIS and Data Mining.

This document discusses using GIS and data mining techniques together to analyze geographic and social data. GIS is used to input and show spatial data on maps while data mining can mine hidden patterns and rules from raw data. The paper aims to demonstrate how these tools combined can provide deeper insights into data than using them individually.

Uploaded by

bebeo05
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views

Data Analysis Using GIS and Data Mining.

This document discusses using GIS and data mining techniques together to analyze geographic and social data. GIS is used to input and show spatial data on maps while data mining can mine hidden patterns and rules from raw data. The paper aims to demonstrate how these tools combined can provide deeper insights into data than using them individually.

Uploaded by

bebeo05
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Data analysis using GIS and data mining.

F.-Y. Leu, T.-H. Wang

To cite this version:


F.-Y. Leu, T.-H. Wang. Data analysis using GIS and data mining.. In International Conference of
Territorial Intelligence, Sep 2006, Alba Iulia, Romania. p. 231-237. �halshs-00516476�

HAL Id: halshs-00516476


https://halshs.archives-ouvertes.fr/halshs-00516476
Submitted on 10 Jun 2014

HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est


archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents
entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non,
lished or not. The documents may come from émanant des établissements d’enseignement et de
teaching and research institutions in France or recherche français ou étrangers, des laboratoires
abroad, or from public or private research centers. publics ou privés.
Data Analysis Using GIS and Data Mining

Fang-Yie Leu and Tai-Shiang Wang


Dept. of Computer Science and Information Engineering, Tunghai University, Taiwan
{leufy, g932810}@thu.edu.tw

Abstract
Recently, many commercial Geographical Information Systems (GISs) have been developed. Their functions are
quickly growing up. Researchers and policymakers can input environmental data to a GIS system to gain spatial analysis
result which can show up how data are geographically dispersed. Besides, the data mining and data warehouse
technologies can automatically mine hidden knowledge and analyze/extract knowledge from raw data, respectively. If we
can put them in use with GIS, the hidden meanings or rules embedded in the environmental data can be then more deeply
and precisely uncovered. In this paper, we will discuss how to use the two data analytical tools, GIS and data mining, to
analyze the data collected for the Situn district so that researchers can realize some facts that can not be superficially
obtained from raw data.

Keywords: GIS, data mining, data analysis.


1. Introduction
2. Related work
Nowadays, a huge amount of geographic information
has been produced and collected, especially from To date, many application domains have employed data
satellite remote measurement and map digitalization. A mining or GIS techniques, but not both, to promote their
part of them have been transformed from traditional business.
formats into digital so that they can be stored in a
computer system. Geographical Information Systems In health care domain, Mitchell [1] described several
(GISs) are widely used in modern time, particularly in prototypical uses of data mining, including an expert
designing and showing a city’s road networks, system able to predict women at high risk of requiring
underground pipes, power lines, and et al. Users can an emergency C-section. Merck-Medco Managed Care,
search roads or landmarks on a electronic map or in a pharmaceutical insurance and prescription mail-order
internet if the map provides a web version, to realize the unit of Merck, used data mining to help uncover less
locations they are interested in. expensive but equally effective drug treatments for
certain types of diseases or patients [2].
Besides, expert systems and machine learning are also
well known intelligent techniques/models. Most of the In finance domain, Bank of America deployed data
researchers or decision makers rely on computers to mining to detect which customers were using which
analyze their data in deep which are always stored in Bank of America products so they could offer the right
computer databases or files. However, databases or files mix of products and services to better meet customer
are passive facilities. We can query or manipulate them needs [2].
only. They never actively tell us the knowledge deeply
embedded or hidden in them. In sports domain, Brain James, assistant coach of the
Toronto Raptors professional basketball teams, used
In the social or geographic domain, few applications Advanced Scout, a data mining/warehousing tool
deploy GIS and data mining at the same time. In this developed by IBM especially for NBA, to create
paper, we use them to analyze social and geographic favorable player matchups and help call the best plays
phenomena, and then explain the phenomena according [3].
to the mining result.
The rest of this article is organized as follows. Besides, many commercial products of GIS have been
Section 2 shows the application domains that have been released, such as ArcGIS [4], TomTom Navigator [5],
developed. Section 3 introduces the mining techniques. Google Map [6], Yahoo Map [7]. Some of the products
Section 4 describes GIS systems. Case study and are for single client use, and others for web-based
examples are presented in section 5. Section 6 concludes service. For analysis purpose, the ArcGIS is much more
this article. mature than others since it can perform almost every
type of geographical analysis. or mobile or navigation features. We can use supervised learning to build
purpose, Garmin and TomTom have released many classification or prediction models from sets of data
products in this domain. containing examples and non-examples of the concepts
to be learned. Then the model (e.g., the decision tree.) is
3. The “Mining” Techniques used to determine the classification or predict the
outcomes of newly presented instances of unknown

Data mining is the process of employing one or more origin.

computer learning techniques to automatically analyze


and extract knowledge from data collected in a large Unsupervised learning is a learning model that builds

database. Its purpose is to identify trends and patterns in models from data without predefined classes. Data

data so that users can extract hidden predictive instances are grouped together based on specific features

information from the database. It is a powerful defined by the learning clustering system. Users have to

technology with great potential to help researchers focus interpret the meaning of the formed clusters with the

on the most important information in their raw data. help of evaluation techniques to determine whether the
classification meets our requirements or not.

Machine learning is a complex process. Computers are


sometimes good at learning concepts. A concept is a set 3.2. Data Mining and Data Query
of objects, symbols, or events grouped together due to
sharing certain characteristics. Concepts can be well Databases collect and store passive data in their
designed and structured for future retrieval and predefined-format storages or data structures, from
management. Common concept structures include trees, which users can retrieve the data and aggregate data.
rules, networks, and mathematical equations. Data mining can mine the hidden rules or knowledge
embedded in the raw data. Before deploying data mining

3.1. Types of Learning as a problem-solving technique, we need to consider


three questions.

Many types of data mining techniques adopt induction-


based learning [8], which is the process of forming (1). How to clearly define the problem? i.e., what we

concepts and definitions by observing concept examples want to mine which gives us a mining direction.

and concept objects to be learned, as the core algorithms


to mine knowledge. Learning can be classified into two (2). Does potential hidden meaningful data truly exist?

types: supervised and unsupervised. If not, the mining process is in vain.

Supervised learning is a learning model that intercepts (3). Is the mining cost less than the profit gained from

instances of concepts representing animals, plants, and the mining process? If yes, we will lose much more

the like, or labels given to individual instances, and then during/after the process.

chooses what we believe to be the definite concept


Without consideration of the three issues, a data mining Inference Engine
is meaningless. There are four general types of
knowledge that can help us determine whether data
Fig.1 The framework of an expert system
mining or data query is suitable for us.

3.3. Expert Systems


(1). Data: sometimes data is also called shallow
knowledge which can be easily stored in a database and
An expert system often comprises knowledge base
manipulated by DBMS. Data query, for example, using
and inference engine [11,12] as shown in Fig. 1. The
SQL is enough. No data mining is required.
former is the place to hold the knowledge of the system,
whereas the latter is the mechanism that inferences new
(2). Multidimensional data: Data of this type is often
facts from exiting facts. From application viewpoint, an
used to represent a multidimensional object in a
expert system is a computer program that gathers
multidimensional format. On-Line Analytical
expertise from human experts to construct its knowledge
Processing (OLAP) [9] is an appropriate tool to
base so as to emulate the problem-solving skills of
manipulate this type of data.
human experts in specific problem domains. That means
the program must solve problems using methods similar
(3). Hidden knowledge: patterns or regularities hidden in
to those employed by the experts. Knowledge base is
data that cannot be easily found using database query
often implemented with rule-based approach. A rule,
languages. Data mining algorithms are suitable for this
formatted by if x then y, can be created by data mining
type of knowledge.
or extracted from human experts by knowledge
engineers who are people trained to interact with experts
(4). Deep knowledge: defined as the data that can only
to capture their knowledge, where x is the antecedent (or
be found if we are given some hints or directions about
condition) and y is the action (or conclusion). To operate
what we are looking for. No current data mining tools
an expert system, inference engine tries to match known
and DBMSs are able to locate knowledge of this type.
facts with “if” part (i.e., antecedent) of a rule to see
whether the rule can be fired or not. If yes, the then part
Existing database query languages, such as SQL and
(action) of the rule is then executed. If not, inference
QUEL, and OLAP are good enough to process data of
engine continues to match other rules and facts.
the first two types [10]. Data mining leads us one step
further to explore data of the third type. But no one
dares to say that current mining techniques are sufficient
4. Geographical Information System (GIS)
to uncover all hidden knowledge. So, computer
scientists have to work hard continuously. A GIS system (or GIS in short) is an application system
for creating, storing, analyzing and managing spatial
data and associated attributes [13]. In a more generic
Knowledge Base
sense, a GIS is a software tool that enables users to
create interactive queries, analyze spatial information,
edit data and display geographically-referenced not overlap'. Vector data can also be used to represent
information. continuously varying phenomena to show us the
continuous change of objects, e.g., the annual
GIS is often used for scientific investigations, resource development of last 20 years.
management, asset management, environmental impact
assessment, city development planning, cartography, Raster datasets record a value for each point in the area
and route planning, for example, to identify a polluted covered which may consume more storage than
area that need to be isolated from others. representing data in a vector format that store data only
as needed. Vector data can be displayed as vector
4.1. Data Creation graphics used on traditional maps, whereas raster data
will appear as an image that may have a blocky
Modern GIS technologies rely on digital information, appearance for object boundaries.
for which there are a number of collection methods. The
most common and popular one is digitization, where a Additional non-spatial data can also be stored besides
hardcopy map or survey plan is transferred into a digital the spatial data, e.g., ages and genders collected through
medium through the use of a digitization tool which is a questionnaires or interview. In vector data, attributes of
computer-aided drafting (CAD) program with geo- object are required. For example, a city inventory
referencing capabilities. polygon may also have an identifier value and
information about its population. In raster data, the cell
4.2. Data Representation value can be attribute information, or an identifier
relating to records in another table.

GIS represents real world objects (roads, wetlands,


buildings) with digital data. Raster and vector are two 4.3. Data Capture
common methods used to store data in a GIS for discrete
objects and continuous fields. Raster images consist of Entering information into a GIS system consumes much
rows and columns of cells where a cell stores a single of the time of its users/creators. There are a variety of
value. The value recorded for each cell may be a methods used to enter data in a digital format into a GIS.
discrete value, a continuous value, or a null value (if no Existing data printed on paper or film maps can be
data is available). digitized or scanned to produce digital data. A digitizer
Vector uses geometries such as points, lines (series of produces vector data as an operator traces points, lines,
point coordinates), or polygons (shapes bounded by and polygon boundaries from a map. Raster data
lines), to represent objects. Examples include property produced by scanning a map could be further processed
boundaries for gardens represented as polygons and to generate vector data.
pond locations represented as points. Vector features can
be made to respect spatial integrity constraints through Positions from a Global Positioning System (GPS), a
the application of topology rules such as 'polygons must survey tool, can also be directly entered into a GIS.
Remotely sensed data also plays an important role in more sophisticated and more accurate. In fact, there are
data collection. A sensing system consists of sensors models that apply to different areas of the earth to
attached to a collection mechanism. Sensors include provide increased accuracy (e.g., North American
cameras, digital scanners and so on, while collection Datum, 1983, NAD83, works well in North America,
mechanisms are often aircrafts or satellites. but not in Europe). Therefore, coordinate conversions
are required.
The majority of digital data currently comes from photo
interpretation of aerial photographs. After entering data A projection is the process of transferring information
into a GIS, it usually requires editing, removing errors, from a model of three-dimensional curved surface to a
or further processing. For vector data it must be made two-dimensional medium, e.g., a paper or a computer
"topologically correct" before it can be used for some screen. Different projections are used for different types
advanced analysis. For example, in a city map, a of maps because each projection particularly suits
polygon should be a closed area. Two adjacent lines of certain uses. For example, a projection that accurately
the object must connect together at an intersection. represents the shapes of the oceans will distort their
Otherwise, GIS will treat them as two disconnected line relative sizes.
segments, i.e., errors such as undershoots and
overshoots must also be removed or corrected. For Since much of the information in a GIS comes from
scanned maps, blemishes on the source map need to be existing maps, a GIS should benefit processing power of
removed from the resulting raster. Otherwise two computer systems to accurately transform digital
disconnected lines, for example, may become connected information, gathered from sources with different
due to a dirtied spot located between the two lines and projections and/or different coordinate systems, to a
connecting the two lines. common projection and coordinate system before we
can correctly put the information of different sources
4.4. Coordinate Systems together and then manipulate the integrated information
precisely.
Two different maps might show data at different scales.
Map information in a GIS must be modified or adjusted 4.5. Current Systems
so that it can fit with information gathered from other
maps. The modification or adjustment includes There are three common types of GIS hardware
projection and coordinate conversions. platforms: Single PC, Web-based (or Net-based) and
mobile devices.
The earth is represented by various models, each of
which may provide a different set of coordinates (e.g., 4.5.1 Single PC
latitude, longitude, elevation) for any given point on the
earth's surface. As more measurements of the earth have We call this type of platforms resource-rich platforms
been accumulated, the models of the earth have become since a PC as compared with a mobile device (e.g.,
pocket PC, smart-phone) often provides many more We had a research project concerning GIS and data
hardware and software resources. A GIS that operates in mining, which is supported by Taichung City
desktop or laptop has its own databases on which we can Government, Taiwan. More than 650 clients, whom
easily perform complex analysis or manipulation, such were served by seven social service agencies for in-
as overlapping, routing and 3D modeling. The major home services, made up the list of investigation for this
parameters that affect system performance include CPU project. These seven social service agencies have had
capacity, memory capacity and so on. contracting relations with the Taichung City
Government in delivering in-home services to the
4.5.2 Web-based elderly. A survey questionnaire was designed by our
research team to be used as the main source for
In a Web-based GIS system, the data is generally stored obtaining information regarding important variables of
in network servers. The client side applications are just elderly needs and the satisfaction of clients towards the
operational interfaces. Besides temporary results, they current service delivery system which carried out in-
store nothing for the map currently manipulated. home services. GIS was used to enhance data storage
Platforms of this type are suitable for research teams or and spatial analytical capacity, and to develop an in-
programmers in school in which most data are managed home service information management system.
centrally.
5.1. GIS Operations
Furthermore, interactive web GIS is most popular
nowadays, such as the Google Maps. The Google Maps Three main concepts of the project that use GIS to
exposes an API, based on Asynchronous JavaScript and analyze social and in-home service resources are:
XML, enabling users to associate attributes with
interactive maps. (1). Characteristics and satisfaction of clients. To
understand the characteristics of elderly subjects who
4.5.3 Mobile Devices received in-home services, and to evaluate the
satisfaction of the clients towards the current service
GIS systems developed for running on mobile devices
delivery system for in-home services.
(such as cellphone, PDA) are rare. Their main
applications focus on car navigation and disaster rescue.
(2). How to use GIS to learn more about our services.
Due to limited device resources, venders often reduce
To describe the use of GIS combining with other
down sizes of their digital geographic databases and
visualized statistical tools, such as correspondence
confine their system analytical capabilities. So, most
analysis, and data mining in developing an in-home
mobile systems are not able to analyze the geographic
service management system to enhance our
information as deeply as the system run on desktop.
understanding of service satisfaction of the elderly and
the issues of the elderly both for in-home services.
5. Case Study
(3). How to use GIS to improve local government The result represents that 64.71% of recipients, whose
decisions. To explore the potential uses of information in-home service payment were partially paid by
techniques for constructing decision support systems for government, were satisfied with their in-home servants’
local government who governs human services. services. Also, 11.61% of recipients who were satisfied
with their services fitted this rule.
We analyzed the service satisfaction data and show them
on digital maps. Thus we can easily understand that We can conclude that most recipients enjoyed their in-
every recipient’s satisfaction status. Furthermore, we home servants’ services if the service payment was
used the “buffer zone” and “overlapping” functions to completely free or partially paid by government, no
analyze the public facilities and in-home service centers’ matter the services were truly what they wanted. That is,
locations. Thus, we can learn which section is lacking of free lunch makes one feel happy and satisfied.
service center and/or public facilities. After that, the B. Participating home parties against service
decision makers can refer to them to make the decisions satisfaction
more accurately and worthy. (1). If (the recipients have never taken part in home
parties) then the answer is “satisfied”
5.2. Data Mining Application :rule accuracy 76.05%
:rule coverage 62.01%
We have analyzed the survey data gathered through
questionnaires with a data mining tools. The following The result represents that 76.05% of recipients, who
gives examples. have never participated in home parties, were satisfied
A. Completely free or partially pay the service fee with their in-home servants’ services. 62.01% of
against service satisfaction recipients who were satisfied with their in-home
(1). If (completely free) then the answer is “satisfied” servants’ services fitted this rule.
:rule accuracy 77.26% (2). If (the recipients have ever taken part in home
:rule coverage 87.86% parties) then the answer is “satisfied”
:rule accuracy 75.00%
The result represents that 77.26% of recipients, whose :rule coverage 37.20%
in-home services payment were totally paid by
government, were satisfied with their in-home servants’ The result represents that 75.00% of recipients, who
services. Also, 87.86% of recipients who were satisfied have ever participated in home parties, were satisfied
with their in-home servants’ services fitted this rule. with their in-home servants’ services. 37.20% of
(2). If (Partially pay) then the answer is “satisfied” recipients who were satisfied with their services fitted
:rule accuracy 64.71% this rule.
:rule coverage 11.61% We can conclude that most recipients enjoyed their
in-home servants’ services no matter they have never or
ever participated in home parties. The deep meaning is
that most of the recipients feel lonely. They feel happy
and satisfied with the in-home services due to having the
chance to talk with someone, even the one is their In- References
home servant.

[1] T.M. Mitchell, “Does Machine Learning Really


6. Conclusion and Future Work Work?” AI Magazine, vol.18, no.3, 1997, pp.11-20.
[2] V. McCarthy, “Strike It Rich, ” Datamation, vol.43,
In the past, we have deployed GIS and data mining to no.2, 1997, pp.44-50.
analyze the data concerning social work, and got a series [3] H. Baltazar, “NBA Coaches’ Latest Weapon : Data
of results. In the future, we will apply these experience Mining,” PC Week, March 2000, pp.69-69.
to analyze the data collected from Situn district [4] ESRI - The GIS Software Leader,
regarding the development of this area during the past http://www.esri.com/.
twenty or thirty years, and to uncover how the [5] Systèmes de navigation routière GPS portables de
development of the Central Taiwan Science Park affects TomTom, http://www.tomtom.com/index.php.
the development of Situn district. We expect to explore [6] Google Maps, http://maps.google.com/.
and learn what changes or advancement/regression were [7] Yahoo! Maps, Driving Directions, and Traffic,
happened, and/or will happen. http://maps.yahoo.com.
[9] H. Garcia-Holina, J.D. ullman and J. Widoma,
In GIS, we expect to: Database System Implementation, Prentice Hall,
(1). Input, edit, store and manage the spatial data and 2000.
attribute data collected from Situn district. [8] R.J. Roiger and M.W. Geatz, Data Mining: A
(2). Display data (maps, charts, and tables). Tutorial-Based Primer, Addison Wesley, 2003.
(3). Explore data (data query, geographic [10] P. Adriaans and D. Zantinge, Data Mining, Addison
visualization). Wesley, 1996.
(4). Analyze data (buffering, overlay, distance [11] V.S. Moustakis, M. Lehto and G. Salvendy,
measurement, map manipulation, spatial interpolation, “Survey of expert opinion: which machine learning
regions-based analysis, network analysis, etc.). method may be used for which task?” Special issue
In Data Mining, we expect to: on machine learning of International Journal of HCI,
(1). Code the questionnaires’ result into databases. 1996.
(2). Use the supervised learning to mine the hidden [12] M. Lavrac and S.K. Wrobel, Machine Learning:
knowledge embedded in the database. ECML-95, New York: Springer Verlag, 1995.
(3). Display the mining result with GIS, and [13] Wikipedia, the free encyclopedia,
manually or automatically explain why they happen. http://en.wikipedia.org/wiki/.

You might also like