Ch6 Ais6 Reviewer
Ch6 Ais6 Reviewer
B. Cloud Databases
Amazon and other cloud computing vendors
provide relational database services as well.
a. Amazon Relational Database Service
(Amazon RDS) offers MySQL, SQL Server,
This diagram illustrates the relationship
Oracle Database, PostgreSQL, MariaDB, or
between the entities SUPPLIER, PART,
Amazon Aurora DB (compatible with
LINE_ITEM, and ORDER.
MySQL) as database engines. Pricing is
o The boxes represent entities.
based on usage.
o The lines connecting the boxes represent b. Oracle has its own Database Cloud
relationships. Services using its relational Oracle
A line connecting two entities that ends in Database.
two short marks designates a one-to one c. Microsoft Windows SQL Azure Database
relationship. is a cloud-based relational database service
A line connecting two entities that ends based on Microsoft’s SQL Server DBMS.
with a crow’s foot topped by a short mark
indicates a one-to-many relationship. Cloud-based data management services
have special appeal for web-focused start-ups or
If the business doesn’t get its data model small to medium-sized businesses seeking
right, the system won’t be able to serve the database capabilities at a lower price than in-house
business well. The company’s systems will not be database products.
as effective as they could be because they’ll have
to work with data that may be inaccurate, 6-3 What are the principal tools and
incomplete, or difficult to retrieve. technologies for accessing information from
Non-relational Databases and Databases in the databases to improve business performance
Cloud and decision making?
Cloud computing, unprecedented data A. Big Data
volumes, massive workloads for web services, and o these data may be unstructured or semi-
the need to store new types of data require structured and thus not suitable for
database alter natives to the traditional relational relational database products that organize
model of organizing data in the form of tables, data in the form of columns and rows.
columns, and rows. Companies are turning to o describes these data sets with volumes so
“NoSQL” non-relational database technologies for huge that they are beyond the ability of
this purpose. typical DBMS to capture, store, and
analyze.
Non-relational Database Management Systems
o doesn’t refer to any specific quantity but
o use a more flexible data model and are
usually refers to data in the petabyte and
designed for managing large data sets across
exabyte range—in other words, billions to
many distributed machines and for easily
trillions of records, all from different sources.
scaling up or down.
o produced in much larger quantities and
o useful for accelerating simple queries against
much more rapidly than traditional data.
large volumes of structured and unstructured
data, including web, social media, graphics, The Challenge of Big Data
and other forms of data that are difficult to o Businesses are interested in big data because
analyze with traditional SQL based tools. they can reveal more patterns and interesting
There are several different kinds of NoSQL relationships than smaller data sets, with the
databases, each with its own technical features and potential to provide new insights into
behavior. customer behavior, weather patterns, financial
a. Oracle NoSQL Database market activity, or other phenomena.
b. Amazon’s SimpleDB – one of the Amazon o Big data is also finding many uses in the
Web Services that run in the cloud. public sector.
c. SimpleDB – provides a simple web services o However, to derive business value from these
interface to create and store multiple data data, organizations need new technologies
sets, query data easily, and return the results. and tools capable of managing and analyzing
There is no need to predefine a formal nontraditional data along with their traditional
database structure or change that definition if enterprise data. They also need to know what
new data are added later. questions to ask of the data and limitations of
d. MongoDB Open-source NoSQL Database – big data. Capturing, storing, and analyzing big
quickly integrate disparate data on more than data can be expensive, and information from
100 million customers and deliver a big data may not necessarily help decision
consolidated view of each. makers. It’s important to have a clear
understanding of the problem big data will
The NoSQL database is able to use solve for the business.
structured, semi-structured, and unstructured
B. Business Intelligence Infrastructure
A contemporary infrastructure for business breaking down processing of huge
intelligence has an array of tools for obtaining data sets and assigning work to the
useful information from all the different types of various nodes in a cluster.
data used by businesses today, including semi- c. HBase, Hadoop’s non-relational
structured and unstructured big data in vast database, provides rapid access to
quantities. These capabilities include: the data stored on HDFS and a
transactional platform for running
1. Data Warehouses and Data Marts
high-scale real-time applications.
Data Warehouse o Hadoop can process large quantities of any
o a database that stores current and historical kind of data, including structured
data of potential interest to decision makers transactional data, loosely structured data
throughout the company. the data originate in such as Facebook and Twitter feeds,
many cores operational transaction systems, complex data such as web server log files,
such as systems for sales, customer and unstructured audio and video data.
accounts, and manufacturing, and may o Hadoop runs on a cluster of inexpensive
include data from website transactions. servers, and processors can be added or
o extracts current and historical data from removed as needed.
multiple operational systems inside the o Companies use Hadoop for analyzing very
organization. these data are combined with large volumes of data as well as for a
data from external sources and transformed staging area for unstructured and semi-
by correcting inaccurate and incomplete data structured data before they are loaded into
and restructuring the data for management a data warehouse.
reporting and analysis before being loaded o Yahoo uses Hadoop to track users’ behavior
into the data warehouse. so it can modify its home page to fit their
o makes the data available for anyone to interests.
access as needed, but the data cannot be o Life sciences research firm NextBio uses
altered. Hadoop and HBase to process data for
o provides a range of ad hoc and standardized pharmaceutical companies conducting
query tools, analytical tools, and graphical genomic research.
reporting facilities.
3. In-Memory Computing
Enterprise-wide Data Warehouses o It relies primarily on a computer’s main
o a central data warehouse serves the entire memory (RAM) for data storage.
organization, or they create smaller, Conventional DBMS use disk storage
decentralized warehouses called data marts. systems.
Data Mart o Users access data stored in system primary
o a subset of a data warehouse in which a memory, thereby eliminating bottlenecks
summarized or highly focused portion of the from retrieving and reading data in a
organization’s data is placed in a separate traditional, disk-based database and
database for a specific population of users. dramatically shortening query response
times.
2. Hadoop o In-memory processing makes it possible for
o Relational DBMS and data warehouse very large sets of data, amounting to the
products are not well suited for organizing size of a data mart or small data
and analyzing big data or data that do not warehouse, to reside entirely in memory.
easily fit into columns and rows used in their o Complex business calculations that used to
data models. take hours or days are able to be completed
o For handling unstructured and semi- within seconds, and this can even be
structured data in vast quantities, as well as accomplished using handheld devices.
structured data, organizations are using o Leading commercial products for in-memory
Hadoop. computing include SAP HANA and Oracle
o Hadoop is an open-source software Exalytics. Each provides a set of integrated
framework managed by the Apache software components, including in-memory
Software Foundation that enables database software and specialized analytics
distributed parallel processing of huge software, that run on hardware optimized for
amounts of data across inexpensive in-memory computing work.
computers.
o It breaks a big data problem down into sub- 4. Analytic Platforms
problems, distributes them among up to o Commercial database vendors have
thousands of inexpensive computer developed specialized high-speed analytic
processing nodes, and then combines the platforms using both relational and non-
result into a smaller data set that is easier to relational technology that are optimized for
analyze. analyzing large data sets.
o Hadoop consists of several key services: o Analytic platforms such as IBM PureData
a. Hadoop Distributed File System System for Analytics, feature
(HDFS) for data storage and preconfigured hardware-software systems
MapReduce for high-performance that are specifically designed for query
parallel data processing. HDFS links processing and analytics. IBM PureData
together the file systems on the System for Analytics features tightly
numerous nodes in a Hadoop cluster integrated database, server, and storage
to turn them into one big file system. components that handle complex analytic
b. Hadoop’s MapReduce was inspired queries 10 to 100 times faster than
by Google’s MapReduce system for traditional systems.
o Analytic platforms also include in-memory There are data mining applications for all
systems and NoSQL non-relational the functional areas of business and for
database management systems. government and scientific work.
o Analytic platforms are now available as o One popular use for data mining is to
cloud services. provide detailed analyses of patterns in
customer data for one-to-one marketing
C. Analytical Tools: Relationships, Patterns, campaigns or for identifying profitable
Trends customers.
1. Online Analytical Processing (OLAP) 3. Text Mining and Web Mining
o OLAP supports multidimensional data Unstructured Data
analysis, enabling users to view the same o most in the form of text files
data in different ways using multiple o is believed to account for more than 80
dimensions.
percent of useful organizational information.
o Each aspect of information—product,
o one of the major sources of big data that
pricing, cost, region, or time period—
firms want to analyze.
represents a different dimension.
o e-mail, memos, call center transcripts,
o OLAP represents relationships among data
survey responses, legal cases, patent
as a multidimensional structure, which can
descriptions, and service reports are all
be visualized as cubes of data and cubes
valuable for finding patterns and trends that
within cubes of data, enabling more
will help employees make better business
sophisticated data analysis.
decisions.
o OLAP enables users to obtain online
answers to ad hoc questions such as these
in a fairly rapid amount of time, even when Text Mining
the data are stored in very large databases. o available to help businesses analyze large
unstructured data sets consisting of text.
2. Data Mining
o able to extract key elements from
o Data mining is more discovery-driven.
unstructured big data sets, discover
o Data mining provides insights into corporate
patterns and relationships, and summarize
data that cannot be obtained with OLAP by the information.
finding hidden patterns and relationships in o analyze transcripts of calls to customer
large databases and inferring rules from
service centers to identify major service and
them to predict future behavior. The patterns
repair issues or to measure customer
and rules are used to guide decision making
sentiment about their company.
and forecast the effect of those decisions.
o Data mining analyzes large pools of data, Sentiment Analysis
including the contents of data warehouses, o able to mine text comments in an e-mail
to find patterns and rules that can be used message, blog, social media conversation,
to predict future behavior and guide or survey form to detect favorable and
decision making. unfavorable opinions about specific
o The types of information obtainable from subjects.
data mining include: Analytic Software
a. Associations: occurrences linked to a o analyzes customer service notes, e-mails,
single event. survey responses, and online discussions to
b. Sequences: events are linked over discover signs of dissatisfaction that might
time. cause a customer to stop using the
c. Classification: recognizes patterns company’s services.
that describe the group to which an o able to automatically identify the various
item belongs by examining existing
“voices” customers use to express their
items that have been classified and by
feedback (such as a positive, negative, or
infer ring a set of rules. It helps
conditional voice) to pinpoint a person’s
discover the characteristics of
intent to buy, intent to leave, or reaction to a
customers who are likely to leave and
specific product or marketing message.
can provide a model to help managers
predict who those customers are so Web
that the managers can devise special o another rich source of unstructured big data
campaigns to retain such customers. for revealing pat terns, trends, and insights
d. Clustering: works in a manner similar into customer behavior.
to classification when no groups have Web Mining
yet been defined. A data mining tool o discovery and analysis of useful patterns
can discover different groupings within
and information from the World Wide Web
data, such as finding affinity groups for
(or web)
bank cards or partitioning a database
o examining the structure of websites and
into groups of customers based on
demographics and types of personal activities of website users as well as the
investments. contents of webpages.
e. Forecasting: uses predictions in a o help them understand customer behavior,
different way. It uses a series of evaluate the effectiveness of a particular
existing values to forecast what other website, or quantify the success of a
values will be. marketing campaign.
o These systems perform high-level analyses o looks for patterns in data through content
of patterns or trends, but they can also drill mining, structure mining, and usage mining.
down to provide more detail when needed.
o the process of extracting knowledge from o provides an up-to-date online directory of
the content of webpages, which may include more than 700,000 suppliers of industrial
text, image, audio, and video data. products.
o examines data related to the structure of a o used to send out huge paper catalogs with
particular website. this information, now it provides this
o examines user interaction data recorded by information to users online via its website
a web server whenever requests for a and has become a smaller, leaner company.
website’s resources are received. Facebook (social networking service)
Databases and the Web o helps users stay connected with each other
and meet new people.
Because many back-end databases cannot
o features “profiles” with information on 1.6
interpret commands written in HTML, the web
server passes these requests for data to software billion active users with information about
that translates HTML commands into SQL so the themselves, including interests, friends,
commands can be processed by the DBMS photos, and groups with which they are
working with the database. affiliated.
o maintains a very large database to house
Conventional databases can be linked via and manage all of this content.
middleware to the web or a web interface to
facilitate user access to an organization’s internal 6-4 Why are information policy, data
data. administration, and data quality assurance
In a client/server environment, the DBMS essential for managing the firm’s data
resides on a dedicated computer called a database resources?
server. The DBMS receives the SQL requests and Establishing an Information Policy
provides the required data. Middleware transfers
information from the organization’s internal Every business, large and small, needs an
database back to the web server for delivery in the information policy. Your firm’s data are an important
form of a web page to the user. resource, and you don’t want people doing
whatever they want with them. You need to have
rules on how the data are to be organized and
maintained and who is allowed to view the data or
change them.
Information Policy
o specifies the organization’s rules for sharing,
disseminating, acquiring, standardizing,
classifying, and inventorying information.
o lays out specific procedures and
The middleware working between the web accountabilities, identifying which users and
server and the DBMS is an application server organizational units can share information,
running on its own dedicated computer. where information can be distributed, and
The application server software handles who is responsible for updating and
all application operations, including transaction maintaining the information.
processing and data access, between browser- o governs the maintenance, distribution, and
based computers and a company’s back-end use of information in the organization.
business applications or databases. The If you are in a small business, the
application server takes requests from the web information policy would be established and
server, runs the business logic to process implemented by the owners or managers. In a
transactions based on those requests, and provides large organization, managing and planning for
connectivity to the organization’s back-end systems information as a corporate resource often require a
or databases. formal data administration function.
Alternatively, the software for handling these Data Administration
operations could be a custom program or a CGI o responsible for the specific policies and
script. A CGI script is a compact program using
procedures through which data can be
the Common Gateway Interface (CGI) specification
managed as an organizational resource.
for processing data on a web server.
o these responsibilities include developing an
There are a number of advantages to using information policy, planning for data,
the web to access an organization’s internal overseeing logical database design and
databases: data dictionary development, and
o the web browser software is much easier to monitoring how information systems
use than proprietary query tools. specialists and end user groups use data.
o the web interface requires few or no changes
Data Governance
to the internal database. It costs much less to o used to describe many of these activities.
add a web interface in front of a legacy
o promoted by IBM, data governance deals
system than to redesign and rebuild the
with the policies and processes for
system to improve user access.
managing the availability, usability, integrity,
o accessing corporate databases through the
and security of the data employed in an
web is creating new efficiencies,
enterprise with special emphasis on
opportunities, and business models.
promoting privacy, security, data quality, and
ThomasNet.com (formerly Thomas Register) compliance with government regulations.
In close cooperation with users, the design
group establishes the physical database, the logical
relations among elements, and the access rules
and security procedures. The functions it performs
are called database administration.
Ensuring Data Quality
If a database is properly designed and
enterprise-wide data standards established,
duplicate or inconsistent data elements should be
minimal. Most data quality problems, however, such
as misspelled names, transposed numbers, or
incorrect or missing codes, stem from errors during
data input. The incidence of such errors is rising as
companies move their businesses to the web and
allow customers and suppliers to enter data into
their websites that directly update internal systems.
Before a new database is in place,
organizations need to identify and correct their
faulty data and establish better routines for editing
data once their data base is in operation.
Data Quality Audit
o analysis of data quality often begins with a
data quality audit.
o a structured survey of the accuracy and
level of completeness of the data in an
information system.
o can be performed by surveying entire data
files, surveying samples from data files, or
surveying end users for their perceptions of
data quality.