Data mining M1
Data mining M1
individuals to extract valuable information from huge sets of data. Data mining is also called
Knowledge Discovery in Database (KDD). The knowledge discovery process includes Data
cleaning, Data integration, Data selection, Data transformation, Data mining, Pattern
evaluation, and Knowledge presentation.
In other words, we can say that Data Mining is the process of investigating hidden patterns of
information to various perspectives for categorization into useful data, which is collected and
assembled in particular areas such as data warehouses, efficient analysis, data mining
algorithm, helping decision making and other data requirement to eventually cost-cutting and
generating revenue.
Data mining is the act of automatically searching for large stores of information to find trends
and patterns that go beyond simple analysis procedures. Data mining utilizes complex
mathematical algorithms for data segments and evaluates the probability of future events.
Data Mining is also called Knowledge Discovery of Data (KDD).
Data Mining is a process used by organizations to extract specific data from huge databases
to solve business problems. It primarily turns raw data into useful information.
Data Mining is similar to Data Science carried out by a person, in a specific situation, on a
particular data set, with an objective. This process includes various types of services such as
text mining, web mining, audio and video mining, pictorial data mining, and social media
mining. It is done through software that is simple or highly specific. By outsourcing data
mining, all the work can be done faster with low operation costs. Specialized firms can also
use new technologies to collect data that is impossible to locate manually. There are tonnes of
information available on various platforms, but very little knowledge is accessible. The
biggest challenge is to analyze the data to extract important information that can be used to
solve a problem or for company development. There are many powerful instruments and
techniques available to mine data and find better insight from it.
Types of Data Mining
Data mining can be performed on the following types of data:
Relational Database:
Data warehouses:
A Data Warehouse is the technology that collects the data from various sources within the
organization to provide meaningful business insights. The huge amount of data comes from
multiple places such as Marketing and Finance. The extracted data is utilized for analytical
purposes and helps in decision- making for a business organization. The data warehouse is
designed for the analysis of data rather than transaction processing.
Data Repositories:
The Data Repository generally refers to a destination for data storage. However, many IT
professionals utilize the term more clearly to refer to a specific kind of setup within an IT
structure. For example, a group of databases, where an organization has kept various kinds of
information.
Object-Relational Database:
A combination of an object-oriented database model and relational database model is called
an object-relational model. It supports Classes, Objects, Inheritance, etc.
One of the primary objectives of the Object-relational data model is to close the gap between
the Relational database and the object-oriented model practices frequently utilized in many
programming languages, for example, C++, Java, C#, and so on.
Transactional Database:
A transactional database refers to a database management system (DBMS) that has the
potential to undo a database transaction if it is not performed appropriately. Even though this
was a unique capability a very long while back, today, most of the relational database
systems support transactional database activities.
Data mining in healthcare has excellent potential to improve the health system. It uses data
and analytics for better insights and to identify best practices that will enhance health care
services and reduce costs. Analysts use data mining approaches such as Machine learning,
Multi-dimensional database, Data visualization, Soft computing, and statistics. Data Mining
can be used to forecast patients in each category. The procedures ensure that the patients get
intensive care at the right place and at the right time. Data mining also enables healthcare
insurers to recognize fraud and abuse.
Market basket analysis is a modeling method based on a hypothesis. If you buy a specific
group of products, then you are more likely to buy another group of products. This technique
may enable the retailer to understand the purchase behavior of a buyer. This data may assist
the retailer in understanding the requirements of the buyer and altering the store's layout
accordingly. Using a different analytical comparison of results between various stores,
between customers in different demographic groups can be done.
Education data mining is a newly emerging field, concerned with developing techniques that
explore knowledge from the data generated from educational Environments. EDM objectives
are recognized as affirming student's future learning behavior, studying the impact of
educational support, and promoting learning science. An organization can use data mining to
make precise decisions and also to predict the results of the student. With the results, the
institution can concentrate on what to teach and how to teach.
Knowledge is the best asset possessed by a manufacturing company. Data mining tools can
be beneficial to find patterns in a complex manufacturing process. Data mining can be used in
system-level designing to obtain the relationships between product architecture, product
portfolio, and data needs of the customers. It can also be used to forecast the product
development period, cost, and expectations among the other tasks.
Customer Relationship Management (CRM) is all about obtaining and holding Customers,
also enhancing customer loyalty and implementing customer-oriented strategies. To get a
decent relationship with the customer, a business organization needs to collect data and
analyze the data. With data mining technologies, the collected data can be used for analytics.
Billions of dollars are lost to the action of frauds. Traditional methods of fraud detection are a
little bit time consuming and sophisticated. Data mining provides meaningful patterns and
turning data into information. An ideal fraud detection system should protect the data of all
the users. Supervised methods consist of a collection of sample records, and these records are
classified as fraudulent or non-fraudulent. A model is constructed using this data, and the
technique is made to identify whether the document is fraudulent or not.
Apprehending a criminal is not a big deal, but bringing out the truth from him is a very
challenging task. Law enforcement may use data mining techniques to investigate offenses,
monitor suspected terrorist communications, etc. This technique includes text mining also,
and it seeks meaningful patterns in data, which is usually unstructured text. The information
collected from the previous investigations is compared, and a model for lie detection is
constructed.
The Digitalization of the banking system is supposed to generate an enormous amount of data
with every new transaction. The data mining technique can help bankers by solving business-
related problems in banking and finance by identifying trends, casualties, and correlations in
business information and market costs that are not instantly evident to managers or
executives because the data volume is too large or are produced too rapidly on the screen by
experts. The manager may find these data for better targeting, acquiring, retaining,
segmenting, and maintain a profitable customer.
The process of extracting useful data from large volumes of data is data mining. The data in
the real-world is heterogeneous, incomplete, and noisy. Data in huge quantities will usually
be inaccurate or unreliable. These problems may occur due to data measuring instrument or
because of human errors. Suppose a retail chain collects phone numbers of customers who
spend more than $ 500, and the accounting employees put the information into their system.
The person may make a digit mistake when entering the phone number, which results in
incorrect data. Even some customers may not be willing to disclose their phone numbers,
which results in incomplete data. The data could get changed due to human or system error.
All these consequences (noisy and incomplete data)makes data mining challenging.
Data Distribution:
Real-worlds data is usually stored on various platforms in a distributed computing
environment. It might be in a database, individual systems, or even on the internet.
Practically, It is a quite tough task to make all the data to a centralized data repository mainly
due to organizational and technical concerns. For example, various regional offices may have
their servers to store their data. It is not feasible to store, all the data from all the offices on a
central server. Therefore, data mining requires the development of tools and algorithms that
allow the mining of distributed data.
Complex Data:
Real-world data is heterogeneous, and it could be multimedia data, including audio and
video, images, complex data, spatial data, time series, and so on. Managing these various
types of data and extracting useful information is a tough task. Most of the time, new
technologies, new tools, and methodologies would have to be refined to obtain specific
information.
Performance:
The data mining system's performance relies primarily on the efficiency of algorithms and
techniques used. If the designed algorithm and techniques are not up to the mark, then the
efficiency of the data mining process will be affected adversely.
Data mining usually leads to serious issues in terms of data security, governance, and
privacy. For example, if a retailer analyzes the details of the purchased items, then it reveals
data about buying habits and preferences of the customers without their permission.
Data Visualization:
In data mining, data visualization is a very important process because it is the primary
method that shows the output to the user in a presentable way. The extracted data should
convey the exact meaning of what it intends to express. But many times, representing the
information to the end-user in a precise and easy way is difficult. The input data and the
output information being complicated, very efficient, and successful data visualization
processes need to be implemented to make it successful.
There are many more challenges in data mining in addition to the problems above-mentioned. More
problems are disclosed as the actual data mining process begins, and the success of data mining
relies on getting rid of all these difficulties.
Data Mining Techniques
Data mining includes the utilization of refined data analysis tools to find previously
unknown, valid patterns and relationships in huge data sets. These tools can incorporate
statistical models, machine learning techniques, and mathematical algorithms, such as neural
networks or decision trees. Thus, data mining incorporates analysis and prediction.
Depending on various methods and technologies from the intersection of machine learning,
database management, and statistics, professionals in data mining have devoted their careers
to better understanding how to process and make conclusions from the huge amount of data,
but what are the methods they use to make it happen?
In recent data mining projects, various major data mining techniques have been developed
and used, including association, classification, clustering, prediction, sequential patterns, and
regression.
1. Classification:
This technique is used to obtain important and relevant information about data and metadata.
This data mining technique helps to classify data in different classes.
Data mining techniques can be classified by different criteria, as follows:
i. Classification of Data mining frameworks as per the type of data sources mined:
This classification is as per the type of data handled. For example, multimedia, spatial data,
text data, time-series data, World Wide Web, and so on..
ii. Classification of data mining frameworks as per the database involved:
This classification based on the data model involved. For example. Object-oriented database,
transactional database, relational database, and so on..
iii. Classification of data mining frameworks as per the kind of knowledge discovered:
This classification depends on the types of knowledge discovered or data mining
functionalities. For example, discrimination, classification, clustering, characterization, etc.
some frameworks tend to be extensive frameworks offering a few data mining
functionalities together..
iv. Classification of data mining frameworks according to data mining techniques used:
This classification is as per the data analysis approach utilized, such as neural networks,
machine learning, genetic algorithms, visualization, statistics, data warehouse-oriented or
database-oriented, etc.
The classification can also take into account, the level of user interaction involved in the data
mining procedure, such as query-driven systems, autonomous systems, or interactive
exploratory systems.
2. Clustering:
Clustering is a division of information into groups of connected objects. Describing the data
by a few clusters mainly loses certain confine details, but accomplishes improvement. It
models data by its clusters. Data modeling puts clustering from a historical point of view
rooted in statistics, mathematics, and numerical analysis. From a machine learning point of
view, clusters relate to hidden patterns, the search for clusters is unsupervised learning, and
the subsequent framework represents a data concept. From a practical point of view,
clustering plays an extraordinary job in data mining applications. For example, scientific data
exploration, text mining, information retrieval, spatial database applications, CRM, Web
analysis, computational biology, medical diagnostics, and much more.
In other words, we can say that Clustering analysis is a data mining technique to identify
similar data. This technique helps to recognize the differences and similarities between the
data. Clustering is very similar to the classification, but it involves grouping chunks of data
together based on their similarities.
3. Regression:
Regression analysis is the data mining process is used to identify and analyze the relationship
between variables because of the presence of the other factor. It is used to define the
probability of the specific variable. Regression, primarily a form of planning and modeling.
For example, we might use it to project certain costs, depending on other factors such as
availability, consumer demand, and competition. Primarily it gives the exact relationship
between two or more variables in the given data set.
4. Association Rules:
This data mining technique helps to discover a link between two or more items. It finds a
hidden pattern in the data set.
Association rules are if-then statements that support to show the probability of interactions
between data items within large data sets in different types of databases. Association rule
mining has several applications and is commonly used to help sales correlations in data or
medical data sets.
The way the algorithm works is that you have various data, For example, a list of grocery
items that you have been buying for the last six months. It calculates a percentage of items
being purchased together.
• Lift:
This measurement technique measures the accuracy of the confidence over how often item
B is purchased.
(Confidence) / (item B)/ (Entire dataset)
• Support:
This measurement technique measures how often multiple items are purchased and
compared it to the overall dataset.
(Item A + Item B) / (Entire dataset)
• Confidence:
This measurement technique measures how often item B is purchased when item A is
purchased as well.
(Item A + Item B)/ (Item A)
5. Outer detection:
This type of data mining technique relates to the observation of data items in the data set,
which do not match an expected pattern or expected behavior. This technique may be used in
various domains like intrusion, detection, fraud detection, etc. It is also known as Outlier
Analysis or Outilier mining. The outlier is a data point that diverges too much from the rest of
the dataset. The majority of the real-world datasets have an outlier. Outlier detection plays a
significant role in the data mining field. Outlier detection is valuable in numerous fields like
network interruption identification, credit or debit card fraud detection, detecting outlying in
wireless sensor network data, etc.
6. Sequential Patterns:
The sequential pattern is a data mining technique specialized for evaluating sequential data
to discover sequential patterns. It comprises of finding interesting subsequences in a set of
sequences, where the stake of a sequence can be measured in terms of different criteria like
length, occurrence frequency, etc.
In other words, this technique of data mining helps to discover or recognize similar patterns
in transaction data over some time.
7. Prediction:
Prediction used a combination of other data mining techniques such as trends, clustering,
classification, etc. It analyzes past events or instances in the right sequence to predict a future
event.
Data mining is described as a process of finding hidden precious data by evaluating the huge
quantity of information stored in data warehouses, using multiple data mining techniques
such as Artificial Intelligence (AI), Machine learning and statistics.
Let's examine the implementation process for data mining in details:
It focuses on understanding the project goals and requirements form a business point of view,
then converting this information into a data mining problem afterward a preliminary plan
designed to accomplish the target.
Tasks:
• It Understands the project targets and prerequisites from a business point of view.
• Thoroughly understand what the customer wants to achieve.
• Reveal significant factors, at the starting, it can impact the result of the project.
Access situation:
• It requires a more detailed analysis of facts about all the resources, constraints,
assumptions, and others that ought to be considered.
Determine data mining goals:
• A business goal states the target of the business terminology. For example, increase catalog
sales to the existing customer.
• A data mining goal describes the project objectives. For example, It assumes how many
objects a customer will buy, given their demographics details (Age, Salary, and City) and the
price of the item over the past three years.
• It states the targeted plan to accomplish the business and data mining plan.
• The project plan should define the expected set of steps to be performed during the rest of
the project, including the latest technique and better selection of tools.
2. Data Understanding:
Data understanding starts with an original data collection and proceeds with operations to get
familiar with the data, to data quality issues, to find better insight in data, or to detect
interesting subsets for concealed information hypothesis.
Tasks:
Describe data:
Explore data:
• Addressing data mining issues that can be resolved by querying, visualizing, and reporting,
including:
o Distribution of important characteristics, results of simple aggregation.
o Establish the relationship between the small number of attributes.
o Characteristics of important sub-populations, simple statical analysis.
• It may refine the data mining objectives.
• It may contribute or refine the information description, and quality reports.
• It may feed into the transformation and other necessary information preparation.
Verify data quality:
3. Data Preparation:
Tasks:
• Select data
• Clean data
• Construct data
• Integrate data
• Format data
Select data:
Clean data:
• It may involve the selection of clean subsets of data, inserting appropriate defaults or more
ambitious methods, such as estimating missing information by modeling.
Construct data:
Integrate data:
• Integrate data refers to the methods whereby data is combined from various tables, or
documents to create new documents or values.
Format data:
• Formatting data refer mainly to linguistic changes produced to information that does not
alter their significance but may require a modeling tool.
4. Modeling:
In modeling, various modeling methods are selected and applied, and their parameters are
measured to optimum values. Some methods gave particular requirements on the form of
data. Therefore, stepping back to the data preparation phase is necessary.
Tasks:
• It selects the real modeling method that is to be used. For example, decision tree, neural
network.
• If various methods are applied,then it performs this task individually for each method.
• Generate a procedure or mechanism for testing the validity and quality of the model before
constructing a model. For example, in classification, error rates are commonly used as
quality measures for data mining models. Therefore, typically separate the data set into
train and test set, build the model on the train set and assess its quality on the separate test
set.
Build model:
• To create one or more models, we need to run the modeling tool on the prepared data set.
Assess model:
• It interprets the models according to its domain expertise, the data mining success criteria,
and the required design.
• It assesses the success of the application of modeling and discovers methods more
technically.
• It Contacts business analytics and domain specialists later to discuss the outcomes of data
mining in the business context.
5. Evaluation:
• At the last of this phase, a decision on the use of the data mining results should be reached.
• It evaluates the model efficiently, and review the steps executed to build the model and to
ensure that the business objectives are properly achieved.
• The main objective of the evaluation is to determine some significant business issue that has
not been regarded adequately.
• At the last of this phase, a decision on the use of the data mining outcomes should be
reached.
Tasks:
• Evaluate results
• Review process
• Determine next steps
Evaluate results:
• It assesses the degree to which the model meets the organization's business objectives.
• It tests the model on test apps in the actual implementation when time and budget
limitations permit and also assesses other data mining results produced.
• It unveils additional difficulties, suggestions, or information for future instructions.
Review process:
• The review process does a more detailed evaluation of the data mining engagement to
determine when there is a significant factor or task that has been somehow ignored.
• It reviews quality assurance problems.
6. Deployment:
Determine:
Tasks:
• Plan deployment
• Plan monitoring and maintenance
• Produce final report
• Review project
Plan deployment:
• To deploy the data mining outcomes into the business, takes the assessment results and
concludes a strategy for deployment.
• It refers to documentation of the process for later deployment.
• It is important when the data mining results become part of the day-to-day business and its
environment.
• It helps to avoid unnecessarily long periods of misuse of data mining results.
• It needs a detailed analysis of the monitoring process.
• A final report can be drawn up by the project leader and his team.
• It may only be a summary of the project and its experience.
• It may be a final and comprehensive presentation of data mining.
Review project:
• Review projects evaluate what went right and what went wrong, what was done wrong, and
what needs to be improved.
Data mining is a significant method where previously unknown and potentially useful
information is extracted from the vast amount of data. The data mining process involves
several components, and these components constitute a data mining system architecture.
The actual source of data is the Database, data warehouse, World Wide Web (WWW), text
files, and other documents. You need a huge amount of historical data for data mining to be
successful. Organizations typically store data in databases or data warehouses. Data
warehouses may comprise one or more databases, text files spreadsheets, or other repositories
of data. Sometimes, even plain text files or spreadsheets may contain information. Another
primary source of data is the World Wide Web or the internet.
Different processes:
Before passing the data to the database or data warehouse server, the data must be cleaned,
integrated, and selected. As the information comes from various sources and in different
formats, it can't be used directly for the data mining procedure because the data may not be
complete and accurate. So, the first data requires to be cleaned and unified. More information
than needed will be collected from various data sources, and only the data of interest will
have to be selected and passed to the server. These procedures are not as easy as we think.
Several methods may be performed on the data as part of selection, integration, and cleaning.
The database or data warehouse server consists of the original data that is ready to be
processed. Hence, the server is cause for retrieving the relevant data that is based on data
mining as per user request.
Data Mining Engine:
The data mining engine is a major component of any data mining system. It contains several
modules for operating data mining tasks, including association, characterization,
classification, clustering, prediction, time-series analysis, etc.
In other words, we can say data mining is the root of our data mining architecture. It
comprises instruments and software used to obtain insights and knowledge from data
collected from various data sources and stored within the data warehouse.
The Pattern evaluation module is primarily responsible for the measure of investigation of the
pattern by using a threshold value. It collaborates with the data mining engine to focus the
search on exciting patterns.
This segment commonly employs stake measures that cooperate with the data mining
modules to focus the search towards fascinating patterns. It might utilize a stake threshold to
filter out discovered patterns. On the other hand, the pattern evaluation module might be
coordinated with the mining module, depending on the implementation of the data mining
techniques used. For efficient data mining, it is abnormally suggested to push the evaluation
of pattern stake as much as possible into the mining procedure to confine the search to only
fascinating patterns.
The graphical user interface (GUI) module communicates between the data mining system
and the user. This module helps the user to easily and efficiently use the system without
knowing the complexity of the process. This module cooperates with the data mining system
when the user specifies a query or a task and displays the results.
Knowledge Base:
The knowledge base is helpful in the entire process of data mining. It might be helpful to
guide the search or evaluate the stake of the result patterns. The knowledge base may even
contain user views and data from user experiences that might be helpful in the data mining
process. The data mining engine may receive inputs from the knowledge base to make the
result more accurate and reliable. The pattern assessment module regularly interacts with the
knowledge base to get inputs, and also update it.
KDD- Knowledge Discovery in Databases
The term KDD stands for Knowledge Discovery in Databases. It refers to the broad
procedure of discovering knowledge in data and emphasizes the high-level applications of
specific Data Mining techniques. It is a field of interest to researchers in various fields,
including artificial intelligence, machine learning, pattern recognition, databases, statistics,
knowledge acquisition for expert systems, and data visualization.
The main objective of the KDD process is to extract information from data in the context of
large databases. It does this by using Data Mining algorithms to identify what is deemed
knowledge.
The availability and abundance of data today make knowledge discovery and Data Mining a
matter of impressive significance and need. In the recent development of the field, it isn't
surprising that a wide variety of techniques is presently accessible to specialists and experts.
The process begins with determining the KDD objectives and ends with the implementation
of the discovered knowledge. At that point, the loop is closed, and the Active Data Mining
starts. Subsequently, changes would need to be made in the application domain. For example,
offering various features to cell phone users in order to reduce churn. This closes the loop,
and the impacts are then measured on the new data repositories, and the KDD process again.
Following is a concise description of the nine-step KDD process, Beginning with a
managerial step:
1. Building up an understanding of the application domain
This is the initial preliminary step. It develops the scene for understanding what should be
done with the various decisions like transformation, algorithms, representation, etc. The
individuals who are in charge of a KDD venture need to understand and characterize the
objectives of the end-user and the environment in which the knowledge discovery process
will occur ( involves relevant prior knowledge).
Once defined the objectives, the data that will be utilized for the knowledge discovery
process should be determined. This incorporates discovering what data is accessible,
obtaining important data, and afterward integrating all the data for knowledge discovery onto
one set involves the qualities that will be considered for the process. This process is important
because of Data Mining learns and discovers from the accessible data. This is the evidence
base for building the models. If some significant attributes are missing, at that point, then the
entire study may be unsuccessful from this respect, the more attributes are considered. On the
other hand, to organize, collect, and operate advanced data repositories is expensive, and
there is an arrangement with the opportunity for best understanding the phenomena. This
arrangement refers to an aspect where the interactive and iterative aspect of the KDD is
taking place. This begins with the best available data sets and later expands and observes the
impact in terms of knowledge discovery and modeling.
In this step, data reliability is improved. It incorporates data clearing, for example, Handling
the missing quantities and removal of noise or outliers. It might include complex statistical
techniques or use a Data Mining algorithm in this context. For example, when one suspects
that a specific attribute of lacking reliability or has many missing data, at this point, this
attribute could turn into the objective of the Data Mining supervised algorithm. A prediction
model for these attributes will be created, and after that, missing data can be predicted. The
expansion to which one pays attention to this level relies upon numerous factors. Regardless,
studying the aspects is significant and regularly revealing by itself, to enterprise data
frameworks.
4. Data Transformation
In this stage, the creation of appropriate data for Data Mining is prepared and developed.
Techniques here incorporate dimension reduction( for example, feature selection and
extraction and record sampling), also attribute transformation(for example, discretization of
numerical attributes and functional transformation). This step can be essential for the success
of the entire KDD project, and it is typically very project-specific. For example, in medical
assessments, the quotient of attributes may often be the most significant factor and not each
one by itself. In business, we may need to think about impacts beyond our control as well as
efforts and transient issues. For example, studying the impact of advertising accumulation.
However, if we do not utilize the right transformation at the starting, then we may acquire an
amazing effect that insights to us about the transformation required in the next iteration.
Thus, the KDD process follows upon itself and prompts an understanding of the
transformation required.
We are now prepared to decide on which kind of Data Mining to use, for example,
classification, regression, clustering, etc. This mainly relies on the KDD objectives, and also
on the previous steps. There are two significant objectives in Data Mining, the first one is a
prediction, and the second one is the description. Prediction is usually referred to as
supervised Data Mining, while descriptive Data Mining incorporates the unsupervised and
visualization aspects of Data Mining. Most Data Mining techniques depend on inductive
learning, where a model is built explicitly or implicitly by generalizing from an adequate
number of preparing models. The fundamental assumption of the inductive approach is that
the prepared model applies to future cases. The technique also takes into account the level of
meta-learning for the specific set of accessible data.
Having the technique, we now decide on the strategies. This stage incorporates choosing a
particular technique to be used for searching patterns that include multiple inducers. For
example, considering precision versus understandability, the previous is better with neural
networks, while the latter is better with decision trees. For each system of meta-learning,
there are several possibilities of how it can be succeeded. Meta-learning focuses on clarifying
what causes a Data Mining algorithm to be fruitful or not in a specific issue. Thus, this
methodology attempts to understand the situation under which a Data Mining algorithm is
most suitable. Each algorithm has parameters and strategies of leaning, such as ten folds
cross-validation or another division for training and testing.
At last, the implementation of the Data Mining algorithm is reached. In this stage, we may
need to utilize the algorithm several times until a satisfying outcome is obtained. For
example, by turning the algorithms control parameters, such as the minimum number of
instances in a single leaf of a decision tree.
8. Evaluation
In this step, we assess and interpret the mined patterns, rules, and reliability to the objective
characterized in the first step. Here we consider the preprocessing steps as for their impact on
the Data Mining algorithm results. For example, including a feature in step 4, and repeat from
there. This step focuses on the comprehensibility and utility of the induced model. In this
step, the identified knowledge is also recorded for further use. The last step is the use, and
overall feedback and discovery results acquire by Data Mining.
Now, we are prepared to include the knowledge into another system for further activity. The
knowledge becomes effective in the sense that we may make changes to the system and
measure the impacts. The accomplishment of this step decides the effectiveness of the whole
KDD process. There are numerous challenges in this step, such as losing the "laboratory
conditions" under which we have worked. For example, the knowledge was discovered from
a certain static depiction, it is usually a set of data, but now the data becomes dynamic. Data
structures may change certain quantities that become unavailable, and the data domain might
be modified, such as an attribute that may have a value that was not expected previously.
Early techniques of identifying patterns in data include Bayes theorem (1700s), and the
evolution of regression(1800s). The generation and growing power of computer science have
boosted data collection, storage, and manipulation as data sets have broad in size and
complexity level. Explicit hands-on data investigation has progressively been improved with
indirect, automatic data processing, and other computer science discoveries such as neural
networks, clustering, genetic algorithms (1950s), decision trees(1960s), and supporting vector
machines (1990s).
Data mining origins are traced back to three family lines: Classical statistics, Artificial
intelligence, and Machine learning.
Classical statistics:
Statistics are the basis of most technology on which data mining is built, such as regression
analysis, standard deviation, standard distribution, standard variance, discriminatory analysis,
cluster analysis, and confidence intervals. All of these are used to analyze data and data
connection.
Artificial Intelligence:
Machine Learning:
Data Mining tools have the objective of discovering patterns/trends/groupings among large
sets of data and transforming data into more refined information.
It is a framework, such as Rstudio or Tableau that allows you to perform different types of
data mining analysis.
We can perform various algorithms such as clustering or classification on your data set and
visualize the results itself. It is a framework that provides us better insights for our data and
the phenomenon that data represent. Such a framework is called a data mining tool.
The Market for Data Mining tool is shining: as per the latest report from ReortLinker noted
that the market would top $1 billion in sales by 2023, up from $ 591 million in 2018
Orange is a perfect machine learning and data mining software suite. It supports the
visualization and is a software-based on components written in Python computing language
and developed at the bioinformatics laboratory at the faculty of computer and information
science, Ljubljana University, Slovenia.
Besides, Orange provides a more interactive and enjoyable atmosphere to dull analytical
tools. It is quite exciting to operate.
Why Orange?
Data comes to orange is formatted quickly to the desired pattern, and moving the widgets can
be easily transferred where needed. Orange is quite interesting to users. Orange allows its
users to make smarter decisions in a short time by rapidly comparing and analyzing the
data.It is a good open-source data visualization as well as evaluation that concerns beginners
and professionals. Data mining can be performed via visual programming or Python
scripting. Many analyses are feasible through its visual programming interface(drag and drop
connected with widgets)and many visual tools tend to be supported such as bar charts,
scatterplots, trees, dendrograms, and heat maps. A substantial amount of widgets(more than
100) tend to be supported.
The instrument has machine learning components, add-ons for bioinformatics and text
mining, and it is packed with features for data analytics. This is also used as a python library.
Python scripts can keep running in a terminal window, an integrated environment like
PyCharmand PythonWin, pr shells like iPython. Orange comprises of canvas interface onto
which the user places widgets and creates a data analysis workflow. The widget proposes
fundamental operations, For example, reading the data, showing a data table, selecting
features, training predictors, comparing learning algorithms, visualizing data elements, etc.
Orange operates on Windows, Mac OS X, and a variety of Linux operating systems. Orange
comes with multiple regression and classification algorithms.
Orange can read documents in native and other data formats. Orange is dedicated to machine
learning techniques for classification or supervised data mining. There are two types of
objects used in classification: learner and classifiers. Learners consider class-leveled data and
return a classifier. Regression methods are very similar to classification in Orange, and both
are designed for supervised data mining and require class-level data. The learning of
ensembles combines the predictions of individual models for precision gain. The model can
either come from different training data or use different learners on the same sets of data.
Learners can also be diversified by altering their parameter sets. In orange, ensembles are
simply wrappers around learners. They act like any other learner. Based on the data, they
return models that can predict the results of any data instance.
SAS stands for Statistical Analysis System. It is a product of the SAS Institute created for
analytics and data management. SAS can mine data, change it, manage information from
various sources, and analyze statistics. It offers a graphical UI for non-technical users.
SAS data miner allows users to analyze big data and provide accurate insight for timely
decision-making purposes. SAS has distributed memory processing architecture that is highly
scalable. It is suitable for data mining, optimization, and text mining purposes.
DMelt is a multi-platform utility written in JAVA. It can run on any operating system which
is compatible with JVM (Java Virtual Machine). It consists of Science and mathematics
libraries.
• Scientific libraries:
Scientific libraries are used for drawing the 2D/3D plots.
• Mathematical libraries:
Mathematical libraries are used for random number generation, algorithms, curve fitting,
etc.
DMelt can be used for the analysis of the large volume of data, data mining, and statistical
analysis. It is extensively used in natural sciences, financial markets, and engineering.
4. Rattle:
Ratte is a data mining tool based on GUI. It uses the R stats programming language. Rattle
exposes the statical power of R by offering significant data mining features. While rattle has a
comprehensive and well-developed user interface, It has an integrated log code tab that
produces duplicate code for any GUI operation.
The data set produced by Rattle can be viewed and edited. Rattle gives the other facility to
review the code, use it for many purposes, and extend the code without any restriction.
5. Rapid Miner:
Rapid Miner is one of the most popular predictive analysis systems created by the company
with the same name as the Rapid Miner. It is written in JAVA programming language. It
offers an integrated environment for text mining, deep learning, machine learning, and
predictive analysis.
The instrument can be used for a wide range of applications, including company applications,
commercial applications, research, education, training, application development, machine
learning.
Rapid Miner provides the server on-site as well as in public or private cloud infrastructure. It
has a client/server model as its base. A rapid miner comes with template-based frameworks
that enable fast delivery with few errors(which are commonly expected in the manual coding
writing process)
Data Mining and Machine learning are areas that have been influenced by each other,
although they have many common things, yet they have different ends.
Data Mining is performed on certain data sets by humans to find interesting patterns between
the items in the data set. Data Mining uses techniques created by machine learning for
predicting the results while machine learning is the capability of the computer to learn from a
minded data set.
Machine learning algorithms take the information that represents the relationship between
items in data sets and creates models in order to predict future results. These models are
nothing more than actions that will be taken by the machine to achieve a result.
Machine learning is a technique that creates complex algorithms for large data processing and
provides outcomes to its users. It utilizes complex programs that can learn through experience
and make predictions.
The algorithms are enhanced by themselves by frequent input of training data. The aim of
machine learning is to understand information and build models from data that can be
understood and used by humans.
• Unsupervised Learning
• Supervised Learning
Unsupervised learning does not depend on trained data sets to predict the results, but it
utilizes direct techniques such as clustering and association in order to predict the results.
Trained data sets are defined as the input for which the output is known.
As the name implies, supervised learning refers to the presence of a supervisor as a teacher.
Supervised learning is a learning process in which we teach or train the machine using data
which is well leveled implies that some data is already marked with the correct responses.
After that, the machine is provided with the new sets of data so that the supervised learning
algorithm analyzes the training data and gives an accurate result from labeled data.
2. Data Mining utilizes more data to obtain helpful information, and that specific data will
help to predict some future results. For example, In a marketing company that utilizes last
year's data to predict the sale, but machine learning does not depend much on data. It uses
algorithms. Many transportation companies such as OLA, UBER machine learning
techniques to calculate ETA (Estimated Time of Arrival) for rides is based on this technique.
3. Data mining is not capable of self-learning. It follows the guidelines that are predefined. It
will provide the answer to a specific problem, but machine learning algorithms are self-
defined and can alter their rules according to the situation, and find out the solution for a
specific problem and resolves it in its way.
4. The main and most important difference between data mining and machine learning is that
without the involvement of humans, data mining can't work, but in the case of machine
learning human effort only involves at the time when the algorithm is defined after that it will
conclude everything on its own. Once it implemented, we can use it forever, but this is not
possible in the case of data mining.
5. As machine learning is an automated process, the result produces by machine learning will
be more precise as compared to data mining.
6. Data mining utilizes the database, data warehouse server, data mining engine, and pattern
assessment techniques to obtain useful information, whereas machine learning utilizes neural
networks, predictive models, and automated algorithms to make the decisions.
In 1930, it was known as knowledge The first program, i.e., Samuel's checker
History
discovery in databases(KDD). playing program, was established in 1950.
Data Mining is used to obtain the rules Machine learning teaches the computer,
Responsibility
from the existing data. how to learn and comprehend the rules.
Data mining abstract from the data
Abstraction Machine learning reads machine.
warehouse.
Techniques Data mining is more of research using a It is a self-learned and train system to do
involve technique like a machine learning. the task precisely.
In this digital era, the social platform has become inevitable. Whether we like this platform or
not, there is no escape. Facebook allows us to interact with friends and family or to stay up to
date about the latest stuff happening around the world. Facebook has made the world seems
much smaller. Facebook is one of the most important sources of online business
communication. The business holders make the most out of this platform. The most important
reason for which this platform is most accessed is because of its characteristic of being the
oldest video and photo sharing social media tool.
A Facebook page helps the people to get aware of the brand through the media content
shared. The platform supports the businesses to reach out to their audience and then establish
their business belonging to Facebook usage itself.
Not only for the users with business accounts, but this platform is also useful for the accounts
which have personal blogs. The bloggers or even the influencers who deal with posting the
content that attracts the customers give another reason to the users to access Facebook.
As far as the usage by normal users is concerned, people nowadays cannot live without
Facebook. This has become a habit to such an extent, that people have the addiction of going
through this site every once in half an hour.
Facebook is one of the most popular social media platforms created in 2004; it now has
almost two billion monthly active users with five new profiles, every second. Anyone who is
over the age of 13 can use the site. Users create a free account which is a profile of them in
which they share as much as some information about themselves as they wish.
• Headquarters: California, US
• Established: February 2004
• Founded by: Mark Zuckerberg
• There are approximately 52 percent Female users and 48 percent Male users on Facebook.
• Facebook stories are viewed by 0.6 Billion viewers on a daily basis.
• In 2019, in 60 seconds on the internet, 1 million people Log In to Facebook.
• More than 5 billion messages are posted on Facebook pages collectively, on a monthly basis.
On a Facebook page, a user can incorporate many different kinds of personal data, including
the user's date of birth, hobbies and interests, education, sexual preferences, political party,
and religious affiliations, and current employment. Users can also post photos of themselves
as well as other peoples, and they can offer other Facebook users the opportunity to search
for and communicate with them via the website. Researchers have realized that plenty of
personal data on Facebook, as well as other social networking platform, can easily be
collected or mined, to search for patterns in people's behavior. For example, Social
researchers at various universities around the world have collected data from Facebook pages
to become familiar with the lives and social networks of college students. They have also
mined for data on MySpace to find out how people express feelings on the web and to assess-
based on data posted on MySpace, what youths think about appropriate internet conduct.
Because academic specialists, particularly those in the social sciences, are collecting data
from Facebook and other internet websites and distributing their discoveries, numerous
university Institutional Review Boards (IRBs), councils charged by government guidelines to
review research with human subjects, have built up policies and procedures that govern
research on the internet. Some have been made strategies specifically relating to data mining
on social media platforms like Facebook. These strategies serve as institutional- specific
supplements to the Department of Health and Human Services (HHS) guidelines guiding the
conduct of research with human subjects. The formation of these institutional-specific
strategies that at least some university IRBs view data mining on Facebook as research with
human subjects. Thus, the universities where this case has happened, research involving data
mining on Facebook must experience the IRB survey before the research may start.
According to the HHS guidelines, all research with human subjects must experience IRB
survey and get IRB endorsement before the research may start. The administrative
requirement tries to assure that human subjects research is conducted as ethically as possible,
in specific requiring that subject participation in research is voluntary, that the risks to
subjects are corresponding to the benefits and that no subject population is unfairly excluded
or incorporated in the research.
Social Media Data Mining Methods
Applying data mining techniques to social media is relatively new as compared to other fields
of research related to social network analytics. When we acknowledge the research in social
media network analysis dates back to the 1930s. The application that uses data mining
techniques developed by industry and academia are already being used commercially. For
example, a "Social Media Analytics" organization offers services to us and track social media
to provide customers data about how goods and services recognized and discussed through
social media networks. Analysts in the organizations have applied text mining algorithms,
and detect the propagation models to blogs to create techniques to understand better how data
moves through the blogosphere.
Data mining techniques can be implemented to social media sites to comprehend information
better and to make use of data for analytics, research, and business purposes. Representative
Fields include a community or group detection, data diffusion, propagation of audiences,
subject detection and tracking, individual behavior analysis, group behavior analysis, and
market research for organizations.
Representation of Data
Similar to other social media data, it is accepted to use a graph representation to study social
media data sets. A graph comprises a set including vertexes (nodes) and edges (links). Users
are usually shown as the nodes in the graph. Relationships or corporation between individuals
(nodes) is shown as the links in the graph.
The graph depiction is common for information extracted from social networking sites where
people interact with friends, family, and business associates. It helps to create a social
network of friends, family, or business associates. Less apparent is how the graph structure is
applied to blogs, wikis, opinion mining, and similar types of online social media platforms.
If we consider blogs, One graph representation blogged as nodes and can be regarded as
"blog network," and another graph description has blog posts as the nodes, and can be
regarded as "post-network." Edges are created in a blog post network when another blog post
references another blog post. Other techniques used to represent blog networks concurrently
account for individuals, relationships, content, and time simultaneously- called Internet
Online Analytical Processing (iOLAP). Wikis can be considered from the context of
depicting authors as nodes, and edges are created when the authors contribute to an object.
The graphical representation allows the application of classic mathematical graph theory,
traditional techniques of analyzing social media platforms and work on mining graph data.
The probably big size of the graph used to depict social media platforms can present
difficulties for automated processing as restricts on computer memory. The processing speeds
are maximized and usually exceeded when trying to cope with huge social media data set.
Other challenges to implementing automated procedures to allow social media data mining
include identifying and dealing with spam, the variety of formats used in the same
subcategory of social media, and continuously altering content and structure.
The problem itself can conclude the best approach. There is no other option for understanding
the data as much possible before applying data mining techniques as well as understanding
the various data mining tools that are available. A subject analyst might be required to help
better understand the data set. To better understand the various tools available for data
mining, there are a host of data mining and machine learning text and different resources that
are available to support more accurate information about a variety of particular data mining
techniques and algorithms.
Once you understand the issues and select an appropriate data mining approach, consider any
preprocessing that needs to be done. A systematic process may also be required to develop an
adequate set of data to allow reasonable processing times. Pre-processing should include
suitable privacy protection mechanisms. Although social media platforms incorporate huge
amounts of openly accessible data, it is important to guarantee individual rights, and social
media platform copyrights are secured. The effect of spam should be considered along with
the temporal representation.
In addition to preprocessing, it is essential to think about the effect of time. Depending upon
the inquiry and the research, we may get different outcomes at one time compared to another,
although the time segment is an accessible consideration for specific areas. For example,
subject detection, influence propagation, and network development, less evident is the effect
of time on network identification, group behavior, and marketing. What defines a network at
one point in time can be significantly different at another point in time. Group behavior and
interests will change after some time, and what was offered to the individuals or groups today
may not be trendy tomorrow.
With data depicted as a graph, the tasks start with a selected number of nodes, known as
seeds. Graphs are traversed, starting with the arrangement of seeds, and as the link structure
from the seed nodes is used, data is collected, and the structure itself is also reviewed.
Utilizing the link structure to stretch out from the seed set and gather new information is
known as crawling the network. The application and algorithms that are executed as a crawler
should effectively manage the challenges present in powerful social media platforms such as
restricted sites, format changes, and structure errors (invalid links). As the crawler finds the
new data, it stores the new data in a repository for further analysis. As link data is found, the
crawler updates the data about the network structure.
Some social media platforms such as Facebook, Twitter, and Technorati provide Application
Programmer Interfaces (APIs) that allow crawler applications to interact with the data sources
directly. However, these platforms usually restrict the number of API transactions per day,
relying on the affiliation the API user has with the platform. For some platforms, it is possible
to collect data (crawl) without utilizing APIs. Given the huge size of the social media
platform data available, it might be necessary to restrict the amount of data that the crawler
collects. When the crawler has collected the data, some postprocessing may be needed to
validate and clean up the data. Traditional social media platforms analysis methods can be
applied, for example, centrality measures and group structure studies. In many cases,
additional data will be related to a node or a link opening opportunities for more complex
methods to consider the more thoughtful semantics that can be exposed with text and data
mining techniques.
We now focus on two particular social media platform data to further represent how data
mining techniques are applied to social media sites. The two major areas are social media
platforms, and Blogs are powerful, and rich data sources portray both these areas. The two
areas offer potential value to the more extensive scientific network as well as a business
organization.
Here, the figure illustrates the hypothetical graph structure diagram for typical social
media platforms, and Arrows indicate links to a larger part of the graph.
It is important to secure personal identity when working with social media platforms data.
Recent reports highlight the need to secure privacy as it has been demonstrated that even
anonymizing this sort of data can still reveal individual data when advanced data analysis
strategies are utilized. Security settings also can restrict the ability of data mining applications
to think about each data on social media platforms. However, some heinous techniques can
be utilized to take over the security settings.
Clustering in Data Mining
Clustering is an unsupervised Machine Learning-based Algorithm that comprises a group of
data points into clusters so that the objects belong to the same group.
Clustering helps to splits data into several subsets. Each of these subsets contains data similar
to each other, and these subsets are called clusters. Now that the data from our customer base
is divided into clusters, we can make an informed decision about who we think is best suited
for this product.
Let's understand this with an example, suppose we are a market manager, and we have a new
tempting product to sell. We are sure that the product would bring enormous profit, as long as
it is sold to the right people. So, how can we tell who is best suited for the product from our
company's huge customer base?
Clustering, falling under the category of unsupervised machine learning, is one of the
problems that machine learning algorithms solve.
Clustering only utilizes input data, to determine patterns, anomalies, or similarities in its input
data.
• The intra-cluster similarities are high, It implies that the data present inside the cluster is
similar to one another.
• The inter-cluster similarity is low, and it means each cluster holds data that is not similar to
other data.
What is a Cluster?
• A cluster is a subset of similar objects
• A subset of objects such that the distance between any of the two objects in the cluster is
less than the distance between any object in the cluster and any object that is not located
inside it.
• A connected region of a multidimensional space with a comparatively high density of
objects.
Important points:
• Data objects of a cluster can be considered as one group.
• We first partition the information set into groups while doing cluster analysis. It is based on
data similarities and then assigns the levels to the groups.
• The over-classification main advantage is that it is adaptable to modifications, and it helps
single out important characteristics that differentiate between distinct groups.
1. Scalability:
Scalability in clustering implies that as we boost the amount of data objects, the time to
perform clustering should approximately scale to the complexity order of the algorithm. For
example, if we perform K- means clustering, we know it is O(n), where n is the number of
objects in the data. If we raise the number of data objects 10 folds, then the time taken to
cluster them should also approximately increase 10 times. It means there should be a linear
relationship. If that is not the case, then there is some error with our implementation process.
Data should be scalable if it is not scalable, then we can't get the appropriate result. The figure
illustrates the graphical example where it may lead to the wrong result.
2. Interpretability:
The clustering algorithm should be able to find arbitrary shape clusters. They should not be
limited to only distance measurements that tend to discover a spherical cluster of small sizes.
Algorithms should be capable of being applied to any data such as data based on intervals
(numeric), binary data, and categorical data.
Databases contain data that is noisy, missing, or incorrect. Few algorithms are sensitive to
such data and may result in poor quality clusters.
6. High dimensionality:
The clustering tools should not only able to handle high dimensional data space but also the
low-dimensional space.
It collects sets of keywords or terms that often happen together and afterward discover the
association relationship among them. First, it preprocesses the text data by parsing,
stemming, removing stop words, etc. Once it pre-processed the data, then it induces
association mining algorithms. Here, human effort is not required, so the number of unwanted
results and the execution time is reduced.
This analysis is used for the automatic classification of the huge number of online text
documents like web pages, emails, etc. Text document classification varies with the
classification of relational data as document databases are not organized according to
attribute values pairs.
Numericizing text:
• Stemming algorithms
A significant pre-processing step before ordering of input documents starts with the
stemming of words. The terms "stemming" can be defined as a reduction of words to their
roots. For example, different grammatical forms of words and ordered are the same. The
primary purpose of stemming is to ensure a similar word by text mining program.
• Support for different languages:
There are some highly language-dependent operations such as stemming, synonyms, the
letters that are allowed in words. Therefore, support for various languages is important.
• Exclude certain character:
Excluding numbers, specific characters, or series of characters, or words that are shorter or
longer than a specific number of letters can be done before the ordering of the input
documents.
• Include lists, exclude lists (stop-words):
A particular list of words to be listed can be characterized, and it is useful when we want to
search for a specific word. It also classifies the input documents based on the frequencies
with which those words occur. Additionally, "stop words," which means terms that are to be
rejected from the ordering can be characterized. Normally, a default list of English stop
words incorporates "the," "a," "since," etc. These words are used in the respective language
very often but communicate very little data in the document.
Bagging Vs Boosting
We all use the Decision Tree Technique on day to day life to make the decision.
Organizations use these supervised machine learning techniques like Decision trees to make a
better decision and to generate more surplus and profit.
Ensemble methods combine different decision trees to deliver better predictive results,
afterward utilizing a single decision tree. The primary principle behind the ensemble model is
that a group of weak learners come together to form an active learner.
There are two techniques given below that are used to perform ensemble decision tree.
Bagging
Bagging is used when our objective is to reduce the variance of a decision tree. Here the
concept is to create a few subsets of data from the training sample, which is chosen randomly
with replacement. Now each collection of subset data is used to prepare their decision trees
thus, we end up with an ensemble of various models. The average of all the assumptions from
numerous tress is used, which is more powerful than a single decision tree.
Random Forest is an expansion over bagging. It takes one additional step to predict a
random subset of data. It also makes the random selection of features rather than using all
features to develop trees. When we have numerous random trees, it is called the Random
Forest.
These are the following steps which are taken to implement a Random forest:
• Let us consider X observations Y features in the training data set. First, a model from the
training data set is taken randomly with substitution.
• The tree is developed to the largest.
• The given steps are repeated, and prediction is given, which is based on the collection of
predictions from n number of trees.
Since the last prediction depends on the mean predictions from subset trees, it won't give
precise value for the regression model.
Boosting:
Boosting is another ensemble procedure to make a collection of predictors. In other words,
we fit consecutive trees, usually random samples, and at each step, the objective is to solve
net error from the prior trees.
If a given input is misclassified by theory, then its weight is increased so that the upcoming
hypothesis is more likely to classify it correctly by consolidating the entire set at last converts
weak learners into better performing models.
It utilizes a gradient descent algorithm that can optimize any differentiable loss function. An
ensemble of trees is constructed individually, and individual trees are summed successively.
The next tree tries to restore the loss ( It is the difference between actual and predicted
values).
Various training data subsets are randomly drawn Each new subset contains the components
with replacement from the whole training dataset. that were misclassified by previous models.
Bagging attempts to tackle the over-fitting issue. Boosting tries to reduce bias.
If the classifier is unstable (high variance), then we If the classifier is steady and straightforward
need to apply bagging. (high bias), then we need to apply boosting.
Every model receives an equal weight. Models are weighted by their performance.
Objective to decrease variance, not bias. Objective to decrease bias, not variance.
It is the easiest way of connecting predictions that It is a way of connecting predictions that
belong to the same type. belong to the different types.
Data Warehouse:
A Data Warehouse refers to a place where data can be stored for useful mining. It is like a
quick computer system with exceptionally huge data storage capacity. Data from the various
organization's systems are copied to the Warehouse, where it can be fetched and conformed
to delete errors. Here, advanced requests can be made against the warehouse storage of data.
Data warehouse combines data from numerous sources which ensure the data quality,
accuracy, and consistency. Data warehouse boosts system execution by separating analytics
processing from transnational databases. Data flows into a data warehouse from different
databases. A data warehouse works by sorting out data into a pattern that depicts the format
and types of data. Query tools examine the data tables using patterns.
Data warehouses and databases both are relative data systems, but both are made to serve
different purposes. A data warehouse is built to store a huge amount of historical data and
empowers fast requests over all the data, typically using Online Analytical Processing
(OLAP). A database is made to store current transactions and allow quick access to specific
transactions for ongoing business processes, commonly known as Online Transaction
Processing (OLTP).
1. Subject Oriented
A data warehouse is subject-oriented. It provides useful data about a subject instead of the
company's ongoing operations, and these subjects can be customers, suppliers, marketing,
product, promotion, etc. A data warehouse usually focuses on modeling and analysis of data
that helps the business organization to make data-driven decisions.
2. Time-Variant:
The different data present in the data warehouse provides information for a specific period.
3. Integrated
A data warehouse is built by joining data from heterogeneous sources, such as social
databases, level documents, etc.
4. Non- Volatile
Data Mining:
Data mining refers to the analysis of data. It is the computer-supported process of analyzing
huge sets of data that have either been compiled by computer systems or have been
downloaded into the computer. In the data mining process, the computer analyzes the data
and extract useful information from it. It looks for hidden patterns within the data set and try
to predict future behavior. Data mining is primarily used to discover and indicate
relationships among the data sets.
Data mining aims to enable business organizations to view business behaviors, trends
relationships that allow the business to make data-driven decisions. It is also known as
knowledge Discover in Database (KDD). Data mining tools utilize AI, statistics, databases,
and machine learning systems to discover the relationship between the data. Data mining
tools can support business-related questions that traditionally time-consuming to resolve any
issue.
i. Market Analysis:
Data Mining can predict the market that helps the business to make the decision. For
example, it predicts who is keen to purchase what type of products.
Data Mining methods can help to find which cellular phone calls, insurance claims, credit, or
debit card purchases are going to be fraudulent.
Data Mining techniques are widely used to help Model Financial Market
Analyzing the current existing trend in the marketplace is a strategic benefit because it helps
in cost reduction and manufacturing process as per market demand.
Data mining is the process of determining A data warehouse is a database system designed for
data patterns. analytics.
Business entrepreneurs carry data mining Data warehousing is entirely carried out by the
with the help of engineers. engineers.
In data mining, data is analyzed repeatedly. In data warehousing, data is stored periodically.
Data mining uses pattern recognition Data warehousing is the process of extracting and
techniques to identify patterns. storing data that allow easier reporting.
One of the most amazing data mining One of the advantages of the data warehouse is its
technique is the detection and ability to update frequently. That is the reason why it is
identification of the unwanted errors that ideal for business entrepreneurs who want up to date
occur in the system. with the latest stuff.
Companies can benefit from this analytical Data warehouse stores a huge amount of historical
tool by equipping suitable and accessible data that helps users to analyze different periods and
knowledge-based data. trends to make future predictions.
Clustering in Data Mining
Social media is a great source of information and a perfect platform for communication.
Businesses and individuals can make the best of it instead of only sharing their photos and
videos on the platform. The platform gives freedom to its users to connect with their target
group easily and fantastically. Either a group or an established business, both face difficulties
in standing up with the competitive social media industry. But through the social media
platform, users can market or develop his/her brand or content with others.
Social media mining includes social media platforms, social network analysis, and data
mining to provide a convenient and consistent platform for learners, professionals, scientists,
and project managers to understand the fundamentals and potentials of social media mining.
It suggests various problems arising from social media data and presents fundamental
concepts, emerging issues, and effective algorithms for data mining, and network analysis. It
includes multiple degrees of difficulty that enhance knowledge and help in applying ideas,
principles, and techniques in distinct social media mining situations.
As per the "Global Digital Report," the total number of active users on social media
platforms worldwide in 2019 is 2.41 billion and increases up to 9 % year-on-year. With the
universal use of Social media platforms via the internet, a huge amount of data is accessible.
Social media platforms include many fields of study, such as sociology, business,
psychology, entertainment, politics, news, and other cultural aspects of societies. Applying
data mining to social media can provide exciting views on human behavior and human
interaction. Data mining can be used in combination with social media to understand user's
opinions about a subject, identifying a group of individuals among the masses of a
population, to study group modifications over time, find influential people, or even suggest a
product or activity to an individual.
For example, The presidential election during 2008 marked an unprecedented use of social
media platforms in the United States. Social media platforms, including Facebook, YouTube
played a vital role in raising funds and getting candidate's messages to voters. Researcher's
extracted blog data to demonstrate correlations between the amount of social media platform
used by candidates and the winner of the 2008 presidential campaign.
This effective example emphasizes the potential for data mining social media data to forecast
results at the national level. Data mining social media can also produce personal and
corporate benefits.
Social media mining refers to social computing. Social computing is defined as "Any
computing application where software is used as an intermediary or Centre for a social
relationship." Social computing involves application used for interpersonal communication as
well as application and research activities related to "computational social studies" or Social
behavior."
Social media platform refers to various kinds of information services used collaboratively by
many people placed into the subcategories shown below.
Category Examples
With popular traditional media such as radio, newspaper, and television, communication is
entirely one-way that comes from the media source or advertiser to the mass of media
consumers. Web 2.0 technologies and modern social media platforms have changed the scene
moving from one-way media communication driven by media providers to where almost
anyone can publish written, audio, video, or image content to the masses.
This media environment is significantly changing the way of business communication with
their clients. It provides exceptionally unprecedented opportunities for individuals to interact
with a huge number of peoples at a very low cost. The relationships present online and shown
through the social media platform are digitalized data sets of social media platforms on a
scale. The resulting data offers rich opportunities for sociology and insights to consumer
behavior and marketing among a host of apps linked to similar fields.
The growth and number of users on social media platforms are incredible. For example,
consider the most tempting social media networking site, Facebook. Facebook reached over
400 million active users during the first six years of operation, and it has been growing
exponentially. The given figure illustrates the exponential growth of Facebook over the first
six years. As per the report, Facebook is ranked 2nd in the world for websites based on the
traffic engagement of the users on the site daily.
The broad use of social media platforms is not limited to one geographical region of the
world. Orkut, a popular social networking platform operated by Google has most of the users
from the outside the United States, and the use of social media among Internet users is now
mainstream in many parts of the globe including countries Aisa, Africa, Europe, South
America, and the middle east. Social media also drive significant changes in company and
business need to decide on their policies to keep pace with this new media.
Data Mining techniques can assist effectively in dealing with the three primary challenges
with social media data. First, social media data sets are large. Consider the example of the
most popular social media platform Facebook with 2.41 billion active users. Without
automated data processing to analyze social media, social media data analytics becomes
inaccessible in any reasonable time frame.
Second, Social media site's data sets can be noisy. For example, Spam blogs are large in
number in the blogosphere, as well as unimportant tweets on Twitter.
Third, data from online social media platforms are dynamic, regular modifications and
updates over a short period are not common but also a significant aspect to consider in
dealing with social media data.
Applying data mining methods to huge data sets can improve search results for everyday
search engines, realize specified target marketing for business, help psychologists study
behavior, personalize consumer web services, provide new insights into the social structure
for sociologists, and help to identify and prevent spam for all of us.
Moreover, open access to data offers an unprecedented amount of data for researchers to
improve efficiency and optimize data mining techniques. The progress of data mining is
based on huge data sets. Social media is an optimal data source on the edge of data mining
for progressing and testing new data mining techniques for academic and allied data mining
analysts.
Bayesian classification uses Bayes theorem to predict the occurrence of any event. Bayesian
classifiers are the statistical classifiers with the Bayesian probability understandings. The
theory expresses how a level of belief, expressed as a probability.
Bayes theorem came into existence after Thomas Bayes, who first utilized conditional
probability to provide an algorithm that uses evidence to calculate limits on an unknown
parameter.
Bayes's theorem is expressed mathematically by the following equation that is given below.
P(X/Y) is a conditional probability that describes the occurrence of event X is given that Y
is true.
P(Y/X) is a conditional probability that describes the occurrence of event Y is given that X
is true.
P(X) and P(Y) are the probabilities of observing X and Y independently of each other. This is
known as the marginal probability.
Bayesian interpretation:
Where P (X⋂Y) is the joint probability of both X and Y being true, because
Bayesian network:
A Directed Acyclic Graph is used to show a Bayesian Network, and like some other
statistical graph, a DAG consists of a set of nodes and links, where the links signify the
connection between the nodes.
The nodes here represent random variables, and the edges define the relationship between
these variables.
A DAG models the uncertainty of an event taking place based on the Conditional Probability
Distribution (CDP) of each random variable. A Conditional Probability Table (CPT) is
used to represent the CPD of each variable in a network.
Over the last few years, the World Wide Web has become a significant source of
information and simultaneously a popular platform for business. Web mining can define as
the method of utilizing data mining techniques and algorithms to extract useful information
directly from the web, such as Web documents and services, hyperlinks, Web content, and
server logs. The World Wide Web contains a large amount of data that provides a rich source
to data mining. The objective of Web mining is to look for patterns in Web data by collecting
and examining data in order to gain insights.
Web content mining can be used to extract useful data, information, knowledge from the web
page content. In web content mining, each web page is considered as an individual document.
The individual can take advantage of the semi-structured nature of web pages, as HTML
provides information that concerns not only the layout but also logical structure. The primary
task of content mining is data extraction, where structured data is extracted from unstructured
websites. The objective is to facilitate data aggregation over various web sites by using the
extracted structured data. Web content mining can be utilized to distinguish topics on the
web. For Example, if any user searches for a specific task on the search engine, then the user
will get a list of suggestions.
The web structure mining can be used to find the link structure of hyperlink. It is used to
identify that data either link the web pages or direct link network. In Web Structure Mining,
an individual considers the web as a directed graph, with the web pages being the vertices
that are associated with hyperlinks. The most important application in this regard is the
Google search engine, which estimates the ranking of its outcomes primarily with the
PageRank algorithm. It characterizes a page to be exceptionally relevant when frequently
connected by other highly related pages. Structure and content mining methodologies are
usually combined. For example, web structured mining can be beneficial to organizations to
regulate the network between two commercial sites.
Web usage mining is used to extract useful data, information, knowledge from the weblog
records, and assists in recognizing the user access patterns for web pages. In Mining, the
usage of web resources, the individual is thinking about records of requests of visitors of a
website, that are often collected as web server logs. While the content and structure of the
collection of web pages follow the intentions of the authors of the pages, the individual
requests demonstrate how the consumers see these pages. Web usage mining may disclose
relationships that were not proposed by the creator of the pages.
Some of the methods to identify and analyze the web usage patterns are given below:
The document is created after this analysis, which contains the details of repeatedly visited
web pages, common entry, and exit.
OLAP can be accomplished on various parts of log related data in a specific period.
The site pages don't have a unifying structure. They are extremely complicated as compared
to traditional text documents. There are enormous amounts of documents in the digital library
of the web. These libraries are not organized according to a specific order.
The data on the internet is quickly updated. For example, news, climate, shopping, financial
news, sports, and so on.
The client network on the web is quickly expanding. These clients have different interests,
backgrounds, and usage purposes. There are over a hundred million workstations that are
associated with the internet and still increasing tremendously.
• Relevancy of data:
It is considered that a specific person is generally concerned about a small portion of the web,
while the rest of the segment of the web contains the data that is not familiar to the user and
may lead to unwanted results.
• The web is too broad:
The size of the web is tremendous and rapidly increasing. It appears that the web is too huge
for data warehousing and data mining.