0% found this document useful (0 votes)
5 views

Data mining M1

Data mining, also known as Knowledge Discovery in Database (KDD), is a process that extracts valuable information from large datasets to identify patterns and trends for data-driven decision-making. It involves various techniques such as classification, clustering, regression, and association rules, and is widely applied in fields like healthcare, marketing, and finance. While data mining offers numerous advantages, including cost efficiency and improved decision-making, it also faces challenges such as data privacy concerns, incomplete data, and the complexity of data management.

Uploaded by

Ailya Fatima
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Data mining M1

Data mining, also known as Knowledge Discovery in Database (KDD), is a process that extracts valuable information from large datasets to identify patterns and trends for data-driven decision-making. It involves various techniques such as classification, clustering, regression, and association rules, and is widely applied in fields like healthcare, marketing, and finance. While data mining offers numerous advantages, including cost efficiency and improved decision-making, it also faces challenges such as data privacy concerns, incomplete data, and the complexity of data management.

Uploaded by

Ailya Fatima
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 64

Data mining is one of the most useful techniques that help entrepreneurs, researchers, and

individuals to extract valuable information from huge sets of data. Data mining is also called
Knowledge Discovery in Database (KDD). The knowledge discovery process includes Data
cleaning, Data integration, Data selection, Data transformation, Data mining, Pattern
evaluation, and Knowledge presentation.

What is Data Mining?


The process of extracting information to identify patterns, trends, and useful data that would
allow the business to take the data-driven decision from huge sets of data is called Data
Mining.

In other words, we can say that Data Mining is the process of investigating hidden patterns of
information to various perspectives for categorization into useful data, which is collected and
assembled in particular areas such as data warehouses, efficient analysis, data mining
algorithm, helping decision making and other data requirement to eventually cost-cutting and
generating revenue.

Data mining is the act of automatically searching for large stores of information to find trends
and patterns that go beyond simple analysis procedures. Data mining utilizes complex
mathematical algorithms for data segments and evaluates the probability of future events.
Data Mining is also called Knowledge Discovery of Data (KDD).

Data Mining is a process used by organizations to extract specific data from huge databases
to solve business problems. It primarily turns raw data into useful information.

Data Mining is similar to Data Science carried out by a person, in a specific situation, on a
particular data set, with an objective. This process includes various types of services such as
text mining, web mining, audio and video mining, pictorial data mining, and social media
mining. It is done through software that is simple or highly specific. By outsourcing data
mining, all the work can be done faster with low operation costs. Specialized firms can also
use new technologies to collect data that is impossible to locate manually. There are tonnes of
information available on various platforms, but very little knowledge is accessible. The
biggest challenge is to analyze the data to extract important information that can be used to
solve a problem or for company development. There are many powerful instruments and
techniques available to mine data and find better insight from it.
Types of Data Mining
Data mining can be performed on the following types of data:

Relational Database:

A relational database is a collection of multiple data sets formally organized by tables,


records, and columns from which data can be accessed in various ways without having to
recognize the database tables. Tables convey and share information, which facilitates data
searchability, reporting, and organization.

Data warehouses:

A Data Warehouse is the technology that collects the data from various sources within the
organization to provide meaningful business insights. The huge amount of data comes from
multiple places such as Marketing and Finance. The extracted data is utilized for analytical
purposes and helps in decision- making for a business organization. The data warehouse is
designed for the analysis of data rather than transaction processing.

Data Repositories:

The Data Repository generally refers to a destination for data storage. However, many IT
professionals utilize the term more clearly to refer to a specific kind of setup within an IT
structure. For example, a group of databases, where an organization has kept various kinds of
information.

Object-Relational Database:
A combination of an object-oriented database model and relational database model is called
an object-relational model. It supports Classes, Objects, Inheritance, etc.

One of the primary objectives of the Object-relational data model is to close the gap between
the Relational database and the object-oriented model practices frequently utilized in many
programming languages, for example, C++, Java, C#, and so on.

Transactional Database:

A transactional database refers to a database management system (DBMS) that has the
potential to undo a database transaction if it is not performed appropriately. Even though this
was a unique capability a very long while back, today, most of the relational database
systems support transactional database activities.

Advantages of Data Mining


• The Data Mining technique enables organizations to obtain knowledge-based data.
• Data mining enables organizations to make lucrative modifications in operation and
production.
• Compared with other statistical data applications, data mining is a cost-efficient.
• Data Mining helps the decision-making process of an organization.
• It Facilitates the automated discovery of hidden patterns as well as the prediction of trends
and behaviors.
• It can be induced in the new system as well as the existing platforms.
• It is a quick process that makes it easy for new users to analyze enormous amounts of data
in a short time.

Disadvantages of Data Mining


• There is a probability that the organizations may sell useful data of customers to other
organizations for money. As per the report, American Express has sold credit card purchases
of their customers to other organizations.
• Many data mining analytics software is difficult to operate and needs advance training to
work on.
• Different data mining instruments operate in distinct ways due to the different algorithms
used in their design. Therefore, the selection of the right data mining tools is a very
challenging task.
• The data mining techniques are not precise, so that it may lead to severe consequences in
certain conditions.

Data Mining Applications


Data Mining is primarily used by organizations with intense consumer demands- Retail,
Communication, Financial, marketing company, determine price, consumer preferences,
product positioning, and impact on sales, customer satisfaction, and corporate profits. Data
mining enables a retailer to use point-of-sale records of customer purchases to develop
products and promotions that help the organization to attract the customer.
These are the following areas where data mining is widely used:

Data Mining in Healthcare:

Data mining in healthcare has excellent potential to improve the health system. It uses data
and analytics for better insights and to identify best practices that will enhance health care
services and reduce costs. Analysts use data mining approaches such as Machine learning,
Multi-dimensional database, Data visualization, Soft computing, and statistics. Data Mining
can be used to forecast patients in each category. The procedures ensure that the patients get
intensive care at the right place and at the right time. Data mining also enables healthcare
insurers to recognize fraud and abuse.

Data Mining in Market Basket Analysis:

Market basket analysis is a modeling method based on a hypothesis. If you buy a specific
group of products, then you are more likely to buy another group of products. This technique
may enable the retailer to understand the purchase behavior of a buyer. This data may assist
the retailer in understanding the requirements of the buyer and altering the store's layout
accordingly. Using a different analytical comparison of results between various stores,
between customers in different demographic groups can be done.

Data mining in Education:

Education data mining is a newly emerging field, concerned with developing techniques that
explore knowledge from the data generated from educational Environments. EDM objectives
are recognized as affirming student's future learning behavior, studying the impact of
educational support, and promoting learning science. An organization can use data mining to
make precise decisions and also to predict the results of the student. With the results, the
institution can concentrate on what to teach and how to teach.

Data Mining in Manufacturing Engineering:

Knowledge is the best asset possessed by a manufacturing company. Data mining tools can
be beneficial to find patterns in a complex manufacturing process. Data mining can be used in
system-level designing to obtain the relationships between product architecture, product
portfolio, and data needs of the customers. It can also be used to forecast the product
development period, cost, and expectations among the other tasks.

Data Mining in CRM (Customer Relationship Management):

Customer Relationship Management (CRM) is all about obtaining and holding Customers,
also enhancing customer loyalty and implementing customer-oriented strategies. To get a
decent relationship with the customer, a business organization needs to collect data and
analyze the data. With data mining technologies, the collected data can be used for analytics.

Data Mining in Fraud detection:

Billions of dollars are lost to the action of frauds. Traditional methods of fraud detection are a
little bit time consuming and sophisticated. Data mining provides meaningful patterns and
turning data into information. An ideal fraud detection system should protect the data of all
the users. Supervised methods consist of a collection of sample records, and these records are
classified as fraudulent or non-fraudulent. A model is constructed using this data, and the
technique is made to identify whether the document is fraudulent or not.

Data Mining in Lie Detection:

Apprehending a criminal is not a big deal, but bringing out the truth from him is a very
challenging task. Law enforcement may use data mining techniques to investigate offenses,
monitor suspected terrorist communications, etc. This technique includes text mining also,
and it seeks meaningful patterns in data, which is usually unstructured text. The information
collected from the previous investigations is compared, and a model for lie detection is
constructed.

Data Mining Financial Banking:

The Digitalization of the banking system is supposed to generate an enormous amount of data
with every new transaction. The data mining technique can help bankers by solving business-
related problems in banking and finance by identifying trends, casualties, and correlations in
business information and market costs that are not instantly evident to managers or
executives because the data volume is too large or are produced too rapidly on the screen by
experts. The manager may find these data for better targeting, acquiring, retaining,
segmenting, and maintain a profitable customer.

Challenges of Implementation in Data mining


Although data mining is very powerful, it faces many challenges during its execution.
Various challenges could be related to performance, data, methods, and techniques, etc. The
process of data mining becomes effective when the challenges or problems are correctly
recognized and adequately resolved.

Incomplete and noisy data:

The process of extracting useful data from large volumes of data is data mining. The data in
the real-world is heterogeneous, incomplete, and noisy. Data in huge quantities will usually
be inaccurate or unreliable. These problems may occur due to data measuring instrument or
because of human errors. Suppose a retail chain collects phone numbers of customers who
spend more than $ 500, and the accounting employees put the information into their system.
The person may make a digit mistake when entering the phone number, which results in
incorrect data. Even some customers may not be willing to disclose their phone numbers,
which results in incomplete data. The data could get changed due to human or system error.
All these consequences (noisy and incomplete data)makes data mining challenging.

Data Distribution:
Real-worlds data is usually stored on various platforms in a distributed computing
environment. It might be in a database, individual systems, or even on the internet.
Practically, It is a quite tough task to make all the data to a centralized data repository mainly
due to organizational and technical concerns. For example, various regional offices may have
their servers to store their data. It is not feasible to store, all the data from all the offices on a
central server. Therefore, data mining requires the development of tools and algorithms that
allow the mining of distributed data.

Complex Data:

Real-world data is heterogeneous, and it could be multimedia data, including audio and
video, images, complex data, spatial data, time series, and so on. Managing these various
types of data and extracting useful information is a tough task. Most of the time, new
technologies, new tools, and methodologies would have to be refined to obtain specific
information.

Performance:

The data mining system's performance relies primarily on the efficiency of algorithms and
techniques used. If the designed algorithm and techniques are not up to the mark, then the
efficiency of the data mining process will be affected adversely.

Data Privacy and Security:

Data mining usually leads to serious issues in terms of data security, governance, and
privacy. For example, if a retailer analyzes the details of the purchased items, then it reveals
data about buying habits and preferences of the customers without their permission.

Data Visualization:

In data mining, data visualization is a very important process because it is the primary
method that shows the output to the user in a presentable way. The extracted data should
convey the exact meaning of what it intends to express. But many times, representing the
information to the end-user in a precise and easy way is difficult. The input data and the
output information being complicated, very efficient, and successful data visualization
processes need to be implemented to make it successful.

There are many more challenges in data mining in addition to the problems above-mentioned. More
problems are disclosed as the actual data mining process begins, and the success of data mining
relies on getting rid of all these difficulties.
Data Mining Techniques
Data mining includes the utilization of refined data analysis tools to find previously
unknown, valid patterns and relationships in huge data sets. These tools can incorporate
statistical models, machine learning techniques, and mathematical algorithms, such as neural
networks or decision trees. Thus, data mining incorporates analysis and prediction.

Depending on various methods and technologies from the intersection of machine learning,
database management, and statistics, professionals in data mining have devoted their careers
to better understanding how to process and make conclusions from the huge amount of data,
but what are the methods they use to make it happen?

In recent data mining projects, various major data mining techniques have been developed
and used, including association, classification, clustering, prediction, sequential patterns, and
regression.

1. Classification:
This technique is used to obtain important and relevant information about data and metadata.
This data mining technique helps to classify data in different classes.
Data mining techniques can be classified by different criteria, as follows:

i. Classification of Data mining frameworks as per the type of data sources mined:
This classification is as per the type of data handled. For example, multimedia, spatial data,
text data, time-series data, World Wide Web, and so on..
ii. Classification of data mining frameworks as per the database involved:
This classification based on the data model involved. For example. Object-oriented database,
transactional database, relational database, and so on..
iii. Classification of data mining frameworks as per the kind of knowledge discovered:
This classification depends on the types of knowledge discovered or data mining
functionalities. For example, discrimination, classification, clustering, characterization, etc.
some frameworks tend to be extensive frameworks offering a few data mining
functionalities together..
iv. Classification of data mining frameworks according to data mining techniques used:
This classification is as per the data analysis approach utilized, such as neural networks,
machine learning, genetic algorithms, visualization, statistics, data warehouse-oriented or
database-oriented, etc.
The classification can also take into account, the level of user interaction involved in the data
mining procedure, such as query-driven systems, autonomous systems, or interactive
exploratory systems.

2. Clustering:
Clustering is a division of information into groups of connected objects. Describing the data
by a few clusters mainly loses certain confine details, but accomplishes improvement. It
models data by its clusters. Data modeling puts clustering from a historical point of view
rooted in statistics, mathematics, and numerical analysis. From a machine learning point of
view, clusters relate to hidden patterns, the search for clusters is unsupervised learning, and
the subsequent framework represents a data concept. From a practical point of view,
clustering plays an extraordinary job in data mining applications. For example, scientific data
exploration, text mining, information retrieval, spatial database applications, CRM, Web
analysis, computational biology, medical diagnostics, and much more.

In other words, we can say that Clustering analysis is a data mining technique to identify
similar data. This technique helps to recognize the differences and similarities between the
data. Clustering is very similar to the classification, but it involves grouping chunks of data
together based on their similarities.

3. Regression:
Regression analysis is the data mining process is used to identify and analyze the relationship
between variables because of the presence of the other factor. It is used to define the
probability of the specific variable. Regression, primarily a form of planning and modeling.
For example, we might use it to project certain costs, depending on other factors such as
availability, consumer demand, and competition. Primarily it gives the exact relationship
between two or more variables in the given data set.

4. Association Rules:
This data mining technique helps to discover a link between two or more items. It finds a
hidden pattern in the data set.

Association rules are if-then statements that support to show the probability of interactions
between data items within large data sets in different types of databases. Association rule
mining has several applications and is commonly used to help sales correlations in data or
medical data sets.

The way the algorithm works is that you have various data, For example, a list of grocery
items that you have been buying for the last six months. It calculates a percentage of items
being purchased together.

These are three major measurements technique:

• Lift:
This measurement technique measures the accuracy of the confidence over how often item
B is purchased.
(Confidence) / (item B)/ (Entire dataset)
• Support:
This measurement technique measures how often multiple items are purchased and
compared it to the overall dataset.
(Item A + Item B) / (Entire dataset)
• Confidence:
This measurement technique measures how often item B is purchased when item A is
purchased as well.
(Item A + Item B)/ (Item A)

5. Outer detection:
This type of data mining technique relates to the observation of data items in the data set,
which do not match an expected pattern or expected behavior. This technique may be used in
various domains like intrusion, detection, fraud detection, etc. It is also known as Outlier
Analysis or Outilier mining. The outlier is a data point that diverges too much from the rest of
the dataset. The majority of the real-world datasets have an outlier. Outlier detection plays a
significant role in the data mining field. Outlier detection is valuable in numerous fields like
network interruption identification, credit or debit card fraud detection, detecting outlying in
wireless sensor network data, etc.

6. Sequential Patterns:
The sequential pattern is a data mining technique specialized for evaluating sequential data
to discover sequential patterns. It comprises of finding interesting subsequences in a set of
sequences, where the stake of a sequence can be measured in terms of different criteria like
length, occurrence frequency, etc.

In other words, this technique of data mining helps to discover or recognize similar patterns
in transaction data over some time.

7. Prediction:
Prediction used a combination of other data mining techniques such as trends, clustering,
classification, etc. It analyzes past events or instances in the right sequence to predict a future
event.

Data Mining Implementation Process


Many different sectors are taking advantage of data mining to boost their business efficiency,
including manufacturing, chemical, marketing, aerospace, etc. Therefore, the need for a
conventional data mining process improved effectively. Data mining techniques must be
reliable, repeatable by company individuals with little or no knowledge of the data mining
context. As a result, a cross-industry standard process for data mining (CRISP-DM) was first
introduced in 1990, after going through many workshops, and contribution for more than 300
organizations.

Data mining is described as a process of finding hidden precious data by evaluating the huge
quantity of information stored in data warehouses, using multiple data mining techniques
such as Artificial Intelligence (AI), Machine learning and statistics.
Let's examine the implementation process for data mining in details:

The Cross-Industry Standard Process for Data Mining


(CRISP-DM)
Cross-industry Standard Process of Data Mining (CRISP-DM) comprises of six phases
designed as a cyclical method as the given figure:
1. Business understanding:

It focuses on understanding the project goals and requirements form a business point of view,
then converting this information into a data mining problem afterward a preliminary plan
designed to accomplish the target.

Tasks:

• Determine business objectives


• Access situation
• Determine data mining goals
• Produce a project plan

Determine business objectives:

• It Understands the project targets and prerequisites from a business point of view.
• Thoroughly understand what the customer wants to achieve.
• Reveal significant factors, at the starting, it can impact the result of the project.

Access situation:

• It requires a more detailed analysis of facts about all the resources, constraints,
assumptions, and others that ought to be considered.
Determine data mining goals:

• A business goal states the target of the business terminology. For example, increase catalog
sales to the existing customer.
• A data mining goal describes the project objectives. For example, It assumes how many
objects a customer will buy, given their demographics details (Age, Salary, and City) and the
price of the item over the past three years.

Produce a project plan:

• It states the targeted plan to accomplish the business and data mining plan.
• The project plan should define the expected set of steps to be performed during the rest of
the project, including the latest technique and better selection of tools.

2. Data Understanding:

Data understanding starts with an original data collection and proceeds with operations to get
familiar with the data, to data quality issues, to find better insight in data, or to detect
interesting subsets for concealed information hypothesis.

Tasks:

• Collects initial data


• Describe data
• Explore data
• Verify data quality

Collect initial data:

• It acquires the information mentioned in the project resources.


• It includes data loading if needed for data understanding.
• It may lead to original data preparation steps.
• If various information sources are acquired then integration is an extra issue, either here or
at the subsequent stage of data preparation.

Describe data:

• It examines the "gross" or "surface" characteristics of the information obtained.


• It reports on the outcomes.

Explore data:

• Addressing data mining issues that can be resolved by querying, visualizing, and reporting,
including:
o Distribution of important characteristics, results of simple aggregation.
o Establish the relationship between the small number of attributes.
o Characteristics of important sub-populations, simple statical analysis.
• It may refine the data mining objectives.
• It may contribute or refine the information description, and quality reports.
• It may feed into the transformation and other necessary information preparation.
Verify data quality:

• It examines the data quality and addressing questions.

3. Data Preparation:

• It usually takes more than 90 percent of the time.


• It covers all operations to build the final data set from the original raw information.
• Data preparation is probable to be done several times and not in any prescribed order.

Tasks:

• Select data
• Clean data
• Construct data
• Integrate data
• Format data

Select data:

• It decides which information to be used for evaluation.


• In the data selection criteria include significance to data mining objectives, quality and
technical limitations such as data volume boundaries or data types.
• It covers the selection of characteristics and the choice of the document in the table.

Clean data:

• It may involve the selection of clean subsets of data, inserting appropriate defaults or more
ambitious methods, such as estimating missing information by modeling.

Construct data:

• It comprises of Constructive information preparation, such as generating derived


characteristics, complete new documents, or transformed values of current characteristics.

Integrate data:

• Integrate data refers to the methods whereby data is combined from various tables, or
documents to create new documents or values.

Format data:

• Formatting data refer mainly to linguistic changes produced to information that does not
alter their significance but may require a modeling tool.

4. Modeling:

In modeling, various modeling methods are selected and applied, and their parameters are
measured to optimum values. Some methods gave particular requirements on the form of
data. Therefore, stepping back to the data preparation phase is necessary.
Tasks:

• Select modeling technique


• Generate test design
• Build model
• Access model

Select modeling technique:

• It selects the real modeling method that is to be used. For example, decision tree, neural
network.
• If various methods are applied,then it performs this task individually for each method.

Generate test Design:

• Generate a procedure or mechanism for testing the validity and quality of the model before
constructing a model. For example, in classification, error rates are commonly used as
quality measures for data mining models. Therefore, typically separate the data set into
train and test set, build the model on the train set and assess its quality on the separate test
set.

Build model:

• To create one or more models, we need to run the modeling tool on the prepared data set.

Assess model:

• It interprets the models according to its domain expertise, the data mining success criteria,
and the required design.
• It assesses the success of the application of modeling and discovers methods more
technically.
• It Contacts business analytics and domain specialists later to discuss the outcomes of data
mining in the business context.

5. Evaluation:

• At the last of this phase, a decision on the use of the data mining results should be reached.
• It evaluates the model efficiently, and review the steps executed to build the model and to
ensure that the business objectives are properly achieved.
• The main objective of the evaluation is to determine some significant business issue that has
not been regarded adequately.
• At the last of this phase, a decision on the use of the data mining outcomes should be
reached.

Tasks:

• Evaluate results
• Review process
• Determine next steps

Evaluate results:
• It assesses the degree to which the model meets the organization's business objectives.
• It tests the model on test apps in the actual implementation when time and budget
limitations permit and also assesses other data mining results produced.
• It unveils additional difficulties, suggestions, or information for future instructions.

Review process:

• The review process does a more detailed evaluation of the data mining engagement to
determine when there is a significant factor or task that has been somehow ignored.
• It reviews quality assurance problems.

Determine next steps:

• It decides how to proceed at this stage.


• It decides whether to complete the project and move on to deployment when necessary or
whether to initiate further iterations or set up new data-mining initiatives.it includes
resources analysis and budget that influence the decisions.

6. Deployment:

Determine:

• Deployment refers to how the outcomes need to be utilized.

Deploy data mining results by:

• It includes scoring a database, utilizing results as company guidelines, interactive internet


scoring.
• The information acquired will need to be organized and presented in a way that can be used
by the client. However, the deployment phase can be as easy as producing. However,
depending on the demands, the deployment phase may be as simple as generating a report
or as complicated as applying a repeatable data mining method across the organizations.

Tasks:

• Plan deployment
• Plan monitoring and maintenance
• Produce final report
• Review project

Plan deployment:

• To deploy the data mining outcomes into the business, takes the assessment results and
concludes a strategy for deployment.
• It refers to documentation of the process for later deployment.

Plan monitoring and maintenance:

• It is important when the data mining results become part of the day-to-day business and its
environment.
• It helps to avoid unnecessarily long periods of misuse of data mining results.
• It needs a detailed analysis of the monitoring process.

Produce final report:

• A final report can be drawn up by the project leader and his team.
• It may only be a summary of the project and its experience.
• It may be a final and comprehensive presentation of data mining.

Review project:

• Review projects evaluate what went right and what went wrong, what was done wrong, and
what needs to be improved.

Data mining is a significant method where previously unknown and potentially useful
information is extracted from the vast amount of data. The data mining process involves
several components, and these components constitute a data mining system architecture.

Data Mining Architecture


The significant components of data mining systems are a data source, data mining engine,
data warehouse server, the pattern evaluation module, graphical user interface, and
knowledge base.
Data Source:

The actual source of data is the Database, data warehouse, World Wide Web (WWW), text
files, and other documents. You need a huge amount of historical data for data mining to be
successful. Organizations typically store data in databases or data warehouses. Data
warehouses may comprise one or more databases, text files spreadsheets, or other repositories
of data. Sometimes, even plain text files or spreadsheets may contain information. Another
primary source of data is the World Wide Web or the internet.

Different processes:

Before passing the data to the database or data warehouse server, the data must be cleaned,
integrated, and selected. As the information comes from various sources and in different
formats, it can't be used directly for the data mining procedure because the data may not be
complete and accurate. So, the first data requires to be cleaned and unified. More information
than needed will be collected from various data sources, and only the data of interest will
have to be selected and passed to the server. These procedures are not as easy as we think.
Several methods may be performed on the data as part of selection, integration, and cleaning.

Database or Data Warehouse Server:

The database or data warehouse server consists of the original data that is ready to be
processed. Hence, the server is cause for retrieving the relevant data that is based on data
mining as per user request.
Data Mining Engine:

The data mining engine is a major component of any data mining system. It contains several
modules for operating data mining tasks, including association, characterization,
classification, clustering, prediction, time-series analysis, etc.

In other words, we can say data mining is the root of our data mining architecture. It
comprises instruments and software used to obtain insights and knowledge from data
collected from various data sources and stored within the data warehouse.

Pattern Evaluation Module:

The Pattern evaluation module is primarily responsible for the measure of investigation of the
pattern by using a threshold value. It collaborates with the data mining engine to focus the
search on exciting patterns.

This segment commonly employs stake measures that cooperate with the data mining
modules to focus the search towards fascinating patterns. It might utilize a stake threshold to
filter out discovered patterns. On the other hand, the pattern evaluation module might be
coordinated with the mining module, depending on the implementation of the data mining
techniques used. For efficient data mining, it is abnormally suggested to push the evaluation
of pattern stake as much as possible into the mining procedure to confine the search to only
fascinating patterns.

Graphical User Interface:

The graphical user interface (GUI) module communicates between the data mining system
and the user. This module helps the user to easily and efficiently use the system without
knowing the complexity of the process. This module cooperates with the data mining system
when the user specifies a query or a task and displays the results.

Knowledge Base:

The knowledge base is helpful in the entire process of data mining. It might be helpful to
guide the search or evaluate the stake of the result patterns. The knowledge base may even
contain user views and data from user experiences that might be helpful in the data mining
process. The data mining engine may receive inputs from the knowledge base to make the
result more accurate and reliable. The pattern assessment module regularly interacts with the
knowledge base to get inputs, and also update it.
KDD- Knowledge Discovery in Databases
The term KDD stands for Knowledge Discovery in Databases. It refers to the broad
procedure of discovering knowledge in data and emphasizes the high-level applications of
specific Data Mining techniques. It is a field of interest to researchers in various fields,
including artificial intelligence, machine learning, pattern recognition, databases, statistics,
knowledge acquisition for expert systems, and data visualization.

The main objective of the KDD process is to extract information from data in the context of
large databases. It does this by using Data Mining algorithms to identify what is deemed
knowledge.

The Knowledge Discovery in Databases is considered as a programmed, exploratory analysis


and modeling of vast data repositories.KDD is the organized procedure of recognizing valid,
useful, and understandable patterns from huge and complex data sets. Data Mining is the root
of the KDD procedure, including the inferring of algorithms that investigate the data, develop
the model, and find previously unknown patterns. The model is used for extracting the
knowledge from the data, analyze the data, and predict the data.

The availability and abundance of data today make knowledge discovery and Data Mining a
matter of impressive significance and need. In the recent development of the field, it isn't
surprising that a wide variety of techniques is presently accessible to specialists and experts.

The KDD Process


The knowledge discovery process(illustrates in the given figure) is iterative and interactive,
comprises of nine steps. The process is iterative at each stage, implying that moving back to
the previous actions might be required. The process has many imaginative aspects in the
sense that one cant presents one formula or make a complete scientific categorization for the
correct decisions for each step and application type. Thus, it is needed to understand the
process and the different requirements and possibilities in each stage.

The process begins with determining the KDD objectives and ends with the implementation
of the discovered knowledge. At that point, the loop is closed, and the Active Data Mining
starts. Subsequently, changes would need to be made in the application domain. For example,
offering various features to cell phone users in order to reduce churn. This closes the loop,
and the impacts are then measured on the new data repositories, and the KDD process again.
Following is a concise description of the nine-step KDD process, Beginning with a
managerial step:
1. Building up an understanding of the application domain

This is the initial preliminary step. It develops the scene for understanding what should be
done with the various decisions like transformation, algorithms, representation, etc. The
individuals who are in charge of a KDD venture need to understand and characterize the
objectives of the end-user and the environment in which the knowledge discovery process
will occur ( involves relevant prior knowledge).

2. Choosing and creating a data set on which discovery will be performed

Once defined the objectives, the data that will be utilized for the knowledge discovery
process should be determined. This incorporates discovering what data is accessible,
obtaining important data, and afterward integrating all the data for knowledge discovery onto
one set involves the qualities that will be considered for the process. This process is important
because of Data Mining learns and discovers from the accessible data. This is the evidence
base for building the models. If some significant attributes are missing, at that point, then the
entire study may be unsuccessful from this respect, the more attributes are considered. On the
other hand, to organize, collect, and operate advanced data repositories is expensive, and
there is an arrangement with the opportunity for best understanding the phenomena. This
arrangement refers to an aspect where the interactive and iterative aspect of the KDD is
taking place. This begins with the best available data sets and later expands and observes the
impact in terms of knowledge discovery and modeling.

3. Preprocessing and cleansing

In this step, data reliability is improved. It incorporates data clearing, for example, Handling
the missing quantities and removal of noise or outliers. It might include complex statistical
techniques or use a Data Mining algorithm in this context. For example, when one suspects
that a specific attribute of lacking reliability or has many missing data, at this point, this
attribute could turn into the objective of the Data Mining supervised algorithm. A prediction
model for these attributes will be created, and after that, missing data can be predicted. The
expansion to which one pays attention to this level relies upon numerous factors. Regardless,
studying the aspects is significant and regularly revealing by itself, to enterprise data
frameworks.

4. Data Transformation

In this stage, the creation of appropriate data for Data Mining is prepared and developed.
Techniques here incorporate dimension reduction( for example, feature selection and
extraction and record sampling), also attribute transformation(for example, discretization of
numerical attributes and functional transformation). This step can be essential for the success
of the entire KDD project, and it is typically very project-specific. For example, in medical
assessments, the quotient of attributes may often be the most significant factor and not each
one by itself. In business, we may need to think about impacts beyond our control as well as
efforts and transient issues. For example, studying the impact of advertising accumulation.
However, if we do not utilize the right transformation at the starting, then we may acquire an
amazing effect that insights to us about the transformation required in the next iteration.
Thus, the KDD process follows upon itself and prompts an understanding of the
transformation required.

5. Prediction and description

We are now prepared to decide on which kind of Data Mining to use, for example,
classification, regression, clustering, etc. This mainly relies on the KDD objectives, and also
on the previous steps. There are two significant objectives in Data Mining, the first one is a
prediction, and the second one is the description. Prediction is usually referred to as
supervised Data Mining, while descriptive Data Mining incorporates the unsupervised and
visualization aspects of Data Mining. Most Data Mining techniques depend on inductive
learning, where a model is built explicitly or implicitly by generalizing from an adequate
number of preparing models. The fundamental assumption of the inductive approach is that
the prepared model applies to future cases. The technique also takes into account the level of
meta-learning for the specific set of accessible data.

6. Selecting the Data Mining algorithm

Having the technique, we now decide on the strategies. This stage incorporates choosing a
particular technique to be used for searching patterns that include multiple inducers. For
example, considering precision versus understandability, the previous is better with neural
networks, while the latter is better with decision trees. For each system of meta-learning,
there are several possibilities of how it can be succeeded. Meta-learning focuses on clarifying
what causes a Data Mining algorithm to be fruitful or not in a specific issue. Thus, this
methodology attempts to understand the situation under which a Data Mining algorithm is
most suitable. Each algorithm has parameters and strategies of leaning, such as ten folds
cross-validation or another division for training and testing.

7. Utilizing the Data Mining algorithm

At last, the implementation of the Data Mining algorithm is reached. In this stage, we may
need to utilize the algorithm several times until a satisfying outcome is obtained. For
example, by turning the algorithms control parameters, such as the minimum number of
instances in a single leaf of a decision tree.

8. Evaluation

In this step, we assess and interpret the mined patterns, rules, and reliability to the objective
characterized in the first step. Here we consider the preprocessing steps as for their impact on
the Data Mining algorithm results. For example, including a feature in step 4, and repeat from
there. This step focuses on the comprehensibility and utility of the induced model. In this
step, the identified knowledge is also recorded for further use. The last step is the use, and
overall feedback and discovery results acquire by Data Mining.

9. Using the discovered knowledge

Now, we are prepared to include the knowledge into another system for further activity. The
knowledge becomes effective in the sense that we may make changes to the system and
measure the impacts. The accomplishment of this step decides the effectiveness of the whole
KDD process. There are numerous challenges in this step, such as losing the "laboratory
conditions" under which we have worked. For example, the knowledge was discovered from
a certain static depiction, it is usually a set of data, but now the data becomes dynamic. Data
structures may change certain quantities that become unavailable, and the data domain might
be modified, such as an attribute that may have a value that was not expected previously.

History of Data Mining


In the 1990s, the term "Data Mining" was introduced, but data mining is the evolution of a
sector with an extensive history.

Early techniques of identifying patterns in data include Bayes theorem (1700s), and the
evolution of regression(1800s). The generation and growing power of computer science have
boosted data collection, storage, and manipulation as data sets have broad in size and
complexity level. Explicit hands-on data investigation has progressively been improved with
indirect, automatic data processing, and other computer science discoveries such as neural
networks, clustering, genetic algorithms (1950s), decision trees(1960s), and supporting vector
machines (1990s).

Data mining origins are traced back to three family lines: Classical statistics, Artificial
intelligence, and Machine learning.

Classical statistics:

Statistics are the basis of most technology on which data mining is built, such as regression
analysis, standard deviation, standard distribution, standard variance, discriminatory analysis,
cluster analysis, and confidence intervals. All of these are used to analyze data and data
connection.
Artificial Intelligence:

AI or Artificial intelligence is based on heuristics as opposed to statistics. It tries to apply


human- thought like processing to statistical problems. A specific AI concept was adopted by
some high-end commercial products, such as query optimization modules for Relational
Database Management System(RDBMS).

Machine Learning:

Machine learning is a combination of statistics and AI. It might be considered as an evolution


of AI because it mixes AI heuristics with complex statistical analysis. Machine learning tries
to enable computer programs to know about the data they are studying so that programs make
a distinct decision based on the characteristics of the data examined. It uses statistics for basic
concepts and adding more AI heuristics and algorithms to accomplish its target.

Data Mining tools


Data Mining is the set of techniques that utilize specific algorithms, statical analysis, artificial
intelligence, and database systems to analyze data from different dimensions and
perspectives.

Data Mining tools have the objective of discovering patterns/trends/groupings among large
sets of data and transforming data into more refined information.

It is a framework, such as Rstudio or Tableau that allows you to perform different types of
data mining analysis.

We can perform various algorithms such as clustering or classification on your data set and
visualize the results itself. It is a framework that provides us better insights for our data and
the phenomenon that data represent. Such a framework is called a data mining tool.

The Market for Data Mining tool is shining: as per the latest report from ReortLinker noted
that the market would top $1 billion in sales by 2023, up from $ 591 million in 2018

These are the most popular data mining tools:


1. Orange Data Mining:

Orange is a perfect machine learning and data mining software suite. It supports the
visualization and is a software-based on components written in Python computing language
and developed at the bioinformatics laboratory at the faculty of computer and information
science, Ljubljana University, Slovenia.

As it is a software-based on components, the components of Orange are called "widgets."


These widgets range from preprocessing and data visualization to the assessment of
algorithms and predictive modeling.

Widgets deliver significant functionalities such as:


• Displaying data table and allowing to select features
• Data reading
• Training predictors and comparison of learning algorithms
• Data element visualization, etc.

Besides, Orange provides a more interactive and enjoyable atmosphere to dull analytical
tools. It is quite exciting to operate.

Why Orange?

Data comes to orange is formatted quickly to the desired pattern, and moving the widgets can
be easily transferred where needed. Orange is quite interesting to users. Orange allows its
users to make smarter decisions in a short time by rapidly comparing and analyzing the
data.It is a good open-source data visualization as well as evaluation that concerns beginners
and professionals. Data mining can be performed via visual programming or Python
scripting. Many analyses are feasible through its visual programming interface(drag and drop
connected with widgets)and many visual tools tend to be supported such as bar charts,
scatterplots, trees, dendrograms, and heat maps. A substantial amount of widgets(more than
100) tend to be supported.

The instrument has machine learning components, add-ons for bioinformatics and text
mining, and it is packed with features for data analytics. This is also used as a python library.
Python scripts can keep running in a terminal window, an integrated environment like
PyCharmand PythonWin, pr shells like iPython. Orange comprises of canvas interface onto
which the user places widgets and creates a data analysis workflow. The widget proposes
fundamental operations, For example, reading the data, showing a data table, selecting
features, training predictors, comparing learning algorithms, visualizing data elements, etc.
Orange operates on Windows, Mac OS X, and a variety of Linux operating systems. Orange
comes with multiple regression and classification algorithms.

Orange can read documents in native and other data formats. Orange is dedicated to machine
learning techniques for classification or supervised data mining. There are two types of
objects used in classification: learner and classifiers. Learners consider class-leveled data and
return a classifier. Regression methods are very similar to classification in Orange, and both
are designed for supervised data mining and require class-level data. The learning of
ensembles combines the predictions of individual models for precision gain. The model can
either come from different training data or use different learners on the same sets of data.
Learners can also be diversified by altering their parameter sets. In orange, ensembles are
simply wrappers around learners. They act like any other learner. Based on the data, they
return models that can predict the results of any data instance.

2. SAS Data Mining:

SAS stands for Statistical Analysis System. It is a product of the SAS Institute created for
analytics and data management. SAS can mine data, change it, manage information from
various sources, and analyze statistics. It offers a graphical UI for non-technical users.

SAS data miner allows users to analyze big data and provide accurate insight for timely
decision-making purposes. SAS has distributed memory processing architecture that is highly
scalable. It is suitable for data mining, optimization, and text mining purposes.

3. DataMelt Data Mining:

DataMelt is a computation and visualization environment which offers an interactive


structure for data analysis and visualization. It is primarily designed for students, engineers,
and scientists. It is also known as DMelt.

DMelt is a multi-platform utility written in JAVA. It can run on any operating system which
is compatible with JVM (Java Virtual Machine). It consists of Science and mathematics
libraries.
• Scientific libraries:
Scientific libraries are used for drawing the 2D/3D plots.
• Mathematical libraries:
Mathematical libraries are used for random number generation, algorithms, curve fitting,
etc.

DMelt can be used for the analysis of the large volume of data, data mining, and statistical
analysis. It is extensively used in natural sciences, financial markets, and engineering.

4. Rattle:

Ratte is a data mining tool based on GUI. It uses the R stats programming language. Rattle
exposes the statical power of R by offering significant data mining features. While rattle has a
comprehensive and well-developed user interface, It has an integrated log code tab that
produces duplicate code for any GUI operation.

The data set produced by Rattle can be viewed and edited. Rattle gives the other facility to
review the code, use it for many purposes, and extend the code without any restriction.

5. Rapid Miner:

Rapid Miner is one of the most popular predictive analysis systems created by the company
with the same name as the Rapid Miner. It is written in JAVA programming language. It
offers an integrated environment for text mining, deep learning, machine learning, and
predictive analysis.
The instrument can be used for a wide range of applications, including company applications,
commercial applications, research, education, training, application development, machine
learning.

Rapid Miner provides the server on-site as well as in public or private cloud infrastructure. It
has a client/server model as its base. A rapid miner comes with template-based frameworks
that enable fast delivery with few errors(which are commonly expected in the manual coding
writing process)

Data Mining vs Machine Learning


Data Mining relates to extracting information from a large quantity of data. Data mining is a
technique of discovering different kinds of patterns that are inherited in the data set and
which are precise, new, and useful data. Data Mining is working as a subset of business
analytics and similar to experimental studies. Data Mining's origins are databases, statistics.

Machine learning includes an algorithm that automatically improves through data-based


experience. Machine learning is a way to find a new algorithm from experience. Machine
learning includes the study of an algorithm that can automatically extract the data. Machine
learning utilizes data mining techniques and another learning algorithm to construct models
of what is happening behind certain information so that it can predict future results.

Data Mining and Machine learning are areas that have been influenced by each other,
although they have many common things, yet they have different ends.

Data Mining is performed on certain data sets by humans to find interesting patterns between
the items in the data set. Data Mining uses techniques created by machine learning for
predicting the results while machine learning is the capability of the computer to learn from a
minded data set.

Machine learning algorithms take the information that represents the relationship between
items in data sets and creates models in order to predict future results. These models are
nothing more than actions that will be taken by the machine to achieve a result.

What is Data Mining?


Data Mining is the method of extraction of data or previously unknown data patterns from
huge sets of data. Hence as the word suggests, we 'Mine for specific data' from the large data
set. Data mining is also called Knowledge Discovery Process, is a field of science that is used
to determine the properties of the datasets. Gregory Piatetsky-Shapiro founded the term
"Knowledge Discovery in Databases" (KDD) in 1989. The term "data mining" came in the
database community in 1990. Huge sets of data collected from data warehouses or complex
datasets such as time series, spatial, etc. are extracted in order to extract interesting
correlations and patterns between the data items. For Machine Learning algorithms, the
output of the data mining algorithm is often used as input.

What is Machine learning?


Machine learning is related to the development and designing of a machine that can learn
itself from a specified set of data to obtain a desirable result without it being explicitly coded.
Hence Machine learning implies 'a machine which learns on its own. Arthur Samuel
invented the term Machine learning an American pioneer in the area of computer gaming and
artificial intelligence in 1959. He said that "it gives computers the ability to learn without
being explicitly programmed."

Machine learning is a technique that creates complex algorithms for large data processing and
provides outcomes to its users. It utilizes complex programs that can learn through experience
and make predictions.

The algorithms are enhanced by themselves by frequent input of training data. The aim of
machine learning is to understand information and build models from data that can be
understood and used by humans.

Machine learning algorithms are divided into two types:

• Unsupervised Learning
• Supervised Learning

1. Unsupervised Machine Learning:

Unsupervised learning does not depend on trained data sets to predict the results, but it
utilizes direct techniques such as clustering and association in order to predict the results.
Trained data sets are defined as the input for which the output is known.

2. Supervised Machine Learning:

As the name implies, supervised learning refers to the presence of a supervisor as a teacher.
Supervised learning is a learning process in which we teach or train the machine using data
which is well leveled implies that some data is already marked with the correct responses.
After that, the machine is provided with the new sets of data so that the supervised learning
algorithm analyzes the training data and gives an accurate result from labeled data.

Major Difference between Data mining and Machine


learning
1. Two-component is used to introduce data mining techniques first one is the database, and
the second one is machine learning. The database provides data management techniques,
while machine learning provides methods for data analysis. But to introduce machine
learning methods, it used algorithms.

2. Data Mining utilizes more data to obtain helpful information, and that specific data will
help to predict some future results. For example, In a marketing company that utilizes last
year's data to predict the sale, but machine learning does not depend much on data. It uses
algorithms. Many transportation companies such as OLA, UBER machine learning
techniques to calculate ETA (Estimated Time of Arrival) for rides is based on this technique.
3. Data mining is not capable of self-learning. It follows the guidelines that are predefined. It
will provide the answer to a specific problem, but machine learning algorithms are self-
defined and can alter their rules according to the situation, and find out the solution for a
specific problem and resolves it in its way.

4. The main and most important difference between data mining and machine learning is that
without the involvement of humans, data mining can't work, but in the case of machine
learning human effort only involves at the time when the algorithm is defined after that it will
conclude everything on its own. Once it implemented, we can use it forever, but this is not
possible in the case of data mining.

5. As machine learning is an automated process, the result produces by machine learning will
be more precise as compared to data mining.

6. Data mining utilizes the database, data warehouse server, data mining engine, and pattern
assessment techniques to obtain useful information, whereas machine learning utilizes neural
networks, predictive models, and automated algorithms to make the decisions.

Data Mining Vs Machine Learning

Factors Data Mining Machine Learning

Traditional databases with unstructured


Origin It has an existing algorithm and data.
data.

Extracting information from a huge Introduce new Information from data as


Meaning
amount of data. well as previous experience.

In 1930, it was known as knowledge The first program, i.e., Samuel's checker
History
discovery in databases(KDD). playing program, was established in 1950.

Data Mining is used to obtain the rules Machine learning teaches the computer,
Responsibility
from the existing data. how to learn and comprehend the rules.
Data mining abstract from the data
Abstraction Machine learning reads machine.
warehouse.

In compare to machine learning, data It needs a large amount of data to obtain


mining can produce outcomes on the accurate results. It has various
Applications
lesser volume of data. It is also used in applications, used in web search, spam
cluster analysis. filter, credit scoring, computer design, etc.

It is automated, once designed and


It involves human interference more
Nature implemented, there is no need for human
towards the manual.
effort.

Techniques Data mining is more of research using a It is a self-learned and train system to do
involve technique like a machine learning. the task precisely.

Scope Applied in the limited fields. It can be used in a vast area.

Facebook Data Mining

In this digital era, the social platform has become inevitable. Whether we like this platform or
not, there is no escape. Facebook allows us to interact with friends and family or to stay up to
date about the latest stuff happening around the world. Facebook has made the world seems
much smaller. Facebook is one of the most important sources of online business
communication. The business holders make the most out of this platform. The most important
reason for which this platform is most accessed is because of its characteristic of being the
oldest video and photo sharing social media tool.

A Facebook page helps the people to get aware of the brand through the media content
shared. The platform supports the businesses to reach out to their audience and then establish
their business belonging to Facebook usage itself.

Not only for the users with business accounts, but this platform is also useful for the accounts
which have personal blogs. The bloggers or even the influencers who deal with posting the
content that attracts the customers give another reason to the users to access Facebook.
As far as the usage by normal users is concerned, people nowadays cannot live without
Facebook. This has become a habit to such an extent, that people have the addiction of going
through this site every once in half an hour.

Facebook is one of the most popular social media platforms created in 2004; it now has
almost two billion monthly active users with five new profiles, every second. Anyone who is
over the age of 13 can use the site. Users create a free account which is a profile of them in
which they share as much as some information about themselves as they wish.

Some Facts about Facebook:

• Headquarters: California, US
• Established: February 2004
• Founded by: Mark Zuckerberg
• There are approximately 52 percent Female users and 48 percent Male users on Facebook.
• Facebook stories are viewed by 0.6 Billion viewers on a daily basis.
• In 2019, in 60 seconds on the internet, 1 million people Log In to Facebook.
• More than 5 billion messages are posted on Facebook pages collectively, on a monthly basis.

On a Facebook page, a user can incorporate many different kinds of personal data, including
the user's date of birth, hobbies and interests, education, sexual preferences, political party,
and religious affiliations, and current employment. Users can also post photos of themselves
as well as other peoples, and they can offer other Facebook users the opportunity to search
for and communicate with them via the website. Researchers have realized that plenty of
personal data on Facebook, as well as other social networking platform, can easily be
collected or mined, to search for patterns in people's behavior. For example, Social
researchers at various universities around the world have collected data from Facebook pages
to become familiar with the lives and social networks of college students. They have also
mined for data on MySpace to find out how people express feelings on the web and to assess-
based on data posted on MySpace, what youths think about appropriate internet conduct.

Because academic specialists, particularly those in the social sciences, are collecting data
from Facebook and other internet websites and distributing their discoveries, numerous
university Institutional Review Boards (IRBs), councils charged by government guidelines to
review research with human subjects, have built up policies and procedures that govern
research on the internet. Some have been made strategies specifically relating to data mining
on social media platforms like Facebook. These strategies serve as institutional- specific
supplements to the Department of Health and Human Services (HHS) guidelines guiding the
conduct of research with human subjects. The formation of these institutional-specific
strategies that at least some university IRBs view data mining on Facebook as research with
human subjects. Thus, the universities where this case has happened, research involving data
mining on Facebook must experience the IRB survey before the research may start.

According to the HHS guidelines, all research with human subjects must experience IRB
survey and get IRB endorsement before the research may start. The administrative
requirement tries to assure that human subjects research is conducted as ethically as possible,
in specific requiring that subject participation in research is voluntary, that the risks to
subjects are corresponding to the benefits and that no subject population is unfairly excluded
or incorporated in the research.
Social Media Data Mining Methods
Applying data mining techniques to social media is relatively new as compared to other fields
of research related to social network analytics. When we acknowledge the research in social
media network analysis dates back to the 1930s. The application that uses data mining
techniques developed by industry and academia are already being used commercially. For
example, a "Social Media Analytics" organization offers services to us and track social media
to provide customers data about how goods and services recognized and discussed through
social media networks. Analysts in the organizations have applied text mining algorithms,
and detect the propagation models to blogs to create techniques to understand better how data
moves through the blogosphere.

Data mining techniques can be implemented to social media sites to comprehend information
better and to make use of data for analytics, research, and business purposes. Representative
Fields include a community or group detection, data diffusion, propagation of audiences,
subject detection and tracking, individual behavior analysis, group behavior analysis, and
market research for organizations.

Representation of Data
Similar to other social media data, it is accepted to use a graph representation to study social
media data sets. A graph comprises a set including vertexes (nodes) and edges (links). Users
are usually shown as the nodes in the graph. Relationships or corporation between individuals
(nodes) is shown as the links in the graph.

The graph depiction is common for information extracted from social networking sites where
people interact with friends, family, and business associates. It helps to create a social
network of friends, family, or business associates. Less apparent is how the graph structure is
applied to blogs, wikis, opinion mining, and similar types of online social media platforms.

If we consider blogs, One graph representation blogged as nodes and can be regarded as
"blog network," and another graph description has blog posts as the nodes, and can be
regarded as "post-network." Edges are created in a blog post network when another blog post
references another blog post. Other techniques used to represent blog networks concurrently
account for individuals, relationships, content, and time simultaneously- called Internet
Online Analytical Processing (iOLAP). Wikis can be considered from the context of
depicting authors as nodes, and edges are created when the authors contribute to an object.

The graphical representation allows the application of classic mathematical graph theory,
traditional techniques of analyzing social media platforms and work on mining graph data.
The probably big size of the graph used to depict social media platforms can present
difficulties for automated processing as restricts on computer memory. The processing speeds
are maximized and usually exceeded when trying to cope with huge social media data set.
Other challenges to implementing automated procedures to allow social media data mining
include identifying and dealing with spam, the variety of formats used in the same
subcategory of social media, and continuously altering content and structure.

Data Mining- A Process


No matter what sort of social media is being studied, some fundamental things are essential to
consider the most meaningful outcomes are feasible. Every kind of social media and every
data mining purpose applied to social media may involve distinctive methods and algorithms
to produce an advantage from data mining. Various data sets and data issues include different
kinds of tools. If it is known how to organize the data, a classification tool might be
appropriate. If we understand what the data is about, but cannot determine trends and patterns
in the data, the use of a clustering tool may be the best.

The problem itself can conclude the best approach. There is no other option for understanding
the data as much possible before applying data mining techniques as well as understanding
the various data mining tools that are available. A subject analyst might be required to help
better understand the data set. To better understand the various tools available for data
mining, there are a host of data mining and machine learning text and different resources that
are available to support more accurate information about a variety of particular data mining
techniques and algorithms.

Once you understand the issues and select an appropriate data mining approach, consider any
preprocessing that needs to be done. A systematic process may also be required to develop an
adequate set of data to allow reasonable processing times. Pre-processing should include
suitable privacy protection mechanisms. Although social media platforms incorporate huge
amounts of openly accessible data, it is important to guarantee individual rights, and social
media platform copyrights are secured. The effect of spam should be considered along with
the temporal representation.

In addition to preprocessing, it is essential to think about the effect of time. Depending upon
the inquiry and the research, we may get different outcomes at one time compared to another,
although the time segment is an accessible consideration for specific areas. For example,
subject detection, influence propagation, and network development, less evident is the effect
of time on network identification, group behavior, and marketing. What defines a network at
one point in time can be significantly different at another point in time. Group behavior and
interests will change after some time, and what was offered to the individuals or groups today
may not be trendy tomorrow.

With data depicted as a graph, the tasks start with a selected number of nodes, known as
seeds. Graphs are traversed, starting with the arrangement of seeds, and as the link structure
from the seed nodes is used, data is collected, and the structure itself is also reviewed.
Utilizing the link structure to stretch out from the seed set and gather new information is
known as crawling the network. The application and algorithms that are executed as a crawler
should effectively manage the challenges present in powerful social media platforms such as
restricted sites, format changes, and structure errors (invalid links). As the crawler finds the
new data, it stores the new data in a repository for further analysis. As link data is found, the
crawler updates the data about the network structure.

Some social media platforms such as Facebook, Twitter, and Technorati provide Application
Programmer Interfaces (APIs) that allow crawler applications to interact with the data sources
directly. However, these platforms usually restrict the number of API transactions per day,
relying on the affiliation the API user has with the platform. For some platforms, it is possible
to collect data (crawl) without utilizing APIs. Given the huge size of the social media
platform data available, it might be necessary to restrict the amount of data that the crawler
collects. When the crawler has collected the data, some postprocessing may be needed to
validate and clean up the data. Traditional social media platforms analysis methods can be
applied, for example, centrality measures and group structure studies. In many cases,
additional data will be related to a node or a link opening opportunities for more complex
methods to consider the more thoughtful semantics that can be exposed with text and data
mining techniques.

We now focus on two particular social media platform data to further represent how data
mining techniques are applied to social media sites. The two major areas are social media
platforms, and Blogs are powerful, and rich data sources portray both these areas. The two
areas offer potential value to the more extensive scientific network as well as a business
organization.

Social media platforms: Illustrative Examples


Social media platforms like Facebook or LinkedIn comprises of connected users with unique
profiles. Users can interact with their friends and family and can share news, photos, story,
videos, favorite links, etc. Users have an option to customize their profiles relying on
individual preferences, but some common data may incorporate relationship status, birthday,
an Email address, and hometown. Users have alternatives to choose how much data they
include in their profile and who has access to it. The amount of data accessible via social
media platforms have raised security concerns and is a related societal issue.

Here, the figure illustrates the hypothetical graph structure diagram for typical social
media platforms, and Arrows indicate links to a larger part of the graph.

It is important to secure personal identity when working with social media platforms data.
Recent reports highlight the need to secure privacy as it has been demonstrated that even
anonymizing this sort of data can still reveal individual data when advanced data analysis
strategies are utilized. Security settings also can restrict the ability of data mining applications
to think about each data on social media platforms. However, some heinous techniques can
be utilized to take over the security settings.
Clustering in Data Mining
Clustering is an unsupervised Machine Learning-based Algorithm that comprises a group of
data points into clusters so that the objects belong to the same group.

Clustering helps to splits data into several subsets. Each of these subsets contains data similar
to each other, and these subsets are called clusters. Now that the data from our customer base
is divided into clusters, we can make an informed decision about who we think is best suited
for this product.
Let's understand this with an example, suppose we are a market manager, and we have a new
tempting product to sell. We are sure that the product would bring enormous profit, as long as
it is sold to the right people. So, how can we tell who is best suited for the product from our
company's huge customer base?

Clustering, falling under the category of unsupervised machine learning, is one of the
problems that machine learning algorithms solve.
Clustering only utilizes input data, to determine patterns, anomalies, or similarities in its input
data.

A good clustering algorithm aims to obtain clusters whose:

• The intra-cluster similarities are high, It implies that the data present inside the cluster is
similar to one another.
• The inter-cluster similarity is low, and it means each cluster holds data that is not similar to
other data.

What is a Cluster?
• A cluster is a subset of similar objects
• A subset of objects such that the distance between any of the two objects in the cluster is
less than the distance between any object in the cluster and any object that is not located
inside it.
• A connected region of a multidimensional space with a comparatively high density of
objects.

What is clustering in Data Mining?


• Clustering is the method of converting a group of abstract objects into classes of similar
objects.
• Clustering is a method of partitioning a set of data or objects into a set of significant
subclasses called clusters.
• It helps users to understand the structure or natural grouping in a data set and used either
as a stand-alone instrument to get a better insight into data distribution or as a pre-
processing step for other algorithms

Important points:
• Data objects of a cluster can be considered as one group.
• We first partition the information set into groups while doing cluster analysis. It is based on
data similarities and then assigns the levels to the groups.
• The over-classification main advantage is that it is adaptable to modifications, and it helps
single out important characteristics that differentiate between distinct groups.

Applications of cluster analysis in data mining:


• In many applications, clustering analysis is widely used, such as data analysis, market
research, pattern recognition, and image processing.
• It assists marketers to find different groups in their client base and based on the purchasing
patterns. They can characterize their customer groups.
• It helps in allocating documents on the internet for data discovery.
• Clustering is also used in tracking applications such as detection of credit card fraud.
• As a data mining function, cluster analysis serves as a tool to gain insight into the
distribution of data to analyze the characteristics of each cluster.
• In terms of biology, It can be used to determine plant and animal taxonomies, categorization
of genes with the same functionalities and gain insight into structure inherent to
populations.
• It helps in the identification of areas of similar land that are used in an earth observation
database and the identification of house groups in a city according to house type, value, and
geographical location.

Why is clustering used in data mining?


Clustering analysis has been an evolving problem in data mining due to its variety of
applications. The advent of various data clustering tools in the last few years and their
comprehensive use in a broad range of applications, including image processing,
computational biology, mobile communication, medicine, and economics, must contribute to
the popularity of these algorithms. The main issue with the data clustering algorithms is that
it cant be standardized. The advanced algorithm may give the best results with one type of
data set, but it may fail or perform poorly with other kinds of data set. Although many efforts
have been made to standardize the algorithms that can perform well in all situations, no
significant achievement has been achieved so far. Many clustering tools have been proposed
so far. However, each algorithm has its advantages or disadvantages and cant work on all real
situations.

1. Scalability:

Scalability in clustering implies that as we boost the amount of data objects, the time to
perform clustering should approximately scale to the complexity order of the algorithm. For
example, if we perform K- means clustering, we know it is O(n), where n is the number of
objects in the data. If we raise the number of data objects 10 folds, then the time taken to
cluster them should also approximately increase 10 times. It means there should be a linear
relationship. If that is not the case, then there is some error with our implementation process.

Data should be scalable if it is not scalable, then we can't get the appropriate result. The figure
illustrates the graphical example where it may lead to the wrong result.
2. Interpretability:

The outcomes of clustering should be interpretable, comprehensible, and usable.

3. Discovery of clusters with attribute shape:

The clustering algorithm should be able to find arbitrary shape clusters. They should not be
limited to only distance measurements that tend to discover a spherical cluster of small sizes.

4. Ability to deal with different types of attributes:

Algorithms should be capable of being applied to any data such as data based on intervals
(numeric), binary data, and categorical data.

5. Ability to deal with noisy data:

Databases contain data that is noisy, missing, or incorrect. Few algorithms are sensitive to
such data and may result in poor quality clusters.

6. High dimensionality:

The clustering tools should not only able to handle high dimensional data space but also the
low-dimensional space.

Text Data Mining


Text data mining can be described as the process of extracting essential data from standard
language text. All the data that we generate via text messages, documents, emails, files are
written in common language text. Text mining is primarily used to draw useful insights or
patterns from such data.
The text mining market has experienced exponential growth and adoption over the last few
years and also expected to gain significant growth and adoption in the coming future. One of
the primary reasons behind the adoption of text mining is higher competition in the business
market, many organizations seeking value-added solutions to compete with other
organizations. With increasing completion in business and changing customer perspectives,
organizations are making huge investments to find a solution that is capable of analyzing
customer and competitor data to improve competitiveness. The primary source of data is e-
commerce websites, social media platforms, published articles, survey, and many more. The
larger part of the generated data is unstructured, which makes it challenging and expensive
for the organizations to analyze with the help of the people. This challenge integrates with the
exponential growth in data generation has led to the growth of analytical tools. It is not only
able to handle large volumes of text data but also helps in decision-making purposes. Text
mining software empowers a user to draw useful information from a huge set of data
available sources.

Areas of text mining in data mining:


These are the following area of text mining :
• Information Extraction:
The automatic extraction of structured data such as entities, entities relationships, and
attributes describing entities from an unstructured source is called information extraction.
• Natural Language Processing:
NLP stands for Natural language processing. Computer software can understand human
language as same as it is spoken. NLP is primarily a component of artificial intelligence(AI).
The development of the NLP application is difficult because computers generally expect
humans to "Speak" to them in a programming language that is accurate, clear, and
exceptionally structured. Human speech is usually not authentic so that it can depend on
many complex variables, including slang, social context, and regional dialects.
• Data Mining:
Data mining refers to the extraction of useful data, hidden patterns from large data sets.
Data mining tools can predict behaviors and future trends that allow businesses to make a
better data-driven decision. Data mining tools can be used to resolve many business
problems that have traditionally been too time-consuming.
• Information Retrieval:
Information retrieval deals with retrieving useful data from data that is stored in our
systems. Alternately, as an analogy, we can view search engines that happen on websites
such as e-commerce sites or any other sites as part of information retrieval.

Text Mining Process:


The text mining process incorporates the following steps to extract the data from the
document.
• Text transformation
A text transformation is a technique that is used to control the capitalization of the text.
Here the two major way of document representation is given.
1. Bag of words
2. Vector Space
• Text Pre-processing
Pre-processing is a significant task and a critical step in Text Mining, Natural Language
Processing (NLP), and information retrieval(IR). In the field of text mining, data pre-
processing is used for extracting useful information and knowledge from unstructured text
data. Information Retrieval (IR) is a matter of choosing which documents in a collection
should be retrieved to fulfill the user's need.
• Feature selection:
Feature selection is a significant part of data mining. Feature selection can be defined as the
process of reducing the input of processing or finding the essential information sources. The
feature selection is also called variable selection.
• Data Mining:
Now, in this step, the text mining procedure merges with the conventional process. Classic
Data Mining procedures are used in the structural database.
• Evaluate:
Afterward, it evaluates the results. Once the result is evaluated, the result abandon.
• Applications:
These are the following text mining applications:
• Risk Management:
Risk Management is a systematic and logical procedure of analyzing, identifying, treating,
and monitoring the risks involved in any action or process in organizations. Insufficient risk
analysis is usually a leading cause of disappointment. It is particularly true in the financial
organizations where adoption of Risk Management Software based on text mining
technology can effectively enhance the ability to diminish risk. It enables the administration
of millions of sources and petabytes of text documents, and giving the ability to connect the
data. It helps to access the appropriate data at the right time.
• Customer Care Service:
Text mining methods, particularly NLP, are finding increasing significance in the field of
customer care. Organizations are spending in text analytics programming to improve their
overall experience by accessing the textual data from different sources such as customer
feedback, surveys, customer calls, etc. The primary objective of text analysis is to reduce the
response time of the organizations and help to address the complaints of the customer
rapidly and productively.
• Business Intelligence:
Companies and business firms have started to use text mining strategies as a major aspect of
their business intelligence. Besides providing significant insights into customer behavior and
trends, text mining strategies also support organizations to analyze the qualities and
weaknesses of their opponent's so, giving them a competitive advantage in the market.
• Social Media Analysis:
Social media analysis helps to track the online data, and there are numerous text mining
tools designed particularly for performance analysis of social media sites. These tools help to
monitor and interpret the text generated via the internet from the news, emails, blogs, etc.
Text mining tools can precisely analyze the total no of posts, followers, and total no of likes
of your brand on a social media platform that enables you to understand the response of the
individuals who are interacting with your brand and content.

Text Mining Approaches in Data Mining:


These are the following text mining approaches that are used in data mining.

1. Keyword-based Association Analysis:

It collects sets of keywords or terms that often happen together and afterward discover the
association relationship among them. First, it preprocesses the text data by parsing,
stemming, removing stop words, etc. Once it pre-processed the data, then it induces
association mining algorithms. Here, human effort is not required, so the number of unwanted
results and the execution time is reduced.

2. Document Classification Analysis:

Automatic document classification:

This analysis is used for the automatic classification of the huge number of online text
documents like web pages, emails, etc. Text document classification varies with the
classification of relational data as document databases are not organized according to
attribute values pairs.

Numericizing text:
• Stemming algorithms
A significant pre-processing step before ordering of input documents starts with the
stemming of words. The terms "stemming" can be defined as a reduction of words to their
roots. For example, different grammatical forms of words and ordered are the same. The
primary purpose of stemming is to ensure a similar word by text mining program.
• Support for different languages:
There are some highly language-dependent operations such as stemming, synonyms, the
letters that are allowed in words. Therefore, support for various languages is important.
• Exclude certain character:
Excluding numbers, specific characters, or series of characters, or words that are shorter or
longer than a specific number of letters can be done before the ordering of the input
documents.
• Include lists, exclude lists (stop-words):
A particular list of words to be listed can be characterized, and it is useful when we want to
search for a specific word. It also classifies the input documents based on the frequencies
with which those words occur. Additionally, "stop words," which means terms that are to be
rejected from the ordering can be characterized. Normally, a default list of English stop
words incorporates "the," "a," "since," etc. These words are used in the respective language
very often but communicate very little data in the document.

Bagging Vs Boosting
We all use the Decision Tree Technique on day to day life to make the decision.
Organizations use these supervised machine learning techniques like Decision trees to make a
better decision and to generate more surplus and profit.

Ensemble methods combine different decision trees to deliver better predictive results,
afterward utilizing a single decision tree. The primary principle behind the ensemble model is
that a group of weak learners come together to form an active learner.

There are two techniques given below that are used to perform ensemble decision tree.

Bagging
Bagging is used when our objective is to reduce the variance of a decision tree. Here the
concept is to create a few subsets of data from the training sample, which is chosen randomly
with replacement. Now each collection of subset data is used to prepare their decision trees
thus, we end up with an ensemble of various models. The average of all the assumptions from
numerous tress is used, which is more powerful than a single decision tree.

Random Forest is an expansion over bagging. It takes one additional step to predict a
random subset of data. It also makes the random selection of features rather than using all
features to develop trees. When we have numerous random trees, it is called the Random
Forest.

These are the following steps which are taken to implement a Random forest:

• Let us consider X observations Y features in the training data set. First, a model from the
training data set is taken randomly with substitution.
• The tree is developed to the largest.
• The given steps are repeated, and prediction is given, which is based on the collection of
predictions from n number of trees.

Advantages of using Random Forest technique:

• It manages a higher dimension data set very well.


• It manages missing quantities and keeps accuracy for missing data.

Disadvantages of using Random Forest technique:

Since the last prediction depends on the mean predictions from subset trees, it won't give
precise value for the regression model.

Boosting:
Boosting is another ensemble procedure to make a collection of predictors. In other words,
we fit consecutive trees, usually random samples, and at each step, the objective is to solve
net error from the prior trees.

If a given input is misclassified by theory, then its weight is increased so that the upcoming
hypothesis is more likely to classify it correctly by consolidating the entire set at last converts
weak learners into better performing models.

Gradient Boosting is an expansion of the boosting procedure.

1. Gradient Boosting = Gradient Descent + Boosting

It utilizes a gradient descent algorithm that can optimize any differentiable loss function. An
ensemble of trees is constructed individually, and individual trees are summed successively.
The next tree tries to restore the loss ( It is the difference between actual and predicted
values).

Advantages of using Gradient Boosting methods:

• It supports different loss functions.


• It works well with interactions.

Disadvantages of using a Gradient Boosting methods:

• It requires cautious tuning of different hyper-parameters.

Difference between Bagging and Boosting:


Bagging Boosting

Various training data subsets are randomly drawn Each new subset contains the components
with replacement from the whole training dataset. that were misclassified by previous models.

Bagging attempts to tackle the over-fitting issue. Boosting tries to reduce bias.

If the classifier is unstable (high variance), then we If the classifier is steady and straightforward
need to apply bagging. (high bias), then we need to apply boosting.

Every model receives an equal weight. Models are weighted by their performance.

Objective to decrease variance, not bias. Objective to decrease bias, not variance.

It is the easiest way of connecting predictions that It is a way of connecting predictions that
belong to the same type. belong to the different types.

New models are affected by the performance


Every model is constructed independently.
of the previously developed model.

Data Mining Vs Data Warehousing


Data warehouse refers to the process of compiling and organizing data into one common
database, whereas data mining refers to the process of extracting useful data from the
databases. The data mining process depends on the data compiled in the data warehousing
phase to recognize meaningful patterns. A data warehousing is created to support
management systems.

Data Warehouse:
A Data Warehouse refers to a place where data can be stored for useful mining. It is like a
quick computer system with exceptionally huge data storage capacity. Data from the various
organization's systems are copied to the Warehouse, where it can be fetched and conformed
to delete errors. Here, advanced requests can be made against the warehouse storage of data.

Data warehouse combines data from numerous sources which ensure the data quality,
accuracy, and consistency. Data warehouse boosts system execution by separating analytics
processing from transnational databases. Data flows into a data warehouse from different
databases. A data warehouse works by sorting out data into a pattern that depicts the format
and types of data. Query tools examine the data tables using patterns.

Data warehouses and databases both are relative data systems, but both are made to serve
different purposes. A data warehouse is built to store a huge amount of historical data and
empowers fast requests over all the data, typically using Online Analytical Processing
(OLAP). A database is made to store current transactions and allow quick access to specific
transactions for ongoing business processes, commonly known as Online Transaction
Processing (OLTP).

Important Features of Data Warehouse

The Important features of Data Warehouse are given below:

1. Subject Oriented

A data warehouse is subject-oriented. It provides useful data about a subject instead of the
company's ongoing operations, and these subjects can be customers, suppliers, marketing,
product, promotion, etc. A data warehouse usually focuses on modeling and analysis of data
that helps the business organization to make data-driven decisions.

2. Time-Variant:

The different data present in the data warehouse provides information for a specific period.

3. Integrated

A data warehouse is built by joining data from heterogeneous sources, such as social
databases, level documents, etc.
4. Non- Volatile

It means, once data entered into the warehouse cannot be change.

Advantages of Data Warehouse:

• More accurate data access


• Improved productivity and performance
• Cost-efficient
• Consistent and quality data

Data Mining:
Data mining refers to the analysis of data. It is the computer-supported process of analyzing
huge sets of data that have either been compiled by computer systems or have been
downloaded into the computer. In the data mining process, the computer analyzes the data
and extract useful information from it. It looks for hidden patterns within the data set and try
to predict future behavior. Data mining is primarily used to discover and indicate
relationships among the data sets.
Data mining aims to enable business organizations to view business behaviors, trends
relationships that allow the business to make data-driven decisions. It is also known as
knowledge Discover in Database (KDD). Data mining tools utilize AI, statistics, databases,
and machine learning systems to discover the relationship between the data. Data mining
tools can support business-related questions that traditionally time-consuming to resolve any
issue.

Important features of Data Mining:

The important features of Data Mining are given below:

• It utilizes the Automated discovery of patterns.


• It predicts the expected results.
• It focuses on large data sets and databases
• It creates actionable information.

Advantages of Data Mining:

i. Market Analysis:

Data Mining can predict the market that helps the business to make the decision. For
example, it predicts who is keen to purchase what type of products.

ii. Fraud detection:

Data Mining methods can help to find which cellular phone calls, insurance claims, credit, or
debit card purchases are going to be fraudulent.

iii. Financial Market Analysis:

Data Mining techniques are widely used to help Model Financial Market

iv. Trend Analysis:

Analyzing the current existing trend in the marketplace is a strategic benefit because it helps
in cost reduction and manufacturing process as per market demand.

Differences between Data Mining and Data Warehousing:


Data Mining Data Warehousing

Data mining is the process of determining A data warehouse is a database system designed for
data patterns. analytics.

Data mining is generally considered as the


Data warehousing is the process of combining all the
process of extracting useful data from a
relevant data.
large set of data.

Business entrepreneurs carry data mining Data warehousing is entirely carried out by the
with the help of engineers. engineers.

In data mining, data is analyzed repeatedly. In data warehousing, data is stored periodically.

Data mining uses pattern recognition Data warehousing is the process of extracting and
techniques to identify patterns. storing data that allow easier reporting.

One of the most amazing data mining One of the advantages of the data warehouse is its
technique is the detection and ability to update frequently. That is the reason why it is
identification of the unwanted errors that ideal for business entrepreneurs who want up to date
occur in the system. with the latest stuff.

The data mining techniques are cost-


The responsibility of the data warehouse is to simplify
efficient as compared to other statistical
every type of business data.
data applications.

In the data warehouse, there is a high possibility that


The data mining techniques are not 100
the data required for analysis by the company may not
percent accurate. It may lead to serious
be integrated into the warehouse. It can simply lead to
consequences in a certain condition.
loss of data.

Companies can benefit from this analytical Data warehouse stores a huge amount of historical
tool by equipping suitable and accessible data that helps users to analyze different periods and
knowledge-based data. trends to make future predictions.
Clustering in Data Mining

Social media is a great source of information and a perfect platform for communication.
Businesses and individuals can make the best of it instead of only sharing their photos and
videos on the platform. The platform gives freedom to its users to connect with their target
group easily and fantastically. Either a group or an established business, both face difficulties
in standing up with the competitive social media industry. But through the social media
platform, users can market or develop his/her brand or content with others.

Social media mining includes social media platforms, social network analysis, and data
mining to provide a convenient and consistent platform for learners, professionals, scientists,
and project managers to understand the fundamentals and potentials of social media mining.
It suggests various problems arising from social media data and presents fundamental
concepts, emerging issues, and effective algorithms for data mining, and network analysis. It
includes multiple degrees of difficulty that enhance knowledge and help in applying ideas,
principles, and techniques in distinct social media mining situations.

As per the "Global Digital Report," the total number of active users on social media
platforms worldwide in 2019 is 2.41 billion and increases up to 9 % year-on-year. With the
universal use of Social media platforms via the internet, a huge amount of data is accessible.
Social media platforms include many fields of study, such as sociology, business,
psychology, entertainment, politics, news, and other cultural aspects of societies. Applying
data mining to social media can provide exciting views on human behavior and human
interaction. Data mining can be used in combination with social media to understand user's
opinions about a subject, identifying a group of individuals among the masses of a
population, to study group modifications over time, find influential people, or even suggest a
product or activity to an individual.
For example, The presidential election during 2008 marked an unprecedented use of social
media platforms in the United States. Social media platforms, including Facebook, YouTube
played a vital role in raising funds and getting candidate's messages to voters. Researcher's
extracted blog data to demonstrate correlations between the amount of social media platform
used by candidates and the winner of the 2008 presidential campaign.

This effective example emphasizes the potential for data mining social media data to forecast
results at the national level. Data mining social media can also produce personal and
corporate benefits.

Social media mining refers to social computing. Social computing is defined as "Any
computing application where software is used as an intermediary or Centre for a social
relationship." Social computing involves application used for interpersonal communication as
well as application and research activities related to "computational social studies" or Social
behavior."

Social media platform refers to various kinds of information services used collaboratively by
many people placed into the subcategories shown below.

Category Examples

Blogs Blogger, LiveJournal, WordPress

Social news Digg, Slashdot

Social bookmarking Delicious, StumbleUpon


Social networking platform Facebook, LinkedIn, Myspace, Orkut

Microblogs Twitter, GoogleBuzz

Opinion mining Epinions, Yelp

Photo and video sharing Flickr, YouTube

Wikis Scholarpedia, Wikihow, Wikipedia, Event

With popular traditional media such as radio, newspaper, and television, communication is
entirely one-way that comes from the media source or advertiser to the mass of media
consumers. Web 2.0 technologies and modern social media platforms have changed the scene
moving from one-way media communication driven by media providers to where almost
anyone can publish written, audio, video, or image content to the masses.

This media environment is significantly changing the way of business communication with
their clients. It provides exceptionally unprecedented opportunities for individuals to interact
with a huge number of peoples at a very low cost. The relationships present online and shown
through the social media platform are digitalized data sets of social media platforms on a
scale. The resulting data offers rich opportunities for sociology and insights to consumer
behavior and marketing among a host of apps linked to similar fields.

The growth and number of users on social media platforms are incredible. For example,
consider the most tempting social media networking site, Facebook. Facebook reached over
400 million active users during the first six years of operation, and it has been growing
exponentially. The given figure illustrates the exponential growth of Facebook over the first
six years. As per the report, Facebook is ranked 2nd in the world for websites based on the
traffic engagement of the users on the site daily.

The broad use of social media platforms is not limited to one geographical region of the
world. Orkut, a popular social networking platform operated by Google has most of the users
from the outside the United States, and the use of social media among Internet users is now
mainstream in many parts of the globe including countries Aisa, Africa, Europe, South
America, and the middle east. Social media also drive significant changes in company and
business need to decide on their policies to keep pace with this new media.

Motivations for Data Mining in Social Media:


The data accessible through Social media platform can give us insights into social networks
and societies that had not been feasible in both scale and extent previously. This digital media
can transform the physical world limitations to study human relationships and help to
measure popular social and political beliefs to the regional community without specific
studies. Social media records viral marketing trends efficiently and is the ideal source to
understand better and leverage the influence mechanisms. However, it is quite difficult to
gain valuable information from social networking sites data without implementing data
mining techniques due to specific challenges.

Data Mining techniques can assist effectively in dealing with the three primary challenges
with social media data. First, social media data sets are large. Consider the example of the
most popular social media platform Facebook with 2.41 billion active users. Without
automated data processing to analyze social media, social media data analytics becomes
inaccessible in any reasonable time frame.

Second, Social media site's data sets can be noisy. For example, Spam blogs are large in
number in the blogosphere, as well as unimportant tweets on Twitter.

Third, data from online social media platforms are dynamic, regular modifications and
updates over a short period are not common but also a significant aspect to consider in
dealing with social media data.

Applying data mining methods to huge data sets can improve search results for everyday
search engines, realize specified target marketing for business, help psychologists study
behavior, personalize consumer web services, provide new insights into the social structure
for sociologists, and help to identify and prevent spam for all of us.

Moreover, open access to data offers an unprecedented amount of data for researchers to
improve efficiency and optimize data mining techniques. The progress of data mining is
based on huge data sets. Social media is an optimal data source on the edge of data mining
for progressing and testing new data mining techniques for academic and allied data mining
analysts.

Data Mining Bayesian Classifiers


In numerous applications, the connection between the attribute set and the class variable is
non- deterministic. In other words, we can say the class label of a test record cant be assumed
with certainty even though its attribute set is the same as some of the training examples.
These circumstances may emerge due to the noisy data or the presence of certain confusing
factors that influence classification, but it is not included in the analysis. For example,
consider the task of predicting the occurrence of whether an individual is at risk for liver
illness based on individuals eating habits and working efficiency. Although most people who
eat healthly and exercise consistently having less probability of occurrence of liver disease,
they may still do so due to other factors. For example, due to consumption of the high-calorie
street foods and alcohol abuse. Determining whether an individual's eating routine is healthy
or the workout efficiency is sufficient is also subject to analysis, which in turn may introduce
vulnerabilities into the leaning issue.

Bayesian classification uses Bayes theorem to predict the occurrence of any event. Bayesian
classifiers are the statistical classifiers with the Bayesian probability understandings. The
theory expresses how a level of belief, expressed as a probability.

Bayes theorem came into existence after Thomas Bayes, who first utilized conditional
probability to provide an algorithm that uses evidence to calculate limits on an unknown
parameter.

Bayes's theorem is expressed mathematically by the following equation that is given below.

Where X and Y are the events and P (Y) ≠ 0

P(X/Y) is a conditional probability that describes the occurrence of event X is given that Y
is true.

P(Y/X) is a conditional probability that describes the occurrence of event Y is given that X
is true.

P(X) and P(Y) are the probabilities of observing X and Y independently of each other. This is
known as the marginal probability.

Bayesian interpretation:

In the Bayesian interpretation, probability determines a "degree of belief." Bayes theorem


connects the degree of belief in a hypothesis before and after accounting for evidence. For
example, Lets us consider an example of the coin. If we toss a coin, then we get either heads
or tails, and the percent of occurrence of either heads and tails is 50%. If the coin is flipped
numbers of times, and the outcomes are observed, the degree of belief may rise, fall, or
remain the same depending on the outcomes.

For proposition X and evidence Y,

• P(X), the prior, is the primary degree of belief in X


• P(X/Y), the posterior is the degree of belief having accounted for Y.

• The quotient represents the supports Y provides for X.


Bayes theorem can be derived from the conditional probability:

Where P (X⋂Y) is the joint probability of both X and Y being true, because

Bayesian network:

A Bayesian Network falls under the classification of Probabilistic Graphical Modelling


(PGM) procedure that is utilized to compute uncertainties by utilizing the probability
concept. Generally known as Belief Networks, Bayesian Networks are used to show
uncertainties using Directed Acyclic Graphs (DAG)

A Directed Acyclic Graph is used to show a Bayesian Network, and like some other
statistical graph, a DAG consists of a set of nodes and links, where the links signify the
connection between the nodes.

The nodes here represent random variables, and the edges define the relationship between
these variables.
A DAG models the uncertainty of an event taking place based on the Conditional Probability
Distribution (CDP) of each random variable. A Conditional Probability Table (CPT) is
used to represent the CPD of each variable in a network.

Data Mining- World Wide Web

Over the last few years, the World Wide Web has become a significant source of
information and simultaneously a popular platform for business. Web mining can define as
the method of utilizing data mining techniques and algorithms to extract useful information
directly from the web, such as Web documents and services, hyperlinks, Web content, and
server logs. The World Wide Web contains a large amount of data that provides a rich source
to data mining. The objective of Web mining is to look for patterns in Web data by collecting
and examining data in order to gain insights.

What is Web Mining?


Web mining can widely be seen as the application of adapted data mining techniques to the
web, whereas data mining is defined as the application of the algorithm to discover patterns
on mostly structured data embedded into a knowledge discovery process. Web mining has a
distinctive property to provide a set of various data types. The web has multiple aspects that
yield different approaches for the mining process, such as web pages consist of text, web
pages are linked via hyperlinks, and user activity can be monitored via web server logs.
These three features lead to the differentiation between the three areas are web content
mining, web structure mining, web usage mining.
There are three types of data mining:

1. Web Content Mining:

Web content mining can be used to extract useful data, information, knowledge from the web
page content. In web content mining, each web page is considered as an individual document.
The individual can take advantage of the semi-structured nature of web pages, as HTML
provides information that concerns not only the layout but also logical structure. The primary
task of content mining is data extraction, where structured data is extracted from unstructured
websites. The objective is to facilitate data aggregation over various web sites by using the
extracted structured data. Web content mining can be utilized to distinguish topics on the
web. For Example, if any user searches for a specific task on the search engine, then the user
will get a list of suggestions.

2. Web Structured Mining:

The web structure mining can be used to find the link structure of hyperlink. It is used to
identify that data either link the web pages or direct link network. In Web Structure Mining,
an individual considers the web as a directed graph, with the web pages being the vertices
that are associated with hyperlinks. The most important application in this regard is the
Google search engine, which estimates the ranking of its outcomes primarily with the
PageRank algorithm. It characterizes a page to be exceptionally relevant when frequently
connected by other highly related pages. Structure and content mining methodologies are
usually combined. For example, web structured mining can be beneficial to organizations to
regulate the network between two commercial sites.

3. Web Usage Mining:

Web usage mining is used to extract useful data, information, knowledge from the weblog
records, and assists in recognizing the user access patterns for web pages. In Mining, the
usage of web resources, the individual is thinking about records of requests of visitors of a
website, that are often collected as web server logs. While the content and structure of the
collection of web pages follow the intentions of the authors of the pages, the individual
requests demonstrate how the consumers see these pages. Web usage mining may disclose
relationships that were not proposed by the creator of the pages.

Some of the methods to identify and analyze the web usage patterns are given below:

I. Session and visitor analysis:

The analysis of preprocessed data can be accomplished in session analysis, which


incorporates the guest records, days, time, sessions, etc. This data can be utilized to analyze
the visitor's behavior.

The document is created after this analysis, which contains the details of repeatedly visited
web pages, common entry, and exit.

II. OLAP (Online Analytical Processing):

OLAP accomplishes a multidimensional analysis of advanced data.

OLAP can be accomplished on various parts of log related data in a specific period.

OLAP tools can be used to infer important business intelligence metrics

Challenges in Web Mining:


The web pretends incredible challenges for resources, and knowledge discovery based on the
following observations:

• The complexity of web pages:

The site pages don't have a unifying structure. They are extremely complicated as compared
to traditional text documents. There are enormous amounts of documents in the digital library
of the web. These libraries are not organized according to a specific order.

• The web is a dynamic data source:

The data on the internet is quickly updated. For example, news, climate, shopping, financial
news, sports, and so on.

• Diversity of client networks:

The client network on the web is quickly expanding. These clients have different interests,
backgrounds, and usage purposes. There are over a hundred million workstations that are
associated with the internet and still increasing tremendously.

• Relevancy of data:

It is considered that a specific person is generally concerned about a small portion of the web,
while the rest of the segment of the web contains the data that is not familiar to the user and
may lead to unwanted results.
• The web is too broad:

The size of the web is tremendous and rapidly increasing. It appears that the web is too huge
for data warehousing and data mining.

Mining the Web's Link Structures to recognize


Authoritative Web Pages:
The web comprises of pages as well as hyperlinks indicating from one to another page. When
a creator of a Web page creates a hyperlink showing another Web page, this can be
considered as the creator's authorization of the other page. The unified authorization of a
given page by various creators on the web may indicate the significance of the page and may
naturally prompt the discovery of authoritative web pages. The web linkage data provide rich
data about the relevance, the quality, and structure of the web's content, and thus is a rich
source of web mining.

Application of Web Mining:


Web mining has an extensive application because of various uses of the web. The list of some
applications of web mining is given below.

• Marketing and conversion tool


• Data analysis on website and application accomplishment.
• Audience behavior analysis
• Advertising and campaign accomplishment analysis.
• Testing and analysis of a site.

You might also like