FDSUNIT 1
FDSUNIT 1
Big Data literally means large amounts of data. Big data is the pillar behind the idea that
one can make useful inferences with a large body of data that wasn’t possible before with
smaller datasets. So extremely large data sets may be analyzed computationally to reveal
patterns, trends, and associations that are not transparent or easy to identify.
One of the biggest challenges most businesses face is ensuring that the data they collect is
reliable. When data suffers from inaccuracy, incompleteness, inconsistencies, and duplication,
that can lead to incorrect insights and poor decision-making. There are many tools available
for data preparation, deduplication, and enhancement, and ideally some of this functionality
is built into your analytics platform.
Non-standardized data can also be an issue—for example, when units, currencies, or date
formats vary. Standardizing as much as possible, as early as possible, will minimize cleansing
efforts and enable better analysis.
By implementing solutions such as data validation, data cleansing, and proper data
governance, organizations can ensure their data is accurate, consistent, complete, accessible,
and secure. This high-quality data can act as the fuel for effective data analysis and
ultimately lead to better decision-making.
2. Data access
Companies often have data scattered across multiple systems and departments, and in
structured, unstructured, and semi-structured formats. This makes it both difficult to
consolidate and analyze and vulnerable to unauthorized use. Disorganized data poses
challenges for analytics, machine learning, and artificial intelligence projects that work best
with as much data as possible to draw from.
For many companies, the goal is democratization—granting data access across the entire
organization regardless of department. To achieve this while also guarding against
unauthorized access, companies should gather their data in a central repository, such as a
data lake, or connect it directly to analytics applications using APIs and other integration
tools. IT departments should strive to create streamlined data workflows with built-in
automation and authentication to minimize data movement, reduce compatibility or format
issues, and keep a handle on what users and systems have access to their information.
3. Bad visualizations
Transforming data into graphs or charts through data visualization efforts helps present
complex information in a tangible, accurate way that makes it easier to understand. But using
the wrong visualization method or including too much data can lead to misleading
visualizations and incorrect conclusions. Input errors and oversimplified visualizations could
also cause the resulting report to misrepresent what’s actually going on.
Effective data analytics systems support report generation, provide guidance on
visualizations, and are intuitive enough for business users to operate. Otherwise, the burden
of preparation and output falls on IT, and the quality and accuracy of visualizations can be
questionable. To avoid this, organizations must make sure that the system they choose can
handle structured, unstructured, and semi-structured data.
So how do you achieve effective data visualization? Start with the following three keys
concepts:
Know your audience: Tailor your visualization to the interests of your viewers. Avoid
technical jargon or complex charts and be selective about the data you include. A CEO wants
very different information than a department head.
Start with a clear purpose: What story are you trying to tell with your data? What key
message do you want viewers to take away? Once you know this, you can choose the most
appropriate chart type. To that end, don’t just default to a pie or bar chart. There are many
visualization options, each suited for different purposes. Line charts show trends over time,
scatter plots reveal relationships between variables, and so on.
Keep it simple: Avoid cluttering your visualization with unnecessary elements. Use clear
labels, concise titles, and a limited color palette for better readability. Avoid misleading
scales, distorted elements, or chart types that might misrepresent the data.
4. Data privacy and security
Controlling access to data is a never-ending challenge that requires data classification as well
as security technology.
At a high level, careful attention must be paid to who is allowed into critical operational
systems to retrieve data, since any damage done here can bring a business to its knees.
Similarly, businesses need to make sure that when users from different departments log into
their dashboards, they see only the data that they should see. Businesses must establish
strong access controls and ensure that their data storage and analytics systems are secure
and compliant with data privacy regulations at every step of the data collection, analysis, and
distribution process.
Before you can decide which roles should have access to various types or pools of data, you
need to understand what that data is. That requires setting up a data classification system.
To get started. consider the following steps:
See what you have: Identify the types of data your organization collects, stores, and
processes, then label it based on sensitivity, potential consequences of a breach, and
regulations it’s subject to, such as HIPAA or GDPR.
Develop a data classification matrix: Define a schema with different categories, such as
public, confidential, and internal use only, and establish criteria for applying these
classifications to data based on its sensitivity, legal requirements, and your company policies.
See who might want access: Outline roles and responsibilities for data classification,
ownership, and access control. A finance department employee will have different access
rights than a member of the HR team, for example.
Then, based on the classification policy, work with data owners to categorize your data. Once
a scheme is in place, consider data classification tools that can automatically scan and
categorize data based on your defined rules.
Finally, set up appropriate data security controls and train your employees on them,
emphasizing the importance of proper data handling and access controls.
5. Talent shortage
Many companies can’t find the talent they need to turn their vast supplies of data into usable
information. The demand for data analysts, data scientists, and other data-related roles has
outpaced the supply of qualified professionals with the necessary skills to handle complex
data analytics tasks. And there’s no signs of that demand leveling out, either. By 2026, the
number of jobs requiring data science skills is projected to grow by nearly 28%, according to
the US Bureau of Labor Statistics.
Fortunately, many analytics systems today offer advanced data analytics capabilities, such as
built-in machine learning algorithms, that are accessible to business users without
backgrounds in data science. Tools with automated data preparation and cleaning
functionalities, in particular, can help data analysts get more done.
Companies can also upskill, identifying employees with strong analytical or technical
backgrounds who might be interested in transitioning to data roles and offering paid
training programs, online courses, or data bootcamps to equip them with the necessary
skills.
It’s not uncommon that, once an organization embarks on a data analytics strategy, it ends
up buying separate tools for each layer of the analytics process. Similarly, if departments act
autonomously, they may wind up buying competing products with overlapping or
counteractive capabilities; this can also be an issue when companies merge.
The result is a hodgepodge of technology, and if it’s deployed on-premises, then somewhere
there’s a data center full of different software and licenses that must be managed.
Altogether, this can lead to waste for the business and add unnecessary complexity to the
architecture. To prevent this, IT leaders should create an organization-wide strategy for data
tools, working with various department heads to understand their needs and requirements.
Issuing a catalog that includes various cloud-based options can help get everyone on a
standardized platform.
7. Cost
Data analytics requires investment in technology, staff, and infrastructure. But unless
organizations are clear on the benefits they’re getting from an analytics effort, IT teams may
struggle to justify the cost of implementing the initiative properly.
Deploying a data analytics platform via a cloud-based architecture can eliminate most
upfront capital expenses while reducing maintenance costs. It can also rein in the problem of
too many one-off tools.
Operationally, an organization’s return on investment comes from the insights that data
analytics can reveal to optimize marketing, operations, supply chains, and other business
functions. To show ROI, IT teams must work with stakeholders to define clear success metrics
that tie back to business goals. Examples might be that findings from data analytics led to a
10% increase in revenue, an 8% reduction in customer churn, or a 15% improvement in
operational efficiency. Suddenly, that cloud service seems like a bargain.
While quantifiable data is important, some benefits might be harder to measure directly, so
IT teams need to think beyond just line-item numbers. For example, a data project might
improve decision-making agility or customer experience, which can lead to long-term gains.
8. Changing technology
The data analytics landscape is constantly evolving, with new tools, techniques, and
technologies emerging all the time. For example, the race is currently on for companies to
get advanced capabilities such as artificial intelligence (AI) and machine learning (ML) into
the hands of business users as well as data scientists. That means introducing new tools that
make these techniques accessible and relevant. But for some organizations, new analytics
technologies may not be compatible with legacy systems and processes. This can cause data
integration challenges that require greater transformations or custom-coded connectors to
resolve.
Evolving feature sets also mean continually evaluating the best product fit for an
organization’s particular business needs. Again, using cloud-based data analytics tools can
smooth over feature and functionality upgrades, as the provider will ensure the latest version
is always available. Compare that to an on-premises system that might only be updated
every year or two, leading to a steeper learning curve between upgrades.
9. Resistance to change
Applying data analytics often requires what can be an uncomfortable level of change.
Suddenly, teams have new information about what’s happening in the business and different
options for how they should react. Leaders accustomed to operating on intuition rather than
data may also feel challenged—or even threatened—by the shift.
10. Goalsetting
Without clear goals and objectives, businesses will struggle to determine which data sources
to use for a project, how to analyze data, what they want to do with results, and how they’ll
measure success. A lack of clear goals can lead to unfocused data analytics efforts that don’t
deliver meaningful insights or returns. This can be mitigated by defining the objectives and
key results of a data analytics project before it begins.
2. Stage Area: Since the data, extracted from the external sources does not follow a
particular format, so there is a need to validate this data to load into datawarehouse.
For this purpose, it is recommended to use ETL tool.
• E(Extracted): Data is extracted from External data source.
• L(Load): Data is loaded into datawarehouse after transforming it into the standard
format.
3. Data-warehouse: After cleansing of data, it is stored in the data warehouse as central
repository. It actually stores the meta data and the actual data gets stored in the data
marts. Note that data warehouse stores the data in its purest form in this top-down
approach.
4. Data Marts: Data mart is also a part of storage component. It stores the information of
a particular function of an organisation which is handled by single authority. There can
be as many number of data marts in an organisation depending upon the functions. We
can also say that data mart contains subset of the data stored in data warehouse.
5. Data Mining: The practice of analysing the big data present in data warehouse is data
mining. It is used to find the hidden patterns that are present in the database or in data
warehouse with the help of algorithm of data mining.
This approach is defined by Inmon as – data warehouse as a central repository for the
complete organisation and data marts are created from it after the complete data
warehouse has been created.
Advantages of Top-Down Approach
1. Since the data marts are created from the datawarehouse, provides consistent
dimensional view of data marts.
2. Also, this model is considered as the strongest model for business changes. That’s
why, big organisations prefer to follow this approach.
3. Creating data mart from data warehouse is easy.
4. Improved data consistency: The top-down approach promotes data consistency by
ensuring that all data marts are sourced from a common data warehouse. This ensures
that all data is standardized, reducing the risk of errors and inconsistencies in
reporting.
5. Easier maintenance: Since all data marts are sourced from a central data warehouse,
it is easier to maintain and update the data in a top-down approach. Changes can be
made to the data warehouse, and those changes will automatically propagate to all the
data marts that rely on it.
6. Better scalability: The top-down approach is highly scalable, allowing organizations to
add new data marts as needed without disrupting the existing infrastructure. This is
particularly important for organizations that are experiencing rapid growth or have
evolving business needs.
7. Improved governance: The top-down approach facilitates better governance by
enabling centralized control of data access, security, and quality. This ensures that all
data is managed consistently and that it meets the organization’s standards for quality
and compliance.
8. Reduced duplication: The top-down approach reduces data duplication by ensuring
that data is stored only once in the data warehouse. This saves storage space and
reduces the risk of data inconsistencies.
9. Better reporting: The top-down approach enables better reporting by providing a
consistent view of data across all data marts. This makes it easier to create accurate
and timely reports, which can improve decision-making and drive better business
outcomes.
10. Better data integration: The top-down approach enables better data integration by
ensuring that all data marts are sourced from a common data warehouse. This makes
it easier to integrate data from different sources and provides a more complete view of
the organization’s data.
Disadvantages of Top-Down Approach
1. The cost, time taken in designing and its maintenance is very high.
2. Complexity: The top-down approach can be complex to implement and maintain,
particularly for large organizations with complex data needs. The design and
implementation of the data warehouse and data marts can be time-consuming and
costly.
3. Lack of flexibility: The top-down approach may not be suitable for organizations that
require a high degree of flexibility in their data reporting and analysis. Since the design
of the data warehouse and data marts is pre-determined, it may not be possible to
adapt to new or changing business requirements.
4. Limited user involvement: The top-down approach can be dominated by IT
departments, which may lead to limited user involvement in the design and
implementation process. This can result in data marts that do not meet the specific
needs of business users.
5. Data latency: The top-down approach may result in data latency, particularly when
data is sourced from multiple systems. This can impact the accuracy and timeliness of
reporting and analysis.
6. Data ownership: The top-down approach can create challenges around data
ownership and control. Since data is centralized in the data warehouse, it may not be
clear who is responsible for maintaining and updating the data.
7. Cost: The top-down approach can be expensive to implement and maintain,
particularly for smaller organizations that may not have the resources to invest in a
large-scale data warehouse and associated data marts.
8. Integration challenges: The top-down approach may face challenges in integrating
data from different sources, particularly when data is stored in different formats or
structures. This can lead to data inconsistencies and inaccuracies.
What is Bottom-Up Approach?
Bottom up Approach is the Ralph Kimball’s approach of the construction of individual data
marts that lie at the center of specific business goals or functions such as marketing or sales.
These data marts are extracted transformed & loaded first to provide organizations’ ability
to generate reports instantly. In turn, these data marts are affiliated to the more centralized
and broad data warehouse system. This is a more flexible method of training, cheaper and
best recommendable in smaller organizations. Nevertheless, it entails the creation of data
silos and disparities, and this may not allow an organization to have a coherent perspective
in its various departments.
1. First, the data is extracted from external sources (same as happens in top-down
approach).
2. Then, the data go through the staging area (as explained above) and loaded into data
marts instead of datawarehouse. The data marts are created first and provide reporting
capability. It addresses a single business area.
Basic Working:
1. It all starts when the user puts up certain data mining requests, these requests are then
sent to data mining engines for pattern evaluation.
2. These applications try to find the solution to the query using the already present
database.
3. The metadata then extracted is sent for proper analysis to the data mining engine
which sometimes interacts with pattern evaluation modules to determine the result.
4. This result is then sent to the front end in an easily understandable manner using a
suitable interface.
A detailed description of parts of data mining architecture is shown:
1. Data Sources: Database, World Wide Web(WWW), and data warehouse are parts of
data sources. The data in these sources may be in the form of plain text, spreadsheets,
or other forms of media like photos or videos. WWW is one of the biggest sources of
data.
2. Database Server: The database server contains the actual data ready to be
processed. It performs the task of handling data retrieval as per the request of the
user.
3. Data Mining Engine: It is one of the core components of the data mining architecture
that performs all kinds of data mining techniques like association, classification,
characterization, clustering, prediction, etc.
4. Pattern Evaluation Modules: They are responsible for finding interesting patterns in
the data and sometimes they also interact with the database servers for producing the
result of the user requests.
5. Graphic User Interface: Since the user cannot fully understand the complexity of the
data mining process so graphical user interface helps the user to communicate
effectively with the data mining system.
6. Knowledge Base: Knowledge Base is an important part of the data mining engine that
is quite beneficial in guiding the search for the result patterns. Data mining engines
may also sometimes get inputs from the knowledge base. This knowledge base may
contain data from user experiences. The objective of the knowledge base is to make
the result more accurate and reliable.
Types of Data Mining architecture:
1. No Coupling: The no coupling data mining architecture retrieves data from particular
data sources. It does not use the database for retrieving the data which is otherwise
quite an efficient and accurate way to do the same. The no coupling architecture for
data mining is poor and only used for performing very simple data mining processes.
2. Loose Coupling: In loose coupling architecture data mining system retrieves data
from the database and stores the data in those systems. This mining is for memory-
based data mining architecture.
3. Semi-Tight Coupling: It tends to use various advantageous features of the data
warehouse systems. It includes sorting, indexing, and aggregation. In this architecture,
an intermediate result can be stored in the database for better performance.
4. Tight coupling: In this architecture, a data warehouse is considered one of its most
important components whose features are employed for performing data mining tasks.
This architecture provides scalability, performance, and integrated information
Advantages of Data Mining:
• Assists in preventing future adversaries by accurately predicting future trends.
• Contributes to the making of important decisions.
• Compresses data into valuable information.
• Provides new trends and unexpected patterns.
• Helps to analyze huge data sets.
• Aids companies to find, attract and retain customers.
• Helps the company to improve its relationship with the customers.
• Assists Companies to optimize their production according to the likability of a certain
product thus saving costs to the company.
Disadvantages of Data Mining:
• Excessive work intensity requires high-performance teams and staff training.
• The requirement of large investments can also be considered a problem as sometimes
data collection consumes many resources that suppose a high cost.
• Lack of security could also put the data at huge risk, as the data may contain private
customer details.
• Inaccurate data may lead to the wrong output.
• Huge databases are quite difficult to manage.
1. Trend: Trend represents the long-term movement or directionality of the data over
time. It captures the overall tendency of the series to increase, decrease, or remain
stable. Trends can be linear, indicating a consistent increase or decrease, or nonlinear,
showing more complex patterns.
2. Seasonality: Seasonality refers to periodic fluctuations or patterns that occur at
regular intervals within the time series. These cycles often repeat annually, quarterly,
monthly, or weekly and are typically influenced by factors such as seasons, holidays,
or business cycles.
3. Cyclic variations: Cyclical variations are longer-term fluctuations in the time series
that do not have a fixed period like seasonality. These fluctuations represent economic
or business cycles, which can extend over multiple years and are often associated with
expansions and contractions in economic activity.
4. Irregularity (or Noise): Irregularity, also known as noise or randomness, refers to the
unpredictable or random fluctuations in the data that cannot be attributed to the trend,
seasonality, or cyclical variations. These fluctuations may result from random events,
measurement errors, or other unforeseen factors. Irregularity makes it challenging to
identify and model the underlying patterns in the time series data.
Time Series Visualization
Time series visualization is the graphical representation of data collected over successive
time intervals. It encompasses various techniques such as line plots, seasonal subseries
plots, autocorrelation plots, histograms, and interactive visualizations. These methods help
analysts identify trends, patterns, and anomalies in time-dependent data for better
understanding and decision-making.
Different Time series visualization graphs
1. Line Plots: Line plots display data points over time, allowing easy observation of
trends, cycles, and fluctuations.
2. Seasonal Plots: These plots break down time series data into seasonal components,
helping to visualize patterns within specific time periods.
3. Histograms and Density Plots: Shows the distribution of data values over time,
providing insights into data characteristics such as skewness and kurtosis.
4. Autocorrelation and Partial Autocorrelation Plots: These plots visualize correlation
between a time series and its lagged values, helping to identify seasonality and lagged
relationships.
5. Spectral Analysis: Spectral analysis techniques, such as periodograms and
spectrograms, visualize frequency components within time series data, useful for
identifying periodicity and cyclical patterns.
6. Decomposition Plots: Decomposition plots break down a time series into its trend,
seasonal, and residual components, aiding in understanding the underlying patterns.
These visualization techniques allow analysts to explore, interpret, and communicate
insights from time series data effectively, supporting informed decision-making and
forecasting.
Feature Selection Techniques in Machine
Learning
Feature selection:
Feature selection is a process that chooses a subset of features from the original features
so that the feature space is optimally reduced according to a certain criterion.
Feature selection is a critical step in the feature construction process. In text
categorization problems, some words simply do not appear very often. Perhaps the word
“groovy” appears in exactly one training document, which is positive. Is it really wort h
keeping this word around as a feature ? It’s a dangerous endeavor because it’s hard to tell
with just one training example if it is really correlated with the positive class or is it just
noise. You could hope that your learning algorithm is smart enough to figure it out. Or you
could just remove it.
There are three general classes of feature selection algorithms: Filter methods, wrapper
methods and embedded methods.
The role of feature selection in machine learning is,
1. To reduce the dimensionality of feature space.
2. To speed up a learning algorithm.
3. To improve the predictive accuracy of a classification algorithm.
4. To improve the comprehensibility of the learning results.
Features Selection Algorithms are as follows:
1. Instance based approaches: There is no explicit procedure for feature subset
generation. Many small data samples are sampled from the data. Features are weighted
according to their roles in differentiating instances of different classes for a data sample.
Features with higher weights can be selected.
2. Nondeterministic approaches: Genetic algorithms and simulated annealing are also
used in feature selection.
3. Exhaustive complete approaches: Branch and Bound evaluates estimated accuracy
and ABB checks an inconsistency measure that is monotonic. Both start with a full feature
set until the preset bound cannot be maintained.
While building a machine learning model for real-life dataset, we come across a lot of
features in the dataset and not all these features are important every time. Adding
unnecessary features while training the model leads us to reduce the overall accuracy of
the model, increase the complexity of the model and decrease the generalization capability
of the model and makes the model biased. Even the saying “Sometimes less is better”
goes as well for the machine learning model. Hence, feature selection is one of the
important steps while building a machine learning model. Its goal is to find the best
possible set of features for building a machine learning model.
Some popular techniques of feature selection in machine learning are:
• Filter methods
• Wrapper methods
• Embedded methods
Filter Methods
These methods are generally used while doing the pre-processing step. These methods
select features from the dataset irrespective of the use of any machine learning algorithm.
In terms of computation, they are very fast and inexpensive and are very good for
removing duplicated, correlated, redundant features but these methods do not remove
multicollinearity. Selection of feature is evaluated individually which can sometimes help
when features are in isolation (don’t have a dependency on other features) but will lag
when a combination of features can lead to increase in the overall performance of the
model.
Chi-square Formula
• Fisher’s Score – Fisher’s Score selects each feature independently according to their
scores under Fisher criterion leading to a suboptimal set of features. The larger the
Fisher’s score is, the better is the selected feature.
• Correlation Coefficient – Pearson’s Correlation Coefficient is a measure of
quantifying the association between the two continuous variables and the direction of
the relationship with its values ranging from -1 to 1.
• Variance Threshold – It is an approach where all features are removed whose
variance doesn’t meet the specific threshold. By default, this method removes features
having zero variance. The assumption made using this method is higher variance
features are likely to contain more information.
• Mean Absolute Difference (MAD) – This method is similar to variance threshold
method but the difference is there is no square in MAD. This method calculates the
mean absolute difference from the mean value.
• Dispersion Ratio – Dispersion ratio is defined as the ratio of the Arithmetic mean (AM)
to that of Geometric mean (GM) for a given feature. Its value ranges from +1 to ∞ as
AM ≥ GM for a given feature. Higher dispersion ratio implies a more relevant feature.
• Mutual Dependence – This method measures if two variables are mutually dependent,
and thus provides the amount of information obtained for one variable on observing the
other variable. Depending on the presence/absence of a feature, it measures the
amount of information that feature contributes to making the target prediction.
• Relief – This method measures the quality of attributes by randomly sampling an
instance from the dataset and updating each feature and distinguishing between
instances that are near to each other based on the difference between the selected
instance and two nearest instances of same and opposite classes.
Wrapper methods:
Wrapper methods, also referred to as greedy algorithms train the algorithm by using a
subset of features in an iterative manner. Based on the conclusions made from training in
prior to the model, addition and removal of features takes place. Stopping criteria for
selecting the best subset are usually pre-defined by the person training the model such as
when the performance of the model decreases or a specific number of features has been
achieved. The main advantage of wrapper methods over the filter methods is that they
provide an optimal set of features for training the model, thus resulting in better accuracy
than the filter methods but are computationally more expensive.