0% found this document useful (0 votes)
349 views35 pages

Asm2 1ST BPS Truongnn BH00704

Uploaded by

truongnnbh00704
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
349 views35 pages

Asm2 1ST BPS Truongnn BH00704

Uploaded by

truongnnbh00704
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

ASSIGNMENT 1 FRONT SHEET

Qualification BTEC Level 5 HND Diploma in Computing

Unit number and title Unit 17: Business Process Support

Submission date 04/08/2024 Date Received 1st submission

Re-submission Date Date Received 2nd submission

Student Name Nguyen Nam Truong Student ID IT0603

Class IT0603 Assessor name Dinh Van Dong

Student declaration

I certify that the assignment submission is entirely my own work and I fully understand the consequences of plagiarism. I understand that
making a false declaration is a form of malpractice.

Student’s signature Truong

Grading grid

P5 P6 P7 M3 M4 D2 D1
 Summative Feedback:  Resubmission Feedback:

Grade: Assessor Signature: Date:


Internal Verifier’s Comments:

Signature & Date:


Contents
A. Introduction.............................................................................................................................................. 5
B. Content .................................................................................................................................................... 6
I. Discuss how tools and technologies associated with data science are used to support business processes
and inform decisions. (P5) ............................................................................................................................. 6
1. Data Science..................................................................................................................................... 6
2. What Is Data Science Used For? ........................................................................................................ 7
3. Data Science Techniques ................................................................................................................ 10
4. Data Science Tools.......................................................................................................................... 11
5. Data Science Use Cases ................................................................................................................... 15
II. Design a data science solution to support decision making related to a real-world problem (P6) .......... 17
1. Problems encountered during data collection: ................................................................................ 17
2. Solutions: ....................................................................................................................................... 18
3. Benefits of using data to address the problem:................................................................................ 19
4. The Business Process Architecture .................................................................................................. 20
III. Implement a data science solution to support decision making related to a real-world problem (P7) .... 23
1. Data Collection: .............................................................................................................................. 23
2. Data Cleaning and Preprocessing: ................................................................................................... 24
C. Conclusion .............................................................................................................................................. 34
D. References .............................................................................................................................................. 35
Figure 1: Data Science................................................................................................................................................... 6
Figure 2: Example of Data Science used for.................................................................................................................. 7
Figure 3: Benefits of Data Science ................................................................................................................................ 8
Figure 4: Data Science Techniques ............................................................................................................................. 10
Figure 5: Python language. ......................................................................................................................................... 11
Figure 6: R- programing language. .............................................................................................................................. 12
Figure 7:Apache Spark ................................................................................................................................................ 12
Figure 8:Jupyter Notebooks ........................................................................................................................................ 13
Figure 9:TensorFlow ................................................................................................................................................... 13
Figure 10:Pytorch ........................................................................................................................................................ 14
Figure 11: SQLs............................................................................................................................................................ 14
Figure 12:Matplotlib ................................................................................................................................................... 15
Figure 13: Original data table. .................................................................................................................................... 25
Figure 14: Dataset after removing Duplicate rows. .................................................................................................... 26
Figure 15: Dataset after removing "Not_Useful_Column". ........................................................................................ 27
Figure 16:Dataset after removing various inconsistencies ......................................................................................... 28
Figure 17: Dataset after Standardizing ....................................................................................................................... 29
Figure 18: Dataset after Splitting ................................................................................................................................ 30
Figure 19:Dataset after replacing. .............................................................................................................................. 31
Figure 20:Filtering down Rows of Data ....................................................................................................................... 32
A. Introduction
In the dynamic and highly competitive domain of consumer electronics manufacturing, ABC Manufacturing
has emerged as a trailblazer by embracing data-driven strategies to enhance its supply chain operations.
Recognizing the pivotal role of data and information in navigating complex market dynamics and evolving
consumer preferences, ABC Manufacturing has orchestrated a paradigm shift in its approach to supply
chain management.

By meticulously analyzing historical sales data, market trends, and customer behaviors, ABC
Manufacturing has unlocked the ability to forecast demand with unparalleled accuracy. This foresight
enables the organization to adjust production levels and inventory holdings proactively, minimizing
stockouts, optimizing resource utilization, and ultimately, bolstering cost efficiencies while elevating
customer satisfaction.

Moreover, ABC Manufacturing has harnessed the power of real-time data from sensors and IoT devices
deployed across its production facilities and logistics network. This wealth of data enables the organization
to monitor equipment performance, track energy consumption, and streamline transportation routes,
thereby identifying bottlenecks and optimizing production processes. Through proactive maintenance
interventions triggered by real-time analytics, ABC Manufacturing mitigates the risk of costly breakdowns
and production delays, ensuring operational continuity and efficiency.
B. Content

I. Discuss how tools and technologies associated with data science are used to support business
processes and inform decisions. (P5)

1. Data Science
Data science is a complicated field with many specific topics and abilities, but the fundamental
definition is that it incorporates all the methods for extracting information and understanding from
data. Data is ubiquitous, and it comes in massive and exponentially expanding numbers. Data science
encompasses the methods used to discover, condition, extract, assemble, process, analyze, interpret,
model, visualize, report on, and present data, regardless of its scale.

Data science is an extremely complicated field, owing to the variety and number of academic fields
and technology it employs. Mathematics, statistics, computer science and programming, statistical
modeling, database technologies, signal processing, data modeling, artificial intelligence and learning,
natural language processing, visualization, and predictive analytics are all part of data science.
Data science has a wide range of applications, including social media, medicine, security, health care,
social sciences, biological sciences, engineering, military, business, economics, finance, marketing,
geolocation, and others.

Figure 1: Data Science

Data science allows you to evaluate vast volumes of data and identify trends using formats such as
data visualizations and predictive models. With the capacity to take preventive actions, businesses may
make better judgments, build more efficient operations, strengthen their cybersecurity policies, and
provide better customer experiences. Teams are already using data science to diagnose diseases,
detect malware, and optimize transportation routes.
2. What Is Data Science Used For?
Data science is used to explore connections and patterns within complicated information, leading to
insights that businesses can subsequently utilize to make better decisions. More specifically, data
science is used for complex data analysis, predictive modeling, recommendation generating and data
visualization.

Figure 2: Example of Data Science used for

Analysis of Complex Data:

Data science enables speedy and exact analysis. With a variety of software tools and methodologies at
their disposal, data analysts may quickly uncover trends and patterns in even the largest and most
complicated datasets. This allows firms to make smarter decisions, such as how to effectively segment
clients or conduct a comprehensive market analysis.

Predictive Modeling:

Data science can also be used for predictive modeling. In essence, by finding patterns in data through the
use of machine learning, analysts can forecast possible future outcomes with some degree of accuracy.
These models are especially useful in industries like insurance, marketing, healthcare and finance, where
anticipating the likelihood of certain events happening is central to the success of the business.

Recommendation Generation:

Some companies — like Netflix, Amazon and Spotify — rely on data science and big data to generate
recommendations for their users based on their past behavior. It’s thanks to data science that users of
these and similar platforms can be served up content that’s tailored to their preferences and interests.
Data Visualization:

Data science is also used to create data visualizations — think graphs, charts, dashboards — and reporting,
which helps non-technical business leaders and busy executives easily understand otherwise complex
information about the state of their business.

2.1. Benefits of Data Science


Improved Decision Making:

Being able to analyze and glean insights from massive amounts of data gives leaders an accurate
understanding of past developments and concrete evidence for justifying their decisions moving forward.
Companies can then make sound, data-driven decisions that are also more transparent to employees and
other stakeholders.

Figure 3: Benefits of Data Science

Predictive Analytics:

• Data Science provides the tools and methodologies to develop predictive models. These models
enable organizations to harness historical data to make predictions about future events and trends.
This capability is invaluable, particularly for businesses seeking to anticipate customer behavior,
market trends, and potential risks.
• Imagine an e-commerce platform using Data Science to predict which products a customer is likely
to purchase next based on their browsing and purchase history. By doing so, they can proactively
recommend relevant products, leading to increased sales and customer satisfaction. Predictive
analytics is a powerful driver of growth and competitiveness in today’s data-driven economy.

Cost Reduction

• Data Science has the remarkable ability to identify inefficiencies in business processes and supply
chains. By analyzing data, organizations can pinpoint areas where resources are misallocated, or
processes are suboptimal. This insight leads to cost reductions and improved resource allocation.
• For example, a manufacturing company might use Data Science to analyze production data and
identify bottlenecks in their manufacturing process. By addressing these bottlenecks, they can
reduce production costs, increase output, and improve overall efficiency. These cost savings can
be substantial and directly impact the company’s profitability.

Personalization

• In an era where customers expect tailored experiences, personalization is a significant advantage


of Data Science. Through data analysis, companies can personalize their offerings to individual
customers, making each interaction more relevant and engaging.
• Consider a streaming service using Data Science to analyze user preferences and viewing habits. By
leveraging this data, the service can recommend content that aligns with a user’s interests,
increasing user satisfaction and retention. Personalization not only enhances customer
relationships but also drives higher conversion rates and revenue.

Fraud Detection

• Fraud detection and prevention are areas where Data Science plays a pivotal role. Data scientists
develop algorithms that can identify unusual patterns and anomalies in data, flagging potential
fraud or security breaches.
• Financial institutions, for instance, use Data Science to monitor transactions and detect fraudulent
activities. By analyzing transaction data in real-time, they can identify suspicious behavior, such as
unauthorized credit card transactions or identity theft. This proactive approach to fraud detection
helps safeguard both businesses and consumers.

Healthcare Advancements

• In the healthcare sector, Data Science has ushered in a new era of advancements. It enables the
development of predictive models for various purposes, including disease outbreak prediction,
patient diagnosis, and treatment optimization. These applications of Data Science are ultimately
saving lives and improving the quality of healthcare worldwide.
3. Data Science Techniques
There are lots of data science techniques with which data science professionals must be familiar in order
to do their jobs. These are some of the most popular techniques:

Regression

Regression analysis allows you to predict an outcome based on multiple variables and how those variables
affect each other. Linear regression is the most used regression analysis technique. Regression is a type
of supervised learning.

Classification

Classification in data science refers to the process of predicting the category or label of different data
points. Like regression, classification is a subcategory of supervised learning. It’s used for applications such
as email spam filters and sentiment analysis.

Clustering

Clustering, or cluster analysis, is a data science technique used in unsupervised learning. During cluster
analysis, closely associated objects within a data set are grouped together, and then each group is assigned
characteristics. Clustering is done to reveal patterns within data — typically with large, unstructured data
sets.

Anomaly Detection

Anomaly detection, sometimes called outlier detection, is a data science technique in which data points
with relatively extreme values are identified. Anomaly detection is used in industries like finance and
cybersecurity.

Figure 4: Data Science Techniques


What Does a Data Scientist Do?

Data scientists collect, organize, and analyze data so that it may be presented as a clear story with
actionable insights. Data scientists are talented in detecting patterns buried in enormous amounts of data,
and they frequently employ advanced algorithms and machine learning models to assist enterprises in
making accurate assessments and forecasts. A typical data scientist has extensive knowledge of arithmetic
and statistics, as well as familiarity with programming.

Job opportunities in data science extend beyond the profession of data scientist. Data analysts, for
example, are responsible for finding actionable information within data sets, analyzing that data, and then
developing reports, dashboards, and visualizations to communicate those insights to others within the
organization.

Another profession, that of data engineer, is responsible for designing, developing, and managing the
systems that data scientists use to access and analyze data. A data engineer's typical responsibilities
include creating data models and pipelines, as well as directing extract, transform, and load (ETL). Each
function in data science requires both technical and soft skills, which must be cultivated during a person's
career.

4. Data Science Tools

4.1. Common data science programming languages


Python

Python is an object-oriented, general-purpose programming language known for having simple syntax and being
easy to use. It’s often used for executing data analysis, building websites and software, and automating various
tasks.

Figure 5: Python language.


R:

R is a programming language that caters to statistical computing and graphics. It’s ideal for creating data
visualizations and building statistical software.

Figure 6: R- programing language.

4.2. Popular data science tools


Apache Spark:

Apache Spark is a fast and general-purpose cluster computing system designed for big data processing. It
provides a unified platform for batch processing, streaming analytics, machine learning, and interactive
querying, making it well-suited for a wide range of data-intensive applications. Spark's in-memory
computing capabilities enable it to process large-scale datasets more efficiently than traditional disk-based

Figure 7:Apache Spark


systems, leading to significant performance improvements. It offers high-level APIs in multiple
programming languages, including Java, Scala, Python, and R, making it accessible to a broad audience of
developers and data scientists.

Jupyter Notebooks:

Jupyter Notebooks are interactive web-based environments that enable users to create and share
documents containing live code, equations, visualizations, and narrative text. They support multiple
programming languages, including Python, R, Julia, and others, making them versatile tools for data
analysis, research, and education. Jupyter Notebooks facilitate an iterative and exploratory approach to
data analysis by allowing users to execute code in cells and visualize the results immediately, making them
ideal for prototyping, experimentation, and collaboration.

Figure 8:Jupyter Notebooks

TensorFlow:

TensorFlow is an open-source machine learning framework developed by Google for building and training
deep learning models. It provides a flexible and scalable platform for implementing a wide range of
machine learning algorithms, including neural networks, convolutional neural networks (CNNs), recurrent
neural networks (RNNs), and more. TensorFlow's architecture allows for distributed computing across
multiple CPUs, GPUs, and TPUs, enabling efficient training of complex models on large-scale datasets.

Figure 9:TensorFlow
PyTorch:

PyTorch is an open-source machine learning library developed by Facebook's AI Research lab. It is known
for its dynamic computational graph and intuitive interface, which make it particularly suitable for research
and experimentation in deep learning. PyTorch offers a flexible and imperative programming model that
allows users to define and manipulate computational graphs on-the-fly, making it easier to debug and
customize models. It has gained popularity for its ease of use, performance, and support for advanced
techniques such as dynamic neural networks and autograd.

Figure 10:Pytorch

SQL:

Description: Structured Query Language (SQL) is a domain-specific language used for managing and
querying relational databases. It provides a standardized syntax for defining, manipulating, and querying
structured data stored in tables. SQL is essential for data scientists working with relational databases, as it
enables them to retrieve, filter, aggregate, and analyze data efficiently. It supports a wide range of

Figure 11: SQLs


operations, including data insertion, deletion, updating, and joining, making it a powerful tool for data
manipulation and extraction.

Matplotlib:

Matplotlib is a comprehensive plotting library for Python, designed to create high-quality static, animated,
and interactive visualizations. It provides a flexible and customizable interface for creating a wide range of
plots and charts, including line plots, scatter plots, bar plots, histograms, and heatmaps. Matplotlib's
extensive gallery of examples and documentation make it easy for users to get started and explore
different visualization techniques. It is often used in conjunction with other data analysis and visualization
libraries, such as Pandas and NumPy, to create informative and visually appealing plots for data
exploration, presentation, and communication.

Figure 12:Matplotlib

5. Data Science Use Cases


Data science helps us achieve many tasks that either were not possible or required a great deal more time
and energy just a few years ago, such as detecting fraud, forecasting revenue, optimizing ride-share
pickups, powering recommendation engines and filtering out spam email.

Healthcare

Data science has led to a number of breakthroughs in the healthcare industry. With a vast network of data
now available via everything from EMRs to clinical databases to personal fitness trackers, medical
professionals are finding new ways to understand disease, practice preventive medicine, diagnose
diseases faster and explore new treatment options. The sensitivity of patient data makes data security an
even bigger point of emphasis in the healthcare space.

Finance

Machine learning and data science have saved the financial industry millions of dollars, and unquantifiable
amounts of time. For example, JP Morgan’s contract intelligence platform uses natural language
processing to process and extract vital data from thousands of commercial credit agreements a year.
Thanks to data science, what would take around hundreds of thousands manual labor hours to complete
is now finished in a few hours.

Cybersecurity

Data science is useful in every industry, but it may be the most important in cybersecurity. For example,
international cybersecurity firm Kaspersky uses science and machine learning to detect hundreds of
thousands of new samples of malware on a daily basis. Being able to instantaneously detect and learn new
methods of cybercrime through data science is essential to our safety and security in the future.

Logistics

UPS turns to data science to maximize efficiency, both internally and along its delivery routes. The
company’s On-road Integrated Optimization and Navigation (ORION) tool uses data science-backed
statistical modeling and algorithms that create optimal routes for delivery drivers based on weather, traffic
and construction. It’s estimated that data science is saving the logistics company millions of gallons of fuel
and delivery miles each year.

Entertainment

Do you ever wonder how Spotify seems to recommend that perfect song you’re in the mood for? Or how
Netflix knows just what shows you’ll love to binge? Using data science, these media streaming giants learn
your preferences to carefully curate content from their vast libraries they think would accurately appeal
to your interests.

Product, Sales and Marketing

Many businesses rely on data scientists to build time series forecasting models that help with inventory
management and supply chain optimization. Data scientists are also sometimes tasked with making
proactive recommendations based on budget forecasts made through financial models. Some even
use data mining to segment customers by behavior, tailoring future marketing messages to appeal to
certain groups based on previous brand interactions.
II. Design a data science solution to support decision making related to a real-world problem (P6)

1. Problems encountered during data collection:


Lack of organization in data collection process:

ABC Manufacturing may lack specific procedures and systems to collect data in an organized and efficient
manner. This lack of organization can lead to disparate data collection methods across departments,
resulting in inconsistency, incompleteness, or inaccuracy of data. Without standardized processes,
valuable information may be overlooked or misinterpreted, hindering the company's ability to make
informed decisions.

Lack of centralized data storage:

Data scattered across various locations without a centralized storage system poses significant challenges
for data accessibility, management, and security. Data may reside on individual employees' computers,
departmental servers, or third-party cloud platforms, making it difficult to locate and integrate. This
decentralized approach increases the risk of data loss, unauthorized access, and non-compliance with data
protection regulations, thereby undermining the reliability and integrity of the data.

Data in incorrect format or non-standard:

Data collected from diverse sources may come in different formats or lack standardization, making it
challenging to integrate and analyze effectively. Inconsistent data formats, naming conventions, or
encoding schemes can impede data processing and hinder the extraction of meaningful insights.
Moreover, non-standardized data may violate industry standards or regulatory requirements, exposing
the company to legal and reputational risks.

Inaccurate or redundant data:

The presence of inaccurate, incomplete, or redundant data undermines the reliability and validity of
analytical insights derived from the data. Errors, inconsistencies, or duplications in the data set can distort
analysis results, leading to flawed decision-making. Additionally, redundant data consumes storage space
and computational resources unnecessarily, increasing operational costs and complicating data
management processes.

Data privacy and security concerns:

Data collection processes may not adequately address data privacy and security concerns, putting sensitive
information at risk of unauthorized access, disclosure, or misuse. Lack of encryption, access controls, or
audit trails may expose confidential data to internal or external threats, leading to breaches, regulatory
fines, and reputational damage. Failure to comply with data protection regulations, such as GDPR or
HIPAA, may result in legal liabilities and financial penalties.

Data governance and compliance issues:

ABC Manufacturing may lack robust data governance frameworks and compliance mechanisms to ensure
the integrity, reliability, and ethical use of data. Absence of data ownership, stewardship, and
accountability structures can lead to data silos, conflicts, or inconsistencies, impairing data quality and
trustworthiness. Inadequate documentation, metadata management, or version control may hinder data
traceability, lineage, and auditability, complicating regulatory compliance and risk management efforts.

2. Solutions:
Establish specific data collection procedures:

ABC Manufacturing should develop standardized procedures and protocols for data collection, ensuring
consistency and quality across all data sources. These procedures should outline the methods, frequency,
and responsibilities associated with data collection, as well as guidelines for data validation and quality
assurance.

Build a centralized data storage system:

Implement a centralized data repository or data warehouse to consolidate and organize data from
disparate sources. This centralized storage infrastructure should provide secure, scalable, and efficient
storage solutions, enabling seamless data integration, retrieval, and analysis. By centralizing data
management, the company can enhance data accessibility, reliability, and governance.

Standardize data:

Standardize data formats, structures, and semantics to promote interoperability and consistency across
the organization. Define data schemas, dictionaries, and ontologies to establish common data models and
definitions, facilitating data integration and interpretation. Automated data transformation tools can
assist in converting heterogeneous data into standardized formats, ensuring compatibility and coherence.

Check and filter data:

Implement data validation and cleansing processes to identify and rectify errors, anomalies, or outliers in
the data set. Utilize data profiling, cleansing, and deduplication techniques to eliminate inaccuracies,
incompleteness, or redundancies in the data. By filtering out irrelevant or erroneous data, the company
can improve data quality, relevance, and usability for decision-making purposes.
Enhance data privacy and security measures:

Strengthen data encryption, access controls, and monitoring mechanisms to protect sensitive information
from unauthorized access, disclosure, or tampering. Implement data anonymization or pseudonymization
techniques to minimize privacy risks while preserving data utility. Conduct regular security audits and risk
assessments to identify and mitigate vulnerabilities, ensuring compliance with data protection regulations
and industry standards.

Implement robust data governance frameworks:

Establish clear roles, responsibilities, and accountability structures for data management, including data
ownership, stewardship, and governance bodies. Define policies, standards, and procedures for data
lifecycle management, metadata management, and data quality management. Foster a culture of data
ethics and compliance through training, awareness programs, and ethical guidelines, promoting
transparency, integrity, and trustworthiness in data practices.

3. Benefits of using data to address the problem:


• Enhanced decision-making capabilities: By leveraging accurate, timely, and relevant data, ABC
Manufacturing can gain valuable insights into business performance, customer behavior, and
market trends. Informed decision-making enables the company to identify opportunities, mitigate
risks, and optimize resource allocation, thereby enhancing operational efficiency and
competitiveness.
• Improved operational efficiency: Streamlining data collection, storage, and analysis processes
enhances operational efficiency and agility, enabling faster response times and better resource
utilization. Automation of repetitive tasks, such as data entry, validation, and reporting, reduces
manual effort and human error, freeing up resources for more strategic activities.
• Enhanced data security and compliance: Centralizing data storage and implementing robust
security measures help safeguard sensitive information and mitigate data breaches or
unauthorized access. Adherence to data protection regulations, such as GDPR or CCPA, enhances
trust and credibility among customers, partners, and regulatory authorities, mitigating legal and
reputational risks.
• Facilitated innovation and growth: By harnessing the power of data analytics and insights, ABC
Manufacturing can identify emerging trends, customer preferences, and market opportunities,
driving innovation and growth. Data-driven decision-making empowers the company to adapt to
changing market dynamics, capitalize on competitive advantages, and foster a culture of
continuous improvement and innovation.
4. The Business Process Architecture

Data Acquisition Layer:

Equipment Monitors:

• Capture real-time data from equipment sensors (temperature, vibration, pressure, etc.).
• Transmit data to Central Repository (if applicable) or directly to the Data Ingestion Layer.

Maintenance Records:

• Extract data from historical maintenance records (maintenance activities, repairs, failures).
• Transmit data to Central Repository (if applicable) or directly to the Data Ingestion Layer.

Operational Environment Data: (instead of Environmental Data)

• Collect environmental data (temperature, humidity, etc.) impacting equipment


performance.
• Transmit data to Central Repository (if applicable) or directly to the Data Ingestion Layer.

Data Ingestion Layer:

• Data Cleaning Module:


• Address missing values, outliers, and inconsistencies in the data.
• Utilize techniques like missing value imputation, outlier removal, and data normalization.
• Data Harmonization Module:
• Combine data from various sources (equipment monitors, records, environment).
• Ensure data compatibility and consistency (unit standardization, formatting).
• Feature Engineering Module:
• Extract relevant characteristics for modeling (statistics, derived variables, time-based).
• Employ techniques like statistical analysis, feature selection, and data transformation.

Data Analysis and Exploration Layer:

• Exploratory Data Analysis (EDA) Module:


• Analyze and visualize data to understand its characteristics, patterns, and relationships.
• Identify potential issues or anomalies requiring further investigation.
• Feature Importance Analysis Module:
• Evaluate the importance of different features for model performance.
• Prioritize features to improve model efficiency and interpretability.
• Data Visualization Module:
• Create visualizations to communicate insights from data analysis to stakeholders.
• Enable informed decision-making about model development and maintenance strategies.

Modeling Layer:

• Model Development Module:


• Select and configure machine learning algorithms (logistic regression, random forest,
gradient boosting).
• Determine appropriate model parameters based on data and prediction objectives.
• Training Module:
• Train the model using historical data from Central Repository (if applicable) or directly
provided data.
• Label training data (equipment failures, maintenance activities).
• Model Evaluation Module:
• Assess model performance using metrics like accuracy, precision, recall, and F1-score.
• Apply cross-validation techniques to ensure model dependability.
Deployment Layer: )

• Model Integration Module:


• Integrate the trained model into a software system or platform.
• The system should be capable of receiving real-time data from equipment monitors.
• Real-time Data Processing Module:
• Continuously receive and pre-process real-time sensor data.
• Prepare data for the model following the established procedure.
• Proactive Maintenance Prediction Module:
• Utilize the model to generate forecasts about potential equipment failures within specified
time frames.
• Send automatic alerts to relevant personnel based on predefined thresholds.
• Alert and Notification Management Module:
• Connect the alert system with maintenance personnel or existing maintenance
management systems.
• Trigger timely notifications and alerts.

Maintenance Optimization Layer:

• Maintenance Planning Module:


o Utilize forecasts and alerts to optimize maintenance schedules.
o Prioritize maintenance tasks based on risk level and predicted time.
o Allocate resources effectively for maintenance activities.
• Performance Monitoring Module:
o Track Key Performance Indicators (KPIs) related to equipment uptime, maintenance costs,
and reduction in unplanned downtime.
o Analyze performance data and identify areas for improvement.
• Performance Improvement Feedback Module:
o Analyze performance data and insights gained from monitoring to continuously improve
the predictive model and maintenance processes.
o Feedback is fed back to the Data Ingestion Layer, Data Analysis and Exploration Layer, and
Modeling Layer for continuous optimization.
III. Implement a data science solution to support decision making related to a real-world problem
(P7)

1. Data Collection:
What data to collect:

For ABC Manufacturing, the data collection process should be comprehensive and cover various aspects
of the business operations to gain insights and support decision-making. The types of data to collect may
include:

• Production Data: Information related to the manufacturing processes, such as production


volumes, cycle times, defect rates, machine utilization, and maintenance logs. This data helps in
assessing production efficiency, identifying bottlenecks, and optimizing resource allocation.
• Inventory Data: Details about inventory levels, stock movements, reorder points, lead times, and
supplier information. Monitoring inventory data enables effective inventory management,
minimizes stockouts, reduces carrying costs, and ensures timely delivery to customers.
• Sales and Customer Data: Data on sales transactions, customer orders, sales channels, customer
demographics, purchase history, and customer feedback. Analyzing sales and customer data
provides insights into market trends, customer preferences, buying behaviors, and sales
performance.
• Financial Data: Financial statements, budgetary information, cost breakdowns, profit margins,
cash flow statements, and financial ratios. Financial data analysis aids in assessing the company's
financial health, profitability, liquidity, and solvency, guiding financial planning and investment
decisions.
• Supply Chain Data: Information about suppliers, vendor performance, procurement costs,
transportation logistics, lead times, and supply chain disruptions. Monitoring supply chain data
helps in optimizing supplier relationships, mitigating risks, reducing procurement costs, and
ensuring timely delivery of materials.
• Quality Assurance Data: Quality control metrics, defect rates, product recalls, customer
complaints, and compliance records. Tracking quality assurance data is crucial for maintaining
product quality, meeting regulatory requirements, enhancing customer satisfaction, and
preserving brand reputation.
• Human Resources Data: Employee demographics, attendance records, training history,
performance evaluations, turnover rates, and compensation data. Analyzing HR data enables
workforce planning, talent management, employee engagement, and compliance with labor laws.
How to collect data:

To collect the diverse range of data mentioned above, ABC Manufacturing can employ a combination of
methods and tools tailored to each data source:

• Automated Systems: Implementing automated data collection systems integrated with IoT
sensors, RFID technology, PLCs (Programmable Logic Controllers), and MES (Manufacturing
Execution Systems) to capture real-time data from production lines, equipment, and
machinery.
• Enterprise Software Solutions: Utilizing ERP (Enterprise Resource Planning), CRM (Customer
Relationship Management), SCM (Supply Chain Management), and BI (Business Intelligence)
systems to collect structured data from internal operations, sales channels, and financial
transactions.
• External Data Sources: Leveraging external data sources such as market research reports,
industry databases, government publications, and third-party data providers to supplement
internal data with external market insights and industry benchmarks.
• Surveys and Feedback Mechanisms: Conducting surveys, interviews, focus groups, and
feedback forms with customers, suppliers, employees, and stakeholders to gather qualitative
insights, opinions, and perceptions.
• Web Scraping and Social Media Monitoring: Extracting relevant data from websites, social
media platforms, online forums, and review sites using web scraping tools, sentiment analysis
algorithms, and social listening tools to monitor brand sentiment, customer feedback, and
market trends.
• Manual Data Entry and Observation: Collecting data manually through direct observation,
manual record-keeping, and data entry forms for non-digital processes, qualitative
observations, and subjective assessments.

2. Data Cleaning and Preprocessing:

a. First looking at Data.


I extracted one table from the Power BI presentation I previously mentioned. Upon examining this
table, I noticed that there are instances where the data is incomplete or missing in certain fields.

In the "Last_Name" column, we encounter various inconsistencies such as "/", "…", and "NaN", indicating
missing or improperly formatted data. Similarly, the "Phone_Number" column exhibits errors with some
entries not being valid numbers, containing non-numeric characters, or being labeled as "N/a".
Standardizing and cleaning up this column is necessary to ensure uniformity and accuracy across all
entries.
Moving on to the "Address" column, we observe a lack of uniformity where some entries consist solely of
a street address while others include additional location information and zip codes. To address this, we
need to implement a process to parse and separate the address components to ensure consistency and
facilitate analysis.

Furthermore, inconsistencies in the "Paying Customer" and "Do_Not_Contact" columns are evident, with
entries varying between "Yes" and "No" and potentially other values. Standardizing these columns by
ensuring all entries conform to a consistent format is essential for accurate analysis and decision-making.

Finally, there is a column labeled "Not_Useful_Column" which presumably does not provide relevant
information for our analysis, and we also need to remove duplicate data Therefore, we need to remove
this column to streamline the dataset and focus only on the relevant data attributes.

Figure 13: Original data table.


b. Removing duplicate Data

The df.drop_duplicates() method in pandas is used to remove duplicate rows from a DataFrame object df.
When you call this method, it returns a new DataFrame with duplicate rows removed based on the values
in all columns.

Input: You pass your DataFrame df to the drop_duplicates() method.

Processing: The method iterates through each row of the DataFrame and identifies duplicate rows based
on the values in all columns.

Identification of Duplicates: Two rows are considered duplicates if they have the same values in all
columns. If a row has identical values to another row in the DataFrame, it is considered a duplicate.

Removal of Duplicates: Once duplicates are identified, only one of the duplicate rows is retained in the
resulting DataFrame, and the other duplicate rows are removed.

Output: The method returns a new DataFrame containing only the unique rows from the original
DataFrame, with duplicates removed.

Figure 14: Dataset after removing Duplicate rows.


c. Dropping Column
The next thing I want to do is removing column that we don’t need to use so the “Not_Useful_Column” is
the right candidate to delete.

The df.drop(columns="Not_Useful_Column") method removes the column named "Not_Useful_Column" from the
DataFrame df. It returns a new DataFrame with the specified column removed. This action streamlines the
DataFrame by eliminating unnecessary data columns, facilitating analysis or further processing.

Figure 15: Dataset after removing "Not_Useful_Column".

d. Strip
"Last_Name" column, we encounter various inconsistencies such as "/", "…", and "NaN", indicating missing
or improperly formatted data.

Here's how it works:

• df["Last_Name"]: Accesses the "Last_Name" column of the DataFrame df.


• .str: Converts the column into a string accessor, allowing string methods to be applied to each element of
the column.
• .lstrip(): Applies the lstrip() method to each string in the column, removing leading whitespace characters.
• The result is a Series containing the modified strings with leading whitespace removed.
Figure 16:Dataset after removing various inconsistencies.
The provided code is utilizing the .loc[] accessor to modify the "Last_Name" column in the DataFrame df
multiple times. Here's what each line does:

df.loc[:,"Last_Name"] = df["Last_Name"].str.lstrip("..."): This line removes leading occurrences of the


specified characters ("...") from each element in the "Last_Name" column.

df.loc[:,"Last_Name"] = df["Last_Name"].str.lstrip("/"): This line removes leading forward slashes ("/")


from each element in the "Last_Name" column.

df.loc[:,"Last_Name"] = df["Last_Name"].str.rstrip("_"): This line removes trailing underscores ("_")


from each element in the "Last_Name" column.

e. Cleaning and Standardizing

Figure 17: Dataset after Standardizing

• df.loc[:,"Phone_Number"] = df["Phone_Number"].str.replace('[^a-zA-Z0-9]',''): This line uses the


str.replace() method to remove all characters in the "Phone_Number" column that are not
alphanumeric (i.e., letters or digits). The regular expression '[^a-zA-Z0-9]' matches any character
that is not a letter (uppercase or lowercase) or a digit, and replaces it with an empty string. This
effectively cleans up the "Phone_Number" column by removing any non-alphanumeric characters.
• df.loc[:,"Phone_Number"]= df["Phone_Number"].str.rstrip('\t'): This line utilizes the str.rstrip()
method to remove trailing whitespace characters, specifically tab characters ('\t'), from the
"Phone_Number" column. This ensures that any leading or trailing tabs are stripped from the
phone numbers, further cleaning up the data.

f. Splitting Column

• df["Address"].str.split(',', 2, expand=True): This part splits each string in the "Address" column by
the comma delimiter (','). The 2 parameter specifies that the splitting should occur at most two
times. The expand=True parameter ensures that the result is returned as a DataFrame with each
part of the split string occupying a separate column.
• df[["Street_Address", "State",= ...: This assigns the resulting DataFrame from the str.split()
operation to three new columns: "Street_Address", "State", and "Zip_Code" in the original
DataFrame (df).

Figure 18: Dataset after Splitting


g. Standardizing column using replace.

Figure 19:Dataset after replacing.


• df["Do_Not_Contact"].str.replace('Yes','Y'): This code replaces every occurrence of "Yes" in the
"Do_Not_Contact" column with "Y", effectively standardizing the values to a shorter form.
• df["Paying Customer"].str.replace('Yes','Y'): Similarly, this code replaces every occurrence of
"Yes" in the "Paying Customer" column with "Y".
• The same logic applies to replacing "No" with "N" in both column

h. Filtering down Rows of Data

Figure 20:Filtering down Rows of Data

Handling Missing Values:

• This line fills any missing values (NaN) in the DataFrame with empty strings ('').

Filtering Rows based on "Do_Not_Contact" Column:

• This loop iterates over each row in the DataFrame.


• If the value in the "Do_Not_Contact" column of a row is 'Y', it drops that row from the DataFrame
using df.drop(x, inplace=True).
This effectively removes any rows where the customer prefers not to be contacted.

Filtering Rows based on Empty "Phone_Number" Column:

• Similar to the previous loop, this loop iterates over each row in the DataFrame.
• If the value in the "Phone_Number" column of a row is an empty string (''), it drops that row from
the DataFrame.
• This ensures that only rows with valid phone numbers are retained.

Resetting Index:

• Finally, this line resets the index of the DataFrame after the rows have been dropped, ensuring that
the index is continuous without any gaps.
• It's important to reset the index after dropping rows to avoid having gaps in the index, which can
be undesirable when working with DataFrames.

Throughout the process of Data Cleaning and Preprocessing, we have tackled and addressed a variety
of common issues encountered in data. From identifying and removing duplicate data, handling missing
values, to standardizing data and eliminating noise, each step has contributed to creating a clean,
standardized dataset ready for analysis and application in decision-making processes.

In summary, the Data Cleaning and Preprocessing process is not only an important part of the data analysis
process but also a crucial step in ensuring the success of data projects and shaping future business
strategies. By ensuring the quality and consistency of the data, we can enhance decision-making
capabilities, improve operational efficiency, and promote sustainable organizational development.

My Code location: [email protected]:NguyenNamTruong/Data-Cleaning-Process.git


C. Conclusion
In conclusion, ABC Manufacturing has exemplified the transformative potential of data-driven strategies
in revolutionizing supply chain management within the consumer electronics manufacturing industry. By
leveraging advanced analytics and real-time data insights, the company has achieved unprecedented
levels of demand forecasting accuracy, enabling proactive adjustments in production levels and inventory
management. This proactive approach not only minimizes stockouts but also optimizes resource
utilization, leading to significant cost efficiencies and heightened customer satisfaction.

Furthermore, ABC Manufacturing's integration of IoT devices and sensors across its production facilities
and logistics network has empowered the organization to monitor equipment performance, track energy
consumption, and streamline transportation routes in real-time. This proactive monitoring and
optimization of production processes have enabled ABC Manufacturing to identify and address
bottlenecks promptly, thereby mitigating the risk of costly breakdowns and production delays. Ultimately,
these data-driven initiatives have ensured operational continuity, efficiency, and competitiveness in the
dynamic landscape of consumer electronics manufacturing. Through its pioneering efforts, ABC
Manufacturing continues to set the benchmark for excellence in supply chain management, serving as a
trailblazer for the industry.
D. References

Daley, S. (2023). What Is Data Science? a Complete Guide. | Built in. [online] builtin.com. Available at:
https://builtin.com/data-science.

https://www.facebook.com/kdnuggets (2016). Data Science and Big Data, Explained - KDnuggets.


[online] KDnuggets. Available at: https://www.kdnuggets.com/2016/11/big-data-data-science-
explained.html [Accessed 29 Jan. 2020].

Uthrakrishnan (2023). The Advantages of Data Science: Unleashing the Power of Information. [online]
Medium. Available at: https://medium.com/@uthrakrishnan3/the-advantages-of-data-science-
unleashing-the-power-of-information-5157056e74c4 [Accessed 8 Mar. 2024].

You might also like