0% found this document useful (0 votes)
228 views

Supervised Machine Learning

Supervised machine learning algorithms are commonly used for predictive analytics and require labeled input and output data to teach models relationships between variables for classification and regression problems; unsupervised learning autonomously discovers patterns without labels, using clustering to group similar data and association to find frequently occurring item combinations; the machine learning process involves data preparation, model training/testing, evaluation, and implementation.

Uploaded by

jtitai
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
228 views

Supervised Machine Learning

Supervised machine learning algorithms are commonly used for predictive analytics and require labeled input and output data to teach models relationships between variables for classification and regression problems; unsupervised learning autonomously discovers patterns without labels, using clustering to group similar data and association to find frequently occurring item combinations; the machine learning process involves data preparation, model training/testing, evaluation, and implementation.

Uploaded by

jtitai
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 20

Supervised machine learning algorithms are the most commonly used for

predictive analytics. Supervised machine learning requires human interaction to label


data read for accurate supervised learning. In supervised learning, the model is
taught by example using input and output data sets processed by human experts,
usually data scientists. The model learns the relationships between input and output
data and then uses that information to formulate predictions based on new datasets.
For example, a classification model can learn to identify plants after being trained on
a dataset of properly labeled images with the plant species and other identifying
characteristics.

Supervised machine learning methods commonly solve regression and classification


problems:

 Regression problems involve estimating the mathematical relationship(s)


between a continuous variable and one or more other variables. This
mathematical relationship can then compute the values of one unknown
variable given the known values of the others. Examples of problems that use
regression include estimating a car's position and speed using GPS,
predicting the trajectory of a tornado using weather data, or predicting the
future value of a stock using historical and other data.

 Classification problems consist of a discrete unknown variable. Typically,


the issue involves estimating which specific sample belongs to a set of pre-
defined classes. Examples of classification are filtering email into spam or
non-spam, diagnosing pathologies from medical tests, or identifying faces in a
picture.

Unsupervised
Unsupervised machine learning algorithms do not require human experts but
autonomously discover patterns in data. Unsupervised learning mainly deals with
unlabeled data. The model must work on its own to find patterns and information.
Examples of problems solved with unsupervised methods are clustering and
association:

 Clustering methods - Clustering is the grouping of data that have similar


characteristics. It helps segment data into groups and analyzes each to find
patterns. For example, clustering algorithms identify groups of users based on
their online purchasing history and then send each member targeted ads.
 Association methods - Association consists of discovering groups of items
frequently observed together. Online retailers use associations to suggest
additional purchases to a user based on the content of their shopping cart.

Reinforcement
Reinforcement learning teaches the machine through trial and error using feedback
from its actions and experiences, also known as learning from mistakes. It involves
assigning positive values to desired outcomes and negative values to undesired
effects. The result is optimal solutions; the system learns to avoid adverse outcomes
and seek the positive. Practical applications of reinforcement learning include
building ratification intelligence for playing video games and robotics and industrial
automation.

3.2.5 The Machine Learning Process


Developing a machine learning solution is seldom a linear process. Several trial-and-
error steps are necessary to fine-tune the solution. The details of each step
performed by the Data Crunchers data scientists as they work on the new weed
identification and eradication model are as follows:

Step 1. Data preparation - Perform data cleaning procedures such as


transformation into a structured format and removing missing data and
noisy/corrupted observations.

Step 2a. Learning data - Create a learning data set used to train the model.

Step 2b. Testing data - Create a test dataset used to evaluate the model
performance. Only perform this step in the case of supervised learning.

Step 3. Learning Process Loop - Selection. An algorithm is chosen based on the


problem. Depending on the selected algorithm, additional pre-processing steps might
be necessary.

Step 4. Learning Process Loop - Evaluation. This selected algorithm's


performance is evaluated on the learning data. If the algorithm and the model reach
an acceptable performance on learning data, the solution validates the test data.
Otherwise, repeat the learning process with a proposed new model and algorithm.

Step 5. Model evaluation - Test the solution on the test data. The performances on
learning data are not necessarily transferrable to test data. The more complex and
fine-tuned the model is, the higher the chances are that the model will become prone
to overfitting, which means it cannot perform accurately against unseen data.
Overfitting can result in going back to the model learning process.

Step 6. Model implementation - After the model achieves satisfactory performance


on test data, implement the model. Implementing the model means performing the
necessary tasks to scale the machine learning solution to big data.

3.2.7 Training Machines to Recognize Patterns


Earlier, you learned that machine learning is a subset of artificial intelligence.
Artificial Intelligence is the concept that a system can learn from data, identify
patterns, and make decisions with little or no human intervention. Machine learning
has many valuable applications in the field of data analytics. One critical application
is pattern recognition.

Pattern recognition utilizes machine learning algorithms to identify patterns in digital


data. These patterns are then applied to different datasets with the goal of
recognizing the same or similar patterns in the new data. The data can be contained
in many different formats, such as text, photographs, or videos. For example, if
referencing classes of birds, a description of a bird would be a pattern. The types
could be sparrows, robins, or finch, among others. Using computer vision, image
processing technologies, and pattern recognition, we can extract specific patterns
from images of birds in this example and compare them to pictures of birds stored in
a database.

Pattern recognition uses the concept of learning to classify data based on statistical
information gained from patterns and their representations. Learning enables the
pattern recognition systems to be "trained" and adaptable to provide more accurate
results. When training the pattern recognition system, a portion of the dataset
prepares the system, and the remaining amount tests the system's accuracy. As
shown in the figure below, the data set divides into two groups: train the model and
test the model. The training data set is used to build the model and consists of about
80% of the data. It contains the set of images used to train the system. The testing
data set consists of about 20% of the data and measures the model's accuracy. For
example, if the system that identifies categories of birds can correctly identify seven
out of ten birds, then the system's accuracy is 70%.

7 Training Machines to Recognize Patterns


Earlier, you learned that machine learning is a subset of artificial intelligence.
Artificial Intelligence is the concept that a system can learn from data, identify
patterns, and make decisions with little or no human intervention. Machine learning
has many valuable applications in the field of data analytics. One critical application
is pattern recognition.

Pattern recognition utilizes machine learning algorithms to identify patterns in digital


data. These patterns are then applied to different datasets with the goal of
recognizing the same or similar patterns in the new data. The data can be contained
in many different formats, such as text, photographs, or videos. For example, if
referencing classes of birds, a description of a bird would be a pattern. The types
could be sparrows, robins, or finch, among others. Using computer vision, image
processing technologies, and pattern recognition, we can extract specific patterns
from images of birds in this example and compare them to pictures of birds stored in
a database.

Pattern recognition uses the concept of learning to classify data based on statistical
information gained from patterns and their representations. Learning enables the
pattern recognition systems to be "trained" and adaptable to provide more accurate
results. When training the pattern recognition system, a portion of the dataset
prepares the system, and the remaining amount tests the system's accuracy. As
shown in the figure below, the data set divides into two groups: train the model and
test the model. The training data set is used to build the model and consists of about
80% of the data. It contains the set of images used to train the system. The testing
data set consists of about 20% of the data and measures the model's accuracy. For
example, if the system that identifies categories of birds can correctly identify seven
out of ten birds, then the system's accuracy is 70%.

Pattern recognition algorithms can be applied to different types of digital data,


including images, texts, or videos, and can be used to automate and solve
complicated analytical problems fully. The applications and use cases for pattern
recognition are virtually unlimited. Some examples include:

 Mobile Security - Identifying fingerprints or facial recognition to gain access


to a smartphone.
 Engineering - Speech recognition by digital assistant systems such as Alexa,
Google Assistant, and Siri.
 Geology - Detecting specific types of rocks and minerals and interpreting
temporal patterns in seismic array recordings.
 Biomedical - Using biometric patterns to identify tumor and cancer cells in
the body.

 Topic Objective: Explain the concept of AI.

 AI uses computer systems to perform tasks that formerly required human


intelligence. AI completes processes more efficiently and effectively. AI
impacts numerous aspects of our lives, such as marketing, blogging,
healthcare, agriculture, retail experiences, and fitness.

Topic Objective: Explain how big data enables machine learning.

Machine learning is a subset of artificial intelligence based on the concept that a


system can learn from data, identify patterns, and make decisions with little or no
human intervention. Machine learning is comprised of both classifiers and
algorithms. Classifiers categorize operations, while algorithms are the techniques
that organize and orient classifiers.

Machine learning is divided into three primary learning model approaches:


Supervised, unsupervised, and reinforced.

A machine learning solution includes:

 Step 1 - Data Preparation

 Step 2a - Learning data


 Step 2b - Testing data

 Step 3 and 4 - Learning Process Loop

 Step 5 - Model evaluation

 Step 6 - Model Implementation

3.3.3 Quiz - Big Data, AI and ML


Check your understanding of experience analytics by choosing the correct
answer to the following questions.
Complete Question 1

Question 1
Multiple choice question

When several items are grouped, which type of machine learning algorithm can
determine which items in the group predict the presence of other items?

Association

Classification

Clustering

Regression

Complete Question 2

Question 2
Multiple choice question

Which machine learning algorithm uses data sets verified by experts as its
learning basis?

Association

Clustering

Routing

Supervised

Complete Question 3

Question 3
Multiple choice question
Which method describes how a machine learns using the reinforcement machine
learning model?

Autonomously discovering patterns in data.

Human interactions to label data read accuracy.

Discovering groups of items frequently observed together.

Trial and error using feedback from the action and experiences.

Complete Question 4

Question 4
Multiple choice question

Which step in the machine learning process transforms data into a structured
format by removing missing data and corrupted observations?

Testing data

Learning data

Preparing data

Model evaluation

Complete Question 5

Question 5
Multiple choice question

In training a pattern recognition system which data set measures the accuracy
achieved by the model?

Model data set

Testing data set

Sample data set

Training data set

Data Scientist
Data scientists apply statistics, machine learning, and analytic approaches to answer
critical business questions. Data scientists interpret and deliver the results of their
findings by using visualization techniques, building data science apps, or narrating
exciting stories about the solutions to their data (business) problems. They work with
data sets of different sizes and run algorithms on large data sets. Data scientists
must be current with the latest automation and machine learning technologies. The
requirements to perform these roles include statistical and analytical skills,
programming knowledge (Python, R, Java), and familiarity with Hadoop, a collection
of open-source software utilities that facilitates working with massive amounts of
data. Data scientists are data wranglers who organize and deliver value from data.

Data Engineer

Data engineers are responsible for building and operationalizing data pipelines to
collect and organize data. They ensure the accessibility and availability of quality
data for data scientists and data analysts by integrating data from disparate sources
and performing data cleaning and transformation. Skills needed for data engineering
roles include understanding the architecture, tools, and methods of data ingestion,
transformation and storage; and proficiency with multiple programming languages
(including Python and Scala). In summary, data engineers build and operate the data
infrastructure needed to prepare data for further analysis by data analysts and
scientists.

Data Analyst

Data analysts query, process, provide reports, and summarize and visualize data.
They leverage existing tools and methods to solve a problem. They help people,
such as business analysts, to understand specific queries with ad-hoc reports and
charts. Data analysts must understand basic statistical principles, cleaning different
data types, visualization, and exploratory data analysis. In short, data analysts
analyze data to help businesses and other organizations make informed decisions.

Practice Item - The Importance of Project


Portfolios
Posting or sharing a portfolio showcasing your data analytics skills and abilities is an
essential step in furthering your career as a data professional—research platforms
for hosting and sharing your portfolio. Describe the platform that appealed to you
most. What seemed most useful about it?

Changes saved

Your answers will vary but may include the following platforms.

Social Media Platforms - Posting on social media (Like Twitter, Quora, Reddit, and
LinkedIn) can build your legitimacy as a data professional and a good way to gain
more visibility for your projects.
DataCamp Workspace - A collaborative cloud-based notebook where you can
instantly analyze data, collaborate with others, and publish an analysis. When you
create projects, you can share the link to your DataCamp profile so others can have
access.

GitHub - A website and cloud service that allows developers to store, manage, and
monitor their code repositories. It enables users to collaborate on or publish open-
source projects.

Kaggle - An online community platform for data enthusiasts to collaborate, find and
publish datasets, publish notebooks, and compete with others to solve data science
challenges. To showcase your work, create a notebook or Kernel that helps others
discover and understand your project.

Sites for Building and Hosting a Personal Website or Blog - ersonal websites or
blogs are another way to have your projects all in one place and share them
inexpensively. These sites allow more control and customization of your content than
DataCamp Workspace and Kaggle. WordPress and Wix are good options for
building and hosting a Blog or Website.

Topic Objective: Differentiate between types of data analytics roles.

Data professionals fill three primary roles in organizations: Data Analyst, Data
Engineer, and Data Scientist. A few tools and skills commonly mentioned in job
advertisements for entry-level positions include:

 Demonstrated attention to detail

 Solid verbal and written communication skills

 Ability to work in a team

 Proficiency with spreadsheets, such as Excel

 Familiarity with SQL and databases

 Some experience with object-oriented programming (OOP) languages such


as Python and Java

 Familiarity with visualization tools and presentations

Topic Objective: Explain the next steps necessary to create a portfolio showcasing
data analytic skills.

A portfolio is a website with demonstrated examples of your work at a basic level.


Portfolios are a professional way to display your interests and abilities to prospective
employers. As you begin to create projects and artifacts for your portfolio, document
the process that you used to work through each project, including:

 Questions to answer or problems to solve


 How you chose data that could inform your analysis
 Methods used to analyze the data
 Personal observations and conclusions
 Reports and presentations created

4.3.3 Quiz - Embarking on Your


Career in Data Analytics
Check your understanding of experience analytics by choosing the correct
answer to the following questions.
Complete Question 1

Question 1
Multiple choice question

What is the role of a data analyst?

To query and process data, provide reports, summarize and visualize data.

To build and operationalize data pipelines for collecting and organizing data.

To apply statistics, machine learning, and analytic approaches to answer critical


business questions.

To enter and validate data, to improve the reliability of the data being collected.

Complete Question 2

Question 2
Multiple choice question

What is the role of a data scientist?

To query and process data, provide reports, summarize and visualize data.

To build and operationalize data pipelines for collecting and organizing data.

To apply statistics, machine learning, visualization techniques and analytic


approaches to answer critical business questions.

To clean data from unstructured data.


Complete Question 3

Question 3
Multiple choice question

What tool is commonly used to showcase data analytic skills to a prospective


employer?

Resume

Word document

Portfolio

Excel spreadsheet

Complete Question 4

Question 4
Multiple choice question

What is the role of a data engineer?

To query and process data, provide reports, summarize and visualize data.

To build and operationalize data pipelines for collecting and organizing data.

To apply statistics, machine learning, and analytic approaches to answer critical


business questions using visualization techniques.

To enter and validate data, to improve the reliability of the data being collected.

Complete Question 5

Question 5
Multiple choice question

Which three are skills that are typical for an entry level data analyst position?
(Choose three.)

Fast typing skills.

Ability to work alone.

Proficiency with spreadsheets.

Familiarity with visualization tools.


Proficiency with multiple programming languages.

Strong verbal and written communication skills.

Question 1
Multiple choice question

Which characteristic describes Boolean data?

A special text data type to be used for Fill-in-blank questions.

A data type that identifies either a true (T) state or a false (F) state.

A data type that identifies either a zero (0) or a one (1).

A text data type to store confidential information such as social security numbers.

The Boolean data type represents either a logical True (T) or False (F) state. It can
be used to test the state of a variable or an expression in computer programming.
Complete Question 2

Question 2
Matchin g. Select from lists and then submit.

Refer to the exhibit. Match the column with the data type that it contains.

Shipped
Boolean
Revenue
Floating point
Quantity
Integer
Order number
String
Product category
String

Place the options in the following order:

Column
Shipped Boolean
Revenue Floating point
Quantity Integer
Order number String
Product category
Complete Question 3
String

Question 3
Multiple choice question

A sales manager in a large automobile dealership wants to determine the top four
best selling models based on sales data over the past two years. Which two charts
are suitable for the purpose? (Choose two.)

Column chart

Scatter chart

Line chart

Bar chart

Pie chart

Column and bar charts are the type of charts to be used when the purpose is to
display the value of a specific data point and compare that value across similar
categories. Column charts are positioned vertically, and bar charts are similar to
column charts with the exception that they are positioned horizontally.
Complete Question 4

Question 4
Multiple choice question

What are three types of structured data? (Choose three.)

E-commerce user accounts

Spreadsheet data

Blogs

Newspaper articles

White papers

Data in relational databases

Structured data is entered and maintained in fixed fields within a file or record, such
as data found in relational databases and spreadsheets. Structured data entry
requires a certain format to minimize errors and make it easier for computer
interpretation.
Complete Question 5

Question 5
Multiple choice question

An online e-commerce shopping site offers money-saving promotions for different


products every hour. A data analyst would like to review the sales of products two
days prior in the afternoon . Which data type would they search for in their query of
the sales data?

Integer

Date and time

String

Floating point

The date and time type is important in recording when a piece of data is generated.
Complete Question 6

Question 6
Multiple choice question

What is the most cost-effective way for businesses to store their big data?

Onsite storage arrays

On-premises

Cloud storage

Onsite local servers

Cloud storage is the most cost-effective way to store big data. Cloud storage enables
big data storage on servers maintained by a third-party service provider on their
network infrastructure. The cloud service provider purchases, installs, and maintains
all hardware, software, and supporting infrastructure in its data centers. When using
cloud services, an organization avoids the enormous costs of building and
supporting the infrastructure necessary to store the vast amounts.
Complete Question 7

Question 7
Multiple choice question

Changing the format, structure, or value of data takes place in which phase of the
data pipeline?
Storage

Analysis

Ingestion

Transformation

Data transformation involves the process of changing the format, structure, or values
of data so that it is clean and better organized, making it easier for both humans and
computers to use.
Complete Question 8

Question 8
Multiple choice question

What is unstructured data?

Data that does not fit into the rows and columns of traditional relational data storage
systems.

A large csv file.

Data that fits into the rows and columns of traditional relational data storage
systems.

Geolocation data.

Unstructured data is data that does not fit into the rows and columns of a traditional
relational data storage systems. This unstructured data is vast and makes up the
largest segment of big data.
Complete Question 9

Question 9
This question component requires you to sele ct the matching option. When you have selected your answers select the submit button.

Match the respective big data term to its description.

Velocity
Veracity
Variety
Volume
Describes the amount of data being transported and stored.
Is the process of preventing inaccurate data from spoiling data sets.
Describes the rate at which data is generated.
Describes a type of data that is not ready for processing and analysis.
Place the options in the following order:

Volume Describes the amount of data being transported and stored


Veracity Is the process of preventing inaccurate data from spoiling data sets
Velocity Describes the rate at which data is generated
Variety
Complete Question 10
Describes a type of data that is not ready for processing and analysis

Question 10
Multiple choice question

What is a major challenge for storage of big data with on-premises legacy data
warehouse architectures?

They cannot process structured data.

Additional servers cannot be easily added to the network architecture.

They cannot process the volume of big data.

They cannot process unstructured data.

The volume of big data and its variety requires storage, management, and retrieval
of virtually limitless volumes of unstructured data, something that is a challenge for
on-premises legacy data warehouse architectures.
Complete Question 11

Question 11
Multiple choice question

Which step in a typical machine learning process involves testing the solution on the
test data?

Learning process loop

Model evaluation

Data preparation

Learning data

A typical machine learning process would involve several steps including:

 Step 1. Data preparation – Perform data cleaning procedures such as


transformation into a structured format and removing missing data and
noisy/corrupted observations.
 Step 2. Learning data - Create a learning data set that will be used to train
the model.
 Step 3. Testing data - Create a test set that will be used to evaluate the
model performance. Note that this step is only performed in the case of
supervised learning.
 Step 4. Learning Process loop - An algorithm is chosen based on the
problem required, and its performance is evaluated on the learning data.
 Step 5. Model evaluation - Test the solution on the test data. The
performances on learning data are not necessarily transferable on test data.
The more complex and fine-tuned the model is, the higher the chances are
that the model will become prone to over-fitting, which means it cannot
perform accurately against unseen data. Over-fitting can result in going back
to the model learning process.
 Step 6. Model implementation - Once the model achieves satisfactory
performance on test data, the model can be implemented. This means
performing the necessary tasks to scale the machine learning solution to Big
Data.
Complete Question 12

Question 12
Multiple choice question

Which type of machine learning algorithm would be used to train a system to detect
spam in email messages?

Clustering

Classification

Regression

Association

The classification algorithm is a supervised learning technique that is used to identify


the category of new observations on the basis of training data. In classification, a
program learns from the given dataset or observations and then classifies new
observation into a number of classes or groups.
Complete Question 13

Question 13
Multiple choice question

Which type of learning algorithm can predict the value of a variable of a loan interest
rate based on the value of other variables?

Regression

Classification

Clustering
Association

An example of how a regression algorithm might be used is to predict the cost of a


house by looking at variables such as crime rate, average income level in the
neighborhood, and how far the house is from a school.
Complete Question 14

Question 14
Multiple choice question

What are two types of supervised machine learning algorithms? (Choose two.)

Regression

Mode

Association

Mean

Clustering

Classification

Two algorithms used with supervised machine learning are classification and
regression. Supervised machine learning algorithms are the most common
algorithms used in big data analytics.
Complete Question 15

Question 15
Multiple choice question

What are two applications that would gain ratification intelligence by using the
reinforcement learning model? (Choose two.)

Playing video games.

Robotics and industrial automation.

Filtering email into spam or non-spam.

Predicting the trajectory of a tornado using weather data.

Identifying faces in a picture.

Reinforcement learning model teaches the machine through trial and error using
feedback from its actions and experiences. It involves assigning positive values to
desired outcomes and negative values to undesired effects. Practical applications of
reinforcement learning include building ratification intelligence for playing video
games and in robotics and industrial automation.
Complete Question 16

Question 16
This question component requires you to sele ct the matching option. When you have selected your answers select the submit button.

Match the job title with the matching job description.

Data Scientist
Data Analyst
Data Engineer
Leverage existing tools and problem-solving methods to query and process data,
provide reports, summarize and visualize data.
Build and operationalize data pipelines for collecting and organizing data while
ensuring the accessibility and availability of quality data.
Apply statistics, machine learning, and analytic approaches in order to interpret and
deliver visualized results to critical business questions.

Place the options in the following order:

Data Analyst Leverage existing tools and problem-solving methods to query and process data, provide repor
Data Engineer Build and operationalize data pipelines for collecting and organizing data while ensuring the a
Data Scientist Apply statistics, machine learning, and analytic approaches in order to interpret and deliver vis
Complete Question 17

Question 17
This question component requires you to sele ct the matching option. When you have selected your answers select the submit button.

Match the data professional role with the skill sets required.

Data analyst
Data scientist
Data engineer
Ability to understand basic statistical principles, cleaning different types of data, data
visualization, and exploratory data analysis.
Ability to understand the architecture and distribution of data acquisition and storage,
multiple programming languages (including Python and Java), and knowledge of
SQL database design including an understanding of creating and monitoring
machine learning models.
Ability to use statistical and analytical skills, programming knowledge (Python, R,
Java), and familiarity with Hadoop; a collection of open-source software utilities that
facilitates working with massive amounts of data.

Place the options in the following order:

Data
Ability to understand basic statistical principles, cleaning different types of data, data visualizatio
Analyst
Data Ability to understand the architecture and distribution of data acquisition and storage, multiple pr
Engineer Java), and knowledge of SQL database design including an understanding of creating and monito
Data Ability to use statistical and analytical skills, programming knowledge (Python, R, Java), and fam
Scientist
Complete Question 18
source software utilities that facilitates working with massive amounts of data

Question 18
Multiple choice question

What are two job roles normally attributed to a data analyst? (Choose two.)

Turning raw data into information and insight, which can be used to make business
decisions.

Reviewing company databases and external sources to make inferences about data
figures and complete statistical calculations.

Using programming skills to develop, customize and manage integration tools,


databases, warehouses, and analytical systems.

Building systems that collect, manage, and convert raw data into usable information.

Working in teams to mine big data for information that can be used to predict
customer behavior and identify new revenue opportunities.

Unstructured data is data that does not fit into the rows and columns of a traditional
relational data storage systems. This unstructured data is vast and makes up the
largest segment of big data.
Complete Question 19

Question 19
Multiple choice question

Which skill set is important for someone seeking to become a data scientist?

The ability to ensure that the database remains stable and maintaining backups of
the database and execute database updates and modifications.

An understanding of architecture and distribution of data acquisition and storage,


multiple programming languages (including Python and Java), and knowledge of
SQL database design.

An understanding of basic statistical principles, cleaning different types of data, data


visualization, and exploratory data analysis.

A thorough knowledge of machine learning technologies and programming


languages, strong statistical and analytical skills, and familiarity with software utilities
that facilitates working with massive amounts of data.

Data scientists apply statistics, machine learning, and analytic approaches to answer
critical business questions. They need thorough knowledge of the latest automation
and machine learning technologies, analytical skills, deep programming knowledge,
and familiarity with Hadoop.
Complete Question 20

Question 20
Multiple choice question

A data analyst is building a portfolio for future prospective employers and wishes to
include a previously completed project. What three process documentations would
be included in building that portfolio? (Choose three.)

A list of data analytic tools that did not work in the described manner for the project.

The data-based research questions and research problem addressed in the


respective project.

The failures of the unstructured data set selected.

The methods used to analyze the data.

The basis of choosing the respective data set.

The hours spent on study and hours worked on projects.

The project documentation for the portfolio should include:

 the data-based research question and research problem addressed


 how the data set was chosen for analysis
 the methods used to analyze the data
 the observations and conclusions
 reports and presentations that were created

You might also like