0% found this document useful (0 votes)
21 views

1) Data-sci Chapter-1

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views

1) Data-sci Chapter-1

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Chapter 1- Introduction to Data Science

1.1 Introduction
1.1.1 Evolution of Data Science
1.1.2 Data Science Roles
1.1.3 Stages in a Data Science Project
1.1.4 Applications of Data Science in various fields
1.2 Tools and Techniques in Data Science - Introduction
- Python & R
1.2 Data Processing
1.2.1 Data Processing Overview
1.2.2 Data Collection & Data Cleaning
1.2.3 Data Integration and Transformation
1.2.4 Data Reduction 1.2.5 Data
Discretization.
1.3 Impact of Data Science

1.1 Introduction

Data Science is a multidisciplinary field that involves the use of statistical and
computational methods to extract insights and knowledge from data. To analyze and
comprehend large data sets, it uses techniques from computer science, mathematics,
and statistics.

Data mining, machine learning, and data visualization are just a few of the tools and
methods we frequently employ to draw meaning from data. They may deal with both
structured and unstructured data, including text and pictures, databases, and
spreadsheets.

A number of sectors, including healthcare, finance, marketing, and more, use the
insights and experience gained via data analysis to steer innovation, advise business
decisions, and address challenging problems.

In short, data science is all about:


1. Collecting data from a range of sources, including databases, sensors,
websites, etc.
2. Making sure data is in a format that can be analyzed while also
organizing and processing it to remove mistakes and inconsistencies.
3. Finding patterns and correlations in the data using statistical and
machine learning approaches.
4. Developing visual representations of the data to aid in comprehension of
the conclusions and insights.
5. Creating mathematical models and computer programs that can classify
and forecast based on data.
6. Conveying clear and understandable facts and insights to others.

1.1.1 Evolution of Data Science

From the 1960s to the Present


1) In 1962, John Tukey wrote a paper titledThe Future of Data Analysis and
described a shift in the world of statistics, Tukey is referring to the merging
of statistics and computers, when computers were first being used to solve
mathematical problems and work with statistics, rather than doing the work
by hand.
2) In 1974, Peter Naur authored the Concise Survey of Computer Methods,
using the term “Data Science,” repeatedly. Naur presented his own
convoluted definition of the new concept:

“The usefulness of data and data processes derives from their application in
building and handling models of reality.”

3) In 1977, The IASC, also known as the International Association for


Statistical Computing was formed. The first phrase of their mission
statement reads, “It is the mission of the IASC to link traditional statistical
methodology, modern computer technology, and the knowledge of domain
experts in order to convert data into information and knowledge.”
4) In 1977, Tukey wrote a second paper, titled Exploratory Data Analysis,
arguing the importance of using data in selecting “which” hypotheses to test,
and that confirmatory data analysis and exploratory data analysis should
work hand-in-hand.
5) In 1989, the Knowledge Discovery in Databases, which would mature into
the ACM SIGKDD Conference on Knowledge Discovery and Data Mining,
organized its first workshop.
6) In 1994, Business Week ran the cover story, Database Marketing,
revealing the ominous news companies had started gathering large amounts
of personal information, with plans to start strange new marketing
campaigns. The flood of data was, at best, confusing to many company
managers, who were trying to decide what to do with so much disconnected
information.
7) In 1999, Jacob Zahavi pointed out the need for new tools to handle the
massive, and continuously growing, amounts of data available to businesses,
in Mining Data for Nuggets of Knowledge. He wrote:

“Scalability is a huge issue in data mining… Conventional statistical


methods work well with small data sets. Today’s databases, however, can
involve millions of rows and scores of columns of data… Another technical
challenge is developing models that can do a better job analyzing data,
detecting non-linear relationships and interaction between elements…
Special data mining tools may have to be developed to address web-site
decisions.”

8) In 2001, Software-as-a-Service (SaaS) was created. This was the


pre-cursor to using cloud-based applications.

In 2001, William S. Cleveland laid out plans for training data scientists to
meet the needs of the future. He presented an action plan titled, Data
Science: An Action Plan for Expanding the Technical Areas of the field of
Statistics. (Look for the “read” icon at the bottom of the screen.) It described
how to increase the technical experience and range of data analysts and
specified six areas of study for university departments. It promoted
developing specific resources for research in each of the six areas. His plan
also applies to government and corporate research. In 2001,
Software-as-a-Service (SaaS) was created. This was the pre-cursor to using
cloud-based applications.
9) In 2002, the International Council for Science: Committee on Data for
Science and Technology began publishing the Data Science Journal, a
publication focused on issues such as the description of data systems, their
publication on the internet, applications and legal issues. Articles for the
Data Science Journal are accepted by their editors and must follow specific
guidelines.
10) In 2006, Hadoop 0.1.0, an open-source, non-relational database, was
released. Hadoop was based on Nutch, another open-source database. Two
problems with processing big data are the storage of huge amounts of data
and then processing that stored data. (Relational data base management
systems (RDBMS) cannot process non-relational data.) Hadoop solved those
problems. Apache Hadoop is now an open-sourced software library that
allows for the research of big data.
11) In 2008, the title, “data scientist” became a buzzword, and eventually a
part of the language. DJ Patil and Jeff Hammerbacher, of LinkedIn and
Facebook, are given credit for initiating its use as a buzzword. (In 2012,
Harvard University declared the data scientists had the sexiest job of
the twenty-first century.)
12) In 2009, the term NoSQL was reintroduced (a variation had been used
since 1998) by Johan Oskarsson, when he organized a discussion on
“open-source, non-relational databases”.
13) In 2011, job listings for data scientists increased by 15,000%. There
was also an increase in seminars and conferences devoted specifically to
Data Science and big data. Data Science had proven itself to be a source of
profits and had become a part of corporate culture.
14) In 2013, IBM shared statistics showing 90% of the data in the world had
been created within the last two years.
15) In 2015, using Deep Learning techniques, Google’s speech recognition,
Google Voice, experienced a dramatic performance jump of 49 percent.
16) In 2015, Bloomberg’s Jack Clark, wrote that it had been a landmark
year for artificial intelligence (AI). Within Google, the total of software
projects using AI increased

Data Science Today


In the past 30 years, Data Science has quietly grown to include businesses and
organizations worldwide. It is now being used by governments, geneticists,
engineers, and even astronomers. During its evolution, Data Science’s use of big
data was not simply a “scaling up” of the data, but included shifting to new
systems for processing data and the ways data gets studied and analyzed.

Data Science has become an important part of business and academic


research. Technically, this includes machine translation, robotics, speech
recognition, the digital economy, and search engines. In terms of research areas,
Data Science has expanded to include the biological sciences, health care,
medical informatics, the humanities, and social sciences. Data Science now
influences economics, governments, and business and finance.

Automated machine learning is one of the new trends in data science.


AutoML streamlines and automates the process of applying machine
learning models. In this way, it becomes more available to
non-experts and more efficient, leading to the democratization of data
science.15

1.1.2 Data Science Roles

8 Key Data Science Roles

1) Data Scientist

A data scientist is inherently very curious – trying to understand certain


phenomena through the analysis of modeling of complex data.

Among all the team roles, the data scientist tends to be the strongest in statistics,
math, and machine learning. They should also have a strong foundation in
programming – typically in Python or R. Often they start their careers or studies in
math or a quantitative research-oriented such as economics or physics. The role
generally is not entry-level and might require an advanced degree and a few years
of experience.

Typical Responsibilities:

● Develop statistical models, machine learning algorithms, and predictive


analytics solutions to address business challenges.
● Analyze large amounts of complex data to extract insights and drive
decision-making.
● Design experiments to test hypotheses and measure the effectiveness of
solutions.
● Collaborate with data engineers and data analysts to collect and
preprocess data, and build and maintain data pipelines.
● Use data visualization tools to communicate insights and findings to
stakeholders.

Typical Qualifications:

● Bachelor’s or Master’s degree in math, stats, computer science, data


science, or related quantitative field.
● Strong programming skills in Python or R.
● Strong SQL skills and understanding of databases.
● Strong experience with machine learning algorithms and libraries such as
scikit-learn, TensorFlow, or PyTorch.
● Familiarity with data visualization tools such as Tableau, Power BI, or
matplotlib.
● Strong analytical and problem-solving skills, with the ability to work with
complex and unstructured data.
● Strong communication skills and ability to work collaboratively with
cross-functional teams.

2) Data Engineer

Data engineer who is responsible for the collection, storage, and processing of
data. They design, build, and maintain the infrastructure that enables the data
science team to work with large amounts of data. This includes databases, data
pipelines, and data warehousing solutions. The data engineer ensures that data is
available when and where it is needed, and that it is of high quality. Some
organizations have their data engineers sit in a separate team from the data
scientists.

Typical Responsibilities:

● Design, build, and maintain data pipelines to move and transform data
from various sources into a target location such as a data warehouse or
data lake.
● Develop and maintain the infrastructure required to support data science
initiatives, including data warehousing, ETL or ELT tools, and data
integration solutions.
● Ensure data quality, accuracy, and consistency across multiple data
sources.
● Work with data scientists, data analysts, and other stakeholders to
understand data requirements and provide support for data-driven
decision-making.
Typical Qualifications:

● Bachelor’s or Master’s degree in computer science, data science, or a


related field.
● Strong programming skills in one or more languages such as Python,
Java, or Scala.
● Strong experience with SQL, NoSQL, and data warehousing technologies
such as Redshift, Snowflake, or BigQuery.
● Experience with ETL tools such as Apache Airflow, AWS Glue, or Azure
Data Factory.
● Familiarity with distributed computing frameworks such as Hadoop,
Spark, or Flink.
● Knowledge of data modeling, data integration, and data quality concepts.
● Strong communication skills and ability to work collaboratively with
cross-functional teams.

3) Data Analyst

A data analyst is a professional who is responsible for collecting, processing, and


performing statistical analyses on large sets of data. They use various analytical
tools and techniques to extract meaningful insights from data and communicate
those insights to decision-makers.

This role is similar to a data scientist but data analysts tend to be more focused on
reporting on the current state as opposed to predictive analytics.

Typical Responsibilities:
● Collect and preprocess data from multiple sources to ensure data quality,
accuracy, and consistency.
● Analyze and interpret complex data to identify patterns and trends, and to
provide insights that support business decision-making.
● Develop dashboards and reports using data visualization tools to
communicate insights and findings to stakeholders.
● Collaborate with data scientists and data engineers to collect and
preprocess data, and build and maintain data pipelines.

Typical Qualifications:

● Bachelor’s or Master’s degree in computer science, business analytics, or


a related field.
● Strong proficiency in SQL and data visualization tools such as Tableau,
Power BI, or QlikView.
● Experience with statistical analysis and A/B testing methodologies.
● Familiarity with data modeling and data preprocessing techniques.
● Strong analytical and problem-solving skills, with the ability to work with
complex and unstructured data.
● Strong communication skills and ability to work collaboratively with
cross-functional teams.

4) Machine Learning Engineer

The machine learning engineer is responsible for building and deploying machine
learning models. They work closely with the data scientist to determine the best
algorithms and models to use, and they build and implement these models in a
production environment. The machine learning engineer is also responsible for
monitoring the performance of models and making updates and improvements as
necessary.

Relative to the data scientist, the machine learning engineer tends to be weaker is
math/stats but stronger in writing production code and maintaining production
systems.

Typical Responsibilities:

● Design, develop, and deploy scalable machine learning models and


systems that support business objectives.
● Collaborate with data scientists and data engineers to collect and
preprocess data, and build and maintain data pipelines.
● Develop and maintain data infrastructures that support machine learning
workflows, including data storage, feature engineering, and model
training.
● Design and implement distributed systems that support large-scale
machine learning.
● Develop and maintain machine learning workflows that are efficient,
reproducible, and scalable.
● Implement monitoring and evaluation systems that track model
performance and identify potential issues.

Typical Qualifications:

● Bachelor’s or Master’s degree in computer science, engineering, or a


related field.
● Strong programming skills in one or more languages such as Python,
Java, or C++.
● Experience with machine learning frameworks such as TensorFlow,
PyTorch, or scikit-learn.
● Experience with distributed systems such as Apache Spark, Hadoop, or
Kafka.
● Familiarity with data storage and processing technologies such as SQL,
NoSQL, and Apache Beam.
● Strong analytical and problem-solving skills, with the ability to work with
complex and unstructured data.
● Strong communication skills and ability to work collaboratively with
cross-functional teams.

5) Product Owner

the product owner is responsible for setting the product vision, defining product
requirements, and prioritizing the product backlog. They work with stakeholders to
understand business requirements and ensure that the data science team is
delivering value to the organization. The best product owners are excellent
story-tellers who can communicate their compelling vision.

To learn more, read the Data Science Product Manager post.

Typical Responsibilities:

● Define and prioritize product requirements that support business


objectives, based on customer needs, data insights, and market trends.
● Work with cross-functional teams, including data scientists, data analysts,
data engineers, and software developers, to develop and deliver
data-driven products that meet customer needs.
● Communicate product requirements and progress to stakeholders,
including senior leadership, customers, and cross-functional teams.
● Develop and maintain product roadmaps that align with business
objectives and account for technical feasibility and resource constraints.
● Verify that solutions delivered serve their intended purpose and often train
stakeholders to understand and use the solutions.

Typical Qualifications:

● Bachelor’s or Master’s degree in a business or informatics field.


● Experience with Agile coordination frameworks, including Scrum,
Kanban, and Data Driven Scrum.
● Familiarity with data science concepts and methodologies, including
statistical analysis, machine learning, and data visualization.
● Excellent communication skills, with the ability to effectively
communicate technical concepts to both technical and non-technical
stakeholders.
● Strong experience in office productivity tools (such as Jira, Asana), flow
diagram tools, and prototyping tools (like Sketch or Fimga).
● Strong domain knowledge (or the ability to quickly to learn a new
business).
● Familiarity with data governance and regulatory compliance
requirements.
6) Process Expert

The process expert is responsible for ensuring that the team is working effectively
together and with broader stakeholders. They help team members understand and
adopt effective Agile principles and collaboration frameworks such as Scrum,
Kanban, or Data Driven Scrum. They facilitate communication, remove
impediments, and help the team to continuously improve its processes.

This role has different titles including agile coach, process expert (a Data Driven
Scrum role), or Scrum master (a Scrum role). Some organizations split a process
expert’s allocation across multiple teams.

Typical Responsibilities:

● Coach and mentor the team on agile principles, processes, and practices,
and help the team continuously improve.
● Facilitate agile ceremonies, including sprint planning, daily stand-ups,
sprint reviews, and retrospectives.
● Work with the product owner to ensure that the product backlog is
prioritized and refined, and that it aligns with business objectives.
● Facilitate communication and collaboration within the team and with
stakeholders, and remove impediments that prevent the team from
achieving its goals.
● Identify and escalate risks and issues that impact the team’s ability to
deliver on time and with quality.

Typical Qualifications:
● Bachelor’s or Master’s degree in business, computer science, engineering,
or a related field.
● Strong understanding of Agile principles and frameworks including
Scrum, Data Driven Scrum, and Kanban.
● Excellent communication and facilitation skills, with the ability to
communicate effectively with both technical and non-technical
stakeholders.
● Strong problem-solving skills, with the ability to identify and remove
impediments that prevent the team from achieving its goals.
● Strong leadership and coaching skills, with the ability to coach and
mentor the team on Agile practices and principles.
● Familiarity with data science concepts and methodologies, including
statistical analysis, machine learning, and data visualization.

7) Project Manager

Many organizations struggle to apply effective project management practices to


data science. To overcome these challenges, a project manager in data science can
drive project success by applying the right project approaches that cater to the
unique aspects of data science. The data science project manager will work closely
with cross-functional teams, including data scientists, analysts, engineers, product
managers, and stakeholders, to ensure successful project execution.

This role most closely resembles that of a process master, and many teams have the
same person serve as both project manager and process master. Other teams might
rely on a lead data scientist to serve as a project manager for a specific project.

To learn more, read the Data Science Project Manager post.


Typical Responsibilities

● Develop and implement data science project plans, ensuring that projects
are completed on time, within budget, and to quality standards.
● Coordinate and monitor day-to-day tasks and workflows of the project
team.
● Manage stakeholder requests and expectations; provide updates to project
sponsors.
● Scope and define tasks that fulfill the project vision; manage and
document scope using a project management ticketing system such as
Jira, Atlassian, or Rally.
● Manage contracts with vendors and suppliers.
● Manage the sourcing of data sets required for upcoming and current
projects.

Typical Qualifications:

● Bachelor’s or Master’s degree in business, computer science, statistics,


mathematics, or a related field.
● Strong understanding of the data science project life cycle.
● Strong understanding of Agile approaches, including Scrum, Data Driven
Scrum, and Kanban.
● Excellent communication, interpersonal, and leadership skills, with the
ability to influence and motivate cross-functional teams.
● Strong problem-solving, analytical, and critical thinking skills, with the
ability to make data-driven decisions.
● Strong experience in office productivity tools (such as Jira, Asana), flow
diagram tools, and prototyping tools (like Sketch or Fimga).
● Ability to manage budgets, scope, and schedules.

8) Team Manager

The team manager is responsible for overseeing the data science team, ensuring
that the team is meeting its goals, and managing individual team member’s
performance. They lead recruitment, performance management, training, and often
administrative responsibilities such as vendor management. They work with
stakeholders to ensure that the data science team is delivering value to the
organization and aligning with the organization’s strategic goals.

The team manager typically supervises the data scientists, analysts, and engineers.
Sometimes the process master and product owner also report to the team manager
but often these roles report through separate org structures like a PMO or product
team.

Typical Responsibilities:

● Lead the data science team, providing guidance, direction, and mentorship
to team members.
● Collaborate with other teams and stakeholders to identify data science
opportunities that align with business objectives.
● Manage the team’s resources, including budget, personnel, and
equipment, and ensure that resources are used efficiently and effectively.
● Develop and maintain relationships with key stakeholders, including
business partners, customers, and vendors.
● Monitor and report on the team’s performance, including progress against
goals, budget, and project milestones.
Typical Qualifications:

● Bachelor’s or Master’s degree in computer science, engineering, statistics,


or a related field.
● Prior experience as data scientist, product manager, or as a software
manager.
● Excellent project management skills, with the ability to develop and
implement project plans that meet business objectives.
● Strong leadership and communication skills, with the ability to motivate
and mentor team members and collaborate with stakeholders.
● Excellent problem-solving skills, with the ability to identify and mitigate
risks and issues that impact project delivery.
● Familiarity with data science tools and technologies, such as Python, R,
SQL, and Hadoop.

1.1.3 Stages in a Data Science Project

You might also like