1) Data-sci Chapter-1
1) Data-sci Chapter-1
1.1 Introduction
1.1.1 Evolution of Data Science
1.1.2 Data Science Roles
1.1.3 Stages in a Data Science Project
1.1.4 Applications of Data Science in various fields
1.2 Tools and Techniques in Data Science - Introduction
- Python & R
1.2 Data Processing
1.2.1 Data Processing Overview
1.2.2 Data Collection & Data Cleaning
1.2.3 Data Integration and Transformation
1.2.4 Data Reduction 1.2.5 Data
Discretization.
1.3 Impact of Data Science
1.1 Introduction
Data Science is a multidisciplinary field that involves the use of statistical and
computational methods to extract insights and knowledge from data. To analyze and
comprehend large data sets, it uses techniques from computer science, mathematics,
and statistics.
Data mining, machine learning, and data visualization are just a few of the tools and
methods we frequently employ to draw meaning from data. They may deal with both
structured and unstructured data, including text and pictures, databases, and
spreadsheets.
A number of sectors, including healthcare, finance, marketing, and more, use the
insights and experience gained via data analysis to steer innovation, advise business
decisions, and address challenging problems.
“The usefulness of data and data processes derives from their application in
building and handling models of reality.”
In 2001, William S. Cleveland laid out plans for training data scientists to
meet the needs of the future. He presented an action plan titled, Data
Science: An Action Plan for Expanding the Technical Areas of the field of
Statistics. (Look for the “read” icon at the bottom of the screen.) It described
how to increase the technical experience and range of data analysts and
specified six areas of study for university departments. It promoted
developing specific resources for research in each of the six areas. His plan
also applies to government and corporate research. In 2001,
Software-as-a-Service (SaaS) was created. This was the pre-cursor to using
cloud-based applications.
9) In 2002, the International Council for Science: Committee on Data for
Science and Technology began publishing the Data Science Journal, a
publication focused on issues such as the description of data systems, their
publication on the internet, applications and legal issues. Articles for the
Data Science Journal are accepted by their editors and must follow specific
guidelines.
10) In 2006, Hadoop 0.1.0, an open-source, non-relational database, was
released. Hadoop was based on Nutch, another open-source database. Two
problems with processing big data are the storage of huge amounts of data
and then processing that stored data. (Relational data base management
systems (RDBMS) cannot process non-relational data.) Hadoop solved those
problems. Apache Hadoop is now an open-sourced software library that
allows for the research of big data.
11) In 2008, the title, “data scientist” became a buzzword, and eventually a
part of the language. DJ Patil and Jeff Hammerbacher, of LinkedIn and
Facebook, are given credit for initiating its use as a buzzword. (In 2012,
Harvard University declared the data scientists had the sexiest job of
the twenty-first century.)
12) In 2009, the term NoSQL was reintroduced (a variation had been used
since 1998) by Johan Oskarsson, when he organized a discussion on
“open-source, non-relational databases”.
13) In 2011, job listings for data scientists increased by 15,000%. There
was also an increase in seminars and conferences devoted specifically to
Data Science and big data. Data Science had proven itself to be a source of
profits and had become a part of corporate culture.
14) In 2013, IBM shared statistics showing 90% of the data in the world had
been created within the last two years.
15) In 2015, using Deep Learning techniques, Google’s speech recognition,
Google Voice, experienced a dramatic performance jump of 49 percent.
16) In 2015, Bloomberg’s Jack Clark, wrote that it had been a landmark
year for artificial intelligence (AI). Within Google, the total of software
projects using AI increased
1) Data Scientist
Among all the team roles, the data scientist tends to be the strongest in statistics,
math, and machine learning. They should also have a strong foundation in
programming – typically in Python or R. Often they start their careers or studies in
math or a quantitative research-oriented such as economics or physics. The role
generally is not entry-level and might require an advanced degree and a few years
of experience.
Typical Responsibilities:
Typical Qualifications:
2) Data Engineer
Data engineer who is responsible for the collection, storage, and processing of
data. They design, build, and maintain the infrastructure that enables the data
science team to work with large amounts of data. This includes databases, data
pipelines, and data warehousing solutions. The data engineer ensures that data is
available when and where it is needed, and that it is of high quality. Some
organizations have their data engineers sit in a separate team from the data
scientists.
Typical Responsibilities:
● Design, build, and maintain data pipelines to move and transform data
from various sources into a target location such as a data warehouse or
data lake.
● Develop and maintain the infrastructure required to support data science
initiatives, including data warehousing, ETL or ELT tools, and data
integration solutions.
● Ensure data quality, accuracy, and consistency across multiple data
sources.
● Work with data scientists, data analysts, and other stakeholders to
understand data requirements and provide support for data-driven
decision-making.
Typical Qualifications:
3) Data Analyst
This role is similar to a data scientist but data analysts tend to be more focused on
reporting on the current state as opposed to predictive analytics.
Typical Responsibilities:
● Collect and preprocess data from multiple sources to ensure data quality,
accuracy, and consistency.
● Analyze and interpret complex data to identify patterns and trends, and to
provide insights that support business decision-making.
● Develop dashboards and reports using data visualization tools to
communicate insights and findings to stakeholders.
● Collaborate with data scientists and data engineers to collect and
preprocess data, and build and maintain data pipelines.
Typical Qualifications:
The machine learning engineer is responsible for building and deploying machine
learning models. They work closely with the data scientist to determine the best
algorithms and models to use, and they build and implement these models in a
production environment. The machine learning engineer is also responsible for
monitoring the performance of models and making updates and improvements as
necessary.
Relative to the data scientist, the machine learning engineer tends to be weaker is
math/stats but stronger in writing production code and maintaining production
systems.
Typical Responsibilities:
Typical Qualifications:
5) Product Owner
the product owner is responsible for setting the product vision, defining product
requirements, and prioritizing the product backlog. They work with stakeholders to
understand business requirements and ensure that the data science team is
delivering value to the organization. The best product owners are excellent
story-tellers who can communicate their compelling vision.
Typical Responsibilities:
Typical Qualifications:
The process expert is responsible for ensuring that the team is working effectively
together and with broader stakeholders. They help team members understand and
adopt effective Agile principles and collaboration frameworks such as Scrum,
Kanban, or Data Driven Scrum. They facilitate communication, remove
impediments, and help the team to continuously improve its processes.
This role has different titles including agile coach, process expert (a Data Driven
Scrum role), or Scrum master (a Scrum role). Some organizations split a process
expert’s allocation across multiple teams.
Typical Responsibilities:
● Coach and mentor the team on agile principles, processes, and practices,
and help the team continuously improve.
● Facilitate agile ceremonies, including sprint planning, daily stand-ups,
sprint reviews, and retrospectives.
● Work with the product owner to ensure that the product backlog is
prioritized and refined, and that it aligns with business objectives.
● Facilitate communication and collaboration within the team and with
stakeholders, and remove impediments that prevent the team from
achieving its goals.
● Identify and escalate risks and issues that impact the team’s ability to
deliver on time and with quality.
Typical Qualifications:
● Bachelor’s or Master’s degree in business, computer science, engineering,
or a related field.
● Strong understanding of Agile principles and frameworks including
Scrum, Data Driven Scrum, and Kanban.
● Excellent communication and facilitation skills, with the ability to
communicate effectively with both technical and non-technical
stakeholders.
● Strong problem-solving skills, with the ability to identify and remove
impediments that prevent the team from achieving its goals.
● Strong leadership and coaching skills, with the ability to coach and
mentor the team on Agile practices and principles.
● Familiarity with data science concepts and methodologies, including
statistical analysis, machine learning, and data visualization.
7) Project Manager
This role most closely resembles that of a process master, and many teams have the
same person serve as both project manager and process master. Other teams might
rely on a lead data scientist to serve as a project manager for a specific project.
● Develop and implement data science project plans, ensuring that projects
are completed on time, within budget, and to quality standards.
● Coordinate and monitor day-to-day tasks and workflows of the project
team.
● Manage stakeholder requests and expectations; provide updates to project
sponsors.
● Scope and define tasks that fulfill the project vision; manage and
document scope using a project management ticketing system such as
Jira, Atlassian, or Rally.
● Manage contracts with vendors and suppliers.
● Manage the sourcing of data sets required for upcoming and current
projects.
Typical Qualifications:
8) Team Manager
The team manager is responsible for overseeing the data science team, ensuring
that the team is meeting its goals, and managing individual team member’s
performance. They lead recruitment, performance management, training, and often
administrative responsibilities such as vendor management. They work with
stakeholders to ensure that the data science team is delivering value to the
organization and aligning with the organization’s strategic goals.
The team manager typically supervises the data scientists, analysts, and engineers.
Sometimes the process master and product owner also report to the team manager
but often these roles report through separate org structures like a PMO or product
team.
Typical Responsibilities:
● Lead the data science team, providing guidance, direction, and mentorship
to team members.
● Collaborate with other teams and stakeholders to identify data science
opportunities that align with business objectives.
● Manage the team’s resources, including budget, personnel, and
equipment, and ensure that resources are used efficiently and effectively.
● Develop and maintain relationships with key stakeholders, including
business partners, customers, and vendors.
● Monitor and report on the team’s performance, including progress against
goals, budget, and project milestones.
Typical Qualifications: