Course Title: Fundamentals of Data Science Course-Code: 24BTELY107 Semester: 1 A.Y: 2024-25 Module-1-PPT
Dr. Prof. Rajasekharaiah K.M.
BE-CSE M.Tech-CSE PhD-CSE M.Phil-CS PGDIT LM-ISTE-New Delhi Professor – CSE Dept. School of Computer Science & Engineering Dept. (SCSE) Fundamentals of Data Science Module-1 8 hrs • Introduction To Data Science Definition—Big Data and Data Science Hype— Datafication—Data Science Profile—Meta Data— Definition—Data Scientist—Statistical Inference— Populations and Samples—Populations and Samples of Big Data—Modeling-Data Warehouse—Philosophy of Exploratory Data Analysis—The Data Science Process—A Data Scientist’s Role in this Process - Case Study: Real Direct—Housing Market Analysis Fundamentals of DS Introduction To Data Science : Today’s world is Big-Data, Organizations store Petabytes, Gigabytes of data. Hence, era of Big- Data emerged. Data storage using Scientific methods has become inevitable. Big-data concept come. This problem of data storage has been solved by using HADOOP Framework. (After 2010 started) Introduction To Data Science : • Over the past few years, there’s been a lot of hype in the media about “data science” and “Big Data.” Today, “Data rules the world”. This has resulted in a huge demand for Data Scientists. • A Data Scientist helps companies with data-driven decisions, to make their business better. Data science is a field that deals with unstructured, structured data, and semi- structured data. It involves practices like data cleansing, data preparation, data analysis, and much more. • Data science is the combination knowledge of: statistics, mathematics, programming for problem-solving, capturing data in ingenious ways. This umbrella term (DS) includes various techniques that are used when extracting insights and information from data. What is Data Science? (10) 1. Data science is a multidisciplinary field knowledge of – • Mathematics • Statistics • Computer Science • Domain knowledge All above is required to extract insights (Database– Hadoop-Big Data) and domain knowledge of data with skills. 2. Data Science is blended (hybrid mode) with various tools, algorithms and machine learning principles. What is Data Science? 3. Simply, it involves obtaining meaningful information or insights from structured or unstructured data through a process of analyzing, programming, and business skills. 4. Data Science is a combination of multiple disciplines that uses statistics, data analysis, and machine learning to analyze data and to extract knowledge and insights from it. What is Data Science? 5. Data Science is about data gathering, analysis, extracting and decision-making. Data Science is about finding patterns in data, through analysis and make future predictions. By using Data Science, companies are able to make: • Better decisions (should we choose A or B) • Predictive analysis (what will happen next?) • Pattern discoveries (find pattern, or maybe hidden information in the data) What is Data Science?
6. DS - is a field containing many elements like
mathematics, statistics, computer science, etc. Those who are good at these respective fields with enough knowledge of the domain in which you are willing to work can call themselves as Data Scientist. 7. It’s not an easy thing to do but not impossible too. You need to start from data, it’s visualization, programming, formulation, development, and deployment of your model. 8. In the future, there will be great hype for data scientist jobs. Taking in that mind, be ready to prepare yourself to fit in this world. What is Data Science? 9. Data Science is a deep study of the large amount of data (BigData), which involves extracting meaningful insights from raw, structured, and unstructured data that is processed using the scientific method, different tools , technologies, and algorithms. 10. Data science uses the most powerful hardware, programming (Python or R Language) , and most efficient algorithms to solve the data related problems. It is the future of artificial intelligence. What is Data Science? Definition Best Definition: Data science is the study of data to extract meaningful insights for business. It is a multidisciplinary approach that combines principles and practices from the fields of mathematics, statistics, artificial intelligence, and computer engineering to analyze large amounts of data. (AWS-Amazon Web Service- Own Cloud Service-On-Line Marketing) Need for Data Science • Data Science is used in many industries in the world today, e.g. banking, consultancy, healthcare, and manufacturing. Also in Education, Political field etc • For route planning: To discover the best routes to ship • To foresee delays for flight/ship/train etc. (through predictive analysis) • To create promotional offers in business sector • To find the best suited time to deliver goods • To forecast the next years revenue for a company • To analyze health benefit of training • To predict who will win elections (Politics) Why is Data Science important? • Data science is important because it combines tools, methods, and technology to generate meaning from data. • Modern organizations are undated with data; there is a proliferation (Rapid increase) of devices that can automatically collect and store information. • Online systems and payment portals capture more data in the fields of e-commerce, medicine, finance, and every other aspect of human life. We have text, audio, video, and image data available in vast quantities. Future of Data Science? Artificial intelligence and machine learning innovations have made data processing faster and more efficient. Industry demand has created an ecosystem of courses, degrees, and job positions within the field of data science. Because of the cross-functional skillset and expertise required, data science shows strong projected growth over the coming decades. Where is Data Science Needed? Applications Data Science is used in many industries in the world today, e.g. banking, consultancy, healthcare, and manufacturing. Examples of where Data Science is needed: • For route planning: To discover the best routes to ship • To foresee delays for flight/ship/train etc. (through predictive analysis) • To create promotional offers • To find the best suited time to deliver goods • To forecast the next years revenue for a company • To analyze health benefit of training • To predict who will win elections DS – Applications (13) 1. HealthCare 2. Consumer goods – Malls & Marts 3. Finance Stock markets 4. Industry 5. Logistics (Transportation) 6. Political Field 7. Air-Line Route Planning DS – Applications 8. E-commerce (On-Line Business) 9. Pattern Recognition 10. Speech Recognition 11. Internet Search 13. Video Games How Does a Data Scientist Work? A Data Scientist requires expertise in several backgrounds: • Machine Learning • Statistics • Programming (Python or R) • Mathematics • Databases • A Data Scientist must find patterns within the data. Before he/she can find the patterns, he/she must organize the data in a standard format. Data Science - Conclusions Data Science is a multidisciplinary field that uses scientific methods, algorithms, and computer systems to extract knowledge and insights from structured and unstructured data. It combines aspects of statistics, machine learning, and domain expertise to analyze data and make informed decision makings. How Data Science Works? Big Data (Hadoop) Big data refers to significant volumes of data that cannot be processed effectively with the traditional (databases) applications that are currently used. Examples of Databases like dBase, MS-Access, MySQL, NoSql, Oracle, Sybase, Data Warehouse and Hadoop. The processing of big data begins with raw data that is not aggregated and impossible to store in the memory of a single computer. Big Data - Outlook Big-Data - Applications Big Data and Data Science Hype (Boom)
What is Big Data?
Big data refers to large, diverse sets of information that grow at ever-increasing rates. The term encompasses the volume of information, the velocity or speed at which it is created and collected, and the variety or scope of the data points being covered (commonly known as the "Three V's" of big data). Big data provides the raw material used in data mining. What Is Big Data? Three V's of big data 1. V – Volume Volume refers to the amount of data 2. V – Velocity Velocity refers to the speed of data processing 3. V – Variety Variety refers to the number of types of data. Above 3 V’s – defines Properties or Dimensions of big data Types of Big Data Types of Big data: 1.Structured: Structured data is data that has a standardized format for efficient access by software and humans alike. It is typically tabular with rows and columns that clearly define data attributes. Computers can effectively process structured data for insights due to its quantitative nature. 2.Unstructured Unstructured simply means that it is datasets (typical large collections of files) that aren't stored in a structured database format. Unstructured data has an internal structure, but it's not predefined through data models. It might be human generated, or machine generated in a textual or a non- textual format 3.Semi-Structured Semi-structured data refers to data that is not captured or formatted in conventional ways. Semi-structured data does not follow the format of a tabular data model or relational databases because it does not have a fixed schema. Big Data - Uses Big data Uses: • Big data is used to analyze insights, which can lead to better decisions and strategic business moves. • Big data is high-volume, and high-velocity or high- variety information assets that demand cost- effective, innovative forms of information processing that enable enhanced insight, decision making, and process automation. • Businesses that use big data effectively hold a potential competitive advantage over those that don't because they're able to make faster and more informed business decisions. DATA COLLECTION • Data collection is the process of acquiring, collecting, extracting, and storing the voluminous amount of data which may be in the structured or unstructured form like text, video, audio, XML files, records, or other image files used in later stages of data analysis. • In the process of big data analysis, “Data collection” is the initial step before starting to analyze the patterns or useful information in data. The data which is to be analyzed must be collected from different valid sources. • The actual data is then further divided mainly into two types known as: 1. Primary data 2. Secondary data Datafication (2013-BigData) What is Datafication? Datafication is a technological trend turning many aspects of our life into data which is subsequently transferred into information realized as a new form of value. What is the function of datafication? Datafication is the act of transforming tasks, processes, and behaviors into data. It's a technological trend that turns actions into quantifiable, measurable data that can be used for tracking in real time, analytics, and insight. BD-A Revolutionary event that will transform how today We Live, Work and Think”. Benefits of Datafication 1. Datafication is a technique that is financially advantageous to pursue since it provides great opportunity for streamlining corporate procedures. 2. Datafication is a cutting-edge process for creating a futuristic framework that is both secure and inventive. Data Science Profiles Data Science Profiles 1. Data Analyst Data Analysts are the individuals who are responsible for reviewing the data so that they can identify the key information in the businesses of customers. Therefore, it is the process of collecting, processing, and analyzing the data to extract meaningful insights and also data analyst support in decision-making processes. 2. Data Scientist Data Scientist are the individual who uses the data to understand it. Therefore these data scientist are responsible to collect, analyze and interpret the data to help to drive the decision making. Data Science Profiles: 3. Data Engineer Data Engineer refers to experts who are responsible for maintaining, designing and optimizing the data infrastructure for the data management and transform them. Data Science – 12 Benefits 1. Enhanced decision-making capabilities. ... 2. Streamlined operations. ... 3. Customer insights and personalization. ... 4. Improved work efficiency. ... 5. High demand in job market. 6. Measuring performance. ... 7. Providing information to internal finances. ... 8. Developing better products. ... 9. Increasing efficiency. ... 10. Mitigating risk and fraud. ...(reduced) 11. Predicting outcomes and trends. ... 12. Improving customer experiences. Data Science Contents? Meta Data - Meaning Metadata is information about data that helps users understand, find, and reuse it. It is a key part of data science and can be used for many purposes, including: • Data quality: Metadata can help confirm that data is accurate and reliable. • Data governance: Metadata is important for compliance with corporate policies and agreed standards. Meta Data - Definition Metadata is defined as the data providing information about one or more aspects of the data; it is used to summarize basic information about data that can make tracking and working with specific data easier. Some examples include: Means of creation of the data. Purpose of the data. Meta Data – Examples Metadata makes finding and working with data easier – allowing the user to sort or locate specific documents. Some examples of basic metadata are author, date created, date modified, and file size. Metadata is also used for unstructured data such as images, video, web pages, spreadsheets (excel) etc. Statistical Inference (DS) Meaning: Using data analysis and statistics to make conclusions about a population is called statistical inference. • a conclusion reached on the basis of evidence and reasoning. • "researchers are entrusted with drawing inferences from the data" Samples—Populations and Samples of Big Data A population refers to the entire group of individuals, objects, or items that share a common characteristic within a given context. On the other hand, a sample is a subset of the population that is selected for analysis. What is population and sample in big data? A population is the entire group that you want to draw conclusions about. A sample is the specific group that you will collect data from. The size of the sample is always less than the total size of the population. Population Vs Samples Population Sample Samples offer a more feasible Collecting data approach to from an entire studying population can be populations, time-consuming, allowing expensive, and researchers to sometimes draw conclusions impractical or based on smaller, impossible. manageable datasets Modeling in Data Science Data modeling is the process of analyzing and defining all the different data types your business collects and produces, as well as the relationships between those bits of data. What is data modeling? (IBM) Data modeling is the process of creating a visual representation of either a whole information system or parts of it to communicate connections between data points and structures. Data Science Modeling Steps (10) 1. Define Your Objective 2. Collect Data 3. Clean Your Data 4. Explore Your Data 5. Split Your Data 6. Choose a Model 7. Train Your Model 8. Evaluate Your Model 9. Improve Your Model 10. Deploy Your Model Data Warehouse? 1. A data warehouse is an enterprise system used for the analysis and reporting of structured and semi-structured data from multiple sources, such as point-of-sale transactions, marketing automation, customer relationship management, and more. A data warehouse is suited for ad hoc analysis as well custom reporting. ETL - Extract, Transform and Load (Tool in DW) Data Warehouse - Defined 2. A data warehouse is a type of data management system that is designed to enable and support business intelligence (BI) activities, especially analytics. Data warehouses are solely intended to perform queries and analysis and often contain large amounts of historical data. • Data warehousing is essential for modern data management, providing a strong foundation for organizations to consolidate and analyze data strategically. • Data Warehouse Architecture · ETL Process in Data Warehouse What is ETL? & Uses Extract, transform and load (ETL) is the process of combining data from multiple sources into a large, central repository called a data warehouse. ETL uses a set of business rules to clean and organize raw data and prepare it for storage, data analytics, and machine learning (ML). Philosophy of Exploratory Data Analysis (EDA) Exploratory Data Analysis (EDA) adopts a different philosophy. It focuses on maximizing insights, extracting crucial variables, detecting outliers, and testing assumptions to deeply understand the data. What is Exploratory Data Analysis? Exploratory Data Analysis (EDA), which frequently uses charts or images, as a key step in fully understanding data and learning all its features. Recognizing patterns, figuring out important variables, and seeing if some things are connected are all crucial. It's also important to find mistakes in the data. EDA helps you get important information, making it easier to understand the data while getting rid of extra or unnecessary values. It involves looking for patterns, finding strange things, testing ideas, checking guesses using simple statistics and pictures. Why is Exploratory Data Analysis important in data science? • It facilities a comprehensive understanding of data before any assumptions are made.
• Its primary objective is to unveil patterns,
identify errors, and pin-point outliers.
• ensuring data scientists produce valid results
aligned with desired business outcomes.
• Leveraging vast datasets to make critical
decision-making has become a cornerstone of success. EDA Tools and Techniques: Python or R Prog. Lang for Data Exploration Benefits of EDA: (Exploratory Data Analysis) • It offers numerous benefits when dealing with datasets, it aids in generating insights and queries or investigations. • Secondly, EDA contributes to assessing the quality and authenticity of the data. Identifying errors, absent values or pre-dispositions in the data becomes crucial for rectification or adjustment. • Thirdly, EDA assists in selecting the most suitable techniques and models for your data.. Conclusions: EDA Exploratory Data Analysis (EDA) is insightful as it encapsulates the attributes and qualities of a dataset. EDA adapts to the characteristics of the data, it operates more as a method rather than a rigid procedure. Employing visual representations like charts and diagrams. Both graphical and non-graphical statistical approaches that find in executing EDA. EDA is efficient Data Analysis tool. AI-driven sectors like retail, e-commerce, banking, finance, agriculture, healthcare, and more, EDA holds pivotal significance role. Data Science Process - (Life Cycles involved) Data Science Process - Life Cycles (6 steps) Step 1: Define the Problem and Create a Project Charter. Clearly defining the research goals is the first step in the Data Science Process. A project charter outlines the objectives, resources, deliverables, and timeline, ensuring that all stakeholders are aligned.
Step 2: Retrieve Data
Data can be stored in databases, data warehouses, or data lakes within an organization. Accessing this data often involves navigating company policies and requesting permissions. Data Science Process - Life Cycles (steps) Step 3: Data Cleansing, Integration, and Transformation Data cleaning ensures that errors, inconsistencies, and outliers are removed. Data integration combines datasets from different sources, while data transformation prepares the data for modeling by reshaping variables or creating new features. Step 4: Exploratory Data Analysis (EDA) During EDA, various graphical techniques like scatter plots, histograms, and box plots are used to visualize data and identify trends. This phase helps in selecting the right modeling techniques. Data Science Process - Life Cycles (steps) Step 5: Build Models In this step, machine learning or deep learning models are built to make predictions or classifications based on the data. The choice of algorithm depends on the complexity of the problem and the type of data. Step 6: Present Findings and Deploy Models Once the analysis is complete, results are presented to stakeholders. Models are deployed into production systems to automate decision-making or support ongoing analysis. Data Science Process - Life Cycles DATA SCIENTIST’S ROLE Data Scientist’s role involves leveraging data to derive actionable insights and support data-driven decision- making. Their roles and responsibilities can be summarized into several key areas: • Problem Definition Collaborate with stakeholders to understand business objectives and translate them into data science problems. • Data Collection Identify and gather relevant data from various sources using techniques like web scraping, APIs, and database querying. DATA SCIENTIST’S ROLE • Data Cleaning and Preprocessing Clean and preprocess data by handling missing values, removing duplicates, and transforming data into a suitable format for analysis. • Exploratory Data Analysis (EDA) Perform descriptive statistics and create visualizations to uncover patterns and insights within the data. DATA SCIENTIST’S ROLE • Feature Engineering Create and select important features to enhance model performance and reduce dimensionality. • Model Building Choose appropriate algorithms, train models, and fine-tune parameters to optimize performance. • Model Evaluation Evaluate models using relevant metrics and validate them to ensure they generalize well to new data. DATA SCIENTIST’S ROLE • Model Deployment Develop and implement a strategy for deploying models into production environments, ensuring seamless integration with existing systems. • Monitoring and Maintenance Continuously monitor model performance, update and retrain models as necessary, and perform error analysis to refine models. • Communication and Reporting Present findings and insights through clear and compelling storytelling, creating reports and dashboards for stakeholders. DATA SCIENTIST’S ROLE • Ethical Considerations Ensure data privacy, compliance with regulations, and mitigate biases to promote fairness and ethical use of data. • Continuous Learning Stay updated with the latest advancements in data science and continuously experiment with new techniques and tools. Conclusion - a Data Scientist combines technical skills with business acumen to transform data into valuable insights that drive strategic decisions and operational improvements. Fundamental of Data Science