Explore 1.5M+ audiobooks & ebooks free for days

Only $12.99 CAD/month after trial. Cancel anytime.

Data Science with .NET and Polyglot Notebooks: Programmer's guide to data science using ML.NET, OpenAI, and Semantic Kernel
Data Science with .NET and Polyglot Notebooks: Programmer's guide to data science using ML.NET, OpenAI, and Semantic Kernel
Data Science with .NET and Polyglot Notebooks: Programmer's guide to data science using ML.NET, OpenAI, and Semantic Kernel
Ebook843 pages5 hours

Data Science with .NET and Polyglot Notebooks: Programmer's guide to data science using ML.NET, OpenAI, and Semantic Kernel

Rating: 0 out of 5 stars

()

Read preview
LanguageEnglish
PublisherPackt Publishing
Release dateAug 30, 2024
ISBN9781835882979
Data Science with .NET and Polyglot Notebooks: Programmer's guide to data science using ML.NET, OpenAI, and Semantic Kernel

Related to Data Science with .NET and Polyglot Notebooks

Related ebooks

Programming For You

View More

Related categories

Reviews for Data Science with .NET and Polyglot Notebooks

Rating: 0 out of 5 stars
0 ratings

0 ratings0 reviews

What did you think?

Tap to rate

Review must be at least 10 words

    Book preview

    Data Science with .NET and Polyglot Notebooks - Matt Eland

    Cover.jpg

    Data Science with .NET and Polyglot Notebooks

    Copyright © 2024 Packt Publishing

    All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

    Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

    Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

    Group Product Manager: Kunal Sawant

    Publishing Product Manager: Debadrita Chatterjee

    Book Project Manager: Manisha Singh

    Senior Editor: Esha Banerjee

    Technical Editor: Jubit Pincy

    Copy Editor: Safis Editing

    Proofreader: Esha Banerjee

    Indexer: Subalakshmi Govindhan

    Production Designer: Joshua Misquitta

    DevRel Marketing Coordinator: Sonia Chauhan

    First published: August 2024

    Production reference: 1230824

    Published by Packt Publishing Ltd.

    Grosvenor House

    11 St Paul’s Square

    Birmingham

    B3 1RB, UK

    ISBN 978-1-83588-296-2

    www.packtpub.com

    To Jon, Diego, Brett, Aleksei, Michael, Luis, Cesar, Bruno, and all the others who made the things described in this book possible.

    To Sam and Sadukie, who encouraged, supported, and mercilessly heckled me as I began to publicly teach the topics captured in this book.

    To Heather, for her endless love and patience, and for telling me to just go get a master’s degree already!

    Contributors

    About the author

    Matt Eland is a senior software engineering and data science consultant at Leading EDJE in Columbus, Ohio. He loves sharing his journey by teaching software engineering, AI, and data science concepts in the most engaging ways possible. Matt has used machine learning to settle debates over whether certain movies are Christmas movies, reinforcement learning to create digital attack squirrels, data analytics to suggest improvements to his favorite TV show, and AI agents to play board games and create an AI agent with the personality of a dog. He is the author of multiple books and courses, helps organize a user group and regional conferences, holds a Master’s of Science in Data Analytics, and is a 2x Microsoft MVP in AI and .NET.

    Thank you to my friends and family for their continued support, as well as my coworkers at Leading EDJE and my client sites who smile and nod at my evening and weekend writing endeavors. Special thanks to Heather, Sam, Sadukie, Eddie, Victor, Mike, James, Kanan, and Stephanie for celebrating little milestones with me. Extra special thanks to Sam Gomez and Sam Nasr for being my fellow MVPs and technical reviewers. Finally, thank you to Debadrita, Esha, and Manisha for your work on our second project together at Packt.

    About the reviewers

    Sam Nasr has been a software developer since 1995, focusing primarily on Microsoft technologies. He is a Senior Software Engineer with NIS Technologies

    where he consults and teaches clients about the latest .NET technologies. Sam has achieved multiple certifications from Microsoft, such as MCSA, MCAD, MCTS, and MCT, and he has been the leader of the Cleveland C#

    since 2003. He also holds leadership roles for the .NET Study Group

    and Azure Cleveland User Group

    .

    When not coding, Sam loves spending time with his family and friends or volunteering at his local church. You can learn more about Sam by visiting

    .

    Samuel Gomez has worked in software development for 15+ years (mostly Microsoft technologies). He is deeply passionate about the problem-solving aspect of his work. Recently, he has dedicated himself to exploring AI and machine learning technologies and has been working on understanding how these technologies can be applied to different aspects of our lives.

    Beyond coding, Sam enjoys spending time with his family. As a soccer enthusiast, he loves to play, watch, and coach the sport.

    Table of Contents

    Preface

    Part 1: Data Analysis in Polyglot Notebooks

    1

    Data Science, Notebooks, and Kernels

    Exploring the field of data science

    The rise of big data

    Data analytics

    Machine learning

    Artificial intelligence

    Data science notebooks and Project Jupyter

    Extending notebooks with kernels

    Polyglot Notebooks and .NET Interactive

    Summary

    Further reading

    2

    Exploring Polyglot Notebooks

    Technical requirements

    Installing Polyglot Notebooks

    Creating your first notebook

    Executing notebook cells

    Adding code cells

    Working with variables

    The Variables view

    Markdown cells

    Declaring classes and methods

    Declaring methods

    Declaring classes

    Working with other languages

    Sharing variables between languages

    Exporting variables

    Troubleshooting notebook execution

    Resolving compiler errors

    Problems with notebook execution

    Diagnostic output for Polyglot Notebooks errors

    Issues and the Polyglot Notebooks repository

    Summary

    Further reading

    3

    Getting Data and Code into Your Notebooks

    Technical requirements

    Importing code and NuGet packages

    Importing code files

    Importing NuGet packages

    Importing project files

    Reading CSV data

    Understanding CSV data

    Reading CSV data into a DataFrame

    Specialized CSV loading scenarios

    Troubleshooting CSV loading errors

    Loading TSV and other delimited file formats

    Getting JSON data with PowerShell

    Building DataFrames from objects

    Connecting to databases with SQL

    Connecting to a SQL database

    Executing SQL from SQL kernels

    Sharing SQL results with other kernels

    Alternative ways of connecting to the Database

    Querying Kusto clusters with KQL

    Summary

    Further reading

    4

    Working with Tabular Data and DataFrames

    Technical requirements

    Understanding data cleaning and data wrangling

    Where unclean data comes from

    The impact of unclean data

    Data cleaning and data wrangling

    Working with DataFrames in C#

    Viewing and sampling data

    Rows

    Getting and setting cell values

    Iterating over rows

    Working with columns

    Columns

    Analyzing columns

    Removing columns

    Renaming columns

    Adding a new column

    Handling missing values

    Sorting, filtering, grouping, and merging data

    Sorting DataFrames

    Grouping and aggregating DataFrames

    Merging DataFrames

    Filtering DataFrames

    DataFrames in other languages

    Summary

    Further reading

    5

    Visualizing Data

    Technical requirements

    Understanding exploratory data analysis

    Data visualization’s role in exploratory data analysis

    Descriptive statistics for EDA

    Extracting insights with descriptive statistics

    Using DataFrame.Description to generate descriptive statistics

    Descriptive statistics with MathNet.Numerics

    Creating a box plot with ScottPlot

    Performing univariate analysis with Plotly.NET

    Plotly and Plotly.NET

    Box plots in Plotly.NET

    Violin plots with Plotly.NET

    Histograms with Plotly.NET

    Summary

    Further reading

    6

    Variable Correlations

    Technical requirements

    Performing multivariate analysis with Plotly.NET

    Loading data and dependencies

    Multivariate analysis with box and violin plots

    Plotting multiple values with scatter plots

    Adding color to a scatter plot

    3D scatter plots with Plotly.NET

    Identifying variable correlations

    Calculating variable correlations

    Building feature correlation matrixes

    Summary

    Further reading

    Part 2: Machine Learning with Polyglot Notebooks and ML.NET

    7

    Classification Experiments with ML.NET AutoML

    Technical requirements

    Understanding machine learning

    Supervised learning

    Classification and regression

    Introducing ML.NET and AutoML

    Understanding AutoML

    AutoML and data pre-processing

    Creating training and testing datasets

    Training a classification model with ML.NET AutoML

    Evaluating binary classification models

    Evaluating our model

    Calculating feature importance

    Predicting values with binary classification models

    Summary

    Further reading

    8

    Regression Experiments with ML.NET AutoML

    Technical requirements

    Understanding regression

    Our regression task

    Regression as a numerical formula

    Our regression dataset

    Performing a regression experiment

    Understanding cross-validation

    Interpreting cross-validation results

    Evaluating regression metrics

    Predicting values for outliers

    Applying PFI to regression models

    Applying a regression model

    Summary

    Further reading

    9

    Beyond AutoML: Pipelines, Trainers, and Transforms

    Technical requirements

    Performing regression without AutoML

    Features and pipelines

    Creating an AutoML pipeline

    Controlling AutoML pipelines

    Customizing the Featurizer

    Customizing the model trainer selector

    Customizing hyperparameter tuning

    Understanding the search space

    Customizing the search space

    Customizing the hyperparameter tuner

    Scaling numeric columns

    Selecting regression algorithms

    Selecting binary classification algorithms

    Summary

    Further reading

    10

    Deploying Machine Learning Models

    Technical requirements

    Introducing our multi-class classification model

    Training our model

    Evaluating multi-class classification models

    Generating test predictions

    Exporting ML.NET models

    Hosting ML.NET models in ASP.NET web applications

    Configuring a PredictionEnginePool

    Using the PredictionEnginePool

    Understanding model performance, data drift, and MLOps

    Detecting model drift

    MLOps and updating models

    Surveying additional ML.NET capabilities

    ONNX and TensorFlow models in ML.NET

    Summary

    Further reading

    Part 3: Exploring Generative AI with Polyglot Notebooks

    11

    Generative AI in Polyglot Notebooks

    Technical requirements

    Understanding Generative AI

    Deploying generative AI models on Azure

    Creating an Azure OpenAI Service

    Deploying models on Azure OpenAI Service

    Getting access credentials for Azure OpenAI

    Connecting to an Azure OpenAI Service

    Chatting with a deployed model

    Customizing model behavior with prompt engineering

    Zero-shot, one-shot, and few-shot inferencing

    Using text embeddings

    Generating images with DALL-E

    Summary

    Further reading

    12

    AI Orchestration with Semantic Kernel

    Technical requirements

    Understanding RAG and AI orchestration

    Introducing Semantic Kernel

    Chatting with Semantic Kernel functions

    Building the Kernel

    Creating a prompt function

    Adding memory to Semantic Kernel

    Defining complex functions

    Creating functions from methods

    Accepting KernelFunction parameters

    Defining a memory function

    Calling multiple functions using plugins

    Examining FunctionResult objects

    Azure OpenAI content filtering

    Handling complex requests with planners

    Knowing where to go from here

    Summary

    Further reading

    Part 4: Polyglot Notebooks in the Enterprise

    13

    Enriching Documentation with Mermaid Diagrams

    Technical requirements

    Introducing Mermaid diagrams

    Communicating logic with flowcharts

    Communicating structure with class diagrams

    Communicating data with Entity Relationship Diagrams

    Communicating behavior with state diagrams

    Communicating flow with sequence diagrams

    Communicating workflow with Git graphs

    Summary

    Further reading

    14

    Extending Polyglot Notebooks

    Technical requirements

    Understanding default formatting behavior

    Default object formatting

    Default collection formatting

    Styling output with custom formatters

    Exploring magic commands

    Creating a Polyglot Notebook extension

    Working with parameters

    Invoking code on kernels

    Summary

    Further reading

    15

    Adopting and Deploying Polyglot Notebooks

    Technical requirements

    Integrating Polyglot Notebooks into your day job

    Enabling rapid experimentation

    Supporting AI and analytics workloads

    Assisting testing workloads

    Training new team members with Polyglot Notebooks

    Sharing Polyglot Notebooks with your team

    Integrating Polyglot Notebooks into Jupyter or JupyterLab

    Storing Notebooks in source control

    Deploying Polyglot Notebooks to GitHub Codespaces

    Configuring GitHub codespaces

    Creating a codespace on GitHub

    Advancing into machine learning and AI

    Adding data science to your day job

    Getting into data science

    Succeeding in data science

    Summary

    Further reading

    Index

    Other Books You May Enjoy

    Preface

    Polyglot Notebooks and its .NET Interactive kernels let you mix together code and markdown cells in an interactive and highly visual notebook experience. By supporting rich documentation and putting the results of each code cell below the cell, Polyglot Notebooks makes it easy for developers to quickly execute code and see results, then make adjustments and try again.

    This makes Polyglot Notebooks ideal for experimentation, documentation, and iterative and analytical tasks such as data analysis, machine learning, and artificial intelligence applications.

    This book will teach you the Polyglot Notebooks environment and then use that as a basis to explore and learn data analysis, machine learning, and generative AI.

    By the time you complete this book, you’ll have learned the basics of machine learning in .NET with ML.NET, data analysis and visualization with DataFrames and Plotly.NET, generative AI with Azure OpenAI and Semantic Kernel, and how these things can tie together to improve and expand your capabilities as a growing developer or data scientist.

    Who this book is for

    This book is for developers who are curious about augmenting their existing .NET development skills with new AI and machine learning capabilities and growing their knowledge and career prospects in the process.

    This book teaches data analytics, machine learning, and generative artificial intelligence to experienced software engineers with .NET backgrounds through a series of small targeted experiments in an interactive notebook environment.

    If you’re a .NET developer wanting to learn machine learning and AI, this book will help you expand your skills into those new areas.

    If you’re thinking about making the same transition I did and specializing in AI and Machine Learning or even becoming a data scientist, this book will help you as you begin your journey into the unknown by starting you in familiar ground with familiar tools and languages.

    This book is also for engineers and engineering managers who are looking to build interactive documentation for their development teams, marry together code and documentation, and better equip other developers for working with their organization’s code.

    If you’re looking to learn something new, find a new way of writing .NET code, gain some new tools in building applications, explore new ways of sharing code and concepts, this book will help you on that journey.

    What this book covers

    Chapter 1

    , Data Science, Notebooks, and Kernels, introduces the idea of data science and interactive notebooks – and the scenarios notebooks help developers, data analysts, and data scientists tackle.

    Chapter 2

    , Exploring Polyglot Notebooks, introduces the Polyglot Notebooks environment and .NET Interactive kernel to the reader by showing how interactive code cells and variable sharing work.

    Chapter 3

    , Getting Data and Code into Your Notebooks, covers different ways of pulling data into a notebook environment including importing from CSV and TSV files, executing SQL and KQL queries, and pulling data in from external APIs.

    Chapter 4

    , Working with Tabular Data and DataFrames, explores data analysis, data cleaning, and feature engineering using the ML.NET DataFrame object.

    Chapter 5

    , Visualizing Data, shows how variable distributions can be explored, both in terms of descriptive statistics and in terms of various distribution plots in Plotly.NET and ScottPlot.

    Chapter 6

    , Variable Correlations, takes data visualization to the next level by exploring how variable relationships can be visualized and introduces the idea of correlation scores and correlation matrixes.

    Chapter 7

    , Classification Experiments with ML.NET AutoML, introduces machine learning through ML.NET by using ML.NET’s automated machine learning capabilities to perform our first classification experiment.

    Chapter 8

    , Regression Experiments with ML.NET AutoML, expands our exploration into ML.NET by covering the prediction of numerical values with regression.

    Chapter 9

    , Beyond AutoML: Pipelines, Trainers, and Transforms, moves away from automated machine learning and shows how ML.NET can be customized by using pipelines, transforms, model trainers, and hyperparameter tuning.

    Chapter 10

    , Deploying Machine Learning Models, concludes our exploration of ML.NET by showing how models can be saved and loaded and embedded into other applications, such as an ASP .NET web application.

    Chapter 11

    , Generative AI in Polyglot Notebooks, introduces generative AI and prompt engineering by using Azure OpenAI models for chat completions, image generation, and text embeddings.

    Chapter 12

    , AI Orchestration with Semantic Kernel, expands on our generative AI knowledge by introducing retrieval-augmented generation (RAG), AI orchestration, and Microsoft Semantic Kernel to achieve complex tasks with AI agents.

    Chapter 13

    , Enriching Documentation with Mermaid Diagrams, shows how Mermaid diagrams serve as simple maintainable markdown-based diagrams that help illustrate common engineering workflows.

    Chapter 14

    , Extending Polyglot Notebooks, talks about how the Polyglot Notebooks experience can be customized and extended by providing custom formatters and magic commands.

    Chapter 15

    , Adopting and Deploying Polyglot Notebooks, discusses ways of integrating Polyglot Notebooks into your workflows and sharing notebooks with others – including deploying them to a Jupyter Notebook server or hosting them online in a GitHub Codespace.

    To get the most out of this book

    This book is intended for software developers familiar with the basics of .NET development in C# trying to learn more about data science in the .NET ecosystem. The ideal reader will be familiar with basic programming concepts and syntax related to C# and curious about learning more about data science.

    Although the book covers some F#, I do not expect most readers to be familiar with the language and its syntax, so all F# syntax will be explained.

    The following software has been covered in the book:

    VS Code

    .NET (C# and some F#)

    ML.NET

    Other languages covered in brief: PowerShell, SQL, and KQL

    And, you can use any of the following operating systems for this book:

    Windows

    MacOS

    Linux

    In order to run code online in GitHub Codespaces, you will need to create a free GitHub account.

    Chapters 11 and 12 refer to resources available in Azure and OpenAI. These resources may involve minimal costs and require creating an account. However, these chapters can still be understood and experienced without running the code.

    If you are using the digital version of this book, we advise you to type the code yourself or access the code from the book’s GitHub repository (a link is available in the next section). Doing so will help you avoid any potential errors related to the copying and pasting of code.

    Download the example code files

    You can download the example code files for this book from GitHub at https://github.com/PacktPublishing/Data-Science-with-.NET-and-Polyglot-Notebooks

    . If there’s an update to the code, it will be updated in the GitHub repository.

    This book’s GitHub repository also features GitHub Codespaces that allow you to execute the code in your browser using a GitHub account. Using GitHub Codespaces is explained more in Chapter 15

    .

    We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/

    . Check them out!

    Conventions used

    There are a number of text conventions used throughout this book.

    Code in text: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and X (formerly Twitter) handles.

    Here is an example: When the experiment is finished executing we can grab our best model, an ITransformer, from the result’s best run via result.BestRun.Model.

    A block of code, notebook cell, or other command you might enter is set as follows:

    ExperimentResult result =

        exp.Execute(split.TrainSet, split.TestSet);

    ITransformer model = result.BestRun.Model;

    var metrics = result.BestRun.ValidationMetrics;

    metrics.ConfusionMatrix.GetFormattedConfusionTable()

    The output of any command is written as follows:

    Confusion table

              ||========================

    PREDICTED ||   Win |  Loss |  Draw | Recall

    TRUTH     ||========================

          Win || 6,537 |     3 |   438 | 0.9368

         Loss ||     1 | 6,544 |   432 | 0.9379

         Draw ||   171 |   220 | 4,271 | 0.9161

              ||========================

    Precision ||0.9744 |0.9670 |0.8308 |

    Bold: Indicates a new term, an important word, or words that you see onscreen. For instance, words in menus or dialog boxes appear in bold. Here is an example: A Variables pane to view variables in the kernel.

    Tips or important notes

    Appear like this.

    Get in touch

    Feedback from our readers is always welcome.

    General feedback: If you have questions about any aspect of this book, email us at [email protected]

    and mention the book title in the subject of your message.

    Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/support/errata

    and fill in the form.

    Piracy: If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected]

    with a link to the material.

    If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com

    .

    Share your thoughts

    Once you’ve read Data Science in .NET with Polyglot Notebooks, we’d love to hear your thoughts! Please click here to go straight to the Amazon review page

    for this book and share your feedback.

    Your review is important to us and the tech community and will help us make sure we’re delivering excellent quality content.

    Download a free PDF copy of this book

    Thanks for purchasing this book!

    Do you like to read on the go but are unable to carry your print books everywhere?

    Is your eBook purchase not compatible with the device of your choice?

    Don’t worry, now with every Packt book you get a DRM-free PDF version of that book at no cost.

    Read anywhere, any place, on any device. Search, copy, and paste code from your favorite technical books directly into your application.

    The perks don’t stop there, you can get exclusive access to discounts, newsletters, and great free content in your inbox daily

    Follow these simple steps to get the benefits:

    Scan the QR code or visit the link below

    https://packt.link/free-ebook/978-1-83588-296-2

    Submit your proof of purchase

    That’s it! We’ll send your free PDF and other benefits to your email directly

    Part 1: Data Analysis in Polyglot Notebooks

    We’ll start our journey by introducing Polyglot Notebooks and their role in software engineering, data analysis, and machine learning workflows. We’ll cover the Polyglot Notebooks technology and user interface as well as the basic decisions and actions you’ll make working with notebooks.

    The remainder of this part focuses on loading data into a notebook and then performing data analysis in Polyglot Notebooks using .NET tooling using C# and F#.

    Whether you’re an experienced data analyst or have no prior knowledge, you’ll learn how to load up, analyze, clean, and manipulate data using .NET technologies like the DataFrame.

    You’ll also see how you can effectively understand data distributions and create helpful visuals using libraries like Plotly.NET, Microsoft.Data.Analysis, and MathNet.Numerics.

    This part has the following chapters:

    Chapter 1

    , Data Science, Notebooks, and Kernels

    Chapter 2

    , Exploring Polyglot Notebooks

    Chapter 3

    , Getting Data and Code into Your Notebooks

    Chapter 4

    , Working with Tabular Data and DataFrames

    Chapter 5

    , Visualizing Data

    Chapter 6

    , Variable Correlations

    1

    Data Science, Notebooks, and Kernels

    Data science seems like it is at an all-time high in popularity as advances in computing, storage, and data analysis have made new types of applications not just possible, but accessible to people who traditionally wouldn’t call themselves data scientists.

    This book aims to help developers expand their existing knowledge and capabilities into the fields of data science, machine learning, artificial intelligence (AI), and data analysis.

    In this opening chapter, we’ll cover these broad topics and explore what the field of data science includes, how the various parts of data science relate to each other, and how data science notebooks enable you to perform new tasks in new ways.

    This chapter covers the following topics:

    Exploring the field of data science

    Data science notebooks and Project Jupyter

    Extending notebooks with kernels

    Polyglot Notebooks and .NET Interactive

    Exploring the field of data science

    Let’s start by defining what data science is and isn’t.

    I define data science as the discipline of preparing and analyzing large amounts of data to extract insights and determine future behavior through machine learning and predictive modeling.

    In other words, data science is all about gathering insights from the large amounts of data organizations amass every day. In fact, this is part of why data science has experienced an increase in popularity in the past decade.

    Over the past several decades, organizations have moved more of their applications to be hosted on cloud computing providers such as Microsoft Azure, Amazon Web Services (AWS), and Google Cloud. This shift to cloud hosting had several benefits, including the following:

    Ease in scaling web applications and databases as usage grows

    Ease in adding new capabilities such as cloud storage and machine learning

    Automated tools for collecting, storing, and analyzing application telemetry

    While there are many other advantages (and some disadvantages) of cloud hosting, these three capabilities highlight an interesting trend: organizations are collecting and retaining more data in their applications than they were a decade ago.

    The rise of big data

    Most businesses and technology executives I advise agree that the data their organizations collect is a major advantage for their business.

    Market trends support this, with a report from Fortune Business Insights sharing that the global data storage market was valued at 217 billion US dollars in 2021. This figure grew to 247 billion in 2023 and is projected to grow to nearly 800 billion by 2030. See Further reading for more information.

    This need to store and analyze data is so common that we use the term big data to refer to the systemic collection and analysis of large amounts of data to extract and apply insights hidden in that data.

    There are entire books dedicated to the storage of data in various databases, data warehouses, data marts, data lakes, and data lakehouses. This book isn’t focused on how data are stored, but rather on how to analyze data in a systematic way and use data to train and deploy machine learning models and generative AI solutions.

    Because data science is often confused with other fields, let’s explore the related terms of data analytics, machine learning, and AI.

    Data analytics

    Data analytics is the field of using statistics and exploratory data analysis techniques to identify, explore, and communicate the significant trends, patterns, or characteristics of our datasets.

    Data analytics tasks are often performed by data analysts whose job focuses on identifying and communicating trends in data in compelling ways. Depending on the data analyst and their organization, this may be done using a variety of tools, from spreadsheets such as Microsoft Excel to data visualization tools such as Power BI or Tableau, or by working with raw data in data science notebooks using programming languages such as Python or R.

    Data analysis is not just for data analysts but has value for other roles, including the following:

    Developers looking for patterns in errors or usage data

    Managers looking to improve processes and products

    Data scientists looking to understand correlations for machine learning

    I’ve personally found a great deal of value in identifying and communicating trends in data. As a developer, I was able to analyze data to identify patterns in rare software bugs or slow application performance.

    As an engineering manager, I was able to identify the parts of our code that encountered the most problems to prioritize improving code in that area and avoiding common problems. Any technologist in any role can use data analysis to communicate facts to leaders, and to shift conversations from gut feelings toward data-driven decision making.

    As a data scientist and AI specialist, I routinely use data analysis to identify unusual outliers and missing values in data, identify data features that correlate with each other, and guide questions that I can ask to further understand the data and its implications.

    I’ve had the opportunity to lead data analysis and data science workshops in the community, and what I’ve found is almost universal: people with numerous job titles see being able to identify and communicate challenges in their day-to-day work with data as a way of obtaining the support they need.

    We’ll explore data analysis throughout the rest of Part 1 of this book as we look at using Polyglot Notebooks, tabular data analysis tools, and various charting libraries to identify trends in data.

    Machine learning

    Machine learning involves using various mathematical techniques to identify patterns in data and preserve them in equations or models.

    Machine learning is an umbrella term that encompasses several sub-fields:

    Supervised learning, which seeks to predict new values based on historical trends using a trained machine learning model

    Unsupervised learning, which looks for patterns, groups, and anomalies in existing data

    Reinforcement learning, which seeks to optimize its choices by learning from prior successes and mistakes

    Supervised learning

    Most of the time, when I hear people talking about machine learning, they are referring to supervised learning. Supervised learning involves training a machine learning model from a dataset with historical data that includes values we will want to predict in the future.

    This model training process takes time. For small datasets, model training may be near instantaneous or occur within a minute. However, larger datasets will require hours or potentially even days to train. The training process is important because it generates a trained machine learning model that internalizes trends observed in the training data. These associations allow it to predict values for new data the model hasn’t seen before to a certain degree of accuracy.

    Thankfully, once a model is trained, it can be used for multiple predictions without having to repeat the training process. This means that the slow model training process is a required up-front time investment that allows fast predictions from the trained model.

    Let’s consider a practical machine learning example.

    Imagine you had data on historical car sales. Your dataset would include information such as the make, model, year, color, mileage, and condition of the vehicle. This dataset would also include the amount of money a customer paid when they bought the car.

    You could use this data to train a model to predict the price that another car on your lot will ultimately sell for. A good model would predict a selling price close to the price the car will later sell for while less accurate models might predict a value much higher or much lower than the actual selling price.

    To illustrate this, examine the scatter plot in Figure 1.1 showing the relationship between the distance a car has driven and its ultimate resale price:

    Figure 1.1 – A scatter plot of car selling prices and mileage

    Figure 1.1 – A scatter plot of car selling prices and mileage

    Here the mileage of the car is only one factor in the selling price of the car, but we can see a clear negative trend where the more a car has been driven, the lower its ultimate selling point typically is.

    Data visualization

    This scatter plot is one of the types of visuals we’ll cover in Chapter 6

    , Visualizing Variable Relationships, using Polyglot Notebooks and the Plotly.NET charting library. We’ll talk more about interpreting and customizing these and other chart types in that chapter.

    In machine learning terminology, we refer to the value we’re trying to predict as a label (sometimes referred to as a target), and the values that factor into this value are referred to as features. In our car example, the label would be the selling price of the car while its features would include the mileage, year, make, color, and other contributing factors. We’ll discuss this terminology more in Part 2 of this book, but it’s important to know that supervised learning requires this labeled dataset – a dataset with historical labels that we’re trying to predict.

    We’ll discuss supervised learning more in Part 2 of this book using Microsoft’s open source ML.NET machine learning library.

    Unsupervised learning

    Unsupervised learning is generally used for clustering and anomaly detection applications such as identifying different types of users or flagging suspicious network traffic. A key distinction of unsupervised learning is that it does not need a dataset with values that you’ll try to predict in the future. Instead, unsupervised learning can be used to group data points into distinct groups or clusters or to identify anomalies in data.

    Unsupervised learning is supported by ML.NET but is less frequently used in software development organizations in my experience, and as a result, is beyond the scope of this book. See the Further reading section at the end of this chapter for a link to more information

    Enjoying the preview?
    Page 1 of 1