Data Science with .NET and Polyglot Notebooks: Programmer's guide to data science using ML.NET, OpenAI, and Semantic Kernel
By Matt Eland
()
Related to Data Science with .NET and Polyglot Notebooks
Related ebooks
Ultimate Machine Learning with ML.NET Rating: 0 out of 5 stars0 ratingsData Science for Decision Makers: Enhance your leadership skills with data science and AI expertise Rating: 0 out of 5 stars0 ratingsData Scientist Roadmap Rating: 5 out of 5 stars5/5Principles of Data Science: A beginner's guide to essential math and coding skills for data fluency and machine learning Rating: 0 out of 5 stars0 ratingsData Science Mastery: From Beginner to Expert in Big Data Analytics Rating: 0 out of 5 stars0 ratingsData-Centric Machine Learning with Python: The ultimate guide to engineering and deploying high-quality models based on good data Rating: 0 out of 5 stars0 ratingsData Analysis Foundations with Python: Master Data Analysis with Python: From Basics to Advanced Techniques Rating: 0 out of 5 stars0 ratingsTools and Skills for .NET 8: Get the career you want with good practices and patterns to design, debug, and test your solutions Rating: 0 out of 5 stars0 ratingsAdvanced Machine Learning with Python Rating: 0 out of 5 stars0 ratingsAI-Assisted Programming for Web and Machine Learning: Improve your development workflow with ChatGPT and GitHub Copilot Rating: 0 out of 5 stars0 ratingsData Science Essentials For Dummies Rating: 0 out of 5 stars0 ratingsMATLAB for Machine Learning: Unlock the power of deep learning for swift and enhanced results Rating: 0 out of 5 stars0 ratingsPython Data Structures and Algorithms Rating: 5 out of 5 stars5/5The Freelance Data Scientist and Big Data Analyst: Freelance Jobs and Their Profiles, #3 Rating: 5 out of 5 stars5/5Azure Data Engineer Associate Certification Guide: Ace the DP-203 exam with advanced data engineering skills Rating: 0 out of 5 stars0 ratingsDemystifying Artificial intelligence: Simplified AI and Machine Learning concepts for Everyone (English Edition) Rating: 0 out of 5 stars0 ratingsApplied Deep Learning on Graphs: Leverage graph data for business applications using specialized deep learning architectures Rating: 0 out of 5 stars0 ratingsData Science Unveiled: A Practical Guide to Key Techniques Rating: 0 out of 5 stars0 ratingsApache Spark for Machine Learning: Build and deploy high-performance big data AI solutions for large-scale clusters Rating: 0 out of 5 stars0 ratingsMachine Learning for Beginners: A Comprehensive Guide to Mastering Algorithms, Data Science, and Artificial Intelligence Rating: 0 out of 5 stars0 ratingsMastering Data Science: A Comprehensive Guide to Techniques and Applications Rating: 0 out of 5 stars0 ratingsData Science, AI, and Blockchain: Integrated Approaches Rating: 0 out of 5 stars0 ratings15 Math Concepts Every Data Scientist Should Know: Understand and learn how to apply the math behind data science algorithms Rating: 0 out of 5 stars0 ratingsArtificial Intelligence: Evolution and Revolution Rating: 0 out of 5 stars0 ratings
Programming For You
SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL Rating: 4 out of 5 stars4/5Python: Learn Python in 24 Hours Rating: 4 out of 5 stars4/5Learn to Code. Get a Job. The Ultimate Guide to Learning and Getting Hired as a Developer. Rating: 5 out of 5 stars5/5Excel 101: A Beginner's & Intermediate's Guide for Mastering the Quintessence of Microsoft Excel (2010-2019 & 365) in no time! Rating: 0 out of 5 stars0 ratingsCoding All-in-One For Dummies Rating: 4 out of 5 stars4/5SQL All-in-One For Dummies Rating: 3 out of 5 stars3/5Learn Python in 10 Minutes Rating: 4 out of 5 stars4/5Excel : The Ultimate Comprehensive Step-By-Step Guide to the Basics of Excel Programming: 1 Rating: 5 out of 5 stars5/5Python: For Beginners A Crash Course Guide To Learn Python in 1 Week Rating: 4 out of 5 stars4/5PYTHON PROGRAMMING Rating: 4 out of 5 stars4/5Algorithms For Dummies Rating: 4 out of 5 stars4/5Python Machine Learning By Example Rating: 4 out of 5 stars4/5Python Data Structures and Algorithms Rating: 5 out of 5 stars5/5Beginning Programming with C++ For Dummies Rating: 4 out of 5 stars4/5Learn SQL in 24 Hours Rating: 5 out of 5 stars5/5Python 3 Object Oriented Programming Rating: 4 out of 5 stars4/5Python Programming : How to Code Python Fast In Just 24 Hours With 7 Simple Steps Rating: 4 out of 5 stars4/5SQL: For Beginners: Your Guide To Easily Learn SQL Programming in 7 Days Rating: 5 out of 5 stars5/5JavaScript All-in-One For Dummies Rating: 5 out of 5 stars5/5Microsoft Azure For Dummies Rating: 0 out of 5 stars0 ratings
Related categories
Reviews for Data Science with .NET and Polyglot Notebooks
0 ratings0 reviews
Book preview
Data Science with .NET and Polyglot Notebooks - Matt Eland
Data Science with .NET and Polyglot Notebooks
Copyright © 2024 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
Group Product Manager: Kunal Sawant
Publishing Product Manager: Debadrita Chatterjee
Book Project Manager: Manisha Singh
Senior Editor: Esha Banerjee
Technical Editor: Jubit Pincy
Copy Editor: Safis Editing
Proofreader: Esha Banerjee
Indexer: Subalakshmi Govindhan
Production Designer: Joshua Misquitta
DevRel Marketing Coordinator: Sonia Chauhan
First published: August 2024
Production reference: 1230824
Published by Packt Publishing Ltd.
Grosvenor House
11 St Paul’s Square
Birmingham
B3 1RB, UK
ISBN 978-1-83588-296-2
www.packtpub.com
To Jon, Diego, Brett, Aleksei, Michael, Luis, Cesar, Bruno, and all the others who made the things described in this book possible.
To Sam and Sadukie, who encouraged, supported, and mercilessly heckled me as I began to publicly teach the topics captured in this book.
To Heather, for her endless love and patience, and for telling me to just go get a master’s degree already!
Contributors
About the author
Matt Eland is a senior software engineering and data science consultant at Leading EDJE in Columbus, Ohio. He loves sharing his journey by teaching software engineering, AI, and data science concepts in the most engaging ways possible. Matt has used machine learning to settle debates over whether certain movies are Christmas movies, reinforcement learning to create digital attack squirrels, data analytics to suggest improvements to his favorite TV show, and AI agents to play board games and create an AI agent with the personality of a dog. He is the author of multiple books and courses, helps organize a user group and regional conferences, holds a Master’s of Science in Data Analytics, and is a 2x Microsoft MVP in AI and .NET.
Thank you to my friends and family for their continued support, as well as my coworkers at Leading EDJE and my client sites who smile and nod at my evening and weekend writing endeavors. Special thanks to Heather, Sam, Sadukie, Eddie, Victor, Mike, James, Kanan, and Stephanie for celebrating little milestones with me. Extra special thanks to Sam Gomez and Sam Nasr for being my fellow MVPs and technical reviewers. Finally, thank you to Debadrita, Esha, and Manisha for your work on our second project together at Packt.
About the reviewers
Sam Nasr has been a software developer since 1995, focusing primarily on Microsoft technologies. He is a Senior Software Engineer with NIS Technologies
where he consults and teaches clients about the latest .NET technologies. Sam has achieved multiple certifications from Microsoft, such as MCSA, MCAD, MCTS, and MCT, and he has been the leader of the Cleveland C#
since 2003. He also holds leadership roles for the .NET Study Group
and Azure Cleveland User Group
.
When not coding, Sam loves spending time with his family and friends or volunteering at his local church. You can learn more about Sam by visiting
.
Samuel Gomez has worked in software development for 15+ years (mostly Microsoft technologies). He is deeply passionate about the problem-solving aspect of his work. Recently, he has dedicated himself to exploring AI and machine learning technologies and has been working on understanding how these technologies can be applied to different aspects of our lives.
Beyond coding, Sam enjoys spending time with his family. As a soccer enthusiast, he loves to play, watch, and coach the sport.
Table of Contents
Preface
Part 1: Data Analysis in Polyglot Notebooks
1
Data Science, Notebooks, and Kernels
Exploring the field of data science
The rise of big data
Data analytics
Machine learning
Artificial intelligence
Data science notebooks and Project Jupyter
Extending notebooks with kernels
Polyglot Notebooks and .NET Interactive
Summary
Further reading
2
Exploring Polyglot Notebooks
Technical requirements
Installing Polyglot Notebooks
Creating your first notebook
Executing notebook cells
Adding code cells
Working with variables
The Variables view
Markdown cells
Declaring classes and methods
Declaring methods
Declaring classes
Working with other languages
Sharing variables between languages
Exporting variables
Troubleshooting notebook execution
Resolving compiler errors
Problems with notebook execution
Diagnostic output for Polyglot Notebooks errors
Issues and the Polyglot Notebooks repository
Summary
Further reading
3
Getting Data and Code into Your Notebooks
Technical requirements
Importing code and NuGet packages
Importing code files
Importing NuGet packages
Importing project files
Reading CSV data
Understanding CSV data
Reading CSV data into a DataFrame
Specialized CSV loading scenarios
Troubleshooting CSV loading errors
Loading TSV and other delimited file formats
Getting JSON data with PowerShell
Building DataFrames from objects
Connecting to databases with SQL
Connecting to a SQL database
Executing SQL from SQL kernels
Sharing SQL results with other kernels
Alternative ways of connecting to the Database
Querying Kusto clusters with KQL
Summary
Further reading
4
Working with Tabular Data and DataFrames
Technical requirements
Understanding data cleaning and data wrangling
Where unclean data comes from
The impact of unclean data
Data cleaning and data wrangling
Working with DataFrames in C#
Viewing and sampling data
Rows
Getting and setting cell values
Iterating over rows
Working with columns
Columns
Analyzing columns
Removing columns
Renaming columns
Adding a new column
Handling missing values
Sorting, filtering, grouping, and merging data
Sorting DataFrames
Grouping and aggregating DataFrames
Merging DataFrames
Filtering DataFrames
DataFrames in other languages
Summary
Further reading
5
Visualizing Data
Technical requirements
Understanding exploratory data analysis
Data visualization’s role in exploratory data analysis
Descriptive statistics for EDA
Extracting insights with descriptive statistics
Using DataFrame.Description to generate descriptive statistics
Descriptive statistics with MathNet.Numerics
Creating a box plot with ScottPlot
Performing univariate analysis with Plotly.NET
Plotly and Plotly.NET
Box plots in Plotly.NET
Violin plots with Plotly.NET
Histograms with Plotly.NET
Summary
Further reading
6
Variable Correlations
Technical requirements
Performing multivariate analysis with Plotly.NET
Loading data and dependencies
Multivariate analysis with box and violin plots
Plotting multiple values with scatter plots
Adding color to a scatter plot
3D scatter plots with Plotly.NET
Identifying variable correlations
Calculating variable correlations
Building feature correlation matrixes
Summary
Further reading
Part 2: Machine Learning with Polyglot Notebooks and ML.NET
7
Classification Experiments with ML.NET AutoML
Technical requirements
Understanding machine learning
Supervised learning
Classification and regression
Introducing ML.NET and AutoML
Understanding AutoML
AutoML and data pre-processing
Creating training and testing datasets
Training a classification model with ML.NET AutoML
Evaluating binary classification models
Evaluating our model
Calculating feature importance
Predicting values with binary classification models
Summary
Further reading
8
Regression Experiments with ML.NET AutoML
Technical requirements
Understanding regression
Our regression task
Regression as a numerical formula
Our regression dataset
Performing a regression experiment
Understanding cross-validation
Interpreting cross-validation results
Evaluating regression metrics
Predicting values for outliers
Applying PFI to regression models
Applying a regression model
Summary
Further reading
9
Beyond AutoML: Pipelines, Trainers, and Transforms
Technical requirements
Performing regression without AutoML
Features and pipelines
Creating an AutoML pipeline
Controlling AutoML pipelines
Customizing the Featurizer
Customizing the model trainer selector
Customizing hyperparameter tuning
Understanding the search space
Customizing the search space
Customizing the hyperparameter tuner
Scaling numeric columns
Selecting regression algorithms
Selecting binary classification algorithms
Summary
Further reading
10
Deploying Machine Learning Models
Technical requirements
Introducing our multi-class classification model
Training our model
Evaluating multi-class classification models
Generating test predictions
Exporting ML.NET models
Hosting ML.NET models in ASP.NET web applications
Configuring a PredictionEnginePool
Using the PredictionEnginePool
Understanding model performance, data drift, and MLOps
Detecting model drift
MLOps and updating models
Surveying additional ML.NET capabilities
ONNX and TensorFlow models in ML.NET
Summary
Further reading
Part 3: Exploring Generative AI with Polyglot Notebooks
11
Generative AI in Polyglot Notebooks
Technical requirements
Understanding Generative AI
Deploying generative AI models on Azure
Creating an Azure OpenAI Service
Deploying models on Azure OpenAI Service
Getting access credentials for Azure OpenAI
Connecting to an Azure OpenAI Service
Chatting with a deployed model
Customizing model behavior with prompt engineering
Zero-shot, one-shot, and few-shot inferencing
Using text embeddings
Generating images with DALL-E
Summary
Further reading
12
AI Orchestration with Semantic Kernel
Technical requirements
Understanding RAG and AI orchestration
Introducing Semantic Kernel
Chatting with Semantic Kernel functions
Building the Kernel
Creating a prompt function
Adding memory to Semantic Kernel
Defining complex functions
Creating functions from methods
Accepting KernelFunction parameters
Defining a memory function
Calling multiple functions using plugins
Examining FunctionResult objects
Azure OpenAI content filtering
Handling complex requests with planners
Knowing where to go from here
Summary
Further reading
Part 4: Polyglot Notebooks in the Enterprise
13
Enriching Documentation with Mermaid Diagrams
Technical requirements
Introducing Mermaid diagrams
Communicating logic with flowcharts
Communicating structure with class diagrams
Communicating data with Entity Relationship Diagrams
Communicating behavior with state diagrams
Communicating flow with sequence diagrams
Communicating workflow with Git graphs
Summary
Further reading
14
Extending Polyglot Notebooks
Technical requirements
Understanding default formatting behavior
Default object formatting
Default collection formatting
Styling output with custom formatters
Exploring magic commands
Creating a Polyglot Notebook extension
Working with parameters
Invoking code on kernels
Summary
Further reading
15
Adopting and Deploying Polyglot Notebooks
Technical requirements
Integrating Polyglot Notebooks into your day job
Enabling rapid experimentation
Supporting AI and analytics workloads
Assisting testing workloads
Training new team members with Polyglot Notebooks
Sharing Polyglot Notebooks with your team
Integrating Polyglot Notebooks into Jupyter or JupyterLab
Storing Notebooks in source control
Deploying Polyglot Notebooks to GitHub Codespaces
Configuring GitHub codespaces
Creating a codespace on GitHub
Advancing into machine learning and AI
Adding data science to your day job
Getting into data science
Succeeding in data science
Summary
Further reading
Index
Other Books You May Enjoy
Preface
Polyglot Notebooks and its .NET Interactive kernels let you mix together code and markdown cells in an interactive and highly visual notebook experience. By supporting rich documentation and putting the results of each code cell below the cell, Polyglot Notebooks makes it easy for developers to quickly execute code and see results, then make adjustments and try again.
This makes Polyglot Notebooks ideal for experimentation, documentation, and iterative and analytical tasks such as data analysis, machine learning, and artificial intelligence applications.
This book will teach you the Polyglot Notebooks environment and then use that as a basis to explore and learn data analysis, machine learning, and generative AI.
By the time you complete this book, you’ll have learned the basics of machine learning in .NET with ML.NET, data analysis and visualization with DataFrames and Plotly.NET, generative AI with Azure OpenAI and Semantic Kernel, and how these things can tie together to improve and expand your capabilities as a growing developer or data scientist.
Who this book is for
This book is for developers who are curious about augmenting their existing .NET development skills with new AI and machine learning capabilities and growing their knowledge and career prospects in the process.
This book teaches data analytics, machine learning, and generative artificial intelligence to experienced software engineers with .NET backgrounds through a series of small targeted experiments in an interactive notebook environment.
If you’re a .NET developer wanting to learn machine learning and AI, this book will help you expand your skills into those new areas.
If you’re thinking about making the same transition I did and specializing in AI and Machine Learning or even becoming a data scientist, this book will help you as you begin your journey into the unknown by starting you in familiar ground with familiar tools and languages.
This book is also for engineers and engineering managers who are looking to build interactive documentation for their development teams, marry together code and documentation, and better equip other developers for working with their organization’s code.
If you’re looking to learn something new, find a new way of writing .NET code, gain some new tools in building applications, explore new ways of sharing code and concepts, this book will help you on that journey.
What this book covers
Chapter 1
, Data Science, Notebooks, and Kernels, introduces the idea of data science and interactive notebooks – and the scenarios notebooks help developers, data analysts, and data scientists tackle.
Chapter 2
, Exploring Polyglot Notebooks, introduces the Polyglot Notebooks environment and .NET Interactive kernel to the reader by showing how interactive code cells and variable sharing work.
Chapter 3
, Getting Data and Code into Your Notebooks, covers different ways of pulling data into a notebook environment including importing from CSV and TSV files, executing SQL and KQL queries, and pulling data in from external APIs.
Chapter 4
, Working with Tabular Data and DataFrames, explores data analysis, data cleaning, and feature engineering using the ML.NET DataFrame object.
Chapter 5
, Visualizing Data, shows how variable distributions can be explored, both in terms of descriptive statistics and in terms of various distribution plots in Plotly.NET and ScottPlot.
Chapter 6
, Variable Correlations, takes data visualization to the next level by exploring how variable relationships can be visualized and introduces the idea of correlation scores and correlation matrixes.
Chapter 7
, Classification Experiments with ML.NET AutoML, introduces machine learning through ML.NET by using ML.NET’s automated machine learning capabilities to perform our first classification experiment.
Chapter 8
, Regression Experiments with ML.NET AutoML, expands our exploration into ML.NET by covering the prediction of numerical values with regression.
Chapter 9
, Beyond AutoML: Pipelines, Trainers, and Transforms, moves away from automated machine learning and shows how ML.NET can be customized by using pipelines, transforms, model trainers, and hyperparameter tuning.
Chapter 10
, Deploying Machine Learning Models, concludes our exploration of ML.NET by showing how models can be saved and loaded and embedded into other applications, such as an ASP .NET web application.
Chapter 11
, Generative AI in Polyglot Notebooks, introduces generative AI and prompt engineering by using Azure OpenAI models for chat completions, image generation, and text embeddings.
Chapter 12
, AI Orchestration with Semantic Kernel, expands on our generative AI knowledge by introducing retrieval-augmented generation (RAG), AI orchestration, and Microsoft Semantic Kernel to achieve complex tasks with AI agents.
Chapter 13
, Enriching Documentation with Mermaid Diagrams, shows how Mermaid diagrams serve as simple maintainable markdown-based diagrams that help illustrate common engineering workflows.
Chapter 14
, Extending Polyglot Notebooks, talks about how the Polyglot Notebooks experience can be customized and extended by providing custom formatters and magic commands.
Chapter 15
, Adopting and Deploying Polyglot Notebooks, discusses ways of integrating Polyglot Notebooks into your workflows and sharing notebooks with others – including deploying them to a Jupyter Notebook server or hosting them online in a GitHub Codespace.
To get the most out of this book
This book is intended for software developers familiar with the basics of .NET development in C# trying to learn more about data science in the .NET ecosystem. The ideal reader will be familiar with basic programming concepts and syntax related to C# and curious about learning more about data science.
Although the book covers some F#, I do not expect most readers to be familiar with the language and its syntax, so all F# syntax will be explained.
The following software has been covered in the book:
VS Code
.NET (C# and some F#)
ML.NET
Other languages covered in brief: PowerShell, SQL, and KQL
And, you can use any of the following operating systems for this book:
Windows
MacOS
Linux
In order to run code online in GitHub Codespaces, you will need to create a free GitHub account.
Chapters 11 and 12 refer to resources available in Azure and OpenAI. These resources may involve minimal costs and require creating an account. However, these chapters can still be understood and experienced without running the code.
If you are using the digital version of this book, we advise you to type the code yourself or access the code from the book’s GitHub repository (a link is available in the next section). Doing so will help you avoid any potential errors related to the copying and pasting of code.
Download the example code files
You can download the example code files for this book from GitHub at https://github.com/PacktPublishing/Data-Science-with-.NET-and-Polyglot-Notebooks
. If there’s an update to the code, it will be updated in the GitHub repository.
This book’s GitHub repository also features GitHub Codespaces that allow you to execute the code in your browser using a GitHub account. Using GitHub Codespaces is explained more in Chapter 15
.
We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/
. Check them out!
Conventions used
There are a number of text conventions used throughout this book.
Code in text: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and X (formerly Twitter) handles.
Here is an example: When the experiment is finished executing we can grab our best model, an ITransformer, from the result’s best run via result.BestRun.Model.
A block of code, notebook cell, or other command you might enter is set as follows:
ExperimentResult
exp.Execute(split.TrainSet, split.TestSet);
ITransformer model = result.BestRun.Model;
var metrics = result.BestRun.ValidationMetrics;
metrics.ConfusionMatrix.GetFormattedConfusionTable()
The output of any command is written as follows:
Confusion table
||========================
PREDICTED || Win | Loss | Draw | Recall
TRUTH ||========================
Win || 6,537 | 3 | 438 | 0.9368
Loss || 1 | 6,544 | 432 | 0.9379
Draw || 171 | 220 | 4,271 | 0.9161
||========================
Precision ||0.9744 |0.9670 |0.8308 |
Bold: Indicates a new term, an important word, or words that you see onscreen. For instance, words in menus or dialog boxes appear in bold. Here is an example: A Variables pane to view variables in the kernel.
Tips or important notes
Appear like this.
Get in touch
Feedback from our readers is always welcome.
General feedback: If you have questions about any aspect of this book, email us at [email protected]
and mention the book title in the subject of your message.
Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/support/errata
and fill in the form.
Piracy: If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected]
with a link to the material.
If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com
.
Share your thoughts
Once you’ve read Data Science in .NET with Polyglot Notebooks, we’d love to hear your thoughts! Please click here to go straight to the Amazon review page
for this book and share your feedback.
Your review is important to us and the tech community and will help us make sure we’re delivering excellent quality content.
Download a free PDF copy of this book
Thanks for purchasing this book!
Do you like to read on the go but are unable to carry your print books everywhere?
Is your eBook purchase not compatible with the device of your choice?
Don’t worry, now with every Packt book you get a DRM-free PDF version of that book at no cost.
Read anywhere, any place, on any device. Search, copy, and paste code from your favorite technical books directly into your application.
The perks don’t stop there, you can get exclusive access to discounts, newsletters, and great free content in your inbox daily
Follow these simple steps to get the benefits:
Scan the QR code or visit the link below
https://packt.link/free-ebook/978-1-83588-296-2
Submit your proof of purchase
That’s it! We’ll send your free PDF and other benefits to your email directly
Part 1: Data Analysis in Polyglot Notebooks
We’ll start our journey by introducing Polyglot Notebooks and their role in software engineering, data analysis, and machine learning workflows. We’ll cover the Polyglot Notebooks technology and user interface as well as the basic decisions and actions you’ll make working with notebooks.
The remainder of this part focuses on loading data into a notebook and then performing data analysis in Polyglot Notebooks using .NET tooling using C# and F#.
Whether you’re an experienced data analyst or have no prior knowledge, you’ll learn how to load up, analyze, clean, and manipulate data using .NET technologies like the DataFrame.
You’ll also see how you can effectively understand data distributions and create helpful visuals using libraries like Plotly.NET, Microsoft.Data.Analysis, and MathNet.Numerics.
This part has the following chapters:
Chapter 1
, Data Science, Notebooks, and Kernels
Chapter 2
, Exploring Polyglot Notebooks
Chapter 3
, Getting Data and Code into Your Notebooks
Chapter 4
, Working with Tabular Data and DataFrames
Chapter 5
, Visualizing Data
Chapter 6
, Variable Correlations
1
Data Science, Notebooks, and Kernels
Data science seems like it is at an all-time high in popularity as advances in computing, storage, and data analysis have made new types of applications not just possible, but accessible to people who traditionally wouldn’t call themselves data scientists.
This book aims to help developers expand their existing knowledge and capabilities into the fields of data science, machine learning, artificial intelligence (AI), and data analysis.
In this opening chapter, we’ll cover these broad topics and explore what the field of data science includes, how the various parts of data science relate to each other, and how data science notebooks enable you to perform new tasks in new ways.
This chapter covers the following topics:
Exploring the field of data science
Data science notebooks and Project Jupyter
Extending notebooks with kernels
Polyglot Notebooks and .NET Interactive
Exploring the field of data science
Let’s start by defining what data science is and isn’t.
I define data science as the discipline of preparing and analyzing large amounts of data to extract insights and determine future behavior through machine learning and predictive modeling.
In other words, data science is all about gathering insights from the large amounts of data organizations amass every day. In fact, this is part of why data science has experienced an increase in popularity in the past decade.
Over the past several decades, organizations have moved more of their applications to be hosted on cloud computing providers such as Microsoft Azure, Amazon Web Services (AWS), and Google Cloud. This shift to cloud hosting had several benefits, including the following:
Ease in scaling web applications and databases as usage grows
Ease in adding new capabilities such as cloud storage and machine learning
Automated tools for collecting, storing, and analyzing application telemetry
While there are many other advantages (and some disadvantages) of cloud hosting, these three capabilities highlight an interesting trend: organizations are collecting and retaining more data in their applications than they were a decade ago.
The rise of big data
Most businesses and technology executives I advise agree that the data their organizations collect is a major advantage for their business.
Market trends support this, with a report from Fortune Business Insights sharing that the global data storage market was valued at 217 billion US dollars in 2021. This figure grew to 247 billion in 2023 and is projected to grow to nearly 800 billion by 2030. See Further reading for more information.
This need to store and analyze data is so common that we use the term big data to refer to the systemic collection and analysis of large amounts of data to extract and apply insights hidden in that data.
There are entire books dedicated to the storage of data in various databases, data warehouses, data marts, data lakes, and data lakehouses. This book isn’t focused on how data are stored, but rather on how to analyze data in a systematic way and use data to train and deploy machine learning models and generative AI solutions.
Because data science is often confused with other fields, let’s explore the related terms of data analytics, machine learning, and AI.
Data analytics
Data analytics is the field of using statistics and exploratory data analysis techniques to identify, explore, and communicate the significant trends, patterns, or characteristics of our datasets.
Data analytics tasks are often performed by data analysts whose job focuses on identifying and communicating trends in data in compelling ways. Depending on the data analyst and their organization, this may be done using a variety of tools, from spreadsheets such as Microsoft Excel to data visualization tools such as Power BI or Tableau, or by working with raw data in data science notebooks using programming languages such as Python or R.
Data analysis is not just for data analysts but has value for other roles, including the following:
Developers looking for patterns in errors or usage data
Managers looking to improve processes and products
Data scientists looking to understand correlations for machine learning
I’ve personally found a great deal of value in identifying and communicating trends in data. As a developer, I was able to analyze data to identify patterns in rare software bugs or slow application performance.
As an engineering manager, I was able to identify the parts of our code that encountered the most problems to prioritize improving code in that area and avoiding common problems. Any technologist in any role can use data analysis to communicate facts to leaders, and to shift conversations from gut feelings toward data-driven decision making.
As a data scientist and AI specialist, I routinely use data analysis to identify unusual outliers and missing values in data, identify data features that correlate with each other, and guide questions that I can ask to further understand the data and its implications.
I’ve had the opportunity to lead data analysis and data science workshops in the community, and what I’ve found is almost universal: people with numerous job titles see being able to identify and communicate challenges in their day-to-day work with data as a way of obtaining the support they need.
We’ll explore data analysis throughout the rest of Part 1 of this book as we look at using Polyglot Notebooks, tabular data analysis tools, and various charting libraries to identify trends in data.
Machine learning
Machine learning involves using various mathematical techniques to identify patterns in data and preserve them in equations or models.
Machine learning is an umbrella term that encompasses several sub-fields:
Supervised learning, which seeks to predict new values based on historical trends using a trained machine learning model
Unsupervised learning, which looks for patterns, groups, and anomalies in existing data
Reinforcement learning, which seeks to optimize its choices by learning from prior successes and mistakes
Supervised learning
Most of the time, when I hear people talking about machine learning, they are referring to supervised learning. Supervised learning involves training a machine learning model from a dataset with historical data that includes values we will want to predict in the future.
This model training process takes time. For small datasets, model training may be near instantaneous or occur within a minute. However, larger datasets will require hours or potentially even days to train. The training process is important because it generates a trained machine learning model that internalizes trends observed in the training data. These associations allow it to predict values for new data the model hasn’t seen before to a certain degree of accuracy.
Thankfully, once a model is trained, it can be used for multiple predictions without having to repeat the training process. This means that the slow model training process is a required up-front time investment that allows fast predictions from the trained model.
Let’s consider a practical machine learning example.
Imagine you had data on historical car sales. Your dataset would include information such as the make, model, year, color, mileage, and condition of the vehicle. This dataset would also include the amount of money a customer paid when they bought the car.
You could use this data to train a model to predict the price that another car on your lot will ultimately sell for. A good model would predict a selling price close to the price the car will later sell for while less accurate models might predict a value much higher or much lower than the actual selling price.
To illustrate this, examine the scatter plot in Figure 1.1 showing the relationship between the distance a car has driven and its ultimate resale price:
Figure 1.1 – A scatter plot of car selling prices and mileageFigure 1.1 – A scatter plot of car selling prices and mileage
Here the mileage of the car is only one factor in the selling price of the car, but we can see a clear negative trend where the more a car has been driven, the lower its ultimate selling point typically is.
Data visualization
This scatter plot is one of the types of visuals we’ll cover in Chapter 6
, Visualizing Variable Relationships, using Polyglot Notebooks and the Plotly.NET charting library. We’ll talk more about interpreting and customizing these and other chart types in that chapter.
In machine learning terminology, we refer to the value we’re trying to predict as a label (sometimes referred to as a target), and the values that factor into this value are referred to as features. In our car example, the label would be the selling price of the car while its features would include the mileage, year, make, color, and other contributing factors. We’ll discuss this terminology more in Part 2 of this book, but it’s important to know that supervised learning requires this labeled dataset – a dataset with historical labels that we’re trying to predict.
We’ll discuss supervised learning more in Part 2 of this book using Microsoft’s open source ML.NET machine learning library.
Unsupervised learning
Unsupervised learning is generally used for clustering and anomaly detection applications such as identifying different types of users or flagging suspicious network traffic. A key distinction of unsupervised learning is that it does not need a dataset with values that you’ll try to predict in the future. Instead, unsupervised learning can be used to group data points into distinct groups or clusters or to identify anomalies in data.
Unsupervised learning is supported by ML.NET but is less frequently used in software development organizations in my experience, and as a result, is beyond the scope of this book. See the Further reading section at the end of this chapter for a link to more information