Discover millions of audiobooks, ebooks, and so much more with a free trial

From $11.99/month after trial. Cancel anytime.

Pandas in 7 Days: Utilize Python to Manipulate Data, Conduct Scientific Computing, Time Series Analysis, and Exploratory Data Analysis
Pandas in 7 Days: Utilize Python to Manipulate Data, Conduct Scientific Computing, Time Series Analysis, and Exploratory Data Analysis
Pandas in 7 Days: Utilize Python to Manipulate Data, Conduct Scientific Computing, Time Series Analysis, and Exploratory Data Analysis
Ebook690 pages9 hours

Pandas in 7 Days: Utilize Python to Manipulate Data, Conduct Scientific Computing, Time Series Analysis, and Exploratory Data Analysis

Rating: 0 out of 5 stars

()

Read preview
  • Data Analysis

  • Pandas Library

  • Data Visualization

  • Data Manipulation

  • Python

  • Mentor

  • Chosen One

  • Quest

  • New Frontier

  • Data Cleaning

  • Dataframe

  • Data Structures

  • Indexing

  • Jupyter Notebook

About this ebook

No matter how large or small your dataset is, the author 'Fabio Nelli' simply used this book to teach all the finest technical coaching on applying Pandas to conduct data analysis with zero worries.

Both newcomers and seasoned professionals will benefit from this book. It teaches you how to use the pandas library in just one week. Every day of the week, you'll learn and practise the features and data analysis exercises listed below:

Day 01: Get familiar with the fundamental data structures of pandas, including Declaration, data upload, indexing, and so on.
Day 02: Execute commands and operations related to data selection and extraction, including slicing, sorting, masking, iteration, and query execution.
Day 03: Advanced commands and operations such as grouping, multi-indexing, reshaping, cross-tabulations, and aggregations.
Day 04: Working with several data frames, including comparison, joins, concatenation, and merges.
Day 05: Cleaning, pre-processing, and numerous strategies for data extraction from external files, the web, databases, and other data sources.
Day 06: Working with missing data, interpolation, duplicate labels, boolean data types, text data, and time-series datasets.
Day 07: Introduction to Jupyter Notebooks, interactive data analysis, and analytical reporting with Matplotlib's stunning graphics.
LanguageEnglish
PublisherBPB Online LLP
Release dateApr 25, 2022
ISBN9789355512147
Pandas in 7 Days: Utilize Python to Manipulate Data, Conduct Scientific Computing, Time Series Analysis, and Exploratory Data Analysis

Related to Pandas in 7 Days

Related ebooks

Reviews for Pandas in 7 Days

Rating: 0 out of 5 stars
0 ratings

0 ratings0 reviews

What did you think?

Tap to rate

Review must be at least 10 words

    Book preview

    Pandas in 7 Days - Fabio Nelli

    CHAPTER 1

    Pandas, the Python Library

    Before starting to work directly with seven chapters in seven days, it would be beneficial to introduce the Pandas library. In this chapter, you will see how this library developed in Python – one of the most successful programming languages – was born, and then in a few simple words, we will summarize what it is for and what it does.

    Pandas is designed specifically for data analysis and is intricately linked to this discipline. So, in this chapter, you will be shown a brief overview of the steps that involve data processing and data analysis in general. These concepts will be resumed in the following chapters with practical examples.

    Structure

    In this chapter, we will cover the following topics:

    A bit of history

    Why use Pandas (and Python) for data analysis?

    Data analysis

    Tabular form of data

    Objective

    After studying this chapter, you will have an overview of how the Pandas library can serve in data analysis, on what type of data it works (a tabular form of data), and what are its strengths that have made it a great success in these recent years, making it an indispensable tool.

    A bit of history

    In 2008, developer Wes McKinney, employed at AQR Capital Management (a global investment management company based in Connecticut (USA)), began work on creating a tool that would allow quantitative analysis to be performed on financial data. In those years, there was already an excellent development of projects on R, a programming language designed to carry out statistical analyzes, and many companies made extensive use of the commercial software based on the use of macros or scripting on data spreadsheets (spreadsheets such as Microsoft Excel).

    But McKinney was looking for something new. The R language was quite difficult for the layman and took an enormous time to learn. As for the commercial software, these had a cost and were also very limited to their basic functionality. The Python language was emerging in the scientific environment, a very flexible and easy-to-learn programming language. Python was free, easy to learn, and could virtually interface with anything – thanks to the integration of many modules and libraries in continuous development.

    Already in 2009, McKinney had in hand a module of functions and new data structures that formed the prototype of a library that would later become Pandas. Before leaving the company he worked for, he managed to convince the management to grant the open-source right to the library. And in 2012, another McKinney colleague at AQR Capital Management joined the project, Chang She, to become the second-largest contributor to the bookstore.

    Figure 1.1 shows in sequence, significant events in the history of the Pandas library:

    Figure 1.1: Pandas library timeline

    Thus, was born the pandas module, a specific Python library for data analysis. The goal was to provide the Python developers with efficient data structures and all the functionality necessary to carry out the data analysis activities with this language in the best possible way.

    The choice of using the tabular form as structured data (like spreadsheets) and integrating functionalities similar to those used by the SQL language to manipulate and generate database tables, was the key to the success of this library. Much of the functionality of the R language and macros available in commercial software were implemented in Python and integrated into the pandas library, making it an even more powerful and flexible tool.

    The choice to use Python as a programming language allows you to take advantage of many libraries that greatly expand the scope of application of our analysis projects. One of these libraries is for example the NumPy library. This powerful tool for scientific computing, once integrated into the project, allowed the implementation of indexing and element selection tools with high performance, compensating in a good part of the computational speed problems of the Python language compared to other programming languages, such as C and Fortran.

    Figure 1.2 shows how choosing a Python library like Pandas allows you to take advantage of many other technologies, such as NumPy:

    Figure 1.2: Pandas and other Python libraries for data analysis

    In 2017, because of the great diffusion of Python in academic and professional environments, there was a growth of interest in the pandas library and its potential, becoming a reference tool for all those involved in the data analysis sector. Moreover, it was precisely in those years that we began to talk about Data Science almost everywhere.

    Taking advantage of the wave of success, pandas have developed enormously in recent years, resulting in the release of numerous versions, thanks to the contribution of more and more developers.

    Today, many data scientists in the major private companies that analyze data (Google, Facebook, JP Morgan) use Pandas. This library is, therefore, the perfect tool for data analysis – powerful, flexible, easy to understand, can be integrated with many other libraries, and is, therefore, an indispensable tool to know.

    The Pandas library is the free software developed under 3-Clause BSD License, also known as Modified BSD License, and all releases and documentation are freely available and can be consulted on the official website (https://pandas.pydata.org/).

    Why use Pandas (and Python) for data analysis?

    As shown earlier, the choice of using Python as the programming language for developing Pandas has led to the undisputed success of this library. Since its first appearance in 1991, Python has spread almost exponentially, becoming today one of the most used programming languages. The merit of this is due precisely to the great flexibility of this language and its ability to integrate libraries into its code that have extended its applicability to many areas of work.

    In this section, we will see in detail the following factors that led to choosing Pandas and the Python language as a tool for data analysis and data science in general:

    A trust gained over the years

    A flexible language that adapts to any context

    Automation, reproducibility, and interaction

    A trust gained over the years

    As for data analysis, thanks to the pandas library, Python soon entered into competition with other programming languages and analysis software such as R, Stata, SAS, and MATLAB. Python, however, being a free product, has spread easily, and due to its enormous potential, it has been able to gain wide trust from all the users. Although at first it was seen as a tool for do-it-yourself calculation, over the years, Python has proven to guarantee excellent results and be a valid tool in both the academic and the industrial fields. In fact, today Python enjoys the utmost confidence in the world of data science, and this is largely due to libraries such as Pandas, which have guaranteed its feasibility, providing all the necessary tools to carry out a work of analysis and calculation of the highest level, at virtually no cost.

    A flexible language that adapts to any context

    In many organizations, it is common to find projects or processes in which several programming languages or calculation software are involved at the same time. Since they work strictly in specific fields (sometimes too specific), each of them must be applied only to one or more steps of data processing. But they never manage to cover the entire structure of a project.

    Therefore, to start a project, it is necessary to set up a certain number of development and work environments, in addition to developing interfaces or ways of exchanging information and data between them.

    But why keep so many works and development environments when just one might be more than enough? It is at this point that Python takes over, a language so flexible that it adapts to many applications and can be used in many areas. In more and more organizations, this language has gradually replaced all pre-existing technologies (web server, calculation, programming, notebooks, algorithms, data visualization tools, and so on) unifying them under a single language.

    Automation, reproducibility, and interaction

    Pandas is a library based on a programming language. But why not instead choose an application that has already been developed, tested, and therefore, does not require development and programming skills? By choosing Pandas, you will find yourself forced to deal with the programming activities that require time and skills, however simple and intuitive programming in Python can be. So, why choose Pandas?

    It is a question of automation, reproducibility, and interaction. Programming languages are gradually replacing large applications that offer everything ready and tested. Applications are environments, even though they are powerful, feature-rich, narrow, and fixed. In fact, they can offer thousands of features, but often never a one that fits your needs completely.

    Choosing a programming language, you leave a free hand to those who use it. In this case, the developer or data scientist will choose and develop a work environment that will fully meet their needs. Python provides a large number of libraries, full of ready-made tools, thus, allowing it to work even at a high level, with algorithms and procedures already implemented within. Therefore, the data scientist will be free to use ready-made tools by integrating them into their personal work environment.

    Large applications with ready-made work environments require the continuous presence of a user who interacts with them, selecting items, entering values, and performing continuous checks on the results to then choose how to continue the analysis and processing of the data. Pandas, being a library of programming languages, is instead a perfect automation tool. Everything can be automated. Once the procedure is established, you can convert everything into lines of code that can be executed as many times as you want. Furthermore, given that the operations, choices, and data management will be carried out by programs that strictly follow the commands read from lines of code, and not from real-time operations by humans, reproducibility will also be guaranteed.

    But why then choose Python and not another programming language like Java or C++?

    Python is a language that differs from the latter in that it is an interpreter language. The developed code does not need to be written in the form of complete programs, which then must be compiled to run. The Python code can also run one line at a time and see how the system responds to each of them, and depending on the result, make decisions or modify the following code. You could then interact with the data analysis process. So, the interaction has been added, which is one of the aspects that has made the use of software and applications advantageous compared to programming languages.

    Data Analysis

    With the previous two sections, you will have become aware of the qualities that make Pandas and Python an indispensable tool for data analysis. The next step is to take a quick look at what the latter is. In the following section, we will have a quick overview of data analysis and some concepts that will then serve to better understand the purpose of many features in the Pandas library.

    What is data analysis?

    Data analysis can be defined as the disciplinary area in which a series of techniques are applied to collect, handling, and process a series of data. The ultimate goal will be to obtain the necessary information from these data in order to draw conclusions and support decision-making.

    Therefore, using calculation techniques and tools, we will start with the raw data extracted from the real world, which will undergo manipulations (cleaning, sorting, processing, statistics, views) in order to gradually convert them into useful information for the purpose set at the beginning of the analysis. The objective of the analysis will, therefore, be to evaluate hypotheses, validating or refuting them (scientific world), or to provide information to be able to make decisions (especially in the business and financial world).

    Thus, it is clear that data analysis is a fundamental tool for the academic environment, and more so for the business environment, where making correct decisions can be very important for investments.

    The data scientist

    Given the importance of data analysis, more elaborate techniques and tools have developed over time, hand-in-hand with the development of information technologies. With the increase in the amount of data available, the use of increasingly powerful and efficient tools becomes indispensable.

    Today, the tools in this area are many, offering multiple possible approaches to the development of a data analysis environment. It is, therefore, up to the data scientist to choose the right tools among the various technologies available that can meet the objectives set for the analysis.

    The data scientist, in addition to having the skills related to calculation techniques and data analysis processing, must also have a broad knowledge of all the available technologies that will allow him to set up a good work environment, essential to obtain the desired results.

    This professional figure is gradually taking shape in the last 5 years, becoming increasingly important in many areas of work.

    The analysis process

    However, regardless of the choices of the technologies or tools used, there are general lines on which the data analysis process is based. The data scientist, or anyone who needs to carry out data analysis, must be clear about the main steps into which the data processing technique is divided.

    Therefore, to better understand the aims of the pandas library, it is important to broadly know how a typical data analysis process develops. Once the structure of the process is clear, it will be easy to understand where and when to apply the different methods and functions that the pandas library offers.

    In Figure 1.3, the main steps that make up the data analysis process are shown:

    Figure 1.3: The various steps of the data analysis process

    As shown in Figure 1.3, there is a precise sequence of operations that make up the data analysis process. Let’s see every single step in detail, as follows:

    Data collection: Once you have assessed the nature of the necessary data (numbers, text, images, sounds, and so on) and the characteristics required to meet the objectives of our data analysis, we proceed with the collection. The data sources can be most varied. There are organizations that make a lot of data available online, already partially processed and organized, but the data are not always so easy to find. Sometimes, it is necessary to extract raw data such as fractions of text, the content of web pages, images, and audio files. And therefore, it will be important to prepare a whole series of procedures and technologies to extract the data from these heterogeneous sources to be collected and subsequently processed. Lately, with the advent of the Internet of Things (IoT), there is an increase in the data acquired and released by devices and sensors.

    Data processing: The collected data, even if in part already processed, are still raw data. Data from different sources could have varied formats and certainly be internally structured differently. It is, therefore, essential to make the data obtained as homogeneous as possible, converting them all into the same format, and then processing them to collect them together in a structured form more suited to us. As for Pandas and other data analysis technologies and relational databases, the data will be collected by typology through a tabular structure – a table formed by rows and columns, where often the data will be distributed on columns that identify particular characteristics, while on the rows will be the collected samples.

    Data cleaning: Even though the collected data has been processed and organized in a structured form suited to us, such as a tabular data structure, they are not yet ready for data analysis. In fact, the data will almost certainly be incomplete, with empty parts, or contain duplicates or errors. The data cleaning operation is, therefore, required in which you must manage, correct errors where possible, and integrate the missing data with appropriate values. This phase is often the most underestimated, but it is also the one that can later lead to misleading conclusions.

    Transforming data: Once the data has been cleaned, integrated, and corrected, it can finally be analyzed. It is in this phase that the actual data analysis is concentrated. Here, the analyst has at his disposal a multitude of techniques to manipulate structured data, acting both on the structure itself and elaborating new values from those already present. Everything must be aimed at focusing the data or at least preparing them to be able to extract the useful information you are looking for. At this point in the process, it may be obvious to the analyst that the data obtained at the time may not be sufficient or not suitable. You will then have to go back to the previous steps of the process to search for new data sources, delete useless information, or recover the previously deleted data.

    Modeling data: The previous step is aimed at extracting information that must then be enclosed in a model. The mathematical, conceptual, or decision-making model will be based on hypotheses that can be evaluated here, thanks to the information extracted during the data analysis. The model can, therefore, be validated or refuted, or at least its knowledge be increased, so as to decide whether to return to the analysis in the previous steps to deepen the model and extract new information or start with a new analysis or definitively close the process of analysis and make decisions directly.

    Data Visualization: The data modeling process often needs to be assisted with data visualization. The data are often reported in very large quantities, and it will be very difficult for the analyst to identify the characteristics (trends, clusters, various statistics, and other characteristics) that are hidden inside by reading only rows and columns. Even reporting and evaluating the results of statistical processing often runs the risk of losing the bulk of the information. The human being has a greater ability to extract information from graphic diagrams and, therefore, it is essential to combine the analysis process with good visualization of the data.

    Tabular form of data

    You saw in the previous section that data must be processed in order to be structured in tabular form. The pandas library also has structured data within it that follow this particular form of ordering the individual data. But, why this data structure? In this section, we will see how to answer this question, introducing the following topics.

    Tabular form of data

    From my point of view, I could say that the tabular format has always been the most used method to arrange and organize data. Whether for historical reasons or for a natural predisposition of human rationality to arrange the data in the form of a table, we find the tabular format historically in the collection of the first scientific data written by hand in previous centuries, and in the accountants’ accounts of the first merchants and banks, which in current technologies have internal database tables storing terabytes of data. Even calendars, recipe ingredients, and other simple things in everyday life follow this structure.

    The tabular format is simply the information presented in the form of a table with rows and columns, where the data is divided and sorted following the classification logic established by the headings on the rows and columns.

    But it is not just a historical question. The reason for the widespread use of this form of organizing data is mainly due to the characteristics that make it ideal for calculating, organizing, and manipulating a large amount of data. They are also easily readable and understandable, even for those with no analysis experience.

    It is no coincidence that most of the software used in offices makes use of tools for entering, calculating, and saving data in tabular form. A classic example is spreadsheets, of which the most popular of all is Microsoft Excel.

    Spreadsheets

    A spreadsheet is an application on a computer whose purpose is to organize, analyze, and store data in tabular form. Spreadsheets are nothing more than the digital evolution of paper worksheets. Accountants once collected all the data in large ledgers full of printouts, from which they extracted the accounts. Even the goods in a warehouse were recorded in paper registers made up of rows and columns, in which dates, costs, and various descriptions were reported.

    Such a tool has followed the evolution of the times, becoming an office software like spreadsheets, of which the most famous version is precisely Microsoft Excel.

    Spreadsheets are programs that operate on data entered in cells within different tables. Each cell can contain both numbers and text strings. With appropriate formulas, you can perform calculations between values contained in areas of selected cells and automatically show the result in another cell. By modifying the values in the operand cells, the value in the cell intended for the result will also be modified in real-time. Spreadsheets, compared to paper worksheets, in addition to organizing the data in tabular form, introduce a certain degree of interaction, in which the user is able to select the data visually, insert calculation formulas, and view the results in real-time.

    Figure 1.4 shows how Spreadsheets are the digital evolution of Paper Worksheets, adding new useful properties:

    Figure 1.4: Paper worksheets and spreadsheets

    The leap forward was enormous, while retaining the tabular structure.

    SQL tables from databases

    Another technology that has enormously marked the evolution of data management is relational databases. Spreadsheets share many features with databases, but they are not quite the same. A table extracted from an SQL query from a database can somewhat resemble a spreadsheet. Not surprisingly, spreadsheets can be used to import large amounts of data into a database and a table can be exported from the database in the form of a spreadsheet. But the similarities end there.

    Database tables are often the result of an SQL query, which is a series of SQL language statements that allow you to extract, select, and transform the data contained within a database, modifying its format and its original structure to get a data table. Therefore, while the spreadsheets introduce excellent calculation tools applicable to the various areas of the data table, the SQL tables are the result of a complex series of manipulations carried out on original data, based on selections of data of interest from a vast amount of data.

    Figure 1.5 shows how the SQL Tables started from the characteristics present in the Paper Worksheets, to adapt to relational databases, showing new and powerful properties:

    Figure 1.5: Paper worksheets and SQL tables form databases

    Pandas and Dataframes

    The idea of being able to treat data in memory as if it were an SQL table is incredibly captivating. Likewise, selectively interacting on data selections by applying formulas and calculations as is done with spreadsheets can make the tabular data management powerful.

    Well, the pandas library introduced DataFrames as structured data, which makes all this possible. These objects that are the basis of data analysis with Pandas, have an internal tabular structure and are the evolution of the two previous technologies. By combining the calculation features of spreadsheets and the way of working of SQL languages, they provide a new and powerful tool for data analysis.

    Furthermore, by exploiting the characteristics of the Python language discussed in the previous sections, we also add aspects of automation, flexibility, interaction, and ease of integration with other technologies, which the two previous technologies only partially cover.

    Figure 1.6 shows how many fundamental properties of Spreadsheets and SQL Tables can be found in the DataFrame, the main data structures of the Pandas library:

    Figure 1.6: Pandas take the best of all technologies

    By introducing the DataFrame, Pandas make it possible to analyze data quickly and intuitively. We will resume the DataFrame in Chapter 3, (Day 1) Data Structures in Pandas Library deepening the characteristics of this type of object, and we will see how to create and manage them in the Python language with a series of practical examples.

    Conclusion

    In this chapter, we had a quick overview of some fundamental concepts that will be useful to fully understand the pandas library. The general framework that is deduced will help contextualize the various topics introduced in the following chapters. In fact, each chapter will cover a series of topics that can be included in the practice of data analysis in one or more steps of the process. It is, therefore, clear that it is important for the reader to keep in mind the general scheme of a typical process of data analysis.

    Pandas is a Python library, and therefore, it is essential to have at least a knowledge of how it came about, and why it was built with this programming language. That’s why we talked about this language in a few lines, and how it has contributed enormously to the success and spread of this library.

    Pandas are designed to be powerful and efficient tools. Taking up the concepts of data in tabular form, spreadsheet, and database with their SQL tables, we tried to highlight what are the characteristics that make it so. It is important

    Enjoying the preview?
    Page 1 of 1