Pandas AI
Pandas AI
PANDAS
• Pandas is a Python library used for working with data sets.
• It has functions for analyzing, cleaning, exploring, and manipulating
data.
• The name "Pandas" has a reference to both "Panel Data", and "Python
Data Analysis" and was created by Wes McKinney in 2008.
WHY USE PANDAS?
• Pandas allows us to analyze big data and make conclusions based on statistical
theories.
• Pandas can clean messy data sets, and make them readable and relevant.
• Relevant data is very important in data science.
• Pandas gives you answers about the data. Like:
• Is there a correlation between two or more columns?
• What is average value?
• Max value?
• Min value?
• Pandas are also able to delete rows that are not relevant, or contains wrong
values, like empty or NULL values. This is called cleaning the data.
• The source code for Pandas is located at this github
repository https://github.com/pandas-dev/pandas
HOW TO START WITH PANDAS?
• print(pd.__version__)
SERIES -- LABELS
• If nothing else is specified, the values are labeled with their index number.
First value has index 0, second value has index 1 etc.
• This label can be used to access a specified value.
• With the index argument, you can name your own labels.
• import pandas as pd
a = [1, 7, 2]
myvar = pd.Series(a, index = ["x", "y", "z"])
print(myvar)
• print(myvar["y"])
KEY/VALUE OBJECTS AS SERIES
• You can also use a key/value object, like a dictionary, when creating a
Series.
• The keys of the dictionary become the labels.
• import pandas as pd
calories = {"day1": 420, "day2": 380, "day3": 390}
myvar = pd.Series(calories)
print(myvar)
KEY/VALUE OBJECTS AS SERIES (CONT.)
• To select only some of the items in the dictionary, use the index
argument and specify only the items you want to include in the Series.
• import pandas as pd
calories = {"day1": 420, "day2": 380, "day3": 390}
myvar = pd.Series(calories, index=["day1", "day2"])
print(myvar)
DATA FRAMES
• As you can see from the result above, the DataFrame is like a table with rows and columns.
• Pandas use the loc attribute to return one or more specified row(s)
• With the index argument, you can name your own indexes.
• import pandas as pd
data = {
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}
• Use the named index in the loc attribute to return the specified row(s).
• print(df.loc["day2"])
READ CSV FILES
• A simple way to store big data sets is to use CSV files (comma
separated files).
• CSV files contains plain text and is a well know format that can be read
by everyone including Pandas.
• import pandas as pd
df = pd.read_csv('data.csv')
print(df)
READ JSON
• Big data sets are often stored, or extracted as JSON.
• JSON is plain text, but has the format of an object, and
is well known in the world of programming, including
Pandas.
• import pandas as pd
df = pd.read_json('data.json')
print(df)
READ JSON