Family Main
Family Main
Family Dataset
Q1. Which family boasts the highest annual income, and which has the lowest?
How do you ascertain this?
In [ ]: import pandas as pd
import numpy as np
In [ ]: family_df = pd.read_csv('family_data.csv')
print(family_df.head())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 279 entries, 0 to 278
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Family 279 non-null object
1 Member 279 non-null object
2 Income 279 non-null int64
3 Spend 279 non-null int64
dtypes: int64(2), object(2)
memory usage: 8.8+ KB
None
In [ ]: print(family_df.describe())
Income Spend
count 2.790000e+02 2.790000e+02
mean 9.477808e+05 3.344265e+05
std 1.001295e+06 3.760808e+05
min 0.000000e+00 1.391000e+03
25% 0.000000e+00 1.550700e+04
50% 5.458300e+05 1.744480e+05
75% 1.808509e+06 5.432650e+05
max 2.979034e+06 1.475664e+06
print(total_income, '\n')
print(f"The family with the max income: {total_income.idxmax()} with {total_income.
print(f"The family with the min income: {total_income.idxmin()} with {total_income.
Family id
1 4761087
2 2939887
3 2301931
4 2896133
5 1428679
...
96 325062
97 2663794
98 3018609
99 1827150
100 1031646
Name: Income, Length: 100, dtype: int64
Using the groupby method to group the data by family, and then use the sum method to
sum the annual income of each family. Finally, use the idxmax and idxmin methods to
find the family with the highest and lowest annual income.
As shown above, the family with the highest annual income is family 6 with 7,804,425 dollars;
the family with the lowest annual income is family 94 with 46,790 dollars.
cover all members' spending? What is the maximum shortfall? How do you determine this?**
Empty DataFrame
Columns: [Income, Spend, Surplus]
Index: []
In [ ]: print(family_income_spend)
Income Spend Surplus
Family id
1 4761087 2129097 2631990
2 2939887 890424 2049463
3 2301931 807835 1494096
4 2896133 1128708 1767425
5 1428679 501827 926852
... ... ... ...
96 325062 135954 189108
97 2663794 774694 1889100
98 3018609 1031955 1986654
99 1827150 493578 1333572
100 1031646 258414 773232
From above, all families have adequate annual income to cover all members' spending. The
family with the smallest surplus is family 94 with 16,761 dollars. These facts are ascertained
by the results obtained from the describe method.
Q3. Are there any single-parent families, where only one Adult is present? Are
there any childless families? How do you discern this?
With the application of groupby method and apply conditions, we sum the number of
text Adult and Child in the Member column. If the number of Adult is 1, then it is a single-
parent family. If the number of Child is 0, then it is a childless family.
As shown below, there are 40 single-parent families and 35 childless families in the dataset.
To ensure the accuracy of the dataset, we can use the describe method to check the basic
statistics of the dataset. Similarly, we can use the isnull method to check if there is any
missing data in the dataset. We can also use the duplicated method to check if there is
any duplicate data in the dataset.
As result shown below, we conclude that there are no negative figures, missing or duplicate
data in the dataset.
In [ ]: summary = family_df.describe()
print("Summary Statistics:")
print(summary)
print("\nMissing Values:")
print(missing_values)
print("\nDuplicated Rows:")
print(duplicated_rows)
print("\nNegative Values:")
print(negative_values)
Summary Statistics:
Income Spend Family id
count 2.790000e+02 2.790000e+02 279.000000
mean 9.477808e+05 3.344265e+05 47.906810
std 1.001295e+06 3.760808e+05 28.701739
min 0.000000e+00 1.391000e+03 1.000000
25% 0.000000e+00 1.550700e+04 22.000000
50% 5.458300e+05 1.744480e+05 47.000000
75% 1.808509e+06 5.432650e+05 71.500000
max 2.979034e+06 1.475664e+06 100.000000
Missing Values:
Family 0
Member 0
Income 0
Spend 0
Family id 0
dtype: int64
Duplicated Rows:
0
Negative Values:
Income 0
Spend 0
Family id 0
dtype: int64
Under the current limitations, ChatGPT and Bind are not able to fully assist with the
questions, but they're able to provide necessary assistance on regarding the usage of
pandas functions. Below are examples of commands used during the process:
ChatGPT: "How do I get the count of a column in each group satisfying a specific
condition in Pandas?"
ChatGPT: "How do I get the maximum value of a column in each group in Pandas?"