0% found this document useful (0 votes)
8 views

Family Main

The document analyzes a dataset containing family income and spending information to answer several questions, including identifying the highest and lowest earning families, determining if any families have inadequate income to cover spending, and checking for errors in the data.

Uploaded by

hamburgerhenry13
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Family Main

The document analyzes a dataset containing family income and spending information to answer several questions, including identifying the highest and lowest earning families, determining if any families have inadequate income to cover spending, and checking for errors in the data.

Uploaded by

hamburgerhenry13
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

2.

Family Dataset
Q1. Which family boasts the highest annual income, and which has the lowest?
How do you ascertain this?

In [ ]: import pandas as pd
import numpy as np

In [ ]: family_df = pd.read_csv('family_data.csv')
print(family_df.head())

Family Member Income Spend


0 family1 Adult1 2376330 1119433
1 family1 Adult2 130268 37337
2 family1 Adult3 2254489 972327
3 family2 Adult1 2292355 649806
4 family2 Adult2 298167 100723

In [ ]: # print the unique values of the first two columns


print(family_df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 279 entries, 0 to 278
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Family 279 non-null object
1 Member 279 non-null object
2 Income 279 non-null int64
3 Spend 279 non-null int64
dtypes: int64(2), object(2)
memory usage: 8.8+ KB
None

In [ ]: print(family_df.describe())

Income Spend
count 2.790000e+02 2.790000e+02
mean 9.477808e+05 3.344265e+05
std 1.001295e+06 3.760808e+05
min 0.000000e+00 1.391000e+03
25% 0.000000e+00 1.550700e+04
50% 5.458300e+05 1.744480e+05
75% 1.808509e+06 5.432650e+05
max 2.979034e+06 1.475664e+06

In [ ]: family_df["Family id"] = family_df["Family"].apply(lambda x: int(x.replace("family"

In [ ]: # get the total income for each family


total_income = family_df.groupby('Family id')['Income'].sum().sort_index()

print(total_income, '\n')
print(f"The family with the max income: {total_income.idxmax()} with {total_income.
print(f"The family with the min income: {total_income.idxmin()} with {total_income.
Family id
1 4761087
2 2939887
3 2301931
4 2896133
5 1428679
...
96 325062
97 2663794
98 3018609
99 1827150
100 1031646
Name: Income, Length: 100, dtype: int64

The family with the max income: 6 with 7804425


The family with the min income: 94 with 46790

Using the groupby method to group the data by family, and then use the sum method to
sum the annual income of each family. Finally, use the idxmax and idxmin methods to
find the family with the highest and lowest annual income.

As shown above, the family with the highest annual income is family 6 with 7,804,425 dollars;
the family with the lowest annual income is family 94 with 46,790 dollars.

**Q2. Which families do not possess adequate annual income to

cover all members' spending? What is the maximum shortfall? How do you determine this?**

In [ ]: total_spend = family_df.groupby('Family id')['Spend'].sum().sort_index()


family_income_spend = pd.concat([total_income, total_spend], axis=1)
family_income_spend["Surplus"] = family_income_spend["Income"] - family_income_spen
print(family_income_spend[family_income_spend["Surplus"] == family_income_spend["Su

Income Spend Surplus


Family id
94 46790 30029 16761

In [ ]: families_deficit = family_income_spend[family_income_spend["Surplus"] < 0]


print(families_deficit, '\n')
# print(f"The family with the max deficit: {families_deficit['Surplus'].idxmin()} w

Empty DataFrame
Columns: [Income, Spend, Surplus]
Index: []

In [ ]: print(family_income_spend)
Income Spend Surplus
Family id
1 4761087 2129097 2631990
2 2939887 890424 2049463
3 2301931 807835 1494096
4 2896133 1128708 1767425
5 1428679 501827 926852
... ... ... ...
96 325062 135954 189108
97 2663794 774694 1889100
98 3018609 1031955 1986654
99 1827150 493578 1333572
100 1031646 258414 773232

[100 rows x 3 columns]

From above, all families have adequate annual income to cover all members' spending. The
family with the smallest surplus is family 94 with 16,761 dollars. These facts are ascertained
by the results obtained from the describe method.

Q3. Are there any single-parent families, where only one Adult is present? Are
there any childless families? How do you discern this?

With the application of groupby method and apply conditions, we sum the number of
text Adult and Child in the Member column. If the number of Adult is 1, then it is a single-
parent family. If the number of Child is 0, then it is a childless family.

As shown below, there are 40 single-parent families and 35 childless families in the dataset.

In [ ]: # print the number of families with only one adult


adult_counts = family_df.groupby('Family id')["Member"].apply(lambda x: x.str.conta
print("Counts of single-parent families:", adult_counts[adult_counts == 1].count())

# print the number of families with no children


childless_counts = family_df.groupby('Family id')["Member"].apply(lambda x: x.str.c
print("Counts of childless families:", childless_counts[childless_counts == 0].coun

Counts of single-parent families: 40


Counts of childless families: 35

**Q4. Do you suspect any errors within this dataset? Examples

may include negative figures, missing or duplicate data, etc. Why?**

To ensure the accuracy of the dataset, we can use the describe method to check the basic
statistics of the dataset. Similarly, we can use the isnull method to check if there is any
missing data in the dataset. We can also use the duplicated method to check if there is
any duplicate data in the dataset.
As result shown below, we conclude that there are no negative figures, missing or duplicate
data in the dataset.

In [ ]: summary = family_df.describe()

# Check for missing values


missing_values = family_df.isnull().sum()

# Check for duplicated rows


duplicated_rows = family_df.duplicated().sum()

# Check for negative values


numeric_cols = family_df.select_dtypes(include=['number'])
negative_values = (numeric_cols < 0).sum()

print("Summary Statistics:")
print(summary)
print("\nMissing Values:")
print(missing_values)
print("\nDuplicated Rows:")
print(duplicated_rows)
print("\nNegative Values:")
print(negative_values)

Summary Statistics:
Income Spend Family id
count 2.790000e+02 2.790000e+02 279.000000
mean 9.477808e+05 3.344265e+05 47.906810
std 1.001295e+06 3.760808e+05 28.701739
min 0.000000e+00 1.391000e+03 1.000000
25% 0.000000e+00 1.550700e+04 22.000000
50% 5.458300e+05 1.744480e+05 47.000000
75% 1.808509e+06 5.432650e+05 71.500000
max 2.979034e+06 1.475664e+06 100.000000

Missing Values:
Family 0
Member 0
Income 0
Spend 0
Family id 0
dtype: int64

Duplicated Rows:
0

Negative Values:
Income 0
Spend 0
Family id 0
dtype: int64

**Q5. Can ChatGPT or Bing assist with the aforementioned four


questions? If so, to what extent? How do you issue commands to the AI tool? If not, why
not?**

Under the current limitations, ChatGPT and Bind are not able to fully assist with the
questions, but they're able to provide necessary assistance on regarding the usage of
pandas functions. Below are examples of commands used during the process:

ChatGPT: "How do I get the count of a column in each group satisfying a specific
condition in Pandas?"

ChatGPT: "How do I get the sum of a column in each group in Pandas?"

ChatGPT: "How do I get the maximum value of a column in each group in Pandas?"

You might also like