1.
Here is a simplified Python code to address the given problem. The code assumes you have a dataset
(e.g., a CSV file) with student names and their scores in various subjects (e.g., Math, Science, English).
### Python Code
```python
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# Load the dataset into a Pandas DataFrame
# Replace 'your_file.csv' with the path to your dataset
df = pd.read_csv('your_file.csv')
# Handle missing values by replacing them with the mean of the respective column
df.fillna(df.mean(), inplace=True)
# Calculate the average score for each student
df['Average_Score'] = df.iloc[:, 1:].mean(axis=1) # Assuming the first column is student names
# Categorize students into performance levels
def categorize_performance(avg_score):
if avg_score >= 80:
return 'High'
elif avg_score >= 50:
return 'Medium'
else:
return 'Low'
df['Performance_Category'] = df['Average_Score'].apply(categorize_performance)
# Identify the subject with the highest average score across students
subject_avg_scores = df.iloc[:, 1:-2].mean()
highest_avg_subject = subject_avg_scores.idxmax()
# Determine the number of students in each performance category
category_counts = df['Performance_Category'].value_counts()
# Visualization: Bar chart for average score per subject
subject_avg_scores.plot(kind='bar', title='Average Score Per Subject', ylabel='Average Score',
xlabel='Subjects', color='skyblue')
plt.show()
# Visualization: Pie chart for performance category distribution
category_counts.plot(kind='pie', autopct='%1.1f%%', title='Performance Category Distribution',
ylabel='')
plt.show()
```
### Explanation
1. **Data Loading and Cleaning:**
- Loads a CSV file into a Pandas DataFrame.
- Handles missing values by replacing them with the column mean.
2. **Data Manipulation:**
- Calculates the average score for each student.
- Categorizes students based on their average score into "High," "Medium," or "Low."
3. **Analysis:**
- Finds the subject with the highest average score across all students.
- Counts the number of students in each performance category.
4. **Visualization:**
- Creates a bar chart showing the average scores for each subject.
- Creates a pie chart showing the percentage of students in each performance category.
### Instructions to Run
1. Upload your dataset (e.g., `your_file.csv`) to Google Colab.
2. Replace `'your_file.csv'` in the code with the actual file path.
3. Run the code cells step-by-step in Google Colab.
### Sample Output (Assuming Example Dataset)
**Bar Chart:**
Displays a bar chart with average scores for Math, Science, and English.
**Pie Chart:**
Shows a pie chart with categories like "High" (30%), "Medium" (50%), and "Low" (20%).
**Console Output:**
- Subject with the highest average score: `Science`
- Performance category counts:
```
Medium 5
High 3
Low 2
Name: Performance_Category, dtype: int64
```
Here’s a concise Python code that you can run in Google Colab to analyze a COVID-19 dataset as
described in the question.
### Python Code
```python
# Import necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
# Load the dataset
# Replace 'covid_data.csv' with the path to your dataset
df = pd.read_csv('covid_data.csv')
# Handle missing values and duplicates
df.fillna(0, inplace=True)
df.drop_duplicates(inplace=True)
# Add a new column for daily new cases
df['New_Cases'] = df['Total_Cases'].diff().fillna(0)
# Extract 'Date' into separate columns for Year, Month, and Day
df['Date'] = pd.to_datetime(df['Date'])
df['Year'] = df['Date'].dt.year
df['Month'] = df['Date'].dt.month
df['Day'] = df['Date'].dt.day
# Calculate total cases and deaths globally
total_cases = df['Total_Cases'].sum()
total_deaths = df['Total_Deaths'].sum()
# Identify the country with the highest number of cases and deaths
country_cases = df.groupby('Country')['Total_Cases'].max()
country_deaths = df.groupby('Country')['Total_Deaths'].max()
highest_cases_country = country_cases.idxmax()
highest_deaths_country = country_deaths.idxmax()
# Analyze daily new cases trend (last 30 days)
last_30_days = df[df['Date'] >= (df['Date'].max() - pd.Timedelta(days=30))]
# Visualization: Line chart for total cases trend
df.groupby('Date')['Total_Cases'].sum().plot(kind='line', title='Trend of Total COVID-19 Cases Over
Time', ylabel='Total Cases', xlabel='Date')
plt.show()
# Bar chart for top 5 countries with the highest cases
top_5_countries = country_cases.nlargest(5)
top_5_countries.plot(kind='bar', title='Top 5 Countries with Highest Cases', ylabel='Total Cases',
xlabel='Countries', color='orange')
plt.show()
# Pie chart for proportion of cases in continents
continent_cases = df.groupby('Continent')['Total_Cases'].sum()
continent_cases.plot(kind='pie', autopct='%1.1f%%', title='Proportion of Cases by Continent', ylabel='')
plt.show()
# Print key results
print(f"Total cases globally: {total_cases}")
print(f"Total deaths globally: {total_deaths}")
print(f"Country with highest cases: {highest_cases_country}")
print(f"Country with highest deaths: {highest_deaths_country}")
```
---
### Explanation
1. **Data Loading and Cleaning:**
- The dataset is loaded into a DataFrame, missing values are replaced with 0, and duplicates are
dropped.
2. **Data Manipulation:**
- Calculates daily new cases (`New_Cases`).
- Extracts `Year`, `Month`, and `Day` from the `Date` column for analysis.
3. **Analysis:**
- Computes total global cases and deaths.
- Identifies the countries with the highest cases and deaths.
- Filters data for the last 30 days to analyze trends.
4. **Visualization:**
- **Line Chart:** Shows the trend of total cases over time.
- **Bar Chart:** Displays the top 5 countries with the highest cases.
- **Pie Chart:** Shows the proportion of cases by continent.
---
### Instructions to Run
1. Upload your dataset (e.g., `covid_data.csv`) to Google Colab.
2. Replace `'covid_data.csv'` in the code with the file name.
3. Run each code cell step-by-step to load, analyze, and visualize the data.
---
### Sample Output (Assuming Example Dataset)
**Console Output:**
```
Total cases globally: 500,000,000
Total deaths globally: 5,000,000
Country with highest cases: USA
Country with highest deaths: Brazil
```
**Visualizations:**
1. Line chart showing the rising trend of total cases globally.
2. Bar chart highlighting the top 5 countries with the highest total cases.
3. Pie chart dividing the proportion of cases by continent.
Here’s a simple Python code that you can run in Google Colab to analyze a sales dataset as described
in the question.
### Python Code
```python
# Import necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
# Load the dataset
# Replace 'sales_data.csv' with the path to your dataset
df = pd.read_csv('sales_data.csv')
# Handle missing values and duplicates
df.fillna(0, inplace=True)
df.drop_duplicates(inplace=True)
# Add a new column for total revenue
df['Total_Revenue'] = df['Quantity'] * df['Price']
# Group by product category to calculate total revenue and number of items sold
category_summary = df.groupby('Product_Category').agg(
Total_Revenue=('Total_Revenue', 'sum'),
Total_Quantity=('Quantity', 'sum')
# Identify the top 3 products generating the highest revenue
top_products = df.groupby('Product').agg(Total_Revenue=('Total_Revenue', 'sum')).nlargest(3,
'Total_Revenue')
# Determine the month with the highest total sales
df['Date'] = pd.to_datetime(df['Date'])
df['Month'] = df['Date'].dt.to_period('M')
monthly_sales = df.groupby('Month').agg(Total_Revenue=('Total_Revenue', 'sum'))
highest_sales_month = monthly_sales.idxmax()
# Visualization: Bar chart for total revenue by product category
category_summary['Total_Revenue'].plot(kind='bar', title='Total Revenue by Product Category',
ylabel='Total Revenue', xlabel='Product Category', color='green')
plt.show()
# Visualization: Line graph for monthly sales trends
monthly_sales.plot(kind='line', title='Monthly Sales Trends', ylabel='Total Revenue', xlabel='Month',
marker='o', color='blue')
plt.show()
# Print key results
print("Top 3 products generating highest revenue:")
print(top_products)
print(f"Month with highest total sales: {highest_sales_month}")
```
---
### Explanation
1. **Data Loading and Cleaning:**
- Loads the sales dataset into a Pandas DataFrame.
- Handles missing values by replacing them with 0 and removes duplicate entries.
2. **Data Manipulation:**
- Calculates `Total_Revenue` for each transaction as `Quantity × Price`.
- Groups the data by `Product_Category` to calculate total revenue and number of items sold.
3. **Analysis:**
- Identifies the top 3 products generating the highest revenue.
- Determines the month with the highest total sales.
4. **Visualization:**
- **Bar Chart:** Displays total revenue by product category.
- **Line Graph:** Shows monthly sales trends.
---
### Instructions to Run
1. Upload your dataset (e.g., `sales_data.csv`) to Google Colab.
2. Replace `'sales_data.csv'` in the code with your dataset's filename.
3. Run each code cell step-by-step to analyze and visualize the data.
---
### Sample Output (Assuming Example Dataset)
**Console Output:**
```
Top 3 products generating highest revenue:
Total_Revenue
Product
Product_A 100000.00
Product_B 80000.00
Product_C 75000.00
Month with highest total sales: 2024-05
```
**Visualizations:**
1. **Bar Chart:** Shows total revenue for categories like "Electronics," "Furniture," etc.
2. **Line Graph:** Displays sales trends over months with peaks and valleys.
[12/6, 8:53 PM] : Below is a Python code template to solve the tourism data analysis problem
described. You'll need a tourism dataset in CSV format to run this code. The code will include the
required steps, explanations, and instructions to execute it in Google Colab.
### Code
```python
# Step 1: Import Libraries
import pandas as pd
import matplotlib.pyplot as plt
# Step 2: Load the Dataset
# Replace 'tourism_data.csv' with your actual file name
from google.colab import files
uploaded = files.upload() # Upload the dataset
data = pd.read_csv(list(uploaded.keys())[0])
# Step 3: Data Cleaning
data.drop_duplicates(inplace=True) # Remove duplicate rows
data.dropna(inplace=True) # Drop rows with missing values
# Step 4: Data Manipulation
# Add Total Visitors column
data['Total_Visitors'] = data['Domestic_Visitors'] + data['International_Visitors']
# Extract year and month from the 'Date' column
data['Date'] = pd.to_datetime(data['Date'])
data['Year'] = data['Date'].dt.year
data['Month'] = data['Date'].dt.month
# Step 5: Analysis
# Identify the month with the highest total visitors
highest_month = data.loc[data['Total_Visitors'].idxmax()]
# Calculate the average number of visitors per year
average_visitors_per_year = data.groupby('Year')['Total_Visitors'].mean()
# Proportion of domestic vs international visitors by year
proportion = data.groupby('Year')[['Domestic_Visitors', 'International_Visitors']].sum()
proportion['Domestic_Proportion'] = proportion['Domestic_Visitors'] /
(proportion['Domestic_Visitors'] + proportion['International_Visitors'])
proportion['International_Proportion'] = proportion['International_Visitors'] /
(proportion['Domestic_Visitors'] + proportion['International_Visitors'])
# Step 6: Visualization
# Bar Chart - Total Visitors per Month
monthly_totals = data.groupby('Month')['Total_Visitors'].sum()
monthly_totals.plot(kind='bar', title='Total Visitors Per Month', ylabel='Visitors', xlabel='Month')
plt.show()
# Pie Chart - Proportion of Domestic vs International Visitors
latest_year = data['Year'].max()
latest_data = proportion.loc[latest_year]
latest_data[['Domestic_Proportion', 'International_Proportion']].plot(kind='pie', autopct='%1.1f%%',
title=f'Domestic vs International Visitors ({latest_year})', ylabel='')
plt.show()
# Line Graph - Trend of Total Visitors Over the Years
yearly_totals = data.groupby('Year')['Total_Visitors'].sum()
yearly_totals.plot(kind='line', title='Total Visitors Over the Years', ylabel='Visitors', xlabel='Year')
plt.show()
# Step 7: Output Results
print("Month with Highest Total Visitors:")
print(highest_month)
print("\nAverage Visitors Per Year:")
print(average_visitors_per_year)
print("\nProportion of Domestic vs International Visitors by Year:")
print(proportion)
```
---
### Instructions to Run in Google Colab
1. **Upload the Dataset**: Replace the placeholder `tourism_data.csv` with your dataset. When you
run the `files.upload()` block, it will prompt you to upload your file.
2. **Install Required Libraries**: Google Colab already includes `pandas` and `matplotlib`. No
additional installations are necessary.
3. **Run the Cells**: Copy and paste the code into Google Colab and execute each cell sequentially.
---
### Explanation of the Code
1. **Data Cleaning**: Handles duplicates and missing values to ensure data consistency.
2. **Data Manipulation**: Calculates the total visitors and extracts `Year` and `Month` for analysis.
3. **Analysis**:
- Finds the month with the highest visitors.
- Calculates average yearly visitors.
- Analyzes proportions of domestic and international visitors.
4. **Visualization**: Uses bar, pie, and line plots to display results graphically.
---
### Output
The output includes:
1. **Text Outputs**:
- The month with the highest visitors.
- The average number of visitors per year.
- Proportion data.
2. **Graphs**:
- A bar chart for total visitors per month.
- A pie chart for domestic vs international visitor proportions.
- A line graph showing the trend of total visitors over the years.
Run the code to view the exact outputs based on your dataset. If you'd like me to adjust the code or
work with a sample dataset, let me know!
[12/6, 8:53 PM] : Here's an example of what the output might look like if we use a fictional
dataset for tourism data. This will give you an idea of the expected results:
### **Sample Text Output**
#### **Month with Highest Total Visitors**
```
Date 2023-07-01
Domestic_Visitors 500,000
International_Visitors 300,000
Total_Visitors 800,000
Year 2023
Month 7
Name: 189, dtype: object
```
#### **Average Visitors Per Year**
```
Year
2019 450,000.0
2020 200,000.0
2021 350,000.0
2022 500,000.0
2023 600,000.0
Name: Total_Visitors, dtype: float64
```
#### **Proportion of Domestic vs International Visitors by Year**
```
Domestic_Visitors International_Visitors Domestic_Proportion International_Proportion
Year
2019 2,000,000 700,000 0.740 0.260
2020 1,200,000 500,000 0.706 0.294
2021 1,500,000 700,000 0.682 0.318
2022 2,000,000 1,000,000 0.667 0.333
2023 2,500,000 1,300,000 0.658 0.342
```
---
### **Sample Visualizations**
1. **Bar Chart: Total Visitors Per Month**
A bar chart showing total visitors for each month, with July as the peak month.
2. **Pie Chart: Proportion of Domestic vs International Visitors (2023)**
A pie chart for 2023 might show:
- **65.8% Domestic Visitors**
- **34.2% International Visitors**
3. **Line Graph: Total Visitors Over the Years**
A line graph showing a general upward trend in tourism, with a dip in 2020 (likely due to external
factors like a pandemic) and steady growth afterward.
---
### **Key Notes**
- The outputs will vary depending on your dataset.
- If you don't have real tourism data, you can simulate data by creating a CSV file with columns like
`Date`, `Domestic_Visitors`, and `International_Visitors`.
- Let me know if you’d like help generating sample data for testing!
Here is how you can run the code on **Google Colab**, step-by-step:
---
### **Step 1: Open Google Colab**
1. Go to [Google Colab](https://colab.research.google.com/).
2. Create a new notebook by clicking on **"File > New Notebook"**.
---
### **Step 2: Upload the Dataset**
1. Save your dataset (e.g., `bank_campaign_data.csv`) on your local machine.
2. In Google Colab, click on the folder icon in the left sidebar.
3. Click the upload icon and upload your dataset.
---
### **Step 3: Run the Code**
1. Copy and paste the following Python code into a code cell in Colab:
```python
# Install necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Step 1: Load the dataset
from google.colab import files
uploaded = files.upload() # Upload the dataset here
file_path = list(uploaded.keys())[0] # Get the uploaded file name
data = pd.read_csv(file_path)
# Step 2: Data Cleaning
# Handle missing values
data.fillna(method='ffill', inplace=True)
# Drop duplicate entries
data.drop_duplicates(inplace=True)
# Step 3: Data Manipulation
# Add a column for Contacted_Last_Month
data['Contacted_Last_Month'] = data['campaign'].apply(lambda x: 'Yes' if x > 0 else 'No')
# Convert categorical variables to numeric using one-hot encoding
categorical_cols = ['job', 'marital', 'education']
data = pd.get_dummies(data, columns=categorical_cols, drop_first=True)
# Step 4: Analysis
# Average age of customers who subscribed
avg_age = data[data['y'] == 'yes']['age'].mean()
# Most common job category for subscribed customers
most_common_job = data[data['y'] == 'yes']['job'].mode()[0]
# Proportion of subscribed customers
subscribed_proportion = len(data[data['y'] == 'yes']) / len(data)
# Step 5: Visualization
# Bar chart showing subscription rate by job
sns.countplot(x='job', hue='y', data=data)
plt.title('Subscription Rate by Job')
plt.xticks(rotation=45)
plt.show()
# Pie chart showing subscription proportion
data['y'].value_counts().plot.pie(autopct='%1.1f%%', labels=['Not Subscribed', 'Subscribed'])
plt.title('Subscription Proportion')
plt.ylabel('')
plt.show()
# Histogram for age distribution
data['age'].plot.hist(bins=10)
plt.title('Distribution of Customer Ages')
plt.xlabel('Age')
plt.show()
# Print analysis results
print(f"Average Age of Subscribed Customers: {avg_age:.2f}")
print(f"Most Common Job for Subscribed Customers: {most_common_job}")
print(f"Proportion of Subscribed Customers: {subscribed_proportion:.2%}")
```
2. Run the cell.
3. When prompted, upload your dataset (e.g., `bank_campaign_data.csv`).
---
### **Sample Output**:
1. The console will display:
```
Average Age of Subscribed Customers: 41.20
Most Common Job for Subscribed Customers: admin
Proportion of Subscribed Customers: 12.50%
```
2. Visualizations:
- **Bar Chart**: Subscription rate by job category.
- **Pie Chart**: Proportion of subscribed vs. not subscribed customers.
- **Histogram**: Age distribution of customers.
---
### **Note**:
Make sure your dataset includes the necessary columns like `age`, `job`, `campaign`, `y`, and other
required fields. Adjust column names in the code if they differ in your dataset.