To load a dataset from a CSV file using Pandas, you'll need to ensure that the file exists
in the specified directory. Here's a complete example that demonstrates how to load the
dataset, perform some basic operations, and visualize the data using Matplotlib.
Let's assume that `[Link]` contains columns `YearsExperience` and `Salary`.
Step-by-Step Example
#### 1. Import Libraries
import numpy as np
import pandas as pd
import [Link] as plt
#### 2. Load the Dataset
Make sure `[Link]` is in the same directory as your script, or provide the full path to
the file.
# Load the dataset
dataset = pd.read_csv('[Link]')
# Display the first few rows of the dataset
print([Link]())
or
[Link]() ( also tail , info , shape , size , describe)
#### 3. Explore the Dataset
# Display basic information about the dataset
print([Link]())
# Display summary statistics
print([Link]())
#### 4. Visualize the Data
Create a scatter plot to visualize the relationship between `YearsExperience` and
`Salary`.
# Scatter plot of YearsExperience vs Salary
[Link](dataset['YearsExperience'], dataset['Salary'], color='blue')
# Adding title and labels
[Link]('Years of Experience vs Salary')
[Link]('Years of Experience')
[Link]('Salary')
# Display the plot
[Link]()
#### 5. Perform Regression Analysis
Let's perform a simple linear regression to predict Salary based on Years of Experience.
from sklearn.model_selection import train_test_split
// sklearn.model_selection is used to split your dataset into training and testing sets//
from sklearn.linear_model import LinearRegression
// LinearRegression to perform a linear regression analysis on a dataset, split the data into
training and testing sets, train the model, make predictions, and evaluate the model.
from [Link] import mean_squared_error, r2_score
//The mean_squared_error and r2_score functions from [Link] are used to
evaluate the performance of a regression model.
Mean Squared Error (MSE): Measures the average squared difference between
the actual and predicted values. Lower values are better.
R-squared (R²) score: Represents the proportion of variance in the dependent
variable that is predictable from the independent variable(s). Higher values
(closer to 1) are better.
# Define the features (X) and target (y)
X = dataset[['YearsExperience']]
y = dataset['Salary']
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
// X: The feature(s) of the dataset. In this case, it is YearsExperience.
y: The target variable. In this case, it is Salary.
test_size=0.2: 20% of the data will be used as the test set.
random_state=42: Ensures reproducibility of the split. Using the same random state
will always produce the same split.
# Create a Linear Regression model to Train a Linear Regression Mode
model = LinearRegression()
# Train the model
[Link](X_train, y_train)
# Make predictions on the test set
y_pred = [Link](X_test)
# Evaluate the print('Mean Squared Error:', mse)
print('R-squared:', r2)
model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
# Plot the regression line
[Link](X, y, color='blue')
[Link](X, [Link](X), color='red', linewidth=2)
# Adding title and labels
[Link]('Years of Experience vs Salary (with Regression Line)')
[Link]('Years of Experience')
[Link]('Salary')
# Display the plot
[Link]()
```