Duplicate columns when querying SQLAlchemy into Pandas Dataframe?
Last Updated :
29 Nov, 2024
When querying data with SQLAlchemy and loading it into a Pandas DataFrame, you may encounter duplicate column names, especially when joining tables. This happens because SQLAlchemy does not rename or manage overlapping column names in joined tables. As a result, the Pandas DataFrame will contain duplicate column names, which can cause confusion or errors in downstream operations.
Understanding Problem of Duplicate Columns
This scenario occurs especially when we use * in Select statement which results in duplicate columns or Join queries. This is because SQLAlchemy cannot distinguish between the columns that are having the same name but the purpose is different.
Let us consider two tables: users and orders.
- In the users table there are three columns: id, name and email.
- In the orders table there are also three columns id, user_id and order_date.
Here the id in user table means user_id and the id in orders table means order_id. Now we will define the schema, tables and insert data into them. After that we will join the two tables based on user_id and append the results to the dataframe.
Python
from sqlalchemy import create_engine, Table, Column, Integer, String, MetaData
import pandas as pd
# Step 1: Set up SQLAlchemy engine for an in-memory SQLite database
engine = create_engine('sqlite:///:memory:') # In-memory SQLite database
# Step 2: Define the schema (metadata) for the tables
metadata = MetaData()
# Define the 'users' table
users = Table('users', metadata,
Column('id', Integer, primary_key=True),
Column('name', String),
Column('email', String))
# Define the 'orders' table
orders = Table('orders', metadata,
Column('id', Integer, primary_key=True),
Column('user_id', Integer),
Column('order_date', String))
# Step 3: Create the tables in the database
metadata.create_all(engine)
# Step 4: Insert sample data into the tables with explicit transaction commit
with engine.connect() as conn:
# Inserting into 'users' table
conn.execute(users.insert(), [
{'id': 1, 'name': 'Alice', 'email': '[email protected]'},
{'id': 2, 'name': 'Bob', 'email': '[email protected]'}
])
# Inserting into 'orders' table
conn.execute(orders.insert(), [
{'id': 1, 'user_id': 1, 'order_date': '2024-01-01'},
{'id': 2, 'user_id': 2, 'order_date': '2024-02-01'}
])
# Explicit commit (although with SQLAlchemy, this should generally happen automatically)
conn.commit()
# Step 5: Query data from both tables (example: JOIN the tables without renaming columns)
query = """
SELECT users.id, users.name, users.email, orders.id, orders.order_date
FROM users
JOIN orders ON users.id = orders.user_id
"""
# Step 6: Load the query result into a Pandas DataFrame
with engine.connect() as conn:
df = pd.read_sql(query, conn) # Execute query and load into DataFrame
# Display the result
print(df)
Output:
Duplicate columns when querying SQLAlchemy into Pandas DFThe green box denotes the user id and the red box denotes order id. But we can see that the column headings are same and as a result we cannot distinguish them as two separate columns. Therefore we need to handle the column names so as to reduce ambiguity.
Method 1. Renaming columns in the SQL query to reduce redundancy
Renaming columns directly in the SQL query ensures unique column names before data enters Pandas. In general for SQL queries we use AS to provide an alias name for table or particular column. So here when we will be querying the database using SQLAlchemy, we will rename the necessary columns using AS operator.
Let us consider a sample example. Here we have two tables: user details and purchase details. Now we will perform join operation between the two tables and also rename the columns.
Python
from sqlalchemy import create_engine, Table, Column, Integer, String, MetaData
import pandas as pd
# Step 1: Set up SQLAlchemy engine for an in-memory SQLite database
engine = create_engine('sqlite:///:memory:') # In-memory SQLite database
# Step 2: Define the schema (metadata) for the tables
metadata = MetaData()
# Define the 'users' table
users = Table('users', metadata,
Column('id', Integer, primary_key=True),
Column('name', String))
# Define the 'purchases' table (instead of orders)
purchases = Table('purchases', metadata,
Column('id', Integer, primary_key=True),
Column('user_id', Integer),
Column('purchase_date', String))
# Step 3: Create the tables in the database
metadata.create_all(engine)
# Step 4: Insert sample data into the tables
with engine.connect() as conn:
conn.execute(users.insert(), [
{'id': 1, 'name': 'Alice'},
{'id': 2, 'name': 'Bob'}
])
conn.execute(purchases.insert(), [
{'id': 1, 'user_id': 1, 'purchase_date': '2024-01-01'},
{'id': 2, 'user_id': 2, 'purchase_date': '2024-02-01'}
])
conn.commit()
# Step 5: Query data from both tables, renaming columns to avoid duplication
query = """
SELECT users.id AS user_id, users.name, purchases.id AS purchase_id, purchases.purchase_date
FROM users
JOIN purchases ON users.id = purchases.user_id
"""
# Step 6: Load the query result into a Pandas DataFrame
with engine.connect() as conn:
df = pd.read_sql(query, conn)
# Display the result
print(df)
Output:
Duplicate columns when querying SQLAlchemy into Pandas DFHere we can see that the users table has an id column which denotes user id and in the purchase table also there is a column named id which denotes purchase id. So when we query the table we use AS to provide an Alias name for the columns.
Method 2. Using Pandas to Rename Columns
In this after querying the data from the table, we will use rename of the Pandas to rename the duplicate columns present in the table. Below is a sample code that illustrates the use of rename.
Python
from sqlalchemy import create_engine, Table, Column, Integer, String, MetaData
import pandas as pd
# Step 1: Set up SQLAlchemy engine for an in-memory SQLite database
engine = create_engine('sqlite:///:memory:') # In-memory SQLite database
# Step 2: Define the schema (metadata) for the tables
metadata = MetaData()
# Define the 'users' table
users = Table('users', metadata,
Column('id', Integer, primary_key=True),
Column('name', String),
Column('email', String))
# Define the 'orders' table
orders = Table('orders', metadata,
Column('id', Integer, primary_key=True),
Column('user_id', Integer),
Column('order_date', String))
# Step 3: Create the tables in the database
metadata.create_all(engine)
# Step 4: Insert sample data into the tables with explicit transaction commit
with engine.connect() as conn:
# Inserting into 'users' table
conn.execute(users.insert(), [
{'id': 1, 'name': 'Alice', 'email': '[email protected]'},
{'id': 2, 'name': 'Bob', 'email': '[email protected]'}
])
# Inserting into 'orders' table
conn.execute(orders.insert(), [
{'id': 1, 'user_id': 1, 'order_date': '2024-01-01'},
{'id': 2, 'user_id': 2, 'order_date': '2024-02-01'}
])
# Explicit commit (although with SQLAlchemy, this should generally happen automatically)
conn.commit()
# Step 5: Query data from both tables (example: JOIN the tables without renaming columns)
query = """
SELECT users.id, users.name, users.email, orders.id as ord_id, orders.order_date
FROM users
JOIN orders ON users.id = orders.user_id
"""
# Step 6: Load the query result into a Pandas DataFrame
with engine.connect() as conn:
df = pd.read_sql(query, conn) # Execute query and load into DataFrame
# Step 7: Rename columns in the DataFrame
df = df.rename(columns={
'id': 'User_ID',
})
# Display the result
print(df)
Output:
Duplicate columns when querying SQLAlchemy into Pandas DFIn this for user id we have used rename to rename the columns for better understanding of the data.
Method 3. Using Object Relational Mapping to avoid duplicate columns
It might so happen that duplicate columns can appear when we query the data from the tables. So we can use Object Relational Mapping to query the data more effectively. It helps to define the object models and helps us to restructure our queries more effectively.
Here we create session and inside the session we label the columns that have duplicate names. Then the join is done between two tables and the most advantageous part is that we do not have to write SQL queries. Then the data is finally dumped into the dataframe. Below is the sample example that illustrates the use of ORM.
Python
from sqlalchemy import create_engine, Column, Integer, String, ForeignKey, MetaData, Table
from sqlalchemy.orm import sessionmaker, declarative_base
import pandas as pd
# --- Step 1: Define Database and Tables ---
Base = declarative_base()
class User(Base):
__tablename__ = 'users'
id = Column(Integer, primary_key=True)
name = Column(String)
class Order(Base):
__tablename__ = 'orders'
id = Column(Integer, primary_key=True)
user_id = Column(Integer, ForeignKey('users.id'))
total = Column(Integer)
# --- Step 2: Connect to Database ---
engine = create_engine("sqlite:///example.db", echo=True)
Base.metadata.create_all(engine) # Creates tables if not already present
# --- Step 3: Insert Sample Data (if necessary) ---
def insert_sample_data():
# Use session to handle transactions
session = Session()
session.add_all([
User(id=1, name='Alice'),
User(id=2, name='Bob'),
User(id=3, name='Charlie'),
Order(id=101, user_id=1, total=200),
Order(id=102, user_id=2, total=150),
Order(id=103, user_id=3, total=300)
])
session.commit() # Commit all changes in one go
session.close()
# Uncomment below line to insert data only once
insert_sample_data()
# --- Step 4: Create a Session ---
Session = sessionmaker(bind=engine)
session = Session()
# --- Step 5: Query Specific Columns ---
query = session.query(
User.id.label("user_id"),
Order.id.label("order_id"),
Order.total.label("order_total")
).filter(User.id == Order.user_id)
# --- Step 6: Load Query Results into a Pandas DataFrame ---
df = pd.read_sql(query.statement, session.bind)
# --- Step 7: Print the DataFrame ---
print(df)
# --- Step 8: Close the Session ---
session.close()
Output:
Duplicate columns when querying SQLAlchemy into Pandas DFExamples of Handling Duplicate Columns in SQLAlchemy Queries
Here we will be seeing some examples on how to handle the duplicate columns when querying SQLAlchemy into pandas df.
Example 1: Here we will be using two tables to join them and rename the common column names using Alias. - Renaming Columns in the SQL Query
Python
from sqlalchemy import create_engine, Table, Column, Integer, String, MetaData
import pandas as pd
# Step 1: Set up SQLAlchemy engine for an in-memory SQLite database
engine = create_engine('sqlite:///:memory:') # In-memory SQLite database
# Step 2: Define the schema (metadata) for the tables
metadata = MetaData()
# Define the 'employees' table
employees = Table('employees', metadata,
Column('id', Integer, primary_key=True),
Column('name', String),
Column('department', String))
# Define the 'projects' table
projects = Table('projects', metadata,
Column('id', Integer, primary_key=True),
Column('employee_id', Integer),
Column('project_name', String),
Column('budget', Integer))
# Step 3: Create the tables in the database
metadata.create_all(engine)
# Step 4: Insert sample data into the tables
with engine.connect() as conn:
conn.execute(employees.insert(), [
{'id': 1, 'name': 'Sophia', 'department': 'Engineering'},
{'id': 2, 'name': 'Liam', 'department': 'Marketing'}
])
conn.execute(projects.insert(), [
{'id': 1, 'employee_id': 1, 'project_name': 'Project A', 'budget': 100000},
{'id': 2, 'employee_id': 2, 'project_name': 'Project B', 'budget': 150000}
])
conn.commit()
# Step 5: Query data with column renaming in SQL
query = """
SELECT employees.id AS employee_id, employees.name AS employee_name, employees.department AS employee_department,
projects.id AS project_id, projects.project_name, projects.budget AS project_budget
FROM employees
JOIN projects ON employees.id = projects.employee_id
"""
# Step 6: Load the query result into a Pandas DataFrame
with engine.connect() as conn:
df = pd.read_sql(query, conn)
# Display the result
print("Query with column renaming in SQL:")
display(df)
Output:
Duplicate columns when querying SQLAlchemy into Pandas DFExample 2: Here we will be using two tables to join them, fetch the data and rename the common column names using rename function of Pandas - Renaming Columns in the DataFrame Using rename()
Python
from sqlalchemy import create_engine, Table, Column, Integer, String, MetaData
import pandas as pd
# Step 1: Set up SQLAlchemy engine for an in-memory SQLite database
engine = create_engine('sqlite:///:memory:') # In-memory SQLite database
# Step 2: Define the schema (metadata) for the tables
metadata = MetaData()
# Define the 'employees' table
employees = Table('employees', metadata,
Column('id', Integer, primary_key=True),
Column('name', String),
Column('department', String))
# Define the 'projects' table
projects = Table('projects', metadata,
Column('id', Integer, primary_key=True),
Column('employee_id', Integer),
Column('project_name', String),
Column('budget', Integer))
# Step 3: Create the tables in the database
metadata.create_all(engine)
# Step 4: Insert sample data into the tables
with engine.connect() as conn:
conn.execute(employees.insert(), [
{'id': 1, 'name': 'Sophia', 'department': 'Engineering'},
{'id': 2, 'name': 'Liam', 'department': 'Marketing'}
])
conn.execute(projects.insert(), [
{'id': 1, 'employee_id': 1, 'project_name': 'Project A', 'budget': 100000},
{'id': 2, 'employee_id': 2, 'project_name': 'Project B', 'budget': 150000}
])
conn.commit()
# Step 5: Query data from both tables
query = """
SELECT employees.id, employees.name, employees.department,
projects.id AS project_id, projects.project_name, projects.budget
FROM employees
JOIN projects ON employees.id = projects.employee_id
"""
# Step 6: Load the query result into a Pandas DataFrame
with engine.connect() as conn:
df = pd.read_sql(query, conn)
# Step 7: Display the columns before renaming
print("Columns before renaming:")
print(df.columns)
# Step 8: Rename columns in the DataFrame using rename() based on actual column names
df = df.rename(columns={
'id': 'employee_id', # Rename the first 'id' from employees
'project_id': 'project_id', # 'project_id' already has the correct name
'name': 'employee_name', # Rename 'name' from employees to 'employee_name'
'department': 'employee_department', # Rename 'department' from employees
'project_name': 'project_name', # Rename 'project_name' (no change needed)
'budget': 'project_budget' # Rename 'budget' from projects
})
# Step 9: Display the final DataFrame
print("\nQuery with renaming columns in DataFrame using rename():")
print(df)
Output:
Duplicate columns when querying SQLAlchemy into Pandas DFExample 3: Here we will be using ORM model to establish the relationship without writing the SQL codes - Using ORM
Python
from sqlalchemy import create_engine, Column, Integer, String, ForeignKey
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import sessionmaker, relationship
import pandas as pd
# Step 1: Set up SQLAlchemy engine for an in-memory SQLite database
engine = create_engine('sqlite:///:memory:') # In-memory SQLite database
# Step 2: Define the base class for ORM
Base = declarative_base()
# Define the 'Employee' class (ORM for employees table)
class Employee(Base):
__tablename__ = 'employees'
id = Column(Integer, primary_key=True)
name = Column(String)
department = Column(String)
projects = relationship('Project', backref='employee') # Relationship to projects
# Define the 'Project' class (ORM for projects table)
class Project(Base):
__tablename__ = 'projects'
id = Column(Integer, primary_key=True)
project_name = Column(String)
budget = Column(Integer)
employee_id = Column(Integer, ForeignKey('employees.id'))
# Step 3: Create the tables in the database
Base.metadata.create_all(engine)
# Step 4: Insert sample data into the tables using ORM
Session = sessionmaker(bind=engine)
session = Session()
employee1 = Employee(name='Sophia', department='Engineering', projects=[Project(project_name='Project A', budget=100000)])
employee2 = Employee(name='Liam', department='Marketing', projects=[Project(project_name='Project B', budget=150000)])
session.add(employee1)
session.add(employee2)
session.commit()
# Step 5: Query using ORM and join the tables
query = session.query(Employee.id.label('employee_id'), Employee.name.label('employee_name'),
Employee.department.label('employee_department'),
Project.id.label('project_id'), Project.project_name, Project.budget.label('project_budget'))\
.join(Project)
# Step 6: Load the query result into a Pandas DataFrame
df = pd.read_sql(query.statement, session.bind)
# Display the result
print("Query using ORM to avoid duplicate columns:")
print(df)
Output:
Duplicate columns when querying SQLAlchemy into Pandas DF
Similar Reads
Prevent duplicated columns when joining two Pandas DataFrames
Column duplication usually occurs when the two data frames have columns with the same name and when the columns are not used in the JOIN statement. In this article, let us discuss the three different methods in which we can prevent duplication of columns when joining two data frames. Syntax: pandas.
5 min read
Rename Duplicated Columns after Join in Pyspark dataframe
In this article, we are going to learn how to rename duplicate columns after join in Pyspark data frame in Python. A distributed collection of data grouped into named columns is known as a Pyspark data frame. While handling a lot of data, we observe that not all data is coming from one data frame, t
4 min read
Delete duplicates in a Pandas Dataframe based on two columns
A dataframe is a two-dimensional, size-mutable tabular data structure with labeled axes (rows and columns). It can contain duplicate entries and to delete them there are several ways. The dataframe contains duplicate values in column order_id and customer_id. Below are the methods to remove duplica
2 min read
How to count duplicates in Pandas Dataframe?
Let us see how to count duplicates in a Pandas DataFrame. Our task is to count the number of duplicate entries in a single column and multiple columns. Under a single column : We will be using the pivot_table() function to count the duplicates in a single column. The column in which the duplicates a
2 min read
How to Find & Drop duplicate columns in a Pandas DataFrame?
Letâs discuss How to Find and drop duplicate columns in a Pandas DataFrame. First, Letâs create a simple Dataframe with column names 'Name', 'Age', 'Domicile', and 'Age'/'Marks'. Find Duplicate Columns from a DataFrameTo find duplicate columns we need to iterate through all columns of a DataFrame a
4 min read
Bulk Insert to Pandas DataFrame Using SQLAlchemy - Python
Let's start with SQLAlchemy, a Python library that allows communication with databases(MySQL, PostgreSQL etc.) and Python. This library is used as an Object Relational Mapper tool that translates Python classes to tables in relational databases and automatically converts function calls to SQL statem
3 min read
SQLAlchemy ORM conversion to Pandas DataFrame
In this article, we will see how to convert an SQLAlchemy ORM to Pandas DataFrame using Python. We need to have the sqlalchemy as well as the pandas library installed in the python environment - $ pip install sqlalchemy $ pip install pandasFor our example, we will make use of the MySQL database wher
4 min read
Removing duplicate columns after DataFrame join in PySpark
In this article, we will discuss how to remove duplicate columns after a DataFrame join in PySpark. Create the first dataframe for demonstration:Python3 # Importing necessary libraries from pyspark.sql import SparkSession # Create a spark session spark = SparkSession.builder.appName('pyspark \ - exa
3 min read
Get unique values from a column in Pandas DataFrame
In Pandas, retrieving unique values from DataFrame is used for analyzing categorical data or identifying duplicates. Let's learn how to get unique values from a column in Pandas DataFrame. Get the Unique Values of Pandas using unique()The.unique()method returns a NumPy array. It is useful for identi
5 min read
Removing duplicate rows based on specific column in PySpark DataFrame
In this article, we are going to drop the duplicate rows based on a specific column from dataframe using pyspark in Python. Duplicate data means the same data based on some condition (column values). For this, we are using dropDuplicates() method: Syntax: dataframe.dropDuplicates(['column 1','column
1 min read