Open In App

Duplicate columns when querying SQLAlchemy into Pandas Dataframe?

Last Updated : 29 Nov, 2024
Comments
Improve
Suggest changes
Like Article
Like
Report

When querying data with SQLAlchemy and loading it into a Pandas DataFrame, you may encounter duplicate column names, especially when joining tables. This happens because SQLAlchemy does not rename or manage overlapping column names in joined tables. As a result, the Pandas DataFrame will contain duplicate column names, which can cause confusion or errors in downstream operations.

Understanding Problem of Duplicate Columns

This scenario occurs especially when we use * in Select statement which results in duplicate columns or Join queries. This is because SQLAlchemy cannot distinguish between the columns that are having the same name but the purpose is different.

Let us consider two tables: users and orders.

  • In the users table there are three columns: id, name and email.
  • In the orders table there are also three columns id, user_id and order_date.

Here the id in user table means user_id and the id in orders table means order_id. Now we will define the schema, tables and insert data into them. After that we will join the two tables based on user_id and append the results to the dataframe.

Python
from sqlalchemy import create_engine, Table, Column, Integer, String, MetaData
import pandas as pd

# Step 1: Set up SQLAlchemy engine for an in-memory SQLite database
engine = create_engine('sqlite:///:memory:')  # In-memory SQLite database

# Step 2: Define the schema (metadata) for the tables
metadata = MetaData()

# Define the 'users' table
users = Table('users', metadata,
              Column('id', Integer, primary_key=True),
              Column('name', String),
              Column('email', String))

# Define the 'orders' table
orders = Table('orders', metadata,
               Column('id', Integer, primary_key=True),
               Column('user_id', Integer),
               Column('order_date', String))

# Step 3: Create the tables in the database
metadata.create_all(engine)

# Step 4: Insert sample data into the tables with explicit transaction commit
with engine.connect() as conn:
    # Inserting into 'users' table
    conn.execute(users.insert(), [
        {'id': 1, 'name': 'Alice', 'email': '[email protected]'},
        {'id': 2, 'name': 'Bob', 'email': '[email protected]'}
    ])
    
    # Inserting into 'orders' table
    conn.execute(orders.insert(), [
        {'id': 1, 'user_id': 1, 'order_date': '2024-01-01'},
        {'id': 2, 'user_id': 2, 'order_date': '2024-02-01'}
    ])
    
    # Explicit commit (although with SQLAlchemy, this should generally happen automatically)
    conn.commit()

# Step 5: Query data from both tables (example: JOIN the tables without renaming columns)
query = """
SELECT users.id, users.name, users.email, orders.id, orders.order_date
FROM users
JOIN orders ON users.id = orders.user_id
"""

# Step 6: Load the query result into a Pandas DataFrame
with engine.connect() as conn:
    df = pd.read_sql(query, conn)  # Execute query and load into DataFrame

# Display the result
print(df)

Output:

Screenshot-2024-11-21-201314
Duplicate columns when querying SQLAlchemy into Pandas DF

The green box denotes the user id and the red box denotes order id. But we can see that the column headings are same and as a result we cannot distinguish them as two separate columns. Therefore we need to handle the column names so as to reduce ambiguity.

Method 1. Renaming columns in the SQL query to reduce redundancy

Renaming columns directly in the SQL query ensures unique column names before data enters Pandas. In general for SQL queries we use AS to provide an alias name for table or particular column. So here when we will be querying the database using SQLAlchemy, we will rename the necessary columns using AS operator.

Let us consider a sample example. Here we have two tables: user details and purchase details. Now we will perform join operation between the two tables and also rename the columns.

Python
from sqlalchemy import create_engine, Table, Column, Integer, String, MetaData
import pandas as pd

# Step 1: Set up SQLAlchemy engine for an in-memory SQLite database
engine = create_engine('sqlite:///:memory:')  # In-memory SQLite database

# Step 2: Define the schema (metadata) for the tables
metadata = MetaData()

# Define the 'users' table
users = Table('users', metadata,
              Column('id', Integer, primary_key=True),
              Column('name', String))

# Define the 'purchases' table (instead of orders)
purchases = Table('purchases', metadata,
                  Column('id', Integer, primary_key=True),
                  Column('user_id', Integer),
                  Column('purchase_date', String))

# Step 3: Create the tables in the database
metadata.create_all(engine)

# Step 4: Insert sample data into the tables
with engine.connect() as conn:
    conn.execute(users.insert(), [
        {'id': 1, 'name': 'Alice'},
        {'id': 2, 'name': 'Bob'}
    ])
    conn.execute(purchases.insert(), [
        {'id': 1, 'user_id': 1, 'purchase_date': '2024-01-01'},
        {'id': 2, 'user_id': 2, 'purchase_date': '2024-02-01'}
    ])
    conn.commit()

# Step 5: Query data from both tables, renaming columns to avoid duplication
query = """
SELECT users.id AS user_id, users.name, purchases.id AS purchase_id, purchases.purchase_date
FROM users
JOIN purchases ON users.id = purchases.user_id
"""

# Step 6: Load the query result into a Pandas DataFrame
with engine.connect() as conn:
    df = pd.read_sql(query, conn)

# Display the result
print(df)

Output:

Screenshot-2024-11-22-191109
Duplicate columns when querying SQLAlchemy into Pandas DF

Here we can see that the users table has an id column which denotes user id and in the purchase table also there is a column named id which denotes purchase id. So when we query the table we use AS to provide an Alias name for the columns.

Method 2. Using Pandas to Rename Columns

In this after querying the data from the table, we will use rename of the Pandas to rename the duplicate columns present in the table. Below is a sample code that illustrates the use of rename.

Python
from sqlalchemy import create_engine, Table, Column, Integer, String, MetaData
import pandas as pd

# Step 1: Set up SQLAlchemy engine for an in-memory SQLite database
engine = create_engine('sqlite:///:memory:')  # In-memory SQLite database

# Step 2: Define the schema (metadata) for the tables
metadata = MetaData()

# Define the 'users' table
users = Table('users', metadata,
              Column('id', Integer, primary_key=True),
              Column('name', String),
              Column('email', String))

# Define the 'orders' table
orders = Table('orders', metadata,
               Column('id', Integer, primary_key=True),
               Column('user_id', Integer),
               Column('order_date', String))

# Step 3: Create the tables in the database
metadata.create_all(engine)

# Step 4: Insert sample data into the tables with explicit transaction commit
with engine.connect() as conn:
    # Inserting into 'users' table
    conn.execute(users.insert(), [
        {'id': 1, 'name': 'Alice', 'email': '[email protected]'},
        {'id': 2, 'name': 'Bob', 'email': '[email protected]'}
    ])

    # Inserting into 'orders' table
    conn.execute(orders.insert(), [
        {'id': 1, 'user_id': 1, 'order_date': '2024-01-01'},
        {'id': 2, 'user_id': 2, 'order_date': '2024-02-01'}
    ])

    # Explicit commit (although with SQLAlchemy, this should generally happen automatically)
    conn.commit()

# Step 5: Query data from both tables (example: JOIN the tables without renaming columns)
query = """
SELECT users.id, users.name, users.email, orders.id as ord_id, orders.order_date
FROM users
JOIN orders ON users.id = orders.user_id
"""

# Step 6: Load the query result into a Pandas DataFrame
with engine.connect() as conn:
    df = pd.read_sql(query, conn)  # Execute query and load into DataFrame

# Step 7: Rename columns in the DataFrame
df = df.rename(columns={
    'id': 'User_ID',
    
})

# Display the result
print(df)

Output:

Screenshot-2024-11-22-200902
Duplicate columns when querying SQLAlchemy into Pandas DF

In this for user id we have used rename to rename the columns for better understanding of the data.

Method 3. Using Object Relational Mapping to avoid duplicate columns

It might so happen that duplicate columns can appear when we query the data from the tables. So we can use Object Relational Mapping to query the data more effectively. It helps to define the object models and helps us to restructure our queries more effectively.

Here we create session and inside the session we label the columns that have duplicate names. Then the join is done between two tables and the most advantageous part is that we do not have to write SQL queries. Then the data is finally dumped into the dataframe. Below is the sample example that illustrates the use of ORM.

Python
from sqlalchemy import create_engine, Column, Integer, String, ForeignKey, MetaData, Table
from sqlalchemy.orm import sessionmaker, declarative_base
import pandas as pd

# --- Step 1: Define Database and Tables ---
Base = declarative_base()

class User(Base):
    __tablename__ = 'users'
    id = Column(Integer, primary_key=True)
    name = Column(String)

class Order(Base):
    __tablename__ = 'orders'
    id = Column(Integer, primary_key=True)
    user_id = Column(Integer, ForeignKey('users.id'))
    total = Column(Integer)

# --- Step 2: Connect to Database ---
engine = create_engine("sqlite:///example.db", echo=True)
Base.metadata.create_all(engine)  # Creates tables if not already present

# --- Step 3: Insert Sample Data (if necessary) ---
def insert_sample_data():
    # Use session to handle transactions
    session = Session()
    session.add_all([
        User(id=1, name='Alice'),
        User(id=2, name='Bob'),
        User(id=3, name='Charlie'),
        Order(id=101, user_id=1, total=200),
        Order(id=102, user_id=2, total=150),
        Order(id=103, user_id=3, total=300)
    ])
    session.commit()  # Commit all changes in one go
    session.close()

# Uncomment below line to insert data only once
insert_sample_data()

# --- Step 4: Create a Session ---
Session = sessionmaker(bind=engine)
session = Session()

# --- Step 5: Query Specific Columns ---
query = session.query(
    User.id.label("user_id"),
    Order.id.label("order_id"),
    Order.total.label("order_total")
).filter(User.id == Order.user_id)

# --- Step 6: Load Query Results into a Pandas DataFrame ---
df = pd.read_sql(query.statement, session.bind)

# --- Step 7: Print the DataFrame ---
print(df)

# --- Step 8: Close the Session ---
session.close()

Output:

Screenshot-2024-11-22-205832
Duplicate columns when querying SQLAlchemy into Pandas DF

Examples of Handling Duplicate Columns in SQLAlchemy Queries

Here we will be seeing some examples on how to handle the duplicate columns when querying SQLAlchemy into pandas df.

Example 1: Here we will be using two tables to join them and rename the common column names using Alias. - Renaming Columns in the SQL Query

Python
from sqlalchemy import create_engine, Table, Column, Integer, String, MetaData
import pandas as pd

# Step 1: Set up SQLAlchemy engine for an in-memory SQLite database
engine = create_engine('sqlite:///:memory:')  # In-memory SQLite database

# Step 2: Define the schema (metadata) for the tables
metadata = MetaData()

# Define the 'employees' table
employees = Table('employees', metadata,
                  Column('id', Integer, primary_key=True),
                  Column('name', String),
                  Column('department', String))

# Define the 'projects' table
projects = Table('projects', metadata,
                 Column('id', Integer, primary_key=True),
                 Column('employee_id', Integer),
                 Column('project_name', String),
                 Column('budget', Integer))

# Step 3: Create the tables in the database
metadata.create_all(engine)

# Step 4: Insert sample data into the tables
with engine.connect() as conn:
    conn.execute(employees.insert(), [
        {'id': 1, 'name': 'Sophia', 'department': 'Engineering'},
        {'id': 2, 'name': 'Liam', 'department': 'Marketing'}
    ])
    conn.execute(projects.insert(), [
        {'id': 1, 'employee_id': 1, 'project_name': 'Project A', 'budget': 100000},
        {'id': 2, 'employee_id': 2, 'project_name': 'Project B', 'budget': 150000}
    ])
    conn.commit()

# Step 5: Query data with column renaming in SQL
query = """
SELECT employees.id AS employee_id, employees.name AS employee_name, employees.department AS employee_department,
       projects.id AS project_id, projects.project_name, projects.budget AS project_budget
FROM employees
JOIN projects ON employees.id = projects.employee_id
"""

# Step 6: Load the query result into a Pandas DataFrame
with engine.connect() as conn:
    df = pd.read_sql(query, conn)

# Display the result
print("Query with column renaming in SQL:")
display(df)

Output:

Screenshot-2024-11-22-211217
Duplicate columns when querying SQLAlchemy into Pandas DF

Example 2: Here we will be using two tables to join them, fetch the data and rename the common column names using rename function of Pandas - Renaming Columns in the DataFrame Using rename()

Python
from sqlalchemy import create_engine, Table, Column, Integer, String, MetaData
import pandas as pd

# Step 1: Set up SQLAlchemy engine for an in-memory SQLite database
engine = create_engine('sqlite:///:memory:')  # In-memory SQLite database

# Step 2: Define the schema (metadata) for the tables
metadata = MetaData()

# Define the 'employees' table
employees = Table('employees', metadata,
                  Column('id', Integer, primary_key=True),
                  Column('name', String),
                  Column('department', String))

# Define the 'projects' table
projects = Table('projects', metadata,
                 Column('id', Integer, primary_key=True),
                 Column('employee_id', Integer),
                 Column('project_name', String),
                 Column('budget', Integer))

# Step 3: Create the tables in the database
metadata.create_all(engine)

# Step 4: Insert sample data into the tables
with engine.connect() as conn:
    conn.execute(employees.insert(), [
        {'id': 1, 'name': 'Sophia', 'department': 'Engineering'},
        {'id': 2, 'name': 'Liam', 'department': 'Marketing'}
    ])
    conn.execute(projects.insert(), [
        {'id': 1, 'employee_id': 1, 'project_name': 'Project A', 'budget': 100000},
        {'id': 2, 'employee_id': 2, 'project_name': 'Project B', 'budget': 150000}
    ])
    conn.commit()

# Step 5: Query data from both tables
query = """
SELECT employees.id, employees.name, employees.department,
       projects.id AS project_id, projects.project_name, projects.budget
FROM employees
JOIN projects ON employees.id = projects.employee_id
"""

# Step 6: Load the query result into a Pandas DataFrame
with engine.connect() as conn:
    df = pd.read_sql(query, conn)

# Step 7: Display the columns before renaming
print("Columns before renaming:")
print(df.columns)

# Step 8: Rename columns in the DataFrame using rename() based on actual column names
df = df.rename(columns={
    'id': 'employee_id',  # Rename the first 'id' from employees
    'project_id': 'project_id',  # 'project_id' already has the correct name
    'name': 'employee_name',  # Rename 'name' from employees to 'employee_name'
    'department': 'employee_department',  # Rename 'department' from employees
    'project_name': 'project_name',  # Rename 'project_name' (no change needed)
    'budget': 'project_budget'  # Rename 'budget' from projects
})

# Step 9: Display the final DataFrame
print("\nQuery with renaming columns in DataFrame using rename():")
print(df)

Output:

Screenshot-2024-11-22-212620
Duplicate columns when querying SQLAlchemy into Pandas DF

Example 3: Here we will be using ORM model to establish the relationship without writing the SQL codes - Using ORM

Python
from sqlalchemy import create_engine, Column, Integer, String, ForeignKey
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import sessionmaker, relationship
import pandas as pd

# Step 1: Set up SQLAlchemy engine for an in-memory SQLite database
engine = create_engine('sqlite:///:memory:')  # In-memory SQLite database

# Step 2: Define the base class for ORM
Base = declarative_base()

# Define the 'Employee' class (ORM for employees table)
class Employee(Base):
    __tablename__ = 'employees'
    id = Column(Integer, primary_key=True)
    name = Column(String)
    department = Column(String)
    projects = relationship('Project', backref='employee')  # Relationship to projects

# Define the 'Project' class (ORM for projects table)
class Project(Base):
    __tablename__ = 'projects'
    id = Column(Integer, primary_key=True)
    project_name = Column(String)
    budget = Column(Integer)
    employee_id = Column(Integer, ForeignKey('employees.id'))

# Step 3: Create the tables in the database
Base.metadata.create_all(engine)

# Step 4: Insert sample data into the tables using ORM
Session = sessionmaker(bind=engine)
session = Session()

employee1 = Employee(name='Sophia', department='Engineering', projects=[Project(project_name='Project A', budget=100000)])
employee2 = Employee(name='Liam', department='Marketing', projects=[Project(project_name='Project B', budget=150000)])

session.add(employee1)
session.add(employee2)
session.commit()

# Step 5: Query using ORM and join the tables
query = session.query(Employee.id.label('employee_id'), Employee.name.label('employee_name'),
                      Employee.department.label('employee_department'),
                      Project.id.label('project_id'), Project.project_name, Project.budget.label('project_budget'))\
               .join(Project)

# Step 6: Load the query result into a Pandas DataFrame
df = pd.read_sql(query.statement, session.bind)

# Display the result
print("Query using ORM to avoid duplicate columns:")
print(df)

Output:

Screenshot-2024-11-22-212932
Duplicate columns when querying SQLAlchemy into Pandas DF

Next Article
Practice Tags :

Similar Reads