How to read hive partitioned table via pyspark
Last Updated :
16 Oct, 2023
Hive is built on Apache Hadoop and employments a high-level association and inquiry dialect called HiveQL (comparative to SQL) to recover large-scale information put away in disseminated capacity frameworks such as the Hadoop Dispersed Record Framework "HDFS".
In this article, we will learn how to read a Hive Partitioned Table using Pyspark.
Reading Hive Partitioned Table using Pyspark
A Hive table could be a logical representation of organized information put away within the Apache Hive information distribution center framework. Before reading the hive-partitioned table using Pyspark, we need to have a hive-partitioned table.
Let us see the process of creating and reading a Hive Partitioned Table using Pyspark in Python.
Create a Hive Partitioned Table
First, let us create a hive-partitioned table.
Step 1: Import modules
Import the necessary modules that are required to create a hive table in Pyspark.
from pyspark.sql import SparkSession
Step 2: Create a SparkSession
Create a SparkSession with Hive support enabled using the enableHiveSupport()
method.
spark = SparkSession.builder.enableHiveSupport().getOrCreate()
Step 3: Create a database (if it doesn't already exist)
Define the sample data as a Python list of tuples. Create a DataFrame using the sample data and specify the column names.
data = [("John", 25, "Male"), ("Alice", 30, "Female"),]
df = spark.createDataFrame(data, ["Name", "Age", "Gender"])
Step 4: Create a partitioned table and save it
Now, create a partitioned hive table of the DataFrame using 'write.partitionBy()' function which takes the column name as the parameter based on which we want to partition the hive table. Then save the table using 'saveAsTable()' function that takes the table name as the parameter.
table_name = "tablename"
df.write.partitionBy("column_name").saveAsTable(table_name)
Step 5: Display the Partitioned Hive Table
df.show()
Example:
In this example, we start by importing the SparkSession
class from the pyspark.sql
module. We then created the SparkSession and define a sample DataFrame 'df
'
with a larger dataset. It contains three columns: "Name", "Age", and "Gender".
Python3
# pyspark code to create a hive parititon table
# importing modules
from pyspark.sql import SparkSession
# Create a SparkSession with Hive support enabled
spark = SparkSession.builder.enableHiveSupport().getOrCreate()
# Create a DataFrame with sample data
data = [
("John", 25, "Male"),
("Alice", 30, "Female"),
("Bob", 27, "Male"),
("Emma", 28, "Female"),
("David", 32, "Male"),
("Sophia", 29, "Female"),
("James", 31, "Male"),
("Olivia", 26, "Female"),
("Ethan", 33, "Male"),
("Isabella", 27, "Female")
]
df = spark.createDataFrame(data, ["Name", "Age", "Gender"])
table_name = "my_partitioned"
# Create a partitioned Hive table
df.write.partitionBy("Gender").saveAsTable(table_name)
# display the table
df.show()
Output:
Partitioned Hive Table
Show the Hive Table Partitions
We can use the 'spark.sql()
'
function to execute a SQL query. It takes the SQL query as a parameter value. The query is as follows:
"SHOW PARTITIONS table_name"
which retrieves the list of partitions for the Hive table.
Python3
# pyspark code to show the partitions
# Show the partitions in the Hive table
spark.sql("SHOW PARTITIONS {}".format(table_name)).show()
Output:
Hive Table Partitions
Read Hive Partitions from the Table
To read a hive partitioned table, we will use the spark.sql()
function to execute a SQL query. The query is as follows:
"SELECT * FROM tablename WHERE condition"
which selects all columns (*
) from the Hive table where the condition is true for that particular row. The result is stored in the specific_partitions_df
DataFrame.
Example: Reading single Partition of the hive table
In this example, we will read a single hive partition from the table. We will display the partitioned table where the gender is Male.
Python3
# pyspark code to show specific partitions
# Read specific partitions from the Hive table
specific_partitions_df = spark.sql(
"SELECT * FROM {} WHERE Gender = 'Male'".format(table_name))
# Display the contents of the dataframe
specific_partitions_df.show()
Output:
Reading Specific partition of Hive TableExample: Reading multiple partitions of the hive table.
In this example, we will read multiple hive partitions from the table. We will display the partitioned table where the gender is Male as well as Female.
Python3
# pyspark code to display multiple partitions from the table.
# in this case this will be the all data from hive table
# Read multiple partitions from the Hive table
multiple_partitions_df = spark.sql(
"SELECT * FROM {} WHERE Gender IN ('Male', 'Female')".format(table_name))
# Displaying contents of the hive table
multiple_partitions_df.show()
Output:
Reading multiple partition of hive table
Similar Reads
Show partitions on a Pyspark RDD
Did you ever get the requirement to show the partitions on a Pyspark RDD for the data frame you uploaded or partition the data and check if has been correctly partitioned? Don't know, how to achieve this. You can do this by using the getNumPartitions functions of Pyspark RDD. Want to know more about
3 min read
Query HIVE table in Pyspark
Hadoop Distributed File System (HDFS) is a distributed file system that provides high-throughput access to application data. In this article, we will learn how to create and query a HIVE table using Apache Spark, which is an open-source distributed computing system that is used to process large amou
4 min read
How to See Record Count Per Partition in a pySpark DataFrame
The API which was introduced to support Spark and Python language and has features of Scikit-learn and Pandas libraries of Python is known as Pyspark. Whenever we upload any file in the Pyspark, it creates a partition of that data equal to the number of cores. The user can repartition that data and
4 min read
How to re-partition pyspark dataframe in Python
Are you a data science or machine learning enthusiast who likes to play with data? Have you ever got the need to repartition the Pyspark dataset you got? Got confused, about how to fulfill the demand? Don't worry! In this article, we will discuss the re-partitioning of the Pyspark data frame in Pyth
3 min read
How to sort by value in PySpark?
In this article, we are going to sort by value in PySpark.Creating RDD for demonstration:Pythonfrom pyspark.sql import SparkSession, Row # creating sparksession and giving an app name spark = SparkSession.builder.appName('sparkdf').getOrCreate() # create 2 Rows with 3 columns data = Row(First_name="
2 min read
PySpark partitionBy() method
PySpark partitionBy() is used to partition based on column values while writing DataFrame to Disk/File system. When you write DataFrame to Disk by calling partitionBy() Pyspark splits the records based on the partition column and stores each partition data into a sub-directory. PySpark Partition is
2 min read
How to Create Delta Table in Databricks Using PySpark
An open-source storage layer called Delta Lake gives data lakes scalability, performance, and dependability. It offers a transactional layer on top of cloud storage and lets you handle massive volumes of data in a data lake. This post will explain how to use PySpark to generate a Delta table in Data
4 min read
How to check the schema of PySpark DataFrame?
In this article, we are going to check the schema of pyspark dataframe. We are going to use the below Dataframe for demonstration. Method 1: Using df.schema Schema is used to return the columns along with the type. Syntax: dataframe.schema Where, dataframe is the input dataframe Code: Python3 # impo
2 min read
How to create partition in scala?
In the world of big data, processing efficiency is key, and data partitioning emerges as a vital tool for optimizing performance. By strategically dividing large datasets into smaller subsets, partitioning enables parallel processing, significantly accelerating data manipulation tasks. In Scala, ach
2 min read
Read Text file into PySpark Dataframe
In this article, we are going to see how to read text files in PySpark Dataframe. There are three ways to read text files into PySpark DataFrame. Using spark.read.text()Using spark.read.csv()Using spark.read.format().load() Using these we can read a single text file, multiple files, and all files fr
3 min read