Week-7
Split lines from the textbook into arrays or words and perform
operations using select ().
Aim: A program to Split lines from the text book into arrays or words and
perform operations using select().
Description:
The split() method splits a string into an array of substrings. The split() method
returns the new array. The split() method does not change the original string. If ("
") is used as separator, the string is split between words.
Dataset:
Program:
from pyspark.shell import spark
from pyspark.sql.functions import monotonically_increasing_id
from pyspark.sql.functions import split
book = spark.read.text("Text.txt")
book.show()
lines = book.withColumn("words", split(book["value"], " "))
lines.show()
linesID = lines.withColumn("ID", monotonically_increasing_id()+1)
linesID.show()
#lines.select("words").head(5)
linesID.select("value","ID").filter(linesID.ID <=5).show()
middleRowID = lines.count()//2
linesID.select("value","ID").filter((linesID.ID >= (middleRowID-2)) & (linesID.ID <=
(middleRowID+2))).show()
linesID.select("value","ID").filter(linesID.ID%2 !=0).show()
from pyspark.sql.functions import size
word_count_df = lines.select('words', size('words').alias('word_count'))
word_count_df.show()
filtered_df = word_count_df.filter(word_count_df.word_count == 19)
filtered_df.show(truncate=False)
from pyspark.sql.functions import expr
filtered_lines = lines.withColumn("3CharacterWords",expr("filter(words, word -> length(word)
= 3)"))
filtered_lines.select("3CharacterWords").show()