0% found this document useful (0 votes)
31 views5 pages

Week7 Bda

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views5 pages

Week7 Bda

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 5

Week-7

Split lines from the textbook into arrays or words and perform
operations using select ().
Aim: A program to Split lines from the text book into arrays or words and
perform operations using select().
Description:
The split() method splits a string into an array of substrings. The split() method
returns the new array. The split() method does not change the original string. If ("
") is used as separator, the string is split between words.
Dataset:

Program:
from pyspark.shell import spark
from pyspark.sql.functions import monotonically_increasing_id
from pyspark.sql.functions import split

book = spark.read.text("Text.txt")
book.show()
lines = book.withColumn("words", split(book["value"], " "))
lines.show()
linesID = lines.withColumn("ID", monotonically_increasing_id()+1)
linesID.show()
#lines.select("words").head(5)
linesID.select("value","ID").filter(linesID.ID <=5).show()
middleRowID = lines.count()//2
linesID.select("value","ID").filter((linesID.ID >= (middleRowID-2)) & (linesID.ID <=
(middleRowID+2))).show()

linesID.select("value","ID").filter(linesID.ID%2 !=0).show()

from pyspark.sql.functions import size


word_count_df = lines.select('words', size('words').alias('word_count'))
word_count_df.show()
filtered_df = word_count_df.filter(word_count_df.word_count == 19)
filtered_df.show(truncate=False)

from pyspark.sql.functions import expr


filtered_lines = lines.withColumn("3CharacterWords",expr("filter(words, word -> length(word)
= 3)"))
filtered_lines.select("3CharacterWords").show()

You might also like