0% found this document useful (0 votes)
81 views4 pages

Pysparkdump

The document contains questions and answers related to Apache Spark. It covers topics like RDDs, DataFrames, caching, partitioning, dynamic allocation, and Spark Streaming. Many questions focus on transformations and actions that can be performed on RDDs and DataFrames as well as features and capabilities of the Spark platform.

Uploaded by

VIDHYA HK
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
81 views4 pages

Pysparkdump

The document contains questions and answers related to Apache Spark. It covers topics like RDDs, DataFrames, caching, partitioning, dynamic allocation, and Spark Streaming. Many questions focus on transformations and actions that can be performed on RDDs and DataFrames as well as features and capabilities of the Spark platform.

Uploaded by

VIDHYA HK
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

1) Sfpd RDD

Ans ; val pairs = sfpd.map(x=>x.parallelize))

2) repartition(5) vs coalesce(5,shuffle=True)

Ans; True

3) Which is true for running spark on Hadoop YARN?

Ans; there are two deploy modes client and cluster

4) What is dynamic allocation?

Ans; dynamic allocation is a property where executors can be released back to cluster
resource pool if they are idle for specified period of time

5) Accumulators are incremented can be read from spark workers ? T or F?

Ans; FALSE

6) The keys transformation returns an RDD with ordered keys from key value psir RDD? T or
F

Ans; TRUE

7) groupbyKey is less efficient than reducebykey ?

ans;

8) which partitioner class used to order keys accor to sort order resp to given type?

Ans; Rangepartitioner

9) the primary ML api for spark now is ____ based api?

Ans; Dataframe

10) an existing RDD unhcrRDD contains refugee?

Ans; val country = unhcrRDD.map(x=>(x(0),x(3))).reducebykey((a,b)=>a+b)

11) the number of stages in a job is no of RDD in DAG, scheduler can truncate lineage ?

ans; RDD is cched or persisted


12) combining a set of filtered edges and filtered vertives from a graph creates what structure?

Ans; subgraph

13) what RDD function returns max,min,count,mean,std deviation?

Ans; stats

14) spark broadcast variables and setting variables in driver prog in pyspark same?

Ans;

15) which of following in scala will give top 10 resolutuins assuming sfpdDF is dataframe
registered as table-sfpd?

Ans; sqlContext.sql(“SELECT resolution.count(incidentnum) AS inccount FROM sfpd


GROUP BY resolution ORDER BY inccount DESC LIMIT 10”)

or

sfpdDF.groupBy(“resolution).count.sort……….show(10)

16) Given the pair RDD country that contain tuple (country, count()) which one to get lowest
refugee in scala?

Ans; val low- country.map(x=>(x._2,x._1)).sortbykey(false).first)

17) Which parameters required for windowed operatrion as reducebykeyAndwindow?

Ans; window length and sliding interval

18) What r some of the things u can monitor in spark web UI?

Ans; All of above

19) Which of the following is not feature of spark?

Ans; it is cost efficient

20) How to enable dynamic allocation?

Ans; spark.dynamicallocation.enabled=True

21) Which of thebelow to remove broadcast variable bvar from memory?

Ans; bvar.unpersist()
22) A dataframe can be created from existing RDD . You would create dataframe from existing
rdd by inferring schema using case classes in which case?

Ans; if all your users are going to need dataset parsed in same way

23) Dstream internally is?

Ans;

24) MEMORY AND DISK SER storage level options in RDD?

Ans; in memory,ondisk,serialized

25) Which partition hinder spark performance?

Ans; Both small and large

26) Which dataframe method is used to remove column from resultant dataframe?

Ans; drop()

27) The foreach and map difference?

Ans; foreach is action and map is transformation

28) Difference between take(1) and first() ?

Ans; take(1) returns an array with one element from an RDD , first() returns one element not
in array

29) Caching can use disk if memory not available. T or F

Ans; TRUE

30) sparkSQL translated commands into codes ,processed by ?

ans; executor node

31) which of following is true for spark application on Hadoop YARN?

Ans; there are two deploy modes .client and cluster mode

32) apache spark has api’s in ?

ans; All of above


33) pyspark is bunch figuring structure keeps running on grp of item and perform information
unification . T or F.

ans;

34) function used to call program written In shellscipt/perl into pyspark/

ans; pipe()

35) ___ leverages spark core fast scheduling capability for streaminganalytics?

Ans; SparkStreaming

36) We can create dataframe using

Ans; ALL of the above

37) Which Dstream output operation used to write output to console?

Ans; pprint()

38) Which of following not feature of spark?

Ans; it is cost efficient

39) Some ways of improving performance of ur spark app einclude?

Ans; All of the above

40) Dataset was introduced in which spark release?

Ans; spark 1.6

You might also like