1) Sfpd RDD
Ans ; val pairs = sfpd.map(x=>x.parallelize))
2) repartition(5) vs coalesce(5,shuffle=True)
Ans; True
3) Which is true for running spark on Hadoop YARN?
Ans; there are two deploy modes client and cluster
4) What is dynamic allocation?
Ans; dynamic allocation is a property where executors can be released back to cluster
resource pool if they are idle for specified period of time
5) Accumulators are incremented can be read from spark workers ? T or F?
Ans; FALSE
6) The keys transformation returns an RDD with ordered keys from key value psir RDD? T or
F
Ans; TRUE
7) groupbyKey is less efficient than reducebykey ?
ans;
8) which partitioner class used to order keys accor to sort order resp to given type?
Ans; Rangepartitioner
9) the primary ML api for spark now is ____ based api?
Ans; Dataframe
10) an existing RDD unhcrRDD contains refugee?
Ans; val country = unhcrRDD.map(x=>(x(0),x(3))).reducebykey((a,b)=>a+b)
11) the number of stages in a job is no of RDD in DAG, scheduler can truncate lineage ?
ans; RDD is cched or persisted
12) combining a set of filtered edges and filtered vertives from a graph creates what structure?
Ans; subgraph
13) what RDD function returns max,min,count,mean,std deviation?
Ans; stats
14) spark broadcast variables and setting variables in driver prog in pyspark same?
Ans;
15) which of following in scala will give top 10 resolutuins assuming sfpdDF is dataframe
registered as table-sfpd?
Ans; sqlContext.sql(“SELECT resolution.count(incidentnum) AS inccount FROM sfpd
GROUP BY resolution ORDER BY inccount DESC LIMIT 10”)
or
sfpdDF.groupBy(“resolution).count.sort……….show(10)
16) Given the pair RDD country that contain tuple (country, count()) which one to get lowest
refugee in scala?
Ans; val low- country.map(x=>(x._2,x._1)).sortbykey(false).first)
17) Which parameters required for windowed operatrion as reducebykeyAndwindow?
Ans; window length and sliding interval
18) What r some of the things u can monitor in spark web UI?
Ans; All of above
19) Which of the following is not feature of spark?
Ans; it is cost efficient
20) How to enable dynamic allocation?
Ans; spark.dynamicallocation.enabled=True
21) Which of thebelow to remove broadcast variable bvar from memory?
Ans; bvar.unpersist()
22) A dataframe can be created from existing RDD . You would create dataframe from existing
rdd by inferring schema using case classes in which case?
Ans; if all your users are going to need dataset parsed in same way
23) Dstream internally is?
Ans;
24) MEMORY AND DISK SER storage level options in RDD?
Ans; in memory,ondisk,serialized
25) Which partition hinder spark performance?
Ans; Both small and large
26) Which dataframe method is used to remove column from resultant dataframe?
Ans; drop()
27) The foreach and map difference?
Ans; foreach is action and map is transformation
28) Difference between take(1) and first() ?
Ans; take(1) returns an array with one element from an RDD , first() returns one element not
in array
29) Caching can use disk if memory not available. T or F
Ans; TRUE
30) sparkSQL translated commands into codes ,processed by ?
ans; executor node
31) which of following is true for spark application on Hadoop YARN?
Ans; there are two deploy modes .client and cluster mode
32) apache spark has api’s in ?
ans; All of above
33) pyspark is bunch figuring structure keeps running on grp of item and perform information
unification . T or F.
ans;
34) function used to call program written In shellscipt/perl into pyspark/
ans; pipe()
35) ___ leverages spark core fast scheduling capability for streaminganalytics?
Ans; SparkStreaming
36) We can create dataframe using
Ans; ALL of the above
37) Which Dstream output operation used to write output to console?
Ans; pprint()
38) Which of following not feature of spark?
Ans; it is cost efficient
39) Some ways of improving performance of ur spark app einclude?
Ans; All of the above
40) Dataset was introduced in which spark release?
Ans; spark 1.6