Week 8-2
Week 8-2
B) The Parameter Server distributes a model over multiple machines and provides two
main operations: Pull (to query parts of the model) and Push (to update parts of the
model).
D) The Parameter Server uses the Collapsed Gibbs Sampling method to update model
parameters by aggregating push updates via subtraction.
Answer:
D) Incorrect - The Parameter Server uses the Collapsed Gibbs Sampling method
to update model parameters.
Explanation: Parameter Servers typically rely on gradient-based optimization methods,
not specifically on Gibbs Sampling.
A) It helps to categorize web pages based on the quality of their content, thus improving
the accuracy of search results.
B) It provides an objective and mechanical method for rating the importance of web
pages based on the link structure of the web, addressing challenges of page relevance
amidst a large number of web pages.
C) It ensures that all web pages are indexed equally, regardless of their content or link
structure.
D) It automatically filters out irrelevant web pages by analyzing their content and
metadata.
Answer:
3.What role does the outerJoinVertices() operator serve in Apache Spark's GraphX?
A) It removes all vertices that are not present in the input RDD.
B) It returns a new graph with only the vertices from the input RDD.
C) It joins the input RDD data with vertices and includes all vertices, whether present in
the input RDD or not.
Answer:
A) Incorrect - It removes all vertices that are not present in the input RDD.
Explanation: This operator does not exclude vertices; it retains them regardless of
whether they have matching data.
B) Incorrect - It does not return a new graph with only the vertices from the input
RDD.
Explanation: It includes all vertices from the original graph.
C) Correct - It joins the input RDD data with vertices and includes all vertices,
whether present in the input RDD or not.
Explanation: The outerJoinVertices() operator allows for a join between vertex
properties and RDD data, including all vertices in the graph. This is useful when you
want to retain vertices that may not have corresponding data in the input RDD.
D) Incorrect - It does not create a subgraph from the input RDD and vertices.
Explanation: It creates a new graph that retains all vertices from the original graph and
merges them with input data.
4. Which of the following statements accurately describes a key feature of
GraphX, a component built on top of Apache Spark Core?
A) GraphX focuses exclusively on performing machine learning tasks and does not
support graph processing.
B) GraphX allows for efficient graph processing and analysis, supports high-level graph
measures like triangle counting, and integrates the Pregel API for graph traversal.
C) GraphX is primarily used for data ingestion and preprocessing and does not provide
functionalities for graph algorithms or analytics.
D) GraphX provides only basic graph visualization capabilities and does not include
algorithms like PageRank or triangle counting.
Answer:
B) Correct - GraphX allows for efficient graph processing and analysis, supports
high-level graph measures like triangle counting, and integrates the Pregel API
for graph traversal.
Explanation: GraphX is designed for processing graphs at scale and includes
functionalities for both analytical and algorithmic operations, such as triangle counting
and custom graph traversals using the Pregel API.
C) Incorrect - GraphX is primarily used for data ingestion and preprocessing and
does not provide functionalities for graph algorithms or analytics.
Explanation: While it can handle data ingestion, its main purpose is graph processing
and analysis.
5. Why are substantial indexes and data reuse important in graph processing?
Answer:
A) Join operators do indeed add data to graphs and produce new graphs.
B) Structural operators operate on the graph's structure and can create new graphs.
C) Property operators modify vertex or edge properties using user-defined functions,
producing new graphs.
Explanation:
Each type of operator serves to extend the capabilities of graph processing in GraphX,
enabling various transformations and manipulations of graph data.
7.Which RDD operator would you use to combine two RDDs by aligning their keys
and producing a new RDD with tuples of corresponding values?
A) union
B) join
C) sample
D) partitionBy
Answer:
A) Incorrect - union.
Explanation: The union operator combines two RDDs without regard to key
relationships; it simply appends the elements of both RDDs.
B) Correct - join.
Explanation: The join operator is specifically designed for combining two RDDs based
on their keys, resulting in an RDD of key-value pairs where the keys are aligned.
C) Incorrect - sample.
Explanation: The sample operator creates a new RDD by taking a random sample
from an existing RDD, without combining two RDDs.
D) Incorrect - partitionBy.
Explanation: The partitionBy operator is used to control how data is partitioned across
nodes, not for combining RDDs.
Answer:
Answer:
A) Correct - Recasting graph systems optimizations as distributed join
optimization and incremental materialized maintenance.
Explanation: Treating graph operations similarly to joins allows for better optimization
strategies, which can reduce computational overhead and improve efficiency.
10. What are the defining traits of a Parameter Server in distributed machine
learning?
A) Only S1 is true.
B) Only S2 is true.
C) Both S1 and S2 are true.
● Distributes a model over multiple machines: This allows for efficient training
of large models on clusters of machines.
● Offers two operations:
○ Pull: Workers can query parts of the model from the Parameter Server.
○ Push: Workers can update parts of the model by pushing their computed
gradients to the Parameter Server.
These two operations are essential for the coordination and synchronization of
distributed machine learning algorithms.