Jasnidh Kaur Ahuja
Chandigarh, Chandigarh, India
475 followers
474 connections
View mutual connections with Jasnidh Kaur
Welcome back
By clicking Continue to join or sign in, you agree to LinkedIn’s User Agreement, Privacy Policy, and Cookie Policy.
New to LinkedIn? Join now
or
By clicking Continue to join or sign in, you agree to LinkedIn’s User Agreement, Privacy Policy, and Cookie Policy.
New to LinkedIn? Join now
View mutual connections with Jasnidh Kaur
Welcome back
By clicking Continue to join or sign in, you agree to LinkedIn’s User Agreement, Privacy Policy, and Cookie Policy.
New to LinkedIn? Join now
or
By clicking Continue to join or sign in, you agree to LinkedIn’s User Agreement, Privacy Policy, and Cookie Policy.
New to LinkedIn? Join now
View Jasnidh Kaur’s full profile
Other similar profiles
-
Aayush Gupta
Data Engineer at Amex | Metlife Insurance (Cont.) Fractal | Citi Bank (Cont.) Infosys | Big Data Hadoop | Spark | SQL | Java | Spring boot | RestAPI | Scala (Spark) | Shell Scripting
Greater Delhi AreaConnect -
Romita Banerjee
Application Development Senior Analyst
KolkataConnect -
Anish Kumar
Data Engineer at Takeda
PuneConnect -
Rohit Sisodia
WebSphere Administrator (IBM WebSphere and JBOSS EAP)
GurgaonConnect -
Akash Bhati
NoidaConnect -
Utsav Verma
Technical Team Lead at Nagarro || Ex- Infoscian || GCET
GurugramConnect -
Rahul Kumar
BengaluruConnect -
Somrita Das
BengaluruConnect -
Sayali Kale
NetherlandsConnect -
Jasmine Kaur
Data Reporting Analyst at Infosys
IndiaConnect -
Girija Laxmi Iyer
Irving, TXConnect -
Swapnil Pandey
HyderabadConnect -
Sachin Jha
Technology Lead at Infosys
Mississauga, ONConnect -
Ashrith Kaparaboina
Software Engineer II B at Bank of America
Telangana, IndiaConnect -
Punit Chaudhary
KolkataConnect -
Subham Chowdhary
Founder || Strategist || Consultant || MBA|| Software Engineer
LondonConnect -
Satish M N
The Home Depot Canada | Supply Chain | Big Data Analytics
Toronto, ONConnect -
Pravin Mohature
Site Reliability Engineer II at Starbucks Coffee Company
Seattle, WAConnect -
Kartik Sehgal
Technology Analyst at Infosys
BengaluruConnect -
Vivek Dubey
Quantium| Infosys|Lambton College|Rotman School Of Management| BIT MESRA| Microsoft Certified Power BI | Microsoft Certified Azure Data Scientist Associate| Microsoft Certified Azure Data Engineer Associate| Intellipaat
CanadaConnect
Explore more posts
-
Aashish Raja
Have completed week 28 of the spark structured streaming part 2.Have covered the below topics in great details :- 1>Streaming Transformations 2>Triggers in spark structured streaming 3>Fault tolerance in spark structured streaming 4>Aggregation in spark structured streaming that is time bound and continuous 5>watermark feature in streaming and its use different output modes 6>Joins in streaming between two stream or one stream and one static table Have covered all these topics in much depth especially the fault tolarance ,aggregation and watermark related topics . Thanks to Sumit Mittal & TrendyTech for covering these in much depth. #dataengineering #streaming #sql #spark
6
-
Sagar Prajapati
What is SparkSession and Sparkcontext? SparkContext:- Main entry point for Spark functionality. A SparkContext represents the connection to a Spark cluster, and can be used to create RDDs, accumulators and broadcast variables on that cluster. Syntax for SparkContext: from pyspark import SparkContext --Create a SparkContext object with local mode and specifying the number of worker threads sc = SparkContext("local[*]", "example") --Now you can use 'sc' to create RDDs and perform operations on them SparkSession:- It’s a unified object to perform all the Spark operations. In the earlier version of the Spark 1.x there were separate objects like SparkContext, SQLContext, HiveContext, SparkConf, and StreamingContext. However with Spark 2.x all these different objects combine into one i.e. the SparkSession. You can perform all those operations using the SparkSession object itself. This unison of all the objects has made life simpler for the Spark Developers. Syntax for SparkSession from pyspark.sql import SparkSession spark = SparkSession \ .builder \ .appName("Name") \ .getOrCreate() Why should use SparkSession over SparkContext? from Spark 2.0, SparkSession provides a common entry point for a Spark application. Instead of SparkContext, HiveContext, SQLContext, everything is now within a SparkSession. it unifies all of sparks numerous contexts. before version 2.0 need to create separate context per JVM. however with SparkSession this problem has been resolved. Check out this databricks interview questions course: https://lnkd.in/dCjUuEGe
39
2 Comments -
Ajay Kadiyala
I attended almost 30+ interviews before Joining PwC and i found many questions have common among them 🎯........ Those questions include -- Explain Hadoop Architecture? What is 5 v's of big data? What is default replica in Hadoop? Can you increase or decrease it? Difference between Hadoop (Gen1) and Hadoop (Gen2)? What is heartbeats in hadoop? why is that important? Write down few Linux commands? What is partition, shuffling, sort in Mapreduce? What is Record Reader? Explain Sqoop Eval Command? Explain different optimizations used in Sqoop? Explain combiner in MapReduce? What is Yarn? Why is it used? Features of sqoop? Explain significance of them? Explain Boundary Val's Query? Explain the formula? Explain Modes available in Sqoop that used in job execution? Difference between Target Vs Warehouse directory? What is split by command? when it is used? Hive Architecture? Explain Transactional Processing Vs Analytical Processing? Difference between Hive and RDBMS? What is seek time in Hive? Difference between SQL Vs HQL? Explain UDF? How many types? What is views in hive? Explain Managed Table and External Table? Spark Architecture? What is transformations and actions? Name few? Intermediate to Advanced questions. Those Include -- Explain different no.of optimizations in hive? Explain types of Joins? What is Map side Join? What is Bucket Map Join and Sort Merge Bucket join(SMB)? Explain SCD Types in Hive? Explain File-formats in hive? Explain CAP Theorem? Explain RDD? Difference between RDD Vs Dataframe vs Dataset? Broadcast in Spark? Explain Catalyst optimizer? Difference between client Mode vs Cluster Mode? Explain Cache & persist? Explain Spark Performance Optimizations? Explain Accumulators? Bonus-- Additionally SQL and Coding questions are important. should have look at frequently asked questions before the interview. Let me know in the #comment section if it helpful If you like to know to my experience with PwC then do checkout the link in comment section Do follow Ajay Kadiyala ✅ #data #interviewexperience #interview #hbase #pwc #job #cloud #experience #comment #like #dataengineering
245
14 Comments -
Aditya Chandak
Hadoop Vs Spark Interview Question!! Interviewer: Let's discuss the evolution of the Hadoop ecosystem and the rise of Apache Spark. One significant change that occurred was the replacement of Hadoop MapReduce with Spark in many data processing workflows. Can you explain why this transition happened? Candidate: Certainly! The transition from Hadoop MapReduce to Apache Spark can be attributed to several factors. One key reason is Spark's superior performance compared to MapReduce. Spark's in-memory processing model allows it to execute tasks much faster by minimizing data shuffling and disk I/O, leading to significant speed improvements for data processing tasks. Interviewer: That's an important point. Can you elaborate on how Spark's in-memory processing differs from Hadoop MapReduce's disk-based processing? Candidate: Of course! In Hadoop MapReduce, intermediate data between map and reduce tasks is typically written to disk, which can incur overhead due to disk I/O operations. In contrast, Spark keeps intermediate data in-memory whenever possible, reducing the need for disk reads and writes. This in-memory processing model enables Spark to achieve faster processing speeds, especially for iterative algorithms and interactive analytics workloads. Interviewer: I see. So, apart from performance, are there any other reasons why Spark replaced MapReduce in many data processing workflows? Candidate: Absolutely. Another factor is Spark's versatility. While Hadoop MapReduce is primarily designed for batch processing, Spark supports a wide range of processing workloads, including batch processing, interactive querying, streaming data processing, and machine learning. This versatility makes Spark a more attractive choice for organizations looking to address diverse analytical use cases within a single framework. Interviewer: That's a compelling advantage. How did Spark's programming model contribute to its popularity? Candidate: Spark offers a more developer-friendly programming model compared to Hadoop MapReduce. With its rich set of high-level APIs in languages like Scala, Python, and Java, Spark simplifies the development of complex data processing workflows. Additionally, Spark's interactive shell (Spark REPL) facilitates rapid prototyping and experimentation, enhancing developer productivity. Interviewer: Thank you for the insightful explanation. It's clear that Spark's performance, versatility, and ease of use have made it a preferred choice for many organizations seeking to modernize their data processing infrastructure. Candidate: My pleasure! #DataEngineering #PySpark #ETLProject #pyspark #python #azuredataengineer #databricks #awscloud #googlecloud #dataengineer #dataengineerjobs #dataanalysis #dataanalytics #glue #azuredatabricks #azuredatafactory #synapse #sql #sqlserver #powerbi #database #etl #etldeveloper #Python #Pandas #BeginnertoAdvance #Dataframe #Sorting #Joins #Grouping #RealTimeScenarios #sql #awscloud #powerbi #databricks
5
-
Gnanasekaran G
Databricks Types of Cluster There may be a questions around 1.Which type of clusters have you worked on? 2.What are all the types of clusters in Databricks? 3.Explain how you would create a cluster for running your notebook. These questions aim to assess your familiarity with the different types of clusters in Databricks. Here is the answer is There are two types of clusters are there 1.Job Cluster 2.All purpose cluster Job Clusters run automated jobs in an expeditious and robust way.The scheduler creats a cluster and runs the job after teinates once the job is complete. we cannot restart job Clusters All purpose clusters analyse data collaboratively using interactive notebooks.We can terminate and restart all purpose cluster. multiple users can share the clusters and do work collaboratively. #databricks #clusters #jobcluster #allpurposecluster #dataengineering #dataengineer
10
-
Ashutosh Gupta
Week 10 of the "Ultimate Big Data Masters Program" provided an insightful journey into Apache Spark Optimizations and Performance Tuning - 1, provided by Sumit Mittal Sir. Here's a summary of Week 10 key learnings: · Pyspark Optimizations – Internals of groupBy · Normal Join Vs Broadcast Join · Different types of Joins · Partition Skew · Adaptive Query Execution (AQE) · Join Strategies · Optimizing Join of 2 large tables – Bucketing Huge thanks to Sumit Mittal for crystal clear explanation. TrendyTech #Bigdata #Spark #Join #dataengineer #SQL #optimization #bigdatadeveloper #learningandgrowing
27
1 Comment -
Aditya Chandak
Apache Spark Vs Hadoop MapReduce! Interviewer: Let's delve into the comparison between Apache Spark and Hadoop MapReduce. Can you elaborate on their core differences? Candidate: Certainly. Apache Spark is characterized by its in-memory processing, enabling faster computation by keeping data in memory across multiple processing steps. This contrasts with Hadoop MapReduce, which follows a disk-based processing model, reading data from and writing data to disk after each Map and Reduce phase. This fundamental difference in processing models greatly influences their performance and suitability for various types of workloads. Interviewer: That's insightful. How about their ecosystem support? Candidate: While Hadoop MapReduce benefits from its longstanding presence in the ecosystem, Spark has rapidly gained popularity and built a comprehensive ecosystem of its own. Spark seamlessly integrates with various data sources and storage systems, and its high-level APIs for SQL, streaming, machine learning, and graph processing simplify application development. Additionally, Spark can run both standalone and on existing Hadoop clusters, offering flexibility and compatibility with existing infrastructure. Interviewer: Good points. Now, let's talk about fault tolerance. How do Spark and MapReduce handle failures in distributed environments? Candidate: Both frameworks employ fault tolerance mechanisms, but they differ in their approaches. In MapReduce, fault tolerance is achieved through data replication and re-execution of failed tasks. Intermediate data is persisted to disk after each phase, allowing tasks to be rerun on other nodes in case of failure. On the other hand, Spark leverages lineage and resilient distributed datasets (RDDs) to achieve fault tolerance. RDDs track the lineage of each partition, enabling lost partitions to be recomputed from the original data source. Because Spark primarily operates in-memory, it can recover from failures more quickly compared to MapReduce. Interviewer: That's a comprehensive explanation. Lastly, in what scenarios would you recommend using Apache Spark over Hadoop MapReduce, and vice versa? Candidate: I would recommend using Apache Spark for applications that require real-time processing, iterative algorithms, or interactive analytics. Its in-memory processing capabilities and high-level APIs make it well-suited for these use cases. Conversely, Hadoop MapReduce may be more suitable for batch processing tasks that involve large-scale data processing and do not require real-time or iterative computation. It's essential to consider factors such as performance requirements, processing models, and ecosystem compatibility when choosing between the two frameworks. #spark #hadoop #bigdata
11
-
Aditya Chandak
Hadoop vs Spark! Interviewer: There's a common belief that Hadoop has been replaced by Spark. Can you explain why this is a misconception? Candidate: This is a misconception because Hadoop and Spark serve different purposes within the big data ecosystem. Hadoop is an ecosystem that includes HDFS (storage), YARN (resource management), and MapReduce (processing). Spark is primarily a processing engine that can run on top of Hadoop's HDFS and YARN. While Spark has largely replaced Hadoop's MapReduce due to its faster in-memory processing capabilities, it has not replaced Hadoop's storage and resource management components. Interviewer: So, Spark complements rather than replaces Hadoop? Candidate: Exactly. Spark is designed to complement Hadoop by providing a more efficient and versatile processing engine. It can utilize Hadoop's HDFS for distributed storage and YARN for resource management, making them work together seamlessly. Interviewer: Can you elaborate on the technical differences between Hadoop MapReduce and Spark that led to Spark's popularity? Candidate: Certainly. The primary technical differences include: Performance: Spark performs in-memory computations, which makes it significantly faster than Hadoop MapReduce, which relies on disk-based processing. Ease of Use: Spark offers high-level APIs in Java, Scala, Python, and R, making it easier for developers to write and maintain code compared to the more complex Java code required for MapReduce. Unified Framework: Spark provides a unified framework for batch processing, real-time streaming, machine learning, and graph processing, whereas MapReduce is limited to batch processing. These advantages make Spark more suitable for a variety of data processing tasks, leading to its increased adoption over Hadoop MapReduce. Interviewer: What considerations should an organization take into account when transitioning from Hadoop MapReduce to Spark? Candidate: When transitioning from Hadoop MapReduce to Spark, organizations should consider the following: Compatibility: Ensure the existing Hadoop cluster is compatible with Spark, typically achieved through distributions like Cloudera or Hortonworks. Training: Provide adequate training for developers and data engineers to become proficient in Spark. Migration Planning: Develop a detailed migration plan, including testing and validation of Spark jobs to ensure they meet performance and accuracy requirements. Resource Management: Adjust resource allocation and configurations in YARN to optimize for Spark’s in-memory processing model. Gradual Transition: Start with less critical workloads to gain confidence before migrating more critical processes. By considering these factors, organizations can effectively transition to Spark while maintaining the benefits of the Hadoop ecosystem. #DataEngineering #PySpark #ETLProject #pyspark #python #azuredataengineer #databricks #awscloud #googlecloud #dataengineer #dataengineerjobs #dataanalysis
43
-
Aditya Chandak
Hadoop vs Spark! Interviewer: There's a common belief that Hadoop has been replaced by Spark. Can you explain why this is a misconception? Candidate: This is a misconception because Hadoop and Spark serve different purposes within the big data ecosystem. Hadoop is an ecosystem that includes HDFS (storage), YARN (resource management), and MapReduce (processing). Spark is primarily a processing engine that can run on top of Hadoop's HDFS and YARN. While Spark has largely replaced Hadoop's MapReduce due to its faster in-memory processing capabilities, it has not replaced Hadoop's storage and resource management components. Interviewer: So, Spark complements rather than replaces Hadoop? Candidate: Exactly. Spark is designed to complement Hadoop by providing a more efficient and versatile processing engine. It can utilize Hadoop's HDFS for distributed storage and YARN for resource management, making them work together seamlessly. Interviewer: Can you elaborate on the technical differences between Hadoop MapReduce and Spark that led to Spark's popularity? Candidate: Certainly. The primary technical differences include: Performance: Spark performs in-memory computations, which makes it significantly faster than Hadoop MapReduce, which relies on disk-based processing. Ease of Use: Spark offers high-level APIs in Java, Scala, Python, and R, making it easier for developers to write and maintain code compared to the more complex Java code required for MapReduce. Unified Framework: Spark provides a unified framework for batch processing, real-time streaming, machine learning, and graph processing, whereas MapReduce is limited to batch processing. These advantages make Spark more suitable for a variety of data processing tasks, leading to its increased adoption over Hadoop MapReduce. Interviewer: What considerations should an organization take into account when transitioning from Hadoop MapReduce to Spark? Candidate: When transitioning from Hadoop MapReduce to Spark, organizations should consider the following: Compatibility: Ensure the existing Hadoop cluster is compatible with Spark, typically achieved through distributions like Cloudera or Hortonworks. Training: Provide adequate training for developers and data engineers to become proficient in Spark. Migration Planning: Develop a detailed migration plan, including testing and validation of Spark jobs to ensure they meet performance and accuracy requirements. Resource Management: Adjust resource allocation and configurations in YARN to optimize for Spark’s in-memory processing model. Gradual Transition: Start with less critical workloads to gain confidence before migrating more critical processes. By considering these factors, organizations can effectively transition to Spark while maintaining the benefits of the Hadoop ecosystem. #DataEngineering #PySpark #ETLProject #pyspark #python #azuredataengineer #databricks #awscloud #googlecloud #dataengineer #dataengineerjobs #dataanalysis
10
-
Tanuj Rana
Consider a DataFrame named sales with the following columns: product_id, quantity_sold, and revenue. Write Spark DataFrame code to find the total quantity sold and revenue for each product. #SCALA #SPARK #BIGDATA Karthik K. INPUT: PRODUCT_ID,QUANTITY_SOLD,PRICE 101,10,150 101,15,150 102,5,50 102,10,50 103,3,100 103,10,100 QUERY: def main(args: Array[String]):Unit = { val ss = SparkSession.builder.appName("two_dfS").master("local[*]").getOrCreate() val df_1 = ss.read.format("csv").option("header",true). option("path","C:/Users/hp/PRODUCT_ID.txt"). schema("PRODUCT_ID Int, QUANTITY_SOLD Int, PRICE Int").load() val df_2 = df_1.withColumn("REVENUE",col("QUANTITY_SOLD")*col("PRICE")) df_2.groupBy(col("PRODUCT_ID")). agg(sum(col("QUANTITY_SOLD")).as("TOTAL_QUANTITY_SOLD"), sum(col("REVENUE")).as("TOTAL_REVENUE")).show() OUTPUT:
1
-
Gajulapalli Anil Kumar Reddy
Spark performance issues which are making the spark jobs to run for long time In this post we will have a quick look on the major issues: -> Data Partitioning: Symptoms: Uneven load across nodes, some tasks taking much longer. Solution: Use repartition() or coalesce() to manage the number of partitions. Ensure the number of partitions matches the cluster size and task complexity (typically 2-4x the number of cores). Use partitionBy() for writing partitioned datasets. -> Shuffles and Joins: Symptoms: Jobs spend too much time on "shuffle read/write" in Spark UI. Solution: Optimize joins by using broadcast joins for smaller datasets (use broadcast()). Use bucketing to reduce shuffle during repeated joins. Avoid wide transformations (e.g., groupByKey) and prefer reduceByKey or aggregateByKey. -> Caching and Persistence: Symptoms: Recomputing the same data repeatedly. Solution: Cache frequently reused data using cache() or persist() with the appropriate storage level. Unpersist datasets once no longer needed to free memory. -> Serialization: Symptoms: High CPU usage or out-of-memory errors during task execution. Solution: Use Kryo serialization instead of Java serialization for faster and more compact serialization: spark.conf.set("spark.serializer","org.apache.spark.serializer.KyroSerializer") -> Skewed Data: Symptoms: Some tasks are much slower than others. Solution: Identify skewed keys using data sampling. Use salting for skewed keys to distribute load evenly. Enable adaptive query execution (AQE) to dynamically handle skew: spark.conf.set("spark.sql.adaptive.enabled","true") -> Resource Allocation: Symptoms: Underutilized cluster resources or memory errors. Solution: Allocate appropriate executor memory, cores, and number of executors. Use Spark's dynamic allocation to adjust resources dynamically: spark.conf.set("spark.dynamicAllocation.enabled", "true") -> Query Optimization: Symptoms: Long-running queries. Solution: Leverage Catalyst optimizer by writing SQL-like transformations (DataFrame/Dataset APIs). Enable cost-based optimization (CBO): spark.conf.set("spark.sql.cbo.enabled", "true") -> Monitoring and Debugging: Use the Spark UI to identify bottlenecks: Look at stages, tasks, and shuffle operations. Use Spark event logs and history server for deeper insights. #Bigdata #spark #SparkScala #Optimization #Scala #PySpark #Hadoop #DataEngineer #SparkOptimization Karthik K.
-
Susmitha Kanuri
Day 31 of our #ADF Series: Hadoop MapReduce Activity 🚀 Welcome to Day 31! Today, we’re exploring the Hadoop MapReduce Activity in Azure Data Factory, a powerful feature for running MapReduce jobs on Hadoop clusters directly from your ADF pipelines. ✨ What is Hadoop MapReduce Activity? The Hadoop MapReduce Activity allows you to execute MapReduce jobs on a Hadoop cluster within Azure Data Factory pipelines. This is ideal for processing massive amounts of data in a distributed way, leveraging the power of Hadoop for advanced transformations and aggregations. 📌 Real-World Example: Imagine you're working with terabytes of clickstream data from a website. Using the Hadoop MapReduce Activity, you can write and execute a MapReduce job to calculate user click trends and segment them into categories for targeted marketing campaigns. 📈 Common Use Cases: Processing and analyzing massive datasets stored in Hadoop. Executing advanced data transformations or computations using custom MapReduce jobs. Preparing data for machine learning models or downstream analytics. ⚠️ Limitations and Workarounds: Limitation: Requires a properly configured Hadoop cluster linked to Azure Data Factory. Workaround: Ensure cluster networking, linked services, and authentication are set up correctly before using this activity. Limitation: Complex MapReduce jobs can take time to execute. Workaround: Optimize your MapReduce scripts for performance by using combiner functions and partitioning where applicable. 💡 Pro Tip: Pair the Hadoop MapReduce Activity with Copy Activity to transfer processed data into Azure services like Data Lake or Synapse Analytics for further use. How have you used Hadoop MapReduce in your ADF pipelines? Share your thoughts or tips below! 👇 Ref Doc: https://lnkd.in/gb5ewUgd #AzureDataFactory #DataEngineering #ETL #BigData #HadoopMapReduceActivity
18
-
Haider Ali
Optimistic Concurrency Control in Databricks 1. What is Optimistic Concurrency Control (OCC)? - Optimistic Concurrency Control (OCC), also known as optimistic locking, is a non-locking concurrency control method used in transactional systems like relational database management systems and software transactional memory. - It operates under the assumption that multiple transactions can frequently complete without interfering with each other. 2. Optimistic Concurrency Control in Azure Databricks: - Azure Databricks manages transactions at the table level. - Transactions always apply to one table at a time. - For managing concurrent transactions, Azure Databricks employs optimistic concurrency control. - Here's how it works: - There are no locks on reading or writing against a table. - Deadlocks are not possible. - Delta Lake, which is used by default for all tables in Azure Databricks, provides ACID transaction guarantees between reads and writes. - Multiple writers across multiple clusters can simultaneously modify a table partition. - Writers see a consistent snapshot view of the table, and writes occur in a serial order. - Readers continue to see a consistent snapshot view of the table that the Databricks job started with, even when the table is modified during a job. - Metadata changes (such as changes to table protocol, properties, or data schema) cause all concurrent write operations to fail. - Streaming reads also fail when encountering a commit that changes table metadata. - In such cases, the stream must be restarted. 3. Row-Level Concurrency: - Row-level concurrency reduces conflicts between concurrent write operations by detecting changes at the row level. - It automatically resolves conflicts that occur when concurrent writes update or delete different rows in the same data file. - Row-level concurrency is generally available on Databricks Runtime 14.2 and above. - It is supported by default for the following conditions: - Tables with deletion vectors enabled and without partitioning. - Tables with liquid clustering, unless deletion vectors are disabled. - Tables with partitions do not support row-level concurrency but can still avoid conflicts between OPTIMIZE and other write operations when deletion vectors are enabled. #databricks #data #dataengineering
38
-
Rajkumar Dandu
Pyspark interview questions for Data Engineer 2024. 1. How do you deploy PySpark applications in a production environment? 2. What are some best practices for monitoring and logging PySpark jobs? 3. How do you manage resources and scheduling in a PySpark application? 4. Write a PySpark job to perform a specific data processing task (e.g., filtering data, aggregating results). 5. You have a dataset containing user activity logs with missing values and inconsistent data types. Describe how you would clean and standardize this dataset using PySpark. 6. Given a dataset with nested JSON structures, how would you flatten it into a tabular format using PySpark? 8. Your PySpark job is running slower than expected due to data skew. Explain how you would identify and address this issue. 9. You need to join two large datasets, but the join operation is causing out-of-memory errors. What strategies would you use to optimize this join? 10. Describe how you would set up a real-time data pipeline using PySpark and Kafka to process streaming data. 11. You are tasked with processing real-time sensor data to detect anomalies. Explain the steps you would take to implement this using PySpark. 12. Describe how you would design and implement an ETL pipeline in PySpark to extract data from an RDBMS, transform it, and load it into a data warehouse. 13. Given a requirement to process and transform data from multiple sources (e.g., CSV, JSON, and Parquet files), how would you handle this in a PySpark job? 14. You need to integrate data from an external API into your PySpark pipeline. Explain how you would achieve this. 15. Describe how you would use PySpark to join data from a Hive table and a Kafka stream. 16. You need to integrate data from an external API into your PySpark pipeline. Explain how you would achieve this. All the best 👍👍
82
1 Comment -
Neha Gupta
The most common interview question in Databricks interviews: What's the difference between Cache() and Persist() in Apache Spark? Today, I’ll break it down: 🔹Cache(): A quick way to store data in memory (MEMORY_ONLY), improving performance by avoiding recomputation. Ideal for smaller datasets that can easily fit in memory. 🔹Persist(): Offers flexibility with various storage levels: • MEMORY_ONLY: Similar to cache(), stores data in memory, with potential recomputation if memory is limited. • MEMORY_AND_DISK: Retains data in memory when possible and spills to disk if necessary, preventing data loss. • DISK_ONLY: Stores data entirely on disk, useful when memory is constrained. • MEMORY_ONLY_SER and MEMORY_AND_DISK_SER: Stores data in a serialized format, reducing memory usage but requiring more CPU for deserialization. This flexibility is key when working with large datasets that might exceed memory capacity. 🔑 Pro Tip: Use cache() for straightforward, memory-bound tasks, and opt for persist() when you need to customize storage based on your dataset and performance requirements. Mastering this difference can help you optimize your Spark jobs and stand out in your Databricks interview. #DataEngineering #ApacheSpark #Databricks #BigData #InterviewTips
96
6 Comments -
Ganesh R
Delta Lake is an open-source storage layer that enhances traditional data lakes, built on top of Apache Spark. It brings features commonly associated with data warehouses, like ACID transactions, schema enforcement, and time travel, to data lakes, making them more reliable. 𝗞𝗲𝘆 𝗙𝗲𝗮𝘁𝘂𝗿𝗲𝘀 𝗼𝗳 𝗗𝗲𝗹𝘁𝗮 𝗟𝗮𝗸𝗲! ⚧️ ACID Transactions: 🤿 Schema Enforcement: ⌛ Time Travel: ♎ Scalable Metadata Handling: 🫗 Upserts and Deletes: 🫙 Efficient Data Storage with Compaction: 🎏 Streaming and Batch Processing: 𝗪𝗵𝗲𝗻 𝘁𝗼 𝗨𝘀𝗲 𝗗𝗲𝗹𝘁𝗮 𝗟𝗮𝗸𝗲! 1️⃣ Reliable Data Lakes: If you are building a data lake and need the reliability of a data warehouse (with ACID transactions and schema enforcement), Delta Lake is a great choice. It solves common issues in traditional data lakes, like eventual consistency and poor data quality. 2️⃣ Handling Large-Scale Data with Many Writes: If you are working with large-scale data, especially with many concurrent write operations or a combination of batch and streaming data, Delta Lake can help manage conflicts and ensure consistency. 3️⃣ Use Cases Requiring Upserts or Deletes: If your workflows require frequent updates to the data (e.g., handling late-arriving data, correcting historical records, or real-time upserts), Delta Lake's MERGE, UPDATE, and DELETE operations are key differentiators. 4️⃣ Data Versioning and Auditing: When you need to track changes in your data over time (e.g., auditing, debugging, or compliance needs), Delta Lake’s time travel feature lets you view and revert to previous versions of your data. 5️⃣ Unified Batch and Stream Processing: If you need to process both streaming and batch data in the same architecture, Delta Lake's ability to handle both data types in a consistent format makes it ideal. 6️⃣ Optimized Data Management: When you need to optimize query performance and manage large amounts of metadata efficiently, Delta Lake provides better management of metadata and file compaction. 𝙀𝙭𝙖𝙢𝙥𝙡𝙚 𝙐𝙨𝙚 𝘾𝙖𝙨𝙚𝙨: 1️⃣ ETL Pipelines: Where data needs to be incrementally loaded and updated over time. 2️⃣ Data Lakes with Schema Changes: Where you expect schema evolution (changes in the structure of your data over time). Real-time Analytics: Where fresh data from streams needs to be merged with historical data. 3️⃣ Data Compliance and Auditing: Where maintaining historical versions of data for regulatory or debugging purposes is necessary. Delta Lake is ideal for use cases where you need the flexibility and scalability of a data lake, with the consistency and reliability typically found in data warehouses.
44
1 Comment -
Subham Khandelwal
𝗙𝗶𝗹𝗲 𝘀𝗸𝗶𝗽𝗽𝗶𝗻𝗴 is the most popular technique used to Read data faster in Big data. Learn how Hive and Delta Lake implements this 🤔 Upcoming article this weekend on my Medium channel 🗓️ Meanwhile you can Subscribe to my Medium channel to read other interesting articles on Spark Optimizations for Free 📢 https://lnkd.in/gQU-XxVZ #dataengineering
15
-
Shubham Sharma
🎉 HDFS vs. Data Lakes: A Fun Dive into Big Data Storage! 🚀 Hey fellow Data Enthusiasts! 🌟 It's often said that cloud storage is more cost-effective than HDFS. But is that really true? Let’s dive in and find out🏄♂️ 🌈 HDFS (Hadoop Distributed File System): In HDFS, data is stored across multiple servers, or nodes, which provide both storage and computing power. To increase storage capacity in HDFS cluster, you add more nodes. This means that if you need more storage, you also have to add more computing power, even if it's not necessary. Imagine you're at an all-you-can-eat buffet 🍽️, but you have to pay for a drink every time you get a plate. That's HDFS for you! 🧠 Storage and computing power are tightly coupled, so if you want more storage, you have to add more compute power too! 💻📦 It's efficient for big data processing but sometimes feels like buying a new car 🚗 just because you need a new tire! 😂 💡 Data Lake Cloud Storage: Cloud-based data lakes offer more flexibility by separating storage from computing resources, with on-demand scalability for compute resources. Picture a magical pantry 🧚♀️ where you can pick as many cookies 🍪 (data) as you want without having to buy extra milk (compute power) unless you actually need it. That's your data lake! 🌊 With cloud-based data lakes, storage and computing are decoupled. Need more storage? Just add space! Need more computing power? Scale up only the compute! It's like paying for just the cookies you eat! 🎯 💸 Cost-Effectiveness: HDFS is like buying in bulk at a wholesale club—great if you consume a lot, but sometimes you end up with more than you need. Data lakes, especially in the cloud, let you pay-as-you-go, keeping costs low and efficient. It's like choosing between a big box store and a bespoke bakery! 🍰 🔄 Elasticity: In today's fast-paced world, flexibility is key! 🕺💃 Data lakes are like yoga masters—super flexible and able to stretch to meet your needs without breaking a sweat. 🧘♂️🧘♀️ Need to crunch data at scale? Up goes the compute! Just storing? Keep it simple and cost-effective. It's all about adapting to your needs, making every day a good data day! 📊🌞 So, next time you're considering your data storage strategy, remember: HDFS is powerful but comes with a package deal, while data lakes offer flexible options. Choose wisely and make your data day even better! 💼💥 What’s your take on this data showdown? Share your thoughts! #DataStorage #HDFS #DataLake #BigData #CloudComputing #TechTalk #DataDay
11
1 Comment
Explore collaborative articles
We’re unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI.
Explore More