Spark | Eklavya Online

Spark

Spark Advanced Tutorials(Complete guide book)

Introduction Apache Spark is a general-purpose cluster computing system to process big data workloads. What sets Spark apart from its predecessors, such as MapReduce, is its speed, ease-of-use, and sophisticated analytics. Apache Spark was originally developed at AMPLab, UC Berkeley, in 2009. It was made open source in 2010 under the BSD license and switched …

Spark Advanced Tutorials(Complete guide book) Read More »

Spark Interview Questions for Beginners

1. How is Apache Spark different from MapReduce? Apache Spark MapReduce Spark processes data in batches as well as in real-time MapReduce processes data in batches only Spark runs almost 100 times faster than Hadoop MapReduce Hadoop MapReduce is slower when it comes to large scale data processing Spark stores data in the RAM i.e. …

Spark Interview Questions for Beginners Read More »

Spark Interview Questions for Intermediates

1. How to programmatically specify a schema for DataFrame? DataFrame can be created programmatically with three steps: Create an RDD of Rows from the original RDD; Create the schema represented by a StructType matching the structure of Rows in the RDD created in Step 1. Apply the schema to the RDD of Rows via createDataFrame method provided by SparkSession. 2. …

Spark Interview Questions for Intermediates Read More »

Scenario based Hadoop interview questions

1) If 8TB is the available disk space per node (10 disks with 1 TB, 2 disk for operating system etc. were excluded.). Assuming initial data size is 600 TB. How will you estimate the number of data nodes (n)? Estimating the hardware requirement is always challenging in Hadoop environment because we never know when …

Scenario based Hadoop interview questions Read More »

Scenario based Apache Spark Interview Questions

Question 1: What are ‘partitions’? A partition is a super-small part of a bigger chunk of data. Partitions are based on logic – they are used in Spark to manage data so that the minimum network encumbrance would be achieved. You could also add that the process of partitioning is used to derive the before-mentioned small pieces of data from larger chunks, …

Scenario based Apache Spark Interview Questions Read More »

Apache Spark Programming skill evaluation advanced interview questions

Describe the following code and what the output will be. Output: The main method, calculate, reads two sets of data. (In the example they are provided from a constant inline data structure that is converted into a distributed dataset using parallelize.) The map applied to each of them transforms them into tuples, each consisting of a userId and the …

Apache Spark Programming skill evaluation advanced interview questions Read More »