Learning Objectives
In this course you will learn about:
- the purpose of Spark and understand why and when you would use Spark.
- how to list and describe the components of the Spark unified stack.
- the basics of the Resilient Distributed Dataset, Spark's primary data abstraction.
- how to download and install Spark standalone.
- an overview of Scala and Python.
Introduction to Spark
Learning Objectives
In this lesson you will learn about:
- Explain the purpose of Spark
- List and describe the components of the Spark unified stack
- Understand the basics of Resilient Distributed Dataset (RDD)
- Downloading and installing Spark standalone
- Scala and Python overview
- Launch and use Spark’s Scala and Python shell
Objectives: After completing this lesson, you should be able to explain the purpose of Spark and understand why and when you would use Spark. You should be able to list and describe the components of the Spark unified stack. You will be able to understand the basics of the Resilient Distributed Dataset, Spark's primary data abstraction. Then you will see how to download and install Spark standalone to test it out yourself. You will get an overview of Scala and Python to prepare for using the two Spark shells.
There is an explosion of data. No matter where you look, data is everywhere. You get data from social media such as Twitter feeds, Facebook posts, SMS, and a variety of others. The need to be able to process those data as quickly as possible becomes more important than ever. How can you find out what your customers want and be able to offer it to them right away? You do not want to wait hours for a batch job to complete. You need to have it in minutes or less.
MapReduce has been useful, but the amount of time it takes for the jobs to run is no longer acceptable in most situations. The learning curve to writing a MapReduce job is also difficult as it takes specific programming knowledge and the know-how. Also, MapReduce jobs only work for a specific set of use cases. You need something that works for a wider set of use cases. Apache Spark was designed as a computing platform to be fast, general-purpose, and easy to use.
It extends the MapReduce model and takes it to a whole other level. The speed comes from the in-memory computations. Applications running in memory allows for a much faster processing and response. Spark is even faster than MapReduce for complex applications on disks.
This generality covers a wide range of workloads under one system. You can run batch application such as MapReduce types jobs or iterative algorithms that builds upon each other. You can also run interactive queries and process streaming data with your application. In a later slide, you'll see that there are a number of libraries which you can easily use to expand beyond the basic Spark capabilities.
The ease of use with Spark enables you to quickly pick it up using simple APIs for Scala, Python and Java. As mentioned, there are additional libraries which you can use for SQL, machine learning, streaming, and graph processing. Spark runs on Hadoop clusters such as Hadoop YARN or Apache Mesos, or even as a standalone with its own scheduler.
You may be asking, why would I want to use Spark and what would I use it for? As you know, Spark is related to MapReduce in a sense that it expands on its capabilities.
Like MapReduce, Spark provides parallel distributed processing, fault tolerance on commodity hardware, scalability, etc. Spark adds to the concept with aggressively cached in-memory distributed computing, low latency, high level APIs and stack of high level tools described on the next slide. This saves time and money. There are two groups that we can consider here who would want to use Spark: Data Scientists and Engineers.
You may ask, but aren't they similar? In a sense, yes, they do have overlapping skill sets, but for our purpose, we'll define data scientist as those who need to analyze and model the data to obtain insight. They would have techniques to transform the data into something they can use for data analysis. They will use Spark for its ad-hoc analysis to run interactive queries that will give them results immediately. Data scientists may also have experience using SQL, statistics, machine learning and some programming, usually in Python, MatLab or R.
Once the data scientists have obtained insights on the data and later someone determines that there's a need develop a production data processing application, a web application, or some system to act upon the insight, the person called upon to work on it would be the engineers.
Engineers would use Spark's programming API to develop a system that implement business use cases. Spark parallelize these applications across the clusters while hiding the complexities of distributed systems programming and fault tolerance. Engineers can use Spark to monitor, inspect and tune applications. For everyone else, Spark is easy to use with a wide range of functionality. The product is mature and reliable.
Here we have a picture of the Spark unified stack. As you can see, the Spark core is at the center of it all. The Spark core is a general-purpose system providing scheduling, distributing, and monitoring of the applications across a cluster. Then you have the components on top of the core that are designed to interoperate closely, letting the users combine them, just like they would any libraries in a software project. The benefit of such a stack is that all the higher layer components will inherit the improvements made at the lower layers.
Example: Optimization to the Spark Core will speed up the SQL, the streaming, the machine learning and the graph processing libraries as well. The Spark core is designed to scale up from one to thousands of nodes. It can run over a variety of cluster managers including Hadoop YARN and Apache Mesos. Or simply, it can even run as a standalone with its own built-in scheduler. Spark SQL is designed to work with the Spark via SQL and HiveQL (a Hive variant of SQL). Spark SQL allows developers to intermix SQL with Spark's programming language supported by Python, Scala, and Java.
Spark Streaming provides processing of live streams of data. The Spark Streaming API closely matches that of the Sparks Core's API, making it easy for developers to move between applications that processes data stored in memory vs arriving in real-time. It also provides the same degree of fault tolerance, throughput, and scalability that the Spark Core provides. Machine learning, MLlib is the machine learning library that provides multiple types of machine learning algorithms. All of these algorithms are designed to scale out across the cluster as well.
GraphX is a graph processing library with APIs to manipulate graphs and performing graph-parallel computations.
Here's a brief history of Spark. I'm not going to spend too much time on this as you can easily find more information for yourself if you are interested. Basically, you can see that MapReduce started out over a decade ago. MapReduce was designed as a fault tolerant framework that ran on commodity systems. Spark comes out about a decade later with the similar framework to run data processing on commodity systems also using a fault tolerant framework. MapReduce started off as a general batch processing system, but there are two major limitations. 1) Difficulty in programming directly in MR and 2) Batch jobs do not fit many use cases. So this spawned specialized systems to handle other use cases. When you try to combine these third party systems in your applications, there are a lot of overhead.
Taking a looking at the code size of some applications on the graph on this slide, you can see that Spark requires a considerable amount less. Even with Spark's libraries, it only adds a small amount of code due to how tightly everything is integrated with very little overhead. There is great value to be able to express a wide variety of use cases with few lines of code. Now let's get into the core of Spark.
Spark's primary core abstraction is called Resilient Distributed Dataset or RDD. Essentially it is just a distributed collection of elements that is parallelized across the cluster. You can have two types of RDD operations. Transformations and Actions. Transformations are those that do not return a value. In fact, nothing is evaluated during the definition of these transformation statements. Spark just creates these Direct Acyclic Graphs or DAG, which will only be evaluated at runtime. We call this lazy evaluation.
The fault tolerance aspect of RDDs allows Spark to reconstruct the transformations used to build the lineage to get back the lost data. Actions are when the transformations get evaluted along with the action that is called for that RDD.
Actions return values. For example, you can do a count on a RDD, to get the number of elements within and that value is returned. So you have an image of a base RDD shown here on the slide. The first step is loading the dataset from Hadoop. Then uou apply successive transformations on it such as filter, map, or reduce. Nothing actually happens until an action is called. The DAG is just updated each time until an action is called. This provides fault tolerance. For example, let's say a node goes offline. All it needs to do when it comes back online is to re-evaluate the graph to where it left off.
Caching is provided with Spark to enable the processing to happen in memory. If it does not fit in memory, it will spill to disk.
Spark runs on Windows or Unix-like systems. The lab exercises that are part of this course will be running with IBM BigInsights. The information on this slide shows you how you can install the standalone version yourself with all the default values that come along with it.
To install Spark, simply download the pre-built version of Spark and place a compiled version of Spark on each node of your cluster. Then you can manually start the cluster by executing ./sbin/start-master.sh . In the lab, the start script is different, and you will see this in the lab exercise guide.
Again, the default URL to the master UI is on port 8080. The lab exercise environment will be different and you will see this in the guide as well.
You can check out Spark's website for more information: Spark jobs can be written in Scala, Python or Java. Spark shells are available for Scala or Python. This course will not teach how to program in each specific language but will cover how to use them within the context of Spark. It is recommended that you have at least some programming background to understand how to code in any of these.
If you are setting up the Spark cluster yourself, you will have to make sure that you have a compatible version of the programming language you choose to use. This information can be found on Spark's website. In the lab environment, everything has been set up for you - all you do is launch up the shell and you are ready to go.
Spark itself is written in the Scala language, so it is natural to use Scala to write Spark applications. This course will cover code examples from by Scala, Python, and Java. Java 8 actually supports the functional programming style to include lambdas, which concisely captures the functionality that are executed by the Spark engine. This bridges the gap between Java and Scala for developing applications on Spark. Java 6 and 7 are supported but would require more work and an additional library to get the same amount of functionality as you would using Scala or Python.
Here we have a brief overview of Scala. Everything in Scala is an object. The primitive types that is defined by Java such as int or boolean are objects in Scala. Functions are objects in Scala and will play an important role in how applications are written for Spark.
Numbers are objects. This means that in an expression that you see here: 1 + 2 * 3 / 4 actually means that the individual numbers invoke the various identifiers +,-,*,/ with the other numbers passed in as arguments using the dot notation. Functions are objects. You can pass functions as arguments into another function. You can store them as variables. You can return them from other functions. The function declaration is the function name followed by the list of parameters and then the return type. This slide and the next is just to serve as a very brief overview of Scala. If you wish to learn more about Scala, check out its website for tutorials and guide.
Throughout this course, you will see examples in Scala that will have explanations on what it does. Remember, the focus of this course is on the context of Spark. It is not intended to teach Scala, Python or Java. Anonymous functions are very common in Spark applications. Essentially, if the function you need is only going to be required once, there is really no value in naming it. Just use it anonymously on the go and forget about it. For example, suppose you have a time flies function and it in, you just print a statement to the console. In another function, once per second, you need to call this time flies function. Without anonymous functions, you would code it like the top example be defining the time flies function. Using the anonymous function capability, you just provide the function with arguments, the right arrow, and the body of the function after the right arrow as in the bottom example. Because this is the only place you will be using this function, you do not need to name the function.
The Spark shell provides a simple way to learn Spark's API. It is also a powerful tool to analyze data interactively. The Shell is available in either Scala, which runs on the Java VM, or Python. To start up Scala, execute the command spark-shell from within the Spark's bin directory. To create an RDD from a text file, invoke the textFile method with the sc object, which is the SparkContext. We'll talk more about these functions in a later lesson.
To start up the shell for Python, you would execute the pyspark command from the same bin directory. Then, invoking the textFile command will also create an RDD for that text file.
In the lab exercise, you will start up either of the shells and run a series of RDD transformations and actions to get a feel of how to work with Spark. In a later lesson and exercise, you will get to dive deeper into RDDs. Having completed this lesson, you should now understand what Spark is all about and why you would want to use it. You should be able to list and describe the components in the Spark stack as well as understand the basics of Spark's primary abstraction, the RDDs.
You also saw how to download and install Spark's standalone if you wish to, or you can use the provided lab environment. You got a brief overview of Scala and saw how to launch and use the two Spark Shells.
You have completed this lesson. Go on to the first lab exercise and then proceed to the next lesson in this course.
Resilient Distributed Dataset and DataFrames
Learning Objectives
In this lesson you will learn about:
- Describe Spark’s primary data abstraction
- Understand how to create parallelized collections and external datasets
- Work with Resilient Distributed Dataset (RDD) operations and DataFrames
- Utilize shared variables and key-value pairs
Hi. Welcome to the Spark Fundamentals course. This lesson will cover Resilient Distributed Dataset.
After completing this lesson, you should be able to understand and describe Spark's primary data abstraction, the RDD. You should know how to create parallelized collections from internal and external datasets. You should be able to use RDD opeartions such as Transformations and Actions. Finally, I will also show you how to take advantage of Spark's shared variables and key-value pairs. Resilient Distributed Dataset (RDD) is Spark's primary abstraction. RDD is a fault tolerant collection of elements that can be parallelized. In other words, they can be made to be operated on in parallel. They are also immutable. These are the fundamental primary units of data in Spark. When RDDs are created, a direct acyclic graph (DAG) is created.
This type of operation is called transformations. Transformations makes updates to that graph, but nothing actually happens until some action is called. Actions are another type of operations. We'll talk more about this shortly. The notion here is that the graphs can be replayed on nodes that need to get back to the state it was before it went offline - thus providing fault tolerance. The elements of the RDD can be operated on in parallel across the cluster. Remember, transformations return a pointer to the RDD created and actions return values that comes from the action. There are three methods for creating a RDD. You can parallelize an existing collection. This means that the data already resides within Spark and can now be operated on in parallel.
As an example, if you have an array of data, you can create a RDD out of it by calling the parallelized method. This method returns a pointer to the RDD. So this new distributed dataset can now be operated upon in parallel throughout the cluster. The second method to create a RDD, is to reference a dataset. This dataset can come from any storage source supported by Hadoop such as HDFS, Cassandra, HBase, Amazon S3, etc.
The third method to create a RDD is from transforming an existing RDD to create a new RDD. In other words, let's say you have the array of data that you parallelized earlier. Now you want to filter out strings that are shorter than 20 characters. A new RDD is created using the filter method.
A final point on this slide. Spark supports text files, SequenceFiles and any other Hadoop InputFormat. Here is a quick example of how to create an RDD from an existing collection of data. In the examples throughout the course, unless otherwise indicated, we're going to be using Scala to show how Spark works. In the lab exercises, you will get to work with Python and Java as well. So the first thing is to launch the Spark shell. This command is located under the $SPARK_HOME/bin directory. In the lab environment, SPARK_HOME is the path to where Spark was installed.
Once the shell is up, create some data with values from 1 to 10,000. Then, create an RDD from that data using the parallelize method from the SparkContext, shown as sc on the slide. This means that the data can now be operated on in parallel. We will cover more on the SparkContext, the sc object that is invoking the parallelized function, in our programming lesson, so for now, just know that when you initialize a shell, the SparkContext, sc, is initialized for you to use. The parallelize method returns a pointer to the RDD. Remember, transformations operations such as parallelize, only returns a pointer to the RDD. It actually won't create that RDD until some action is invoked on it. With this new RDD, you can perform additional transformations or actions on it such as the filter transformation.
Another way to create a RDD is from an external dataset. In the example here, we are creating a RDD from a text file using the textFile method of the SparkContext object. You will see plenty more examples of how to create RDD throughout this course. Here we go over some basic operations. You have seen how to load a file from an external dataset. This time, however, we are loading a file from the hdfs. Loading the file creates a RDD, which is only a pointer to the file. The dataset is not loaded into memory yet. Nothing will happen until some action is called. The transformation basically updates the direct acyclic graph (DAG).
So the transformation here is saying map each line s, to the length of that line. Then, the action operation is reducing it to get the total length of all the lines. When the action is called, Spark goes through the DAG and applies all the transformation up until that point, followed by the action and then a value is returned back to the caller. A common example is a MapReduce word count. You first split up the file by words and then map each word into a key value pair with the word as the key, and the value of 1. Then you reduce by the key, which adds up all the value of the same key, effectively, counting the number of occurrences of that key. Finally, you call the collect() function, which is an action, to have it print out all the words and its occurrences. Again, you will get to work with this in the lab exercises.
You heard me mention direct acyclic graph several times now. On this slide, you will see how to view the DAG of any particular RDD. A DAG is essentially a graph of the business logic and does not get executed until an action is called -- often called lazy evaluation.
To view the DAG of a RDD after a series of transformation, use the toDebugString method as you see here on the slide. It will display the series of transformation that Spark will go through once an action is called. You read it from the bottom up. In the sample DAG shown on the slide, you can see that it starts as a textFile and goes through a series of transformation such as map and filter, followed by more map operations. Remember, that it is this behavior that allows for fault tolerance. If a node goes offline and comes back on, all it has to do is just grab a copy of this from a neighboring node and rebuild the graph back to where it was before it went offline. In the next several slides, you will see at a high level what happens when an action is executed.
Let's look at the code first. The goal here is to analyze some log files. The first line you load the log from the hadoop file system. The next two lines you filter out the messages within the log errors. Before you invoke some action on it, you tell it to cache the filtered dataset - it doesn't actually cache it yet as nothing has been done up until this point. Then you do more filters to get specific error messages relating to mysql and php followed by the count action to find out how many errors were related to each of those filters. Now let's walk through each of the steps. The first thing that happens when you load in the text file is the data is partitioned into different blocks across the cluster.
Then the driver sends the code to be executed on each block. In the example, it would be the various transformations and actions that will be sent out to the workers. Actually, it is the executor on each workers that is going to be performing the work on each block. You will see a bit more on executors in a later lesson. Then the executors read the HDFS blocks to prepare the data for the operations in parallel. After a series of transformations, you want to cache the results up until that point into memory. A cache is created. After the first action completes, the results are sent back to the driver. In this case, we're looking for messages that relate to mysql. This is then returned back to the driver.
To process the second action, Spark will use the data on the cache -- it doesn't need to go to the HDFS data again. It just reads it from the cache and processes the data from there.
Finally the results go back to the driver and we have completed a full cycle. So a quick recap. This is a subset of some of the transformations available. The full list of them can be found on Spark's website. Remember that Transformations are essentially lazy evaluations. Nothing is executed until an action is called. Each transformation function basically updates the graph and when an action is called, the graph is executed. Transformation returns a pointer to the new RDD.
I'm not going to read through this as you can do so yourself. I'll just point out some things I think are important. The flatMap function is similar to map, but each input can be mapped to 0 or more output items. What this means is that the returned pointer of the func method, should return a sequence of objects, rather than a single item. It would mean that the flatMap would flatten a list of lists for the operations that follows. Basically this would be used for MapReduce operations where you might have a text file and each time a line is read in, you split that line up by spaces to get individual keywords. Each of those lines ultimately is flatten so that you can perform the map operation on it to map each keyword to the value of one.
The join function combines two sets of key value pairs and return a set of keys to a pair of values from the two initial set. For example, you have a K,V pair and a K,W pair. When you join them together, you will get a K, (V,W) set.
The reduceByKey function aggregates on each key by using the given reduce function. This is something you would use in a WordCount to sum up the values for each word to count its occurrences.
Action returns values. Again, you can find more information on Spark's website. This is just a subset.
The collect function returns all the elements of the dataset as an array of the driver program. This is usually useful after a filter or another operation that returns a significantly small subset of data to make sure your filter function works correctly.
The count function returns the number of elements in a dataset and can also be used to check and test transformations.
The take(n) function returns an array with the first n elements. Note that this is currently not executed in parallel. The driver computes all the elements.
The foreach(func) function run a function func on each element of the dataset.