Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Learning Objectives

...

  • The definition of Big Data
  • The Hadoop architecture including MapReduce and HDFS
  • How to use the Hadoop file system shell and the Ambari Console to work with HDFS
  • Starting and stopping Hadoop components
  • How to add/remove a node to/from a Hadoop Cluster
  • Determining how much space is available on each node in a Hadoop cluster
  • How to modify Hadoop configuration parameters
  • The concepts and purpose of other components such as MapReduce, Pig, Hive, Flume, Sqoop, and Oozie

Introduction to Hadoop

...

In this lesson you will learn about:

  • The concept of Big Data
  • What Hadoop is
  • Other open source software related to Hadoop
  • How Big Data solutions can work on the Cloud

...

Hadoop Architecture & HDFS

...

Learning objectives

In this lesson you will learn about:

  • The main Hadoop components
  • How HDFS works
  • Data access patterns for which HDFS is designed
  • How data is stored in an HDFS cluster

...

Hadoop Administration

...

Learning objectives

In this lesson you will learn about:

  • Adding and removing nodes from a cluster
  • How to verify the health of a cluster
  • How to start and stop a cluster's components
  • Modifying Hadoop configuration parameters
  • Setting up a rack topology

...

Hadoop Component

...

Learning objectives

In this lesson you will learn about:

  • The MapReduce philosophy
  • How to use Pig and Hive in a Hadoop environment
  • Moving data into Hadoop using Flume and Sqoop
  • Scheduling and controlling Hadoop job execution using Oozie

...

The reduce is simple: it just sums the values it is given to produce a sum for each key.

Pig & Hive

Widget Connector
width1000
urlhttps://www.youtube.com/watch?v=_Ae0x1ffYRs
height600

Pig and Hive have much in common. They all translate high-level languages into MapReduce jobs so that the programmer can work at a higher level than he or she would when writing MapReduce jobs in Java or other lower-level languages supported by Hadoop using Hadoop streaming. The high-level languages offered by Pig and Hive let you write programs that are much smaller than the equivalent Java code. When you find that you need to work at a lower level to accomplish something these high-level languages do not support themselves, you have the option to extend these languages, often by writing user-defined functions in Java. Interoperability can work both ways since programs written in these high-level languages can be embedded inside other languages as well. Finally, since all these technologies run on top of Hadoop, when they do so, they have the same limitations with respect to random reads and writes and low-latency queries as Hadoop does.

Now, let us examine what is unique about each technology, starting with Pig. Pig was developed at Yahoo Research around 2006 and moved into the Apache Software Foundation in 2007. Pig's language, called PigLatin, is a data flow language - this is the kind of language in which you program by connecting things together. Pig can operate on complex data structures, even those that can have levels of nesting. Unlike SQL, Pig does not require that the data have a schema, so it is well suited to processing unstructured data. However, Pig can still leverage the value of a schema if you choose to supply one. Like SQL, PigLatin is relationally complete, which means it is at least as powerful as relational algebra. Turing completeness requires looping constructs, an infinite memory model, and conditional constructs. PigLatin is not Turing complete on its own but is Turing complete when extended with User-Defined Functions.

Hive is a technology developed at Facebook that turns Hadoop into a data warehouse complete with a dialect of SQL for querying. Being an SQL dialect, HiveQL is a declarative language. Unlike in PigLatin, you do not specify the data flow, but instead, describe the result you want and Hive figures out how to build a data flow to achieve it. Also unlike Pig, a schema is required, but you are not limited to one schema. Like PigLatin and SQL, HiveQL on its own is a relationally complete language but not a Turing complete language. It can be extended through UDFs just like Pig to be Turing complete. Let us examine Pig in detail.

Pig consists of a language and an execution environment. The language is called PigLatin. There are two choices of execution environment: a local environment and distributed environment. A local environment is good for testing when you do not have a full distributed Hadoop environment deployed. You tell Pig to run in the local environment when you start Pig's command line interpreter by passing it the -x local option. You tell Pig to run in a distributed environment by passing -x MapReduce instead.

Alternatively, you can start the Pig command line interpreter without any arguments and it will start it in the distributed environment. There are three different ways to run Pig. You can run your PigLatin code as a script, just by passing the name of your script file to the pig command. You can run it interactively through the grunt command line launched using pig with no script argument. Finally, you can call into Pig from within Java using Pig's embedded form. As mentioned in the overview, Hive is a technology for turning Hadoop into a data warehouse, complete with an SQL dialect for querying it.

There are three ways to run Hive. You can run it interactively by launching the hive shell using the hive command with no arguments. You can run a Hive script by passing the -f option to the hive command along with the path to your script file. Finally, you can execute a Hive program as one command by passing thee option to the hive command followed by your Hive program in quotes.

Flume, Squoop and Oozie

Widget Connector
width1000
urlhttps://www.youtube.com/watch?v=VL89pHMU1lw
height600

Now let us look at moving data into Hadoop. We will begin by looking at Flume's architecture, then examining the three modes it can run in followed by a look at the event data model.

Flume is an open source software program developed by Cloudera that acts as a service for aggregating and moving large amounts of data around a Hadoop cluster as the data is produced or shortly thereafter. Its primary use case is the gathering of log files from all the machines in a cluster to persist them in a centralized store such as HDFS.

This topic is not intended to cover all aspects of Sqoop but to give you an idea of the capabilities of Sqoop. Sqoop is an open source product designed to transfer data between relational database systems and Hadoop. It uses JDBC to access the relational systems. Sqoop accesses the database in order to understand the schema of the data. It then generates a MapReduce application to import or export the data. When you use Sqoop to import data into Hadoop, Sqoop generates a Java class that encapsulates one row of the imported table.

You have access to the source code for the generated class. This can allow you to quickly develop any other MapReduce applications that use the records that Sqoop stored into HDFS. In Flume, you create data flows by building up chains of logical nodes and connecting them to sources and sinks. For example, say you wish to move data from an Apache access log into HDFS. You create a source by tailing access.log and use a logical node to route this to an HDFS sink. Most production Flume deployments have a three-tier design. The agent tier consists of Flume agents co-located with the source of the data that is to be moved.

The collector tier consists of perhaps multiple collectors each of which collects data coming in from multiple agents and forwards it on to the storage tier which may consist of a file system such as HDFS. Here is an example of such a design. Say we have four http server nodes producing log files labeled httpd_logx where x is a number between 1 and 4. Each of these http server nodes has a Flume agent process running on it. There are two collector nodes. Collector1 is configured to take data from Agent1 and Agent2 and route it to HDFS. Collector2 is configured to take data from Agent3 and Agent4 and route it to HDFS. Having multiple collector processes allows one to increase the parallelism in which data flows into HDFS from the web servers.

Oozie is an open source job control component used to manage Hadoop jobs. Oozie workflows are collections of actions arranged in a Direct Acyclic Graph. There is a control dependency between actions in that a second action cannot run until the proceeding action completes. For example, you have the capability of having one job execute only when a previous job completed successfully. You can specify that several jobs are permitted to execute in parallel but a final job must wait to begin executing until all parallel jobs complete. Workflows are written in hPDL an XML process definition language and are stored in a file called workflow.xml.

Each workflow action starts jobs in some remote system and that remote system performs a callback to Oozie to notify that the action completed. The coordinator component can invoke workflows based upon a time interval, that is for example, once every 15 minutes, or based upon the availability of specific data. It might also be necessary to connect workflow jobs that run regularly but at different time intervals. For example, you may want the output of the last four jobs that run every 15 minutes to be the input to a job that runs every hour.

A single workflow can invoke a single task or multiple tasks, either in sequence or based upon some control logic.