Job scheduling using Apache oozie

Apache Oozie is a scheduler for managing & running Hadoop jobs in a distributed environment. We may build a desired pipeline with a different type of tasks combined. It could be your job at Hive, Pig, Sqoop, or MapReduce. You can also plan your jobs using Apache Oozie. Two or more jobs may also be programmed within a sequence of the task to run parallel to one another. It is a device that is scalable, efficient, and extensible.

Oozie is an open-source Java web-application that triggers the workflow actions. It, in essence, uses the execution engine Hadoop to perform the tasks.

Apache Oozie senses tasks being performed through callback and by polling. When Oozie begins a task, it provides the task with a specific HTTP callback URL and notifies that URL until the task is full. If the callback URL fails to invoke the task, Oozie will be able to test the task for completion.

Apache Oozie types

In Apache Oozie three types of jobs exist.

  • Oozie Workflow Jobs

These are Directed Acyclic Graphs (DAGs) that define a sequence of the activities to be carried out.

  • Oozie Coordinator Jobs

These are workflow tasks that are started by time and as per the availability of data.

  • Oozie Bundles

These can be referred to as various supervisor packages and workflow jobs.

Now let’s understand one by one all those jobs.

Workflow with Apache Oozie

Workflow is an action sequence structured in a Direct Acyclic Graph (DAG). The actions depend on each other since the next action can only be performed after the current action is completed. This may be a workflow action. Decision trees may be used to determine how a job should operate and under what condition.

You can create various types of work-based actions and each type of action can have tags of its own. Before executing the workflow and the scripts or jars should be put in the HDFS route.

Command: oozie job — oozie http:/localhost:11000 / oozie -work.properties config -run

In situations, we can use Fork where we want to run multiple jobs in parallel. Any time we use a fork, we need to use Join as the end node to fork. There should be a Join for any Fork. Join assumes that all parallel executing nodes are a single fork child. We can construct two tables in parallel, for example.

If we want to perform an action based on the decision output we can add tags for the decision. For eg, if we have the hive table already we don’t need to build it again. In that case, if the table already exists, we should add a decision tag for not running the build table measures. Decision nodes have identical switch tags to switch events.

Directly transfer the value of the job-tracker, name-node, script, and parameter. But, this is getting difficult to handle. A config file (i.e. .property file) is useful here.

Coordinator Apache Oozie

You can schedule complex workflows, as well as regularly scheduled workflows using Coordinators. Oozie Coordinators cause workflow jobs based on predicates of time, data, or events. Workflows start when the specified condition is satisfied with the job coordinator.

Definitions required for coordinator jobs are as follows.

  • Starting − Work start date.
  • End − End of the work date.
  • Timezone− Timezone of operation for a supervisor.
  • Duration − Pace of execution of work, in minutes.
  • For Control Details a few more properties are available.

For Control Details a few more properties are available:

  • Timeout − Maximum time, in minutes during which the action would wait for the additional conditions to be met before being discarded-1 No timeout means the action will wait indefinitely. The default value will be -1
  • Competitiveness − Maximum number of acts that can run in parallel for a work. The defaults to 1.
  • Execution — Determines execution order if the execution conditions have been met by several instances of the coordinator job. Might be:
  • FIFO (par défect)
  • UPDATE
  • Details

Apache Oozie bundle

Oozie bundle framework lets you identify and run a collection of coordinator applications, also referred to as a data pipeline. There is no specific dependence among the requests for coordinators in an Oozie package. However, you might use the coordinator applications data dependence to construct an implicit pipeline for the data application. The package can be start/stop/suspend/ resume/run. It provides better and easier control of operations.

Kick-off time − The time it takes for a package to start and send requests for coordinators.

Schedule Apache Oozie

You need to write a Hive-action to schedule Hive jobs using Oozie. Your work at Oozie will consist mainly of three items.

  • Workingflow.xml
  • Work.properties
  • Script Hive

Note:

Horton Works Sandbox can run a complete Hive-oozie job. If you are using any other platform, make the configuration changes accordingly.

Job.ownership

This file consists of all descriptions of the variable that you use in your workflow.xml. Let’s tell, you listed the property as below at workflow.xml:

< name>${nameNode}</name-node >

So, you have to declare $nameNode in your Job.properties file and assign the relative direction.

The Job.properties description below.

nameNode=hdfs://sandbox.hortonworks.com:8020

jobTracker=sandbox.hortonworks.com:8050

oozie.libpath=${nameNode}/user/oozie/share/lib/hive

oozie.wf.application.path=${nameNode}/user/${user.name}/workflows

appPath=${nameNode}/user/${user.name}/workflows

Explanation

Let’s explain what each of these says.

Oozie.libpath=$${nameNode}/user / oozie / lib / hive

Indicates the direction where all of the respective jars are present (in hdfs).

App.path=${nameNode}/user/${user.name}/workflows

This is the location where you can get the dependent files from your application.

xml Workflow

This is the place to write your action Apache Oozie. It contains all the file information, scripts that are needed to schedule and run Apache Oozie ‘s job. This is an XML file, as the name implies, where you need to list the information in a proper tab. The following is the example of workflow.xml to run Action Hive.

<workflow-app name=”Oozie” xmlns=”uri:oozie:workflow:0.1">

<start to=”-hive”/>

<action name=”-hive”>

<hive xmlns=”uri:oozie:hive-action:0.2">

<job-tracker>${jobTracker}</job-tracker>

<name-node>${nameNode}</name-node>

<job-xml>${appPath}/hive-site.xml</job-xml>

<configuration>

<property>

<name>oozie.hive.defaults</name>

<value>${appPath}/hive-site.xml</value>

</property>

<property>

<name>hadoop.proxyuser.oozie.hosts</name>

<value>*</value>

</property>

<property>

<name>hadoop.proxyuser.oozie.groups</name>

<value>*</value>

</property>

</configuration>

<script>create_table.hql</script>

</hive>

<ok to=”end”/>

<error to=”end”/>

</action>

<end name=”end”/>

</workflow-app>

Now let’s try to understand what the Workflow.xml document means exactly.

The first line generates a workflow framework, and we give a name for recognizing the job (according to our convenience).

Specifies that we are developing a workflow software called ‘Oozie.’ All other properties within this key tag will stay.

The above two tags are very self-explanatory which means, give your behavior a name (here ‘-hive’) and start your Oozie job when matching.

The above line is really relevant because it tells what kind of action you’re going to take. It may be an operation of MR, or of Pig, or of Hive. This is where we gave the name as Hive-action.

${Tracker}

${theNameNode}

${appPath}/hive.xml

Both the above tags point to the vector where you have your work tracker, NameNode, and Hive-site.xml. In the Job.properties file, the exact declaration of these variables is made.

< create table.hql</script >

You need to fill in your script file’s exact name (here is a Hive script file) that will be searched for and the query will be executed.

Create create.hql

This is the Hive script you would like to plan for in Oozie. It’s pretty simple and self-explanatory.

Conclusion

I hope you reach to a conclusion about the Apache Oozie scheduler. You can learn more about Apache Oozie through big data online training