Jobs

Members of a project in Hopsworks, can launch the following types of applications through a project’s Jobs service:

  • Python (Hopsworks Enterprise only)
  • Apache Spark
  • Apache Flink

If you are a beginner it is highly recommended to click on the Spark button at landing page under the available tours. It will guide you through launching your first Spark application and the steps for launching any job type are similar. Details on running Python programs are provided in the Python section below.

Guided tours

To create a new job, click on the Jobs tab from the Project Menu and follow the steps below:

  • Step 1: Press the New Job button on the top left corner
  • Step 2: Give a name for you job
  • Step 3: Select one of the available job types
  • Step 4: Select the executable file of your job that you have uploaded earlier in a Dataset
  • Step 5: If the job is a Spark job the main class will be inferred from the jar, but can also be set manually
  • Step 6 (Optional): Configure default arguments to run your job with
  • Step 7: In the Configure and create tab you can manually specify the configuration you desire for your job and any additional dependencies and arbitrary Spark/Flink parameters.
  • Step 8: Click on the Create button
  • Step 9: Click on the Run button to launch your job. If no default arguments have been configured, a dialog textbox will ask for any runtime arguments the job may require. If this job requires no arguments, the field can be left empty. The figure below shows the dialog.
Job runtime arguments

Job input arguments

After creating a job by following the new job wizard, you can manage all jobs and their runs from the landing page of the Jobs service. The figure below shows a project with 6 jobs where 5 jobs are shown per page. When a job has run at least once, all past and current runs are then shown in the UI.

Jobs

Jobs UI

Users can interact with the jobs in the following ways:

  1. Search jobs by using the Search text box
  2. Filter jobs by creation date
  3. Set the number of jobs to be displayed per page
  4. Run a job
  5. Stop a job, this stops all the ongoing runs of a job.
  6. Edit a job, for example change the Spark configuration parameters
  7. View Monitoring UI, with detailed Job information such as Spark UI, YARN, real-time logs and metrics
Job logs

Job real-time logs

  1. View a job’s details
Job real-time logs

Job details

  1. Make a copy of a job

10. Export a job, which prompts the user to download a json file. A job can then be imported by clicking on the New Job and then Import Job button.

Additionally, users click on a job and view additional information about their runs.

  1. Information about the run, such as location of log files and id.
  2. Stop a run
  3. Monitoring UI of this particular run
  4. View/Download stdout logs
  5. View/Download stderr logs
Job logs

Job aggregated logs

By default all files and folders created by Spark are group writable (i.e umask=007). If you want to change this default umask you can add additional spark property spark.hadoop.fs.permissions.umask-mode=<umask> in More Spark Properties when you create a new job.

Python

(Available in Hopsworks Enterprise only)

There are three ways of running Python programs in Hopsworks:

  • Jupyter notebooks: Covered in the Jupyter section of the user guide.
  • Jobs UI
  • Programmatically

The GIF below demonstrates how to create a Python job from the Jobs UI by selecting a python file that is already uploaded in a Hopsworks dataset and attaching a few other files to be immediately available to the application at runtime. However, any file can be made available to the application at runtime by using in the Python app to run, the copy_to_local function of the hdfs module of the hops Python library http://hops-py.logicalclocks.com/hops.html#module-hops.hdfs

Python new job UI

Create a new Python job from the Jobs UI

You do not have to upload the Python program UI to run it. That can be done so from within the Python program by using upload function of the dataset module of the hops Python library http://hops-py.logicalclocks.com

To do that, first generate an API key for your project, see Generate an API key, and then use the project.connect() function of the same library to connect to a project of your Hopsworks cluster and then dataset.upload.

Docker

(Available in Hopsworks Enterprise only)

The Docker job type in Hopsworks enables running your own Docker containers as jobs in Hopsworks. With the Docker job type, users are no longer restricted in running only Python, Spark/PySpark and Flink programs, but can now utilize the Hopsworks Jobs service to run any program/service does it packaged in a Docker container.

As seen the screenshot below, users can set the following Docker job specific properties (advanced properties are optional):

  • Docker image: The location of the Docker image. Currently only publicly accessible docker registries are supported.
  • Docker command: Newline delimited list of commands to run the docker image with.
  • Default arguments: Optional input arguments to be provided to the docker container.
  • Input paths: Newline delimited list of datasets or directories to be made available to the docker container. Data is copied asynchronously in the container, it is up to the application to wait until all data copy is completed. In the example screenshot below, the application sleeps for 20 seconds before using the data.
  • Output path: The location in Hopsworks datasets where the output of the Job will be persisted, if the programs running inside the container redirect their output to the same container-local path. For example, if the output path is set to /Projects/myproject/Resources and the a container runs the command echo “hello” >> /Projects/myproject/Resources/hello.txt, then the Hopsworks job upon job completion will copy the entire content of the /Projects/myproject/Resources from the docker container to the corresponding path with the same name under Datasets.
  • Environment variables: Newline delimited list of environment variables to be set for the Docker container.
  • Volumes: Newline delimited list of volumes to be mounted with the Docker job.
  • User id / Group Id: Provide the uid and gid to run the Docker container with. For further details, look into the Admin options below.
  • Redirect stdout/stderr: Whether to automatically redirect stdout and stderr to the Logs dataset. Logs will be made available after the job is completed. Disable this setting if you prefer to redirect the logs to another location.

Admin options

The following options can be set using the Variables service within the Admin UI of Hopsworks:

  • docker_job_mounts_list: Comma-separated list of host paths jobs are allowed to mount. Default is empty.
  • docker_job_mounts_allowed: Whether mounting volumes is allowed. Allowed values: true/false. Default is false.
  • docker_job_uid_strict: Enable or disable strict mode for uig/gid of docker jobs. In strict mode, users cannot set the uid/gid of the job. Default is true. If false and users do not set uid and gid, the container will run with the uid/gid set in the Dockerfile.

Examples

Below you can find some examples showing how to set various Docker job options. Despite all jobs using commands and arguments differently, the output of all jobs is the equivalent. You can choose whichever setup is convenient for your use-case, keep in mind that defaultArgs and execution args are provided in a single line (String variable). If the job fails and no out/error logs are available, make sure the commands and arguments are properly formatted, for example not trailing whitespace characters are present.

The command to run is /bin/sh -c sleep 10 && cp /Projects/p1/Jupyter/README.md /Projects/p1/Resources/README_Jupyter.md && ls /

Example 1: A job with multiple commands and no arguments

Docker job example 1

Create a new Docker job from the Jobs UI using only the “command” property

Example 2: A job with multiple commands and default arguments

Docker job example 2

Create a new Docker job from the Jobs UI using only the “command” and “defaultArgs” properties

Example 3: A job with multiple commands and no arguments (requested upon execution)

Docker job example 3

Create a new Docker job from the Jobs UI using the “command” property and execution arguments

Below you can see how to view the stdout and stderr job logs.

Docker job logs

View Docker jobs logs

Hopsworks IDE Plugin

It is also possible to work on jobs while developing in your IntelliJ/PyCharm IDE by installing the Hopsworks Plugin from the marketplace.

Usage

  • Open the Hopsworks Job Preferences UI for specifying user preferences under Settings -> Tools -> Hopsworks Job Preferences.
  • Input the Hopworks project preferences and job details you wish to work on.
  • Open a Project and within the Project Explorer right click on the program ( .jar, .py, .ipynb) you wish to execute as a job on Hopsworks. Different job actions possible are available in the context menu ( Create, Run, Stop, etc.)
  • Note: The Job Type Python only supports Hopsworks-EE

Actions

  • Create: Create or update job as specified in Hopsworks Job Preferences
  • Run: Uploads the program first to the HDFS path as specficied and runs job
  • Stop: Stops a job
  • Delete: Deletes a job
  • Job Execution Status / Job Execution Logs: Get the job status or logs respectively. You have the option of retrieving a particular job execution by specifying the execution id in the ‘Hopsworks Job Preferences’ UI, otherwise default is the last execution for the job name specified.
Hopworks Plugin

Working with jobs from Hopsworks IntelliJ/PyCharm plugin

Support for running Flink jobs

  • You can also submit your local program as a flink job using the plugin. Follow the steps to Create Job to first create a flink job in Hopsworks.
  • Then click on Run Job. This will first start a flink cluster if there is no active running flink job with same job name. Otherwise it will use an active running flink cluster with same job name. Next it will upload and submit your program to a running flink cluster.
  • Set your program main class using the Main Class field in preferences. To pass arguments, simply fill it in the User Arguments, multiple arguments separated by space. e.g. –arg1 a1 –arg2 a2
Example:Submitting Flink Job from plugin

Example:Submitting Flink Job from plugin