Apache Flink is a framework and distributed processing engine for stateful computations over unbounded and bounded data streams (link). Hopsworks supports running Apache Flink jobs as part of the Jobs service within a Hopsworks project. Running Flink jobs on Hopsworks involves starting a Flink session cluster from the Hopsworks UI and then submitting jobs via the Flink Dashboard which is accessible through the project’s Jobs menu. Hopsworks takes care of managing the life-cycle of the cluster; deploying, monitoring, cleanup. We go in details in the next sections on how to:
Hopsworks supports running long running Flink session clusters in a project-based multi-tenant environment. First,
you need to create the Flink session cluster which is done is the same way as creating a Spark/PySpark job. Navigate
to the project, then select Jobs
from the left hand side menu and then step through the New Job wizard
by
clicking on the New
button. The parameters that can be set for the session cluster are:
For details on how to use these settings and which additional Flink configuration properties you can set through the
Flink properties
option, see here.
After clicking Create
, you should be able to see the new Flink session cluster as depicted in the image below and
you can start it by clicking on the Run
button.
The cluster will soon be running and then you can access the Flink Dashboard by clicking on the blue button on the right of the screen. Then you can see the Dashboard from within the project as shown in the image below.
You can then use the Flink Dashboard to submit new jobs by navigating to the Submit new Job
menu. An example for
running a streaming WordCount is the following
The Flink Dashboard uses the Flink Monitoring REST API to manage the cluster. Hopsworks exposes this API to the client through an http proxy, which enables Hopsworks to provide secure multi-tenant access to Flink clusters. Therefore, user can submit their Flink jobs, typically jar files, by making http requests to Hopsworks based on the endpoints the API provides.
Hopsworks comes with a Python based client that enables users to submit their Flink programs to Hopsworks remotely without having to usethe Hopsworks UI. Instructions how to use the Hopsworks Flink command-line client are available at hops-examples. Job progress can still be monitored from the Hopsworks UI.
Users have access to the Flink history server, which shows jobs that have ran only in projects the user is a member
of, as projects act like sandboxes across data and jobs. The server is accessible from the same menu
as the Dashboard, by clicking on the History Server
button.
Logs
Additionally, users can access the same logs after the Flink session cluster has shut down, by clicking on the Flink job
and navigating to the bottom of the page where both the Execution
and the Error
logs are available. These
logs contain output from the JobManager and the TaskManagers and are collected by Apache Hadoop YARN, where the Flink
session cluster is running on.