This tutorial will go over the basics of using Jupyter notebooks on Hopsworks.
Open the Jupyter Service
Jupyter is provided as a micro-service on Hopsworks and can be found in the main UI inside a project.
Start a Jupyter notebook server
When you start a Jupyter notebook server you can select the ‘Python’ option, which enables the Python kernel in JupyterLab, the notebook server behaves the same as running Jupyter on your local workstation.
If you are doing Machine Learning you should pick the Experiments tab. See HopsML for more information on the Machine Learning pipeline.
For general purpose notebooks, select the Spark tab and run with Static or Dynamic Spark Executors.
Hopsworks supports both JupyterLab and classic Jupyter as Jupyter development frameworks. Clicking Start as shown in the image below, will start JupyterLab by default. You can select to start with classic Jupyter by clicking on the arrow next to the start button. Also, you can switch from JupyterLab to classic Jupyter from within JupyterLab and then Help-Launch Classic Notebook.
Using the previous attached configuration
When you run a notebook, the jupyter configuration used is stored and attached to the notebook as an xattribute. You can use this configuration later to start the jupyter notebook server directly from the notebook file. When the notebook you previously run is selected, you will see options to view the previously run configuration or start jupyter server from the previous configuration. Click on the Notebook Configuration button to view the previous used configuration. Click on the JupyterLab button to start the jupyter notebook server.
Logs
It can be useful to look at the Jupyter server logs in case of errors as they can provide more details compared to the error notification that is shown in the Jupyter dashboard. For example if Jupyter cannot start, simply click the Logs button next to the Start button in Jupyter dashboard.
This will open a new tab (make sure your browser does not block the new tab!) with the Jupyter logs as shown in the figure below.
Jupyter + Spark on Hopsworks
As a user, you will just interact with the Jupyter notebooks, but below you can find a detailed explanation of the technology behind the scenes.
When using Jupyter on Hopsworks, a library called sparkmagic is used to interact with the Hops cluster. When you create a Jupyter notebook on Hopsworks, you first select a kernel. A kernel is simply a program that executes the code that you have in the Jupyter cells, you can think of it as a REPL-backend to your jupyter notebook that acts as a frontend.
Sparkmagic works with a remote REST server for Spark, called livy, running inside the Hops cluster. Livy is an interface that Jupyter-on-Hopsworks uses to interact with the Hops cluster. When you run Jupyter cells using the pyspark kernel, the kernel will automatically send commands to livy in the background for executing the commands on the cluster. Thus, the work that happens in the background when you run a Jupyter cell is as follows:
The three Jupyter kernels we support on Hopsworks are:
All notebooks make use of Spark, since that is the standard way to allocate resources and run jobs in the cluster.
By default all files and folders created by Spark are group writable (i.e umask=007). If you want to change this
default umask you can add additional spark property spark.hadoop.fs.permissions.umask-mode=<umask>
in More Spark Properties before starting the jupyter server.
In the rest of this tutorial we will focus on the pyspark kernel.
Create a pyspark notebook
After you have started the Jupyter notebook server, you can create a pyspark notebook from the Jupyter dashboard:
When you execute the first cell in a pyspark notebook, the spark session is automatically created, referring to the Hops cluster.
The notebook will look just like any python notebook, with the difference that the python interpreter is actually running on a Spark driver in the cluster. You can execute regular python code:
Since you are executing on the spark driver, you can also launch jobs on spark executors in the cluster, the spark session is available as the variable spark in the notebook:
When you execute a cell in Jupyter that starts a Spark job, you can go back to the Hopsworks-Jupyter-UI and you will see that a link to the SparkUI for the job that has been created.
In addition to having access to a regular python interpreter as well as the spark cluster, you also have access to magic commands provided by sparkmagic. You can view a list of all commands by executing a cell with %%help:
So far throughout this tutorial, the Jupyter notebook have behaved more or less identical to how it does if you start the notebook server locally on your machine using a python kernel, without access to a Hadoop cluster. However, there is one main difference from a user-standpoint when using pyspark notebooks instead of regular python notebooks, this is related to plotting.
Since the code in a pyspark notebook is being executed remotely, in the spark cluster, regular python plotting will not work. What you can do however, is to use sparkmagic to download your remote spark dataframe as a local pandas dataframe and plot it using matplotlib, seaborn, or sparkmagics built in visualization. To do this we use the magics: %%sql, %%spark, and %%local. The steps to do plotting using a pyspark notebook are illustrated below. Using this approach, you can have large scale cluster computation and plotting in the same notebook.
Step 1 : Create a remote Spark Dataframe:
Step 2 : Download the Spark Dataframe to a local Pandas Dataframe using %%sql or %%spark:
Note: you should not try to download large spark dataframes for plotting. When you plot a dataframe, the entire dataframe must fit into memory, so add the flag –maxrows x to limit the dataframe size when you download it to the local Jupyter server for plotting.
Using %%sql:
Using %%spark:
Step 3 : Plot the pandas dataframe using Python plotting libraries:
When you download a dataframe from spark to pandas with sparkmagic, it gives you a default visualization of the data using autovizwidget, as you saw in the screenshots above. However, sometimes you want custom plots, using matplotlib or seaborn. To do this, use the sparkmagic %%local to access the local pandas dataframe and then you can plot like usual. Just make sure that you have your plotting libraries (e.g matplotlib or seaborn) installed on the Jupyter machine, contact a system administrator if this is not already installed.
Jupyter notebooks have become the lingua franca for data scientists. As with ordinary source code files, we should version them to be able to keep track of the changes we made or collaborate.
Hopsworks Enterprise Edition comes with a feature to allow users to version their notebooks with Git and interact with remote repositories such as GitHub ones. Authenticating against a remote service is done using API keys which are safely stored in Hopsworks.
The first thing we need to do is issue an API key from a remote hosting service. For the purpose of this guide it will be GitHub. To do so, go to your Settings > Developer Settings > Personal access tokens
Then click on Generate new token. Give a distinctive name to the token and select all repo scopes. Finally hit the Generate token button. For more detailed instructions follow GitHub Help.
NOTE: Make sure you copy the token, if you lose it there is no way to recover, you have to go through the steps again
Once we have issued an API key, we need to store it in Hopsworks for later usage. For this purpose we will use the Secrets which store encrypted information accessible only to the owner of the secret. If you wish to, you can share the same secret API key with all the members of a Project.
Go to your account’s Settings on the top right corner and click Secrets. Give a name to the secret, paste the API token from the previous step and finally click Add.
To start versioning your Jupyter notebooks is quite trivial. First copy the web URL of your repository from GitHub or GitLab.
Navigate into a Project and head over to Jupyter from the left panel. Regardless of the mode, Git options are the same. For brevity, here we use Python mode. Expand the Advanced configuration and enable Git by choosing GITHUB or GITLAB, here we use GitHub. More options will appear as shown in figure below. Paste the repository’s web URL from the previous step into GitHub repository URL and from the API key dropdown select the name of the Secret you entered.
Keep in mind that once you’ve enabled Git, you will no longer be able to see notebooks stored in HDFS and vice versa. Notebooks versioned with Git will not be visible in Datasets browser. Another important note is that if you are running Jupyter Servers on Kubernetes and Git is enabled, notebooks are stored in the pod’s local filesystem. So, if you stop Jupyter or the pod gets killed and you haven’t pushed, your modifications will be lost.
That’s the minimum configuration you should have. It will pick the default branch you’ve set in GitHub and set it as base and head branch. By default it will automatically pull from base on Jupyter startup and push to head on Jupyter shutdown. You can change this behaviour by toggling the respective switches. Click on the plus button to create a new branch to commit your changes and push to remote.
Finally hit the Start button on the top right corner!
From within JupyterLab you can perform all the common git operations such as diff a file, commit your changes, see the history of your branch, pull from a remote or push to a remote etc. For more complicated operations you can always fall back to good old terminal.
Jupyter is installed in the python environment of your project. This means that if a dependency of Jupyter is removed or an incorrect version is installed it may not work properly. If the Python environment ends up in a state with conflicting libraries installed then an alert will be shown in the interface explaining the issue.
We have provided a large number of example notebooks, available here. Go to Hopsworks and try them out! You can do this either by taking one of the built-in tours on Hopsworks, or by uploading one of the example notebooks to your project and run it through the Jupyter service. You can also have a look at HopsML, which enables large-scale distributed deep learning on Hops.