Hopsworks is a managed platform for scale-out data science, with support for both GPUs and Big Data, in a familiar development environment. Hopsworks can be used either through its User-Interface or via a REST API. Hopsworks unique features are:
Hopsworks supports the following open-source platforms for Data Science:
Hopsworks provides a new stronger, GDPR-coompliant security model for managing sensitive data in a shared data platform. Hopsworks’ security model is built around Projects, which are analogous to Github repositories. A project contains datasets, users, and programs (code). Sensitive datasets can be sandboxed inside a project, so that users are prevented from exporting that data from the project or cross-linking that data with data in other projects. Note, this capability is provided in competitor data platforms by creating a whole new cluster for the sensitive dataset. In Hopsworks, sharing data does not involve copying data. Datasets can still be securely shared between projects, without the need for duplicating the dataset. Supported datasets in Hopsworks include Hive databases, Kafka topics, and subtrees in HopsFS (HDFS). Hopsworks implements its project-based multi-tenancy security model by supporting TLS certificates (instead of Kerberos) for user authentication, with a new certificate created for every use in every project. Hopsworks also provides role-based access control within projects, with pre-defined DataOwner and DataScientists roles provided for GDPR compliance (Data owners are responsible for the data and access to the data, while Data Scientists are processors of the data).
Hopsworks includes open-source frameworks for scalable data science in a single, secure platform.
Hopsworks is enabled by a unified, scale-out metadata layer - a strongly consistent in-memory data layer that stores metadata for everything from Projects/Users/Datasets in Hopsworks, Filesystem metadata in HopsFS, Kafka ACLs, and YARN quota information. Hopsworks’ metadata layer is kept consistent by mutating it using transactions and its integrity is ensured using foreign keys.
Hopsworks provides HopsML as a set of services and platforms to support the full machine learning lifecycle, including:
Hopsworks’ security model is designed to support the processing of sensitive Datasets in a shared (multi-tenant) cluster. The solution is based on Projects. Within a Project, a user may have one of two different roles, a Data Owner - who is like a superuser, and a Data Scientist - who is allowed to run programs (do analysis), but not allowed to:
That is, the Project acts like a sandbox for the data within it.
To realize this security model, Hopsworks implements dynamic role-based access control for projects. That is, users do not have static global roles. A user’s privileges depend on what the user’s active project is. For example, the user may be a Data Owner in one project, but only a Data Scientist in another project. Depending on which project is active, the user may be a Data Owner or a Data Scientist. The Data Owner role is strictly a superset of the Data Scientist role - everything a Data Scientist can do, a Data Owner can do.
A Data Scientist can
A Data Owner can