Skip to content

Dataset usage guides

The platform provides a number of tools to help you manage your dataset, all revolve around a central dataset concept.

Datasets are described by a manifest file, which is a JSON file that resides in the API resource store. The include information such as descriptions, data ownership, citation information and how to interact with the data.

Dataset manifest

The dataset manifest, like all manifests, contain metadata, status and spec fields, with the spec field being unique to the dataset-kind catalog.hubocean.io/dataset.

The metadata field is used to specify the dataset name, and can also be used to set a display-name that is more friendly to humans. The description field can also be used to provide a longer description of the dataset.

The status-field is system-generated and should not be modified by the user. It contains information about the dataset creator and when the dataset was created. It also contains information about who last modified the dataset and when.

The spec-field has the following fields:

attributes Attributes contain metadata that can shape how the data can be used.

citation How to cite the dataset.

data_collection What data collection the dataset is a part of.

documentation Text documentation describing the dataset.

maintainer Name and email of maintainer

storage_class What storage class the data should be handled with.

storage_controller Which storage controller should work with the data.

Storage controller and class

The storage_class and storage_controller fields are used to specify how the data is stored, and typically refers to the storage controller which is used interact with the data. For example - if the dataset is tabular, the storage class will be registry.hubocean.io/storage-class/tabular and the storage controller will the qualified name of one of the registered tabular storage controllers.

When creating a dataset, the storage_class must be set, but the storage_controller can be left empty. In this case, the system will set storage_controller to the default storage controller for the given storage class.

The system currently supports the following storage classes:

Storage class Description
registry.hubocean.io/storage-class/raw Raw files, similar to a cloud storage container
registry.hubocean.io/storage-class/tabular Tabular data

We plan to add more storage classes in the future.

Data ownership fields

The maintainer, distribution, and citations-fields are used to specify data ownership information. The maintainer field is used to specify the maintainer of the dataset by providing a name, email address and optionally an organisation name. The purpose of this field is to provide a way for users to contact the maintainer of the dataset and should therefore contain valid contact information.

The distribution field is used to describe the contact information and website of the dataset distributor, publication date and the dataset license.

Users can input dataset license information of a common license name such as CC-BY, a link to the license text, or a full license text.

While the distribution information is optional when creating a dataset, it must be defined before a dataset can be published on the platform.

Dataset attributes

Dataset attributes is an umbrella term for dimensions within a dataset, such as table columns or grid properties. Attributes can be further decorated by traits, which are atomic properties such as data units, what is measured and so on. Users can query the traits of a dataset to find datasets that are compatible with their use-case.

The dataset traits area inspired by the Microsoft Common Data Model

Documentation

If there exist external dataset documentation, it can be provided as a link in the documentation field.

Tags

Users can add tags to a dataset to make it easier to find. Tags are simple strings that can be used to describe the dataset. For example, a dataset containing weather data for a specific city can be tagged with the city name.