Dataset usage guides
The platform provides a number of tools to help you manage your dataset, all revolve around a central dataset concept.
Datasets are described by a manifest file, which is a JSON file that resides in the API resource store. The include information such as descriptions, data ownership, citation information and how to interact with the data.
Dataset manifest
The dataset manifest, like all manifests, contain metadata
, status
and spec
fields, with the spec
field
being unique to the dataset-kind catalog.hubocean.io/dataset
.
The metadata
field is used to specify the dataset name, and can also be used to set a display-name that is more
friendly to humans. The description field can also be used to provide a longer description of the dataset.
The status
-field is system-generated and should not be modified by the user. It contains information about the
dataset creator and when the dataset was created. It also contains information about who last modified the dataset and
when.
The spec
-field has the following fields:
attributes
Attributes contain metadata that can shape how the data can be used.
citation
How to cite the dataset.
data_collection
What data collection the dataset is a part of.
documentation
Text documentation describing the dataset.
maintainer
Name and email of maintainer
storage_class
What storage class the data should be handled with.
storage_controller
Which storage controller should work with the data.
Storage controller and class
The storage_class
and storage_controller
fields are used to specify how the data is stored, and typically refers
to the storage controller which is used interact with the data. For example - if the dataset is tabular, the storage
class will be registry.hubocean.io/storage-class/tabular
and the storage controller will the qualified name of one
of the registered tabular storage controllers.
When creating a dataset, the storage_class
must be set, but the storage_controller
can be left empty. In this case,
the system will set storage_controller
to the default storage controller for the given storage class.
The system currently supports the following storage classes:
Storage class | Description |
---|---|
registry.hubocean.io/storage-class/raw |
Raw files, similar to a cloud storage container |
registry.hubocean.io/storage-class/tabular |
Tabular data |
We plan to add more storage classes in the future.
Data ownership fields
The maintainer
, distribution
, and citations
-fields are used to specify data ownership information. The
maintainer
field is used to specify the maintainer of the dataset by providing a name, email address and optionally
an organisation name. The purpose of this field is to provide a way for users to contact the maintainer of the
dataset and should therefore contain valid contact information.
The distribution
field is used to describe the contact information and website of the dataset distributor, publication
date and the dataset license.
Users can input dataset license information of a common license name such as CC-BY, a link to the license text, or a full license text.
While the distribution information is optional when creating a dataset, it must be defined before a dataset can be published on the platform.
Dataset attributes
Dataset attributes is an umbrella term for dimensions within a dataset, such as table columns or grid properties. Attributes can be further decorated by traits, which are atomic properties such as data units, what is measured and so on. Users can query the traits of a dataset to find datasets that are compatible with their use-case.
The dataset traits area inspired by the Microsoft Common Data Model
Documentation
If there exist external dataset documentation, it can be provided as a link in the documentation
field.
Tags
Users can add tags to a dataset to make it easier to find. Tags are simple strings that can be used to describe the dataset. For example, a dataset containing weather data for a specific city can be tagged with the city name.