Uploading Data

Uploading data consists of 2 main parts: creating a dataset and uploading the data itself. Dataset can be in raw and tabular forms. Here are examples for each one of them:

Creating Raw dataset

DashboardAPISDK

Creating datasets in the platform. Let's click "My Data" in the top navbar.

Here we can see the data collections we have made ourselves.

To create a new collection just click "Add new" and fill out the fields.

Keywords

Keywords are labels you can add to collections and datasets. They are helpful to categorize resources, and give insight at a glance. After typing in a keyword, hit enter to add it to the list.

Let's create a raw dataset.

Open the collection you created, click "Add Dataset" and fill in the fields with "type" set to files.

Set type to "Files" to make a raw dataset.

If you would like to add additional information, for the datasets, you can click edit buttons for the fields

If you would like to have additional metadata fields, you can add them in "Advanced metadata"

These fields will be shown in dataset's preview

To create raw dataset through API, you need to POST https://api.hubocean.earth/catalog with request body containing these fields:

{
  "kind": "catalog.hubocean.io/dataset",
  "version": "v1alpha3",
  "metadata": {
    "name": "test-dataset",
    "display_name": "Test Dataset",
    "description": "A test dataset"
  },
  "spec": {
    "storage_controller": "registry.hubocean.io/storageController/storage-raw-cdffs",
    "storage_class": "registry.hubocean.io/storageClass/raw",
    "maintainer": {
      "contact": "HUB Ocean <info@oceandata.earth>",
      "organisation": "Hub Ocean"
    },
    "tags": [
      "Ocean"
    ],
    "facets": {
      "additional_metadata_1": "metadata_value",
      "additional_metadata_2": "metadata_value_2"
    }
  }
}

kind: kind consists of resource group (catalog.hubocean.io) and resource type (dataset)
version: endpoint version. Currently only "v1alpha3" is supported for this endpoint
name: unique dataset's name. Note: if same request will be sent a second time, it will fail because of this field's uniqueness
display_name: name visible in the platform. Does not need to be unique
description: description visible in the platform under the name field
storage_controller: raw and tabular has different resource controllers. Example controller is for raw
storage_class: This field shows that it is RAW dataset
contact: Dataset controller's contact information/li>
organisation: Dataset controller's organisation
tags: Optional. Tags are used in dataset search
facets: Optional. Facets are used as additional metadata fields

To update dataset's metadata, you need to send PATCH https://api.hubocean.earth/catalog?kind=hubocean.io/catalog&either_id={catalog_uuid} with request body being updated metadata.

Raw dataset creation through SDK in Workspace is as easy as with UI.

First, we create a data collection and a raw dataset:

from odp.client import OdpClient
from odp.dto import Metadata
from odp.dto.catalog import DataCollectionDto, DataCollectionSpec
from odp.dto.catalog import DatasetDto, DatasetSpec
from odp.dto.common.contact_info import ContactInfo
from odp.dto.common.license import License

# Instantiate the client without specifying a token provider.
# The token provider will be set based on the environment.
client = OdpClient()

data_collection_name = "collection-manifest-example"
dataset_name = "Dataset-name"

collection = DataCollectionDto(
    metadata=Metadata(
        name=data_collection_name,
        display_name="collection-example",
        description="A test data collection",
    ),
    spec=DataCollectionSpec(
        published_by=ContactInfo(
            contact="User McUsername <user.mcusername@emailprovider.com>",
            organisation="Organisation Name",
        ),
        published_date="2019-06-19T06:00:00",
        website="https://hubocean.earth",
        license=License(
            name="proprietary",
            full_text="This is a very strict legal text describing the data license.",
            href="www.wikipedia.org",
        ),
        tags=[],
    ),
)

collection = client.catalog.create(collection)
print("Collection was created")

dataset = DatasetDto(
    metadata=Metadata(
        name=dataset_name,  # Must be unique. Can be used as identifier
        display_name="SDK Raw Example",  # Can be duplicate. Name seen in UI
    ),
    spec=DatasetSpec(
        data_collection=f"catalog.hubocean.io/dataCollection/{data_collection_name}",
        # Both lines below used specify that it is RAW storage
        storage_controller="registry.hubocean.io/storageController/storage-raw-cdffs",
        storage_class="registry.hubocean.io/storageClass/raw",
        maintainer=ContactInfo(
            contact="User McUsername <user.mcusername@emailprovider.com>",
            organisation="Organisation Name",
        ),
        facets={
            "test_metadata": "test_value"
        },
    ),
)

dataset = client.catalog.create(dataset)
print("Resource created successfully:", dataset)

To update dataset's metadata, we need to get catalog's info, update it send it to OdpClient().catalog.update():

from odp.client import OdpClient
from odp.dto import Metadata
from odp.dto.catalog import DatasetDto, DatasetSpec
from odp.dto.common.contact_info import ContactInfo
from odp.client.exc import OdpResourceNotFoundError, OdpValidationError

client = OdpClient()

# DatasetDto or JSON patch
manifest_update = client.catalog.get("3d797de8-f4ec-48a5-b211-cae1bcfa432c")
manifest_update.metadata.description = "Updated description"

try:
    updated_resource_manifest = client.catalog.update(manifest_update)
    print("Resource updated successfully:", updated_resource_manifest)
except OdpValidationError as e:
    print("Validation Error:", e)
except OdpResourceNotFoundError as e:
    print("Resource Not Found Error:", e)

Uploading Raw data

DatasetAPISDK

To upload a file, click "Open file", select a file and then click upload button.

Uploading data to raw is only a matter of selecting the file on your computer.

Tip

Any file can be uploaded, satellite imagery, zip files, shapefiles.

API upload process is:

Create a file inside a raw dataset.
Upload a file.

Create a file by sending POST https://api.hubocean.earth/data/{dataset_uuid} request with request body:

{
  "name": "test",
  "mime_type": "text/plain",
}

name: file name. This name is used in URL
mime_type: Optional. You can read more about common MIME types here

Upload the file with PATCH https://api.hubocean.earth/data/{dataset_uuid}/{file_name} with request body containing file.

Postman is frequently used for sending API responses. Send file with Postman by selecting "Body" -> "binary" and choosing file from the computer.

File's metadata can be:

viewed
updated

View file's metadata by sending GET https://api.hubocean.earth/data/{dataset_uuid}/{file_name}/metadata request.
Update file's metadata by updating existing metadata's JSON and sending updated requested body in PATCH https://api.hubocean.earth/data/{dataset_uuid}/{file_name}/metadata request.

Following example with file uploaded into Workspace. If code is not run in Workspace, path_to_file might need to be updated to contain full path to the file.

# Add raw dataset's UUID below
dataset = client.catalog.get("UUID")

# If file and .py files are both in the same folder, name only is sufficient
path_to_file = "test.txt"

with open(path_to_file, "rb") as data:  # We open file
    file_metadata_dto = FileMetadataDto(
        name=data.name,  # Name of file inside the dataset
        mime_type="text/plain",  # MIME type of the file
    )
    file_dto = client.raw.create_file(
        resource_dto=dataset,  # We reuse same dataset we had created
        file_metadata_dto=file_metadata_dto,
        contents=data.read(),
    )

create_file() creates a file and if contents are added, uploads a file. After running provided code, file should be visible in the platform.

Creating Tabular dataset

DashboardAPISDK

Tabular dataset can be created just like a raw dataset. Open the collection you created, click "Add Dataset" but set "type" to "Tabular" this time.

To create tabular dataset through API, you need to POST https://api.hubocean.earth/catalog with request body containing these fields:

{
  "kind": "catalog.hubocean.io/dataset",
  "version": "v1alpha3",
  "metadata": {
    "name": "test-dataset",
    "display_name": "Test Dataset",
    "description": "A test dataset"
  },
  "spec": {
    "storage_controller": "registry.hubocean.io/storageClass/tabular",
    "storage_class": "registry.hubocean.io/storageController/storage-tabular",
    "maintainer": {
      "contact": "HUB Ocean <info@oceandata.earth>",
      "organisation": "Hub Ocean"
    },
    "tags": [
      "Ocean"
    ]
  }
}

kind: leave as is
version: leave as is
name: unique dataset's name. Note: if same request will be sent a second time, it will fail because of this field's uniqueness
display_name: name visible in the platform. Does not need to be unique
description: description visible in the platform under the name field
storage_controller: leave as is
storage_class: leave as is. This field shows that it is TABULAR dataset
contact: change in the format provided
organisation: change as needed
tags: Optional. Tags can be added for easier search and grouping of datasets

Similarly as with Raw example, first, we start by creating a data collection and a tabular dataset.

from odp.client import OdpClient
from odp.dto import Metadata
from odp.dto.catalog import DataCollectionDto, DataCollectionSpec
from odp.dto.catalog import DatasetDto, DatasetSpec
from odp.dto.common.contact_info import ContactInfo
from odp.dto.common.license import License

# Instantiate the client without specifying a token provider.
# The token provider will be set based on the environment.
client = OdpClient()

data_collection_name = "collection-manifest-example"
dataset_name = "Dataset-name"

collection = DataCollectionDto(
    metadata=Metadata(
        name=data_collection_name,
        display_name="collection-example",
        description="A test data collection",
    ),
    spec=DataCollectionSpec(
        published_by=ContactInfo(
            contact="User McUsername <user.mcusername@emailprovider.com>",
            organisation="Organisation Name",
        ),
        published_date="2019-06-19T06:00:00",
        website="https://hubocean.earth",
        license=License(
            name="proprietary",
            full_text="This is a very strict legal text describing the data license.",
            href="www.wikipedia.org",
        ),
        tags=[],
    ),
)

collection = client.catalog.create(collection)
print("Collection was created")

dataset = DatasetDto(
    metadata=Metadata(
        name=dataset_name,
        display_name="SDK Raw Example",
    ),
    spec=DatasetSpec(
        data_collection=f"catalog.hubocean.io/dataCollection/{data_collection_name}",
        # Both lines below used specify that it is Tabular storage
        storage_controller="registry.hubocean.io/storageController/storage-tabular",
        storage_class="registry.hubocean.io/storageClass/tabular",
        maintainer=ContactInfo(
            contact="User McUsername <user.mcusername@emailprovider.com>",
            organisation="Organisation Name",
        ),
    ),
)

dataset = client.catalog.create(dataset)
print("Resource created successfully:", dataset)

Uploading Tabular data

DashboardAPISDK

Currently tabular upload works only through API and SDK.

For application to know how to store uploaded data, we need to create schema, that matches to be uploaded data. Here is the simple example with 1 string column.

POST https://api.hubocean.earth/data/{dataset_uuid}/schema with request body:

{
  "table_schema": {
    "SeahorseName": {
      "type": "string"
    }
  }
}

table_schema: leave as is
SeahorseName: a name of column, according to the needs. Can be named as "depth", "location", etc.
type: type of data that will be stored. See TableSpec for usable data types

Then you can upload data itself with POST https://api.hubocean.earth/data/{dataset_uuid}. Here is body example for previous schema:

{
  "data": [
    {
      "SeahorseName": "Frank"
    },
    {
      "SeahorseName": "Tom"
    }
  ]
}

For application to know how to store uploaded data, we need to create schema, that matches to be uploaded data. Here is the simple example with 1 string column.

client.tabular.create_schema(
    resource_dto=dataset,  # We use created tabular dataset
    table_spec=TableSpec(table_schema={"Data": {"type": "string"}})
)

To upload data, we prepare data in the format of the schema. Here is example of 2 rows for previously created schema:

test_data = [{"Data": "Test"}, {"Data": "Test1"}]

client.tabular.write(resource_dto=dataset, data=test_data)

Now you have uploaded tabular data. You can view it with:

uploaded_data = client.tabular.select_as_list(dataset)

print("Data query result:")
print(f"{uploaded_data}\n")

We have more examples for ingesting raw and tabular data in the SDK repository.

SDK Examples