Skip to content

Uploading Data

Uploading data consists of 2 main parts: creating a dataset and uploading the data itself. Dataset can be in raw and tabular forms. Here are examples for each one of them:

Creating Raw dataset

Creating datasets in the platform. Let's click "My Data" in the top navbar.

img_13.png
Your data collections

Here we can see the data collections we have made ourselves.

To create a new collection just click "Add new" and fill out the fields.

Keywords

Keywords are labels you can add to collections and datasets. They are helpful to categorize resources, and give insight at a glance. After typing in a keyword, hit enter to add it to the list.

img_14.png
Creating a new data collection

Let's create a raw dataset.

Open the collection you created, click "Add Dataset" and fill in the fields with "type" set to files.

img_28.png
Set type to "Files" to make a raw dataset.

If you would like to add additional information, for the datasets, you can click edit buttons for the fields

img_32.png
Editable fields example in dataset.

If you would like to have additional metadata fields, you can add them in "Advanced metadata"

img_33.png
Advance metadata edit screen.

These fields will be shown in dataset's preview

img_35.png
Dataset preview with advanced metadata.

To create raw dataset through API, you need to POST https://api.hubocean.earth/catalog with request body containing these fields:

{
  "kind": "catalog.hubocean.io/dataset",
  "version": "v1alpha3",
  "metadata": {
    "name": "test-dataset",
    "display_name": "Test Dataset",
    "description": "A test dataset"
  },
  "spec": {
    "storage_controller": "registry.hubocean.io/storageController/storage-raw-cdffs",
    "storage_class": "registry.hubocean.io/storageClass/raw",
    "maintainer": {
      "contact": "HUB Ocean <info@oceandata.earth>",
      "organisation": "Hub Ocean"
    },
    "tags": [
      "Ocean"
    ],
    "facets": {
      "additional_metadata_1": "metadata_value",
      "additional_metadata_2": "metadata_value_2"
    }
  }
}

  • kind: kind consists of resource group (catalog.hubocean.io) and resource type (dataset)
  • version: endpoint version. Currently only "v1alpha3" is supported for this endpoint
  • name: unique dataset's name. Note: if same request will be sent a second time, it will fail because of this field's uniqueness
  • display_name: name visible in the platform. Does not need to be unique
  • description: description visible in the platform under the name field
  • storage_controller: raw and tabular has different resource controllers. Example controller is for raw
  • storage_class: This field shows that it is RAW dataset
  • contact: Dataset controller's contact information/li>
  • organisation: Dataset controller's organisation
  • tags: Optional. Tags are used in dataset search
  • facets: Optional. Facets are used as additional metadata fields

To update dataset's metadata, you need to send PATCH https://api.hubocean.earth/catalog?kind=hubocean.io/catalog&either_id={catalog_uuid} with request body being updated metadata.

Raw dataset creation through SDK in Workspace is as easy as with UI.

First, we create a data collection and a raw dataset:

from odp.client import OdpClient
from odp.dto import Metadata
from odp.dto.catalog import DataCollectionDto, DataCollectionSpec
from odp.dto.catalog import DatasetDto, DatasetSpec
from odp.dto.common.contact_info import ContactInfo
from odp.dto.common.license import License

# Instantiate the client without specifying a token provider.
# The token provider will be set based on the environment.
client = OdpClient()

data_collection_name = "collection-manifest-example"
dataset_name = "Dataset-name"

collection = DataCollectionDto(
    metadata=Metadata(
        name=data_collection_name,
        display_name="collection-example",
        description="A test data collection",
    ),
    spec=DataCollectionSpec(
        published_by=ContactInfo(
            contact="User McUsername <user.mcusername@emailprovider.com>",
            organisation="Organisation Name",
        ),
        published_date="2019-06-19T06:00:00",
        website="https://hubocean.earth",
        license=License(
            name="proprietary",
            full_text="This is a very strict legal text describing the data license.",
            href="www.wikipedia.org",
        ),
        tags=[],
    ),
)

collection = client.catalog.create(collection)
print("Collection was created")

dataset = DatasetDto(
    metadata=Metadata(
        name=dataset_name,  # Must be unique. Can be used as identifier
        display_name="SDK Raw Example",  # Can be duplicate. Name seen in UI
    ),
    spec=DatasetSpec(
        data_collection=f"catalog.hubocean.io/dataCollection/{data_collection_name}",
        # Both lines below used specify that it is RAW storage
        storage_controller="registry.hubocean.io/storageController/storage-raw-cdffs",
        storage_class="registry.hubocean.io/storageClass/raw",
        maintainer=ContactInfo(
            contact="User McUsername <user.mcusername@emailprovider.com>",
            organisation="Organisation Name",
        ),
        facets={
            "test_metadata": "test_value"
        },
    ),
)

dataset = client.catalog.create(dataset)
print("Resource created successfully:", dataset)

To update dataset's metadata, we need to get catalog's info, update it send it to OdpClient().catalog.update():

from odp.client import OdpClient
from odp.dto import Metadata
from odp.dto.catalog import DatasetDto, DatasetSpec
from odp.dto.common.contact_info import ContactInfo
from odp.client.exc import OdpResourceNotFoundError, OdpValidationError

client = OdpClient()

# DatasetDto or JSON patch
manifest_update = client.catalog.get("3d797de8-f4ec-48a5-b211-cae1bcfa432c")
manifest_update.metadata.description = "Updated description"

try:
    updated_resource_manifest = client.catalog.update(manifest_update)
    print("Resource updated successfully:", updated_resource_manifest)
except OdpValidationError as e:
    print("Validation Error:", e)
except OdpResourceNotFoundError as e:
    print("Resource Not Found Error:", e)

Uploading Raw data

To upload a file, click "Open file", select a file and then click upload button.

img_21.png
Uploading data to raw is only a matter of selecting the file on your computer.

Tip

Any file can be uploaded, satellite imagery, zip files, shapefiles.

API upload process is:

  1. Create a file inside a raw dataset.
  2. Upload a file.

Create a file by sending POST https://api.hubocean.earth/data/{dataset_uuid} request with request body:

{
  "name": "test",
  "mime_type": "text/plain",
}

  • name: file name. This name is used in URL
  • mime_type: Optional. You can read more about common MIME types here

Upload the file with PATCH https://api.hubocean.earth/data/{dataset_uuid}/{file_name} with request body containing file.


Postman is frequently used for sending API responses. Send file with Postman by selecting "Body" -> "binary" and choosing file from the computer.

img_46.png
Postman file in response selection


File's metadata can be:

  • viewed
  • updated

View file's metadata by sending GET https://api.hubocean.earth/data/{dataset_uuid}/{file_name}/metadata request.
Update file's metadata by updating existing metadata's JSON and sending updated requested body in PATCH https://api.hubocean.earth/data/{dataset_uuid}/{file_name}/metadata request.

Following example with file uploaded into Workspace. If code is not run in Workspace, path_to_file might need to be updated to contain full path to the file.

# Add raw dataset's UUID below
dataset = client.catalog.get("UUID")

# If file and .py files are both in the same folder, name only is sufficient
path_to_file = "test.txt"

with open(path_to_file, "rb") as data:  # We open file
    file_metadata_dto = FileMetadataDto(
        name=data.name,  # Name of file inside the dataset
        mime_type="text/plain",  # MIME type of the file
    )
    file_dto = client.raw.create_file(
        resource_dto=dataset,  # We reuse same dataset we had created
        file_metadata_dto=file_metadata_dto,
        contents=data.read(),
    )
create_file() creates a file and if contents are added, uploads a file. After running provided code, file should be visible in the platform.


Creating Tabular dataset

Tabular dataset can be created just like a raw dataset. Open the collection you created, click "Add Dataset" but set "type" to "Tabular" this time.

img_27.png
Click Add Dataset

To create tabular dataset through API, you need to POST https://api.hubocean.earth/catalog with request body containing these fields:

{
  "kind": "catalog.hubocean.io/dataset",
  "version": "v1alpha3",
  "metadata": {
    "name": "test-dataset",
    "display_name": "Test Dataset",
    "description": "A test dataset"
  },
  "spec": {
    "storage_controller": "registry.hubocean.io/storageClass/tabular",
    "storage_class": "registry.hubocean.io/storageController/storage-tabular",
    "maintainer": {
      "contact": "HUB Ocean <info@oceandata.earth>",
      "organisation": "Hub Ocean"
    },
    "tags": [
      "Ocean"
    ]
  }
}

  • kind: leave as is
  • version: leave as is
  • name: unique dataset's name. Note: if same request will be sent a second time, it will fail because of this field's uniqueness
  • display_name: name visible in the platform. Does not need to be unique
  • description: description visible in the platform under the name field
  • storage_controller: leave as is
  • storage_class: leave as is. This field shows that it is TABULAR dataset
  • contact: change in the format provided
  • organisation: change as needed
  • tags: Optional. Tags can be added for easier search and grouping of datasets

Similarly as with Raw example, first, we start by creating a data collection and a tabular dataset.

from odp.client import OdpClient
from odp.dto import Metadata
from odp.dto.catalog import DataCollectionDto, DataCollectionSpec
from odp.dto.catalog import DatasetDto, DatasetSpec
from odp.dto.common.contact_info import ContactInfo
from odp.dto.common.license import License

# Instantiate the client without specifying a token provider.
# The token provider will be set based on the environment.
client = OdpClient()

data_collection_name = "collection-manifest-example"
dataset_name = "Dataset-name"

collection = DataCollectionDto(
    metadata=Metadata(
        name=data_collection_name,
        display_name="collection-example",
        description="A test data collection",
    ),
    spec=DataCollectionSpec(
        published_by=ContactInfo(
            contact="User McUsername <user.mcusername@emailprovider.com>",
            organisation="Organisation Name",
        ),
        published_date="2019-06-19T06:00:00",
        website="https://hubocean.earth",
        license=License(
            name="proprietary",
            full_text="This is a very strict legal text describing the data license.",
            href="www.wikipedia.org",
        ),
        tags=[],
    ),
)

collection = client.catalog.create(collection)
print("Collection was created")

dataset = DatasetDto(
    metadata=Metadata(
        name=dataset_name,
        display_name="SDK Raw Example",
    ),
    spec=DatasetSpec(
        data_collection=f"catalog.hubocean.io/dataCollection/{data_collection_name}",
        # Both lines below used specify that it is Tabular storage
        storage_controller="registry.hubocean.io/storageController/storage-tabular",
        storage_class="registry.hubocean.io/storageClass/tabular",
        maintainer=ContactInfo(
            contact="User McUsername <user.mcusername@emailprovider.com>",
            organisation="Organisation Name",
        ),
    ),
)

dataset = client.catalog.create(dataset)
print("Resource created successfully:", dataset)

Uploading Tabular data

Currently tabular upload works only through API and SDK.

For application to know how to store uploaded data, we need to create schema, that matches to be uploaded data. Here is the simple example with 1 string column.

POST https://api.hubocean.earth/data/{dataset_uuid}/schema with request body:

{
  "table_schema": {
    "SeahorseName": {
      "type": "string"
    }
  }
}

  • table_schema: leave as is
  • SeahorseName: a name of column, according to the needs. Can be named as "depth", "location", etc.
  • type: type of data that will be stored. See TableSpec for usable data types

Then you can upload data itself with POST https://api.hubocean.earth/data/{dataset_uuid}. Here is body example for previous schema:

{
  "data": [
    {
      "SeahorseName": "Frank"
    },
    {
      "SeahorseName": "Tom"
    }
  ]
}

For application to know how to store uploaded data, we need to create schema, that matches to be uploaded data. Here is the simple example with 1 string column.

client.tabular.create_schema(
    resource_dto=dataset,  # We use created tabular dataset
    table_spec=TableSpec(table_schema={"Data": {"type": "string"}})
)

To upload data, we prepare data in the format of the schema. Here is example of 2 rows for previously created schema:

test_data = [{"Data": "Test"}, {"Data": "Test1"}]

client.tabular.write(resource_dto=dataset, data=test_data)

Now you have uploaded tabular data. You can view it with:

uploaded_data = client.tabular.select_as_list(dataset)

print("Data query result:")
print(f"{uploaded_data}\n")

We have more examples for ingesting raw and tabular data in the SDK repository.