Uploading Data
Uploading data consists of 2 main parts: creating a dataset and uploading the data itself. Dataset can be in raw and tabular forms. Here are examples for each one of them:
Creating Raw dataset
Creating datasets in the platform. Let's click "My Data" in the top navbar.
Here we can see the data collections we have made ourselves.
To create a new collection just click "Add new" and fill out the fields.
Keywords
Keywords are labels you can add to collections and datasets. They are helpful to categorize resources, and give insight at a glance. After typing in a keyword, hit enter to add it to the list.
Let's create a raw dataset.
Open the collection you created, click "Add Dataset" and fill in the fields with "type" set to files.
If you would like to add additional information, for the datasets, you can click edit buttons for the fields
If you would like to have additional metadata fields, you can add them in "Advanced metadata"
These fields will be shown in dataset's preview
To create raw dataset through API, you need to POST https://api.hubocean.earth/catalog
with request body containing these fields:
{
"kind": "catalog.hubocean.io/dataset",
"version": "v1alpha3",
"metadata": {
"name": "test-dataset",
"display_name": "Test Dataset",
"description": "A test dataset"
},
"spec": {
"storage_controller": "registry.hubocean.io/storageController/storage-raw-cdffs",
"storage_class": "registry.hubocean.io/storageClass/raw",
"maintainer": {
"contact": "HUB Ocean <info@oceandata.earth>",
"organisation": "Hub Ocean"
},
"tags": [
"Ocean"
],
"facets": {
"additional_metadata_1": "metadata_value",
"additional_metadata_2": "metadata_value_2"
}
}
}
- kind: kind consists of resource group (catalog.hubocean.io) and resource type (dataset)
- version: endpoint version. Currently only "v1alpha3" is supported for this endpoint
- name: unique dataset's name. Note: if same request will be sent a second time, it will fail because of this field's uniqueness
- display_name: name visible in the platform. Does not need to be unique
- description: description visible in the platform under the name field
- storage_controller: raw and tabular has different resource controllers. Example controller is for raw
- storage_class: This field shows that it is RAW dataset
- contact: Dataset controller's contact information/li>
- organisation: Dataset controller's organisation
- tags: Optional. Tags are used in dataset search
- facets: Optional. Facets are used as additional metadata fields
To update dataset's metadata, you need to send PATCH https://api.hubocean.earth/catalog?kind=hubocean.io/catalog&either_id={catalog_uuid}
with request body being updated metadata.
Raw dataset creation through SDK in Workspace is as easy as with UI.
First, we create a data collection and a raw dataset:
from odp.client import OdpClient
from odp.dto import Metadata
from odp.dto.catalog import DataCollectionDto, DataCollectionSpec
from odp.dto.catalog import DatasetDto, DatasetSpec
from odp.dto.common.contact_info import ContactInfo
from odp.dto.common.license import License
# Instantiate the client without specifying a token provider.
# The token provider will be set based on the environment.
client = OdpClient()
data_collection_name = "collection-manifest-example"
dataset_name = "Dataset-name"
collection = DataCollectionDto(
metadata=Metadata(
name=data_collection_name,
display_name="collection-example",
description="A test data collection",
),
spec=DataCollectionSpec(
published_by=ContactInfo(
contact="User McUsername <user.mcusername@emailprovider.com>",
organisation="Organisation Name",
),
published_date="2019-06-19T06:00:00",
website="https://hubocean.earth",
license=License(
name="proprietary",
full_text="This is a very strict legal text describing the data license.",
href="www.wikipedia.org",
),
tags=[],
),
)
collection = client.catalog.create(collection)
print("Collection was created")
dataset = DatasetDto(
metadata=Metadata(
name=dataset_name, # Must be unique. Can be used as identifier
display_name="SDK Raw Example", # Can be duplicate. Name seen in UI
),
spec=DatasetSpec(
data_collection=f"catalog.hubocean.io/dataCollection/{data_collection_name}",
# Both lines below used specify that it is RAW storage
storage_controller="registry.hubocean.io/storageController/storage-raw-cdffs",
storage_class="registry.hubocean.io/storageClass/raw",
maintainer=ContactInfo(
contact="User McUsername <user.mcusername@emailprovider.com>",
organisation="Organisation Name",
),
facets={
"test_metadata": "test_value"
},
),
)
dataset = client.catalog.create(dataset)
print("Resource created successfully:", dataset)
To update dataset's metadata, we need to get catalog's info, update it send it to OdpClient().catalog.update():
from odp.client import OdpClient
from odp.dto import Metadata
from odp.dto.catalog import DatasetDto, DatasetSpec
from odp.dto.common.contact_info import ContactInfo
from odp.client.exc import OdpResourceNotFoundError, OdpValidationError
client = OdpClient()
# DatasetDto or JSON patch
manifest_update = client.catalog.get("3d797de8-f4ec-48a5-b211-cae1bcfa432c")
manifest_update.metadata.description = "Updated description"
try:
updated_resource_manifest = client.catalog.update(manifest_update)
print("Resource updated successfully:", updated_resource_manifest)
except OdpValidationError as e:
print("Validation Error:", e)
except OdpResourceNotFoundError as e:
print("Resource Not Found Error:", e)
Uploading Raw data
To upload a file, click "Open file", select a file and then click upload button.
Tip
Any file can be uploaded, satellite imagery, zip files, shapefiles.
API upload process is:
- Create a file inside a raw dataset.
- Upload a file.
Create a file by sending POST https://api.hubocean.earth/data/{dataset_uuid}
request with request body:
{
"name": "test",
"mime_type": "text/plain",
}
- name: file name. This name is used in URL
- mime_type: Optional. You can read more about common MIME types here
Upload the file with PATCH https://api.hubocean.earth/data/{dataset_uuid}/{file_name}
with request body containing file.
Postman is frequently used for sending API responses. Send file with Postman by selecting "Body" -> "binary" and choosing file from the computer.
File's metadata can be:
- viewed
- updated
View file's metadata by sending GET https://api.hubocean.earth/data/{dataset_uuid}/{file_name}/metadata
request.
Update file's metadata by updating existing metadata's JSON and sending updated requested body in PATCH
https://api.hubocean.earth/data/{dataset_uuid}/{file_name}/metadata
request.
Following example with file uploaded into Workspace. If code is not run in Workspace, path_to_file
might need to be
updated to contain full path to the file.
# Add raw dataset's UUID below
dataset = client.catalog.get("UUID")
# If file and .py files are both in the same folder, name only is sufficient
path_to_file = "test.txt"
with open(path_to_file, "rb") as data: # We open file
file_metadata_dto = FileMetadataDto(
name=data.name, # Name of file inside the dataset
mime_type="text/plain", # MIME type of the file
)
file_dto = client.raw.create_file(
resource_dto=dataset, # We reuse same dataset we had created
file_metadata_dto=file_metadata_dto,
contents=data.read(),
)
create_file()
creates a file and if contents are added, uploads a file. After running provided code, file should be visible in the platform.
Creating Tabular dataset
Tabular dataset can be created just like a raw dataset. Open the collection you created, click "Add Dataset" but set "type" to "Tabular" this time.
To create tabular dataset through API, you need to POST https://api.hubocean.earth/catalog with request body containing these fields:
{
"kind": "catalog.hubocean.io/dataset",
"version": "v1alpha3",
"metadata": {
"name": "test-dataset",
"display_name": "Test Dataset",
"description": "A test dataset"
},
"spec": {
"storage_controller": "registry.hubocean.io/storageClass/tabular",
"storage_class": "registry.hubocean.io/storageController/storage-tabular",
"maintainer": {
"contact": "HUB Ocean <info@oceandata.earth>",
"organisation": "Hub Ocean"
},
"tags": [
"Ocean"
]
}
}
- kind: leave as is
- version: leave as is
- name: unique dataset's name. Note: if same request will be sent a second time, it will fail because of this field's uniqueness
- display_name: name visible in the platform. Does not need to be unique
- description: description visible in the platform under the name field
- storage_controller: leave as is
- storage_class: leave as is. This field shows that it is TABULAR dataset
- contact: change in the format provided
- organisation: change as needed
- tags: Optional. Tags can be added for easier search and grouping of datasets
Similarly as with Raw example, first, we start by creating a data collection and a tabular dataset.
from odp.client import OdpClient
from odp.dto import Metadata
from odp.dto.catalog import DataCollectionDto, DataCollectionSpec
from odp.dto.catalog import DatasetDto, DatasetSpec
from odp.dto.common.contact_info import ContactInfo
from odp.dto.common.license import License
# Instantiate the client without specifying a token provider.
# The token provider will be set based on the environment.
client = OdpClient()
data_collection_name = "collection-manifest-example"
dataset_name = "Dataset-name"
collection = DataCollectionDto(
metadata=Metadata(
name=data_collection_name,
display_name="collection-example",
description="A test data collection",
),
spec=DataCollectionSpec(
published_by=ContactInfo(
contact="User McUsername <user.mcusername@emailprovider.com>",
organisation="Organisation Name",
),
published_date="2019-06-19T06:00:00",
website="https://hubocean.earth",
license=License(
name="proprietary",
full_text="This is a very strict legal text describing the data license.",
href="www.wikipedia.org",
),
tags=[],
),
)
collection = client.catalog.create(collection)
print("Collection was created")
dataset = DatasetDto(
metadata=Metadata(
name=dataset_name,
display_name="SDK Raw Example",
),
spec=DatasetSpec(
data_collection=f"catalog.hubocean.io/dataCollection/{data_collection_name}",
# Both lines below used specify that it is Tabular storage
storage_controller="registry.hubocean.io/storageController/storage-tabular",
storage_class="registry.hubocean.io/storageClass/tabular",
maintainer=ContactInfo(
contact="User McUsername <user.mcusername@emailprovider.com>",
organisation="Organisation Name",
),
),
)
dataset = client.catalog.create(dataset)
print("Resource created successfully:", dataset)
Uploading Tabular data
Currently tabular upload works only through API and SDK.
For application to know how to store uploaded data, we need to create schema, that matches to be uploaded data. Here is the simple example with 1 string column.
POST https://api.hubocean.earth/data/{dataset_uuid}/schema with request body:
{
"table_schema": {
"SeahorseName": {
"type": "string"
}
}
}
- table_schema: leave as is
- SeahorseName: a name of column, according to the needs. Can be named as "depth", "location", etc.
- type: type of data that will be stored. See TableSpec for usable data types
Then you can upload data itself with POST https://api.hubocean.earth/data/{dataset_uuid}. Here is body example for previous schema:
{
"data": [
{
"SeahorseName": "Frank"
},
{
"SeahorseName": "Tom"
}
]
}
For application to know how to store uploaded data, we need to create schema, that matches to be uploaded data. Here is the simple example with 1 string column.
client.tabular.create_schema(
resource_dto=dataset, # We use created tabular dataset
table_spec=TableSpec(table_schema={"Data": {"type": "string"}})
)
To upload data, we prepare data in the format of the schema. Here is example of 2 rows for previously created schema:
test_data = [{"Data": "Test"}, {"Data": "Test1"}]
client.tabular.write(resource_dto=dataset, data=test_data)
Now you have uploaded tabular data. You can view it with:
uploaded_data = client.tabular.select_as_list(dataset)
print("Data query result:")
print(f"{uploaded_data}\n")
We have more examples for ingesting raw and tabular data in the SDK repository.