Tabular v2 - Quick Start

The Ocean Data Platform (ODP) is a data platform to find, access and share ocean data. You can connect to ODP using the Python SDK, which provides specific APIs to interact with tabular datasets.

We added a completely new implementation of the tabular storage, called Tabular v2. The key features of Tabular v2 are:

pyarrow.ipc: provides faster and more efficient data serialization and deserialization.
simple query: allows you to query the data using a simpler syntax or pyarrow.compute expressions
ktable: dynamically partition the data automatically on the backend
transactions: simplify how you can modify the data and how errors are handled
pyarrow.Schema: provides a easier to use schema definition
bsquare: index polygons and points to speed up spatial queries

Setting up the sdk

To get started, install the SDK using pip:

pip install -U odp-sdk

If you're using a Jupyter notebook in an ODP workspace, you can run the install command with a leading !:

!pip install -U odp-sdk

Before using the SDK, make sure you’ve signed up for access to the Ocean Data Platform.

In a workspace:
If you're running code inside an ODP workspace, authentication is handled automatically — no additional setup needed.
Locally:
If you're using the SDK outside of an ODP workspace, you'll need to authenticate using an API key.

You can generate your API key from the account page after signing in.

Use the key when creating a client:

from odp.client import OdpClient

cli = OdpClient(api_key="sk_your_secret_api_key")

Getting Started

With the sdk set up, you are ready to explore our vast selection of ocean data!

After getting a dataset uuid from the catalog, you can obtain a tabular handler by initiating a client and connecting to a dataset:

import pyarrow as pa
from odp.client import OdpClient

cli = OdpClient()
dataset = cli.catalog.get("dataset-uuid")
tab = cli.table_v2(dataset)

The above handler can then be used to interact with the tabular data for that dataset. We will now show some examples of the available functionality

Create the Table

Before you can start adding data to the table, you need to create the table schema. This can be done using the pyarrow.Schema object:

tab.create(pa.schema([
    pa.field("id", pa.string(), nullable=False, metadata={"index":"1"}),
    pa.field("name", pa.string()),
    pa.field("age", pa.int32()),
]))

This set the schema of the table, and mark the id column to be indexed.

(you can later change the schema using the alter method)

Add Data

You can add data to the table using the insert method, which accept python dictionaries, pyarrow.RecordBatch or `pandas.DataFrame'. This should be done inside a transaction:

with tab as tx:
    tx.insert({"id": "1", "name": "Alice", "age": 25})
    tx.insert({"id": "2", "name": "Bob", "age": 30})

Using a transaction prevents other people to see intermediate steps until the transaction is committed.

Query Data

You don't need a transaction to query the data:

for row in tab.select().rows():
    print(row)

You can also filter the data, choose which columns to return, and get dataframes:

for df in tab.select("age < 26", cols=["id"]).dataframes():
    print(df)

Indexing

You can specify in the schema which columns should be indexed.

Indexing is obtained by splitting the data in separate files based on the values of the indexed fields, making it skip files which have no data of interest.

The drawback of this approach is that the more the column you index, the less efficient the index is.

The positive is that the more the data, the more efficient it gets.

Normally you will start to care about indexing when you have hundreds of thousands of rows.

Polygons and geospatial queries

Tabular v2 supports polygons and geospatial queries:

tab.create(pa.schema([
    pa.field("id", pa.string(), nullable=False),
    pa.field("geo", pa.string(), metadata={"index":"1", "isGeometry":"1"}),
]))

with tab as tx:
    tx.insert({"id": "1", "geo": "POLYGON ((0 0, 0 1, 1 1, 1 0, 0 0))"})
    tx.insert({"id": "2", "geo": "POLYGON ((1 1, 1 2, 2 2, 2 1, 1 1))"})

for row in tab.select(
    'geo intersect "POLYGON ((0.5 0.5, 0.5 1.5, 1.5 1.5, 1.5 0.5, 0.5 0.5))"',
).rows():
    print(row)

Aggregation

Experimental

Imagine you have a table:

id	country	name	age
1	IT	Alice	25
2	IT	Bob	30
3	UK	Charlie	35

and you want to know the average age of people in each country. You need to aggregate the data, grouping by the country field:

data = tab.aggregate('country', aggr={"age": "avg"})

This will return a DataFrame with the average age of people in each country:

country	age
IT	27.5
UK	35

It is possible to further filter the data using a query expression:

data = tab.aggregate('country', query='age > 28', aggr={"age": "avg"})

country	age
IT	30
UK	35

You can specify multiple aggregations:

data = tab.aggregate('country', aggr={"age": "avg", "name": "min"})

country	age	name
IT	27.5	Alice
UK	35	Charlie

and allow the schema of the table to dictate the aggregation:

tab.create(pa.schema([
    pa.field("id", pa.string(), nullable=False),
    pa.field("country", pa.string(), nullable=False),
    pa.field("name", pa.string(), nullable=False),
    pa.field("age", pa.int32(), nullable=False, metadata={"aggr": "avg"}),
]))

which will automatically aggregate the age field by average, whenever no aggr is specified:

data = tab.aggregate('country')

country	age
IT	27.5
UK	35

Working with files

With the new tabular controller, is it possible to upload files on the side of a table.

Those files are not part of the table, and might be hidden to the readers.

They can be used to ingest data to a table, either:

directly (creating the table from them)
appending (using the table schema to convert them)

A new metadata field will become relevant for this: ingest_name

Handling files

Obtain a table handle using a dataset or dataset uuid as described here

Then you can use the raw controller to manage the files:

id = tab.raw.upload("my_file.ext", binary_data)  # upload a file, get back an id for it
tab.raw.update_meta(id, {"key": "value"})  # update the metadata of the file

for file in tab.raw.list():  # show all the files, with theirs ids, names and other metadata
   print(file)

tab.raw.download(id)  # download a file by id
tab.raw.delete(id)  # delete a file by id

It is also possible to identify a file by a filename if the filename is unique within the dataset

tab.raw.upload("test.csv", binary_data)
tab.raw.update_meta("test.csv", {"key": "value"})

tab.raw.delete("test.csv")

File metadata

Some metadata is automatically added to the file when uploaded:

id: the id of the file
name: the name of the file
size: the size of the file
created: the date of creation
updated: last time updated

We also support the following properties:

geometry: geometry related the the file. Can be the place of observation
mimetype: mimetype of the file
properties: dictionary of {key[string]: value[string]} for misc metadata

Ingest from files to table

tab.raw.ingest(id)  # ingest a previously uploaded file by id or filename

If a table does not exists, then a new table is created using the file data

then the file is converted to the schema of the table, and appended.

Only specific file formats are supported:

Parquet

this is pretty straight forward, since we use arrow for IPC and we support most of the features already

CSV

CSV is very limited, and ingestion will only return simple data types. You can use alter later to improve the results.

Resilience to changes

Since the introduction of alter, the table schema is not static anymore.

when ingesting a file, we store the name of the column of the original file as ingest_name in the metadata

This means that further re-ingestions will adapt to the schema changes seamlessly.