Tabular v2 - Quick Start
The Ocean Data Platform (ODP) is a data platform to find, access and share ocean data. You can connect to ODP using the Python SDK, which provides specific APIs to interact with tabular datasets.
We added a completely new implementation of the tabular storage, called Tabular v2. The key features of Tabular v2 are:
pyarrow.ipc
: provides faster and more efficient data serialization and deserialization.- simple query: allows you to query the data using a simpler syntax or
pyarrow.compute
expressions ktable
: dynamically partition the data automatically on the backend- transactions: simplify how you can modify the data and how errors are handled
pyarrow.Schema
: provides a easier to use schema definitionbsquare
: index polygons and points to speed up spatial queries
Setting up the sdk
To get started, install the SDK using pip:
pip install -U odp-sdk
If you're using a Jupyter notebook in an ODP workspace, you can run the install command with a leading !
:
!pip install -U odp-sdk
Before using the SDK, make sure you’ve signed up for access to the Ocean Data Platform.
-
In a workspace:
If you're running code inside an ODP workspace, authentication is handled automatically — no additional setup needed. -
Locally:
If you're using the SDK outside of an ODP workspace, you'll need to authenticate using an API key.
You can generate your API key from the account page after signing in.
Use the key when creating a client:
from odp.client import OdpClient
cli = OdpClient(api_key="sk_your_secret_api_key")
Getting Started
With the sdk set up, you are ready to explore our vast selection of ocean data!
After getting a dataset uuid from the catalog, you can obtain a tabular handler by initiating a client and connecting to a dataset:
import pyarrow as pa
from odp.client import OdpClient
cli = OdpClient()
dataset = cli.catalog.get("dataset-uuid")
tab = cli.table_v2(dataset)
The above handler can then be used to interact with the tabular data for that dataset. We will now show some examples of the available functionality
Create the Table
Before you can start adding data to the table, you need to create the table schema. This can be done using the pyarrow.Schema object:
tab.create(pa.schema([
pa.field("id", pa.string(), nullable=False, metadata={"index":"1"}),
pa.field("name", pa.string()),
pa.field("age", pa.int32()),
]))
This set the schema of the table, and mark the id
column to be indexed.
(you can later change the schema using the alter method)
Add Data
You can add data to the table using the insert
method, which accept python dictionaries, pyarrow.RecordBatch
or `pandas.DataFrame'.
This should be done inside a transaction:
with tab as tx:
tx.insert({"id": "1", "name": "Alice", "age": 25})
tx.insert({"id": "2", "name": "Bob", "age": 30})
Using a transaction prevents other people to see intermediate steps until the transaction is committed.
Query Data
You don't need a transaction to query the data:
for row in tab.select().rows():
print(row)
You can also filter the data, choose which columns to return, and get dataframes:
for df in tab.select("age < 26", cols=["id"]).dataframes():
print(df)
Indexing
You can specify in the schema which columns should be indexed.
Indexing is obtained by splitting the data in separate files based on the values of the indexed fields, making it skip files which have no data of interest.
The drawback of this approach is that the more the column you index, the less efficient the index is.
The positive is that the more the data, the more efficient it gets.
Normally you will start to care about indexing when you have hundreds of thousands of rows.
Polygons and geospatial queries
Tabular v2 supports polygons and geospatial queries:
tab.create(pa.schema([
pa.field("id", pa.string(), nullable=False),
pa.field("geo", pa.string(), metadata={"index":"1", "isGeometry":"1"}),
]))
with tab as tx:
tx.insert({"id": "1", "geo": "POLYGON ((0 0, 0 1, 1 1, 1 0, 0 0))"})
tx.insert({"id": "2", "geo": "POLYGON ((1 1, 1 2, 2 2, 2 1, 1 1))"})
for row in tab.select(
'geo intersect "POLYGON ((0.5 0.5, 0.5 1.5, 1.5 1.5, 1.5 0.5, 0.5 0.5))"',
).rows():
print(row)
Aggregation
Experimental
Imagine you have a table:
id | country | name | age |
---|---|---|---|
1 | IT | Alice | 25 |
2 | IT | Bob | 30 |
3 | UK | Charlie | 35 |
and you want to know the average age of people in each country. You need to aggregate the data, grouping by the country
field:
data = tab.aggregate('country', aggr={"age": "avg"})
This will return a DataFrame
with the average age of people in each country:
country | age |
---|---|
IT | 27.5 |
UK | 35 |
It is possible to further filter the data using a query
expression:
data = tab.aggregate('country', query='age > 28', aggr={"age": "avg"})
country | age |
---|---|
IT | 30 |
UK | 35 |
You can specify multiple aggregations:
data = tab.aggregate('country', aggr={"age": "avg", "name": "min"})
country | age | name |
---|---|---|
IT | 27.5 | Alice |
UK | 35 | Charlie |
and allow the schema of the table to dictate the aggregation:
tab.create(pa.schema([
pa.field("id", pa.string(), nullable=False),
pa.field("country", pa.string(), nullable=False),
pa.field("name", pa.string(), nullable=False),
pa.field("age", pa.int32(), nullable=False, metadata={"aggr": "avg"}),
]))
which will automatically aggregate the age
field by average, whenever no aggr
is specified:
data = tab.aggregate('country')
country | age |
---|---|
IT | 27.5 |
UK | 35 |
Working with files
With the new tabular controller, is it possible to upload files on the side of a table.
Those files are not part of the table, and might be hidden to the readers.
They can be used to ingest data to a table, either:
- directly (creating the table from them)
- appending (using the table schema to convert them)
A new metadata field will become relevant for this: ingest_name
Handling files
Obtain a table handle using a dataset or dataset uuid as described here
Then you can use the raw controller to manage the files:
id = tab.raw.upload("my_file.ext", binary_data) # upload a file, get back an id for it
tab.raw.update_meta(id, {"key": "value"}) # update the metadata of the file
for file in tab.raw.list(): # show all the files, with theirs ids, names and other metadata
print(file)
tab.raw.download(id) # download a file by id
tab.raw.delete(id) # delete a file by id
It is also possible to identify a file by a filename if the filename is unique within the dataset
tab.raw.upload("test.csv", binary_data)
tab.raw.update_meta("test.csv", {"key": "value"})
tab.raw.delete("test.csv")
File metadata
Some metadata is automatically added to the file when uploaded:
id
: the id of the filename
: the name of the filesize
: the size of the filecreated
: the date of creationupdated
: last time updated
We also support the following properties:
geometry
: geometry related the the file. Can be the place of observationmimetype
: mimetype of the fileproperties
: dictionary of{key[string]: value[string]}
for misc metadata
Ingest from files to table
tab.raw.ingest(id) # ingest a previously uploaded file by id or filename
If a table does not exists, then a new table is created using the file data
then the file is converted to the schema of the table, and appended.
Only specific file formats are supported:
Parquet
this is pretty straight forward, since we use arrow for IPC and we support most of the features already
CSV
CSV is very limited, and ingestion will only return simple data types. You can use alter
later to improve the results.
Resilience to changes
Since the introduction of alter
, the table schema is not static anymore.
when ingesting a file, we store the name of the column of the original file as ingest_name
in the metadata
This means that further re-ingestions will adapt to the schema changes seamlessly.