Python SDK

ODP Python SDK

The new ODP Python SDK provides a simple and efficient way to access datasets on the Ocean Data Platform.

You can install the SDK using pip:

pip install -U odp-sdk

And then use it to connect to the Ocean Data Platform:

from odp.client import Client
cli = Client()

The client will open your browser to perform the authentication process. If you prefer you can pass an api_key to the Client constructor to authenticate without opening a browser:

from odp.client import Client
cli = Client(api_key=MY_API_KEY)

You can generate an API key in the Ocean Data Platform web interface, under your user profile.

When the Client is initialized, you will need to give a dataset ID to access a specific dataset. Normally you will use the web UI here to find the dataset you want to work with, and then likely use the dataset ID to access it in the SDK:

from odp.client import Client
cli = Client()
glodap = cli.dataset("aea06582-fc49-4995-a9a8-2f31fcc65424")

You know have a handler to the glodap dataset which is well known for its globally calibrated ocean carbon data, with over a million data points, widely used by oceanographers, which you can use to retrieve data, or modify it if you are an editor.

Tabular

Each dataset might have a tabular storage, which is defined by a schema, which you can retrieve with:

import pyarrow as pa
from odp.client import Client
cli = Client()
glodap = cli.dataset("aea06582-fc49-4995-a9a8-2f31fcc65424")

try:
    schema: pa.Schema = glodap.table.schema()
    print(schema)
except FileNotFoundError as e:
    print("This dataset does not have a tabular storage.")

The schema is an arrow.Schema object, which you can use to understand the structure of the data.

`select()`

If you want to retrieve some of the data, you can use the select() method:

from odp.client import Client
cli = Client()
glodap = cli.dataset("aea06582-fc49-4995-a9a8-2f31fcc65424")

for batch in glodap.table.select().batches():
    print(batch)

This will return all the data in the dataset table, in chunks of pyarrow.RecordBatch objects.

`filter`

from odp.client import Client
cli = Client()
glodap = cli.dataset("aea06582-fc49-4995-a9a8-2f31fcc65424")

for batch in glodap.table.select(filter='G2year >= 2020 AND G2year < 2025').batches():
    print(batch)

You can use the optional filter argument to filter the data using arrow-like expressions or SQL-like syntax.

we support most of the pyarrow expressions, but we also added some special syntax for geospatial queries:

ds.select(filter='geometry within "POLYGON ((-10 50, -5 50, -5 55, -10 55, -10 50))"').batches()

intersect check if the two polygons intersect
within check if the polygon on the left is inside the polygon on the right
contains check if the polygon on the left contains the polygon on the right
is null check if the value of the field is null
is not null check if the value of the field is not null

`columns`

you can specify which columns you want to retrieve using the columns argument:

from odp.client import Client
cli = Client()
glodap = cli.dataset("aea06582-fc49-4995-a9a8-2f31fcc65424")
for batch in glodap.table.select(columns=['G2tco2', 'G2year']).batches():
    print(batch)

`vars`

from odp.client import Client
cli = Client()
glodap = cli.dataset("aea06582-fc49-4995-a9a8-2f31fcc65424")
min = 2020
max = 2025
for batch in glodap.table.select(filter='G2year >= $start_year AND G2year < $end_year',
                           vars={'start_year': min, 'end_year': max}).batches():
    print(batch)

You can bind variables to the filter using the vars argument. It can work with either named variables or positional arguments, the example above uses named variables. for positional use G2year >= ? AND G2year < ? and pass the values in the vars argument as a list

for batch in glodap.table.select(filter='G2year >= ? AND G2year < ?', vars=[min, max]).batches():
    print(batch)

streaming vs full table

the select() method returns a cursor object which can be scanned to retrieve the data in chunks.

you can get the native pyarrow.RecordBatch objects chaining select(...).batches()

you can also get each chunk as a dataframe using select(...).dataframes() or select(...).rows() if you want python dictionaries for each row.

All the above will stream the data, meaning that it will only store in memory the current chunk being processed.

This is very efficient for large datasets, and fast since the first chunk will be available much faster.

When fetching limited data, all the streaming can be annoying, so you can use the select(...).all() method to retrieve the entire table at once. You can then chain it using to retrieve different formats:

from odp.client import Client
cli = Client()
glodap = cli.dataset("aea06582-fc49-4995-a9a8-2f31fcc65424")
df = glodap.table.select("G2year==2018").all().dataframe()  # pandas dataframe

Aggregation

Often you need aggregated data from a table, for example the average level of salinty per year in a specific region:

from odp.client import Client
cli = Client()
glodap = cli.dataset("aea06582-fc49-4995-a9a8-2f31fcc65424")
df = glodap.table.aggregate(
    filter='G2year >= 2020 AND G2year < 2025',
    group_by="G2year",
    aggr={
        'G2salinity': 'avg',
    },
)

This will return a pandas dataframe with an index for the unique values of G2year and 2 columns: * for the amount of rows in each group and G2salinity for the average salinity

The filter works the same way as in the select() method, and you can use the vars argument to bind variables.

functions

You can aggregate group by h3 tiles, which is a hexagonal grid system that is useful for spatial data.

To do so, use h3(my_geo_field, resolution) as the by argument, where my_geo_field is the name of the column containing the geometry data, and resolution is an integer between 0 and 15 (the higher the resolution, the smaller the hexagons).

If you want to aggregate by buckets of numbers, you can use the bucket(my_field, val0, val1, ...) function, where my_field is the name of the column containing the numeric data, and val1, val2, etc. are the boundaries of the buckets.

You will get a list of rows where the index is the index of the bucket, so if you call buckets(age, 10,20,30) you will get all the data less or equal to 10 in the bucket 0, all the data between 10 and 20 in the bucket 1, between 20 and 30 in the bucket 2, and all the data greater than 30 in the bucket 3.

aggregation and streaming

Internally, the sdk perform a two step aggregation: 1. the server scan each partition of the table, and compute partials which are sent to the client 2. the client get the streams of partials, and locally builds the final aggregation

This means that even tho the memory usage is kept low, the final result will only be available when all the partials have been received.

Modifying data

You need to be an editor of a dataset to modify its tabular data.

The easiest way to be an editor is to create a new dataset yourself in ODP my data

you will first have to create the table with a pyarrow.Schema object, which defines the structure of the data.

from odp.client import Client
import pyarrow as pa
cli = Client()
ds = cli.dataset(ID)  # use the dataset ID you created earlier

schema = pa.schema([
    pa.field("id", pa.int64(), False), 
    pa.field("geo", pa.string(), True, metadata={"isGeometry": 1, "index": 1}),
    pa.field("name", pa.string(), False),
    pa.field("value", pa.float64(), False),
])
ds.table.create(schema)

this create a new table with 4 columns: id, geo, name, and value. the geo column metadata make the column a geometry column (encoded as WKT) and indexed for fast spatial queries.

you can then insert or modify the data using a transaction:

with ds as tx:
    tx.insert([
        {"id": 1, "geo": "POINT(1 1)", "name": "point1", "value": 1.0},
        {"id": 2, "geo": "POINT(2 2)", "name": "point2", "value": 2.0},
    ])

this will insert two rows in the table, and then commit the transaction (unless an exception is raised, in which case the transaction will be completely rolled back).

you can also replace rows using the replace() method:

with ds as tx:
    for row in tx.replace(filter='id = 1').rows():
        # row has just been removed from the table
        row['value'] = 10.0  # modify the value
        tx.insert(row)  # insert it back with the updated value

It is not required that each row returned by replace has to be inserted back, you can also just skip it to remove it, or insert something else, even multiple rows. You can also fetch batches when speed is important, or use the delete() method to just remove rows efficiently.

Alter the table schema

import pyarrow as pa
from odp.client import Client
cli = Client()
ds = cli.dataset(MY_DATASET_ID)
ds.table.alter(schema=pa.schema([
    pa.field("id", pa.int64(), False),
    pa.field("geo", pa.string(), True, metadata={"isGeometry": 1, "index": 1}),
    pa.field("name", pa.string(), False),
    pa.field("temp", pa.float64(), False),  # rename value to temp
    pa.field("new_col", pa.string(), True),  # add a new column
]), from_names={"temp": "value"})  # rename the value column to temp

This will change the schema of the table, and add a new column new_col which is nullable, and rename the value column to temp.

types can be changed as well, and the backend will convert the existing data to the new types.

Might be wise to use the Web UI to alter the schema, as it will give you a preview a better idea of what will happen and more details on the various options available.

Files

While a dataset can have a table with a schema, it can also have zero or more files associated with it.

from odp.client import Client
cli = Client()
ds = cli.dataset("5cf13ecb-6e12-41b8-8cfc-9c9c6eb2065d")  # Aker Biomarine Echo Sounding data
all_files = ds.files.list()

if you are an editor, you can upload more files:

from odp.client import Client
cli = Client()
ds = cli.dataset(MY_DATASET_ID)

id = ds.files.upload("hello.txt", b"Hello World!")
ds.files.delete(id)  # delete the file we just uploaded

with open("file.csv", "rb") as f:
    fid = ds.files.upload("north_sea.csv", f)
    ds.files.update_meta(fid, {
      "name": "north_sea.csv",
      "mime-type": "application/octet-stream",  
      "geometry": "POLYGON ((0 51, 4 51, 9 54, 9 59, 5 61, 1 61, 0 58, 0 51))"
      "format": "csv",
    })

It might be easier to upload and manage files using the ODP web interface.

Ingest

If you tested the UI above, you probably noticed that some files can be automatically ingested into the table.

To ingest files into tables via the sdk you can use the ingest method:

from odp.client import Client
cli = Client()
ds = cli.dataset(MY_DATASET_ID)

with open("file.csv", "rb") as f:
    fid = ds.files.upload("file.csv", f)

ds.files.ingest(fid)

If no table exists, a new table is created based on the file data.

You can use the opt parameter to specify what should happen to an existing table when you ingest data: - "append": Add new data to the existing table. - "truncate": Remove all existing data and add new data. - "drop": Drop the existing table and create a new one with the new data.

By default, we will try to append the data. If the data has a different schema then what already exists in the table we will raise an error.