Skip to content

Ocean Data Platform SDK — Python and R

The Ocean Data Platform exposes curated ocean datasets through a tabular API. Our SDKs (Python and R) let you authenticate, stream Arrow batches, transform results into native data structures, and modify data (insert, update, delete). You can also inspect schemas, perform server-backed aggregations, and manage raw files.

Quick Start

Want a fast tour? Download and run the quick-start script:

Notice: The R SDK is still in beta. Please report any bugs and give us some feedback!

Installation

Install from PyPI:

pip install -U odp-sdk

Install from GitHub (requires remotes package):

install.packages("remotes")  # skip if already installed
remotes::install_github("C4IROcean/odp-sdkr")

Load the package in your session via library(odp) after the installation finishes.

Authentication

Authenticate using your API key or via browser:

from odp.client import Client

# Authenticate with API key
cli = Client(api_key="Sk_....")

# Or authenticate via browser (if no api_key provided)
cli = Client()

Authenticate using your API key or via browser/environment variable:

# Authenticate with API key
client <- odp_client(api_key = "Sk_....")

# Or authenticate via browser/environment key
# Sys.setenv(ODP_API_KEY = "Sk_...")
client <- odp_client()

You can generate an API key in the Ocean Data Platform web interface under your user profile.

Connecting to a Dataset

Use the dataset ID from the catalog UI (https://app.hubocean.earth/catalog). The example below uses the public GLODAP dataset:

glodap = cli.dataset("aea06582-fc49-4995-a9a8-2f31fcc65424")
table = glodap.table
glodap <- client$dataset("aea06582-fc49-4995-a9a8-2f31fcc65424")
table <- glodap$table

If the dataset has an attached tabular store you can now work with the table using the helpers described below. When a dataset is not tabular the table calls will raise errors.

Reading data

schema()

The schema call returns the Arrow layout for the table so you can plan your queries:

import pyarrow as pa

schema = table.schema()
print(schema)
schema <- table$schema()
print(schema)

select()

Tables on ODP can be quite large, so instead of waiting for the full data to load, you can stream it in chunks using a cursor and materialize it into familiar structures like dataframes, geodataframes, or dictionaries.

# Fetch next chunk as Arrow record batch
cursor = table.select().batches()
batch = next(cursor)

# as Pandas or GeoPandas df
cursor = table.select().dataframes()
df = next(cursor)

# as Python dict
cursor = table.select().rows()
row = next(cursor)


# Or iterate any of the above in a loop:
for i in next(cursor):
    print(i)
# Fetch next chunk as Arrow record batch
cursor <- table$select()
batch <- cursor$next_batch()

# as R dataframe
cursor <- table$select()
df <- cursor$next_dataframe()

# as Tidyverse tibble
cursor <- table$select()
tibble <- cursor$next_tibble()


# Or iterate any of the above in a loop:
cursor <- table()$select()
while (!is.null(batch <- cursor$next_batch())) {
  print(batch)
}

Successive next calls only drain chunks that have not been streamed yet. To start over again with a new cursor, run another select().

filter

You can add a filter parameter to the select() to narrow down your query. Filters use SQL/Arrow-style expressions, including geospatial helpers.

# Basic filter
cursor = table.select(filter='G2year >= 2020 AND G2year < 2025').batches()

# Geospatial filter
cursor = table.select(filter='geometry within "POLYGON ((-10 50, -5 50, -5 55, -10 55, -10 50))"').dataframes()
# Basic filter
cursor <- table$select(filter = "G2year >= 2020 AND G2year < 2025")

# Geospatial filter
cursor <- table$select(filter = 'geometry within "POLYGON ((-10 50, -5 50, -5 55, -10 55, -10 50))"')

Supported geospatial operations:

  • intersect - check if two polygons intersect
  • within - check if the polygon on the left is inside the polygon on the right
  • contains - check if the polygon on the left contains the polygon on the right

columns

Restrict the projection when you only need specific fields:

cursor = table.select(columns=['G2tco2', 'G2year']).batches()
cursor <- table$select(columns = c("G2tco2", "G2year"))

vars

Bind parameters inside the filter using either named or positional variables:

# Named variables
cursor = table.select(
    filter='G2year >= $start_year AND G2year < $end_year',
    vars={'start_year': 2020, 'end_year': 2025}
).batches()

# Positional variables
cursor = table.select(
    filter='G2year >= ? AND G2year < ?',
    vars=[2020, 2025]
).dataframes()
# Named variables
cursor <- table$select(
  filter = "G2year >= $start AND G2year < $end",
  vars = list(start = 2020, end = 2025)
)

# Positional variables
cursor <- table$select(
  filter = "G2year >= ? AND G2year < ?",
  vars = list(2020, 2025)
)

all...()

For smaller tables you can attempt to select() the whole table:

arrow_tab = table.select().all().table()    # Arrow table
df = table.select().all().dataframe()       # GeoPandas or Pandas df
rows = table.select().all().rows()          # Python list of dicts
arrow_tab <- table$select()$all_table()      # Arrow table
df <- table$select()$all_dataframe()         # R dataframe
tibble <- table$select()$all_tibble()        # Tidyverse tibble

Adjust the max_rows or max_time if the default is not enough to collect the whole data. Note: if the data is too large, you might want to stream with a cursor instead (see above).

stats()

Fetch read-only stats for the table:

stats = table.stats()
print(stats)
stats <- table$stats()

Aggregations

table.aggregate() / table$aggregate() performs the heavy lifting on the backend and stitches partials together locally.

df = table.aggregate(
    filter='G2year >= 2020 AND G2year < 2025',
    group_by="G2year",
    aggr={
        'G2salinity': 'avg',
        'G2tco2': 'max'
    }
)
print(df)
agg <- table$aggregate(
  group_by = "G2year",
  filter = "G2year >= 2020 AND G2year < 2025",
  aggr = list(G2salinity = "avg", G2tco2 = "max")
)
print(agg)

Aggregation operations: "sum", "min", "max", "count", or "avg".

Advanced grouping expressions:

# H3 hexagonal grid (resolution 0-15)
df = table.aggregate(
    group_by="h3(geometry, 6)",
    aggr={'G2salinity': 'avg'}
)

# Bucketing (numeric ranges)
df = table.aggregate(
    group_by="buckets(G2tco2, 0, 200, 400)",
    aggr={'G2tco2': 'sum'}
)
# H3 hexagonal grid (resolution 0-15)
agg <- table$aggregate(
  group_by = "h3(geometry, 6)",
  aggr = list(G2salinity = "avg")
)

# Bucketing (numeric ranges)
agg <- table$aggregate(
  group_by = "buckets(G2tco2, 0, 200, 400)",
  aggr = list(G2tco2 = "sum")
)

Writing data

You need to be an editor of a dataset to modify its tabular data.

Creating and inserting

Create from arrow schema:

import pyarrow as pa

schema = pa.schema([
    pa.field("latitude", pa.float64()),
    pa.field("depth", pa.int32())
])
my_table.create(schema)
schema <- arrow::schema(
    latitude = arrow::float64(),
    depth = arrow::int32()
)
my_table$create(schema)

Create directly from data:

import pandas as pd

df = pd.DataFrame({
    'latitude': [10.5, 20.3],
    'depth': [100, 200]
})
my_table.create(df)
df <- data.frame(
    latitude = c(10.5, 20.3),
    depth = c(100L, 200L)
)
my_table$create(df)

Append more data to the existing table:

with my_table as tx:
    my_table.insert(df_more)
my_table$insert(df_more)

Transactions

For multi-step workflows (insert, update, delete) within a transaction:

with my_table as tx:
    tx.insert(df)
    df_replace = tx.replace(filter="depth > 50").dataframe()
    df_replace['depth'] = df_replace['depth'] + 5
    tx.insert(df_replace)
    tx.delete(query="depth > 1000")
tx <- table$begin()
tx$insert(df)
df_replace <- tx$replace(filter = "depth > 50")$dataframe()
df_replace$depth <- df_replace$depth + 5L
tx$insert(df_replace)
tx$delete(query = "depth > 1000")
tx$commit()

Schema management

Alter the table schema and re-ingest data, or remove data:

import pyarrow as pa

new_schema = pa.schema([
    pa.field("latitude", pa.float64()),
    pa.field("depth", pa.int32()),
    pa.field("quality", pa.int8())
])
table.alter(new_schema)  # re-ingest with new schema

table.truncate()  # remove all rows
table.drop()      # remove table entirely
new_schema <- arrow::schema(
    latitude = arrow::float64(),
    depth = arrow::int32(),
    quality = arrow::int8()
)
table$alter(new_schema)   # re-ingest with new schema

table$truncate()          # remove all rows
table$drop()              # remove table entirely

Files

While a dataset can have a table with a schema, it can also have zero or more files associated with it.

ds = cli.dataset("5cf13ecb-6e12-41b8-8cfc-9c9c6eb2065d")  # Aker Biomarine Echo Sounding data
all_files = ds.files.list()
ds <- client$dataset("5cf13ecb-6e12-41b8-8cfc-9c9c6eb2065d")  # Aker Biomarine Echo Sounding data
all_files <- ds$files$list()

If you are an editor, you can upload more files:

ds = cli.dataset(MY_DATASET_ID)

fid <- ds.files.upload("hello.txt", b"Hello World!")
ds.files.delete(fid)  # delete the file we just uploaded

with open("file.csv", "rb") as f:
    fid = ds.files.upload("north_sea.csv", f)
    ds.files.update_meta(fid, {
        "name": "north_sea.csv",
        "mime-type": "application/octet-stream",
        "geometry": "POLYGON ((0 51, 4 51, 9 54, 9 59, 5 61, 1 61, 0 58, 0 51))",
        "format": "csv",
    })
ds <- client$dataset(MY_DATASET_ID)

fid <- ds$files$upload("hello.txt", charToRaw("Hello World!"))
ds$files$delete(fid)  # delete the file we just uploaded

content <- paste(readLines("file.csv"), collapse = "\n")
fid <- ds$files$upload("north_sea.csv", content)
ds$files$update_meta(fid, list(
    name = "north_sea.csv",
    geometry = "POLYGON ((0 51, 4 51, 9 54, 9 59, 5 61, 1 61, 0 58, 0 51))",
    format = "csv"
))

It might be easier to upload and manage files using the ODP web interface.

Ingest

Some files can be automatically ingested into the table. To ingest files into tables via the SDK:

ds = cli.dataset(MY_DATASET_ID)

with open("file.csv", "rb") as f:
    fid = ds.files.upload("file.csv", f)

ds.files.ingest(fid)
ds <- client$dataset(MY_DATASET_ID)

content <- paste(readLines("file.csv"), collapse = "\n")
fid <- ds$files$upload("file.csv", content)

ds$files$ingest(fid)

If no table exists, a new table is created based on the file data.

You can use the opt parameter to specify what should happen to an existing table when you ingest data:

  • "append" — Add new data to the existing table.
  • "truncate" — Remove all existing data and add new data.
  • "drop" — Drop the existing table and create a new one with the new data.

By default, data is appended. If the data has a different schema than what already exists in the table, an error is raised.

Troubleshooting

  • Increase or decrease the timeout when fetching data to control when a cursor is returned.
  • Keep the ODP_API_KEY secret—never commit it to source control