Python SDK
ODP Python SDK
The new ODP Python SDK provides a simple and efficient way to access datasets on the Ocean Data Platform.
You can install the SDK using pip:
pip install -U odp-sdk
And then use it to connect to the Ocean Data Platform:
from odp.client import Client
cli = Client()
The client will open your browser to perform the authentication process. If you prefer you can pass an api_key
to the Client
constructor to authenticate without opening a browser:
from odp.client import Client
cli = Client(api_key=MY_API_KEY)
You can generate an API key in the Ocean Data Platform web interface, under your user profile.
When the Client
is initialized, you will need to give a dataset ID to access a specific dataset.
Normally you will use the web UI here to find the dataset you want to work with,
and then likely use the dataset ID to access it in the SDK:
from odp.client import Client
cli = Client()
glodap = cli.dataset("aea06582-fc49-4995-a9a8-2f31fcc65424")
glodap
dataset which is well known for its globally calibrated ocean carbon data, with over a million data points, widely used by oceanographers, which you can use to retrieve data, or modify it if you are an editor.
Tabular
Each dataset might have a tabular storage, which is defined by a schema
, which you can retrieve with:
import pyarrow as pa
from odp.client import Client
cli = Client()
glodap = cli.dataset("aea06582-fc49-4995-a9a8-2f31fcc65424")
try:
schema: pa.Schema = glodap.table.schema()
print(schema)
except FileNotFoundError as e:
print("This dataset does not have a tabular storage.")
The schema is an arrow.Schema
object, which you can use to understand the structure of the data.
select()
If you want to retrieve some of the data, you can use the select()
method:
from odp.client import Client
cli = Client()
glodap = cli.dataset("aea06582-fc49-4995-a9a8-2f31fcc65424")
for batch in glodap.table.select().batches():
print(batch)
This will return all the data in the dataset table, in chunks of pyarrow.RecordBatch
objects.
filter
from odp.client import Client
cli = Client()
glodap = cli.dataset("aea06582-fc49-4995-a9a8-2f31fcc65424")
for batch in glodap.table.select(filter='G2year >= 2020 AND G2year < 2025').batches():
print(batch)
You can use the optional filter
argument to filter the data using arrow-like expressions or SQL-like syntax.
we support most of the pyarrow expressions, but we also added some special syntax for geospatial queries:
ds.select(filter='geometry within "POLYGON ((-10 50, -5 50, -5 55, -10 55, -10 50))"').batches()
intersect
check if the two polygons intersectwithin
check if the polygon on the left is inside the polygon on the right-
contains
check if the polygon on the left contains the polygon on the right -
is null
check if the value of the field is null is not null
check if the value of the field is not null
columns
you can specify which columns you want to retrieve using the columns
argument:
from odp.client import Client
cli = Client()
glodap = cli.dataset("aea06582-fc49-4995-a9a8-2f31fcc65424")
for batch in glodap.table.select(columns=['G2tco2', 'G2year']).batches():
print(batch)
vars
from odp.client import Client
cli = Client()
glodap = cli.dataset("aea06582-fc49-4995-a9a8-2f31fcc65424")
min = 2020
max = 2025
for batch in glodap.table.select(filter='G2year >= $start_year AND G2year < $end_year',
vars={'start_year': min, 'end_year': max}).batches():
print(batch)
You can bind variables to the filter using the vars
argument.
It can work with either named variables or positional arguments, the example above uses named variables.
for positional use G2year >= ? AND G2year < ?
and pass the values in the vars
argument as a list
for batch in glodap.table.select(filter='G2year >= ? AND G2year < ?', vars=[min, max]).batches():
print(batch)
streaming vs full table
the select()
method returns a cursor object which can be scanned to retrieve the data in chunks.
you can get the native pyarrow.RecordBatch
objects chaining select(...).batches()
you can also get each chunk as a dataframe using select(...).dataframes()
or select(...).rows()
if you want python dictionaries for each row.
All the above will stream the data, meaning that it will only store in memory the current chunk being processed.
This is very efficient for large datasets, and fast since the first chunk will be available much faster.
When fetching limited data, all the streaming can be annoying, so you can use the select(...).all()
method to retrieve the entire table at once.
You can then chain it using to retrieve different formats:
from odp.client import Client
cli = Client()
glodap = cli.dataset("aea06582-fc49-4995-a9a8-2f31fcc65424")
df = glodap.table.select("G2year==2018").all().dataframe() # pandas dataframe
Aggregation
Often you need aggregated data from a table, for example the average level of salinty per year in a specific region:
from odp.client import Client
cli = Client()
glodap = cli.dataset("aea06582-fc49-4995-a9a8-2f31fcc65424")
df = glodap.table.aggregate(
filter='G2year >= 2020 AND G2year < 2025',
group_by="G2year",
aggr={
'G2salinity': 'avg',
},
)
This will return a pandas dataframe with an index for the unique values of G2year
and 2 columns: *
for the amount of rows in each group and G2salinity
for the average salinity
The filter works the same way as in the select()
method, and you can use the vars
argument to bind variables.
functions
You can aggregate group by h3 tiles, which is a hexagonal grid system that is useful for spatial data.
To do so, use h3(my_geo_field, resolution)
as the by
argument, where my_geo_field
is the name of the column containing the geometry data, and resolution
is an integer between 0 and 15 (the higher the resolution, the smaller the hexagons).
If you want to aggregate by buckets of numbers, you can use the bucket(my_field, val0, val1, ...)
function, where my_field
is the name of the column containing the numeric data, and val1
, val2
, etc. are the boundaries of the buckets.
You will get a list of rows where the index is the index of the bucket, so if you call buckets(age, 10,20,30)
you will get all the data less or equal to 10 in the bucket 0
, all the data between 10 and 20 in the bucket 1
, between 20 and 30 in the bucket 2
, and all the data greater than 30 in the bucket 3
.
aggregation and streaming
Internally, the sdk perform a two step aggregation: 1. the server scan each partition of the table, and compute partials which are sent to the client 2. the client get the streams of partials, and locally builds the final aggregation
This means that even tho the memory usage is kept low, the final result will only be available when all the partials have been received.
Modifying data
You need to be an editor of a dataset to modify its tabular data.
The easiest way to be an editor is to create a new dataset yourself in ODP my data
you will first have to create the table with a pyarrow.Schema object, which defines the structure of the data.
from odp.client import Client
import pyarrow as pa
cli = Client()
ds = cli.dataset(ID) # use the dataset ID you created earlier
schema = pa.schema([
pa.field("id", pa.int64(), False),
pa.field("geo", pa.string(), True, metadata={"isGeometry": 1, "index": 1}),
pa.field("name", pa.string(), False),
pa.field("value", pa.float64(), False),
])
ds.table.create(schema)
this create a new table with 4 columns: id
, geo
, name
, and value
.
the geo
column metadata make the column a geometry column (encoded as WKT) and indexed for fast spatial queries.
you can then insert or modify the data using a transaction:
with ds as tx:
tx.insert([
{"id": 1, "geo": "POINT(1 1)", "name": "point1", "value": 1.0},
{"id": 2, "geo": "POINT(2 2)", "name": "point2", "value": 2.0},
])
this will insert two rows in the table, and then commit the transaction (unless an exception is raised, in which case the transaction will be completely rolled back).
you can also replace rows using the replace()
method:
with ds as tx:
for row in tx.replace(filter='id = 1').rows():
# row has just been removed from the table
row['value'] = 10.0 # modify the value
tx.insert(row) # insert it back with the updated value
It is not required that each row returned by replace has to be inserted back, you can also just skip it to remove it, or insert something else, even multiple rows.
You can also fetch batches when speed is important, or use the delete()
method to just remove rows efficiently.
Alter the table schema
import pyarrow as pa
from odp.client import Client
cli = Client()
ds = cli.dataset(MY_DATASET_ID)
ds.table.alter(schema=pa.schema([
pa.field("id", pa.int64(), False),
pa.field("geo", pa.string(), True, metadata={"isGeometry": 1, "index": 1}),
pa.field("name", pa.string(), False),
pa.field("temp", pa.float64(), False), # rename value to temp
pa.field("new_col", pa.string(), True), # add a new column
]), from_names={"temp": "value"}) # rename the value column to temp
This will change the schema of the table, and add a new column new_col
which is nullable, and rename the value
column to temp
.
types can be changed as well, and the backend will convert the existing data to the new types.
Might be wise to use the Web UI to alter the schema, as it will give you a preview a better idea of what will happen and more details on the various options available.
Files
While a dataset can have a table with a schema, it can also have zero or more files associated with it.
from odp.client import Client
cli = Client()
ds = cli.dataset("5cf13ecb-6e12-41b8-8cfc-9c9c6eb2065d") # Aker Biomarine Echo Sounding data
all_files = ds.files.list()
if you are an editor, you can upload more files:
from odp.client import Client
cli = Client()
ds = cli.dataset(MY_DATASET_ID)
id = ds.files.upload("hello.txt", b"Hello World!")
ds.files.delete(id) # delete the file we just uploaded
with open("file.csv", "rb") as f:
fid = ds.files.upload("north_sea.csv", f)
ds.files.update_meta(fid, {
"name": "north_sea.csv",
"mime-type": "application/octet-stream",
"geometry": "POLYGON ((0 51, 4 51, 9 54, 9 59, 5 61, 1 61, 0 58, 0 51))"
"format": "csv",
})
It might be easier to upload and manage files using the ODP web interface.
Ingest
If you tested the UI above, you probably noticed that some files can be automatically ingested into the table.
If no table exists, a new table is created based on the file data.
otherwise an option to append, overwrite or recreate is given