Tabular V2 - Reference

table methods

`Table.create()`

tab.create(pa_schema)

will raise FileAlreadyExists if the table already has a schema

expects a valid pyarrow.Schema

metadata

some fields in the metadata have special meaning, and will change the behaviour of the table:

index: the column will be used to dynamically partition the data, giving a boost to queries on that field

experimental metadata:

aggr: specify how aggregation should be done on this field: sum, count, avg, min or max
class: can be geometry, latitude, longitude and other partially supported types
description
unit: the unit of the field following UCUM standard (e.g.: m, Cel, degF, s etc.)
ingest_name: the name of the field in the file, if different from the schema

obsolete:

isGeometry: the column will be expected to be either wkt or wkb (based on the pyarrow type)
isLat: specify that a column is latitude
isLon: specify that a column is longitude
big: used to specify that a column is big, but this is now done automatically

metadata class

class can have specific meanings: * geometry: implies wkt format if string, or wkb if binary * latitude and longitude: implies float type

`Table.drop()`

drop the schema and all the data in the column this operation is not reversible

`Table.alter()`

alter the schema of the table, using a new schema.

data will be re-ingested and if succeed, the old data will be deleted.

to rename a column, you need to specify the old name in the metadata, using the tag prev_name

assume you had:

tab.create(pa.schema([
    pa.field("id", pa.string(), nullable=False),
    pa.field("name", pa.string(), nullable=False),
]))
with tab as tx:
    tx.insert({"id": "1", "name": "Alice"})

and you want to rename name to first_name, you can do:

tab.alter(pa.schema([
    pa.field("id", pa.int64(), nullable=False),
    pa.field("family_name", pa.string(), True),
    pa.field("first_name", pa.string(), False, metadata={"prev_name": "name"}),
]))

The example above will perform 3 different changes at once: 1. convert the id field to an int (so the "1" will be come 1) 2. add a column family_name that is nullable 3. rename the column name to first_name

it is possible to modify other metadata, like adding geometry or change index.

`select`

fetch rows from a table, with optional filtering and column selection, will return a cursor:

for row in tab.select().rows():
    print(row)

`Cursor`

when selecting, a cursor will be returned. the cursor must be iterated over to perform any action. To iterate over a cursor, you must specify how you want to fetch the data: * rows(): will return a generator of rows * dataframes(): will return a generator of pandas.DataFrame * pages(N): will return a generator of pages, each page containing N rows * batches(): will return a generator or pyarrow.RecordBatch

Note: if pages(0), the cursor will return a page as soon as a chunk is obtained from the server

filtering

The first argument of a select is the filter that will be applied to the data. If no filter is specified, all the data will be returned.

The syntax for filters is similar to pyarrow.compute.Expression, with only few differences.

tab.select('age < 26')
tab.select('age < 26 and name == "Alice""')
tab.select('age < 26 or name == "Alice"')
tab.select('age + 10 < 26')
tab.select('age < (1+2)*3')

to support geospatial queries we have the following operators:

tab.select('geofield intersect "POLYGON ((0 0, 0 1, 1 1, 1 0, 0 0))"`)
tab.select('geofield contains "POLYGON ((0 0, 0 1, 1 1, 1 0, 0 0))"`)
tab.select('geofield within "POLYGON ((0 0, 0 1, 1 1, 1 0, 0 0))"`)
tab.select('geofield ~ "POLYGON ((0 0, 0 1, 1 1, 1 0, 0 0))"`)

The last operator acts as a greedy match, and will return all the rows that have a geometry that might intersect each other. It is designed to be as fast as possible, but the user must handle the false positives later. This can be very useful for large datasets in some cases.

column selection

you can specify a subset of columns to be returned, further limiting the computation and network needs, by specifying the argument cols as a list of strings:

for row in tab.select(cols=["id", "name"]).rows():
    print(row)

bind variables

since pyarrow.compute does not play well with datetime or timestamp objects, or more in general where validation of input is needed, we have allow the user to use bind variables in the query:

tab.select('age < 26')  # no bind vars
tab.select('age < $age', vars={"age": 26})  # bind by name
tab.select('age < ?', vars=[26])  # bind by position

variables can be specified as named or positional. This becomes necessary when using structured types, like timestamp:

tab.create(pa.schema([
    pa.field("id", pa.string(), nullable=False),
    pa.field("ts", pa.timestamp("ms"), nullable=False),
]))
tab.select('ts < ?', vars=[datetime.datetime.now()])

`Table.aggregate(group_by, query=None, aggr=None)`

experimental

allow for aggregation of data using a group by expression, an optional filter, and aggr definition

tab.create(pa.schema([
    pa.field("id", pa.string(), nullable=False),
    pa.field("country", pa.string(), nullable=False),
    pa.field("name", pa.string(), nullable=False),
    pa.field("age", pa.int32(), nullable=False, metadata={"aggr": "avg"}),
]))
with tab as tx:
    tx.insert([
        {"id": "1", "name": "Alice", "country": "UK", "age": 25},
        {"id": "2", "name": "Bob", "country": "UK", "age": 30},
        {"id": "3", "name": "Charlie", "country": "US", "age": 35}, 
    ])
data = tab.aggregate('country')
# UK 27.5
# US 35

data is returned as a DataFrame with an index set to the group by field.

`group by`

the group by expression is mandatory, and can be a field or an expression:

tab.aggregate('country')  #  will return 2 rows, one for UK and one for US
tab.aggregate('age>30')  # will return 2 rows, with True and False each

there are some special functions that can be used here, like h3:

`h3(lat, lon, res)`

this special function can be used on a point cloud to group the data by h3 tile, given a resolution:

tab.aggregate('h3(lat, lon, 8)')  # compute the h3 tile of each row based on lat and lon

this assumes that lat and lon have valid latitude and longitude values, refer to h3 documentation for more information

Filtering

the query clause is optional, and can be used to filter the data before aggregation. Follow the same syntax as the select method:

tab.aggregate("country", query='age > 30')

`aggr`

the aggr specifies which fields need to be aggregated, and how. if not specified, the table schema is used to determine the aggregation method. if no aggregation method is specified, the field is skipped.

tab.aggregate("country", aggr={"age": "max", "name": "count"})

currently, we support aggregation by avg, sum, min, max and count.

transactions

transaction should be used to provide atomicity to the operations on the table:

with tab as tx:
    ...

this create a context, and inside this context you can safely perform operations on the table. Those operation won't be visible to anyone, until the block is exited, and the transaction is committed.

If the block exists with an exception, the whole transaction is rolled back and no changes will be applied.

Internally, this is done by using creating a transaction with begin and then commit or rollback on exit.

`TX.insert()`

you can append data to a table using the insert method:

with tab as tx:
    tx.insert({"id": "1", "name": "Alice", "age": 25})
    tx.insert({"id": "2", "name": "Bob", "age": 30})

insert() also allow you to use list[dict], pyarrow.RecordBatch or pandas.DataFrame as input.

make sure the schema is correct and the data is valid, otherwise an exception will be raised.

`TX.delete()`

you can delete data from a table using the delete method:

with tab as tx:
    tx.delete('age < 26')

`TX.replace`

you cannot update the data on the table directly, but you can replace some of the data with something else:

with tab as tx:
    for row in tx.replace('age < 26'):
        if row["old"]:
            row["age"]+= 30
            tx.insert(row)

This will remove all the rows that match the filter, and return them to the caller. The caller can then decide what to do with each row, insert it back, discard it, insert other rows.

Note: replace computes what to do first, and then returns the rows. inserted rows won't affect the current iteration.

Note: there are limitations on replacing what was inserted in the same transaction.

`BigCol`

If a row has a field with large data, the data is then stored in a separate buffer and a reference to the buffer is stored in the table instead

this happen automatically and the user will get back the full data later when querying.

filtering is also performed on the client after fetching the full data, so this process should be transparent to the user

The key benefit is to keep the data streams small and fast, and cache the big files separately.

`bSquare`

We convert any polygon added to the system as a bounding square, and use the geometry of this square to index the data. This provide a convenient way to index the data with low false positives, the client can then perform the more complex query on the full polygon on the client side, especially if the polygon is big and store in a BigCol.