Raw

with the new tabular controller, is it possible to upload files on the side of a table.

those files are not part of the table, and might be hidden to the readers.

they can be used to ingest data to a table, either: * directly (creating the table from them) * appending (using the table schema to convert them)

A new metadata field will become relevant for this: ingest_name

Handling files

obtain a table handle using a dataset or dataset uuid as described here

then you can use the raw controller to manage the files:

id = tab.raw.upload("my_file.ext", binary_data)  # upload a file, get back an id for it
tab.raw.update_meta(id, {"key": "value"})  # update the metadata of the file

for file in tab.raw.list():  # show all the files, with theirs ids, names and other metadata
   print(file)

tab.raw.download(id)  # download a file by id
tab.raw.delete(id)  # delete a file by id

It is also possible to identify a file by a filename if the filename is unique within the dataset

tab.raw.upload("test.csv", binary_data)
tab.raw.update_meta("test.csv", {"key": "value"})

tab.raw.download("test.csv")

tab.raw.delete("test.csv")

Metadata

Some metadata is automatically added to the file when uploaded: * id: the id of the file * name: the name of the file * size: the size of the file * created: the date of creation * updated: last time updated

other can be computed automatically based on the format

Ingest from files to table

tab.raw.ingest(id)  # ingest a previously uploaded file by id or filename

If a table does not exists, then a new table is created using the file data

then the file is converted to the schema of the table, and appended.

Only specific file formats are supported:

Parquet

this is pretty straight forward, since we use arrow for IPC and we support most of the features already

CSV

CSV is very limited, and ingestion will only return simple data types. You can use alter later to improve the results.

Resilience to changes

Since the introduction of alter, the table schema is not static anymore.

when ingesting a file, we store the name of the column of the original file as ingest_name in the metadata

This means that further re-ingestions will adapt to the schema changes seamlessly.