Raw
with the new tabular controller, is it possible to upload files on the side of a table.
those files are not part of the table, and might be hidden to the readers.
they can be used to ingest data to a table, either: * directly (creating the table from them) * appending (using the table schema to convert them)
A new metadata field will become relevant for this: ingest_name
Handling files
obtain a table handle using a dataset or dataset uuid as described here
then you can use the raw controller to manage the files:
id = tab.raw.upload("my_file.ext", binary_data) # upload a file, get back an id for it
tab.raw.update_meta(id, {"key": "value"}) # update the metadata of the file
for file in tab.raw.list(): # show all the files, with theirs ids, names and other metadata
print(file)
tab.raw.download(id) # download a file by id
tab.raw.delete(id) # delete a file by id
Metadata
Some metadata is automatically added to the file when uploaded:
* id
: the id of the file
* name
: the name of the file
* size
: the size of the file
* created_at
: the date of creation
other can be computed automatically based on the format
Ingest from files to table
tab.raw.ingest(id) # ingest a previously uploaded file by id
If a table does not exists, then a new table is created using the file data
then the file is converted to the schema of the table, and appended.
Only specific file formats are supported:
Parquet
this is pretty straight forward, since we use arrow for IPC and we support most of the features already
CSV
CSV is very limited, and ingestion will only return simple data types. You can use alter
later to improve the results.
Resilience to changes
Since the introduction of alter
, the table schema is not static anymore.
when ingesting a file, we store the name of the column of the original file as ingest_name
in the metadata
This means that further re-ingestions will adapt to the schema changes seamlessly.