-
Notifications
You must be signed in to change notification settings - Fork 197
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for S3 catalog to work with S3 Tables #1404
Comments
Thanks for raising this @nicor88! Would you be interested to contribute this feature? |
The catalog implementation can be found here |
I also would be interested in this feature. |
@kevinjqliu Unfortunately I don't have the capacity at the moment to contribute to this feature. |
I'm also interested, I will have a look at the reference @nicor88 provided and create a PR if I can get something to work:) |
Super keen to see this happen too! |
It looks like that the warehouse location of those S3 tables doesn't support List operations.
The issue seems to come from pyarrow, that does this check:
The |
I created an intial PR #1429 where I am currently working on supporting table creation. I ran into the same issue that @nicor88 described and could work around it by setting
I'm currently going through the pyarrow S3FileSystem implementation to see where this header is being introduced. EDIT: I tried using a different FileIO and the issue disappears when using properties = {"py-io-impl": "pyiceberg.io.fsspec.FsspecFileIO"} seems like this is indeed specific to pyarrow |
@felixscherz thanks for catching this (and thanks to everyone who's interested in building S3 Tables support for PyIceberg!). We're working on an S3-side fix for the |
@jamesbornholt Great to hear that you folks are keeping an eye on here! IMO that would help accelerate the adoption a lot, otherwise all the Iceberg implementations will need to integrate with S3Tables separately and I have a feeling that maintenance will be non-trivial. |
+1 |
I just re-ran the tests using PyArrowFileIO and it seems to be fixed now, thank you! |
Feature Request / Improvement
Amazon S3 tables have being launched, see this, and looks like that S3 tables have a managed iceberg catalog.
Based on https://github.com/awslabs/s3-tables-catalog it looks like that AWS build an S3 catalog wrapper using java, that can be used by query engines like Spark/Trino.
It will be relevant to write to S3 tables via pyiceberg.
More context
Based on my understanding, once an S3 table is created, iceberg metadata are not initialized.
For a freshly created table, it's possible to retrieve the warehouseLocation -> see get_table.
The warehouseLocation looks like a unique S3 bucket, where you can put S3 objects in it.
After putting the S3 objects of an iceberg commit operation: data+metadata, it's possible to use update_table_metadata_location to point the S3 table to the right location.
Note: I'm not 100% sure on the above - and I need to validate it via some tests.
The text was updated successfully, but these errors were encountered: