Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for S3 catalog to work with S3 Tables #1404

Open
nicor88 opened this issue Dec 5, 2024 · 12 comments
Open

Support for S3 catalog to work with S3 Tables #1404

nicor88 opened this issue Dec 5, 2024 · 12 comments

Comments

@nicor88
Copy link

nicor88 commented Dec 5, 2024

Feature Request / Improvement

Amazon S3 tables have being launched, see this, and looks like that S3 tables have a managed iceberg catalog.

Based on https://github.com/awslabs/s3-tables-catalog it looks like that AWS build an S3 catalog wrapper using java, that can be used by query engines like Spark/Trino.
It will be relevant to write to S3 tables via pyiceberg.

More context

Based on my understanding, once an S3 table is created, iceberg metadata are not initialized.
For a freshly created table, it's possible to retrieve the warehouseLocation -> see get_table.
The warehouseLocation looks like a unique S3 bucket, where you can put S3 objects in it.
After putting the S3 objects of an iceberg commit operation: data+metadata, it's possible to use update_table_metadata_location to point the S3 table to the right location.

Note: I'm not 100% sure on the above - and I need to validate it via some tests.

@kevinjqliu
Copy link
Contributor

Thanks for raising this @nicor88! Would you be interested to contribute this feature?

@kevinjqliu
Copy link
Contributor

@nlm4145
Copy link

nlm4145 commented Dec 6, 2024

I also would be interested in this feature.

@nicor88
Copy link
Author

nicor88 commented Dec 9, 2024

@kevinjqliu Unfortunately I don't have the capacity at the moment to contribute to this feature.
I would nevertheless be available to look at the PR and test the implementation.

@felixscherz
Copy link
Contributor

I'm also interested, I will have a look at the reference @nicor88 provided and create a PR if I can get something to work:)

@petehanssens
Copy link

Super keen to see this happen too!

@nicor88
Copy link
Author

nicor88 commented Dec 12, 2024

It looks like that the warehouse location of those S3 tables doesn't support List operations.
I tried to point my local warehouse (using SQLite) to the warehouse location of an S3 table, just to validate if all could work, and I got this error:

AWS Error UNKNOWN (HTTP status 405) during ListObjectsV2 operation: Unable to parse ExceptionName: MethodNotAllowed Message: The specified method is not allowed against this resource.

The issue seems to come from pyarrow, that does this check:

if not overwrite and self.exists() is True:
    raise FileExistsError(f"Cannot create file, already exists: {self.location}")
output_file = self._filesystem.open_output_stream(self._path, buffer_size=self._buffer_size)

The self.exists(), triggers under the hood a list operation, that it's not supported.....

@felixscherz
Copy link
Contributor

felixscherz commented Dec 14, 2024

I created an intial PR #1429 where I am currently working on supporting table creation. I ran into the same issue that @nicor88 described and could work around it by setting overwrite=True for now.
However, now I get a different error during the write operation for the table metadata:

AWS Error UNKNOWN (HTTP status 400) during CompleteMultipartUpload operation: Unable to parse ExceptionName: S3TablesUnsupportedHeader Message: S3 Tables does not support the following header: x-amz-api-version value: 2006-03-01

I'm currently going through the pyarrow S3FileSystem implementation to see where this header is being introduced.

EDIT:

I tried using a different FileIO and the issue disappears when using pyiceberg.io.fsspecFileIO explicitly via:

properties = {"py-io-impl": "pyiceberg.io.fsspec.FsspecFileIO"}

seems like this is indeed specific to pyarrow

@jamesbornholt
Copy link

@felixscherz thanks for catching this (and thanks to everyone who's interested in building S3 Tables support for PyIceberg!). We're working on an S3-side fix for the x-amz-api-version exception you're seeing; hoping to have that out soon.

@buremba
Copy link

buremba commented Dec 17, 2024

@jamesbornholt Great to hear that you folks are keeping an eye on here!
Sorry if this is not the right channel to ask the question but considering S3Tables is a mix of catalog + storage layer, is there any plan to provide Iceberg REST compatibility as part of the S3Tables in addition to current API?

IMO that would help accelerate the adoption a lot, otherwise all the Iceberg implementations will need to integrate with S3Tables separately and I have a feeling that maintenance will be non-trivial.

@soumilshah1995
Copy link

+1

@felixscherz
Copy link
Contributor

@felixscherz thanks for catching this (and thanks to everyone who's interested in building S3 Tables support for PyIceberg!). We're working on an S3-side fix for the x-amz-api-version exception you're seeing; hoping to have that out soon.

I just re-ran the tests using PyArrowFileIO and it seems to be fixed now, thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants