Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add metadata blob_storage_total_files and blob_storage_file_index on azure blob storage input #89

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

mrchypark
Copy link

This PR adds two new metadata fields to the Azure Blob Storage input:

blob_storage_total_files: The total number of files in the Azure Blob Storage container.
blob_storage_file_index: The current file index being processed.
These new metadata fields provide users with additional context about the progress of file processing in their Azure Blob Storage input.

Changes:

Added totalFiles and currentIndex fields to the azureBlobStorage struct.
Modified the Connect method to count the total number of files.
Updated the blobStorageMetaToBatch function to include the new metadata fields.
Incremented the currentIndex after processing each file in the ReadBatch method.
These changes will help users track the progress of their Azure Blob Storage input processing, especially when dealing with large numbers of files. The new metadata can be used for logging, monitoring, or implementing custom logic based on the processing progress.

Testing:

Tested the new metadata fields with various file counts in Azure Blob Storage containers.
Verified that the blob_storage_total_files remains constant throughout the processing.
Confirmed that the blob_storage_file_index increments correctly for each processed file.
Please review and let me know if any further changes or clarifications are needed.

}
return a, nil
}

func (a *azureBlobStorage) Connect(ctx context.Context) error {
var err error
a.keyReader, err = newAzureTargetReader(ctx, a.log, a.conf)

// Count total files
for {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about if a file is added after the connection is made? Won't this information then become stale?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wrote it considering only batch mode. After testing, it seems that in batch mode, the file list is fetched at the time of connection.

func newAzureTargetReader(ctx context.Context, logger *service.Logger, conf bsiConfig) (azureTargetReader, error) {
	if conf.FileReader == nil {
		return newAzureTargetBatchReader(ctx, conf)
	}
	return &azureTargetStreamReader{
		input: conf.FileReader,
		log:   logger,
	}, nil
}

Looking at this, it seems that azureTargetStreamReader is distinguished by FileReader.

Copy link
Author

@mrchypark mrchypark Aug 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't know whether the Azure SDK holds the list at the point of input generation, or if it updates it every time the pager operates. Rather than relying on this approach, it would be better if there was a way for the pager to indicate when it has reached the end.

@jem-davies jem-davies force-pushed the main branch 4 times, most recently from cca1170 to dc98f1c Compare November 7, 2024 18:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants