Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize Shopify Metafields Sync Performance #200

Open
wants to merge 17 commits into
base: master
Choose a base branch
from

Conversation

sgandhi1311
Copy link
Member

@sgandhi1311 sgandhi1311 commented Feb 12, 2025

Problem

The current implementation of metafields sync is inefficient because:

  1. It fetches all parent IDs (products, orders, etc.) first
  2. Makes additional API calls for each parent ID to fetch its metafields
  3. Many API calls are wasted on parents that have no metafields
  4. Results in high sync duration and excessive API usage

Solution

Implemented a hybrid approach that:

  1. Directly queries metafields list without fetching parent IDs first
  2. Only falls back to individual parent ID fetching when metafield count exceeds the max page limit (default set to 175)
  3. Uses GraphQL pagination for efficient data retrieval
  4. Maintains existing functionality for edge cases

Implementation Details

  • Added new GraphQL queries to fetch metafields directly
  • Implemented pagination handling at both resource and metafield levels
  • Added fallback mechanism for parents with > max page limit metafields
  • Maintained backwards compatibility with existing bookmark logic

Additional Changes

  1. Add retry for the interruptible sync error from the server side.
  2. Include the new field (= inventoryItem) in the product variants schema.

Benefits

  • Significantly reduced API calls
  • Faster sync completion time
  • Lower resource utilization

Testing

  • Verified data consistency with old approach
  • Confirmed proper handling of bookmarks
  • Performed PR-alpha on the customer connection

Future Considerations

  • Monitor performance on very large stores
  • Consider implementing parallel processing for large datasets or using bulk APIs
  • Add metrics collection for optimization tracking

Related Issues

Closes #[TDL-26998]

AI generated code

https://internal.qlik.dev/general/ways-of-working/code-reviews/#guidelines-for-ai-generated-code

  • this PR has been written with the help of GitHub Copilot or another generative AI tool

@sgandhi1311 sgandhi1311 changed the base branch from master to bugfix-interuptible-sync February 12, 2025 04:10
@sgandhi1311 sgandhi1311 force-pushed the TDL-26998-update-metafields-logic branch from c2905b1 to faa9c2b Compare February 12, 2025 07:19
@sgandhi1311 sgandhi1311 force-pushed the TDL-26998-update-metafields-logic branch from faa9c2b to 3f26fb2 Compare February 12, 2025 07:32
@sgandhi1311 sgandhi1311 changed the title Tdl 26998 update metafields logic Optimize Shopify Metafields Sync Performance Feb 12, 2025
@sgandhi1311 sgandhi1311 force-pushed the TDL-26998-update-metafields-logic branch from 2238558 to edd5b44 Compare February 12, 2025 07:58
@sgandhi1311 sgandhi1311 marked this pull request as ready for review February 12, 2025 08:01

@shopify_error_handling
def call_api(self, query_params, query, data_key):
def call_api(self, query_params, query):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the base class implementation has more explicit error handling, it would be a good idea to carry it over to the modified implementation of the class.

ref: gql_base.py


    @shopify_error_handling
    def call_api(self, query_params):
        """
        - Modifies the default call api implementation to support GraphQL
        - Returns response Object dict
        """
        try:
            query = self.get_query()
            LOGGER.info("Fetching %s %s", self.name, query_params)
            response = shopify.GraphQL().execute(query=query, variables=query_params)
            response = json.loads(response)
            if "errors" in response.keys():
                raise ShopifyAPIError(response['errors'])
            data = response.get("data", {}).get(self.data_key, {})
            return data
        except ShopifyAPIError as gql_error:
            LOGGER.error("GraphQL Error %s", gql_error)
            raise ShopifyAPIError("An error occurred with the GraphQL API.") from gql_error
        except Exception as exception:
            LOGGER.error("Unexpected error occurred.",)
            raise exception

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the exception handling over here as well.


def get_next_page_metafields(self, metafields, query_params, gql_resource):
owner_id = metafields.get("edges", [])[0]["node"]["owner"]["id"]
query = self.get_resource_type_query(gql_resource)()
Copy link
Member

@somethingmorerelevant somethingmorerelevant Feb 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not the right way to use a static method with a self reference
please change the method implementation or update the function call.

Metafields.get_resource_type_query(gql_resource)()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated the function to the instance method.

@sgandhi1311 sgandhi1311 changed the base branch from bugfix-interuptible-sync to master February 19, 2025 06:45
replication_value = utils.strptime_to_utc(obj["updated_at"])
current_bookmark_value = self.get_bookmark_by_name(f"{self.name}_{resource_type}")
if resource_type not in current_bookmarks:
current_bookmarks[resource_type] = \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is correct but the function self.get_bookmark_by_name(f"{self.name}_{resource_type}") seems to be called twice — once to set current_bookmark_value and once more when updating current_bookmarks.
Can't we write like this, instead?
current_bookmarks[resource_type] = current_bookmark_value

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense, made the change

Comment on lines 186 to 194
if resource_type not in current_bookmarks:
current_bookmarks[resource_type] = \
self.get_bookmark_by_name(f"{self.name}_{resource_type}")

if replication_value >= current_bookmarks[resource_type]:
current_bookmarks[resource_type] = max(
replication_value,
current_bookmarks[resource_type]
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here,

  • You're checking if resource_type exists in current_bookmarks. If not, you fetch the initial bookmark value from self.get_bookmark_by_name(f"{self.name}_{resource_type}") and store it in current_bookmarks[resource_type]
  • You then check if replication_value >= current_bookmarks[resource_type]. If this condition is True, you update the bookmark for this resource_type by taking the maximum value between the current replication_value and the existing current_bookmarks[resource_type].
    Essentially, you're checking and possibly updating current_bookmarks[resource_type] twice.
    You can combine the check and update process in a single step by using the dict.get() method and reducing the redundancy:
if replication_value >= current_bookmarks.get(resource_type, current_bookmark_value):
    current_bookmarks[resource_type] = max(replication_value, current_bookmarks.get(resource_type, replication_value))

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds logical.
I optimized it further.

if replication_value >= current_bookmarks.get(resource_type, current_bookmark_value):
    current_bookmarks[resource_type] = replication_value

Let me know if you think something else.


LOGGER = get_logger()



class Metafields(ShopifyGqlStream):
Copy link
Contributor

@rdeshmukh15 rdeshmukh15 Feb 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not impacting anything at all, still these 2 empty lines should be there before class def.

@@ -1,5 +1,10 @@
# Changelog

## 2.1.0
* Optimize Shopify Metafields Sync Performance [#200](https://github.com/singer-io/tap-shopify/pull/200)
* Include the new field (= inventoryItem) in the product variants schema.
Copy link
Member

@somethingmorerelevant somethingmorerelevant Feb 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Include the new field (= inventoryItem) in the product variants schema.
* Include new field inventoryItem in the product_variants stream.

If this was not intended

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants