Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What is utils.get_text_from_blob param with_encoding meant to do? #276

Open
ghost opened this issue Aug 4, 2023 · 3 comments
Open

What is utils.get_text_from_blob param with_encoding meant to do? #276

ghost opened this issue Aug 4, 2023 · 3 comments

Comments

@ghost
Copy link

ghost commented Aug 4, 2023

The first use inside the function makes it return the content as bytes, and assuming 'utf-8'

The second use inside the function returns back to the user the charset used, but surely because of the first use the second use is never called?

The only time the function is actually called with the param set to True is in an error handler of clean.clean_invalid_documents and I don't understand that error handler, as it could end up leaving iati_activities_el undefined and then the very next line tries to use that variable.

b011174 adds both uses

@akmiller01
Copy link
Contributor

The first use is within a try / except block that catches UnicodeDecodeError. So if the downloader.content_as_text() statement within the return line fails to decode as unicode, the function continues along to detect the encoding, and return it along with the content in the second use.

The original function without the with_encoding argument was already detecting the encoding to return the text, it just was never returning it. So when subsequent functions needed the text and the encoding, it didn't need to use chardet again.

The flow-control of clean.clean_invalid_documents calls continue on that error handler within a for loop from above: https://github.com/IATI/refresher/blob/develop/src/library/clean.py#L165 So it will skip that iteration of the loop, and won't use the undefined iati_activities_el.

@ghost
Copy link
Author

ghost commented Oct 11, 2023

The flow-control of clean.clean_invalid_documents calls continue on that error handler within a for loop from above: https://github.com/IATI/refresher/blob/develop/src/library/clean.py#L165 So it will skip that iteration of the loop, and won't use the undefined iati_activities_el.

But it only calls continue on a second error - where utils.get_text_from_blob crashes.
If there is a document that is not UTF-8 and has an encoding such that LXML crashes but get_text_from_blob can detect a different charset wouldn't the execution path be:


            try:
                blob_service_client = BlobServiceClient.from_connection_string(
                    config['STORAGE_CONNECTION_STR']) # EXECUTES
                blob_client = blob_service_client.get_blob_client(
                    container=config['SOURCE_CONTAINER_NAME'], blob=blob_name) # EXECUTES

                downloader = blob_client.download_blob() # EXECUTES

                large_parser = etree.XMLParser(huge_tree=True) # EXECUTES
                root = etree.parse(
                    BytesIO(downloader.content_as_bytes()), parser=large_parser) # EXECUTES BUT HAS ISSUES
                iati_activities_el = root.getroot()
                file_encoding = 'utf-8'
            except (AzureExceptions.ResourceNotFoundError) as e:
                logger.warning(
                    f"Blob not found for hash: {hash} and id: {id} - updating as Not Downloaded for the refresher to pick up.")
                db.updateFileAsNotDownloaded(conn, id)
            except etree.XMLSyntaxError as e:
                logger.warning(
                    f"Cannot parse entire XML for hash: {hash} id: {id}. ") # EXECUTES
                try:
                    file_text, file_encoding = utils.get_text_from_blob(
                        downloader, hash, True) # EXECUTES AND RETURNS AN ENCODING, DOES NOT CRASH
                except:
                    logger.warning(
                        f"Can not identify charset for hash: {hash} id: {id} to remove invalid activities.")
                    db.updateCleanError(
                        conn, id, 'Could not parse')
                    continue

            # loop and save valid activities to clean doc
            activities = iati_activities_el.xpath("iati-activity") # EXECUTES BUT AT THIS POINT iati_activities_el IS IN A BAD STATE?

@akmiller01
Copy link
Contributor

But it only calls continue on a second error - where utils.get_text_from_blob crashes.
If there is a document that is not UTF-8 and has an encoding such that LXML crashes but get_text_from_blob can detect a different charset wouldn't the execution path be:

Yes, that sounds correct. So the code could be updated to something like this to work as expected like this:

            try:
                blob_service_client = BlobServiceClient.from_connection_string(
                    config['STORAGE_CONNECTION_STR'])
                blob_client = blob_service_client.get_blob_client(
                    container=config['SOURCE_CONTAINER_NAME'], blob=blob_name)

                downloader = blob_client.download_blob()

                large_parser = etree.XMLParser(huge_tree=True)
                root = etree.parse(
                    BytesIO(downloader.content_as_bytes()), parser=large_parser)
                iati_activities_el = root.getroot()
                file_encoding = 'utf-8'
            except (AzureExceptions.ResourceNotFoundError) as e:
                logger.warning(
                    f"Blob not found for hash: {hash} and id: {id} - updating as Not Downloaded for the refresher to pick up.")
                db.updateFileAsNotDownloaded(conn, id)
                continue # ADD CONTINUE HERE FOR FILE NOT FOUND
            except etree.XMLSyntaxError as e:
                logger.warning(
                    f"Cannot parse entire XML for hash: {hash} id: {id}. ")
                try:
                    file_text, file_encoding = utils.get_text_from_blob(
                        downloader, hash, True)
                    root = etree.fromstring(file_text, parser=large_parser) # ADD LXML PARSE FILE_TEXT STRING
                    iati_activities_el = root.getroot() # ADD GETROOT
                except:
                    logger.warning(
                        f"Can not identify charset for hash: {hash} id: {id} to remove invalid activities.")
                    db.updateCleanError(
                        conn, id, 'Could not parse')
                    continue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant