Inaccurate web scraping data - how to improve accuracy? #609

lhexin · 2023-07-24T11:17:45Z

lhexin
Jul 24, 2023

Hi

I've just tried using Flowise to interrogate a website (using the Cheerio Web Scraper). I am asking it to find summer offers on Starbucks.com as an example.

The scraper works, and it is pulling in data from Starbucks but it didn't get the context of the offer right, i.e. the offer detail is wrong.

I am assuming there might be a way to improve the accuracy by changing up how the recursive character text splitter works? But I have no idea how.

Any help is appreciated!

HenryHengZJ · 2023-07-25T11:41:13Z

HenryHengZJ
Jul 25, 2023
Maintainer

Try to use HtmlToMarkDown Text Splitter, because using recursive text splitter you will still have a lot of html gibberish embeddings.

Hopefully this helps - https://docs.flowiseai.com/use-cases/web-scrape-qna

4 replies

collasanta Jul 25, 2023

hey Henry, there are a lot of websites that have nested sitemaps under domain/sitemap.xml, I tested the feature scraping by XML and it seems to handle only sitemaps that have all the URLs in the root of sitemap.xml
Do you have plans to add support for that? thanks

HenryHengZJ Jul 25, 2023
Maintainer

we dont have a near term plan for that as currently users dont have a clear way of selecting which links to use and which one not to. crawling nested sitemap could take really long time and doesnt fit well to current UI/UX.

we will improve this in future when we introduce workflow pipeline

collasanta Jul 25, 2023

yes, I was trying it with websites with hundreds of pages and I got no clue of what is happening in the process when i got long loading times in the chat ui (also the feature that logs in all page that is going to be scrapped in the console did not work for me)

Thankss

tomique34 Jan 16, 2025

What about using Crawl4AI web scrapper ? I have read one article where they described Crawl4AI as very fast scrapper with ability scrape complete website in moments and have it in md format. Can you implement it into flowise workflow sometimes ?

Charlotte-br560 · 2024-03-15T17:33:44Z

Charlotte-br560
Mar 15, 2024

Improving the accuracy of web scraping data involves refining your scraping technique and understanding the website structure you're targeting. Here are a few tips to enhance accuracy:

Inspect the Website Structure: Before you can scrap the HTML structure of the website to understand how the data is organized. This will help you identify the elements you need to target accurately.

Use Targeted Selectors: Utilize precise CSS selectors or XPath expressions to target specific elements containing the desired data. Avoid broad selectors that may capture irrelevant information.

Handle Dynamic Content: Some websites load content dynamically using JavaScript. Ensure your scraper can handle dynamic content by using tools like Puppeteer or Selenium or by analyzing network requests to replicate AJAX requests.

Implement Error Handling: Build error-handling mechanisms into your scraper to gracefully handle unexpected situations, such as missing data or changes in website layout.

Regularly Update Scraping Logic: Websites frequently undergo updates that may affect scraping accuracy. Review and update your scraping logic regularly to adapt to any changes.

Test and Iterate: Test your scraper with different scenarios and iterate on your scraping logic based on the results. This iterative process can help refine the accuracy of your scraping.

By applying these strategies and experimenting with different approaches, you can improve the accuracy of your web scraping data extraction.

1 reply

ChatGurus Mar 21, 2024

Sounds like chatgpt ;)
But yes, this is more complicated than just adding a module in the workflow. It might also be a good idea to scrape the website seperately and upload the resulting document (after inspecting and optimizing it)

toi500 · 2024-03-24T00:58:34Z

toi500
Mar 24, 2024

Apify is the solution for pro scraping. There is even a built-in integration to upsert the data to Pinecone.

Also, you can schedule your runs, so, per example, you can scraper 1 or 2 times per week any dynamic web.

If I have time, I would like to make a short tutorial about it. It is a perfect companion for Flowise.

1 reply

Vortigern-source Sep 6, 2024

When you make one please post it here,

thanks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inaccurate web scraping data - how to improve accuracy? #609

{{title}}

Replies: 3 comments 6 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Inaccurate web scraping data - how to improve accuracy? #609

Replies: 3 comments · 6 replies

HenryHengZJ Jul 25, 2023 Maintainer

HenryHengZJ Jul 25, 2023 Maintainer

Replies: 3 comments 6 replies

HenryHengZJ
Jul 25, 2023
Maintainer

HenryHengZJ Jul 25, 2023
Maintainer