Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Web Scraping Documentation and Llama-Index Integration #43

Closed
wants to merge 5 commits into from

Conversation

AsH1605
Copy link

@AsH1605 AsH1605 commented Nov 6, 2024

Issue Reference: Resolves #42

Changes:

  • Added functionality to scrape documentation from the given website, including sublinks.
  • Stored scraped content in Llama-Index Document objects with metadata (URLs).
  • Compiled the documents into a full Llama-Index for indexing.
  • Created a Colab notebook with sample usage to demonstrate the scraper in action.

Challenges:

  • Overcame issues with deprecated functions in libraries like GPTSimpleVectorIndex.
  • Ensured all relative URLs were correctly resolved using urljoin.
  • Managed to implement a recursive scraper to handle multiple pages linked from the main documentation.

Please review the changes and let me know if further adjustments are needed.

@debrupf2946
Copy link
Collaborator

Hi, @AsH1605 Thanks for the contribution. I was little busy I am reviewing your code.
Can you please test your implementation on Keras documentation?
Show your implementation on note-book that you have added in PR.

@debrupf2946
Copy link
Collaborator

@AsH1605 please sign all your commits before creating a PR(you can watch a youtube tutorial), also you should have made your commits in the different branch not directly on the main.

@AsH1605
Copy link
Author

AsH1605 commented Nov 26, 2024

@debrupf2946 I have made the changes please review in next PR.

@AsH1605 AsH1605 closed this Nov 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

feat: Scrapping Documentations from website for building Knowledge Graphs
2 participants