Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Web] Setup header for web scrape task #1109

Open
ShihChun-H opened this issue Sep 24, 2024 · 15 comments
Open

[Web] Setup header for web scrape task #1109

ShihChun-H opened this issue Sep 24, 2024 · 15 comments
Labels
component feature New feature or request good-first-issue Good for newcomers hacktoberfest hacktoberfest2024 Component improvement issues for Hacktoberfest 2024 help-wanted Help from the community is appreciated improvement Improvement on existing features instill core Label

Comments

@ShihChun-H
Copy link
Member

ShihChun-H commented Sep 24, 2024

Issue Description

Current State

  • Web crawler cannot set header auth

Why We Want to Change?

  • With header auth, the users can crawl the web requires auth.
  • It gives VDP more use cases.

Proposed Change

  • Setup input for all tasks in Web operator
    "headers": {}
  • Users can set auth and content type

Rules for the Component Hackathon

  • Each issue will only be assigned to one person/team at a time.
  • You can only work on one issue at a time.
  • To express interest in an issue, please comment on it and tag @kuroxx, allowing the Instill AI team to assign it to you.
  • Ensure you address all feedback and suggestions provided by the Instill AI team.
  • If no commits are made within five days, the issue may be reassigned to another contributor.
  • Join our Discord to engage in discussions and seek assistance in #hackathon channel. For technical queries, you can tag @chuang8511.

Component Contribution Guideline | Documentation | Official Go Tutorial

@ShihChun-H ShihChun-H added need-triage Need to be investigated further feature New feature or request labels Sep 24, 2024
Copy link

linear bot commented Sep 24, 2024

@ShihChun-H ShihChun-H added hacktoberfest2024 Component improvement issues for Hacktoberfest 2024 help-wanted Help from the community is appreciated improvement Improvement on existing features instill core component and removed need-triage Need to be investigated further labels Sep 24, 2024
@chuang8511 chuang8511 added the good-first-issue Good for newcomers label Sep 25, 2024
@someshfengde
Copy link

Hi @ShihChun-H can you please assign this issue to me?

@ShihChun-H
Copy link
Member Author

Hi @someshfengde , sure. The issue has been assigned to you.

@someshfengde
Copy link

thank you will start

@someshfengde
Copy link

Hi @ShihChun-H can you please help me get started for working on this issue. I've been trying to set up my machine according to contributions.md but it's not working out (it's been 50 + mins since pulling images from docker)

Also can you explain in more detail what I've to do?

from description mentioned I think I have to add headers to schema/ai-tasks.json lmk if I'm on right path.

I've been thinking to add this

    "headers": {
      "title": "Request Headers",
      "description": "HTTP headers to include in the request.",
      "type": "object",
      "additionalProperties": {
        "type": "string"
      }
    }

Thanks :)

@chuang8511
Copy link
Member

Hi @someshfengde ,
Thanks for taking time on this.

can you please help me get started for working on this issue. I've been trying to set up my machine according to contributions.md but it's not working out (it's been 50 + mins since pulling images from docker)

It could be the several reasons. From my experience, you may need to increase your docker resources.
In the Docker Desktop, you can find them here. Could you please try it out again?
image

Or, sometimes restarting your PC / cleaning your docker resources could help as well.

Also can you explain in more detail what I've to do?

You have to add more params in web operator's tasks.json.
To scrape some websites requiring more information, scraper needs to set up the header to access the website.
So, you can add more optional params in scrapers.
And, the users can set up some tokens or key when they scrape specific sites.

I hope I answer your all questions.
Please feel free to ask me anything if there is further question!
Thank you again!

@ShihChun-H
Copy link
Member Author

Hi @someshfengde, I'm following up to check on any progress made or any question encountered regarding this issue. Could you please provide an update? Thanks 🙏

@someshfengde
Copy link

sorry I have been busy for last couple of days. Will continue to work on it after some hours

@kuroxx
Copy link
Collaborator

kuroxx commented Oct 15, 2024

Hey @someshfengde how's it going? If you have any PR for this, don't forget to submit it!

@someshfengde
Copy link

sorry got into other tasks I think it'll require me lots of efforts can you please assign this to someone else?

@kuroxx
Copy link
Collaborator

kuroxx commented Oct 15, 2024

@someshfengde No worries - thank you for letting me know! Good luck with your other tasks

@Sourabh782
Copy link

hii @kuroxx , if you dont mind can i look into this issue?

@kuroxx
Copy link
Collaborator

kuroxx commented Oct 17, 2024

Hey @Sourabh782, sounds great! I have assigned it to you 🤝

@kuroxx
Copy link
Collaborator

kuroxx commented Oct 25, 2024

Hey @Sourabh782 how's it going? Any blockers or progress?

If you have questions or need help, we have Discord community here: https://discord.gg/sevxWsqpGh

@kuroxx
Copy link
Collaborator

kuroxx commented Oct 29, 2024

Hey @Sourabh782, not sure if you're still working on this but since it's been 2 weeks now - I will unassign this task. Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component feature New feature or request good-first-issue Good for newcomers hacktoberfest hacktoberfest2024 Component improvement issues for Hacktoberfest 2024 help-wanted Help from the community is appreciated improvement Improvement on existing features instill core Label
Projects
Status: In Progress
Development

No branches or pull requests

6 participants