Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add base URL to links #13

Open
RuedigerVoigt opened this issue Apr 11, 2020 · 0 comments
Open

Add base URL to links #13

RuedigerVoigt opened this issue Apr 11, 2020 · 0 comments
Labels
enhancement New feature or request

Comments

@RuedigerVoigt
Copy link
Owner

Many documents use relative links like overview.html instead of https://www.example.com/overview.html. It would be useful to convert those into absolute links before the page content is saved.

  • The first step would be to find each link and to determine if it is absolute or relative. This must cover other protocols besides http and https.
  • It is possible that a base-URL was set in the document code, which is not the URL of the page just crawled. This has to be found and regarded.
  • To capture cases like ../../foobar.html the urllib.parse.urljoin function should be used.

The best place for this functionality seems to be an optional feature of the prettify_html function.

@RuedigerVoigt RuedigerVoigt added the enhancement New feature or request label Apr 11, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant