-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
14 changed files
with
4,742 additions
and
10 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,251 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"id": "daba7cc8-d507-422a-ae76-9cea8809a646", | ||
"metadata": { | ||
"papermill": {}, | ||
"tags": [] | ||
}, | ||
"source": [ | ||
"<img width=\"8%\" alt=\"BeautifulSoup.png\" src=\"https://raw.githubusercontent.com/jupyter-naas/awesome-notebooks/master/.github/assets/logos/BeautifulSoup.png\" style=\"border-radius: 15%\">" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "2778f16d-7641-4958-bc78-3dde9c493d65", | ||
"metadata": { | ||
"papermill": {}, | ||
"tags": [] | ||
}, | ||
"source": [ | ||
"# BeautifulSoup - List social network links from website\n", | ||
"\n", | ||
"Largely inspired by https://github.com/jupyter-naas/awesome-notebooks/blob/master/BeautifulSoup/BeautifulSoup_List_social_network_links_from_website.ipynb" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "8f1a033f-9362-43d8-8a8f-1878bd2115c4", | ||
"metadata": { | ||
"papermill": {}, | ||
"tags": [] | ||
}, | ||
"source": [ | ||
"**Description:** This notebook will use BeautifulSoup to list all the social network links from a website. It is usefull for organizations to quickly get a list of all the social networks they are present on." | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "5a50c831-2700-4fb5-acee-ed0513446815", | ||
"metadata": { | ||
"papermill": {}, | ||
"tags": [] | ||
}, | ||
"source": [ | ||
"**References:**\n", | ||
"- [BeautifulSoup Documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "9df24602-3917-4f76-b29a-2be92dea558d", | ||
"metadata": { | ||
"papermill": {}, | ||
"tags": [] | ||
}, | ||
"source": "## Import libraries" | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"id": "07f7c93e-5f43-4084-a434-d3a591046738", | ||
"metadata": { | ||
"papermill": {}, | ||
"tags": [], | ||
"ExecuteTime": { | ||
"end_time": "2024-10-13T12:22:09.049638Z", | ||
"start_time": "2024-10-13T12:22:09.047611Z" | ||
} | ||
}, | ||
"source": [ | ||
"import requests\n", | ||
"from bs4 import BeautifulSoup\n", | ||
"from pprint import pprint" | ||
], | ||
"outputs": [], | ||
"execution_count": 33 | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "f0fc7961-2be2-4c4e-a915-e5683952df41", | ||
"metadata": { | ||
"papermill": {}, | ||
"tags": [] | ||
}, | ||
"source": [ | ||
"### Setup Variables\n", | ||
"- `url`: The URL of the website you want to extract social network links from\n", | ||
"- `social_network_links`: List of social network links extracted from website" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"id": "8f88972c-5b30-4656-9e8f-6e9215665131", | ||
"metadata": { | ||
"papermill": {}, | ||
"tags": [], | ||
"ExecuteTime": { | ||
"end_time": "2024-10-13T12:22:09.052633Z", | ||
"start_time": "2024-10-13T12:22:09.050699Z" | ||
} | ||
}, | ||
"source": [ | ||
"# Inputs\n", | ||
"url = \"https://www.papit.fr/utiles.html\"\n", | ||
"\n", | ||
"# Outputs\n", | ||
"social_network_links = []" | ||
], | ||
"outputs": [], | ||
"execution_count": 34 | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "e8d53777-9a93-49ee-a8e7-79bbdf27e029", | ||
"metadata": { | ||
"papermill": {}, | ||
"tags": [] | ||
}, | ||
"source": "## Get social network links" | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"id": "2577d407-9da1-431a-b11c-2624a8b749e0", | ||
"metadata": { | ||
"papermill": {}, | ||
"tags": [], | ||
"ExecuteTime": { | ||
"end_time": "2024-10-13T12:22:09.056374Z", | ||
"start_time": "2024-10-13T12:22:09.053464Z" | ||
} | ||
}, | ||
"source": [ | ||
"def get_social_network_links(url, social_network_links):\n", | ||
" # Make a GET request to the URL and get the HTML content\n", | ||
" response = requests.get(url)\n", | ||
" html_content = response.text\n", | ||
"\n", | ||
" # Create a BeautifulSoup object to parse the HTML content\n", | ||
" soup = BeautifulSoup(html_content, 'html.parser')\n", | ||
"\n", | ||
" # Find all the links on the page\n", | ||
" links = soup.find_all('a')\n", | ||
"\n", | ||
" # Loop through the links and find the social network links\n", | ||
" social_networks = ['facebook', 'twitter', 'linkedin', 'instagram', 'github', 'youtube']\n", | ||
" for link in links:\n", | ||
" href = link.get('href')\n", | ||
" if href:\n", | ||
" for social in social_networks:\n", | ||
" if social in href:\n", | ||
" if href not in social_network_links:\n", | ||
" social_network_links.append(href)\n", | ||
" return social_network_links" | ||
], | ||
"outputs": [], | ||
"execution_count": 35 | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "cb5902d1-db8b-4fbd-bbf2-4b660c21a5f2", | ||
"metadata": { | ||
"papermill": {}, | ||
"tags": [] | ||
}, | ||
"source": "## Crawling the website and display results" | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"id": "92c2d4d5-58c6-48da-ae99-ad5bfc5b97b9", | ||
"metadata": { | ||
"papermill": {}, | ||
"tags": [], | ||
"ExecuteTime": { | ||
"end_time": "2024-10-13T12:22:09.429263Z", | ||
"start_time": "2024-10-13T12:22:09.057161Z" | ||
} | ||
}, | ||
"source": [ | ||
"social_network_links = get_social_network_links(url, social_network_links)\n", | ||
"pprint(social_network_links)" | ||
], | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"['https://github.com/phe-sto',\n", | ||
" 'https://github.com/pcko1/bscscan-python',\n", | ||
" 'https://github.com/Polve/bitcoin-rpc-client',\n", | ||
" 'https://github.com/HydraCG/Specifications',\n", | ||
" 'https://github.com/mingqian/zigbee-viewer',\n", | ||
" 'https://github.com/CodeforFR/enthic-dataviz',\n", | ||
" 'https://www.youtube.com/@Computerphile',\n", | ||
" 'https://shellchocolat.github.io//',\n", | ||
" 'https://github.com/papit-fr/papit-frontend']\n" | ||
] | ||
} | ||
], | ||
"execution_count": 36 | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "e3c324dc-fea0-47da-8f89-2747ab5fa5c0", | ||
"metadata": { | ||
"papermill": {}, | ||
"tags": [] | ||
}, | ||
"source": [ | ||
" " | ||
] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "Python 3", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.9.6" | ||
}, | ||
"naas": { | ||
"notebook_id": "200a5dcebfc9ff32f08e84aaba44cb6125fbc8bbde5f686f467b8626c7ef5f78", | ||
"notebook_path": "BeautifulSoup/BeautifulSoup_List_social_network_links_from_website.ipynb" | ||
}, | ||
"papermill": { | ||
"default_parameters": {}, | ||
"environment_variables": {}, | ||
"parameters": {}, | ||
"version": "2.4.0" | ||
}, | ||
"widgets": { | ||
"application/vnd.jupyter.widget-state+json": { | ||
"state": {}, | ||
"version_major": 2, | ||
"version_minor": 0 | ||
} | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 5 | ||
} |
Oops, something went wrong.