Skip to content

Commit

Permalink
First version.
Browse files Browse the repository at this point in the history
  • Loading branch information
phe-sto committed Oct 14, 2024
1 parent cd91f5f commit 8a93279
Show file tree
Hide file tree
Showing 14 changed files with 4,742 additions and 10 deletions.
251 changes: 251 additions & 0 deletions 12_bs4_requests.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,251 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "daba7cc8-d507-422a-ae76-9cea8809a646",
"metadata": {
"papermill": {},
"tags": []
},
"source": [
"<img width=\"8%\" alt=\"BeautifulSoup.png\" src=\"https://raw.githubusercontent.com/jupyter-naas/awesome-notebooks/master/.github/assets/logos/BeautifulSoup.png\" style=\"border-radius: 15%\">"
]
},
{
"cell_type": "markdown",
"id": "2778f16d-7641-4958-bc78-3dde9c493d65",
"metadata": {
"papermill": {},
"tags": []
},
"source": [
"# BeautifulSoup - List social network links from website\n",
"\n",
"Largely inspired by https://github.com/jupyter-naas/awesome-notebooks/blob/master/BeautifulSoup/BeautifulSoup_List_social_network_links_from_website.ipynb"
]
},
{
"cell_type": "markdown",
"id": "8f1a033f-9362-43d8-8a8f-1878bd2115c4",
"metadata": {
"papermill": {},
"tags": []
},
"source": [
"**Description:** This notebook will use BeautifulSoup to list all the social network links from a website. It is usefull for organizations to quickly get a list of all the social networks they are present on."
]
},
{
"cell_type": "markdown",
"id": "5a50c831-2700-4fb5-acee-ed0513446815",
"metadata": {
"papermill": {},
"tags": []
},
"source": [
"**References:**\n",
"- [BeautifulSoup Documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)"
]
},
{
"cell_type": "markdown",
"id": "9df24602-3917-4f76-b29a-2be92dea558d",
"metadata": {
"papermill": {},
"tags": []
},
"source": "## Import libraries"
},
{
"cell_type": "code",
"id": "07f7c93e-5f43-4084-a434-d3a591046738",
"metadata": {
"papermill": {},
"tags": [],
"ExecuteTime": {
"end_time": "2024-10-13T12:22:09.049638Z",
"start_time": "2024-10-13T12:22:09.047611Z"
}
},
"source": [
"import requests\n",
"from bs4 import BeautifulSoup\n",
"from pprint import pprint"
],
"outputs": [],
"execution_count": 33
},
{
"cell_type": "markdown",
"id": "f0fc7961-2be2-4c4e-a915-e5683952df41",
"metadata": {
"papermill": {},
"tags": []
},
"source": [
"### Setup Variables\n",
"- `url`: The URL of the website you want to extract social network links from\n",
"- `social_network_links`: List of social network links extracted from website"
]
},
{
"cell_type": "code",
"id": "8f88972c-5b30-4656-9e8f-6e9215665131",
"metadata": {
"papermill": {},
"tags": [],
"ExecuteTime": {
"end_time": "2024-10-13T12:22:09.052633Z",
"start_time": "2024-10-13T12:22:09.050699Z"
}
},
"source": [
"# Inputs\n",
"url = \"https://www.papit.fr/utiles.html\"\n",
"\n",
"# Outputs\n",
"social_network_links = []"
],
"outputs": [],
"execution_count": 34
},
{
"cell_type": "markdown",
"id": "e8d53777-9a93-49ee-a8e7-79bbdf27e029",
"metadata": {
"papermill": {},
"tags": []
},
"source": "## Get social network links"
},
{
"cell_type": "code",
"id": "2577d407-9da1-431a-b11c-2624a8b749e0",
"metadata": {
"papermill": {},
"tags": [],
"ExecuteTime": {
"end_time": "2024-10-13T12:22:09.056374Z",
"start_time": "2024-10-13T12:22:09.053464Z"
}
},
"source": [
"def get_social_network_links(url, social_network_links):\n",
" # Make a GET request to the URL and get the HTML content\n",
" response = requests.get(url)\n",
" html_content = response.text\n",
"\n",
" # Create a BeautifulSoup object to parse the HTML content\n",
" soup = BeautifulSoup(html_content, 'html.parser')\n",
"\n",
" # Find all the links on the page\n",
" links = soup.find_all('a')\n",
"\n",
" # Loop through the links and find the social network links\n",
" social_networks = ['facebook', 'twitter', 'linkedin', 'instagram', 'github', 'youtube']\n",
" for link in links:\n",
" href = link.get('href')\n",
" if href:\n",
" for social in social_networks:\n",
" if social in href:\n",
" if href not in social_network_links:\n",
" social_network_links.append(href)\n",
" return social_network_links"
],
"outputs": [],
"execution_count": 35
},
{
"cell_type": "markdown",
"id": "cb5902d1-db8b-4fbd-bbf2-4b660c21a5f2",
"metadata": {
"papermill": {},
"tags": []
},
"source": "## Crawling the website and display results"
},
{
"cell_type": "code",
"id": "92c2d4d5-58c6-48da-ae99-ad5bfc5b97b9",
"metadata": {
"papermill": {},
"tags": [],
"ExecuteTime": {
"end_time": "2024-10-13T12:22:09.429263Z",
"start_time": "2024-10-13T12:22:09.057161Z"
}
},
"source": [
"social_network_links = get_social_network_links(url, social_network_links)\n",
"pprint(social_network_links)"
],
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['https://github.com/phe-sto',\n",
" 'https://github.com/pcko1/bscscan-python',\n",
" 'https://github.com/Polve/bitcoin-rpc-client',\n",
" 'https://github.com/HydraCG/Specifications',\n",
" 'https://github.com/mingqian/zigbee-viewer',\n",
" 'https://github.com/CodeforFR/enthic-dataviz',\n",
" 'https://www.youtube.com/@Computerphile',\n",
" 'https://shellchocolat.github.io//',\n",
" 'https://github.com/papit-fr/papit-frontend']\n"
]
}
],
"execution_count": 36
},
{
"cell_type": "markdown",
"id": "e3c324dc-fea0-47da-8f89-2747ab5fa5c0",
"metadata": {
"papermill": {},
"tags": []
},
"source": [
" "
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.6"
},
"naas": {
"notebook_id": "200a5dcebfc9ff32f08e84aaba44cb6125fbc8bbde5f686f467b8626c7ef5f78",
"notebook_path": "BeautifulSoup/BeautifulSoup_List_social_network_links_from_website.ipynb"
},
"papermill": {
"default_parameters": {},
"environment_variables": {},
"parameters": {},
"version": "2.4.0"
},
"widgets": {
"application/vnd.jupyter.widget-state+json": {
"state": {},
"version_major": 2,
"version_minor": 0
}
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Loading

0 comments on commit 8a93279

Please sign in to comment.