- Introduction
- Features
- Prerequisites
- Installation
- Usage
- Directory Structure
- Automated Updates
- Contributing
- License
- Contact
This project automates the process of scraping Exjobb (master thesis) project data from Linköping University using Selenium. The scraped data is stored in a CSV file and visualized through an interactive Streamlit application. Additionally, GitHub Actions is configured to run the scraper daily, ensuring the data remains up-to-date.
- Automated Web Scraping: Uses Selenium to extract project details such as title, organization, research field, and application deadlines.
- Data Storage: Saves scraped data in a structured CSV format for easy access and analysis.
- Interactive Visualization: Streamlit app provides interactive charts and filters to explore the data.
- Daily Updates: GitHub Actions workflow ensures the scraper runs daily, keeping the dataset current.
- Data Download: Users can download filtered data in CSV and Excel formats directly from the Streamlit app.
Before you begin, ensure you have met the following requirements:
- Python 3.7 or higher installed on your machine. You can download it from here.
- Google Chrome browser installed. Download it from here.
- ChromeDriver compatible with your Chrome version. You can download it from here.
-
Clone the Repository
git clone https://github.com/your-username/liu-exjobb-crawler.git cd liu-exjobb-crawler
-
Create a Virtual Environment (Optional but Recommended)
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install Dependencies
pip install -r requirements.txt
-
Set Up ChromeDriver
- Download the ChromeDriver version that matches your installed Chrome browser.
- Extract the
chromedriver
executable and place it in a directory that's in your system'sPATH
or specify its path in theliu_data.py
script.
The scraper script *_data.py
uses Selenium to navigate the Exjobb website, extract project information, and save it to a CSV file.
python *_data.py
After running, the scraped data will be available at data/exjobb_projects.csv
.
The Streamlit application streamlit_app.py
visualizes the scraped data.
streamlit run streamlit_app.py
This command will open a new tab in your default web browser displaying the interactive dashboard.
├── data
│ └── *_exjobb_projects.csv # Scraped project data
├── *_data.py # Web scraper script
├── README.md # Project documentation
├── requirements.txt # Python dependencies
└── streamlit_app.py # Streamlit visualization app
- data/: Contains the CSV file with the scraped Exjobb project data.
- liu_data.py: Python script that performs web scraping using Selenium.
- streamlit_app.py: Streamlit application for data visualization and interaction.
- requirements.txt: Lists all Python libraries required to run the project.
- README.md: Provides an overview and instructions for the project.
To ensure that the scraped data is updated daily, a GitHub Actions workflow is set up.
-
GitHub Actions Workflow
The workflow file
.github/workflows/auto.yml
is configured to run theliu_data.py
script daily at 02:00 UTC. -
Setup Steps
- Ensure that
.github/workflows/auto.yml
is present in your repository. - The workflow installs necessary dependencies, Chrome, and ChromeDriver before running the scraper.
- After scraping, it commits and pushes the updated
exjobb_projects.csv
back to the repository.
- Ensure that
-
Monitoring
- Navigate to the Actions tab in your GitHub repository to monitor workflow runs.
- Ensure that the workflow completes successfully and updates the data as expected.
Contributions are welcome! Follow these steps to contribute:
-
Fork the Repository
-
Create a New Branch
git checkout -b feature/YourFeature
-
Make Changes and Commit
git commit -m "Add new feature"
-
Push to the Branch
git push origin feature/YourFeature
-
Open a Pull Request
This project is licensed under the MIT License.
For any inquiries or suggestions, please contact me.
🚀