Skip to content

Latest commit

 

History

History
156 lines (124 loc) · 4.95 KB

README.md

File metadata and controls

156 lines (124 loc) · 4.95 KB

acgbox_crawler

An Python bot for doing ETL(extract, transform, load) personal favorite lists from gamer.com.tw/acgbox.

Prerequisites

  • Setup all on My Arch Linux VM

to_do_list

  • Migration from MySQL to PostgreSQL
  • Find Last Updated Date of a Web Page
  • TritonHo/RDBMS course
  • pandas to_sql method if_exists='append' implementation function update method for only update new ACG collects?
  • implementation function finding method for find last pages?
  • research or implementation function for html tags div to table
  • check which files will be stored via podman when not executing MySQL container
  • implementation function CRUD API/query method for MySQL
  • implementation function load_data
  • implementation function modfy_data with advanced string replace in pandas.DataFrame
  • refactor some parts codes to class acgbox_crawler(object)
  • podman-compose up with docker-selenium
    • WARN[0011] aardvark-dns binary not found, container dns will not be enabled

quick start

setup on Arch Linux

#setup on Arch Linux
#update package databases
sudo pacman -Syy

#install podman
sudo pacman -S podman
#podman-docker
sudo pacman -S podman-docker
#podman-compose
sudo pacman -S podman-compose
#fuse-overlayfs
sudo pacman -S fuse-overlayfs

#podman: /usr/lib/libc.so.6: version `GLIBC_2.38' not found (required by podman)
#upgrading packages
sudo pacman -Syu

#check podman
podman --version

#create a MySQL container with podman-compose
cd db_settingup/

#check out the db_settingup.md

start this project and do development

#After Setting UP with Usage with your python projects
#Spawns a shell within the virtualenv.
pipenv shell

#if no packages installed
pipenv install

#add some Packages
pipenv install diagrams
pipenv install "psycopg[binary,pool]"
pipenv install requests
pipenv install beautifulsoup4
pipenv install pandas
pipenv install lxml
pipenv install SQLAlchemy
pipenv install PyYAML
pipenv install pymysql
pipenv install fake-useragent
pipenv install user_agent
pipenv install tornado

#Generate a requirements.txt from Pipfile.lock. to requirements.txt
pipenv requirements > requirements.txt

#Becareful your execute PATH! XD 
#Test
pipenv shell
cd src
python main.py
#time a simple command or give resource usage
time python main.py
# real    2m55.699s
# user    0m3.973s
# sys     0m0.858s

#podman
sudo pacman -S aardvark-dns

#docker/podman with selenium
cd docker-selenium

#podman
podman-compose up -d
podman-compose stop

#docker
docker-compose up -d

misc(miscellaneous)

Important!!!

== We're Using GitHub Under Protest ==

This project is currently hosted on GitHub. This is not ideal; GitHub is a proprietary, trade-secret system that is not Free and Open Souce Software (FOSS). We are deeply concerned about using a proprietary system like GitHub to develop our FOSS project. We have an open {bug ticket, mailing list thread, etc.} where the project contributors are actively discussing how we can move away from GitHub in the long term. We urge you to read about the Give up GitHub campaign from the Software Freedom Conservancy to understand some of the reasons why GitHub is not a good place to host FOSS projects.

If you are a contributor who personally has already quit using GitHub, please check this resource for how to send us contributions without using GitHub directly.

Any use of this project's code by GitHub Copilot, past or present, is done without our permission. We do not consent to GitHub's use of this project's code in Copilot.

Logo of the GiveUpGitHub campaign