Repository for a web crawler (input number of sites) by using BFS algorithm. Made in shell script (v4.0+).
Starting from https://en.wikipedia.org/wiki/Cloud_computing the script will crawl for non-repetitive wiki pages, using BFS algorithm, and will save the pages on disk, process the files and extract the words inside each file and save them in form of an indexer in which each file has an alphabetically sorted list of words in which each line has the world and the number of times that word has shown up in that file.
After creating the indexer file, there's another script which takes a word as input and outputs the total count of appearance of that word in your files plus the number of times that it appears on each file that has the word.
- Shell Script version 4 or older
- Lynx browser (a good source on how to install: https://www.tecmint.com/command-line-web-browsers/)
- Download the two bash files
crawler.sh
andcount.sh
- Execute the crawler file
./crawler.sh nnn
(nnn is the limit of number of sites to crawl, if not specified 150 assumed) - Wait for the end of execution. It may take some time depending on how many sites you chose.
- Crawler.sh will ask you if you want to delete temporary files. Choose your answer.
- Execute the count file
./cound.sh www
(www is the word you want to search and count) - Count.sh will generate a
result.txt
file, and will ask you if you want to display its contents. Choose your answer.