A Beginner’s Hands-on Guide to PySpark with Google Colab-Tutorial Notebook From Scratch
Note: All the details have been saved in a Notebook, you can run this notebook on your Google Colab easily!
Colab is actually a Centos virtual machine with GPU. You can directly use the linux wget command to download the dataset to the server. The default is to download to the /content path
Spark has a variety of modules to read data of different formats. It also automatically determines the data type for each column, but it has to go over it once.
add Codeadd Markdown First, the command to mount Google Cloud Disk in Colab is as follows. After execution, you will be asked to enter the key of your Google account to mount
If you are playing a game on kaggle, the data set you need is prepared on it, and you can download it directly using the kaggle command. You need to choose to create an api token in the my profile of kaggle, and then generate the username and key locally
Google provides 67G of disk space. Use the upload button to upload the image below. This method is suitable for small datasets or own datasets
First, install it into colab using- The URL can be any link be it google or kaggle links.
If you have any question feel free to ask, stay tuned to next works
Happy Learning! Stick To The Plan!
Linkdin : https://www.linkedin.com/in/parisan-ahmadi-1410a0a9/
Github : https://github.com/parisa-ahmadi
TelegramChannel : https://t.me/AIwithParissan