PySpark-on-GoogleColab

A Beginner’s Hands-on Guide to PySpark with Google Colab-Tutorial Notebook From Scratch

Note: All the details have been saved in a Notebook, you can run this notebook on your Google Colab easily!

1- Use !wget to download the dataset to the server

Colab is actually a Centos virtual machine with GPU. You can directly use the linux wget command to download the dataset to the server. The default is to download to the /content path

2- Loading data into PySpark from github¶

Spark has a variety of modules to read data of different formats. It also automatically determines the data type for each column, but it has to go over it once.

3- Use Google Cloud Disk to load datasets

add Codeadd Markdown First, the command to mount Google Cloud Disk in Colab is as follows. After execution, you will be asked to enter the key of your Google account to mount

4. Load dataset from kaggle

If you are playing a game on kaggle, the data set you need is prepared on it, and you can download it directly using the kaggle command. You need to choose to create an api token in the my profile of kaggle, and then generate the username and key locally

5. Upload to disk using the upload button

Google provides 67G of disk space. Use the upload button to upload the image below. This method is suitable for small datasets or own datasets

6-There is a library in jovian called open datasets.

First, install it into colab using- The URL can be any link be it google or kaggle links.

If you have any question feel free to ask, stay tuned to next works

Happy Learning! Stick To The Plan!

Author :Parissan Ahmadi

Linkdin : https://www.linkedin.com/in/parisan-ahmadi-1410a0a9/

Github : https://github.com/parisa-ahmadi

TelegramChannel : https://t.me/AIwithParissan

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.gitignore		.gitignore
LICENSE		LICENSE
Pyspark_on_Colab.ipynb		Pyspark_on_Colab.ipynb
README.md		README.md
weatherAUS.csv		weatherAUS.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PySpark-on-GoogleColab

Note: All the details have been saved in a Notebook, you can run this notebook on your Google Colab easily!

1- Use !wget to download the dataset to the server

2- Loading data into PySpark from github¶

3- Use Google Cloud Disk to load datasets

4. Load dataset from kaggle

5. Upload to disk using the upload button

6-There is a library in jovian called open datasets.

Author :Parissan Ahmadi

About

Releases

Packages

Languages

License

parisa-ahmadi/PySpark-on-GoogleColab

Folders and files

Latest commit

History

Repository files navigation

PySpark-on-GoogleColab

Note: All the details have been saved in a Notebook, you can run this notebook on your Google Colab easily!

1- Use !wget to download the dataset to the server

2- Loading data into PySpark from github¶

3- Use Google Cloud Disk to load datasets

4. Load dataset from kaggle

5. Upload to disk using the upload button

6-There is a library in jovian called open datasets.

Author :Parissan Ahmadi

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages