Skip to content

Commit

Permalink
add setup to tidy episode
Browse files Browse the repository at this point in the history
  • Loading branch information
jlchang committed Oct 4, 2024
1 parent fb2fb98 commit a45c51f
Showing 1 changed file with 62 additions and 2 deletions.
64 changes: 62 additions & 2 deletions episodes/tidy.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,15 +16,73 @@ exercises: 10
::::::::::::::::::::::::::::::::::::::::::::::::::


## Tidy Data in Pandas

Let's import the pickle file that contains all of our Chicago public library circulation data in a single DataFrame. We can use the Pandas `.read_pickle()` method to do so.
:::::::::::::::::::::::::::::::::::::::::: spoiler

## Setup instructions if your Google Drive is not mounted
If you did not run the commands from episode 7 in this Colab session, you will need to load the library `pandas` and make your google drive accessible:
```python
import pandas as pd
from google.colab import drive
drive.mount('/content/drive')
file_location = "drive/MyDrive/lc-python/"
```
You'll need to grant Google all the permissions it requests to make your google drive accessible to Colab.

:::::::::::::::::::::::::::::::::::::::::: spoiler

### What if the files have not been copied to my Google Drive yet?

Uploading files to Google Drive allows the data persist over time when using Colab. To save time now, run `wget` to download files directly to the cloud:

```bash
!wget https://github.com/jlchang/cb-python-intro-lesson-template/raw/refs/heads/main/episodes/files/data.zip
!unzip data.zip
```
```python
file_location = ""
```
Remember that next time you use Colab, you'll need to get these files again unless you follow the [Setup instructions](https://broadinstitute.github.io/2024-09-27-python-intro-lesson/#setup) to copy the files to Google Drive.
::::::::::::::::::::::::::::::::::::::::::::::::::

## Put all of our Chicago public library circulation data in a single DataFrame.
```python
import glob

dfs = []

for csv in sorted(glob.glob(file_location + 'data/*.csv')):
year = csv[29:33] #the 30th to 33rd characters in each file match the year
# if you copied your data using wget, year should be set differently:
# year = csv[5:9] #the 5th to 9th characters in each file match the year
data = pd.read_csv(csv)
data['year'] = year
dfs.append(data)

df = pd.concat(dfs, ignore_index=True)

df.head(3)
```
::::::::::::::::::::::::::::::::::::::::::::::::::

:::::::::::::::::::::::::::::::::::::::::: spoiler


If we had the pickle file that contains all of our Chicago public library circulation data in a single DataFrame. We could use the Pandas `.read_pickle()` method load the data to use it again.


```python
import pandas as pd

df = pd.read_pickle('data/all_years.pkl')
```
::::::::::::::::::::::::::::::::::::::::::::::::::

## Tidy Data in Pandas

Let's take a peek at our data:

```python
df.head()
```
| | branch | address | city | zip code | january | february | march | april | may | june | july | august | september | october | november | december | ytd | year |
Expand All @@ -35,6 +93,8 @@ df.head()
| 3 | Austin | 5615 W. Race Ave. | Chicago | 60644.0 | 1755 | 1316 | 1942 | 2200 | 2133 | 2359 | 2080 | 2405 | 2417 | 2571 | 2233 | 2116 | 25527 | 2011 |
| 4 | Austin-Irving | 6100 W. Irving Park Rd. | Chicago | 60634.0 | 12593 | 11791 | 14807 | 14382 | 11754 | 14402 | 14605 | 15164 | 14306 | 15357 | 14069 | 12404 | 165634 | 2011 |



```python
df.tail()
```
Expand Down

0 comments on commit a45c51f

Please sign in to comment.