Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs(python): Various improvements to docstrings and user guide #10981

Closed
wants to merge 37 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
37 commits
Select commit Hold shift + click to select a range
94621ff
Add some streaming related docstrings
Sep 7, 2023
a03b7c0
Fix lints
Sep 7, 2023
340fca2
update sink strings
Sep 13, 2023
07a4dd4
Merge branch 'main' of github.com:pola-rs/polars into low-memory-docs…
Sep 13, 2023
ac2707c
add config example
Sep 17, 2023
c7251fc
Merge branch 'main' of github.com:pola-rs/polars into low-memory-docs…
Sep 17, 2023
abe00e1
Add structify example
Sep 22, 2023
52999b9
Merge branch 'main' of github.com:pola-rs/polars into low-memory-docs…
Sep 22, 2023
489cd72
checking tests
Sep 25, 2023
0ad6fcd
Merge branch 'main' of github.com:pola-rs/polars into low-memory-docs…
Sep 25, 2023
44ff2fb
Merge branch 'main' of github.com:pola-rs/polars into low-memory-docs…
Sep 27, 2023
c88b4fd
Merge branch 'main' of github.com:pola-rs/polars into low-memory-docs…
Sep 27, 2023
9e4813c
Merge branch 'main' of github.com:pola-rs/polars into low-memory-docs…
Sep 27, 2023
2838d33
Merge branch 'main' of github.com:pola-rs/polars into low-memory-docs…
Sep 27, 2023
e9b567a
Merge branch 'main' of github.com:pola-rs/polars into low-memory-docs…
Sep 28, 2023
eda8aa8
Add scan parquet options
Sep 28, 2023
c56e2b9
update docs
Sep 29, 2023
44a9d92
Merge branch 'main' of github.com:pola-rs/polars into low-memory-docs…
Sep 29, 2023
ced5b3f
Merge branch 'main' of github.com:pola-rs/polars into low-memory-docs…
Oct 2, 2023
620a5fb
Merge branch 'main' of github.com:pola-rs/polars into low-memory-docs…
Oct 3, 2023
f9044cd
Update user guide docs for IO
Oct 3, 2023
020a217
add python JSON
Oct 3, 2023
d5a49a5
Update contributing for user guide
Oct 4, 2023
a9d99f8
Run dprint
Oct 4, 2023
4ea1b87
Run dprint
Oct 4, 2023
4308a50
Fix link
Oct 4, 2023
28b4fd2
fix api link
Oct 4, 2023
0a1e0b5
Update links
Oct 4, 2023
f4d7c7f
Add API links
Oct 4, 2023
495ff0e
Merge branch 'main' of github.com:pola-rs/polars into low-memory-docs…
Oct 5, 2023
1a8374e
Merge branch 'main' of github.com:pola-rs/polars into low-memory-docs…
Oct 6, 2023
e770381
better float format example
Oct 6, 2023
05e9057
update examples
Oct 9, 2023
0661da2
Merge branch 'main' of github.com:pola-rs/polars into low-memory-docs…
Oct 9, 2023
1f938b2
changes from sdg comments
Oct 9, 2023
cb2ba8a
linting
Oct 9, 2023
c4e1b2d
remove commas
Oct 9, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
63 changes: 61 additions & 2 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -151,8 +151,67 @@ The most important components of Polars documentation are the [user guide](https

### User guide

The user guide is maintained in the `docs` folder.
Further contributing information will be added shortly.
The user guide is maintained in the `docs/user-guide` folder. Before creating a PR first raise an issue to discuss what you feel is missing or could be improved.

#### Building and serving the user guide

The user guide is built using [MkDocs](https://www.mkdocs.org/). You install the dependencies for building the user guide by running `make requirements` in the root of the repo.

Run `mkdocs serve` to build and serve the user guide so you can view it locally and see updates as you make changes.

#### Creating a new user guide page

Each user guide page is based on a `.md` markdown file. This file must be listed in `mkdocs.yml`.

#### Adding a shell code block

To add a code block with code to be run in a shell with tabs for Python and Rust use the following format
`
=== ":fontawesome-brands-python: Python"

```shell
$ pip install fsspec
```

=== ":fontawesome-brands-rust: Rust"

```shell
$ cargo add aws_sdk_s3
```

`

#### Adding a code block

The snippets for Python and Rust code blocks are in the `docs/src/python/` and `docs/src/rust/` directories respectively. To add a code snippet with Python or Rust code to a `.md` page use the following format
`{{code_block('user-guide/io/cloud-storage','read_parquet',[read_parquet,read_csv])}}`

- The first argument is a path to either or both files called `docs/src/python/user-guide/io/cloud-storage.py` and `docs/src/rust/user-guide/io/cloud-storage.rs`.
- The second argument is the name given at the start and end of each snippet in the `.py` or `.rs` file
- The third argument is a list of links to functions in the API docs. For each element of the list there must be a corresponding entry in `docs/_build/API_REFERENCE_LINKS.yml`

If the corresponding `.py` and `.rs` snippet files both exist then each snippet named in the second argument to `code_block` above must exist or the build will fail. An empty snippet should be added to the `.py` or `.rs` file if the snippet is not needed.

Each snippet is formatted as follows:
`

# --8<-- [start:read_parquet]

import polars as pl

source = "s3://bucket/*.parquet"

df = pl.read_parquet(source)

# --8<-- [end:read_parquet]

`The snippet is delimited by`--8<-- [start:<snippet_name>]`and`--8<-- [end:<snippet_name>]`. The snippet name must match the name given in the second argument to`code_block` above.

#### Linting

Before committing install `dprint` (see above) and run
`dprint fmt`
from the `docs` directory to link the markdown files.

### API reference

Expand Down
10 changes: 8 additions & 2 deletions docs/_build/API_REFERENCE_LINKS.yml
Original file line number Diff line number Diff line change
Expand Up @@ -12,8 +12,8 @@ python:
write_csv: https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.DataFrame.write_csv.html
read_json: https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.read_json.html
write_json: https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.DataFrame.write_json.html
read_ipc: https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.read_ipc.html
read_parquet: https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.read_parquet.html
stinodego marked this conversation as resolved.
Show resolved Hide resolved
write_parquet: https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.DataFrame.write_parquet.html
min: https://pola-rs.github.io/polars/py-polars/html/reference/series/api/polars.Series.min.html
max: https://pola-rs.github.io/polars/py-polars/html/reference/series/api/polars.Series.max.html
value_counts: https://pola-rs.github.io/polars/py-polars/html/reference/expressions/api/polars.Expr.value_counts.html
Expand Down Expand Up @@ -65,6 +65,7 @@ python:
write_database:
name: write_database
link: https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.DataFrame.write_database.html
read_database_uri: https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.read_database_uri.html
read_parquet: https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.read_parquet.html
write_parquet: https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.DataFrame.write_parquet.html
scan_parquet: https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.scan_parquet.html
Expand All @@ -73,6 +74,7 @@ python:
write_ndjson: https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.DataFrame.write_ndjson.html
write_json: https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.DataFrame.write_json.html
scan_ndjson: https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.scan_ndjson.html
scan_pyarrow_dataset: https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.scan_pyarrow_dataset.html
from_arrow:
name: from_arrow
link: https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.from_arrow.html
Expand Down Expand Up @@ -197,7 +199,7 @@ rust:
feature_flags: ['json']
read_ndjson:
name: JsonLineReader
link: https://pola-rs.github.io/polars/docs/rust/dev/polars_io/ndjson_core/ndjson/struct.JsonLineReader.html
link: https://pola-rs.github.io/polars/docs/rust/dev/polars_io/ndjson/core/struct.JsonLineReader.html
feature_flags: ['json']
write_json:
name: JsonWriter
Expand All @@ -223,6 +225,10 @@ rust:
name: scan_parquet
link: https://pola-rs.github.io/polars/docs/rust/dev/polars/prelude/struct.LazyFrame.html#method.scan_parquet
feature_flags: ['parquet']
read_ipc:
name: IpcReader
link: https://pola-rs.github.io/polars/docs/rust/dev/polars_io/prelude/struct.IpcReader.html
feature_flags: ['ipc']
min: https://pola-rs.github.io/polars/docs/rust/dev/polars/series/struct.Series.html#method.min
max: https://pola-rs.github.io/polars/docs/rust/dev/polars/series/struct.Series.html#method.max
struct:
Expand Down
14 changes: 0 additions & 14 deletions docs/src/python/user-guide/io/aws.py

This file was deleted.

68 changes: 68 additions & 0 deletions docs/src/python/user-guide/io/cloud-storage.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
# --8<-- [start:setup]
import polars as pl

# --8<-- [end:setup]

"""
# --8<-- [start:read_parquet]
import polars as pl

source = "s3://bucket/*.parquet"

df = pl.read_parquet(source)
# --8<-- [end:read_parquet]

# --8<-- [start:scan_parquet]
import polars as pl

source = "s3://bucket/*.parquet"

storage_options = {
"aws_access_key_id": "<secret>",
"aws_secret_access_key": "<secret>",
"aws_region": "us-east-1",
}
df = pl.scan_parquet(source, storage_options=storage_options)
# --8<-- [end:scan_parquet]

# --8<-- [start:scan_parquet_query]
import polars as pl

source = "s3://bucket/*.parquet"


df = pl.scan_parquet(source).filter(pl.col("id") < 100).select("id","value").collect()
# --8<-- [end:scan_parquet_query]

# --8<-- [start:scan_pyarrow_dataset]
import polars as pl
import pyarrow.dataset as ds

dset = ds.dataset("s3://my-partitioned-folder/", format="parquet")
(
pl.scan_pyarrow_dataset(dset)
.filter("foo" == "a")
.select(["foo", "bar"])
.collect()
)
# --8<-- [end:scan_pyarrow_dataset]

# --8<-- [start:write_parquet]

import polars as pl
import s3fs

df = pl.DataFrame({
"foo": ["a", "b", "c", "d", "d"],
"bar": [1, 2, 3, 4, 5],
})

fs = s3fs.S3FileSystem()
destination = "s3://bucket/my_file.parquet"

# write parquet
with fs.open(destination, mode='wb') as f:
df.write_parquet(f)
# --8<-- [end:write_parquet]

"""
32 changes: 22 additions & 10 deletions docs/src/python/user-guide/io/database.py
Original file line number Diff line number Diff line change
@@ -1,32 +1,44 @@
"""
# --8<-- [start:read]
# --8<-- [start:read_uri]
import polars as pl

connection_uri = "postgres://username:password@server:port/database"
uri = "postgres://username:password@server:port/database"
query = "SELECT * FROM foo"

pl.read_database(query=query, connection_uri=connection_uri)
# --8<-- [end:read]
pl.read_database_uri(query=query, uri=uri)
# --8<-- [end:read_uri]

# --8<-- [start:read_cursor]
import polars as pl
from sqlalchemy import create_engine

conn = create_engine(f"sqlite:///test.db")

query = "SELECT * FROM foo"

pl.read_database(query=query, connection=conn.connect())
# --8<-- [end:read_cursor]


# --8<-- [start:adbc]
connection_uri = "postgres://username:password@server:port/database"
uri = "postgres://username:password@server:port/database"
query = "SELECT * FROM foo"

pl.read_database(query=query, connection_uri=connection_uri, engine="adbc")
pl.read_database_uri(query=query, uri=uri, engine="adbc")
# --8<-- [end:adbc]

# --8<-- [start:write]
connection_uri = "postgres://username:password@server:port/database"
uri = "postgres://username:password@server:port/database"
df = pl.DataFrame({"foo": [1, 2, 3]})

df.write_database(table_name="records", connection_uri=connection_uri)
df.write_database(table_name="records", uri=uri)
# --8<-- [end:write]

# --8<-- [start:write_adbc]
connection_uri = "postgres://username:password@server:port/database"
uri = "postgres://username:password@server:port/database"
df = pl.DataFrame({"foo": [1, 2, 3]})

df.write_database(table_name="records", connection_uri=connection_uri, engine="adbc")
df.write_database(table_name="records", uri=uri, engine="adbc")
# --8<-- [end:write_adbc]

"""
24 changes: 24 additions & 0 deletions docs/src/python/user-guide/io/json-file.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
# --8<-- [start:setup]
import polars as pl

# --8<-- [end:setup]

"""
# --8<-- [start:read]
df = pl.read_json("docs/data/path.json")
# --8<-- [end:read]

# --8<-- [start:readnd]
df = pl.read_ndjson("docs/data/path.json")
# --8<-- [end:readnd]

"""

# --8<-- [start:write]
df = pl.DataFrame({"foo": [1, 2, 3], "bar": [None, "bak", "baz"]})
df.write_json("docs/data/path.json")
# --8<-- [end:write]

# --8<-- [start:scan]
df = pl.scan_ndjson("docs/data/path.json")
# --8<-- [end:scan]
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
"""
# --8<-- [start:bucket]
# --8<-- [start:read_parquet]
use aws_sdk_s3::Region;

use aws_config::meta::region::RegionProviderChain;
Expand Down Expand Up @@ -28,5 +28,18 @@ async fn main() {

println!("{:?}", df);
}
# --8<-- [end:bucket]
# --8<-- [end:read_parquet]

# --8<-- [start:scan_parquet]
# --8<-- [end:scan_parquet]

# --8<-- [start:scan_parquet_query]
# --8<-- [end:scan_parquet_query]

# --8<-- [start:scan_pyarrow_dataset]
# --8<-- [end:scan_pyarrow_dataset]

# --8<-- [start:write_parquet]
# --8<-- [end:write_parquet]

"""
20 changes: 0 additions & 20 deletions docs/user-guide/io/aws.md

This file was deleted.

50 changes: 50 additions & 0 deletions docs/user-guide/io/cloud-storage.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
# Cloud storage

Polars can read and write to AWS S3, Azure Blob Storage and Google Cloud Storage. The API is the same for all three storage providers.

To read from cloud storage, additional dependencies may be needed depending on the use case and cloud storage provider:

=== ":fontawesome-brands-python: Python"

```shell
$ pip install fsspec s3fs adlfs gcsfs
```

=== ":fontawesome-brands-rust: Rust"

```shell
$ cargo add aws_sdk_s3 aws_config tokio --features tokio/full
```

## Reading from cloud storage

Polars can read a CSV, IPC or Parquet file in eager mode from cloud storage.

{{code_block('user-guide/io/cloud-storage','read_parquet',[read_parquet,read_csv,read_ipc])}}

This eager query downloads the file to a buffer in memory and creates a `DataFrame` from there. Polars uses `fsspec` to manage this download internally for all cloud storage providers.

## Scanning from cloud storage with query optimisation

Polars can scan a Parquet file in lazy mode from cloud storage. We may need to provide further details beyond the source url such as authentication details or storage region. Polars looks for these as environment variables but we can also do this manually by passing a `dict` as the `storage_options` argument.

{{code_block('user-guide/io/cloud-storage','scan_parquet',[scan_parquet])}}

This query creates a `LazyFrame` without downloading the file. In the `LazyFrame` we have access to file metadata such as the schema. Polars uses the `object_store.rs` library internally to manage the interface with the cloud storage providers and so no extra dependencies are required in Python to scan a cloud Parquet file.

If we create a lazy query with [predicate and projection pushdowns](../lazy/optimizations.md) the query optimiser will apply them before the file is downloaded. This can significantly reduce the amount of data that needs to be downloaded. The query evaluation is triggered by calling `collect`.

{{code_block('user-guide/io/cloud-storage','scan_parquet_query',[])}}

## Scanning with PyArrow

We can also scan from cloud storage using PyArrow. This is particularly useful for partitioned datasets such as Hive partitioning.

We first create a PyArrow dataset and then create a `LazyFrame` from the dataset.
{{code_block('user-guide/io/cloud-storage','scan_pyarrow_dataset',[scan_pyarrow_dataset])}}

## Writing to cloud storage

We can write a `DataFrame` to cloud storage in Python using s3fs for S3, adlfs for Azure Blob Storage and gcsfs for Google Cloud Storage. In this example we write a Parquet file to S3.

{{code_block('user-guide/io/cloud-storage','write_parquet',[write_parquet])}}
Loading