Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs(python): Various improvements to docstrings and user guide #10981

Closed
wants to merge 37 commits into from
Closed
Show file tree
Hide file tree
Changes from 2 commits
Commits
Show all changes
37 commits
Select commit Hold shift + click to select a range
94621ff
Add some streaming related docstrings
Sep 7, 2023
a03b7c0
Fix lints
Sep 7, 2023
340fca2
update sink strings
Sep 13, 2023
07a4dd4
Merge branch 'main' of github.com:pola-rs/polars into low-memory-docs…
Sep 13, 2023
ac2707c
add config example
Sep 17, 2023
c7251fc
Merge branch 'main' of github.com:pola-rs/polars into low-memory-docs…
Sep 17, 2023
abe00e1
Add structify example
Sep 22, 2023
52999b9
Merge branch 'main' of github.com:pola-rs/polars into low-memory-docs…
Sep 22, 2023
489cd72
checking tests
Sep 25, 2023
0ad6fcd
Merge branch 'main' of github.com:pola-rs/polars into low-memory-docs…
Sep 25, 2023
44ff2fb
Merge branch 'main' of github.com:pola-rs/polars into low-memory-docs…
Sep 27, 2023
c88b4fd
Merge branch 'main' of github.com:pola-rs/polars into low-memory-docs…
Sep 27, 2023
9e4813c
Merge branch 'main' of github.com:pola-rs/polars into low-memory-docs…
Sep 27, 2023
2838d33
Merge branch 'main' of github.com:pola-rs/polars into low-memory-docs…
Sep 27, 2023
e9b567a
Merge branch 'main' of github.com:pola-rs/polars into low-memory-docs…
Sep 28, 2023
eda8aa8
Add scan parquet options
Sep 28, 2023
c56e2b9
update docs
Sep 29, 2023
44a9d92
Merge branch 'main' of github.com:pola-rs/polars into low-memory-docs…
Sep 29, 2023
ced5b3f
Merge branch 'main' of github.com:pola-rs/polars into low-memory-docs…
Oct 2, 2023
620a5fb
Merge branch 'main' of github.com:pola-rs/polars into low-memory-docs…
Oct 3, 2023
f9044cd
Update user guide docs for IO
Oct 3, 2023
020a217
add python JSON
Oct 3, 2023
d5a49a5
Update contributing for user guide
Oct 4, 2023
a9d99f8
Run dprint
Oct 4, 2023
4ea1b87
Run dprint
Oct 4, 2023
4308a50
Fix link
Oct 4, 2023
28b4fd2
fix api link
Oct 4, 2023
0a1e0b5
Update links
Oct 4, 2023
f4d7c7f
Add API links
Oct 4, 2023
495ff0e
Merge branch 'main' of github.com:pola-rs/polars into low-memory-docs…
Oct 5, 2023
1a8374e
Merge branch 'main' of github.com:pola-rs/polars into low-memory-docs…
Oct 6, 2023
e770381
better float format example
Oct 6, 2023
05e9057
update examples
Oct 9, 2023
0661da2
Merge branch 'main' of github.com:pola-rs/polars into low-memory-docs…
Oct 9, 2023
1f938b2
changes from sdg comments
Oct 9, 2023
cb2ba8a
linting
Oct 9, 2023
c4e1b2d
remove commas
Oct 9, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 6 additions & 3 deletions py-polars/polars/io/csv/functions.py
Original file line number Diff line number Diff line change
Expand Up @@ -133,7 +133,8 @@ def read_csv(
``utf8-lossy``, the input is first decoded in memory with
python. Defaults to ``utf8``.
low_memory
Reduce memory usage at expense of performance.
Reduce memory usage at expense of performance when rechunking into
a single array. To work with larger than-memory datasets use streaming mode.
braaannigan marked this conversation as resolved.
Show resolved Hide resolved
rechunk
Make sure that all columns are contiguous in memory by
aggregating the chunks into a single array.
Expand Down Expand Up @@ -502,7 +503,8 @@ def read_csv_batched(
``utf8-lossy``, the input is first decoded in memory with
python. Defaults to ``utf8``.
low_memory
Reduce memory usage at expense of performance.
Reduce memory usage at expense of performance when rechunking into
a single array. To work with larger than-memory datasets use streaming mode.
rechunk
Make sure that all columns are contiguous in memory by
aggregating the chunks into a single array.
Expand Down Expand Up @@ -781,7 +783,8 @@ def scan_csv(
Lossy means that invalid utf8 values are replaced with ``�``
characters. Defaults to "utf8".
low_memory
Reduce memory usage in expense of performance.
Reduce memory usage at expense of performance when rechunking into
a single array. To work with larger than-memory datasets use streaming mode.
rechunk
Reallocate to contiguous memory when all chunks/ files are parsed.
skip_rows_after_header
Expand Down
6 changes: 4 additions & 2 deletions py-polars/polars/io/parquet/functions.py
Original file line number Diff line number Diff line change
Expand Up @@ -80,7 +80,8 @@ def read_parquet(
row_count_offset
Offset to start the row_count column (only use if the name is set).
low_memory
Reduce memory pressure at the expense of performance.
Reduce memory usage at expense of performance when rechunking into
a single array. To work with larger than-memory datasets use streaming mode.
pyarrow_options
Keyword arguments for `pyarrow.parquet.read_table
<https://arrow.apache.org/docs/python/generated/pyarrow.parquet.read_table.html>`_.
Expand Down Expand Up @@ -215,7 +216,8 @@ def scan_parquet(
particular storage connection.
e.g. host, port, username, password, etc.
low_memory
Reduce memory pressure at the expense of performance.
Reduce memory usage at expense of performance when rechunking into
a single array. To work with larger than-memory datasets use streaming mode.
use_statistics
Use statistics in the parquet to determine if pages
can be skipped from reading.
Expand Down
49 changes: 43 additions & 6 deletions py-polars/polars/lazyframe/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -1624,11 +1624,30 @@ def collect(
**kwargs: Any,
) -> DataFrame:
"""
Collect into a DataFrame.
Collect a LazyFrame into a DataFrame.

Note: use :func:`fetch` if you want to run your query on the first `n` rows
Use :func:`fetch` if you want to run your query on the first `n` rows
only. This can be a huge time saver in debugging queries.

By default all query optimizations are applied. Use the arguments
to collect to turn off particular optimizations.

If streaming is False the entire query is processed in a single batch.
If streaming is True Polars tries to process the query in batches for
larger than memory datasets. Use :func:`explain` to see if Polars
can process the query in streaming mode.
Use :func:`polars.set_streaming_chunk_size` to set the size of the
batches.

See Also
--------
polars.collect_all : Collect multiple LazyFrames at the same time.
polars.collect_all_async: Collect multiple LazyFrames at the same time lazily.
polars.explain : Print the query plan that is evaluated with collect.
polars.set_streaming_chunk_size : Set the size of streaming batches.
profile : Collect the LazyFrame and time each node in the computation graph.
braaannigan marked this conversation as resolved.
Show resolved Hide resolved


Parameters
----------
type_coercion
Expand Down Expand Up @@ -1677,6 +1696,24 @@ def collect(
│ c ┆ 6 ┆ 1 │
└─────┴─────┴─────┘

Collect in streaming mode

>>> (
... lf.group_by("a", maintain_order=True)
... .agg(pl.all().sum())
... .collect(streaming=True)
... )
shape: (3, 3)
┌─────┬─────┬─────┐
│ a ┆ b ┆ c │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 │
╞═════╪═════╪═════╡
│ a ┆ 4 ┆ 10 │
│ b ┆ 11 ┆ 10 │
│ c ┆ 6 ┆ 1 │
└─────┴─────┴─────┘

"""
eager = kwargs.get("eager", False)
if no_optimization or eager:
Expand Down Expand Up @@ -1830,7 +1867,7 @@ def sink_parquet(
slice_pushdown: bool = True,
) -> DataFrame:
"""
Persists a LazyFrame at the provided path.
Collect and write a LazyFrame in streaming mode to a Parquet file at the path.
stinodego marked this conversation as resolved.
Show resolved Hide resolved

This allows streaming results that are larger than RAM to be written to disk.

Expand Down Expand Up @@ -1926,7 +1963,7 @@ def sink_ipc(
slice_pushdown: bool = True,
) -> DataFrame:
"""
Persists a LazyFrame at the provided path.
Collect and write a LazyFrame in streaming mode to an IPC file at the path.

This allows streaming results that are larger than RAM to be written to disk.

Expand Down Expand Up @@ -2009,7 +2046,7 @@ def sink_csv(
slice_pushdown: bool = True,
) -> DataFrame:
"""
Persists a LazyFrame at the provided path.
Collect and write a LazyFrame in streaming mode to a CSV file at the path.

This allows streaming results that are larger than RAM to be written to disk.

Expand Down Expand Up @@ -2629,7 +2666,7 @@ def group_by(
maintain_order
Ensure that the order of the groups is consistent with the input data.
This is slower than a default group by.
Settings this to ``True`` blocks the possibility
Setting this to ``True`` blocks the possibility
to run on the streaming engine.

Examples
Expand Down