Skip to content

Commit

Permalink
Minor cleanup of user guide text
Browse files Browse the repository at this point in the history
  • Loading branch information
stinodego committed Oct 6, 2023
1 parent 126db7d commit 1619420
Show file tree
Hide file tree
Showing 3 changed files with 16 additions and 12 deletions.
6 changes: 4 additions & 2 deletions docs/user-guide/io/cloud-storage.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
Polars can read and write to AWS S3, Azure Blob Storage and Google Cloud Storage. The API is the same for all three storage providers.

To read from cloud storage, additional dependencies may be needed depending on the use case and cloud storage provider:

=== ":fontawesome-brands-python: Python"

```shell
Expand Down Expand Up @@ -31,7 +32,7 @@ Polars can scan a Parquet file in lazy mode from cloud storage. We may need to p

This query creates a `LazyFrame` without downloading the file. In the `LazyFrame` we have access to file metadata such as the schema. Polars uses the `object_store.rs` library internally to manage the interface with the cloud storage providers and so no extra dependencies are required in Python to scan a cloud Parquet file.

If we create a lazy query with [predicate and projection pushdowns](../lazy/optimizations.md) the query optimiser will apply them before the file is downloaded. This can significantly reduce the amount of data that needs to be downloaded. The query evaluation is triggered by calling `collect`.
If we create a lazy query with [predicate and projection pushdowns](../lazy/optimizations.md), the query optimiszr will apply them before the file is downloaded. This can significantly reduce the amount of data that needs to be downloaded. The query evaluation is triggered by calling `collect`.

{{code_block('user-guide/io/cloud-storage','scan_parquet_query',[])}}

Expand All @@ -40,10 +41,11 @@ If we create a lazy query with [predicate and projection pushdowns](../lazy/opti
We can also scan from cloud storage using PyArrow. This is particularly useful for partitioned datasets such as Hive partitioning.

We first create a PyArrow dataset and then create a `LazyFrame` from the dataset.

{{code_block('user-guide/io/cloud-storage','scan_pyarrow_dataset',[scan_pyarrow_dataset])}}

## Writing to cloud storage

We can write a `DataFrame` to cloud storage in Python using s3fs for S3, adlfs for Azure Blob Storage and gcsfs for Google Cloud Storage. In this example we write a Parquet file to S3.
We can write a `DataFrame` to cloud storage in Python using s3fs for S3, adlfs for Azure Blob Storage and gcsfs for Google Cloud Storage. In this example, we write a Parquet file to S3.

{{code_block('user-guide/io/cloud-storage','write_parquet',[write_parquet])}}
18 changes: 10 additions & 8 deletions docs/user-guide/io/database.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,24 +2,25 @@

## Read from a database

Polars can read from a database using either the `pl.read_database_uri` and `pl.read_database` functions.
Polars can read from a database using the `pl.read_database_uri` and `pl.read_database` functions.

### Difference between the `read_database` functions
### Difference between `read_database_uri` and `read_database`

Use `pl.read_database_uri` if you want to specify the database connection with a connection string called a `uri`. For example, the following snippet shows a query to read all columns from the `foo` table in a Postgres database where we use the `uri` to connect:

{{code_block('user-guide/io/database','read_uri',['read_database_uri'])}}

On the other hand use `pl.read_database` if you want to connect via a connection engine created with a library like SQLAlchemy.
On the other hand, use `pl.read_database` if you want to connect via a connection engine created with a library like SQLAlchemy.

{{code_block('user-guide/io/database','read_cursor',['read_database'])}}

Note that `pl.read_database_uri` is likely to be faster than `pl.read_database` if you are using a SQLAlchemy or DBAPI2 connection as these connections may load the data row-wise into Python before copying the data again to the column-wise Apache Arrow format.

### Engines

Polars doesn't manage connections and data transfer from databases by itself. Instead external libraries (known as _engines_) handle this.
Polars doesn't manage connections and data transfer from databases by itself. Instead, external libraries (known as _engines_) handle this.

If you use `pl.read_database` then you specify the engine when you make the connection object. If you use `pl.read_database_uri` then you can specify one of two engines to read from the database:
When using `pl.read_database`, you specify the engine when you create the connection object. When using `pl.read_database_uri`, you can specify one of two engines to read from the database:

- [ConnectorX](https://github.com/sfu-db/connector-x) and
- [ADBC](https://arrow.apache.org/docs/format/ADBC.html)
Expand All @@ -46,7 +47,7 @@ It is still early days for ADBC so support for different databases is still limi
$ pip install adbc-driver-sqlite
```

As ADBC is not the default engine you must specify the engine as an argument to `pl.read_database`
As ADBC is not the default engine you must specify the engine as an argument to `pl.read_database_uri`

{{code_block('user-guide/io/database','adbc',['read_database_uri'])}}

Expand All @@ -73,9 +74,10 @@ In this example, we write the `DataFrame` to a table called `records` in the dat

{{code_block('user-guide/io/database','write',['write_database'])}}

In the SQLAlchemy approach Polars converts the `DataFrame` to a Pandas `DataFrame` backed by PyArrow and then uses SQLAlchemy methods on a Pandas `DataFrame` to write to the database.
In the SQLAlchemy approach, Polars converts the `DataFrame` to a Pandas `DataFrame` backed by PyArrow and then uses SQLAlchemy methods on a Pandas `DataFrame` to write to the database.

#### ADBC

As with reading from a database you can also use ADBC to write to a SQLite or Posgres database. As shown above you need to install the appropriate ADBC driver for your database.
As with reading from a database, you can also use ADBC to write to a SQLite or Posgres database. As shown above, you need to install the appropriate ADBC driver for your database.

{{code_block('user-guide/io/database','write_adbc',['write_database'])}}
4 changes: 2 additions & 2 deletions docs/user-guide/io/json_file.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,15 +14,15 @@ Reading a JSON file should look familiar:

JSON objects that are delimited by newlines can be read into polars in a much more performant way than standard json.

Polars can read an ND-JSON file into a `DataFrame` using the `read_ndjson` function:
Polars can read an NDJSON file into a `DataFrame` using the `read_ndjson` function:

{{code_block('user-guide/io/json-file','readnd',['read_ndjson'])}}

## Write

{{code_block('user-guide/io/json-file','write',['write_json','write_ndjson'])}}

## Scan NDJSON
## Scan

`Polars` allows you to _scan_ a JSON input **only for newline delimited json**. Scanning delays the actual parsing of the
file and instead returns a lazy computation holder called a `LazyFrame`.
Expand Down

0 comments on commit 1619420

Please sign in to comment.