Skip to content

Commit

Permalink
Standardizing File, FTP/SFTP and HTTPS connector docs (#609)
Browse files Browse the repository at this point in the history
* Standardizing File connector documentation

* Enhancing HTTPS connector documentation

* Standardizing FTP connector documentation

* Updating secrets section

* Update spiceaidocs/docs/components/data-connectors/file.md

Co-authored-by: Phillip LeBlanc <[email protected]>

* Update spiceaidocs/docs/components/data-connectors/ftp.md

Co-authored-by: Phillip LeBlanc <[email protected]>

* Update spiceaidocs/docs/components/data-connectors/https.md

Co-authored-by: Phillip LeBlanc <[email protected]>

* Removing secrets section from File docs

* Improving FTP documents based on feedback

---------

Co-authored-by: Phillip LeBlanc <[email protected]>
  • Loading branch information
slyons and phillipleblanc authored Nov 14, 2024
1 parent 08aa7e0 commit c925a74
Show file tree
Hide file tree
Showing 3 changed files with 229 additions and 81 deletions.
83 changes: 71 additions & 12 deletions spiceaidocs/docs/components/data-connectors/file.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,10 +4,8 @@ sidebar_label: 'File Data Connector'
description: 'File Data Connector Documentation'
---

import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';

The File Data Connector enables federated SQL queries on files stored by locally accessible filesystems. It supports querying individual files or entire directories, where all child files within the directory will be loaded and queried.
The File Data Connector enables federated/accelerated SQL queries on files stored by locally accessible filesystems. It supports querying individual files or entire directories, where all child files within the directory will be loaded and queried.

File formats are specified using the `file_format` parameter, as described in [Object Store File Formats](/components/data-connectors/index.md#object-store-file-formats).

Expand All @@ -19,20 +17,45 @@ datasets:
name: customer
params:
file_format: parquet
```
## Configuration
### `from`

The `from` field for the File connector takes the form `file://path` where `path` is the path to the file to read from. See the [examples](#examples) below for examples of relative and absolute paths

### `name`

- from: file://path/to/orders.csv
name: orders
The dataset name. This will be used as the table name within Spice.

Example:
```yaml
datasets:
- from: file://path/to/customer.parquet
name: cool_dataset
params:
file_format: csv
csv_has_header: false
...
```

## Parameters
```sql
SELECT COUNT(*) FROM cool_dataset;
```

```shell
+----------+
| count(*) |
+----------+
| 6001215 |
+----------+
```

| Parameter name | Description |
|------------------------|-------------------------------------------------------------------------------------------------------|
| `file_format` | Specifies the data file format. Required if the format cannot be inferred from the `from` path. |
| `hive_partitioning_enabled`| Enable partitioning using hive-style partitioning from the folder structure. Defaults to `false` |
### `params`

| Parameter name | Description |
| --------------------------- | ------------------------------------------------------------------------------------------------ |
| `file_format` | Specifies the data file format. Required if the format cannot be inferred from the `from` path. |
| `hive_partitioning_enabled` | Enable partitioning using hive-style partitioning from the folder structure. Defaults to `false` |

For CSV-specific parameters, see [CSV Parameters](/reference/file_format.md#csv).

Expand All @@ -52,3 +75,39 @@ datasets:
```

When the file is modified, the acceleration will be refreshed and will include the latest data.

## Examples

### Absolute path

In this example, `path` is an absolute path to the file on the filesystem.

```yaml
datasets:
- from: file://path/to/customer.parquet
name: customer
params:
file_format: parquet
```

### Relative path

In this example, the path is relative to the directory where the `spicepod.yaml` is located.

```bash
├── foo
│   └── yellow_tripdata_2024-01.parquet
└── spicepod.yaml
```

```yaml
datasets:
- from: file:foo/yellow_tripdata_2024-01.parquet
name: trip_data
params:
file_format: parquet
```

## Quickstarts and Samples

Refer to the [File quickstart](https://github.com/spiceai/quickstarts/tree/trunk/file) to see an example of the File connector in use.
161 changes: 100 additions & 61 deletions spiceaidocs/docs/components/data-connectors/ftp.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,69 +4,108 @@ sidebar_label: 'FTP/SFTP Data Connector'
description: 'FTP/SFTP Data Connector Documentation'
---

import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';
FTP (File Transfer Protocol) and SFTP (SSH File Transfer Protocol) are network protocols used for transferring files between a client and server, with FTP being less secure and SFTP providing encrypted file transfer over SSH.

The FTP/SFTP Data Connector enables federated SQL query across Parquet/CSV files stored in FTP/SFTP servers.
The FTP/SFTP Data Connector enables federated/accelerated SQL query across [supported file formats](/components/data-connectors/index.md#object-store-file-formats) stored in FTP/SFTP servers.

If a folder is provided, all child Parquet/CSV files will be loaded.
```yaml
datasets:
- from: sftp://remote-sftp-server.com/path/to/folder/
name: my_dataset
params:
file_format: csv
sftp_port: 22
sftp_user: my-sftp-user
sftp_pass: ${secrets:my_sftp_password}
```
## Configuration
<Tabs>
<TabItem value="ftp" label="FTP" default>
### Parameters

The connection to FTP can be configured by providing the following params:

- `file_format`: Specifies the data file format. Required if the format cannot be inferred by from the `from` path. See [Object Store File Formats](/components/data-connectors/index.md#object-store-file-formats).
- `ftp_port`: Optional, specifies the port of the FTP server. Default is 21. E.g. `ftp_port: 21`
- `ftp_user`: The username for the FTP server. E.g. `ftp_user: my-ftp-user`
- `ftp_pass`: The password for the FTP server. Use the [secret replacement syntax](../secret-stores/index.md) to load the password from a secret store, e.g. `${secrets:my_ftp_pass}`.
- `client_timeout`: Optional. Specifies timeout for FTP connection. E.g. `client_timeout: 30s`. When not set, no timeout will be configured for FTP client.
- `hive_partitioning_enabled`: Optional. Enable partitioning using hive-style partitioning from the folder structure. Defaults to `false`

More CSV related parameters can be configured, see [CSV Parameters](/reference/file_format.md#csv)

### Examples
```yaml
- from: ftp://remote-ftp-server.com/path/to/folder/
name: my_dataset
params:
file_format: csv
ftp_user: my-ftp-user
ftp_pass: ${secrets:my_ftp_password}
hive_partitioning_enabled: false
```

</TabItem>
<TabItem value="sftp" label="SFTP">
### Parameters

The connection to SFTP can be configured by providing the following params:

- `file_format`: Optional, specifies the requested file format.
- `parquet`: (default) Parquet file format.
- `csv`: CSV file format.
- `sftp_port`: Optional, specifies the port of the SFTP server. Default is 22. E.g. `sftp_port: 22`
- `sftp_user`: The username for the SFTP server. E.g. `sftp_user: my-sftp-user`
- `sftp_pass`: The password for the SFTP server. Use the [secret replacement syntax](../secret-stores/index.md) to load the password from a secret store, e.g. `${secrets:my_sftp_pass}`.
- `client_timeout`: Optional. Specifies timeout for SFTP connection. E.g. `client_timeout: 30s`. When not set, no timeout will be configured for SFTP client.
- `hive_partitioning_enabled`: Optional. Enable partitioning using hive-style partitioning from the folder structure. Defaults to `false`

More CSV related parameters can be configured, see [CSV Parameters](/reference/file_format.md#csv)

### Examples
```yaml
- from: sftp://remote-sftp-server.com/path/to/folder/
name: my_dataset
params:
file_format: csv
sftp_port: 20
sftp_user: my-sftp-user
sftp_pass: ${secrets:my_sftp_password}
hive_partitioning_enabled: true
```

</TabItem>
</Tabs>
### `from`

The `from` field takes one of two forms: `ftp://<host>/<path>` or `sftp://<host>/<path>` where `<host>` is the host to connect to and `<path>` is the path to the file or directory to read from.

If a folder is provided, all child files will be loaded.

### `name`

The dataset name. This will be used as the table name within Spice.

Example:
```yaml
datasets:
- from: sftp://remote-sftp-server.com/path/to/folder/
name: cool_dataset
params:
...
```

```sql
SELECT COUNT(*) FROM cool_dataset;
```

```shell
+----------+
| count(*) |
+----------+
| 6001215 |
+----------+
```

### `params`

#### FTP

| Parameter Name | Description |
| --------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `file_format` | Specifies the data file format. Required if the format cannot be inferred by from the `from` path. See [Object Store File Formats](/components/data-connectors/index.md#object-store-file-formats). |
| `ftp_port` | Optional, specifies the port of the FTP server. Default is 21. E.g. `ftp_port: 21` |
| `ftp_user` | The username for the FTP server. E.g. `ftp_user: my-ftp-user` |
| `ftp_pass` | The password for the FTP server. Use the [secret replacement syntax](../secret-stores/index.md) to load the password from a secret store, e.g. `${secrets:my_ftp_pass}`. |
| `client_timeout` | Optional. Specifies timeout for FTP connection. E.g. `client_timeout: 30s`. When not set, no timeout will be configured for FTP client. |
| `hive_partitioning_enabled` | Optional. Enable partitioning using hive-style partitioning from the folder structure. Defaults to `false` |

#### SFTP
| Parameter Name | Description |
| --------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `file_format` | Specifies the data file format. Required if the format cannot be inferred by from the `from` path. See [Object Store File Formats](/components/data-connectors/index.md#object-store-file-formats). |
| `sftp_port` | Optional, specifies the port of the SFTP server. Default is 22. E.g. `sftp_port: 22` |
| `sftp_user` | The username for the SFTP server. E.g. `sftp_user: my-sftp-user` |
| `sftp_pass` | The password for the SFTP server. Use the [secret replacement syntax](../secret-stores/index.md) to load the password from a secret store, e.g. `${secrets:my_sftp_pass}`. |
| `client_timeout` | Optional. Specifies timeout for SFTP connection. E.g. `client_timeout: 30s`. When not set, no timeout will be configured for SFTP client. |
| `hive_partitioning_enabled` | Optional. Enable partitioning using hive-style partitioning from the folder structure. Defaults to `false` |

## Examples

### Connecting to FTP

```yaml
- from: ftp://remote-ftp-server.com/path/to/folder/
name: my_dataset
params:
file_format: csv
ftp_user: my-ftp-user
ftp_pass: ${secrets:my_ftp_password}
hive_partitioning_enabled: false
```

### Connecting to SFTP

```yaml
- from: sftp://remote-sftp-server.com/path/to/folder/
name: my_dataset
params:
file_format: csv
sftp_port: 22
sftp_user: my-sftp-user
sftp_pass: ${secrets:my_sftp_password}
hive_partitioning_enabled: false
```

## Quickstarts and Samples

Refer to the [FTP quickstart](https://github.com/spiceai/quickstarts/tree/trunk/ftp) to see an example of the FTP connector in use.

## Secrets

Spice integrates with multiple secret stores to help manage sensitive data securely. For detailed information on supported secret stores, refer to the [secret stores documentation](/components/secret-stores). Additionally, learn how to use referenced secrets in component parameters by visiting the [using referenced secrets guide](/components/secret-stores#using-secrets).
66 changes: 58 additions & 8 deletions spiceaidocs/docs/components/data-connectors/https.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,26 +5,76 @@ description: 'HTTP(s) Data Connector Documentation'
pagination_prev: null
---

The HTTP(s) Data Connector enables federated SQL query against a variety of tabular formatted (e.g. Parquet/CSV) files stored at a HTTP endpoint.
The HTTP(s) Data Connector enables federated/accelerated SQL query across [supported file formats](/components/data-connectors/index.md#object-store-file-formats) stored at an HTTP(s) endpoint.

The connector supports Basic HTTP authentication via `param` values.
```yaml
datasets:
- from: http://static_username@localhost:3001/report.csv
name: local_report
params:
http_password: ${env:MY_HTTP_PASS}
```
## Configuration
### `from`

The `from` field must contain a valid URI to the location of a [supported file](/components/data-connectors/index.md#object-store-file-formats). For example, `http://static_username@localhost:3001/report.csv`.

### `name`

The dataset name. This will be used as the table name within Spice.

Example:
```yaml
datasets:
- from: http://static_username@localhost:3001/report.csv
name: cool_dataset
params:
...
```

### Parameters
```sql
SELECT COUNT(*) FROM cool_dataset;
```

```shell
+----------+
| count(*) |
+----------+
| 6001215 |
+----------+
```

- `http_port`: Optional. Port to create HTTP(s) connection over. Default: 80 and 443 for HTTP and HTTPS respectively.
- `http_username`: Optional. Username to provide connection for HTTP basic authentication. Default: None.
- `http_password`: Optional. Password to provide connection for HTTP basic authentication. Default: None. Use the [secret replacement syntax](../secret-stores/index.md) to load the password from a secret store, e.g. `${secrets:my_http_pass}`.
- `client_timeout`: Optional. Specifies timeout for HTTP operations. Default value is `30s` E.g. `client_timeout: 60s`
### `params`

### Examples
The connector supports Basic HTTP authentication via `param` values.

| Parameter Name | Description |
| ---------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `http_port` | Optional. Port to create HTTP(s) connection over. Default: 80 and 443 for HTTP and HTTPS respectively. |
| `http_username` | Optional. Username to provide connection for HTTP basic authentication. Default: None. |
| `http_password` | Optional. Password to provide connection for HTTP basic authentication. Default: None. Use the [secret replacement syntax](../secret-stores/index.md) to load the password from a secret store, e.g. `${secrets:my_http_pass}`. |
| `client_timeout` | Optional. Specifies timeout for HTTP operations. Default value is `30s` E.g. `client_timeout: 60s` |

## Examples

### Basic example
```yaml
datasets:
- from: https://github.com/LAION-AI/audio-dataset/raw/7fd6ae3cfd7cde619f6bed817da7aa2202a5bc28/metadata/freesound/parquet/freesound_parquet.parquet
name: laion_freesound
```

### Using Basic Authentication
```yaml
datasets:
- from: http://static_username@localhost:3001/report.csv
name: local_report
params:
http_password: ${env:MY_HTTP_PASS}
```

## Secrets

Spice integrates with multiple secret stores to help manage sensitive data securely. For detailed information on supported secret stores, refer to the [secret stores documentation](/components/secret-stores). Additionally, learn how to use referenced secrets in component parameters by visiting the [using referenced secrets guide](/components/secret-stores#using-secrets).

0 comments on commit c925a74

Please sign in to comment.