From a5a687439628aed116d40830fe163742aacf3a6d Mon Sep 17 00:00:00 2001 From: tianwei Date: Tue, 12 Dec 2023 18:29:19 +0800 Subject: [PATCH 1/6] add dataset guide --- docs/dataset/build.md | 0 docs/dataset/integration.md | 0 docs/dataset/load.md | 0 docs/dataset/version.md | 0 docs/dataset/view.md | 0 .../current/dataset/build.md | 206 ++++++++++++++++++ .../current/dataset/integration.md | 178 +++++++++++++++ .../current/dataset/load.md | 3 + .../current/dataset/version.md | 3 + .../current/dataset/view.md | 3 + .../current/reference/sdk/type.md | 4 +- sidebars.js | 8 +- 12 files changed, 401 insertions(+), 4 deletions(-) create mode 100644 docs/dataset/build.md create mode 100644 docs/dataset/integration.md create mode 100644 docs/dataset/load.md create mode 100644 docs/dataset/version.md create mode 100644 docs/dataset/view.md create mode 100644 i18n/zh/docusaurus-plugin-content-docs/current/dataset/build.md create mode 100644 i18n/zh/docusaurus-plugin-content-docs/current/dataset/integration.md create mode 100644 i18n/zh/docusaurus-plugin-content-docs/current/dataset/load.md create mode 100644 i18n/zh/docusaurus-plugin-content-docs/current/dataset/version.md create mode 100644 i18n/zh/docusaurus-plugin-content-docs/current/dataset/view.md diff --git a/docs/dataset/build.md b/docs/dataset/build.md new file mode 100644 index 000000000..e69de29bb diff --git a/docs/dataset/integration.md b/docs/dataset/integration.md new file mode 100644 index 000000000..e69de29bb diff --git a/docs/dataset/load.md b/docs/dataset/load.md new file mode 100644 index 000000000..e69de29bb diff --git a/docs/dataset/version.md b/docs/dataset/version.md new file mode 100644 index 000000000..e69de29bb diff --git a/docs/dataset/view.md b/docs/dataset/view.md new file mode 100644 index 000000000..e69de29bb diff --git a/i18n/zh/docusaurus-plugin-content-docs/current/dataset/build.md b/i18n/zh/docusaurus-plugin-content-docs/current/dataset/build.md new file mode 100644 index 000000000..a5e92b76a --- /dev/null +++ b/i18n/zh/docusaurus-plugin-content-docs/current/dataset/build.md @@ -0,0 +1,206 @@ +--- +title: 数据集构建 +--- + +Starwhale 数据集构建方式非常灵活,可以从一些图片/音频/视频/csv/json/jsonl文件构建,也可以写一些Python脚本构建,还可以从Huggingface Hub 导入数据集。 + +## 从数据文件直接构建 + +### 图片 + +支持递归遍历目录中的图片文件,构建Starwhale 数据集,不需要写任何代码: + +- 支持的图片文件格式: `png/jpg/jpeg/webp/svg/apng` +- 图片会转成 Starwhale.Image 类型,并可以在 Starwhale Server Web页面中查看。 +- 支持命令行 `swcli dataset build --image` 和 Python SDK `starwhale.Dataset.from_folder` 两种方式。 +- label机制:当 SDK 设置 `auto_label=True` 或 命令行设置 `--auto-label` 时,会将父目录的名字作为 `label`。 +- metadata机制:可以通过在根目录设置 `metadata.csv` 或 `metadata.jsonl` 文件来扩展数据集的列。 +- caption机制:当在同目录下发现 `{image-name}.txt` 文件时,文件中的内容会被自动导入,填充到 `caption` 列中。 + +假设在 folder 目录中有下面四个文件: + +```console +folder/dog/1.png +folder/cat/2.png +folder/dog/3.png +folder/cat/4.png +``` + +命令方式构建方法: + +```console +❯ swcli dataset build --image folder --name image-folder +🚧 start to build dataset bundle... +👷 uri local/project/self/dataset/image-folder/version/latest +🌊 creating dataset local/project/self/dataset/image-folder/version/uw6mdisnf7alg4t4fs2myfug4ie4636w3x4jqcu2... +🦋 update 4 records into dataset +🌺 congratulation! you can run swcli dataset info image-folder/version/uw6mdisnf7al +``` + +```console +❯ swcli dataset head image-folder -n 2 +row ─────────────────────────────────────── +🌳 id: cat/2.png +🌀 features: + 🔅 file_name : cat/2.png + 🔅 label : cat + 🔅 file : ArtifactType.Image, display:2.png, mime_type:MIMEType.PNG, shape:[None, None, 3], encoding: +row ─────────────────────────────────────── +🌳 id: cat/4.png +🌀 features: + 🔅 file_name : cat/4.png + 🔅 label : cat + 🔅 file : ArtifactType.Image, display:4.png, mime_type:MIMEType.PNG, shape:[None, None, 3], encoding: +``` + + +Python SDK方式构建: + +```python +from starwhale import Dataset +ds = Dataset.from_folder("folder", kind="image") +print(ds) +print(ds.fetch_one().features) +``` + +```console +🌊 creating dataset local/project/self/dataset/folder/version/nyc2ay4gnyayv4zqalovdgakl3k2esvrne42cjna... +🦋 update 4 records into dataset +Dataset: folder, stash version: d22hdiwyakdfh5xitcpn2s32gblfbhrerzczkb63, loading version: nyc2ay4gnyayv4zqalovdgakl3k2esvrne42cjna +{'file_name': 'cat/2.png', 'label': 'cat', 'file': ArtifactType.Image, display:2.png, mime_type:MIMEType.PNG, shape:[None, None, 3], encoding: } +``` + + +### 视频 + +支持递归遍历目录中的视频文件,构建Starwhale 数据集,不需要写任何代码: + +- 支持的视频文件格式:`mp4/webm/avi` +- 视频会被转成 Starwhale.Video 类型,并可以在 Starwhale Server Web页面中查看。 +- 支持命令行 `swcli dataset build --video` 和 Python SDK `starwhale.Dataset.from_folder` 两种方式。 +- label, caption 和 metadata 机制与图片方式相同。 + +### 音频 + +支持递归遍历目录中的音频文件,构建Starwhale 数据集,不需要写任何代码: + +- 支持的音频文件格式:`mp3/wav` +- 音频会被转成 Starwhale.Audio 类型,并可以在 Starwhale Server Web页面中查看。 +- 支持命令行 `swcli dataset build --video` 和 Python SDK `starwhale.Dataset.from_folder` 两种方式。 +- label, caption 和 metadata 机制与图片方式相同。 + +### csv 文件 + +支持命令行或Python SDK方式将本地或远端的csv文件直接转化成 Starwhale 数据集: + +- 支持一个或多个本地csv文件 +- 支持对本地目录递归寻找csv文件 +- 支持一个或多个以http url方式指定的远端csv文件 + +命令行方式构建: + +```console +❯ swcli dataset build --name product-desc-modelscope --csv https://modelscope.cn/api/v1/datasets/lcl193798/product_description_generation/repo\?Revision\=master\&FilePath\=test.csv --encoding=utf-8-sig +🚧 start to build dataset bundle... +👷 uri local/project/self/dataset/product-desc-modelscope/version/latest +🌊 creating dataset local/project/self/dataset/product-desc-modelscope/version/wzaz4ccodpyj4jelgupljreyida2bleg5xp7viwe... +🦋 update 3848 records into dataset +🌺 congratulation! dataset build from csv files(('https://modelscope.cn/api/v1/datasets/lcl193798/product_description_generation/repo?Revision=master&FilePath=test.csv',)) has been built. You can run swcli dataset info product-desc-modelscope/version/wzaz4ccodpyj +``` + +Python SDK方式构建: + +```python +from starwhale import Dataset +ds = Dataset.from_csv(path="http://example.com/data.csv", name="my-csv-dataset") +``` + +### json/jsonl 文件 + +支持命令行或Python SDK方式将本地或远端的json/jsonl文件直接转化成 Starwhale 数据集: + +- 支持一个或多个本地json/jsonl文件 +- 支持对本地目录递归寻找json/jsonl文件 +- 支持一个或多个以http url方式指定的远端json/jsonl文件 + +对于json文件: + +- 默认认为json解析后的对象是list,list中的每个对象是dict,会映射为Starwhale 数据集中的一行。 +- 可以通过 `--field-selector` 或 `field_selector` 参数定位具体的某个list。 + +比如json文件如下: + +```json +{ + "p1": { + "p2":{ + "p3": [ + {"a": 1, "b": 2}, + {"a": 10, "b": 20}, + ] + } + } +} +``` + +那么可以设置 `--field-selector=p1.p2.p3` ,准确添加两行数据到数据集中。 + +命令方式构建: + +```console +❯ swcli dataset build --json https://modelscope.cn/api/v1/datasets/damo/100PoisonMpts/repo\?Revision\=master\&FilePath\=train.jsonl +🚧 start to build dataset bundle... +👷 uri local/project/self/dataset/json-b0o2zsvg/version/latest +🌊 creating dataset local/project/self/dataset/json-b0o2zsvg/version/q3uoziwqligxdggncqywpund75jz55h3bne6a5la... +🦋 update 906 records into dataset +🌺 congratulation! dataset build from ('https://modelscope.cn/api/v1/datasets/damo/100PoisonMpts/repo?Revision=master&FilePath=train.jsonl',) has been built. You can run swcli dataset info json-b0o2zsvg/version/q3uoziwqligx +``` + +Python SDK方式构建: + +```python +from starwhale import Dataset +myds = Dataset.from_json( + name="translation", + text='{"content":{"child_content":[{"en":"hello","zh-cn":"你好"},{"en":"how are you","zh-cn":"最近怎么样"}]}}', + field_selector="content.child_content" +) +print(myds[0].features["zh-cn"]) +``` + +```console +🌊 creating dataset local/project/self/dataset/translation/version/kblfn5zh4cpoqxqbhgdfbvonulr2zefp6lojq44y... +🦋 update 2 records into dataset +你好 +``` + +## 从Huggingface Datasets Hub中构建 + +Huggingface Hub 上有大量的数据集,可以通过一行代码或一条命令就能转化为 Starwhale 数据集。 + +:::tip +Huggingface Datasets 转化需要依赖 [datasets](https://pypi.org/project/datasets/) 库。 +::: + +命令行方式: + +```console +swcli dataset build -hf lambdalabs/pokemon-blip-captions --name pokemon +``` + +Python SDK方式: + +```python +from starwhale import Dataset + +# You only specify starwhale dataset expected name and huggingface repo name +# example: https://huggingface.co/datasets/lambdalabs/pokemon-blip-captions +ds = Dataset.from_huggingface("pokemon", "lambdalabs/pokemon-blip-captions") +print(ds) +print(len(ds)) +print(repr(ds.fetch_one())) +``` + +## 使用 Starwhale SDK 编写 Python Script 方式构建 + +## 使用 swcli dataset build + Python Handler 方式构建 diff --git a/i18n/zh/docusaurus-plugin-content-docs/current/dataset/integration.md b/i18n/zh/docusaurus-plugin-content-docs/current/dataset/integration.md new file mode 100644 index 000000000..e0a97604b --- /dev/null +++ b/i18n/zh/docusaurus-plugin-content-docs/current/dataset/integration.md @@ -0,0 +1,178 @@ +--- +title: 数据集与其他ML库的集成 +--- + +Starwhale 数据集可以 Pillow, Numpy, Huggingface Datasets, Pytorch 和 Tensorflow 等流行的ML库进行良好的集成,方便进行数据转化。 + +## Pillow + +[Starwhale Image](../reference/sdk/type#image) 类型与 [Pillow Image](https://pillow.readthedocs.io/en/stable/reference/Image.html) 对象进行双向转化。 + +### 使用 Pillow Image 初始化 Starwhale Image + +```python +from starwhale import dataset + +# login cloud instance in advance: `swcli instance login` command or `starwhale.login` sdk +# raw dataset url: https://cloud.starwhale.cn/projects/397/datasets/172/versions/236/files +ds = dataset("https://cloud.starwhale.cn/project/starwhale:object-detection/dataset/coco128/v2") +img = ds.head(n=1)[0].features.image + +pil = img.to_pil() +print(pil) +print(pil.size) +``` + +```console + +(640, 480) +``` + +### 将 Starwhale Image 转化为 Pillow Image + +```python +import numpy +from PIL import Image as PILImage +from starwhale import Image + +# generate a random image +random_array = numpy.random.randint(low=0, high=256, size=(100, 100, 3), dtype=numpy.uint8) +pil = PILImage.fromarray(random_array, mode="RGB") + +img = Image(pil) +print(img) +``` + +```console +ArtifactType.Image, display:, mime_type:MIMEType.UNDEFINED, shape:[None, None, 3], encoding: +``` + +## Numpy + +### 转化为 numpy.ndarray + +Starwhale 的以下数据类型可以转化为 numpy.ndarray 对象: + +* Image:先转化为Pillow Image类型,然后再转化为 numpy.ndarray 对象。 +* Video:将 video bytes 直接转化 numpy.ndarray 对象。 +* Audio:调用 soundfile 库将 audio bytes 转化为 numpy.ndarray 对象。 +* BoundingBox:转化为 xywh 格式的 numpy.ndarray 对象。 +* Binary:将 bytes 直接转化 numpy.ndarray 对象。 + +```python +from starwhale import dataset + +# login cloud instance in advance: `swcli instance login` command or `starwhale.login` sdk +# raw dataset url: https://cloud.starwhale.cn/projects/397/datasets/172/versions/236/files +ds = dataset("https://cloud.starwhale.cn/project/starwhale:object-detection/dataset/coco128/v2") + +item = ds.head(n=1)[0] + +img = item.features.image +img_array = img.to_numpy() +print(img_array) +print(img_array.shape) + +bbox = item.features.annotations[0]["bbox"] +print(bbox) +print(bbox.to_numpy()) +``` + +```console + +(480, 640, 3) +BoundingBox[XYWH]- x:1.0799999999999699, y:187.69008, width:611.5897600000001, height:285.84000000000003 +array([ 1.08 , 187.69008, 611.58976, 285.84 ]) +``` + +### 使用 numpy.ndarray 初始化 Starwhale Image + +当一个图片表示为 numpy.ndarray 对象时,可以用来初始化为 Starwhale Image 对象。 + +```python +import numpy +from starwhale import Image + +# generate a random image numpy.ndarray +random_array = numpy.random.randint(low=0, high=256, size=(100, 100, 3), dtype=numpy.uint8) +img = Image(random_array) +print(img) +``` + +```console +ArtifactType.Image, display:, mime_type:MIMEType.UNDEFINED, shape:[None, None, 3], encoding: +``` + +## Huggingface Datasets + +Huggingface Hub 上有大量的数据集,可以通过一行代码就能转化为 Starwhale 数据集。 + +:::tip +Huggingface Datasets 转化需要依赖 [datasets](https://pypi.org/project/datasets/) 库。 +::: + +```python +from starwhale import Dataset + +# You only specify starwhale dataset expected name and huggingface repo name +# example: https://huggingface.co/datasets/lambdalabs/pokemon-blip-captions +ds = Dataset.from_huggingface("pokemon", "lambdalabs/pokemon-blip-captions") +print(ds) +print(len(ds)) +print(repr(ds.fetch_one())) +``` + +```console +🌊 creating dataset local/project/self/dataset/pokemon/version/r2m3is6ormwcio4gtayop25qk4gmfr6mcei6hise... +🦋 update 833 records into dataset +Dataset: pokemon, stash version: r2m3is6ormwcio4gtayop25qk4gmfr6mcei6hise, loading version: r2m3is6ormwcio4gtayop25qk4gmfr6mcei6hise +833 +index:default/train/0, features:{'image': ArtifactType.Image, display:, mime_type:MIMEType.JPEG, shape:[1280, 1280, 3], encoding: , 'text': 'a drawing of a green pokemon with red eyes', '_hf_subset': 'default', '_hf_split': 'train'}, shadow dataset: None +``` + +## Pytorch + +Starwhale Dataset 可以转化为 Pytorch 的 [torch.utils.dataset.IterableDataset](https://pytorch.org/docs/stable/data.html#torch.utils.data.IterableDataset) 对象,并接受 transform 变换。转化后的 Pytorch dataset 对象就可以传递给 Pytorch dataloader 或 Huggingface Trainer 等。 + +```python +from starwhale import dataset +import torch.utils.data as tdata + +def custom_transform(data): + data["label"] = data["label"] + 100 + return data + +with dataset("simple", create="empty") as ds: + for i in range(0, 10): + ds[i] = {"text": f"{i}-text", "label": i} + ds.commit() + + torch_ds = ds.to_pytorch(transform=custom_transform) + torch_loader = tdata.DataLoader(torch_ds, batch_size=1) + item = next(iter(torch_loader)) + print(item) + print(item["label"]) +``` + +```console +{'text': ['0-text'], 'label': tensor([100])} +tensor([100]) +``` + +## Tensorflow + +Starwhale Dataset 可以转化为 Tensorflow 的 [tensorflow.data.Dataset](https://www.tensorflow.org/api_docs/python/tf/data/Dataset) 对象,同时也支持 transform 函数,可以对数据进行变化。 + +```python +from starwhale import dataset + +# login cloud instance in advance: `swcli instance login` command or `starwhale.login` sdk +# raw dataset url: https://cloud.starwhale.cn/projects/397/datasets/172/versions/236/files +ds = dataset("https://cloud.starwhale.cn/project/starwhale:helloworld/dataset/mnist64") +tf_ds = ds.to_tensorflow() +print(tf_ds) +``` + +```console +<_FlatMapDataset element_spec={'label': TensorSpec(shape=(), dtype=tf.int64, name=None), 'img': TensorSpec(shape=(8, 8, 1), dtype=tf.uint8, name=None)}> +``` diff --git a/i18n/zh/docusaurus-plugin-content-docs/current/dataset/load.md b/i18n/zh/docusaurus-plugin-content-docs/current/dataset/load.md new file mode 100644 index 000000000..bcbc362c6 --- /dev/null +++ b/i18n/zh/docusaurus-plugin-content-docs/current/dataset/load.md @@ -0,0 +1,3 @@ +--- +title: 数据集加载 +--- \ No newline at end of file diff --git a/i18n/zh/docusaurus-plugin-content-docs/current/dataset/version.md b/i18n/zh/docusaurus-plugin-content-docs/current/dataset/version.md new file mode 100644 index 000000000..b76f9028e --- /dev/null +++ b/i18n/zh/docusaurus-plugin-content-docs/current/dataset/version.md @@ -0,0 +1,3 @@ +--- +title: 数据集版本控制 +--- \ No newline at end of file diff --git a/i18n/zh/docusaurus-plugin-content-docs/current/dataset/view.md b/i18n/zh/docusaurus-plugin-content-docs/current/dataset/view.md new file mode 100644 index 000000000..de090010a --- /dev/null +++ b/i18n/zh/docusaurus-plugin-content-docs/current/dataset/view.md @@ -0,0 +1,3 @@ +--- +title: 数据集可视化 +--- \ No newline at end of file diff --git a/i18n/zh/docusaurus-plugin-content-docs/current/reference/sdk/type.md b/i18n/zh/docusaurus-plugin-content-docs/current/reference/sdk/type.md index 3291e9598..7506bfe4e 100644 --- a/i18n/zh/docusaurus-plugin-content-docs/current/reference/sdk/type.md +++ b/i18n/zh/docusaurus-plugin-content-docs/current/reference/sdk/type.md @@ -158,7 +158,7 @@ ClassLabel( ) ``` -## Image +## Image {#image} 图片类型。 @@ -175,7 +175,7 @@ Image( |参数|说明| |---|---| -|`fp`|图片的路径、IO对象或文件内容的bytes| +|`fp`|图片的路径、IO对象、numpy对象、pillow image对象或文件内容的bytes| |`display_name`|Dataset Viewer上展示的名字| |`shape`|图片的Width、Height和channel| |`mime_type`|MIMEType支持的类型| diff --git a/sidebars.js b/sidebars.js index 1fa49d9eb..32d1ca1f9 100644 --- a/sidebars.js +++ b/sidebars.js @@ -157,7 +157,12 @@ module.exports = { }, collapsed: true, items: [ - "dataset/yaml" + "dataset/yaml", + "dataset/build", + "dataset/load", + "dataset/view", + "dataset/version", + "dataset/integration" ] }, { @@ -219,7 +224,6 @@ module.exports = { "reference/sdk/evaluation", "reference/sdk/model", "reference/sdk/job", - "reference/swcli/server", "reference/sdk/other", ] } From d41bf126c8fe9c14ea6933a613595b35fed49154 Mon Sep 17 00:00:00 2001 From: tianwei Date: Mon, 8 Jan 2024 18:52:58 +0800 Subject: [PATCH 2/6] add build --- .../current/dataset/build.md | 95 +++++++++++++++++++ 1 file changed, 95 insertions(+) diff --git a/i18n/zh/docusaurus-plugin-content-docs/current/dataset/build.md b/i18n/zh/docusaurus-plugin-content-docs/current/dataset/build.md index a5e92b76a..4a6946815 100644 --- a/i18n/zh/docusaurus-plugin-content-docs/current/dataset/build.md +++ b/i18n/zh/docusaurus-plugin-content-docs/current/dataset/build.md @@ -203,4 +203,99 @@ print(repr(ds.fetch_one())) ## 使用 Starwhale SDK 编写 Python Script 方式构建 +Starwhale Dataset SDK 提供类似Python `dict` 的方式添加或更新数据,实现本地或远端数据集的创建和更新。 + +Starwhale 对每行数据定义了两种属性:`key` 和 `features` 。 + - `key` 类型为 int 或 str,同一个数据集中只有有一种类型的`key`。`key` 表示能够唯一索引到该行数据。 + - `features` 类型为 dict。Starwhale Dataset 采用无Schema设计,所以每一行的 `features` 结构都可以不同。 + - `features` 中的数据支持int, float, str等Python 常量类型,也支持Image, Video, Audio, Text, Binary 等Starwhale 类型,还支持 list, tuple, dict等Python 复合类型。 + +### 数据集初始化 + +要创建、更新或加载数据集,需要先获得一个 Starwhale.Dataset 对象,一般可以采用如下方式获取: + +```python +from starwhale import dataset + +# 创建一个本地的数据集,名称为 new-test,若已经存在这个数据集,则抛出异常 +local_ds = dataset("new-test", create="empty") +print(local_ds) +print(len(local_ds)) + +# 若mnist64数据集不存在就创建一个,若存在就加载这个数据集 +remote_ds = dataset("https://cloud.starwhale.cn/project/starwhale:helloworld/dataset/mnist64", create="auto") +print(remote_ds) +print(len(remote_ds)) + +# 加载一个已经存在的数据集,名称为mnist64,如该数据集不存在则报错 +existed_ds = dataset("mnist64", create="forbid") +print(existed_ds) +print(len(existed_ds)) +``` + +```console +Dataset: new-test, stash version: y4touw3btifhkd4f2gg4x3qvydgnfmvoztqqm5cf, loading version: y4touw3btifhkd4f2gg4x3qvydgnfmvoztqqm5cf +0 + +Dataset: mnist64, stash version: 4z5wpbpozsxlelma3j6soeatekufymnyxdeihoqo, loading version: vs3gnaauakidjcc5effevaoh63vivu7dzodo5cmc +500 + +Dataset: mnist64, stash version: 3ahtfbizw63myxcz34ebd72lhgc25dualcmtznts, loading version: lwhvvixpimlsghfs2xqmtgrwti4yn2z5nevz7hth +500 +``` + +### 数据集元素添加和更新 + +Dataset 添加完数据后,如调用 `commit` 方式会产生一个新的版本,之后就可以用这个版本进行数据集的记载。 + +#### append 方式 + +Dataset 提供 `append` 函数,调用时自动增加`features`到数据集新的一行。 + +```python +from starwhale import dataset +ds = dataset("new-test", create="empty") + +# key 采用自增ID方式,本例子中 key 为 0 +ds.append({"a": 0, "b": 0}) + +# key 也可以主动声明,但需要保持与其他行的key类型一致 +# 以 list 或 tuple 方式添加的数据,第0个就是该行的`key`, 第1个是`features` +ds.append((1, {"a":1, "b":1})) + +ds.commit() +``` + +#### \_\_setitem\_\_ 方式 + +Dataset 提供 `__setitem__` 函数,提供类似 dict的索引更新值的方式添加数据。 + +```python +ds[2] = {"a":2, "b":2} +ds.commit() +``` + ## 使用 swcli dataset build + Python Handler 方式构建 + +支持 `swcli` 命令行读取某个Python文件中的某个函数作为输入,构建数据集。该函数的返回值需要可迭代。 + +dataset.py 脚本内容如下: + +```python +def iter_item(): + for i in range(100): + # 只返回 features。key为int类型的自增数字。 + yield {"a": i, "b": i} + +def iter_item_with_key(): + for i in range(100): + # 返回 key + features结构。 + yield i, {"a": i, "b": i} +``` + +构建数据集时,需要通过`swcli`命令行触发: + +```console +swcli dataset build --handler dataset:iter_item --name test1 +swcli dataset build --handler dataset:iter_item_with_key --name test2 +``` From 0dcf62e0df7e67cae0c938be30a2fddda578a8dd Mon Sep 17 00:00:00 2001 From: tianwei Date: Tue, 9 Jan 2024 17:31:46 +0800 Subject: [PATCH 3/6] add dataset load --- .../current/dataset/index.md | 2 +- .../current/dataset/load.md | 131 +++++++++++++++++- 2 files changed, 131 insertions(+), 2 deletions(-) diff --git a/i18n/zh/docusaurus-plugin-content-docs/current/dataset/index.md b/i18n/zh/docusaurus-plugin-content-docs/current/dataset/index.md index 4ce665cf3..b071411ba 100644 --- a/i18n/zh/docusaurus-plugin-content-docs/current/dataset/index.md +++ b/i18n/zh/docusaurus-plugin-content-docs/current/dataset/index.md @@ -89,7 +89,7 @@ title: Starwhale 数据集 ```python { - "img": GrayscaleImage( + "img": Image( link=Link( "123", offset=32, diff --git a/i18n/zh/docusaurus-plugin-content-docs/current/dataset/load.md b/i18n/zh/docusaurus-plugin-content-docs/current/dataset/load.md index bcbc362c6..15c9cfd28 100644 --- a/i18n/zh/docusaurus-plugin-content-docs/current/dataset/load.md +++ b/i18n/zh/docusaurus-plugin-content-docs/current/dataset/load.md @@ -1,3 +1,132 @@ --- title: 数据集加载 ---- \ No newline at end of file +--- + +Starwhale 数据集构建完成后,可以在任意位置访问数据集,加载一条或多条数据,满足训练、评测和微调等数据消费的需求。 + +## 数据集加载的特点 + +- 加载本地 Standalone 实例或远端 Cloud/Server 实例的数据集,数据集唯一索引方式是数据集URI。 + + ```python + from starwhale import dataset + + local_latest_ds = dataset("mnist") + remote_cloud_ds = dataset("https://cloud-cn.starwhale.cn/project/starwhale:helloworld/dataset/mnist64/v2") + remote_server_ds = dataset("cloud://server/project/1/dataset/helloworld") + ``` + +- 远端数据集按需预加载,数据不落盘。 + - Starwhale 数据集加载时,并不会将远端数据集完全下载到本地后再加载。只会加载目标索引关联的数据。 + - 根据目标索引特征,提前加载一些数据,提升Batch性能,用空间换时间。 + + ![dataset-load](https://starwhale-examples.oss-cn-beijing.aliyuncs.com/docs/dataset-load.png) + +- 数据索引方式灵活。Starwhale Dataset 类实现了 `__getitem__` 方法,提供key索引和分片索引方式读取相关数据。 + + ```python + from starwhale import dataset + ds = dataset("mnist64") + print(ds[0].features.img) + print(ds[0].features.label) + print(len(ds[:10])) + ``` + + ```console + ArtifactType.Image, display:0, mime_type:MIMEType.PNG, shape:[8, 8, 1], encoding: + 0 + 10 + ``` + +## 数据集元素访问方式 + +### 下标方式 + +可以通过 key 值进行访问。当使用切片时,按根据key排序结果取范围。 + +```python +from starwhale import dataset + +with dataset("empty-new") as ds: + for i in range(0, 100): + ds.append({"a": i}) + ds.commit() + +ds = dataset("empty-new", readonly=True) +print(ds[0].features.a) +print(ds[99].features["a"]) +print(ds[0:10]) +print(ds[99:]) +``` + +```console +0 +99 +10 +2 +``` + +需要注意,这里并不是list的切片语法,并不支持逆序索引,如 `ds[-1]` 或 `ds[1:-1]` 这种表达。 + +### 迭代方式 + +Starwhale Dataset 类实现了 `__iter__` 方法,可以对实例化的Dataset对象进行遍历迭代,这也是训练、评测和微调中常用的数据集访问方式,能获得最佳性能。 + +```python +from starwhale import dataset +ds = dataset("mnist64") +for idx, row in enumerate(ds): + if idx > 10: + break + print(row.index, row.features) +``` + +```console +0 {'img': ArtifactType.Image, display:0, mime_type:MIMEType.PNG, shape:[8, 8, 1], encoding: , 'label': 0} +1 {'img': ArtifactType.Image, display:1, mime_type:MIMEType.PNG, shape:[8, 8, 1], encoding: , 'label': 1} +2 {'img': ArtifactType.Image, display:2, mime_type:MIMEType.PNG, shape:[8, 8, 1], encoding: , 'label': 2} +4 {'img': ArtifactType.Image, display:4, mime_type:MIMEType.PNG, shape:[8, 8, 1], encoding: , 'label': 4} +5 {'img': ArtifactType.Image, display:5, mime_type:MIMEType.PNG, shape:[8, 8, 1], encoding: , 'label': 5} +3 {'img': ArtifactType.Image, display:3, mime_type:MIMEType.PNG, shape:[8, 8, 1], encoding: , 'label': 3} +6 {'img': ArtifactType.Image, display:6, mime_type:MIMEType.PNG, shape:[8, 8, 1], encoding: , 'label': 6} +7 {'img': ArtifactType.Image, display:7, mime_type:MIMEType.PNG, shape:[8, 8, 1], encoding: , 'label': 7} +8 {'img': ArtifactType.Image, display:8, mime_type:MIMEType.PNG, shape:[8, 8, 1], encoding: , 'label': 8} +9 {'img': ArtifactType.Image, display:9, mime_type:MIMEType.PNG, shape:[8, 8, 1], encoding: , 'label': 9} +10 {'img': ArtifactType.Image, display:10, mime_type:MIMEType.PNG, shape:[8, 8, 1], encoding: , 'label': 0} +``` + +### fetch_one 方法 + +获取数据集第一个元素,一般用来做回归测试或查看一下数据集features结构。与 `head(n=1)` 等价。 + +```python +from starwhale import dataset +ds = dataset("mnist64") +item = ds.fetch_one() +print(item.index) +print(list(item.features.keys())) +``` + +```console +0 │ +['img', 'label'] +``` + +### head 方法 + +获取数据集的n个元素,以列表方式返回。 + +```python +from starwhale import dataset +ds = dataset("mnist64") +items = ds.head(n=5) +print(items[0]) +print(items[0].features) +print(len(items)) +``` + +```console +0 +{'img': ArtifactType.Image, display:0, mime_type:MIMEType.PNG, shape:[8, 8, 1], encoding: , 'label': 0} +5 +``` From 0332a7350cbec19633feb3204bd46703276dd87c Mon Sep 17 00:00:00 2001 From: tianwei Date: Tue, 9 Jan 2024 18:00:15 +0800 Subject: [PATCH 4/6] add dataset view --- docs/concepts/glossary.md | 1 + .../current/concepts/glossary.md | 1 + .../current/dataset/view.md | 29 ++++++++++++++++++- 3 files changed, 30 insertions(+), 1 deletion(-) diff --git a/docs/concepts/glossary.md b/docs/concepts/glossary.md index ae7ac19b5..83b5664a6 100644 --- a/docs/concepts/glossary.md +++ b/docs/concepts/glossary.md @@ -18,3 +18,4 @@ On this page you find a list of important terminology used throughout the Starwh * **`model.yaml` file**: A descriptive file defining how to build a Starwhale Model, optional. * **`dataset.yaml` file**: A descriptive file defining how to build a Starwhale Dataset, needs to work with some Python scripts. Used by `swcli dataset build` command, optional. * **`runtime.yaml` file**: A descriptive file defining a Starwhale Runtime, used by `swcli runtime build` command, optional. +* **Starwhale Console**: Web frontend page for Starwhale Server/Cloud instance. diff --git a/i18n/zh/docusaurus-plugin-content-docs/current/concepts/glossary.md b/i18n/zh/docusaurus-plugin-content-docs/current/concepts/glossary.md index 186788862..a483b9d5e 100644 --- a/i18n/zh/docusaurus-plugin-content-docs/current/concepts/glossary.md +++ b/i18n/zh/docusaurus-plugin-content-docs/current/concepts/glossary.md @@ -18,3 +18,4 @@ title: Starwhale的名词解释 * **`model.yaml` 文件**:是一种定义Starwhale Model如何构建的描述性文件,非必需。 * **`dataset.yaml` 文件**:是一种定义Starwhale Dataset如何构建的描述性文件,需要与一些Python脚本配合使用。`swcli dataset build` 命令会使用。非必需。 * **`runtime.yaml` 文件**:是一种定义Starwhale Runtime的描述性文件,`swcli runtime build` 命令会使用。非必需。 +* **Starwhale Console**: Starwhale Server/Cloud 实例中的Web前端页面。 diff --git a/i18n/zh/docusaurus-plugin-content-docs/current/dataset/view.md b/i18n/zh/docusaurus-plugin-content-docs/current/dataset/view.md index de090010a..01df33474 100644 --- a/i18n/zh/docusaurus-plugin-content-docs/current/dataset/view.md +++ b/i18n/zh/docusaurus-plugin-content-docs/current/dataset/view.md @@ -1,3 +1,30 @@ --- title: 数据集可视化 ---- \ No newline at end of file +--- + +Starwhale Console 提供数据集的可视化,支持搜索、过滤、数据对比、数据展示等功能,能有效的显示视频、音频、图片、文本等数据。 + +## 视频 + +能对 `Starwhale.Video` 对象进行呈现,可播放。 + +![video](https://starwhale-examples.oss-cn-beijing.aliyuncs.com/docs/dataset-view/video.png) + +## 图片 + +对 `Starwhale.Image` 和 `Starwhale.GrayscaleImage` 对象进行呈现,同时支持 `Starwhale.BoundingBox` 和 `Starwhale.COCOObjectAnnotation` 对象。 + +![image-simple](https://starwhale-examples.oss-cn-beijing.aliyuncs.com/docs/dataset-view/image-simple.png) +![image-bbox](https://starwhale-examples.oss-cn-beijing.aliyuncs.com/docs/dataset-view/image-bbox.png) +![image-mask](https://starwhale-examples.oss-cn-beijing.aliyuncs.com/docs/dataset-view/image-mask.png) +![image-mask2](https://starwhale-examples.oss-cn-beijing.aliyuncs.com/docs/dataset-view/image-mask2.png) + +## 音频 + +对 `Starwhale.Audio` 对象进行呈现,可播放。 + +![audio](https://starwhale-examples.oss-cn-beijing.aliyuncs.com/docs/dataset-view/audio.png) + +## 文本 + +![text](https://starwhale-examples.oss-cn-beijing.aliyuncs.com/docs/dataset-view/text.png) From 28f5605883d58f7876dab2073ed0d83f848c7246 Mon Sep 17 00:00:00 2001 From: tianwei Date: Wed, 10 Jan 2024 11:59:26 +0800 Subject: [PATCH 5/6] add version controll --- .../current/dataset/version.md | 107 +++++++++++++++++- 1 file changed, 106 insertions(+), 1 deletion(-) diff --git a/i18n/zh/docusaurus-plugin-content-docs/current/dataset/version.md b/i18n/zh/docusaurus-plugin-content-docs/current/dataset/version.md index b76f9028e..06a4873c0 100644 --- a/i18n/zh/docusaurus-plugin-content-docs/current/dataset/version.md +++ b/i18n/zh/docusaurus-plugin-content-docs/current/dataset/version.md @@ -1,3 +1,108 @@ --- title: 数据集版本控制 ---- \ No newline at end of file +--- + +Starwhale 数据集支持细粒度的版本控制,能实现对每一行和每一列的变更追溯。Starwhale 的数据集版本控制具备一下特点: + +- 线性版本。设计上以简化操作为核心,不需要考虑branch、merge等复杂的操作。对大规模数据集进行branch merge操作几乎不可能。 +- 细粒度控制。最小单位是某一行的某一列变更后就可以生成一个新的版本。 +- 版本唯一。生成版本时,会产生一个唯一ID,当数据集拷贝到不同实例中,该ID不变,可以通过该ID加载对应的数据集内容。 + +## 构建数据集时生成版本 + +### SDK commit主动调用生成版本 + +当使用 Starwhale Dataset SDK 构建数据集时,当添加完数据后,调用 `commit` 方法时,会产生一个新的版本,得到一个UUID。 + +```python +from starwhale import dataset + +ds1 = dataset("new-ds", create="empty") +ds1["train/0"] = {"a": 1, "b": 10} +ds1["train/1"] = {"a": 2, "b": 20} +version = ds1.commit() +print(version) +ds1.close() + +ds2 = dataset(f"new-ds/version/{version}") +ds2["train/0"].features.c = 100 +ds2["train/1"].features.a = -2 +ds2["train/1"].features.b = -20 +new_version = ds2.commit() +print(new_version) +ds2.close() + +ds1 = dataset(f"new-ds/version/{version}", readonly=True) +print(f"---{version}") +print(ds1["train/0"].index, ds1["train/0"].features) +print(ds1["train/1"].index, ds1["train/1"].features) +ds2 = dataset(f"new-ds/version/{new_version}", readonly=True) +print(f"---{new_version}") +print(ds2["train/0"].index, ds2["train/0"].features) +print(ds2["train/1"].index, ds2["train/1"].features) +ds1.close() +ds2.close() +``` + +```console +n7uglydp4pbjrf5rjgct7ygmmwk6ldmzv5j3amaw +a4gyk3w3uxgklfthle2jjmxw3gx3k7m6icbzfhlf +---n7uglydp4pbjrf5rjgct7ygmmwk6ldmzv5j3amaw +train/0 {'a': 1, 'b': 10} +train/1 {'a': 2, 'b': 20} +---a4gyk3w3uxgklfthle2jjmxw3gx3k7m6icbzfhlf +train/0 {'a': 1, 'b': 10, 'c': 100} +train/1 {'a': -2, 'b': -20} +``` + +### swcli 命令行自动生成版本 + +对于 `swcli dataset build` 命令行构建数据集时,会自动产生一个新版本。 + +```console +❯ swcli dataset build --json https://modelscope.cn/api/v1/datasets/damo/100PoisonMpts/repo\?Revision\=master\&FilePath\=train.jsonl +🚧 start to build dataset bundle... +👷 uri local/project/self/dataset/json-gec8u5sv/version/latest +🌊 creating dataset local/project/self/dataset/json-gec8u5sv/version/f3iz4sdljjt7rmmfd4rkiak4vkbilp5pbrdgfgom... +🦋 update 906 records into dataset +🌺 congratulation! dataset build from ('https://modelscope.cn/api/v1/datasets/damo/100PoisonMpts/repo?Revision=master&FilePath=train.jsonl',) has been built. You can run swcli dataset info json-gec8u5sv/version/f3iz4sdljjt7 +``` + +### Tag 关联版本 + +Starwhale 数据集引入 Tag 概念,可以在 `commit` 或执行数据集构建命令时,指定Tag,实现数据集版本和Tag的关联,之后可以用Tag进行数据集加载。 + +- 数据集版本:一个唯一ID,类似 `f3iz4sdljjt7rmmfd4rkiak4vkbilp5pbrdgfgom`,保证在所有Starwhale 实例上ID唯一。 +- 数据集Tag:可读字符串,类似 `t1`, `t2`, `v0.3`。数据集版本与Tag是一对多的关系。每个Tag只能标识一个版本,但每个数据集版本可以有多个Tag。 + - 手工指定Tag:`commit` 函数中的`tags` 参数,或在`swcli dataset build`命令行中通过`--tag`参数,指定一个或多个Tag。数据集拷贝到其他实例时,可以通过参数设置携带这些Tags。 + - 自动生成的自增Tag:在一个实例范围内,每次commit或build后,会产生类似 `v0`, `v1`, `v2` 这样的自增Tag。数据集拷贝的时候会忽略源实例上的这些Tag,在目的实例上会产生新的自增Tag。 + - `latest` Tag: 自动生成,最后一次调用commit或指定build命令,会将`latest`标记到该版本上。 + +## 通过版本加载数据集 + +通过 Dataset URI 可以加载任意位置的数据集,URI中的version字段,可以用唯一ID、唯一ID简写、自定义Tag、自增Tag和`latest` Tag等多种形式。 + +```python +from starwhale import dataset + +# load with the latest version +print("latest version(default):", dataset("new-ds").loading_version) +print("latest version(specified):", dataset("new-ds/version/latest").loading_version) + +# load with the full specified version +print("uuid version(full):", dataset("new-ds/version/n7uglydp4pbjrf5rjgct7ygmmwk6ldmzv5j3amaw").loading_version) +print("uuid version(prefix):", dataset("new-ds/version/n7uglydp4p").loading_version) + +# load with tag +print("tag version(v0):", dataset("new-ds/version/v0").loading_version) +print("tag version(v1):", dataset("new-ds/version/v1").loading_version) +``` + +```console +latest version(default): a4gyk3w3uxgklfthle2jjmxw3gx3k7m6icbzfhlf +latest version(specified): a4gyk3w3uxgklfthle2jjmxw3gx3k7m6icbzfhlf +uuid version(full): n7uglydp4pbjrf5rjgct7ygmmwk6ldmzv5j3amaw +uuid version(prefix): n7uglydp4pbjrf5rjgct7ygmmwk6ldmzv5j3amaw +tag version(v0): n7uglydp4pbjrf5rjgct7ygmmwk6ldmzv5j3amaw +tag version(v1): a4gyk3w3uxgklfthle2jjmxw3gx3k7m6icbzfhlf +``` From 5ec3c7e3fa28a44280f3e4b32d2b27a7bb8b7476 Mon Sep 17 00:00:00 2001 From: tianwei Date: Wed, 10 Jan 2024 15:50:08 +0800 Subject: [PATCH 6/6] update docs with lint --- docs/dataset/build.md | 300 ++++++++++++++++++ docs/dataset/integration.md | 178 +++++++++++ docs/dataset/load.md | 132 ++++++++ docs/dataset/version.md | 108 +++++++ docs/dataset/view.md | 30 ++ .../current/dataset/build.md | 32 +- .../current/dataset/integration.md | 26 +- .../current/dataset/load.md | 62 ++-- .../current/dataset/version.md | 6 +- 9 files changed, 810 insertions(+), 64 deletions(-) diff --git a/docs/dataset/build.md b/docs/dataset/build.md index e69de29bb..7bba01d14 100644 --- a/docs/dataset/build.md +++ b/docs/dataset/build.md @@ -0,0 +1,300 @@ +--- +title: Dataset Building +--- + +Starwhale provides a highly flexible method to build datasets, allowing you to build dataset from various file types including images, audio, video, CSV, JSON, and JSONL files. Python scripts and datasets from the Huggingface Hub can also be used for construction. + +## Building from Data Files + +### Image + +Starwhale supports recursively traversing image files within directories to build a dataset without any coding: + +- Supported image formats: `png`, `jpg`, `jpeg`, `webp`, `svg`, `apng`. +- Images are converted to `Starwhale.Image` type and can be viewed in the Starwhale Server Web page. +- Supported by `swcli dataset build --image` command line and `starwhale.Dataset.from_folder` Python SDK. +- **Label mechanism**: when SDK sets `auto_label=True` or command line sets `--auto-label`, the parent directory name will be used as the `label`. +- **Metadata mechanism**: dataset columns can be expanded by setting `metadata.csv` or `metadata.jsonl` files in the root directory. +- **Caption mechanism**: when `{image-name}.txt` files are found in the same directory, the content will be automatically imported and populated into the `caption` column. + +Assuming there are the following four files in the folder directory: + +```console +folder/dog/1.png +folder/cat/2.png +folder/dog/3.png +folder/cat/4.png +``` + +Command line construction: + +```console +❯ swcli dataset build --image folder --name image-folder +🚧 start to build dataset bundle... +👷 uri local/project/self/dataset/image-folder/version/latest +🌊 creating dataset local/project/self/dataset/image-folder/version/uw6mdisnf7alg4t4fs2myfug4ie4636w3x4jqcu2... +🦋 update 4 records into dataset +🌺 congratulation! you can run swcli dataset info image-folder/version/uw6mdisnf7al +``` + +```console +❯ swcli dataset head image-folder -n 2 +row ─────────────────────────────────────── +🌳 id: cat/2.png +🌀 features: + 🔅 file_name : cat/2.png + 🔅 label : cat + 🔅 file : ArtifactType.Image, display:2.png, mime_type:MIMEType.PNG, shape:[None, None, 3], encoding: +row ─────────────────────────────────────── +🌳 id: cat/4.png +🌀 features: + 🔅 file_name : cat/4.png + 🔅 label : cat + 🔅 file : ArtifactType.Image, display:4.png, mime_type:MIMEType.PNG, shape:[None, None, 3], encoding: +``` + +Python SDK construction: + +```python +from starwhale import Dataset +ds = Dataset.from_folder("folder", kind="image") +print(ds) +print(ds.fetch_one().features) +``` + +```console +🌊 creating dataset local/project/self/dataset/folder/version/nyc2ay4gnyayv4zqalovdgakl3k2esvrne42cjna... +🦋 update 4 records into dataset +Dataset: folder, stash version: d22hdiwyakdfh5xitcpn2s32gblfbhrerzczkb63, loading version: nyc2ay4gnyayv4zqalovdgakl3k2esvrne42cjna +{'file_name': 'cat/2.png', 'label': 'cat', 'file': ArtifactType.Image, display:2.png, mime_type:MIMEType.PNG, shape:[None, None, 3], encoding: } +``` + +### Video + +Recursive traversal of video files in a directory to construct Starwhale datasets without any coding: + +- Supported video formats: `mp4`, `webm` and `avi`. +- Videos are converted to Starwhale.Video types and can be viewed in the Starwhale Server Web page. +- Supported by `swcli dataset build --video` command line and `starwhale.Dataset.from_folder` Python SDK. +- Label, caption and metadata mechanisms are the same as for images. + +### Audio + +Recursive traversal of audio files in a directory to construct Starwhale datasets without any coding: + +- Supported audio formats: `mp3` and `wav`. +- Audio is converted to Starwhale.Audio types and can be viewed in the Starwhale Server Web page. +- Supported by `swcli dataset build --audio` command line and `starwhale.Dataset.from_folder` Python SDK. +- Label, caption and metadata mechanisms are the same as for images. + +### csv Files + +Command line or Python SDK can directly convert local or remote csv files into Starwhale datasets: + +- Support one or more local csv files. +- Support recursive finding of csv files in a local directory. +- Support one or more remote csv files specified by http urls. + +Command line construction: + +```console +❯ swcli dataset build --name product-desc-modelscope --csv https://modelscope.cn/api/v1/datasets/lcl193798/product_description_generation/repo\?Revision\=master\&FilePath\=test.csv --encoding=utf-8-sig +🚧 start to build dataset bundle... +👷 uri local/project/self/dataset/product-desc-modelscope/version/latest +🌊 creating dataset local/project/self/dataset/product-desc-modelscope/version/wzaz4ccodpyj4jelgupljreyida2bleg5xp7viwe... +🦋 update 3848 records into dataset +🌺 congratulation! dataset build from csv files(('https://modelscope.cn/api/v1/datasets/lcl193798/product_description_generation/repo?Revision=master&FilePath=test.csv',)) has been built. You can run swcli dataset info product-desc-modelscope/version/wzaz4ccodpyj +``` + +Python SDK construction: + +```python +from starwhale import Dataset +ds = Dataset.from_csv(path="http://example.com/data.csv", name="my-csv-dataset") +``` + +### json/jsonl Files + +Command line or Python SDK can directly convert local or remote json/jsonl files into Starwhale datasets: + +- Support one or more local json/jsonl files. +- Support recursive finding of json/jsonl files in a local directory. +- Support one or more remote json/jsonl files specified by http urls. + +For JSON files: + +- By default, the parsed json object is assumed to be a list, and each object in the list is a dict, which maps to one row in the Starwhale dataset. +- The `--field-selector` or `field_selector` parameter can be used to locate a specific list. + +For example, for the json file: + +```json +{ + "p1": { + "p2":{ + "p3": [ + {"a": 1, "b": 2}, + {"a": 10, "b": 20} + ] + } + } +} +``` + +Set `--field-selector=p1.p2.p3` to accurately add two rows of data to the dataset. + +Command line construction: + +```console +❯ swcli dataset build --json https://modelscope.cn/api/v1/datasets/damo/100PoisonMpts/repo\?Revision\=master\&FilePath\=train.jsonl +🚧 start to build dataset bundle... +👷 uri local/project/self/dataset/json-b0o2zsvg/version/latest +🌊 creating dataset local/project/self/dataset/json-b0o2zsvg/version/q3uoziwqligxdggncqywpund75jz55h3bne6a5la... +🦋 update 906 records into dataset +🌺 congratulation! dataset build from ('https://modelscope.cn/api/v1/datasets/damo/100PoisonMpts/repo?Revision=master&FilePath=train.jsonl',) has been built. You can run swcli dataset info json-b0o2zsvg/version/q3uoziwqligx +``` + +Python SDK construction: + +```python +from starwhale import Dataset +myds = Dataset.from_json( + name="translation", + text='{"content": {"child_content": [{"en":"hello","zh-cn":"你好"},{"en":"how are you","zh-cn":"最近怎么样"}]}}', + field_selector="content.child_content" +) +print(myds[0].features["zh-cn"]) +``` + +```console +🌊 creating dataset local/project/self/dataset/translation/version/kblfn5zh4cpoqxqbhgdfbvonulr2zefp6lojq44y... +🦋 update 2 records into dataset +你好 +``` + +## Building from Huggingface Hub + +There are numerous datasets available on the Huggingface Hub, which can be converted into Starwhale Dataset with a single line of code or command. + +:::tip +Huggingface Datasets conversion relies on the [datasets](https://pypi.org/project/datasets/) library. +::: + +Command line: + +```console +swcli dataset build -hf lambdalabs/pokemon-blip-captions --name pokemon +``` + +Python SDK: + +```python +from starwhale import Dataset + +# You only specify starwhale dataset expected name and huggingface repo name +# example: https://huggingface.co/datasets/lambdalabs/pokemon-blip-captions +ds = Dataset.from_huggingface("pokemon", "lambdalabs/pokemon-blip-captions") +print(ds) +print(len(ds)) +print(repr(ds.fetch_one())) +``` + +## Building from Python SDK scripts + +The Starwhale Dataset SDK provides a way similar to Python `dict` to add or update data, enabling the creation and update of local or remote datasets. + +Starwhale defines two attributes for each row of data: `key` and `features`. + +- `key`: int or str type. There is only one type of `key` in a dataset. `key` indicates the unique index of that row of data. +- `features`: dict type. Starwhale Dataset adopts a schema-free design, so the `features` structure of each row can be different. + - `features` data supports Python constant types like int, float, str, as well as Starwhale types like Image, Video, Audio, Text, and Binary. It also supports Python compound types like list, tuple, dict. + +### Dataset Initialization + +To create, update, or load a dataset, you need to get a `Starwhale.Dataset` object, usually in the following ways: + +```python +from starwhale import dataset + +# Create a dataset named new-test in standalone instance. If it exists, raise an exception. +local_ds = dataset("new-test", create="empty") +print(local_ds) +print(len(local_ds)) + +# If the mnist64 dataset does not exist, create one; otherwise, load this existing dataset. +remote_ds = dataset("https://cloud.starwhale.cn/project/starwhale:helloworld/dataset/mnist64", create="auto") +print(remote_ds) +print(len(remote_ds)) + +# Load the existing dataset named mnist64, and if it does not exist, an error will be raised. +existed_ds = dataset("mnist64", create="forbid") +print(existed_ds) +print(len(existed_ds)) +``` + +```console +Dataset: new-test, stash version: y4touw3btifhkd4f2gg4x3qvydgnfmvoztqqm5cf, loading version: y4touw3btifhkd4f2gg4x3qvydgnfmvoztqqm5cf +0 + +Dataset: mnist64, stash version: 4z5wpbpozsxlelma3j6soeatekufymnyxdeihoqo, loading version: vs3gnaauakidjcc5effevaoh63vivu7dzodo5cmc +500 + +Dataset: mnist64, stash version: 3ahtfbizw63myxcz34ebd72lhgc25dualcmtznts, loading version: lwhvvixpimlsghfs2xqmtgrwti4yn2z5nevz7hth +500 +``` + +### Adding and Updating Dataset Elements + +After adding data, calling `commit` will generate a new version that can then be used to access the dataset. + +#### The append Method + +The Dataset provides the append function, which automatically adds `features` to a new row in the dataset when called. + +```python +from starwhale import dataset +ds = dataset("new-test", create="empty") + +# key is the auto increment index. The example key is zero. +ds.append({"a": 0, "b": 0}) + +# Keys in the dataset can also be explicitly declared, but they must maintain consistency with the key types of other rows. +# When data is added in the form of a list or tuple, the first element (at index 0) represents the key for that particular row, while the second element (at index 1) contains the corresponding features. +ds.append((1, {"a":1, "b":1})) + +ds.commit() +``` + +#### \_\_setitem\_\_ Method + +The Dataset's `__setitem__` method provides a dict-like way to add data by index. + +```python +ds[2] = {"a":2, "b":2} +ds.commit() +``` + +## Building from Python Handler + +Supports reading functions in Python files through the `swcli` command line as input to build datasets. The return value of the function needs to be iterable. + +Example python script dataset.py: + +```python +def iter_item(): + for i in range(100): + # only return features. key is auto increment index. + yield {"a": i, "b": i} + +def iter_item_with_key(): + for i in range(100): + # key + features + yield i, {"a": i, "b": i} +``` + +Build datasets by triggering through the `swcli` command line: + +```console +swcli dataset build --handler dataset:iter_item --name test1 +swcli dataset build --handler dataset:iter_item_with_key --name test2 +``` diff --git a/docs/dataset/integration.md b/docs/dataset/integration.md index e69de29bb..e755ac264 100644 --- a/docs/dataset/integration.md +++ b/docs/dataset/integration.md @@ -0,0 +1,178 @@ +--- +title: Integration with Other ML Libraries +--- + +Starwhale datasets can integrate well with popular ML libraries such as Pillow, Numpy, Huggingface Datasets, Pytorch and Tensorflow, facilitating data transformation. + +## Pillow + +[Starwhale Image](../reference/sdk/type#image) type and [Pillow Image](https://pillow.readthedocs.io/en/stable/reference/Image.html) objects have bidirectional conversion. + +### Initialize Starwhale Image with Pillow Image + +```python +from starwhale import dataset + +# login cloud instance in advance: `swcli instance login` command or `starwhale.login` sdk +# raw dataset url: https://cloud.starwhale.cn/projects/397/datasets/172/versions/236/files +ds = dataset("https://cloud.starwhale.cn/project/starwhale:object-detection/dataset/coco128/v2") +img = ds.head(n=1)[0].features.image + +pil = img.to_pil() +print(pil) +print(pil.size) +``` + +```console + +(640, 480) +``` + +### Converting Starwhale Image to Pillow Image + +```python +import numpy +from PIL import Image as PILImage +from starwhale import Image + +# generate a random image +random_array = numpy.random.randint(low=0, high=256, size=(100, 100, 3), dtype=numpy.uint8) +pil = PILImage.fromarray(random_array, mode="RGB") + +img = Image(pil) +print(img) +``` + +```console +ArtifactType.Image, display:, mime_type:MIMEType.UNDEFINED, shape:[None, None, 3], encoding: +``` + +## Numpy + +### Converting to numpy.ndarray + +The following Starwhale data types can be converted to `numpy.ndarray` objects: + +* `Image`: First convert to Pillow Image type, then to `numpy.ndarray` object. +* `Video`: Directly convert video bytes to `numpy.ndarray` object. +* `Audio`: Use the soundfile library to convert audio bytes to `numpy.ndarray` object. +* `BoundingBox`: Convert to `numpy.ndarray` object in xywh format. +* `Binary`: Directly convert bytes to `numpy.ndarray` object. + +```python +from starwhale import dataset + +# login cloud instance in advance: `swcli instance login` command or `starwhale.login` sdk +# raw dataset url: https://cloud.starwhale.cn/projects/397/datasets/172/versions/236/files +ds = dataset("https://cloud.starwhale.cn/project/starwhale:object-detection/dataset/coco128/v2") + +item = ds.head(n=1)[0] + +img = item.features.image +img_array = img.to_numpy() +print(img_array) +print(img_array.shape) + +bbox = item.features.annotations[0]["bbox"] +print(bbox) +print(bbox.to_numpy()) +``` + +```console + +(480, 640, 3) +BoundingBox[XYWH]- x:1.0799999999999699, y:187.69008, width:611.5897600000001, height:285.84000000000003 +array([ 1.08 , 187.69008, 611.58976, 285.84 ]) +``` + +### Initialize Starwhale Image with numpy.ndarray + +When an image is represented as a `numpy.ndarray` object, it can be used to initialize a Starwhale Image object. + +```python +import numpy +from starwhale import Image + +# generate a random image numpy.ndarray +random_array = numpy.random.randint(low=0, high=256, size=(100, 100, 3), dtype=numpy.uint8) +img = Image(random_array) +print(img) +``` + +```console +ArtifactType.Image, display:, mime_type:MIMEType.UNDEFINED, shape:[None, None, 3], encoding: +``` + +## Huggingface Datasets + +There are numerous datasets on the Huggingface Hub that can be transformed into Starwhale datasets with a single line of code. + +:::tip +Huggingface Datasets conversion relies on the [datasets](https://pypi.org/project/datasets/) library. +::: + +```python +from starwhale import Dataset + +# You only specify starwhale dataset expected name and huggingface repo name +# example: https://huggingface.co/datasets/lambdalabs/pokemon-blip-captions +ds = Dataset.from_huggingface("pokemon", "lambdalabs/pokemon-blip-captions") +print(ds) +print(len(ds)) +print(repr(ds.fetch_one())) +``` + +```console +🌊 creating dataset local/project/self/dataset/pokemon/version/r2m3is6ormwcio4gtayop25qk4gmfr6mcei6hise... +🦋 update 833 records into dataset +Dataset: pokemon, stash version: r2m3is6ormwcio4gtayop25qk4gmfr6mcei6hise, loading version: r2m3is6ormwcio4gtayop25qk4gmfr6mcei6hise +833 +index:default/train/0, features:{'image': ArtifactType.Image, display:, mime_type:MIMEType.JPEG, shape:[1280, 1280, 3], encoding: , 'text': 'a drawing of a green pokemon with red eyes', '_hf_subset': 'default', '_hf_split': 'train'}, shadow dataset: None +``` + +## Pytorch + +Starwhale Dataset can be converted into Pytorch's [torch.utils.dataset.IterableDataset](https://pytorch.org/docs/stable/data.html#torch.utils.data.IterableDataset) object and accept transform. The converted Pytorch dataset object can then be passed to Pytorch dataloader or Huggingface Trainer, etc. + +```python +from starwhale import dataset +import torch.utils.data as tdata + +def custom_transform(data): + data["label"] = data["label"] + 100 + return data + +with dataset("simple", create="empty") as ds: + for i in range(0, 10): + ds[i] = {"text": f"{i}-text", "label": i} + ds.commit() + + torch_ds = ds.to_pytorch(transform=custom_transform) + torch_loader = tdata.DataLoader(torch_ds, batch_size=1) + item = next(iter(torch_loader)) + print(item) + print(item["label"]) +``` + +```console +{'text': ['0-text'], 'label': tensor([100])} +tensor([100]) +``` + +## Tensorflow + +Starwhale Dataset can be converted into Tensorflow's [tensorflow.data.Dataset](https://www.tensorflow.org/api_docs/python/tf/data/Dataset) object and supports transform functions to mutate the data. + +```python +from starwhale import dataset + +# login cloud instance in advance: `swcli instance login` command or `starwhale.login` sdk +# raw dataset url: https://cloud.starwhale.cn/projects/397/datasets/172/versions/236/files +ds = dataset("https://cloud.starwhale.cn/project/starwhale:helloworld/dataset/mnist64") +tf_ds = ds.to_tensorflow() +print(tf_ds) +``` + +```console +<_FlatMapDataset element_spec={'label': TensorSpec(shape=(), dtype=tf.int64, name=None), 'img': TensorSpec(shape=(8, 8, 1), dtype=tf.uint8, name=None)}> +``` diff --git a/docs/dataset/load.md b/docs/dataset/load.md index e69de29bb..1945f6a7b 100644 --- a/docs/dataset/load.md +++ b/docs/dataset/load.md @@ -0,0 +1,132 @@ +--- +title: Dataset Loading +--- + +After Starwhale datasets are constructed, they can be accessed from any location to load one or multiple data items, meeting the needs for training, evaluation and fine-tuning. + +## Features of Dataset Loading + +- Load datasets from local Standalone instances or remote Cloud/Server instances. Datasets are uniquely indexed by dataset URI. + + ```python + from starwhale import dataset + + local_latest_ds = dataset("mnist") + remote_cloud_ds = dataset("https://cloud-cn.starwhale.cn/project/starwhale:helloworld/dataset/mnist64/v2") + remote_server_ds = dataset("cloud://server/project/1/dataset/helloworld") + ``` + +- Remote datasets are loaded on demand without local persistence. + - When loading Starwhale datasets, remote datasets will not be completely downloaded before loading. Only related data based on target indexes will be loaded. + - Some data will be loaded in advance based on target index features to improve batch performance by trading space for time. + + ![dataset-load](https://starwhale-examples.oss-cn-beijing.aliyuncs.com/docs/dataset-load.png) + +- Flexible data indexing methods. Starwhale Dataset class implements `__getitem__` to provide key index and slice index methods to read related data. + + ```python + from starwhale import dataset + ds = dataset("mnist64") + print(ds[0].features.img) + print(ds[0].features.label) + print(len(ds[:10])) + ``` + + ```console + ArtifactType.Image, display:0, mime_type:MIMEType.PNG, shape:[8, 8, 1], encoding: + 0 + 10 + ``` + +## Methods to Access Dataset Elements + +### Indexing + +Use key values for accessing. Use slices for ranges sorted by key. + +```python +from starwhale import dataset + +with dataset("empty-new") as ds: + for i in range(0, 100): + ds.append({"a": i}) + ds.commit() + +ds = dataset("empty-new", readonly=True) +print(ds[0].features.a) +print(ds[99].features["a"]) +print(ds[0:10]) +print(ds[99:]) +``` + +```console +0 +99 +10 +2 +``` + +Note that this is not the slicing syntax of a list and does not support reverse indexing expressions like ds[-1] or ds[1:-1]. + +### Iteration + +Starwhale Dataset implements `__iter__` enabling iterating over Dataset instances. This is commonly used in training, evaluation and fine-tuning to achieve the best performance. + +```python +from starwhale import dataset +ds = dataset("mnist64") +for idx, row in enumerate(ds): + if idx > 10: + break + print(row.index, row.features) +``` + +```console +0 {'img': ArtifactType.Image, display:0, mime_type:MIMEType.PNG, shape:[8, 8, 1], encoding: , 'label': 0} +1 {'img': ArtifactType.Image, display:1, mime_type:MIMEType.PNG, shape:[8, 8, 1], encoding: , 'label': 1} +2 {'img': ArtifactType.Image, display:2, mime_type:MIMEType.PNG, shape:[8, 8, 1], encoding: , 'label': 2} +4 {'img': ArtifactType.Image, display:4, mime_type:MIMEType.PNG, shape:[8, 8, 1], encoding: , 'label': 4} +5 {'img': ArtifactType.Image, display:5, mime_type:MIMEType.PNG, shape:[8, 8, 1], encoding: , 'label': 5} +3 {'img': ArtifactType.Image, display:3, mime_type:MIMEType.PNG, shape:[8, 8, 1], encoding: , 'label': 3} +6 {'img': ArtifactType.Image, display:6, mime_type:MIMEType.PNG, shape:[8, 8, 1], encoding: , 'label': 6} +7 {'img': ArtifactType.Image, display:7, mime_type:MIMEType.PNG, shape:[8, 8, 1], encoding: , 'label': 7} +8 {'img': ArtifactType.Image, display:8, mime_type:MIMEType.PNG, shape:[8, 8, 1], encoding: , 'label': 8} +9 {'img': ArtifactType.Image, display:9, mime_type:MIMEType.PNG, shape:[8, 8, 1], encoding: , 'label': 9} +10 {'img': ArtifactType.Image, display:10, mime_type:MIMEType.PNG, shape:[8, 8, 1], encoding: , 'label': 0} +``` + +### fetch_one Method + +Get the first element of the dataset, usually for testing or viewing dataset features structure. Equivalent to `head(n=1)`. + +```python +from starwhale import dataset +ds = dataset("mnist64") +item = ds.fetch_one() +print(item.index) +print(list(item.features.keys())) +``` + +```console +0 +['img', 'label'] +``` + +### head Method + +Get the first n elements of the dataset, returned as a list. + +```python +from starwhale import dataset +ds = dataset("mnist64") +items = ds.head(n=5) +print(items[0]) +print(items[0].features) +print(len(items)) +``` + +```console +0 +{'img': ArtifactType.Image, display:0, mime_type:MIMEType.PNG, shape:[8, 8, 1], encoding: , 'label': 0} +5 +``` diff --git a/docs/dataset/version.md b/docs/dataset/version.md index e69de29bb..7074024eb 100644 --- a/docs/dataset/version.md +++ b/docs/dataset/version.md @@ -0,0 +1,108 @@ +--- +title: Dataset Versioning +--- + +Starwhale dataset supports fine-grained version control to trace changes to each row and column. The version control of Starwhale Dataset has the following features: + +- **Linear versioning**. The design aims at simplifying operations without complex branch and merge operations. Branch merge on massive datasets is almost impossible. +- **Fine-grained control**. The minimum unit is a change to a column in a row that can generate a new version. +- **Unique version IDs**. When generating a version, a globally unique ID is produced. Copying datasets between instances will keep this ID unchanged. The dataset content can be loaded by this ID. + +## Generating Versions During Dataset Construction + +### SDK commit to Actively Create Versions + +When constructing a dataset using the Starwhale Dataset SDK, after adding data, calling the `commit` method will produce a new version and obtain a UUID. + +```python +from starwhale import dataset + +ds1 = dataset("new-ds", create="empty") +ds1["train/0"] = {"a": 1, "b": 10} +ds1["train/1"] = {"a": 2, "b": 20} +version = ds1.commit() +print(version) +ds1.close() + +ds2 = dataset(f"new-ds/version/{version}") +ds2["train/0"].features.c = 100 +ds2["train/1"].features.a = -2 +ds2["train/1"].features.b = -20 +new_version = ds2.commit() +print(new_version) +ds2.close() + +ds1 = dataset(f"new-ds/version/{version}", readonly=True) +print(f"---{version}") +print(ds1["train/0"].index, ds1["train/0"].features) +print(ds1["train/1"].index, ds1["train/1"].features) +ds2 = dataset(f"new-ds/version/{new_version}", readonly=True) +print(f"---{new_version}") +print(ds2["train/0"].index, ds2["train/0"].features) +print(ds2["train/1"].index, ds2["train/1"].features) +ds1.close() +ds2.close() +``` + +```console +n7uglydp4pbjrf5rjgct7ygmmwk6ldmzv5j3amaw +a4gyk3w3uxgklfthle2jjmxw3gx3k7m6icbzfhlf +---n7uglydp4pbjrf5rjgct7ygmmwk6ldmzv5j3amaw +train/0 {'a': 1, 'b': 10} +train/1 {'a': 2, 'b': 20} +---a4gyk3w3uxgklfthle2jjmxw3gx3k7m6icbzfhlf +train/0 {'a': 1, 'b': 10, 'c': 100} +train/1 {'a': -2, 'b': -20} +``` + +### swcli Command Line + +`swcli dataset build` commands automatically generate a new version: + +```console +❯ swcli dataset build --json https://modelscope.cn/api/v1/datasets/damo/100PoisonMpts/repo\?Revision\=master\&FilePath\=train.jsonl +🚧 start to build dataset bundle... +👷 uri local/project/self/dataset/json-gec8u5sv/version/latest +🌊 creating dataset local/project/self/dataset/json-gec8u5sv/version/f3iz4sdljjt7rmmfd4rkiak4vkbilp5pbrdgfgom... +🦋 update 906 records into dataset +🌺 congratulation! dataset build from ('https://modelscope.cn/api/v1/datasets/damo/100PoisonMpts/repo?Revision=master&FilePath=train.jsonl',) has been built. You can run swcli dataset info json-gec8u5sv/version/f3iz4sdljjt7 +``` + +### Tagging Versions + +Starwhale introduces the concept of Tags, which can be specified during `commit` or when executing dataset construction commands to associate dataset versions with Tags, allowing dataset loading by Tag. + +- Dataset version: A unique ID, similar to `f3iz4sdljjt7rmmfd4rkiak4vkbilp5pbrdgfgom`, ensuring the ID is unique across all Starwhale instances. +- Dataset Tag: A readable string, similar to `t1`, `t2`, `v0.3`. There is a one-to-many relationship between dataset versions and Tags. Each Tag can only identify one version, but each dataset version can have multiple Tags. + - Manually specified Tags: The `tags` parameter in the `commit` function, or the `--tag` parameter in the `swcli dataset build` command, can be used to specify one or multiple Tags. When the dataset is copied to other instances, these Tags can be carried over by parameter settings. + - Automatically generated incremental Tags: Within an instance, after each commit or build, an incremental Tag like `v0`, `v1`, `v2` is generated. When copying the dataset, these Tags are ignored on the source instance, and new incremental Tags are generated on the destination instance. + - `latest` Tag: Automatically generated, the last commit or build command will mark the `latest` Tag on that version. + +## Loading Datasets by Version + +Datasets can be loaded from any location using the Dataset URI, where the version field in the URI can take various forms such as unique IDs, unique ID abbreviations, custom Tags, incremental Tags, and the `latest` Tag. + +```python +from starwhale import dataset + +# load with the latest version +print("latest version(default):", dataset("new-ds").loading_version) +print("latest version(specified):", dataset("new-ds/version/latest").loading_version) + +# load with the full specified version +print("uuid version(full):", dataset("new-ds/version/n7uglydp4pbjrf5rjgct7ygmmwk6ldmzv5j3amaw").loading_version) +print("uuid version(prefix):", dataset("new-ds/version/n7uglydp4p").loading_version) + +# load with tag +print("tag version(v0):", dataset("new-ds/version/v0").loading_version) +print("tag version(v1):", dataset("new-ds/version/v1").loading_version) +``` + +```console +latest version(default): a4gyk3w3uxgklfthle2jjmxw3gx3k7m6icbzfhlf +latest version(specified): a4gyk3w3uxgklfthle2jjmxw3gx3k7m6icbzfhlf +uuid version(full): n7uglydp4pbjrf5rjgct7ygmmwk6ldmzv5j3amaw +uuid version(prefix): n7uglydp4pbjrf5rjgct7ygmmwk6ldmzv5j3amaw +tag version(v0): n7uglydp4pbjrf5rjgct7ygmmwk6ldmzv5j3amaw +tag version(v1): a4gyk3w3uxgklfthle2jjmxw3gx3k7m6icbzfhlf +``` diff --git a/docs/dataset/view.md b/docs/dataset/view.md index e69de29bb..b8575e33a 100644 --- a/docs/dataset/view.md +++ b/docs/dataset/view.md @@ -0,0 +1,30 @@ +--- +title: Dataset Visualization +--- + +Starwhale Console provides visualization for datasets, supporting features such as search, filtering, data comparison, and data display, effectively displaying data such as video, audio, images, and text. + +## Video + +Present `Starwhale.Video` objects, with playback capability. + +![video](https://starwhale-examples.oss-cn-beijing.aliyuncs.com/docs/dataset-view/video.png) + +## Images + +Present `Starwhale.Image` and `Starwhale.GrayscaleImage` objects, and support `Starwhale.BoundingBox` and `Starwhale.COCOObjectAnnotation` objects. + +![image-simple](https://starwhale-examples.oss-cn-beijing.aliyuncs.com/docs/dataset-view/image-simple.png) +![image-bbox](https://starwhale-examples.oss-cn-beijing.aliyuncs.com/docs/dataset-view/image-bbox.png) +![image-mask](https://starwhale-examples.oss-cn-beijing.aliyuncs.com/docs/dataset-view/image-mask.png) +![image-mask2](https://starwhale-examples.oss-cn-beijing.aliyuncs.com/docs/dataset-view/image-mask2.png) + +## Audio + +Present `Starwhale.Audio` objects, with playback capability. + +![audio](https://starwhale-examples.oss-cn-beijing.aliyuncs.com/docs/dataset-view/audio.png) + +## Text + +![text](https://starwhale-examples.oss-cn-beijing.aliyuncs.com/docs/dataset-view/text.png) diff --git a/i18n/zh/docusaurus-plugin-content-docs/current/dataset/build.md b/i18n/zh/docusaurus-plugin-content-docs/current/dataset/build.md index 4a6946815..5e28a0c6c 100644 --- a/i18n/zh/docusaurus-plugin-content-docs/current/dataset/build.md +++ b/i18n/zh/docusaurus-plugin-content-docs/current/dataset/build.md @@ -10,12 +10,12 @@ Starwhale 数据集构建方式非常灵活,可以从一些图片/音频/视 支持递归遍历目录中的图片文件,构建Starwhale 数据集,不需要写任何代码: -- 支持的图片文件格式: `png/jpg/jpeg/webp/svg/apng` -- 图片会转成 Starwhale.Image 类型,并可以在 Starwhale Server Web页面中查看。 +- 支持的图片文件格式: `png/jpg/jpeg/webp/svg/apng`。 +- 图片会转成 `Starwhale.Image` 类型,并可以在 Starwhale Server Web页面中查看。 - 支持命令行 `swcli dataset build --image` 和 Python SDK `starwhale.Dataset.from_folder` 两种方式。 -- label机制:当 SDK 设置 `auto_label=True` 或 命令行设置 `--auto-label` 时,会将父目录的名字作为 `label`。 -- metadata机制:可以通过在根目录设置 `metadata.csv` 或 `metadata.jsonl` 文件来扩展数据集的列。 -- caption机制:当在同目录下发现 `{image-name}.txt` 文件时,文件中的内容会被自动导入,填充到 `caption` 列中。 +- **label机制**:当 SDK 设置 `auto_label=True` 或 命令行设置 `--auto-label` 时,会将父目录的名字作为 `label`。 +- **metadata机制**:可以通过在根目录设置 `metadata.csv` 或 `metadata.jsonl` 文件来扩展数据集的列。 +- **caption机制**:当在同目录下发现 `{image-name}.txt` 文件时,文件中的内容会被自动导入,填充到 `caption` 列中。 假设在 folder 目录中有下面四个文件: @@ -40,20 +40,19 @@ folder/cat/4.png ```console ❯ swcli dataset head image-folder -n 2 row ─────────────────────────────────────── -🌳 id: cat/2.png +🌳 id: cat/2.png 🌀 features: - 🔅 file_name : cat/2.png - 🔅 label : cat - 🔅 file : ArtifactType.Image, display:2.png, mime_type:MIMEType.PNG, shape:[None, None, 3], encoding: + 🔅 file_name : cat/2.png + 🔅 label : cat + 🔅 file : ArtifactType.Image, display:2.png, mime_type:MIMEType.PNG, shape:[None, None, 3], encoding: row ─────────────────────────────────────── -🌳 id: cat/4.png +🌳 id: cat/4.png 🌀 features: - 🔅 file_name : cat/4.png - 🔅 label : cat - 🔅 file : ArtifactType.Image, display:4.png, mime_type:MIMEType.PNG, shape:[None, None, 3], encoding: + 🔅 file_name : cat/4.png + 🔅 label : cat + 🔅 file : ArtifactType.Image, display:4.png, mime_type:MIMEType.PNG, shape:[None, None, 3], encoding: ``` - Python SDK方式构建: ```python @@ -70,12 +69,11 @@ Dataset: folder, stash version: d22hdiwyakdfh5xitcpn2s32gblfbhrerzczkb63, loadin {'file_name': 'cat/2.png', 'label': 'cat', 'file': ArtifactType.Image, display:2.png, mime_type:MIMEType.PNG, shape:[None, None, 3], encoding: } ``` - ### 视频 支持递归遍历目录中的视频文件,构建Starwhale 数据集,不需要写任何代码: -- 支持的视频文件格式:`mp4/webm/avi` +- 支持的视频文件格式:`mp4/webm/avi`。 - 视频会被转成 Starwhale.Video 类型,并可以在 Starwhale Server Web页面中查看。 - 支持命令行 `swcli dataset build --video` 和 Python SDK `starwhale.Dataset.from_folder` 两种方式。 - label, caption 和 metadata 机制与图片方式相同。 @@ -212,7 +210,7 @@ Starwhale 对每行数据定义了两种属性:`key` 和 `features` 。 ### 数据集初始化 -要创建、更新或加载数据集,需要先获得一个 Starwhale.Dataset 对象,一般可以采用如下方式获取: +要创建、更新或加载数据集,需要先获得一个 `Starwhale.Dataset` 对象,一般可以采用如下方式获取: ```python from starwhale import dataset diff --git a/i18n/zh/docusaurus-plugin-content-docs/current/dataset/integration.md b/i18n/zh/docusaurus-plugin-content-docs/current/dataset/integration.md index e0a97604b..c837bcf6c 100644 --- a/i18n/zh/docusaurus-plugin-content-docs/current/dataset/integration.md +++ b/i18n/zh/docusaurus-plugin-content-docs/current/dataset/integration.md @@ -13,7 +13,7 @@ Starwhale 数据集可以 Pillow, Numpy, Huggingface Datasets, Pytorch 和 Tenso ```python from starwhale import dataset -# login cloud instance in advance: `swcli instance login` command or `starwhale.login` sdk +# login cloud instance in advance: `swcli instance login` command or `starwhale.login` sdk # raw dataset url: https://cloud.starwhale.cn/projects/397/datasets/172/versions/236/files ds = dataset("https://cloud.starwhale.cn/project/starwhale:object-detection/dataset/coco128/v2") img = ds.head(n=1)[0].features.image @@ -44,25 +44,25 @@ print(img) ``` ```console -ArtifactType.Image, display:, mime_type:MIMEType.UNDEFINED, shape:[None, None, 3], encoding: +ArtifactType.Image, display:, mime_type:MIMEType.UNDEFINED, shape:[None, None, 3], encoding: ``` ## Numpy ### 转化为 numpy.ndarray -Starwhale 的以下数据类型可以转化为 numpy.ndarray 对象: +Starwhale 的以下数据类型可以转化为 `numpy.ndarray` 对象: -* Image:先转化为Pillow Image类型,然后再转化为 numpy.ndarray 对象。 -* Video:将 video bytes 直接转化 numpy.ndarray 对象。 -* Audio:调用 soundfile 库将 audio bytes 转化为 numpy.ndarray 对象。 -* BoundingBox:转化为 xywh 格式的 numpy.ndarray 对象。 -* Binary:将 bytes 直接转化 numpy.ndarray 对象。 +* `Image`:先转化为Pillow Image类型,然后再转化为 `numpy.ndarray` 对象。 +* `Video`:将 video bytes 直接转化 `numpy.ndarray` 对象。 +* `Audio`:调用 soundfile 库将 audio bytes 转化为 `numpy.ndarray` 对象。 +* `BoundingBox`:转化为 xywh 格式的 `numpy.ndarray` 对象。 +* `Binary`:将 bytes 直接转化 `numpy.ndarray` 对象。 ```python from starwhale import dataset -# login cloud instance in advance: `swcli instance login` command or `starwhale.login` sdk +# login cloud instance in advance: `swcli instance login` command or `starwhale.login` sdk # raw dataset url: https://cloud.starwhale.cn/projects/397/datasets/172/versions/236/files ds = dataset("https://cloud.starwhale.cn/project/starwhale:object-detection/dataset/coco128/v2") @@ -100,7 +100,7 @@ print(img) ``` ```console -ArtifactType.Image, display:, mime_type:MIMEType.UNDEFINED, shape:[None, None, 3], encoding: +ArtifactType.Image, display:, mime_type:MIMEType.UNDEFINED, shape:[None, None, 3], encoding: ``` ## Huggingface Datasets @@ -139,8 +139,8 @@ from starwhale import dataset import torch.utils.data as tdata def custom_transform(data): - data["label"] = data["label"] + 100 - return data + data["label"] = data["label"] + 100 + return data with dataset("simple", create="empty") as ds: for i in range(0, 10): @@ -166,7 +166,7 @@ Starwhale Dataset 可以转化为 Tensorflow 的 [tensorflow.data.Dataset](https ```python from starwhale import dataset -# login cloud instance in advance: `swcli instance login` command or `starwhale.login` sdk +# login cloud instance in advance: `swcli instance login` command or `starwhale.login` sdk # raw dataset url: https://cloud.starwhale.cn/projects/397/datasets/172/versions/236/files ds = dataset("https://cloud.starwhale.cn/project/starwhale:helloworld/dataset/mnist64") tf_ds = ds.to_tensorflow() diff --git a/i18n/zh/docusaurus-plugin-content-docs/current/dataset/load.md b/i18n/zh/docusaurus-plugin-content-docs/current/dataset/load.md index 15c9cfd28..d35b4c8dc 100644 --- a/i18n/zh/docusaurus-plugin-content-docs/current/dataset/load.md +++ b/i18n/zh/docusaurus-plugin-content-docs/current/dataset/load.md @@ -8,35 +8,35 @@ Starwhale 数据集构建完成后,可以在任意位置访问数据集,加 - 加载本地 Standalone 实例或远端 Cloud/Server 实例的数据集,数据集唯一索引方式是数据集URI。 - ```python - from starwhale import dataset - - local_latest_ds = dataset("mnist") - remote_cloud_ds = dataset("https://cloud-cn.starwhale.cn/project/starwhale:helloworld/dataset/mnist64/v2") - remote_server_ds = dataset("cloud://server/project/1/dataset/helloworld") - ``` + ```python + from starwhale import dataset + + local_latest_ds = dataset("mnist") + remote_cloud_ds = dataset("https://cloud-cn.starwhale.cn/project/starwhale:helloworld/dataset/mnist64/v2") + remote_server_ds = dataset("cloud://server/project/1/dataset/helloworld") + ``` - 远端数据集按需预加载,数据不落盘。 - Starwhale 数据集加载时,并不会将远端数据集完全下载到本地后再加载。只会加载目标索引关联的数据。 - 根据目标索引特征,提前加载一些数据,提升Batch性能,用空间换时间。 - + ![dataset-load](https://starwhale-examples.oss-cn-beijing.aliyuncs.com/docs/dataset-load.png) - 数据索引方式灵活。Starwhale Dataset 类实现了 `__getitem__` 方法,提供key索引和分片索引方式读取相关数据。 - ```python - from starwhale import dataset - ds = dataset("mnist64") - print(ds[0].features.img) - print(ds[0].features.label) - print(len(ds[:10])) - ``` - - ```console - ArtifactType.Image, display:0, mime_type:MIMEType.PNG, shape:[8, 8, 1], encoding: - 0 - 10 - ``` + ```python + from starwhale import dataset + ds = dataset("mnist64") + print(ds[0].features.img) + print(ds[0].features.label) + print(len(ds[:10])) + ``` + + ```console + ArtifactType.Image, display:0, mime_type:MIMEType.PNG, shape:[8, 8, 1], encoding: + 0 + 10 + ``` ## 数据集元素访问方式 @@ -48,9 +48,9 @@ Starwhale 数据集构建完成后,可以在任意位置访问数据集,加 from starwhale import dataset with dataset("empty-new") as ds: - for i in range(0, 100): - ds.append({"a": i}) - ds.commit() + for i in range(0, 100): + ds.append({"a": i}) + ds.commit() ds = dataset("empty-new", readonly=True) print(ds[0].features.a) @@ -76,9 +76,9 @@ Starwhale Dataset 类实现了 `__iter__` 方法,可以对实例化的Dataset from starwhale import dataset ds = dataset("mnist64") for idx, row in enumerate(ds): - if idx > 10: - break - print(row.index, row.features) + if idx > 10: + break + print(row.index, row.features) ``` ```console @@ -102,14 +102,14 @@ for idx, row in enumerate(ds): ```python from starwhale import dataset ds = dataset("mnist64") -item = ds.fetch_one() +item = ds.fetch_one() print(item.index) print(list(item.features.keys())) ``` ```console -0 │ -['img', 'label'] +0 +['img', 'label'] ``` ### head 方法 @@ -127,6 +127,6 @@ print(len(items)) ```console 0 -{'img': ArtifactType.Image, display:0, mime_type:MIMEType.PNG, shape:[8, 8, 1], encoding: , 'label': 0} +{'img': ArtifactType.Image, display:0, mime_type:MIMEType.PNG, shape:[8, 8, 1], encoding: , 'label': 0} 5 ``` diff --git a/i18n/zh/docusaurus-plugin-content-docs/current/dataset/version.md b/i18n/zh/docusaurus-plugin-content-docs/current/dataset/version.md index 06a4873c0..c90a06254 100644 --- a/i18n/zh/docusaurus-plugin-content-docs/current/dataset/version.md +++ b/i18n/zh/docusaurus-plugin-content-docs/current/dataset/version.md @@ -4,9 +4,9 @@ title: 数据集版本控制 Starwhale 数据集支持细粒度的版本控制,能实现对每一行和每一列的变更追溯。Starwhale 的数据集版本控制具备一下特点: -- 线性版本。设计上以简化操作为核心,不需要考虑branch、merge等复杂的操作。对大规模数据集进行branch merge操作几乎不可能。 -- 细粒度控制。最小单位是某一行的某一列变更后就可以生成一个新的版本。 -- 版本唯一。生成版本时,会产生一个唯一ID,当数据集拷贝到不同实例中,该ID不变,可以通过该ID加载对应的数据集内容。 +- **线性版本**。设计上以简化操作为核心,不需要考虑branch、merge等复杂的操作。对大规模数据集进行branch merge操作几乎不可能。 +- **细粒度控制**。最小单位是某一行的某一列变更后就可以生成一个新的版本。 +- **版本唯一**。生成版本时,会产生一个唯一ID,当数据集拷贝到不同实例中,该ID不变,可以通过该ID加载对应的数据集内容。 ## 构建数据集时生成版本