Persian Synthetic Data Generation

This repository contains code and resources for generating high-quality synthetic data in Persian language using State-of-the-Art Language Models (LLMs).

Overview

This project provides tools and utilities to generate synthetic Persian text data using advanced language models. It includes predefined topic trees, prompts, and datasets that can be used to generate domain-specific Persian content.

Project Structure

.
├── datasets/                  # Generated synthetic datasets
├── prompts/                  # Prompt templates and configurations
│   └── topic_prompts.json
├── topic_trees/             # Topic hierarchies and structures
├── gen_synthetic.py         # Main synthetic data generation script
├── gen_synthetic_emotions.py # Emotion-specific data generation
└── Experience.ipynb         # Example notebook with usage demonstrations

Prerequisites

Python 3.x
UV package manager
Required API keys (see Environment Variables)

Installation

Clone the repository:

git clone https://github.com/ParsBench/PersianSyntheticData.git
cd PersianSyntheticData

Install dependencies using UV:

uv venv
source .venv/bin/activate  # On Unix-like systems
# or
.venv\Scripts\activate     # On Windows
uv pip install -r requirements.txt

Environment Variables

Copy the .env.sample file to create your own .env:

cp .env.sample .env

Configure the following variables in your .env file:

Add your API keys and other configuration parameters

Usage

Basic Data Generation

Run the main generation script:

python gen_synthetic.py

Emotion-Specific Data Generation

For generating emotion-related synthetic data:

python gen_synthetic_emotions.py

Project Components

Topic Trees: Hierarchical structures defining the relationships between different Persian topics
Prompts: Carefully crafted prompt templates for generating contextually relevant data
Datasets: Generated synthetic data in JSONL format
Notebooks: Interactive examples and demonstrations in Jupyter notebooks

License

This project is licensed under the Apache License 2.0 - see below for details:

Copyright 2024 ParsBench

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

For the full license text, please see the LICENSE file in the repository.

Citation

If you use this project in your research, please cite:

@misc{parsbench2025,
  title={ParsBench: A Framework for Generating High-Quality Persian Language Datasets},
  author={ParsBench Team, Shariati Shahriar},
  year={2025},
  publisher={GitHub},
  howpublished={\url{https://github.com/ParsBench/PersianSyntheticData}},
  note={Open-source framework for Persian language dataset generation}
}

Acknowledgments

Special thanks to AvalAI for sponsoring this project through their AvalAward program
This repository was made possible by AvalAI's generous support and commitment to advancing Persian language AI research

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Persian Synthetic Data Generation

Overview

Project Structure

Prerequisites

Installation

Environment Variables

Usage

Basic Data Generation

Emotion-Specific Data Generation

Project Components

License

Citation

Acknowledgments

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
datasets		datasets
prompts		prompts
topic_trees		topic_trees
.env.sample		.env.sample
.gitignore		.gitignore
.python-version		.python-version
Experience.ipynb		Experience.ipynb
LICENSE		LICENSE
README.md		README.md
gen_synthetic.py		gen_synthetic.py
gen_synthetic_emotions.py		gen_synthetic_emotions.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

License

ParsBench/PersianSyntheticData

Folders and files

Latest commit

History

Repository files navigation

Persian Synthetic Data Generation

Overview

Project Structure

Prerequisites

Installation

Environment Variables

Usage

Basic Data Generation

Emotion-Specific Data Generation

Project Components

License

Citation

Acknowledgments

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages