f2CBVx

ppt.cc/fVjECx ppt.cc/fEnHsx ppt.cc/fRZTnx ppt.cc/fSZ3cx ppt.cc/fLOuCx ppt.cc/fE9Nux ppt.cc/fL5Kyx ppt.cc/fIr1ax ppt.cc/f71Yqx tecmint.com linuxcool.com linux.die.net linux.it.net.cn ostechnix.com unix.com ubuntugeek.com runoob.com man.linuxde.net ppt.cc/fwpCex ppt.cc/fxcLIx ppt.cc/foX6Ux linuxprobe.com linuxtechi.com howtoforge.com linuxstory.org systutorials.com ghacks.net linuxopsys.com ppt.cc/ffAGfx ppt.cc/fJbezx ppt.cc/fNIQDx ppt.cc/fCSllx ppt.cc/fybDVx ppt.cc/fIMQxx ppt.cc/fKlBax

Thursday, 20 March 2025

easy-dataset

A powerful tool for creating fine-tuning datasets for LLM.

A powerful tool for creating fine-tuning datasets for Large Language Models

简体中文 | English

Features • Getting Started • Usage • Documentation • Contributing • License

If you like this project, please leave a Star ⭐️ for it. Or you can buy the author a cup of coffee => Support the author

Overview

Easy Dataset is a specialized application designed to streamline the creation of fine-tuning datasets for Large Language Models (LLMs). It offers an intuitive interface for uploading domain-specific files, intelligently splitting content, generating questions, and producing high-quality training data for model fine-tuning.

With Easy Dataset, you can transform your domain knowledge into structured datasets compatible with all OpenAI-format compatible LLM APIs, making the fine-tuning process accessible and efficient.

Features

Intelligent Document Processing: Upload Markdown files and automatically split them into meaningful segments
Smart Question Generation: Extract relevant questions from each text segment
Answer Generation: Generate comprehensive answers for each question using LLM APIs
Flexible Editing: Edit questions, answers, and datasets at any stage of the process
Multiple Export Formats: Export datasets in various formats (Alpaca, ShareGPT) and file types (JSON, JSONL)
Wide Model Support: Compatible with all LLM APIs that follow the OpenAI format
User-Friendly Interface: Intuitive UI designed for both technical and non-technical users
Customizable System Prompts: Add custom system prompts to guide model responses

Getting Started

Download Client

Windows	MacOS		Linux
Setup.exe	Intel	M	AppImage

Using npm

Node.js 18.x or higher
pnpm (recommended) or npm

Clone the repository:

git clone https://github.com/ConardLi/easy-dataset.git
cd easy-dataset

Install dependencies:
```
npm install
```
Start the development server:
```
npm run build

npm run start
```

Build with Local Dockerfile

If you want to build the image yourself, you can use the Dockerfile in the project root directory:

Clone the repository:

git clone https://github.com/ConardLi/easy-dataset.git
cd easy-dataset

Build the Docker image:
```
docker build -t easy-dataset .
```
Run the container:
```
docker run -d -p 1717:1717 -v {YOUR_LOCAL_DB_PATH}:/app/local-db --name easy-dataset easy-dataset
```
Note: Replace {YOUR_LOCAL_DB_PATH} with the actual path where you want to store the local database.
Open your browser and navigate to http://localhost:1717

Usage

Creating a Project

Click the "Create Project" button on the home page
Enter a project name and description
Configure your preferred LLM API settings

Processing Documents

Upload your Markdown files in the "Text Split" section
Review the automatically split text segments
Adjust the segmentation if needed

Generating Questions

Navigate to the "Questions" section
Select text segments to generate questions from
Review and edit the generated questions
Organize questions using the tag tree

Creating Datasets

Go to the "Datasets" section
Select questions to include in your dataset
Generate answers using your configured LLM
Review and edit the generated answers

Exporting Datasets

Click the "Export" button in the Datasets section
Select your preferred format (Alpaca or ShareGPT)
Choose file format (JSON or JSONL)
Add custom system prompts if needed
Export your dataset

Project Structure

easy-dataset/
├── app/                                # Next.js application directory
│   ├── api/                            # API routes
│   │   ├── llm/                        # LLM API integration
│   │   │   ├── ollama/                 # Ollama API integration
│   │   │   └── openai/                 # OpenAI API integration
│   │   ├── projects/                   # Project management APIs
│   │   │   ├── [projectId]/            # Project-specific operations
│   │   │   │   ├── chunks/             # Text chunk operations
│   │   │   │   ├── datasets/           # Dataset generation and management
│   │   │   │   │   └── optimize/       # Dataset optimization API
│   │   │   │   ├── generate-questions/ # Batch question generation
│   │   │   │   ├── questions/          # Question management
│   │   │   │   └── split/              # Text splitting operations
│   │   │   └── user/                   # User-specific project operations
│   ├── projects/                       # Front-end project pages
│   │   └── [projectId]/                # Project-specific pages
│   │       ├── datasets/               # Dataset management UI
│   │       ├── questions/              # Question management UI
│   │       ├── settings/               # Project settings UI
│   │       └── text-split/             # Text processing UI
│   └── page.js                         # Home page
├── components/                         # React components
│   ├── datasets/                       # Dataset-related components
│   ├── home/                           # Home page components
│   ├── projects/                       # Project management components
│   ├── questions/                      # Question management components
│   └── text-split/                     # Text processing components
├── lib/                                # Core libraries and utilities
│   ├── db/                             # Database operations
│   ├── i18n/                           # Internationalization
│   ├── llm/                            # LLM integration
│   │   ├── common/                     # Common LLM utilities
│   │   ├── core/                       # Core LLM client
│   │   └── prompts/                    # Prompt templates
│   │       ├── answer.js               # Answer generation prompts (Chinese)
│   │       ├── answerEn.js             # Answer generation prompts (English)
│   │       ├── question.js             # Question generation prompts (Chinese)
│   │       ├── questionEn.js           # Question generation prompts (English)
│   │       └── ... other prompts
│   └── text-splitter/                  # Text splitting utilities
├── locales/                            # Internationalization resources
│   ├── en/                             # English translations
│   └── zh-CN/                          # Chinese translations
├── public/                             # Static assets
│   └── imgs/                           # Image resources
└── local-db/                           # Local file-based database
    └── projects/                       # Project data storage

Documentation

For detailed documentation on all features and APIs, please visit our Documentation Site.

from https://github.com/ConardLi/easy-dataset