A powerful tool for creating fine-tuning datasets for LLM.
A powerful tool for creating fine-tuning datasets for Large Language Models
Features • Getting Started • Usage • Documentation • Contributing • License
If you like this project, please leave a Star ⭐️ for it. Or you can buy the author a cup of coffee => Support the author
Easy Dataset is a specialized application designed to streamline the creation of fine-tuning datasets for Large Language Models (LLMs). It offers an intuitive interface for uploading domain-specific files, intelligently splitting content, generating questions, and producing high-quality training data for model fine-tuning.
With Easy Dataset, you can transform your domain knowledge into structured datasets compatible with all OpenAI-format compatible LLM APIs, making the fine-tuning process accessible and efficient.
- Intelligent Document Processing: Upload Markdown files and automatically split them into meaningful segments
- Smart Question Generation: Extract relevant questions from each text segment
- Answer Generation: Generate comprehensive answers for each question using LLM APIs
- Flexible Editing: Edit questions, answers, and datasets at any stage of the process
- Multiple Export Formats: Export datasets in various formats (Alpaca, ShareGPT) and file types (JSON, JSONL)
- Wide Model Support: Compatible with all LLM APIs that follow the OpenAI format
- User-Friendly Interface: Intuitive UI designed for both technical and non-technical users
- Customizable System Prompts: Add custom system prompts to guide model responses
Windows | MacOS | Linux | |
![]() Setup.exe | ![]() Intel | ![]() M | ![]() AppImage |
- Node.js 18.x or higher
- pnpm (recommended) or npm
Clone the repository:
git clone https://github.com/ConardLi/easy-dataset.git cd easy-dataset
Install dependencies:
npm install
Start the development server:
npm run build npm run start
If you want to build the image yourself, you can use the Dockerfile in the project root directory:
Clone the repository:
git clone https://github.com/ConardLi/easy-dataset.git cd easy-dataset
Build the Docker image:
docker build -t easy-dataset .
Run the container:
docker run -d -p 1717:1717 -v {YOUR_LOCAL_DB_PATH}:/app/local-db --name easy-dataset easy-dataset
Note: Replace
{YOUR_LOCAL_DB_PATH}
with the actual path where you want to store the local database.Open your browser and navigate to
http://localhost:1717
![]() | ![]() |
- Click the "Create Project" button on the home page
- Enter a project name and description
- Configure your preferred LLM API settings
![]() | ![]() |
- Upload your Markdown files in the "Text Split" section
- Review the automatically split text segments
- Adjust the segmentation if needed
![]() | ![]() |
- Navigate to the "Questions" section
- Select text segments to generate questions from
- Review and edit the generated questions
- Organize questions using the tag tree
![]() | ![]() |
- Go to the "Datasets" section
- Select questions to include in your dataset
- Generate answers using your configured LLM
- Review and edit the generated answers
![]() | ![]() |
- Click the "Export" button in the Datasets section
- Select your preferred format (Alpaca or ShareGPT)
- Choose file format (JSON or JSONL)
- Add custom system prompts if needed
- Export your dataset
easy-dataset/
├── app/ # Next.js application directory
│ ├── api/ # API routes
│ │ ├── llm/ # LLM API integration
│ │ │ ├── ollama/ # Ollama API integration
│ │ │ └── openai/ # OpenAI API integration
│ │ ├── projects/ # Project management APIs
│ │ │ ├── [projectId]/ # Project-specific operations
│ │ │ │ ├── chunks/ # Text chunk operations
│ │ │ │ ├── datasets/ # Dataset generation and management
│ │ │ │ │ └── optimize/ # Dataset optimization API
│ │ │ │ ├── generate-questions/ # Batch question generation
│ │ │ │ ├── questions/ # Question management
│ │ │ │ └── split/ # Text splitting operations
│ │ │ └── user/ # User-specific project operations
│ ├── projects/ # Front-end project pages
│ │ └── [projectId]/ # Project-specific pages
│ │ ├── datasets/ # Dataset management UI
│ │ ├── questions/ # Question management UI
│ │ ├── settings/ # Project settings UI
│ │ └── text-split/ # Text processing UI
│ └── page.js # Home page
├── components/ # React components
│ ├── datasets/ # Dataset-related components
│ ├── home/ # Home page components
│ ├── projects/ # Project management components
│ ├── questions/ # Question management components
│ └── text-split/ # Text processing components
├── lib/ # Core libraries and utilities
│ ├── db/ # Database operations
│ ├── i18n/ # Internationalization
│ ├── llm/ # LLM integration
│ │ ├── common/ # Common LLM utilities
│ │ ├── core/ # Core LLM client
│ │ └── prompts/ # Prompt templates
│ │ ├── answer.js # Answer generation prompts (Chinese)
│ │ ├── answerEn.js # Answer generation prompts (English)
│ │ ├── question.js # Question generation prompts (Chinese)
│ │ ├── questionEn.js # Question generation prompts (English)
│ │ └── ... other prompts
│ └── text-splitter/ # Text splitting utilities
├── locales/ # Internationalization resources
│ ├── en/ # English translations
│ └── zh-CN/ # Chinese translations
├── public/ # Static assets
│ └── imgs/ # Image resources
└── local-db/ # Local file-based database
└── projects/ # Project data storage
For detailed documentation on all features and APIs, please visit our Documentation Site.
from https://github.com/ConardLi/easy-dataset
No comments:
Post a Comment