This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Documentation

1: Overview
2: Quick Start

2.1: Installation Guide
2.2: System Architecture
2.3: Development Environment Setup

3: User Guide

3.1: Data Collection
3.2: Data Management
3.3: Data Cleaning
3.4: Data Annotation
3.5: Data Synthesis
3.6: Data Evaluation
3.7: Knowledge Base Management
3.8: Operator Market
3.9: Pipeline Orchestration
3.10: Agent Chat

4: API Reference

4.1: Data Management API

5: Developer Guide

5.1: Backend Architecture
5.2: Frontend Architecture

6: Appendix

6.1: Configuration
6.2: Troubleshooting

7: Contribution Guide

This is a placeholder page that shows you how to use this template site.

This section is where the user documentation for your project lives - all the information your users need to understand and successfully use your project.

For large documentation sets we recommend adding content under the headings in this section, though if some or all of them don’t apply to your project feel free to remove them or add your own. You can see an example of a smaller Docsy documentation site in the Docsy User Guide, which lives in the Docsy theme repo if you’d like to copy its docs section.

Other content such as marketing material, case studies, and community updates should live in the About and Community pages.

Find out how to use the Docsy theme in the Docsy User Guide. You can learn more about how to organize your documentation (and how we organized this site) in Organizing Your Content.

1 - Overview

DataMate - Enterprise-level Large Model Data Processing Platform

DataMate is an enterprise-level data processing platform designed for model fine-tuning and RAG retrieval. It provides comprehensive data processing capabilities including data collection, management, cleaning, annotation, synthesis, evaluation, and knowledge base management.

Product Positioning

DataMate is dedicated to solving data pain points in large model implementation, providing a one-stop data governance solution:

Full Lifecycle Coverage: From data collection to evaluation, covering the entire data processing lifecycle
Enterprise-grade Capabilities: Supports million-scale concurrent data processing with private deployment options
Flexible Extension: Rich built-in data processing operators with support for custom operator development
Visual Orchestration: Drag-and-drop pipeline design without coding for complex data processing workflows

Core Features

Data Collection

Heterogeneous data source collection capabilities based on DataX
Supports relational databases, NoSQL, file systems, and other data sources
Flexible task configuration and monitoring

Data Management

Unified dataset management supporting image, text, audio, video, and multimodal data types
Complete data operations: upload, download, preview
Tag and metadata management for easy data organization and retrieval

Data Cleaning

Rich built-in data cleaning operators
Visual cleaning template configuration
Supports both batch and stream processing modes

Data Annotation

Integrated Label Studio for professional annotation capabilities
Supports image classification, object detection, text classification, and other annotation types
Annotation review and quality control mechanisms

Data Synthesis

Data augmentation and synthesis capabilities based on large models
Instruction template management and customization
Proportional synthesis tasks for diverse data needs

Data Evaluation

Multi-dimensional data quality evaluation metrics
Supports both automatic and manual evaluation
Detailed evaluation reports

Knowledge Base Management (RAG)

Supports multiple document formats for knowledge base construction
Automated text chunking and vectorization
Integrated vector retrieval for RAG applications

Operator Marketplace

Rich built-in data processing operators
Support for operator publishing and sharing
Custom operator development capabilities

Pipeline Orchestration

Visual drag-and-drop workflow design
Multiple node types and configurations
Pipeline execution monitoring and debugging

Agent Chat

Integrated large language model chat capabilities
Knowledge base Q&A
Conversation history management

Technical Architecture

Overall Architecture

DataMate adopts a microservices architecture with core components including:

Frontend: React 18 + TypeScript + Ant Design + Tailwind CSS
Backend: Java 21 + Spring Boot 3.5.6 + Spring Cloud + MyBatis Plus
Runtime: Python FastAPI + LangChain + Ray
Database: PostgreSQL + Redis + Milvus + MinIO

Microservice Components

API Gateway (8080): Unified entry point for routing and authentication
Main Application: Core business logic
Data Management Service (8092): Dataset management
Data Collection Service: Data collection task management
Data Cleaning Service: Data cleaning task management
Data Annotation Service: Data annotation task management
Data Synthesis Service: Data synthesis task management
Data Evaluation Service: Data evaluation task management
Operator Market Service: Operator marketplace management
RAG Indexer Service: Knowledge base indexing
Runtime Service (8081): Operator execution engine
Backend Python Service (18000): Python backend service

Use Cases

Model Fine-tuning

Training data cleaning and quality improvement
Data augmentation and synthesis
Training data evaluation

RAG Applications

Enterprise knowledge base construction
Document vectorization and indexing
Semantic retrieval and Q&A

Data Governance

Unified management of multi-source data
Data lineage tracking
Data quality monitoring

Deployment Options

DataMate supports multiple deployment methods:

Docker Compose: Quick experience and development testing
Kubernetes/Helm: Production environment deployment
Offline Deployment: Supports air-gapped environment deployment

Comparison with Similar Products

Feature	DataMate	Label Studio	DocArray
Data Management	✅ Complete dataset management	❌ Annotation data only	❌ Document data only
Data Collection	✅ DataX support	❌ Not supported	❌ Not supported
Data Cleaning	✅ Rich built-in operators	❌ Not supported	❌ Not supported
Data Annotation	✅ Label Studio integration	✅ Professional tool	❌ Not supported
Data Synthesis	✅ LLM-based	❌ Not supported	❌ Not supported
Data Evaluation	✅ Multi-dimensional	⚠️ Basic	❌ Not supported
Knowledge Base	✅ RAG integration	❌ Not supported	⚠️ Requires development
Pipeline Orchestration	✅ Visual orchestration	❌ Not supported	❌ Not supported
Operator Extension	✅ Custom operators	⚠️ Limited	⚠️ Requires coding
License	✅ MIT	✅ Apache 2.0	✅ MIT

Next Steps

Quick Start - Deploy DataMate in 5 minutes
User Guide - Detailed feature usage documentation
API Reference - Complete API documentation
Developer Guide - Architecture and development guide

2 - Quick Start

Deploy DataMate in 5 minutes

This guide will help you deploy DataMate platform in 5 minutes.

DataMate supports two main deployment methods:

Docker Compose: Suitable for quick experience and development testing
Kubernetes/Helm: Suitable for production deployment

Prerequisites

Docker Compose Deployment

Docker 20.10+
Docker Compose 2.0+
At least 4GB RAM
At least 10GB disk space

Kubernetes Deployment

Kubernetes 1.20+
Helm 3.0+
kubectl configured with cluster connection
At least 8GB RAM
At least 20GB disk space

5-Minute Quick Deployment (Docker Compose)

1. Clone the Code

git clone https://github.com/ModelEngine-Group/DataMate.git
cd DataMate

2. Start Services

Use the provided Makefile for one-click deployment:

make install

After running the command, the system will prompt you to select a deployment method:

Choose a deployment method:
1. Docker/Docker-Compose
2. Kubernetes/Helm
Enter choice:

Enter 1 to select Docker Compose deployment.

3. Verify Deployment

After services start, you can access them at:

Frontend: http://localhost:30000
API Gateway: http://localhost:8080
Database: localhost:5432

4. Check Service Status

docker ps

You should see the following containers running:

datamate-frontend (Frontend service)
datamate-backend (Backend service)
datamate-backend-python (Python backend service)
datamate-gateway (API gateway)
datamate-database (PostgreSQL database)
datamate-runtime (Operator runtime)

Optional Components Installation

Install Milvus Vector Database

Milvus is used for vector storage and retrieval in knowledge bases:

make install-milvus

Select Docker Compose deployment method when prompted.

Install Label Studio Annotation Tool

Label Studio is used for data annotation:

make install-label-studio

Access: http://localhost:30001

Default credentials:

Username: admin@demo.com
Password: demoadmin

Install MinerU PDF Processing Service

MinerU provides enhanced PDF document processing:

make build-mineru
make install-mineru

Install DeerFlow Service

DeerFlow is used for enhanced workflow orchestration:

make install-deer-flow

Using Local Images for Development

If you’ve modified local code, use local images for deployment:

make build
make install dev=true

Offline Environment Deployment

For offline environments, download all images first:

make download SAVE=true

Images will be saved in the dist/ directory. Load images on the target machine:

make load-images

Uninstall

Uninstall DataMate

make uninstall

The system will prompt whether to delete volumes:

Select 1: Delete all data (including datasets, configurations, etc.)
Select 2: Keep volumes

Uninstall Specific Components

# Uninstall Label Studio
make uninstall-label-studio

# Uninstall Milvus
make uninstall-milvus

# Uninstall DeerFlow
make uninstall-deer-flow

Next Steps

Installation Guide - Detailed installation and configuration
System Architecture - Understand DataMate architecture
Development Setup - Local development environment

Common Questions

Q: What if service startup fails?

First check if ports are occupied:

# Check port usage
lsof -i :30000
lsof -i :8080

If ports are occupied, modify port mappings in deployment/docker/datamate/docker-compose.yml.

Q: How to view service logs?

# View all service logs
docker compose -f deployment/docker/datamate/docker-compose.yml logs

# View specific service logs
docker compose -f deployment/docker/datamate/docker-compose.yml logs -f datamate-backend

Q: Where is data stored?

Data is persisted through Docker volumes:

datamate-dataset-volume: Dataset files
datamate-postgresql-volume: Database data
datamate-log-volume: Log files

View all volumes:

docker volume ls | grep datamate

2.1 - Installation Guide

Detailed installation and configuration instructions for DataMate

This document provides detailed installation and configuration instructions for the DataMate platform.

System Requirements

Minimum Configuration

Component	Minimum	Recommended
CPU	4 cores	8 cores+
RAM	8 GB	16 GB+
Disk	50 GB	100 GB+
OS	Linux/macOS/Windows	Linux (Ubuntu 20.04+)

Software Dependencies

Docker Compose Deployment

Docker 20.10+
Docker Compose 2.0+
Git (optional, for cloning code)
Make (optional, for using Makefile)

Kubernetes Deployment

Kubernetes 1.20+
Helm 3.0+
kubectl (matching cluster version)
Git (optional, for cloning code)
Make (optional, for using Makefile)

Deployment Method Comparison

Feature	Docker Compose	Kubernetes
Deployment Difficulty	⭐ Simple	⭐⭐⭐ Complex
Resource Utilization	⭐⭐ Fair	⭐⭐⭐⭐ High
High Availability	❌ Not supported	✅ Supported
Scalability	⭐⭐ Fair	⭐⭐⭐⭐ Strong
Use Case	Dev/test, small scale	Production, large scale

Docker Compose Deployment

Basic Deployment

1. Prerequisites

# Clone code repository
git clone https://github.com/ModelEngine-Group/DataMate.git
cd DataMate

# Check Docker and Docker Compose versions
docker --version
docker compose version

2. Deploy Using Makefile

# One-click deployment (including Milvus)
make install

Select 1. Docker/Docker-Compose when prompted.

3. Use Docker Compose Directly

If Make is not installed:

# Set image registry (optional)
export REGISTRY=ghcr.io/modelengine-group/

# Start basic services
docker compose -f deployment/docker/datamate/docker-compose.yml --profile milvus up -d

4. Verify Deployment

# Check container status
docker ps

# View service logs
docker compose -f deployment/docker/datamate/docker-compose.yml logs -f

# Access frontend
open http://localhost:30000

Optional Components

Milvus Vector Database

# Using Makefile
make install-milvus

# Or Docker Compose
docker compose -f deployment/docker/datamate/docker-compose.yml --profile milvus up -d

Components:

milvus-standalone (19530, 9091)
milvus-minio (9000, 9001)
milvus-etcd

Label Studio Annotation Tool

# Using Makefile
make install-label-studio

# Or Docker Compose
docker compose -f deployment/docker/datamate/docker-compose.yml --profile label-studio up -d

Access: http://localhost:30001

Default credentials:

Username: admin@demo.com
Password: demoadmin

MinerU PDF Processing

# Build MinerU image
make build-mineru

# Deploy MinerU
make install-mineru

DeerFlow Workflow Service

# Using Makefile
make install-deer-flow

# Or Docker Compose
docker compose -f deployment/docker/datamate/docker-compose.yml --profile deer-flow up -d

Environment Variables

Variable	Default	Description
`DB_PASSWORD`	`password`	Database password
`DATAMATE_JWT_ENABLE`	`false`	Enable JWT authentication
`REGISTRY`	`ghcr.io/modelengine-group/`	Image registry
`VERSION`	`latest`	Image version
`LABEL_STUDIO_HOST`	-	Label Studio access URL

Data Volume Management

DataMate uses Docker volumes for persistence:

# View all volumes
docker volume ls | grep datamate

# View volume details
docker volume inspect datamate-dataset-volume

# Backup volume data
docker run --rm -v datamate-dataset-volume:/data -v $(pwd):/backup \
  ubuntu tar czf /backup/dataset-backup.tar.gz /data

Kubernetes/Helm Deployment

Prerequisites

# Check cluster connection
kubectl cluster-info
kubectl get nodes

# Check Helm version
helm version

# Create namespace (optional)
kubectl create namespace datamate

Using Makefile

# Deploy DataMate
make install INSTALLER=k8s

# Or deploy to specific namespace
make install NAMESPACE=datamate INSTALLER=k8s

Using Helm

1. Deploy Basic Services

# Deploy DataMate
helm upgrade datamate deployment/helm/datamate/ \
  --install \
  --namespace datamate \
  --create-namespace \
  --set global.image.repository=ghcr.io/modelengine-group/

# Check deployment status
kubectl get pods -n datamate

2. Configure Ingress (Optional)

# Edit values.yaml
cat >> deployment/helm/datamate/values.yaml << EOF
ingress:
  enabled: true
  className: nginx
  hosts:
    - host: datamate.example.com
      paths:
        - path: /
          pathType: Prefix
EOF

# Redeploy
helm upgrade datamate deployment/helm/datamate/ \
  --namespace datamate \
  -f deployment/helm/datamate/values.yaml

3. Deploy Optional Components

# Deploy Milvus
helm upgrade milvus deployment/helm/milvus \
  --install \
  --namespace datamate

# Deploy Label Studio
helm upgrade label-studio deployment/helm/label-studio/ \
  --install \
  --namespace datamate

Offline Deployment

Prepare Offline Images

1. Download Images

# Download all images locally
make download SAVE=true

# Download specific version
make download VERSION=v1.0.0 SAVE=true

Images saved in dist/ directory.

2. Package and Transfer

# Package
tar czf datamate-images.tar.gz dist/

# Transfer to target server
scp datamate-images.tar.gz user@target-server:/tmp/

Offline Installation

1. Load Images

# Extract on target server
tar xzf datamate-images.tar.gz

# Load all images
make load-images

2. Modify Configuration

Use empty REGISTRY for local images:

REGISTRY= docker compose -f deployment/docker/datamate/docker-compose.yml up -d

Upgrade Guide

Docker Compose Upgrade

# 1. Backup data
docker run --rm -v datamate-postgresql-volume:/data -v $(pwd):/backup \
  ubuntu tar czf /backup/postgres-backup.tar.gz /data

# 2. Pull new images
docker pull ghcr.io/modelengine-group/datamate-backend:latest

# 3. Stop services
docker compose -f deployment/docker/datamate/docker-compose.yml down

# 4. Start new version
docker compose -f deployment/docker/datamate/docker-compose.yml up -d

# 5. Verify upgrade
docker ps
docker logs -f datamate-backend

Or use Makefile:

make datamate-docker-upgrade

Kubernetes Upgrade

# 1. Backup data
kubectl exec -n datamate deployment/datamate-database -- \
  pg_dump -U postgres datamate > backup.sql

# 2. Update Helm Chart
helm upgrade datamate deployment/helm/datamate/ \
  --namespace datamate \
  --set global.image.tag=new-version

Uninstall

Docker Compose Complete Uninstall

# Using Makefile
make uninstall

# Choose to delete volumes for complete cleanup

Or manual uninstall:

# Stop and remove containers
docker compose -f deployment/docker/datamate/docker-compose.yml --profile milvus --profile label-studio down -v

# Remove all volumes
docker volume rm datamate-dataset-volume \
  datamate-postgresql-volume \
  datamate-log-volume

# Remove network
docker network rm datamate-network

Kubernetes Complete Uninstall

# Uninstall all components
make uninstall INSTALLER=k8s

# Or use Helm
helm uninstall datamate -n datamate
helm uninstall milvus -n datamate
helm uninstall label-studio -n datamate

# Delete namespace
kubectl delete namespace datamate

Troubleshooting

Common Issues

1. Service Won’t Start

# Check port conflicts
netstat -tlnp | grep -E '30000|8080|5432'

# Check disk space
df -h

# Check memory
free -h

# View detailed logs
docker logs datamate-backend --tail 100

2. Database Connection Failed

# Check database container
docker ps | grep database

# Test connection
docker exec -it datamate-database psql -U postgres -d datamate

Quick Start - Quick deployment guide
System Architecture - Architecture details
Development Setup - Development environment

2.2 - System Architecture

DataMate system architecture design documentation

This document details DataMate’s system architecture, tech stack, and design philosophy.

Overall Architecture

DataMate adopts a microservices architecture, splitting the system into multiple independent services, each responsible for specific business functions. This architecture provides good scalability, maintainability, and fault tolerance.

┌─────────────────────────────────────────────────────────────────┐
│                           Frontend Layer                        │
│                    (React + TypeScript)                         │
│                      Ant Design + Tailwind                      │
└────────────────────────┬────────────────────────────────────────┘
                         │
                         ▼
┌─────────────────────────────────────────────────────────────────┐
│                        API Gateway Layer                        │
│                    (Spring Cloud Gateway)                       │
│                      Port: 8080                                 │
└────────────────────────┬────────────────────────────────────────┘
                         │
         ┌───────────────┼───────────────┐
         ▼               ▼               ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│  Java Backend│ │ Python Backend│ │  Runtime     │
│   Services   │ │    Service    │ │   Service    │
├──────────────┤ ├──────────────┤ ├──────────────┤
│· Main App    │ │· RAG Service  │ │· Operator    │
│· Data Mgmt   │ │· LangChain    │ │  Execution   │
│· Collection  │ │· FastAPI      │ │              │
│· Cleaning    │ │              │ │              │
│· Annotation  │ │              │ │              │
│· Synthesis   │ │              │ │              │
│· Evaluation  │ │              │ │              │
│· Operator    │ │              │ │              │
│· Pipeline    │ │              │ │              │
└──────┬───────┘ └──────┬───────┘ └──────┬───────┘
       │                │                │
       └────────────────┼────────────────┘
                        ▼
         ┌──────────────┴──────────────┐
         │                              │
    ┌────▼────┐    ┌─────────┐   ┌─────▼────┐
    │PostgreSQL│    │  Redis  │   │  Milvus  │
    │  (5432)  │    │ (6379)  │   │ (19530)  │
    └──────────┘    └─────────┘   └──────────┘
                                              │
                                        ┌─────▼─────┐
                                        │   MinIO   │
                                        │  (9000)   │
                                        └───────────┘

Tech Stack

Frontend Tech Stack

Technology	Version	Purpose
React	18.x	UI framework
TypeScript	5.x	Type safety
Ant Design	5.x	UI component library
Tailwind CSS	3.x	Styling framework
Redux Toolkit	2.x	State management
React Router	6.x	Routing management
Vite	5.x	Build tool

Backend Tech Stack (Java)

Technology	Version	Purpose
Java	21	Runtime environment
Spring Boot	3.5.6	Application framework
Spring Cloud	2023.x	Microservices framework
MyBatis Plus	3.x	ORM framework
PostgreSQL Driver	42.x	Database driver
Redis	5.x	Cache client
MinIO	8.x	Object storage client

Backend Tech Stack (Python)

Technology	Version	Purpose
Python	3.11+	Runtime environment
FastAPI	0.100+	Web framework
LangChain	0.1+	LLM application framework
Ray	2.x	Distributed computing
Pydantic	2.x	Data validation

Data Storage

Technology	Version	Purpose
PostgreSQL	15+	Main database
Redis	8.x	Cache and message queue
Milvus	2.6.5	Vector database
MinIO	RELEASE.2024+	Object storage

Microservices Architecture

Service List

Service Name	Port	Tech Stack	Description
API Gateway	8080	Spring Cloud Gateway	Unified entry, routing, auth
Frontend	30000	React	Frontend UI
Main Application	-	Spring Boot	Core business logic
Data Management Service	8092	Spring Boot	Dataset management
Data Collection Service	-	Spring Boot	Data collection tasks
Data Cleaning Service	-	Spring Boot	Data cleaning tasks
Data Annotation Service	-	Spring Boot	Data annotation tasks
Data Synthesis Service	-	Spring Boot	Data synthesis tasks
Data Evaluation Service	-	Spring Boot	Data evaluation tasks
Operator Market Service	-	Spring Boot	Operator marketplace
RAG Indexer Service	-	Spring Boot	Knowledge base indexing
Runtime Service	8081	Python + Ray	Operator execution engine
Backend Python Service	18000	FastAPI	Python backend service
Database	5432	PostgreSQL	Database

Service Communication

Synchronous Communication

API Gateway → Backend Services: HTTP/REST
Frontend → API Gateway: HTTP/REST
Backend Services ↔: HTTP/REST (Feign Client)

Asynchronous Communication

Task Execution: Database task queue
Event Notification: Redis Pub/Sub

Data Architecture

Data Flow

┌─────────────┐
│  Data       │ Collection task config
│  Collection │ → DataX → Raw data
└──────┬──────┘
       │
       ▼
┌─────────────┐
│  Data       │ Dataset management, file upload
│  Management │ → Structured storage
└──────┬──────┘
       │
       ├──────────────┐
       ▼              ▼
┌─────────────┐  ┌─────────────┐
│  Data       │  │ Knowledge   │
│  Cleaning   │  │ Base        │
│             │  │             │
└──────┬──────┘  └──────┬──────┘
       │                │
       ▼                ▼
┌─────────────┐  ┌─────────────┐
│  Data       │  │ Vector      │
│  Annotation │  │ Index       │
└──────┬──────┘  └──────┬──────┘
       │                │
       ▼                │
┌─────────────┐          │
│  Data       │          │
│  Synthesis  │          │
└──────┬──────┘          │
       │                │
       ▼                ▼
┌─────────────┐  ┌─────────────┐
│  Data       │  │  RAG        │
│  Evaluation │  │ Retrieval   │
└─────────────┘  └─────────────┘

Deployment Architecture

Docker Compose Deployment

┌────────────────────────────────────────────────┐
│              Docker Network                    │
│            datamate-network                    │
│                                                │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐   │
│  │Frontend  │  │ Gateway  │  │ Backend  │   │
│  │ :30000   │  │  :8080   │  │          │   │
│  └──────────┘  └──────────┘  └──────────┘   │
│                                                │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐   │
│  │Backend   │  │ Runtime  │  │Database  │   │
│  │  Python  │  │  :8081   │  │  :5432   │   │
│  └──────────┘  └──────────┘  └──────────┘   │
│                                                │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐   │
│  │  Milvus  │  │  MinIO   │  │  etcd    │   │
│  │  :19530  │  │  :9000   │  │          │   │
│  └──────────┘  └──────────┘  └──────────┘   │
└────────────────────────────────────────────────┘

Kubernetes Deployment

┌────────────────────────────────────────────────┐
│           Kubernetes Cluster                   │
│                                                │
│  Namespace: datamate                           │
│                                                │
│  ┌────────────┐  ┌────────────┐              │
│  │ Deployment │  │ Deployment │              │
│  │  Frontend  │  │  Gateway   │              │
│  │   (3 Pods) │  │  (2 Pods)  │              │
│  └─────┬──────┘  └─────┬──────┘              │
│        │                │                     │
│  ┌─────▼────────────────▼──────┐              │
│  │       Service (LoadBalancer) │              │
│  └──────────────────────────────┘              │
│                                                │
│  ┌────────────┐  ┌────────────┐              │
│  │ StatefulSet│  │ Deployment │              │
│  │  Database  │  │  Backend   │              │
│  └────────────┘  └────────────┘              │
└────────────────────────────────────────────────┘

Security Architecture

Authentication & Authorization

JWT Authentication (Optional)

datamate:
  jwt:
    enable: true  # Enable JWT authentication
    secret: your-secret-key
    expiration: 86400  # 24 hours

API Key Authentication

datamate:
  api-key:
    enable: false

Data Security

Transport Encryption

API Gateway supports HTTPS/TLS
Internal service communication can be encrypted

Storage Encryption

Database: Transparent data encryption (TDE)
MinIO: Server-side encryption
Milvus: Encryption at rest

Next Steps

Development Setup - Local development configuration
Backend Architecture - Backend architecture details
Frontend Architecture - Frontend architecture details

2.3 - Development Environment Setup

Local development environment configuration guide for DataMate

This document describes how to set up a local development environment for DataMate.

Prerequisites

Required Software

Software	Version	Purpose
Node.js	18.x+	Frontend development
pnpm	8.x+	Frontend package management
Java	21	Backend development
Maven	3.9+	Backend build
Python	3.11+	Python service development
Docker	20.10+	Containerized deployment
Docker Compose	2.0+	Service orchestration
Git	2.x+	Version control
Make	4.x+	Build automation

Recommended Software

IDE: IntelliJ IDEA (backend) + VS Code (frontend/Python)
Database Client: DBeaver, pgAdmin
API Testing: Postman, curl
Git Client: GitKraken, SourceTree

Code Structure

DataMate/
├── backend/                 # Java backend
│   ├── services/           # Microservice modules
│   │   ├── main-application/
│   │   ├── data-management-service/
│   │   ├── data-cleaning-service/
│   │   └── ...
│   ├── openapi/            # OpenAPI specs
│   └── scripts/            # Build scripts
├── frontend/               # React frontend
│   ├── src/
│   │   ├── components/    # Common components
│   │   ├── pages/         # Page components
│   │   ├── services/      # API services
│   │   ├── store/         # Redux store
│   │   └── routes/        # Routes config
│   └── package.json
├── runtime/                # Python runtime
│   └── datamate/          # DataMate runtime
└── deployment/             # Deployment configs
    ├── docker/            # Docker configs
    └── helm/              # Helm charts

Backend Development

1. Install Java 21

# macOS (Homebrew)
brew install openjdk@21

# Linux (Ubuntu/Debian)
sudo apt update
sudo apt install openjdk-21-jdk

# Verify
java -version

2. Install Maven

# macOS
brew install maven

# Linux
sudo apt install maven

# Verify
mvn -version

3. Configure IDE (IntelliJ IDEA)

Install Plugins

Lombok Plugin
MyBatis Plugin
Rainbow Brackets
GitToolBox

Import Project

Open IntelliJ IDEA
File → Open
Select backend directory
Wait for Maven dependency download

4. Configure Database

Start Local Database (Docker)

# Start database only
docker compose -f deployment/docker/datamate/docker-compose.yml up -d datamate-database

Connection info:

Host: localhost
Port: 5432
Database: datamate
Username: postgres
Password: password

5. Run Backend Service

Using Maven

cd backend/services/main-application
mvn spring-boot:run

Using IDE

Find Application class
Right-click → Run
Access http://localhost:8080

Frontend Development

1. Install Node.js

# macOS
brew install node@18

# Linux
curl -fsSL https://deb.nodesource.com/setup_18.x | sudo -E bash -
sudo apt-get install -y nodejs

2. Install pnpm

npm install -g pnpm

3. Install Dependencies

cd frontend
pnpm install

4. Configure Dev Environment

Create .env.development:

VITE_API_BASE_URL=http://localhost:8080
VITE_API_TIMEOUT=30000

5. Start Dev Server

pnpm dev

Access http://localhost:3000

Python Service Development

1. Install Python 3.11

# macOS
brew install python@3.11

# Linux
sudo apt install python3.11 python3.11-venv

2. Create Virtual Environment

cd runtime/datamate
python3.11 -m venv venv
source venv/bin/activate

3. Install Dependencies

pip install -r requirements.txt

4. Run Python Service

python operator_runtime.py --port 8081

Local Debugging

Start All Services

Using Docker Compose

# Start base services (database, Redis, etc.)
docker compose -f deployment/docker/datamate/docker-compose.yml up -d \
  datamate-database \
  datamate-redis

# Start Milvus (optional)
docker compose -f deployment/docker/datamate/docker-compose.yml --profile milvus up -d

Start Backend Services

# Terminal 1: Main Application
cd backend/services/main-application
mvn spring-boot:run

# Terminal 2: Data Management Service
cd backend/services/data-management-service
mvn spring-boot:run

Start Frontend

cd frontend
pnpm dev

Start Python Services

# Runtime Service
cd runtime/datamate
python operator_runtime.py --port 8081

# Backend Python Service
cd backend-python
uvicorn main:app --reload --port 18000

Code Standards

Java Code Standards

Naming Conventions

Class name: PascalCase UserService
Method name: camelCase getUserById
Constants: UPPER_CASE MAX_SIZE
Variables: camelCase userName

TypeScript Code Standards

Naming Conventions

Components: PascalCase UserProfile
Types/Interfaces: PascalCase UserData
Functions: camelCase getUserData
Constants: UPPER_CASE API_BASE_URL

Python Code Standards

Follow PEP 8:

def get_user(user_id: int) -> dict:
    """Get user information

    Args:
        user_id: User ID

    Returns:
        User information dictionary
    """
    # ...

Common Issues

Backend Won’t Start

Check Java version: java -version
Check port conflicts: lsof -i :8080
View logs
Clean and rebuild: mvn clean install

Frontend Won’t Start

Check Node version: node -v
Delete node_modules: rm -rf node_modules && pnpm install
Check port: lsof -i :3000

Next Steps

Backend Architecture - Backend architecture details
Frontend Architecture - Frontend architecture details
Contribution Guide - How to contribute

3 - User Guide

DataMate feature usage guides

This guide introduces how to use each feature module of DataMate.

DataMate provides comprehensive data processing solutions for large models, covering data collection, management, cleaning, annotation, synthesis, evaluation, and the full process.

Feature Modules

Data Collection - Collect data from multiple data sources
Data Management - Manage datasets and files
Data Cleaning - Clean and preprocess data
Data Annotation - Data annotation and quality control
Data Synthesis - Data augmentation based on large models
Data Evaluation - Data quality evaluation
Knowledge Base - RAG knowledge base construction
Operator Market - Data processing operator management
Pipeline Orchestration - Visual workflow orchestration
Agent Chat - AI intelligent assistant

Typical Use Cases

Model Fine-tuning Scenario

1. Data Collection → 2. Data Management → 3. Data Cleaning → 4. Data Annotation
↓
5. Data Evaluation → 6. Export Training Data

RAG Application Scenario

1. Upload Documents → 2. Vectorization Index → 3. Knowledge Base Management
↓
4. Agent Chat (Knowledge Base Q&A)

Data Augmentation Scenario

1. Prepare Raw Data → 2. Create Instruction Template → 3. Data Synthesis
↓
4. Quality Evaluation → 5. Export Augmented Data

Quick Links

Quick Start - Deployment and installation
API Reference - API documentation
Developer Guide - Architecture and development

3.1 - Data Collection

Collect data from multiple data sources with DataMate

Data collection module helps you collect data from multiple data sources (databases, file systems, APIs, etc.) into the DataMate platform.

Features Overview

Based on DataX, data collection module supports:

Multiple Data Sources: MySQL, PostgreSQL, Oracle, SQL Server, etc.
Heterogeneous Sync: Data sync between different sources
Batch Collection: Large-scale batch collection and sync
Scheduled Tasks: Support scheduled execution
Task Monitoring: Real-time monitoring of collection tasks

Supported Data Sources

Data Source Type	Reader	Writer	Description
General Relational Databases	✅	✅	Supports MySQL, PostgreSQL, OpenGauss, SQL Server, DM, DB2
MySQL	✅	✅	Relational database
PostgreSQL	✅	✅	Relational database
OpenGauss	✅	✅	Relational database
SQL Server	✅	✅	Microsoft database
DM (Dameng)	✅	✅	Domestic database
DB2	✅	✅	IBM database
StarRocks	✅	✅	Analytical database
NAS	✅	✅	Network storage
S3	✅	✅	Object storage
GlusterFS	✅	✅	Distributed file system
API Collection	✅	✅	API interface data
JSON Files	✅	✅	JSON format files
CSV Files	✅	✅	CSV format files
TXT Files	✅	✅	Text files
FTP	✅	✅	FTP servers
HDFS	✅	✅	Hadoop HDFS

Quick Start

1. Create Collection Task

Step 1: Enter Data Collection Page

Select Data Collection in the left navigation.

Step 2: Create Task

Click Create Task button.

Step 3: Configure Basic Information

Fill in the following basic information:

Name: A meaningful name for the task
Timeout: Task execution timeout (seconds)
Description: Task purpose (optional)

Step 4: Select Sync Mode

Select the task synchronization mode:

Immediate Sync: Execute once immediately after task creation
Scheduled Sync: Execute periodically according to schedule rules

When selecting Scheduled Sync, configure the execution policy:

Execution Cycle: Hourly / Daily / Weekly / Monthly
Execution Time: Select the execution time point

Step 5: Configure Data Source

Select data source type: Choose from dropdown list (e.g., MySQL, CSV, etc.)

Configure data source parameters: Fill in connection parameters based on the selected data source template (form format)

MySQL Example:

JDBC URL: jdbc:mysql://localhost:3306/mydb
Username: root
Password: password
Table Name: users

Step 6: Configure Field Extraction

Field mapping is not supported. You can only extract specific fields from the configured SQL.

Extract specific fields: Enter the field names you want to extract in the field list
Extract all fields: Leave the field list empty to extract all fields from the SQL query result

Step 7: Create and Execute

Click Create button to create the task.

If Immediate Sync is selected, task starts immediately
If Scheduled Sync is selected, task runs periodically according to schedule

2. Monitor Task Execution

View all collection tasks with status, progress, and operations.

3. Task Management

Each task in the task list has the following actions available:

View Execution Records: View all historical executions of the task
Delete: Delete the task (note: deleting a task does not delete collected data)

Click the task name to view task details including:

Basic configuration
Execution record list
Data statistics

Common Questions

Q: Task execution failed?

A: Troubleshooting:

Check data source connection
View execution logs
Check data format
Verify target dataset exists

Q: How to collect large tables?

Use incremental collection
Split into multiple tasks
Adjust concurrent parameters
Use filter conditions

API Reference

Data Collection API

Data Management - Manage collected data
Data Cleaning - Clean collected data

3.2 - Data Management

Manage datasets and files with DataMate

Data management module provides unified dataset management capabilities, supporting multiple data types for storage, query, and operations.

Features Overview

Data management module provides:

Multiple data types: Image, text, audio, video, and multimodal support
File management: Upload, download, preview, delete operations
Directory structure: Support for hierarchical directory organization
Tag management: Use tags to categorize and retrieve data
Statistics: Dataset size, file count, and other statistics

Dataset Types

Type	Description	Supported Formats
Image	Image data	JPG, PNG, GIF, BMP, WebP
Text	Text data	TXT, MD, JSON, CSV
Audio	Audio data	MP3, WAV, FLAC, AAC
Video	Video data	MP4, AVI, MOV, MKV
Multimodal	Multimodal data	Mixed formats

Quick Start

1. Create Dataset

Step 1: Enter Data Management Page

In the left navigation, select Data Management.

Step 2: Create Dataset

Click the Create Dataset button in the upper right corner.

Step 3: Fill Basic Information

Dataset name: e.g., user_images_dataset
Dataset type: Select data type (e.g., Image)
Description: Dataset purpose description (optional)
Tags: Add tags for categorization (optional)

Step 4: Create Dataset

Click the Create button to complete.

2. Upload Files

Method 1: Drag & Drop

Enter dataset details page
Drag files directly to the upload area
Wait for upload completion

Method 2: Click Upload

Click Upload File button
Select local files
Wait for upload completion

Method 3: Chunked Upload (Large Files)

For large files (>100MB), the system automatically uses chunked upload:

Select large file to upload
System automatically splits the file
Upload chunks one by one
Automatically merge

3. Create Directory

Step 1: Enter Dataset

Click dataset name to enter details.

Step 2: Create Directory

Click Create Directory button
Enter directory name
Select parent directory (optional)
Click confirm

Directory structure example:

user_images_dataset/
├── train/
│   ├── cat/
│   └── dog/
├── test/
│   ├── cat/
│   └── dog/
└── validation/
    ├── cat/
    └── dog/

4. Manage Files

View Files

In dataset details page, you can see all files:

Filename	Size	File Count	Upload Time	Tags	Tag Update Time	Actions
image1.jpg	2.3 MB	1	2024-01-15	Training Set	2024-01-16	Download Rename Delete
image2.png	1.8 MB	1	2024-01-15	Validation Set	2024-01-16	Download Rename Delete

Preview File

Click Preview button to preview in browser:

Image: Display thumbnail and details
Text: Display text content
Audio: Online playback
Video: Online playback

Download File

Single file download: Click Download button

Currently, batch download and package download are not supported.

5. Dataset Operations

View Statistics

In dataset details page, you can see:

Total files: Total number of files in dataset
Total size: Total size of all files

Edit Dataset

Click Edit button to modify:

Dataset name
Description
Tags
Associated collection task

Delete Dataset

Click Delete button to delete entire dataset.

Note: Deleting a dataset will also delete all files within it. This action cannot be undone.

Advanced Features

Tag Management

Create Tag

In dataset list page, click Tag Management
Click Create Tag
Enter tag name

Use Tags

Edit dataset
Select existing tags in tag bar
Save dataset

Filter by Tags

In dataset list page, click tags to filter datasets with that tag.

Best Practices

1. Dataset Organization

Recommended directory organization:

project_dataset/
├── raw/              # Raw data
├── processed/        # Processed data
├── train/            # Training data
├── validation/       # Validation data
└── test/             # Test data

2. Naming Conventions

Dataset name: Use lowercase letters and underscores, e.g., user_images_2024
Directory name: Use meaningful English names, e.g., train, test, processed
File name: Keep original filename or use standardized naming

3. Tag Usage

Recommended tag categories:

Project tags: project-a, project-b
Status tags: raw, processed, validated
Type tags: image, text, audio
Purpose tags: training, testing, evaluation

4. Data Backup

The system currently does not support automatic backup. To backup data, you can manually download individual files:

Enter dataset details page
Find the file you need to backup
Click the Download button of the file

Common Questions

Q: Large file upload fails?

A: Suggestions for large file uploads:

Use chunked upload: System automatically enables chunked upload
Check network: Ensure stable network connection
Adjust upload parameters: Increase timeout
Use FTP/SFTP: For very large files, use FTP upload

Q: How to import existing data?

A: Three methods to import existing data:

Upload files: Upload via interface
Add files: If files already on server, use add file feature
Data collection: Use data collection module to collect from external sources

Q: Dataset size limit?

A: Dataset size limits:

Single file: Maximum 5GB (chunked upload)
Total dataset: Limited by storage space
File count: No explicit limit

Regularly clean unnecessary files to free up space.

API Reference

For detailed API documentation, see:

Data Management API

Data Collection - Collect data to datasets
Data Cleaning - Clean dataset data
Data Annotation - Annotate dataset data

3.3 - Data Cleaning

Clean and preprocess data with DataMate

Data cleaning module provides powerful data processing capabilities to help you clean, transform, and optimize data quality.

Features Overview

Data cleaning module provides:

Built-in Cleaning Operators: Rich pre-cleaning operator library
Visual Configuration: Drag-and-drop cleaning pipeline design
Template Management: Save and reuse cleaning templates
Batch Processing: Support large-scale data batch cleaning
Real-time Preview: Preview cleaning results

Cleaning Operator Types

Data Quality Operators

Operator	Function	Applicable Data Types
Deduplication	Remove duplicates	All types
Null Handling	Handle null values	All types
Outlier Detection	Detect outliers	Numerical
Format Validation	Validate format	All types

Text Cleaning Operators

Operator	Function
Remove Special Chars	Remove special characters
Case Conversion	Convert case
Remove Stopwords	Remove common stopwords
Text Segmentation	Chinese word segmentation
HTML Tag Cleaning	Clean HTML tags

Quick Start

1. Create Cleaning Task

Step 1: Enter Data Cleaning Page

Select Data Processing in the left navigation.

Step 2: Create Task

Click Create Task button.

Step 3: Configure Basic Information

Task name: e.g., user_data_cleansing
Source dataset: Select dataset to clean
Output dataset: Select or create output dataset

Step 4: Configure Cleaning Pipeline

Drag operators from left library to canvas
Connect operators to form pipeline
Configure operator parameters
Preview cleaning results

Example pipeline:

Input Data → Deduplication → Null Handling → Format Validation → Output Data

2. Use Cleaning Templates

Create Template

Configure cleaning pipeline
Click Save as Template
Enter template name
Save

Use Template

Create cleaning task
Click Use Template
Select template
Adjust as needed

3. Monitor Cleaning Task

View task status, progress, and statistics in task list.

Advanced Features

Custom Operators

Develop custom operators. See:

Operator Market - Operator development guide

Conditional Branching

Add conditional branches in pipeline:

Input Data → [Condition Check]
              ├── Satisfied → Pipeline A
              └── Not Satisfied → Pipeline B

Best Practices

1. Pipeline Design

Recommended principles:

Modular: Split complex pipelines
Reusable: Use templates and parameters
Maintainable: Add comments
Testable: Test individually before combining

2. Performance Optimization

Optimize performance:

Parallelize: Use parallel nodes
Reduce data transfer: Process locally when possible
Batch operations: Use batch operations
Cache results: Cache intermediate results

Common Questions

Q: Task execution failed?

A: Troubleshooting:

Check data format
View execution logs
Check operator parameters
Test individual operators
Reduce data size for testing

Q: Cleaning speed is slow?

A: Optimize:

Reduce operator count
Optimize operator order
Increase concurrency
Use incremental processing

API Reference

Data Cleaning API

Data Management - Manage cleaned data
Operator Market - Get more cleaning operators

3.4 - Data Annotation

Perform data annotation with DataMate

Data annotation module integrates Label Studio to provide professional-grade data annotation capabilities.

Features Overview

Data annotation module provides:

Multiple Annotation Types: Image, text, audio, etc.
Annotation Templates: Rich annotation templates and configurations
Quality Control: Annotation review and consistency checks
Team Collaboration: Multi-person collaborative annotation
Annotation Export: Export annotation results

Annotation Types

Image Annotation

Type	Description	Use Cases
Image Classification	Classify entire image	Scene recognition
Object Detection	Annotate object locations	Object recognition
Semantic Segmentation	Pixel-level classification	Medical imaging
Key Point Annotation	Annotate key points	Pose estimation

Text Annotation

Type	Description	Use Cases
Text Classification	Classify text	Sentiment analysis
Named Entity Recognition	Annotate entities	Information extraction
Text Summarization	Generate summaries	Document understanding

Quick Start

1. Deploy Label Studio

make install-label-studio

Access: http://localhost:30001

Default credentials:

Username: admin@demo.com
Password: demoadmin

2. Create Annotation Task

Step 1: Enter Data Annotation Page

Select Data Annotation in the left navigation.

Step 2: Create Task

Click Create Task.

Step 3: Configure Basic Information

Task name: e.g., image_classification_task
Source dataset: Select dataset to annotate
Annotation type: Select type

Step 4: Configure Annotation Template

Image Classification Template:

<View>
  <Image name="image" value="$image"/>
  <Choices name="choice" toName="image">
    <Choice value="cat"/>
    <Choice value="dog"/>
    <Choice value="bird"/>
  </Choices>
</View>

Step 5: Configure Annotation Rules

Annotation method: Single label / Multi label
Minimum annotations: Per sample (for consistency)
Review mechanism: Enable/disable review

3. Start Annotation

Enter annotation interface
View sample to annotate
Perform annotation
Click Submit
Auto-load next sample

Advanced Features

Quality Control

Annotation Consistency

Check consistency between annotators:

Cohen’s Kappa: Evaluate consistency
Majority vote: Use majority annotation results
Expert review: Expert reviews disputed annotations

Pre-annotation

Use models for pre-annotation:

Train or use existing model
Pre-annotate dataset
Annotators correct pre-annotations

Best Practices

1. Annotation Guidelines

Create clear guidelines:

Define standards: Clear annotation standards
Provide examples: Positive and negative examples
Edge cases: Handle edge cases
Train annotators: Ensure understanding

Common Questions

Q: Poor annotation quality?

A: Improve:

Refine guidelines
Strengthen training
Increase reviews
Use pre-annotation

Data Management - Manage data to annotate
Data Evaluation - Evaluate annotation quality

3.5 - Data Synthesis

Use large models for data augmentation and synthesis

Data synthesis module leverages large model capabilities to automatically generate high-quality training data, reducing data collection costs.

Features Overview

Data synthesis module provides:

Instruction template management: Create and manage synthesis instruction templates
Single task synthesis: Create individual synthesis tasks
Proportional synthesis task: Synthesize multi-category balanced data by specified ratios
Large model integration: Support for multiple LLM APIs
Quality evaluation: Automatic evaluation of synthesized data quality

Quick Start

1. Create Instruction Template

Step 1: Enter Data Synthesis Page

In the left navigation, select Data Synthesis → Synthesis Tasks.

Step 2: Create Instruction Template

Click Instruction Templates tab
Click Create Template button

Step 3: Configure Template

Basic Information:

Template name: e.g., qa_generation_template
Template description: Describe template purpose (optional)
Template type: Select template type (Q&A, dialogue, summary, etc.)

Prompt Configuration:

Example prompt:

You are a professional data generation assistant. Generate data based on the following requirements:

Task: Generate Q&A pairs
Topic: {topic}
Count: {count}
Difficulty: {difficulty}

Requirements:
1. Questions should be clear and specific
2. Answers should be accurate and complete
3. Cover different difficulty levels

Output format: JSON
[
  {
    "question": "...",
    "answer": "..."
  }
]

Parameter Configuration:

Model: Select LLM to use (GPT-4, Claude, local model, etc.)
Temperature: Control generation randomness (0-1)
Max tokens: Limit generation length
Other parameters: Configure according to model

Step 4: Save Template

Click Save button to save template.

2. Create Synthesis Task

Step 1: Fill Basic Information

Return to Data Synthesis page
Click Create Task button
Fill basic information:
- Task name: e.g., medical_qa_synthesis
- Task description: Describe task purpose (optional)

Step 2: Select Dataset and Files

Select required data from existing datasets:

Select dataset: Choose the dataset to use from the list
Select files:
- Can select all files from a dataset
- Can also select specific files from a dataset
- Support selecting multiple files

Step 3: Select Synthesis Instruction Template

Select an existing template or create a new one:

Select from template library: Choose from created templates
Template type: Q&A generation, dialogue generation, summary generation, etc.
Preview template: View template prompt content

Step 4: Fill Synthesis Configuration

The synthesis configuration consists of four parts:

1. Set Total Synthesis Count

Set the maximum limit for the entire task:

Parameter	Description	Default Value	Range
Maximum QA Pairs	Maximum number of QA pairs to generate for entire task	5000	1-100,000

This setting is optional, used for total volume control in large-scale synthesis tasks.

2. Configure Text Chunking Strategy

Chunk the input text files, supporting multiple chunking methods:

Parameter	Description	Default Value
Chunking Method	Select chunking strategy	Default chunking
Chunk Size	Character count per chunk	3000
Overlap Size	Overlap characters between adjacent chunks	100

Chunking Method Options:

Default Chunking (默认分块): Use system default intelligent chunking strategy
Chapter-based Chunking (按章节分块): Split by chapter structure
Paragraph-based Chunking (按段落分块): Split by paragraph boundaries
Fixed Length Chunking (固定长度分块): Split by fixed character length
Custom Separator Chunking (自定义分隔符分块): Split by custom delimiter

3. Configure Question Synthesis Parameters

Set parameters for question generation:

Parameter	Description	Default Value	Range
Question Count	Number of questions generated per chunk	1	1-20
Temperature	Control randomness and diversity of question generation	0.7	0-2
Model	Select CHAT model for question generation	-	Select from model list

Parameter Notes:

Question Count: Number of questions generated per text chunk. Higher value generates more questions.
Temperature: Higher values produce more diverse questions, lower values produce more stable questions.

4. Configure Answer Synthesis Parameters

Set parameters for answer generation:

Parameter	Description	Default Value	Range
Temperature	Control stability of answer generation	0.7	0-2
Model	Select CHAT model for answer generation	-	Select from model list

Parameter Notes:

Temperature: Lower values produce more conservative and accurate answers, higher values produce more diverse and creative answers.

Synthesis Types: The system supports two synthesis types:

SFT Q&A Synthesis (SFT 问答数据合成): Generate Q&A pairs for supervised fine-tuning
COT Chain-of-Thought Synthesis (COT 链式推理合成): Generate data with reasoning process

Step 5: Start Task

Click Start Task button, task will automatically start executing.

3. Create Ratio Synthesis Task

Ratio synthesis tasks are used to synthesize multi-category balanced data in specified proportions.

Step 1: Create Ratio Task

In the left navigation, select Data Synthesis → Ratio Tasks
Click Create Task button

Step 2: Fill Basic Information

Parameter	Description	Required
Task Name	Unique identifier for the task	Yes
Total Target Count	Target total count for entire ratio task	Yes
Task Description	Describe purpose and requirements of ratio task	No

Example:

Task name: balanced_dataset_synthesis
Total target count: 10000
Task description: Generate balanced data for training and validation sets

Step 3: Select Datasets

Select datasets to participate in the ratio synthesis from existing datasets:

Dataset Selection Features:

Search Datasets: Search datasets by keyword
Multi-select Support: Can select multiple datasets simultaneously
Dataset Information: Display detailed information for each dataset
- Dataset name and type
- Dataset description
- File count
- Dataset size
- Label distribution preview (up to 8 labels)

After selecting datasets, the system automatically loads label distribution information for each dataset.

Step 4: Fill Ratio Configuration

Configure specific synthesis rules for each selected dataset:

Ratio Configuration Items:

Parameter	Description	Range
Label	Select label from dataset’s label distribution	Based on dataset labels
Label Value	Specific value under selected label	Based on label value list
Label Update Time	Select label update date range (optional)	Date picker
Quantity	Data count to generate for this config	0 to total target count

Feature Notes:

Auto Distribute: Click “Auto Distribute” button, system automatically distributes total count evenly across datasets
Quantity Limit: Each configuration item’s quantity cannot exceed the dataset’s total file count
Percentage Calculation: System automatically calculates percentage of each configuration item
Delete Configuration: Can delete unwanted configuration items
Add Configuration: Each dataset can have multiple different label configurations

Example Configuration:

Dataset	Label	Label Value	Label Update Time	Quantity
Training Dataset	Category	Training	-	6000
Training Dataset	Category	Validation	-	2000
Test Dataset	Category	Test	2024-01-01 to 2024-12-31	2000

Step 5: Execute Task

Click Start Task button, the system will create and execute the task according to ratio configuration.

4. Monitor Synthesis Task

View Task List

In data synthesis page, you can see all synthesis tasks:

Task Name	Template	Status	Progress	Generated Count	Actions
Medical QA Synthesis	qa_template	Running	50%	50/100	View Details
Sentiment Data Synthesis	sentiment_template	Completed	100%	1000/1000	View Details

Advanced Features

Template Variables

Use variables in prompts for dynamic configuration:

Variable syntax: {variable_name}

Example:

Generate {count} {difficulty} level {type} about {topic}.

Built-in variables:

{current_date}: Current date
{current_time}: Current time
{random_id}: Random ID

Model Selection

DataMate supports multiple LLMs:

Model	Type	Description
GPT-4	OpenAI	High-quality generation
GPT-3.5-Turbo	OpenAI	Fast generation
Claude 3	Anthropic	Long-text generation
Wenxin Yiyan	Baidu	Chinese optimized
Tongyi Qianwen	Alibaba	Chinese optimized
Local Model	Deployed locally	Private deployment

Best Practices

1. Prompt Design

Good prompts should:

Define task clearly: Clearly describe generation task
Specify format: Clearly define output format requirements
Provide examples: Give expected output examples
Control quality: Set quality requirements

Example prompt:

You are a professional educational content creator.

Task: Generate educational Q&A pairs
Subject: {subject}
Grade: {grade}
Count: {count}

Requirements:
1. Questions should be appropriate for the grade level
2. Answers should be accurate, detailed, and easy to understand
3. Each answer should include explanation process
4. Do not generate sensitive or inappropriate content

Output format (JSON):
[
  {
    "id": 1,
    "question": "Question content",
    "answer": "Answer content",
    "explanation": "Explanation content",
    "difficulty": "easy/medium/hard",
    "knowledge_points": ["point1", "point2"]
  }
]

Start generating:

2. Parameter Tuning

Adjust model parameters according to needs:

Parameter	High Quality	Fast Generation	Creative Generation
Temperature	0.3-0.5	0.1-0.3	0.7-1.0
Max tokens	As needed	Shorter	Longer
Top P	0.9-0.95	0.9	0.95-1.0

Common Questions

Q: Generated data quality is not ideal?

A: Optimization suggestions:

Improve prompt: More detailed and clear instructions
Adjust parameters: Lower temperature, increase max tokens
Provide examples: Give examples in prompt
Change model: Try other LLMs
Manual review: Manual review and filtering

Q: Generation speed is slow?

A: Acceleration suggestions:

Reduce count: Generate in smaller batches
Adjust concurrency: Increase concurrency appropriately
Use faster model: Like GPT-3.5-Turbo
Shorten output: Reduce max tokens
Use local model: Deploy local model for acceleration

API Reference

For detailed API documentation, see:

Data Synthesis API

Data Evaluation - Evaluate synthesized data quality
Data Management - Manage synthesized data
Pipeline Orchestration - Integrate synthesis tasks into pipelines

3.6 - Data Evaluation

Evaluate data quality with DataMate

Data evaluation module provides multi-dimensional data quality evaluation capabilities.

Features Overview

Data evaluation module provides:

Quality Metrics: Rich data quality evaluation metrics
Automatic Evaluation: Auto-execute evaluation tasks
Manual Evaluation: Manual sampling evaluation
Evaluation Reports: Generate detailed reports
Quality Tracking: Track data quality trends

Evaluation Dimensions

Data Completeness

Metric	Description	Calculation
Null Rate	Null value ratio	Null count / Total count
Missing Field Rate	Required field missing rate	Missing fields / Total fields
Record Complete Rate	Complete record ratio	Complete records / Total records

Data Accuracy

Metric	Description	Calculation
Format Correct Rate	Format compliance	Format correct / Total
Value Range Compliance	In valid range	In range / Total
Consistency Rate	Data consistency	Consistent records / Total

Quick Start

1. Create Evaluation Task

Step 1: Enter Data Evaluation Page

Select Data Evaluation in the left navigation.

Step 2: Create Task

Click Create Task.

Step 3: Configure Basic Information

Task name: e.g., data_quality_evaluation
Evaluation dataset: Select dataset to evaluate

Step 4: Configure Evaluation Dimensions

Select dimensions:

✅ Data completeness
✅ Data accuracy
✅ Data uniqueness
✅ Data timeliness

Step 5: Configure Evaluation Rules

Completeness Rules:

Required fields: name, email, phone
Null threshold: 5% (warn if exceeded)

2. Execute Evaluation

Automatic Evaluation

Auto-executes after creation, or click Execute Now.

Manual Evaluation

Click Manual Evaluation tab
View samples to evaluate
Manually evaluate quality
Submit results

3. View Evaluation Report

Overall Score

Overall Quality Score: 85 (Excellent)

Completeness: 90 ⭐⭐⭐⭐⭐
Accuracy: 82 ⭐⭐⭐⭐
Uniqueness: 95 ⭐⭐⭐⭐⭐
Timeliness: 75 ⭐⭐⭐⭐

Detailed Metrics

Completeness:

Null rate: 3.2% ✅
Missing field rate: 1.5% ✅
Record complete rate: 96.8% ✅

Advanced Features

Custom Evaluation Rules

Regex Validation

Field: phone
Rule: ^1[3-9]\d{9}$
Description: China mobile phone number

Value Range Validation

Field: age
Min value: 0
Max value: 120

Comparison Evaluation

Compare different datasets or versions.

Best Practices

1. Regular Evaluation

Recommended schedule:

Daily: Critical data
Weekly: General data
Monthly: All data

2. Establish Baseline

Create quality baseline for each dataset.

3. Continuous Improvement

Based on evaluation results:

Clean problem data
Optimize collection process
Update validation rules

Common Questions

Q: Evaluation task failed?

A: Troubleshoot:

Check dataset exists
Check rule configuration
View execution logs
Test with small sample size

API Reference

Data Evaluation API

Data Cleaning - Clean data
Data Management - Manage data

3.7 - Knowledge Base Management

Build and manage RAG knowledge bases with DataMate

Knowledge base management module helps you build enterprise knowledge bases for efficient vector retrieval and RAG applications.

Features Overview

Knowledge base management module provides:

Document upload: Support multiple document formats
Text chunking: Intelligent text splitting strategies
Vectorization: Automatic text-to-vector conversion
Vector search: Semantic similarity-based retrieval
Knowledge base Q&A: RAG-intelligent Q&A

Supported Document Formats

Format	Description	Recommended For
TXT	Plain text	General text
PDF	PDF documents	Documents, reports
Markdown	Markdown files	Technical docs
JSON	JSON data	Structured data
CSV	CSV tables	Tabular data
DOCX	Word documents	Office documents

Quick Start

1. Create Knowledge Base

Step 1: Enter Knowledge Base Page

In the left navigation, select Knowledge Generation.

Step 2: Create Knowledge Base

Click Create Knowledge Base button in upper right.

Step 3: Configure Basic Information

Knowledge base name: e.g., company_docs_kb
Knowledge base description: Describe purpose (optional)
Knowledge base type: General / Professional domain

Step 4: Configure Vector Parameters

Embedding model: Select embedding model
- OpenAI text-embedding-ada-002
- BGE-M3
- Custom model
Vector dimension: Auto-set based on model
Index type: IVF_FLAT / HNSW / IVF_PQ

Step 5: Configure Chunking Strategy

Chunking method:
- By character count
- By paragraph
- By semantic
Chunk size: Size of each text chunk (character count)
Overlap size: Overlap between adjacent chunks

2. Upload Documents

Step 1: Enter Knowledge Base Details

Click knowledge base name to enter details.

Step 2: Upload Documents

Click Upload Document button
Select local files
Wait for upload completion

System will automatically:

Parse document content
Chunk text
Generate vectors
Build index

3. Vector Search

Step 1: Enter Search Page

In knowledge base details page, click Vector Search tab.

Step 2: Enter Query

Enter query in search box, e.g.:

How to use DataMate for data cleaning?

Step 3: View Search Results

System returns most relevant text chunks with similarity scores:

Rank	Text Chunk	Similarity	Source Doc	Actions
1	DataMate’s data cleaning module…	0.92	user_guide.pdf	View
2	Configure cleaning task…	0.87	tutorial.md	View
3	Cleaning operator list…	0.81	reference.txt	View

4. Knowledge Base Q&A (RAG)

Step 1: Enable RAG

In knowledge base details page, click RAG Q&A tab.

Step 2: Configure RAG Parameters

LLM: Select LLM to use
Retrieval count: Number of text chunks to retrieve
Temperature: Control generation randomness
Prompt template: Custom Q&A template

Step 3: Q&A

Enter question in dialog box, e.g.:

User: What data cleaning operators does DataMate support?

Assistant: DataMate supports rich data cleaning operators, including:
1. Data quality operators: deduplication, null handling, outlier detection...
2. Text cleaning operators: remove special chars, case conversion...
3. Image cleaning operators: format conversion, quality detection...
[Source: user_guide.pdf, tutorial.md]

Best Practices

1. Document Preparation

Before uploading documents:

Unify format: Convert to unified format (PDF, Markdown)
Clean content: Remove irrelevant content (headers, ads)
Maintain structure: Keep good document structure
Add metadata: Add document metadata (author, date, tags)

2. Chunking Strategy Selection

Choose based on document type:

Document Type	Recommended Strategy	Chunk Size
Technical docs	Paragraph chunking	-
Long reports	Semantic chunking	-
Short text	Character chunking	500
Code	Character chunking	300

Common Questions

Q: Document stuck in “Processing”?

A: Check:

Document format: Ensure format is supported
Document size: Single document under 100MB
Vector service: Check if vector service is running
View logs: Check detailed error messages

Q: Inaccurate search results?

A: Optimization suggestions:

Adjust chunking: Try different chunking methods
Increase chunk size: Add more context
Use reranking: Enable reranking model
Optimize query: Use clearer query statements
Change embedding model: Try other models

API Reference

For detailed API documentation, see:

RAG Indexer API

Agent Chat - Q&A with knowledge base
Data Management - Manage knowledge base documents
Pipeline Orchestration - Integrate knowledge base into pipelines

3.8 - Operator Market

Manage and use DataMate operators

Operator marketplace provides rich data processing operators and supports custom operator development.

Features Overview

Operator marketplace provides:

Built-in Operators: Rich built-in data processing operators
Operator Publishing: Publish and share custom operators
Operator Installation: Install third-party operators
Custom Development: Develop custom operators

Built-in Operators

Data Cleaning Operators

Operator	Function	Input	Output
Deduplication	Remove duplicates	Dataset	Deduplicated data
Null Handler	Handle nulls	Dataset	Filled data
Format Converter	Convert format	Original format	New format

Text Processing Operators

Operator	Function
Text Segmentation	Chinese word segmentation
Remove Stopwords	Remove common stopwords
Text Cleaning	Clean special characters

Quick Start

1. Browse Operators

Step 1: Enter Operator Market

Select Operator Market in the left navigation.

Step 2: Browse Operators

View all available operators with ratings and installation counts.

2. Install Operator

Install Built-in Operator

Built-in operators are installed by default.

Install Third-party Operator

In operator details page, click Install
Wait for installation completion

3. Use Operator

After installation, use in:

Data Cleaning: Add operator node to cleaning pipeline
Pipeline Orchestration: Add operator node to workflow

Advanced Features

Develop Custom Operator

Create Operator

In operator market page, click Create Operator
Fill operator information
Write operator code (Python)
Package and publish

Python Operator Example:

class MyTextCleaner:
    def __init__(self, config):
        self.remove_special_chars = config.get('remove_special_chars', True)

    def process(self, data):
        if isinstance(data, str):
            result = data
            if self.remove_special_chars:
                import re
                result = re.sub(r'[^\w\s]', '', result)
            return result
        return data

Best Practices

1. Operator Design

Good operator design:

Single responsibility: One operator does one thing
Configurable: Rich configuration options
Error handling: Comprehensive error handling
Performance: Consider large-scale data

Common Questions

Q: Operator execution failed?

A: Troubleshoot:

View logs
Check configuration
Check data format
Test locally

Data Cleaning - Use operators for cleaning
Pipeline Orchestration - Use operators in pipelines

3.9 - Pipeline Orchestration

Visual workflow orchestration with DataMate

Pipeline orchestration module provides drag-and-drop visual interface for designing and managing complex data processing workflows.

Features Overview

Pipeline orchestration provides:

Visual Designer: Drag-and-drop workflow design
Rich Node Types: Data processing, conditions, loops, etc.
Flow Execution: Auto-execute and monitor workflows
Template Management: Save and reuse flow templates
Version Management: Flow version control

Node Types

Data Nodes

Node	Function	Config
Input Dataset	Read from dataset	Select dataset
Output Dataset	Write to dataset	Select dataset
Data Collection	Execute collection task	Select task
Data Cleaning	Execute cleaning task	Select task
Data Synthesis	Execute synthesis task	Select task

Logic Nodes

Node	Function	Config
Condition Branch	Execute different branches	Condition expression
Loop	Repeat execution	Loop count/condition
Parallel	Execute multiple branches in parallel	Branch count
Wait	Wait for specified time	Duration

Quick Start

1. Create Pipeline

Step 1: Enter Pipeline Orchestration Page

Select Pipeline Orchestration in left navigation.

Step 2: Create Pipeline

Click Create Pipeline.

Step 3: Fill Basic Information

Pipeline name: e.g., data_processing_pipeline
Description: Pipeline purpose (optional)

Step 4: Design Flow

Drag nodes from left library to canvas
Connect nodes
Configure node parameters
Save flow

Example:

Input Dataset → Data Cleaning → Condition Branch
                                    ├── Satisfied → Data Annotation → Output
                                    └── Not Satisfied → Data Synthesis → Output

2. Execute Pipeline

Step 1: Enter Execution Page

Click pipeline name to enter details.

Step 2: Execute Pipeline

Click Execute Now.

Step 3: Monitor Execution

View execution status, progress, and logs.

Advanced Features

Flow Templates

Save as Template

Design flow
Click Save as Template
Enter template name

Use Template

Create pipeline, click Use Template
Select template
Load to designer

Parameterized Flow

Define parameters in pipeline:

{
  "parameters": [
    {
      "name": "input_dataset",
      "type": "dataset",
      "required": true
    }
  ]
}

Scheduled Execution

Configure scheduled execution:

Cron expression: 0 0 2 * * ? (Daily at 2 AM)
Execution parameters

Best Practices

1. Flow Design

Recommended principles:

Modular: Split complex flows
Reusable: Use templates
Maintainable: Add comments
Testable: Test individually

2. Performance Optimization

Optimize performance:

Parallelize: Use parallel nodes
Reduce data transfer: Process locally
Batch operations: Use batch operations
Cache results: Cache intermediate results

Common Questions

Q: Flow execution failed?

A: Troubleshoot:

View execution logs
Check node configuration
Check data format
Test nodes individually

Data Collection - Collection nodes
Data Cleaning - Cleaning nodes
Operator Market - Get more operators

3.10 - Agent Chat

Use DataMate Agent for intelligent conversation

Agent chat module integrates large language models to provide intelligent Q&A and knowledge base retrieval capabilities.

Features Overview

Agent chat module provides:

Intelligent Chat: Natural language conversation based on LLMs
Knowledge Base Q&A: RAG-integrated knowledge base Q&A
Multi-turn Dialogue: Support context-aware multi-turn conversation
Dialogue Management: Conversation history and management
Quick Commands: Quick commands for common operations

Quick Start

1. Access Agent

Select Agent Chat in left navigation, or visit:

Local: http://localhost:30000/chat
Production: https://your-domain/chat

2. Start Conversation

Enter question in dialog box, e.g.:

User: What data formats does DataMate support?

Assistant: DataMate supports multiple data formats:

Images: JPG, PNG, GIF, BMP, WebP
Text: TXT, MD, JSON, CSV
Audio: MP3, WAV, FLAC, AAC
Video: MP4, AVI, MOV, MKV

These formats can be uploaded and managed in the data management module.

3. Use Knowledge Base Q&A

Enable Knowledge Base

Click Settings button
In Knowledge Base settings, select knowledge base to use
Save settings

Knowledge Base Q&A

User: How to create a data cleaning task?

Assistant: According to the knowledge base documentation:

1. Enter data processing page
2. Click create task button
3. Configure basic information
4. Configure cleaning pipeline (drag operators to canvas)
5. Configure execution parameters
6. Create and execute task

[Source: user_guide.md, data_cleansing.md]

Advanced Features

Conversation Modes

General Chat

Use LLM for general conversation without knowledge base.

Knowledge Base Q&A

Answer questions based on knowledge base content.

Mixed Mode

Combine general chat and knowledge base Q&A.

Quick Commands

Command	Function	Example
`/dataset`	Query datasets	`/dataset list`
`/task`	Query tasks	`/task status`
`/help`	Show help	`/help`
`/clear`	Clear conversation	`/clear`

Conversation History

View History

Click History tab on left
Select historical conversation
View conversation content

Continue Conversation

Click historical conversation to continue.

Export Conversation

Export conversation records:

Markdown: Export as Markdown file
JSON: Export as JSON
PDF: Export as PDF

Best Practices

1. Effective Questioning

Get better answers:

Be specific: Clear and specific questions
Provide context: Include background information
Break down: Split complex questions

2. Knowledge Base Usage

Make the most of knowledge base:

Select appropriate knowledge base: Choose based on question
View sources: Check answer source documents
Verify information: Verify with source documents

Common Questions

Q: Inaccurate Agent answers?

A: Improve:

Optimize question: More specific
Check knowledge base: Ensure relevant content exists
Change model: Try more powerful model
Provide context: More background info

Knowledge Base Management - Create and manage knowledge bases
Data Management - Manage knowledge base documents

4 - API Reference

DataMate API documentation

DataMate provides complete REST APIs supporting programmatic access to all core features.

API Overview

DataMate API is based on REST architecture design, providing the following services:

Data Management API: Dataset and file management
Data Cleaning API: Data cleaning task management
Data Collection API: Data collection task management
Data Annotation API: Data annotation task management
Data Synthesis API: Data synthesis task management
Data Evaluation API: Data evaluation task management
Operator Market API: Operator management
RAG Indexer API: Knowledge base and vector retrieval
Pipeline Orchestration API: Pipeline orchestration management

Authentication

DataMate supports two authentication methods:

JWT Authentication (Recommended)

GET /api/v1/data-management/datasets
Authorization: Bearer <your-jwt-token>

Get JWT Token:

POST /api/v1/auth/login
Content-Type: application/json

{
  "username": "admin",
  "password": "password"
}

Response:

{
  "token": "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9...",
  "expiresIn": 86400
}

API Key Authentication

GET /api/v1/data-management/datasets
X-API-Key: <your-api-key>

Common Response Format

Success Response

{
  "code": 200,
  "message": "success",
  "data": {
    // Response data
  }
}

Error Response

{
  "code": 400,
  "message": "Bad Request",
  "error": "Invalid parameter: datasetId",
  "timestamp": "2024-01-15T10:30:00Z",
  "path": "/api/v1/data-management/datasets"
}

Paged Response

{
  "content": [],
  "page": 0,
  "size": 20,
  "totalElements": 100,
  "totalPages": 5,
  "first": true,
  "last": false
}

API Endpoints

Data Management

Endpoint	Method	Description
`/data-management/datasets`	GET	Get dataset list
`/data-management/datasets`	POST	Create dataset
`/data-management/datasets/{id}`	GET	Get dataset details
`/data-management/datasets/{id}`	PUT	Update dataset
`/data-management/datasets/{id}`	DELETE	Delete dataset
`/data-management/datasets/{id}/files`	GET	Get file list
`/data-management/datasets/{id}/files/upload`	POST	Upload files

Data Cleaning

Endpoint	Method	Description
`/data-cleaning/tasks`	GET	Get cleaning task list
`/data-cleaning/tasks`	POST	Create cleaning task
`/data-cleaning/tasks/{id}`	GET	Get task details
`/data-cleaning/tasks/{id}`	PUT	Update task
`/data-cleaning/tasks/{id}`	DELETE	Delete task
`/data-cleaning/tasks/{id}/execute`	POST	Execute task

Data Collection

Endpoint	Method	Description
`/data-collection/tasks`	GET	Get collection task list
`/data-collection/tasks`	POST	Create collection task
`/data-collection/tasks/{id}`	GET	Get task details
`/data-collection/tasks/{id}/execute`	POST	Execute collection task

Data Synthesis

Endpoint	Method	Description
`/data-synthesis/tasks`	GET	Get synthesis task list
`/data-synthesis/tasks`	POST	Create synthesis task
`/data-synthesis/templates`	GET	Get instruction template list
`/data-synthesis/templates`	POST	Create instruction template

Operator Market

Endpoint	Method	Description
`/operator-market/operators`	GET	Get operator list
`/operator-market/operators`	POST	Publish operator
`/operator-market/operators/{id}`	GET	Get operator details
`/operator-market/operators/{id}/install`	POST	Install operator

RAG Indexer

Endpoint	Method	Description
`/rag/knowledge-bases`	GET	Get knowledge base list
`/rag/knowledge-bases`	POST	Create knowledge base
`/rag/knowledge-bases/{id}/documents`	POST	Upload documents
`/rag/knowledge-bases/{id}/search`	POST	Vector search

Error Codes

Code	Description
200	Success
201	Created
400	Bad Request
401	Unauthorized
403	Forbidden
404	Not Found
409	Conflict
500	Internal Server Error

Rate Limiting

API call rate limits:

Default limit: 1000 requests/hour
Burst limit: 100 requests/minute

Exceeding the limit returns 429 Too Many Requests.

Response headers contain rate limiting information:

X-RateLimit-Limit: 1000
X-RateLimit-Remaining: 999
X-RateLimit-Reset: 1642252800

Version Management

API versions are specified through URL paths:

Current version: /api/v1/
Future versions: /api/v2/

Developer Guide - Architecture and development guide
OpenAPI Specifications - Complete OpenAPI specs

4.1 - Data Management API

Dataset and file management API

Data management API provides capabilities for dataset and file creation, query, update, and deletion.

Basic Information

Base URL: http://localhost:8092/api/v1/data-management
Authentication: JWT / API Key
Content-Type: application/json

Dataset Management

Get Dataset List

GET /data-management/datasets?page=0&size=20&type=text

Query Parameters:

Parameter	Type	Required	Description
page	integer	No	Page number, starts from 0
size	integer	No	Page size, default 20
type	string	No	Dataset type filter
tags	string	No	Tag filter, comma-separated
keyword	string	No	Keyword search
status	string	No	Status filter

Response Example:

{
  "content": [
    {
      "id": "dataset-001",
      "name": "text_dataset",
      "description": "Text dataset",
      "type": {
        "code": "TEXT",
        "name": "Text"
      },
      "status": "ACTIVE",
      "fileCount": 1000,
      "totalSize": 1073741824,
      "createdAt": "2024-01-15T10:00:00Z"
    }
  ],
  "page": 0,
  "size": 20,
  "totalElements": 1
}

Create Dataset

POST /data-management/datasets
Content-Type: application/json

{
  "name": "my_dataset",
  "description": "My dataset",
  "type": "TEXT",
  "tags": ["training", "nlp"]
}

Get Dataset Details

GET /data-management/datasets/{datasetId}

Update Dataset

PUT /data-management/datasets/{datasetId}
Content-Type: application/json

{
  "name": "updated_dataset",
  "description": "Updated description"
}

Delete Dataset

DELETE /data-management/datasets/{datasetId}

File Management

Get File List

GET /data-management/datasets/{datasetId}/files?page=0&size=20

Upload File

POST /data-management/datasets/{datasetId}/files/upload/chunk
Content-Type: multipart/form-data

Download File

GET /data-management/datasets/{datasetId}/files/{fileId}/download

Delete File

DELETE /data-management/datasets/{datasetId}/files/{fileId}

Error Response

{
  "code": 400,
  "message": "Bad Request",
  "error": "Invalid parameter: datasetId",
  "timestamp": "2024-01-15T10:30:00Z",
  "path": "/api/v1/data-management/datasets"
}

SDK Usage

Python

from datamate import DataMateClient

client = DataMateClient(
    base_url="http://localhost:8080",
    api_key="your-api-key"
)

# Get datasets
datasets = client.data_management.get_datasets()

# Create dataset
dataset = client.data_management.create_dataset(
    name="my_dataset",
    type="TEXT"
)

cURL

# Get datasets
curl -X GET "http://localhost:8092/api/v1/data-management/datasets" \
  -H "Authorization: Bearer your-jwt-token"

# Create dataset
curl -X POST "http://localhost:8092/api/v1/data-management/datasets" \
  -H "Authorization: Bearer your-jwt-token" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "my_dataset",
    "type": "TEXT"
  }'

Data Management - User guide
OpenAPI Specs - Complete specs

5 - Developer Guide

DataMate architecture and development guide

Developer guide introduces DataMate’s technical architecture, development environment, and contribution process.

DataMate is an enterprise-level data processing platform using microservices architecture, supporting large-scale data processing and custom extensions.

Architecture Documentation

Backend Architecture - Java microservices architecture design
Frontend Architecture - React frontend architecture design

Development Guide

Development Setup - Local development environment configuration
Contribution Guide - Code contribution process
Deployment Guide - Production environment deployment

Tech Stack

Frontend

Technology	Version	Description
React	18.x	UI framework
TypeScript	5.x	Type safety
Ant Design	5.x	UI component library
Redux Toolkit	2.x	State management
Vite	5.x	Build tool

Backend (Java)

Technology	Version	Description
Java	21	Runtime environment
Spring Boot	3.5.6	Application framework
Spring Cloud	2023.x	Microservices framework
MyBatis Plus	3.x	ORM framework

Backend (Python)

Technology	Version	Description
Python	3.11+	Runtime environment
FastAPI	0.100+	Web framework
LangChain	0.1+	LLM framework
Ray	2.x	Distributed computing

Project Structure

DataMate/
├── backend/                 # Java backend
│   ├── services/           # Microservice modules
│   ├── openapi/            # OpenAPI specs
│   └── scripts/            # Build scripts
├── frontend/               # React frontend
│   ├── src/
│   │   ├── components/    # Common components
│   │   ├── pages/         # Page components
│   │   ├── services/      # API services
│   │   └── store/         # Redux store
│   └── package.json
├── runtime/                # Python runtime
│   └── datamate/          # DataMate runtime
└── deployment/             # Deployment config
    ├── docker/            # Docker config
    └── helm/              # Helm Charts

Quick Start

1. Clone Code

git clone https://github.com/ModelEngine-Group/DataMate.git
cd DataMate

2. Start Services

# Start basic services
make install

# Access frontend
open http://localhost:30000

3. Development Mode

# Backend development
cd backend/services/main-application
mvn spring-boot:run

# Frontend development
cd frontend
pnpm dev

# Python service development
cd runtime/datamate
python operator_runtime.py --port 8081

Core Concepts

Microservices Architecture

DataMate uses microservices architecture, each service handles specific business functions:

API Gateway: Unified entry, routing, authentication
Main Application: Core business logic
Data Management Service: Dataset management
Data Cleaning Service: Data cleaning
Data Synthesis Service: Data synthesis
Runtime Service: Operator execution

Operator System

Operators are basic units of data processing:

Built-in operators: Common operators provided by platform
Custom operators: User-developed custom operators
Operator execution: Executed by Runtime Service

Pipeline Orchestration

Pipelines are implemented through visual orchestration:

Nodes: Basic units of data processing
Connections: Data flow between nodes
Execution: Automatic execution according to workflow

Extension Development

Develop Custom Operators

Operator development guide:

Operator Market - Operator usage guide
Python operator development examples
Operator testing and debugging

Integrate External Systems

API integration: Integration via REST API
Webhook: Event notifications
Plugin system: (Coming soon)

Testing

Unit Tests

# Backend tests
cd backend
mvn test

# Frontend tests
cd frontend
pnpm test

# Python tests
cd runtime
pytest

Integration Tests

# Start test environment
make test-env-up

# Run integration tests
make integration-test

# Clean test environment
make test-env-down

Performance Optimization

Backend Optimization

Database connection pool configuration
Query optimization
Caching strategies
Asynchronous processing

Frontend Optimization

Code splitting
Lazy loading
Caching strategies

Security

Authentication and Authorization

JWT authentication
RBAC permission control
API Key authentication

Data Security

Transport encryption (HTTPS/TLS)
Storage encryption
Sensitive data masking

GitHub Repository
Issue Tracking
OpenAPI Specs
API Reference - Complete API documentation

5.1 - Backend Architecture

DataMate Java backend architecture design

DataMate backend adopts microservices architecture built on Spring Boot 3.x and Spring Cloud.

Architecture Overview

DataMate backend uses microservices architecture, splitting into multiple independent services:

┌─────────────────────────────────────────────┐
│              API Gateway                    │
│         (Spring Cloud Gateway)              │
│              Port: 8080                     │
└──────────────┬──────────────────────────────┘
               │
       ┌───────┴───────┬───────────────┐
       ▼               ▼               ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│   Main       │ │  Data        │ │  Data        │
│ Application  │ │  Management  │ │  Collection  │
└──────────────┘ └──────────────┘ └──────────────┘
       │               │               │
       └───────────────┴───────────────┘
                       │
                       ▼
              ┌────────────────┐
              │   PostgreSQL   │
              │   Port: 5432   │
              └────────────────┘

Tech Stack

Core Frameworks

Technology	Version	Purpose
Java	21	Programming language
Spring Boot	3.5.6	Application framework
Spring Cloud	2023.x	Microservices framework
MyBatis Plus	3.5.x	ORM framework

Support Components

Technology	Version	Purpose
Redis	5.x	Cache and message queue
MinIO	8.x	Object storage
Milvus SDK	2.3.x	Vector database

Microservices List

API Gateway

Port: 8080

Functions:

Unified entry point
Route forwarding
Authentication and authorization
Rate limiting and circuit breaking

Tech: Spring Cloud Gateway, JWT authentication

Main Application

Functions:

User management
Permission management
System configuration
Task scheduling

Data Management Service

Port: 8092

Functions:

Dataset management
File management
Tag management
Statistics

API Endpoints:

/data-management/datasets - Dataset management
/data-management/datasets/{id}/files - File management

Runtime Service

Port: 8081

Functions:

Operator execution
Ray integration
Task scheduling

Tech: Python + Ray, FastAPI

Database Design

Main Tables

users (User Table)

Field	Type	Description
id	BIGINT	Primary key
username	VARCHAR(50)	Username
password	VARCHAR(255)	Password (encrypted)
email	VARCHAR(100)	Email
role	VARCHAR(20)	Role
created_at	TIMESTAMP	Creation time

datasets (Dataset Table)

Field	Type	Description
id	VARCHAR(50)	Primary key
name	VARCHAR(100)	Name
description	TEXT	Description
type	VARCHAR(20)	Type
status	VARCHAR(20)	Status
created_by	VARCHAR(50)	Creator

Service Communication

Synchronous Communication

Services communicate via HTTP/REST:

// Using Feign Client
@FeignClient(name = "data-management-service")
public interface DataManagementClient {
    @GetMapping("/data-management/datasets/{id}")
    DatasetResponse getDataset(@PathVariable String id);
}

Asynchronous Communication

Using Redis for async messaging:

// Send message
redisTemplate.convertAndSend("task.created", taskMessage);

// Receive message
@RedisListener(topic = "task.created")
public void handleTaskCreated(TaskMessage message) {
    // Handle task creation event
}

Authentication & Authorization

JWT Authentication

@Configuration
public class JwtConfig {
    @Value("${datamate.jwt.secret}")
    private String secret;

    @Value("${datamate.jwt.expiration}")
    private Long expiration;
}

RBAC

@PreAuthorize("hasRole('ADMIN')")
public void adminOperation() {
    // Admin operations
}

Performance Optimization

Database Connection Pool

spring:
  datasource:
    hikari:
      maximum-pool-size: 20
      minimum-idle: 5
      connection-timeout: 30000

Caching Strategy

@Cacheable(value = "datasets", key = "#id")
public Dataset getDataset(String id) {
    return datasetRepository.findById(id);
}

Frontend Architecture - Frontend architecture
Database Schema - Database tables
Development Setup - Dev environment

5.2 - Frontend Architecture

DataMate React frontend architecture design

DataMate frontend is built on React 18 and TypeScript with modern frontend architecture.

Architecture Overview

DataMate frontend adopts SPA architecture:

┌─────────────────────────────────────────────┐
│              Browser                        │
└──────────────────┬──────────────────────────┘
                   │
                   ▼
┌─────────────────────────────────────────────┐
│              React App                      │
│  ┌──────────────────────────────────────┐  │
│  │         Components                   │  │
│  └──────────────────────────────────────┘  │
│  ┌──────────────────────────────────────┐  │
│  │         State Management             │  │
│  │         (Redux Toolkit)              │  │
│  └──────────────────────────────────────┘  │
│  ┌──────────────────────────────────────┐  │
│  │         Services (API)               │  │
│  └──────────────────────────────────────┘  │
│  ┌──────────────────────────────────────┐  │
│  │         Routing                      │  │
│  └──────────────────────────────────────┘  │
└─────────────────────────────────────────────┘

Tech Stack

Core Frameworks

Technology	Version	Purpose
React	18.x	UI framework
TypeScript	5.x	Type safety
Ant Design	5.x	UI components
Tailwind CSS	3.x	Styling

State Management

Technology	Version	Purpose
Redux Toolkit	2.x	Global state
React Query	5.x	Server state

Project Structure

frontend/
├── src/
│   ├── components/     # Common components
│   ├── pages/          # Page components
│   ├── services/       # API services
│   ├── store/          # Redux store
│   ├── hooks/          # Custom hooks
│   ├── routes/         # Routes config
│   └── main.tsx        # Entry point

Routing Design

const router = createBrowserRouter([
  { path: "/", Component: Home },
  { path: "/chat", Component: AgentPage },
  {
    path: "/data",
    Component: MainLayout,
    children: [
      {
        path: "management",
        Component: DatasetManagement
      }
    ]
  }
]);

State Management

Redux Toolkit Configuration

export const store = configureStore({
  reducer: {
    dataManagement: dataManagementSlice,
    user: userSlice,
  },
});

Slice Example

export const fetchDatasets = createAsyncThunk(
  'dataManagement/fetchDatasets',
  async (params: GetDatasetsParams) => {
    const response = await getDatasets(params);
    return response.data;
  }
);

Component Design

Page Component

export const DataManagement: React.FC = () => {
  const dispatch = useAppDispatch();
  const { datasets, loading } = useAppSelector(
    (state) => state.dataManagement
  );

  useEffect(() => {
    dispatch(fetchDatasets({ page: 0, size: 20 }));
  }, [dispatch]);

  return (
    <div className="p-6">
      <h1>Data Management</h1>
      <DataTable data={datasets} loading={loading} />
    </div>
  );
};

API Services

Axios Configuration

const request = axios.create({
  baseURL: import.meta.env.VITE_API_BASE_URL,
  timeout: 30000,
});

// Request interceptor
request.interceptors.request.use((config) => {
  const token = localStorage.getItem('token');
  if (token) {
    config.headers.Authorization = `Bearer ${token}`;
  }
  return config;
});

Performance Optimization

Code Splitting

const DataManagement = lazy(() =>
  import('@/pages/DataManagement/Home/DataManagement')
);

React.memo

export const DataCard = React.memo<DataCardProps>(({ data }) => {
  return <div>{data.name}</div>;
});

Backend Architecture - Backend architecture
Development Setup - Dev environment

6 - Appendix

Configuration, troubleshooting, and other reference information

Appendix contains configuration parameters, troubleshooting, and other reference information.

Appendix Content

Configuration - System configuration parameters
Troubleshooting - Common issues and solutions

Configuration

Detailed system configuration documentation:

Environment Variables: All configurable environment variables
application.yml: Spring Boot configuration file
Docker Compose: Container configuration
Kubernetes: K8s configuration

Troubleshooting

Common issue troubleshooting steps and solutions:

Service startup issues: Container startup failures
Database connection issues: Database connection failures
Frontend issues: Page loading, API requests
Task execution issues: Tasks stuck, execution failures
Performance issues: Slow response, memory overflow

Other References

GitHub Repository - Source code
Issue Tracking - Issue feedback
Changelog - Version updates

Technical Support

If you encounter issues:

Check Troubleshooting documentation
Search GitHub Issues
Submit a new issue with detailed information

Contributing

Contributions to DataMate are welcome:

Report bugs
Propose new features
Submit code contributions
Improve documentation

See Contribution Guide for details.

6.1 - Configuration

DataMate system configuration parameters

This document details various configuration parameters of the DataMate system.

Environment Variables

Common Configuration

Variable	Default	Description
`DB_PASSWORD`	`password`	Database password
`DATAMATE_JWT_ENABLE`	`false`	Enable JWT authentication
`REGISTRY`	`ghcr.io/modelengine-group/`	Image registry
`VERSION`	`latest`	Image version tag

Database Configuration

Variable	Default	Description
`DB_HOST`	`datamate-database`	Database host
`DB_PORT`	`5432`	Database port
`DB_NAME`	`datamate`	Database name
`DB_USER`	`postgres`	Database username
`DB_PASSWORD`	`password`	Database password

Redis Configuration

Variable	Default	Description
`REDIS_HOST`	`datamate-redis`	Redis host
`REDIS_PORT`	`6379`	Redis port
`REDIS_PASSWORD`	-	Redis password (optional)
`REDIS_DB`	`0`	Redis database number

Milvus Configuration

Variable	Default	Description
`MILVUS_HOST`	`milvus`	Milvus host
`MILVUS_PORT`	`19530`	Milvus port
`MILVUS_INDEX_TYPE`	`IVF_FLAT`	Vector index type
`MILVUS_EMBEDDING_DIM`	`768`	Vector dimension

MinIO Configuration

Variable	Default	Description
`MINIO_ENDPOINT`	`minio:9000`	MinIO endpoint
`MINIO_ACCESS_KEY`	`minioadmin`	Access key
`MINIO_SECRET_KEY`	`minioadmin`	Secret key
`MINIO_BUCKET`	`datamate`	Bucket name

LLM Configuration

Variable	Default	Description
`OPENAI_API_KEY`	-	OpenAI API key
`OPENAI_BASE_URL`	`https://api.openai.com/v1`	API base URL
`OPENAI_MODEL`	`gpt-4`	Model to use

JWT Configuration

Variable	Default	Description
`JWT_SECRET`	`default-insecure-key`	JWT secret (CHANGE IN PRODUCTION)
`JWT_EXPIRATION`	`86400`	Token expiration (seconds)

Logging Configuration

Variable	Default	Description
`LOG_LEVEL`	`INFO`	Log level
`LOG_PATH`	`/var/log/datamate`	Log path

application.yml Configuration

Main Config

datamate:
  jwt:
    enable: ${DATAMATE_JWT_ENABLE:false}
    secret: ${JWT_SECRET:default-insecure-key}
    expiration: ${JWT_EXPIRATION:86400}

  storage:
    type: minio
    endpoint: ${MINIO_ENDPOINT:minio:9000}
    access-key: ${MINIO_ACCESS_KEY:minioadmin}
    secret-key: ${MINIO_SECRET_KEY:minioadmin}

Spring Boot Config

spring:
  datasource:
    url: jdbc:postgresql://${DB_HOST:datamate-database}:${DB_PORT:5432}/${DB_NAME:datamate}
    username: ${DB_USER:postgres}
    password: ${DB_PASSWORD:password}

  jpa:
    hibernate:
      ddl-auto: validate
    show-sql: false

server:
  port: 8092

Docker Compose Configuration

Environment Variables

services:
  datamate-backend:
    environment:
      - DB_PASSWORD=${DB_PASSWORD:-password}
      - LOG_LEVEL=${LOG_LEVEL:-INFO}

Resource Limits

services:
  datamate-backend:
    deploy:
      resources:
        limits:
          cpus: '2'
          memory: 4G

Kubernetes Configuration

ConfigMap

apiVersion: v1
kind: ConfigMap
metadata:
  name: datamate-config
data:
  LOG_LEVEL: "INFO"

Secret

apiVersion: v1
kind: Secret
metadata:
  name: datamate-secret
type: Opaque
data:
  DB_PASSWORD: cGFzc3dvcmQ=  # base64 encoded

Performance Tuning

Database Connection Pool

spring:
  datasource:
    hikari:
      maximum-pool-size: 20
      minimum-idle: 5
      connection-timeout: 30000

JVM Parameters

JAVA_OPTS="-Xms2g -Xmx4g -XX:+UseG1GC"

Installation Guide - Deployment configuration
Troubleshooting - Common issues

6.2 - Troubleshooting

Common issues and solutions for DataMate

This document provides troubleshooting steps and solutions for common DataMate issues.

Service Startup Issues

Service Won’t Start

Symptoms

Service fails to start or exits immediately after running make install.

Troubleshooting Steps

Check Port Conflicts

# Check port usage
lsof -i :8080  # API Gateway
lsof -i :30000 # Frontend

If port is occupied:

# Kill process
kill -9 <PID>

View Container Logs

# View all containers
docker ps -a

# View specific container logs
docker logs datamate-backend

Check Docker Resources

# View Docker system info
docker system df

# Clean unused resources
docker system prune -a

Common Causes and Solutions

Cause	Solution
Port occupied	Kill process or modify port mapping
Insufficient memory	Increase Docker memory limit
Image not pulled	Run `docker pull`
Network issues	Check firewall and network config

Container Exits Immediately

Troubleshooting

# View exit code
docker ps -a

# View detailed logs
docker logs <container-name> --tail 100

Database Connection Issues

Cannot Connect to Database

Troubleshooting Steps

Check Database Container

docker ps | grep datamate-database
docker logs datamate-database

Test Database Connection

# Enter database container
docker exec -it datamate-database psql -U postgres -d datamate

Check Database Config

# Check environment variables
docker exec datamate-backend env | grep DB_

Frontend Issues

Frontend Not Accessible

Symptoms

Browser cannot access http://localhost:30000

Troubleshooting

Check Frontend Container

docker ps | grep datamate-frontend
docker logs datamate-frontend

Check Port Mapping

docker port datamate-frontend

API Request Failed

Troubleshooting

Check Browser Console

Open browser DevTools → Network tab

Check API Gateway

docker ps | grep datamate-gateway
docker logs datamate-gateway

Test API

curl http://localhost:8080/actuator/health

Task Execution Issues

Task Stuck

Troubleshooting

View Task Logs

docker logs datamate-backend --tail 100 | grep <task-id>
docker logs datamate-runtime --tail 100

Check System Resources

docker stats

Performance Issues

Slow System Response

Troubleshooting

Check System Resources

docker stats

Check Database Performance

-- View active queries
SELECT * FROM pg_stat_activity WHERE state = 'active';

Memory Overflow

Troubleshooting

# Check exit reason
docker inspect <container> | grep OOMKilled

Log Viewing

View Application Logs

# Backend logs
docker logs datamate-backend --tail 100 -f

# Frontend logs
docker logs datamate-frontend --tail 100 -f

Log File Locations

Service	Log Path
Backend	`/var/log/datamate/backend/app.log`
Frontend	`/var/log/datamate/frontend/`
Database	`/var/log/datamate/database/`
Runtime	`/var/log/datamate/runtime/`

Getting Help

If issues persist:

Collect Information
- Error messages
- Log files
- System environment
- Reproduction steps
Search Existing Issues

Visit GitHub Issues

Submit New Issue

Include:

DataMate version
OS version
Docker version
Detailed error messages
Reproduction steps

Installation Guide - Deployment guide
Configuration - Configuration parameters

7 - Contribution Guide

Welcome to the DataMate project. We welcome all forms of contributions including documentation, code, testing, translation, etc.

DataMate is an enterprise-level open source data processing project dedicated to providing efficient data solutions for model training, AI applications, and data flywheel scenarios. We welcome all developers, document creators, and test engineers to participate through code commits, documentation optimization, issue feedback, and community support.

If this is your first time contributing to an open source project, we recommend reading Open Source Contribution Newbie Guide first, then proceed with this guide. All contributions must follow the DataMate Code of Conduct.

Contribution Scope and Methods

DataMate open source project contributions cover the following core scenarios. You can choose your participation based on your expertise:

Contribution Type	Specific Content	Suitable For
Code Contribution	Core feature development, bug fixes, performance optimization, new feature proposals	Backend/frontend developers, data engineers
Documentation Contribution	User manual updates, API documentation improvements, tutorial writing, contribution guide optimization	Technical document creators, experienced users
Testing Contribution	Write unit/integration tests, feedback test issues, participate in compatibility testing	Test engineers, QA personnel
Community Contribution	Answer GitHub Issues, participate in community discussions, share use cases	All users, tech enthusiasts
Design Contribution	UI/UX optimization, logo/icon design, documentation visual upgrade	UI/UX designers, visual designers

Thank you for choosing to participate in the DataMate open source project! Whether it’s code, documentation, or community support, every contribution helps the project grow and advances enterprise-level data processing technology. If you encounter any issues during the contribution process, feel free to seek help through community channels.

Getting Started

Development Environment

Before contributing, please set up your development environment:

Clone Repository

git clone https://github.com/ModelEngine-Group/DataMate.git
cd DataMate

Install Dependencies

# Backend dependencies
cd backend
mvn clean install

# Frontend dependencies
cd frontend
pnpm install

# Python dependencies
cd runtime
pip install -r requirements.txt

Start Services

# Start basic services
make install dev=true

For detailed setup instructions, see:

Development Environment Setup - Local development configuration
Backend Architecture - Backend architecture
Frontend Architecture - Frontend architecture

Code Contribution

Code Standards

Java Code Standards

Naming Conventions:
- Class name: PascalCase UserService
- Method name: camelCase getUserById
- Constants: UPPER_CASE MAX_SIZE
- Variables: camelCase userName
Documentation: Add Javadoc comments for public APIs

/**
 * User service
 *
 * @author Your Name
 * @since 1.0.0
 */
public class UserService {
    /**
     * Get user by ID
     *
     * @param userId user ID
     * @return user information
     */
    public User getUserById(Long userId) {
        // ...
    }
}

TypeScript Code Standards

Naming Conventions:
- Components: PascalCase UserProfile
- Types/Interfaces: PascalCase UserData
- Functions: camelCase getUserData
- Constants: UPPER_CASE API_BASE_URL

Python Code Standards

Follow PEP 8:

def get_user(user_id: int) -> dict:
    """
    Get user information

    Args:
        user_id: User ID

    Returns:
        User information dictionary
    """
    # ...

Submitting Code

1. Create Branch

git checkout -b feature/your-feature-name

Branch naming convention:

feature/ - New features
fix/ - Bug fixes
docs/ - Documentation updates
refactor/ - Refactoring

2. Make Changes

Follow the code standards mentioned above.

3. Write Tests

# Backend tests
mvn test

# Frontend tests
pnpm test

# Python tests
pytest

4. Commit Changes

git add .
git commit -m "feat: add new feature description"

Commit message format:

feat: - New feature
fix: - Bug fix
docs: - Documentation changes
style: - Code style changes
refactor: - Refactoring
test: - Adding tests
chore: - Other changes

5. Push and Create PR

git push origin feature/your-feature-name

Then create a Pull Request on GitHub.

Documentation Contribution

Documentation Structure

Documentation is located in the /docs directory:

docs/
├── getting-started/     # Quick start
├── user-guide/          # User guide
├── api-reference/       # API reference
├── developer-guide/     # Developer guide
└── appendix/            # Appendix

Writing Documentation

1. Choose Language

Documents support bilingual (Chinese and English). When updating documentation, please update both language versions.

2. Follow Format

Use Markdown format with Hugo front matter:

---
title: Page Title
description: Page description
weight: 1
---

Content here...

3. Add Examples

Include code examples, commands, and use cases to help users understand.

4. Cross-Reference

Add links to related documentation:

See [Data Management](/docs/user-guide/data-management/) for details.

Testing Contribution

Test Coverage

We aim for comprehensive test coverage:

Unit Tests: Test individual functions and classes
Integration Tests: Test service interactions
E2E Tests: Test complete workflows

Writing Tests

Backend Tests (JUnit)

@Test
public void testGetDataset() {
    // Arrange
    String datasetId = "test-dataset";

    // Act
    Dataset result = datasetService.getDataset(datasetId);

    // Assert
    assertNotNull(result);
    assertEquals("test-dataset", result.getId());
}

Frontend Tests (Jest + React Testing Library)

test('renders data management page', () => {
  render(<DataManagement />);
  expect(screen.getByText('Data Management')).toBeInTheDocument();
});

Reporting Issues

When finding bugs:

Search existing GitHub Issues
If not found, create new issue with:
- Clear title
- Detailed description
- Steps to reproduce
- Expected vs actual behavior
- Environment info

Design Contribution

UI/UX Guidelines

We use Ant Design as the UI component library. When contributing design changes:

Follow Ant Design principles
Ensure consistency with existing design
Consider accessibility
Test on different screen sizes

Design Assets

Design assets should be placed in:

Frontend assets: frontend/src/assets/
Documentation images: content/en/docs/images/

Community Guidelines

Code of Conduct

Be respectful and inclusive
Welcome newcomers and help them learn
Focus on constructive feedback
Collaborate openly

Communication Channels

GitHub Issues: Bug reports and feature requests
GitHub Discussions: General discussions
Pull Requests: Code and documentation contributions

Getting Help

If you need help:

Check existing documentation
Search GitHub Issues
Start a GitHub Discussion

Recognition

Contributors will be recognized in:

Contributors List: In the documentation
Release Notes: For significant contributions
Community Highlights: For outstanding contributions

License

By contributing to DataMate, you agree that your contributions will be licensed under the MIT License.

Thank you for contributing to DataMate! Your contributions help make DataMate better for everyone. 🎉