Documentation
This is a placeholder page that shows you how to use this template site.
This section is where the user documentation for your project lives - all the
information your users need to understand and successfully use your project.
For large documentation sets we recommend adding content under the headings in
this section, though if some or all of them don’t apply to your project feel
free to remove them or add your own. You can see an example of a smaller Docsy
documentation site in the Docsy User Guide, which
lives in the Docsy theme
repo if you’d like to
copy its docs section.
Other content such as marketing material, case studies, and community updates
should live in the About and Community pages.
Find out how to use the Docsy theme in the Docsy User
Guide. You can learn more about how to organize your
documentation (and how we organized this site) in Organizing Your
Content.
1 - Overview
DataMate - Enterprise-level Large Model Data Processing Platform
DataMate is an enterprise-level data processing platform designed for model fine-tuning and RAG retrieval. It provides comprehensive data processing capabilities including data collection, management, cleaning, annotation, synthesis, evaluation, and knowledge base management.
Product Positioning
DataMate is dedicated to solving data pain points in large model implementation, providing a one-stop data governance solution:
- Full Lifecycle Coverage: From data collection to evaluation, covering the entire data processing lifecycle
- Enterprise-grade Capabilities: Supports million-scale concurrent data processing with private deployment options
- Flexible Extension: Rich built-in data processing operators with support for custom operator development
- Visual Orchestration: Drag-and-drop pipeline design without coding for complex data processing workflows
Core Features
Data Collection
- Heterogeneous data source collection capabilities based on DataX
- Supports relational databases, NoSQL, file systems, and other data sources
- Flexible task configuration and monitoring
Data Management
- Unified dataset management supporting image, text, audio, video, and multimodal data types
- Complete data operations: upload, download, preview
- Tag and metadata management for easy data organization and retrieval
Data Cleaning
- Rich built-in data cleaning operators
- Visual cleaning template configuration
- Supports both batch and stream processing modes
Data Annotation
- Integrated Label Studio for professional annotation capabilities
- Supports image classification, object detection, text classification, and other annotation types
- Annotation review and quality control mechanisms
Data Synthesis
- Data augmentation and synthesis capabilities based on large models
- Instruction template management and customization
- Proportional synthesis tasks for diverse data needs
Data Evaluation
- Multi-dimensional data quality evaluation metrics
- Supports both automatic and manual evaluation
- Detailed evaluation reports
Knowledge Base Management (RAG)
- Supports multiple document formats for knowledge base construction
- Automated text chunking and vectorization
- Integrated vector retrieval for RAG applications
Operator Marketplace
- Rich built-in data processing operators
- Support for operator publishing and sharing
- Custom operator development capabilities
Pipeline Orchestration
- Visual drag-and-drop workflow design
- Multiple node types and configurations
- Pipeline execution monitoring and debugging
Agent Chat
- Integrated large language model chat capabilities
- Knowledge base Q&A
- Conversation history management
Technical Architecture
Overall Architecture
DataMate adopts a microservices architecture with core components including:
- Frontend: React 18 + TypeScript + Ant Design + Tailwind CSS
- Backend: Java 21 + Spring Boot 3.5.6 + Spring Cloud + MyBatis Plus
- Runtime: Python FastAPI + LangChain + Ray
- Database: PostgreSQL + Redis + Milvus + MinIO
Microservice Components
- API Gateway (8080): Unified entry point for routing and authentication
- Main Application: Core business logic
- Data Management Service (8092): Dataset management
- Data Collection Service: Data collection task management
- Data Cleaning Service: Data cleaning task management
- Data Annotation Service: Data annotation task management
- Data Synthesis Service: Data synthesis task management
- Data Evaluation Service: Data evaluation task management
- Operator Market Service: Operator marketplace management
- RAG Indexer Service: Knowledge base indexing
- Runtime Service (8081): Operator execution engine
- Backend Python Service (18000): Python backend service
Use Cases
Model Fine-tuning
- Training data cleaning and quality improvement
- Data augmentation and synthesis
- Training data evaluation
RAG Applications
- Enterprise knowledge base construction
- Document vectorization and indexing
- Semantic retrieval and Q&A
Data Governance
- Unified management of multi-source data
- Data lineage tracking
- Data quality monitoring
Deployment Options
DataMate supports multiple deployment methods:
- Docker Compose: Quick experience and development testing
- Kubernetes/Helm: Production environment deployment
- Offline Deployment: Supports air-gapped environment deployment
Comparison with Similar Products
| Feature | DataMate | Label Studio | DocArray |
|---|
| Data Management | ✅ Complete dataset management | ❌ Annotation data only | ❌ Document data only |
| Data Collection | ✅ DataX support | ❌ Not supported | ❌ Not supported |
| Data Cleaning | ✅ Rich built-in operators | ❌ Not supported | ❌ Not supported |
| Data Annotation | ✅ Label Studio integration | ✅ Professional tool | ❌ Not supported |
| Data Synthesis | ✅ LLM-based | ❌ Not supported | ❌ Not supported |
| Data Evaluation | ✅ Multi-dimensional | ⚠️ Basic | ❌ Not supported |
| Knowledge Base | ✅ RAG integration | ❌ Not supported | ⚠️ Requires development |
| Pipeline Orchestration | ✅ Visual orchestration | ❌ Not supported | ❌ Not supported |
| Operator Extension | ✅ Custom operators | ⚠️ Limited | ⚠️ Requires coding |
| License | ✅ MIT | ✅ Apache 2.0 | ✅ MIT |
Next Steps
2 - Quick Start
Deploy DataMate in 5 minutes
This guide will help you deploy DataMate platform in 5 minutes.
DataMate supports two main deployment methods:
- Docker Compose: Suitable for quick experience and development testing
- Kubernetes/Helm: Suitable for production deployment
Prerequisites
Docker Compose Deployment
- Docker 20.10+
- Docker Compose 2.0+
- At least 4GB RAM
- At least 10GB disk space
Kubernetes Deployment
- Kubernetes 1.20+
- Helm 3.0+
- kubectl configured with cluster connection
- At least 8GB RAM
- At least 20GB disk space
5-Minute Quick Deployment (Docker Compose)
1. Clone the Code
git clone https://github.com/ModelEngine-Group/DataMate.git
cd DataMate
2. Start Services
Use the provided Makefile for one-click deployment:
After running the command, the system will prompt you to select a deployment method:
Choose a deployment method:
1. Docker/Docker-Compose
2. Kubernetes/Helm
Enter choice:
Enter 1 to select Docker Compose deployment.
3. Verify Deployment
After services start, you can access them at:
- Frontend: http://localhost:30000
- API Gateway: http://localhost:8080
- Database: localhost:5432
4. Check Service Status
You should see the following containers running:
- datamate-frontend (Frontend service)
- datamate-backend (Backend service)
- datamate-backend-python (Python backend service)
- datamate-gateway (API gateway)
- datamate-database (PostgreSQL database)
- datamate-runtime (Operator runtime)
Optional Components Installation
Install Milvus Vector Database
Milvus is used for vector storage and retrieval in knowledge bases:
Select Docker Compose deployment method when prompted.
Label Studio is used for data annotation:
make install-label-studio
Access: http://localhost:30001
Default credentials:
Install MinerU PDF Processing Service
MinerU provides enhanced PDF document processing:
make build-mineru
make install-mineru
Install DeerFlow Service
DeerFlow is used for enhanced workflow orchestration:
Using Local Images for Development
If you’ve modified local code, use local images for deployment:
make build
make install dev=true
Offline Environment Deployment
For offline environments, download all images first:
Images will be saved in the dist/ directory. Load images on the target machine:
Uninstall
Uninstall DataMate
The system will prompt whether to delete volumes:
- Select
1: Delete all data (including datasets, configurations, etc.) - Select
2: Keep volumes
Uninstall Specific Components
# Uninstall Label Studio
make uninstall-label-studio
# Uninstall Milvus
make uninstall-milvus
# Uninstall DeerFlow
make uninstall-deer-flow
Next Steps
Common Questions
Q: What if service startup fails?
First check if ports are occupied:
# Check port usage
lsof -i :30000
lsof -i :8080
If ports are occupied, modify port mappings in deployment/docker/datamate/docker-compose.yml.
Q: How to view service logs?
# View all service logs
docker compose -f deployment/docker/datamate/docker-compose.yml logs
# View specific service logs
docker compose -f deployment/docker/datamate/docker-compose.yml logs -f datamate-backend
Q: Where is data stored?
Data is persisted through Docker volumes:
datamate-dataset-volume: Dataset filesdatamate-postgresql-volume: Database datadatamate-log-volume: Log files
View all volumes:
docker volume ls | grep datamate
2.1 - Installation Guide
Detailed installation and configuration instructions for DataMate
This document provides detailed installation and configuration instructions for the DataMate platform.
System Requirements
Minimum Configuration
| Component | Minimum | Recommended |
|---|
| CPU | 4 cores | 8 cores+ |
| RAM | 8 GB | 16 GB+ |
| Disk | 50 GB | 100 GB+ |
| OS | Linux/macOS/Windows | Linux (Ubuntu 20.04+) |
Software Dependencies
Docker Compose Deployment
- Docker 20.10+
- Docker Compose 2.0+
- Git (optional, for cloning code)
- Make (optional, for using Makefile)
Kubernetes Deployment
- Kubernetes 1.20+
- Helm 3.0+
- kubectl (matching cluster version)
- Git (optional, for cloning code)
- Make (optional, for using Makefile)
Deployment Method Comparison
| Feature | Docker Compose | Kubernetes |
|---|
| Deployment Difficulty | ⭐ Simple | ⭐⭐⭐ Complex |
| Resource Utilization | ⭐⭐ Fair | ⭐⭐⭐⭐ High |
| High Availability | ❌ Not supported | ✅ Supported |
| Scalability | ⭐⭐ Fair | ⭐⭐⭐⭐ Strong |
| Use Case | Dev/test, small scale | Production, large scale |
Docker Compose Deployment
Basic Deployment
1. Prerequisites
# Clone code repository
git clone https://github.com/ModelEngine-Group/DataMate.git
cd DataMate
# Check Docker and Docker Compose versions
docker --version
docker compose version
2. Deploy Using Makefile
# One-click deployment (including Milvus)
make install
Select 1. Docker/Docker-Compose when prompted.
3. Use Docker Compose Directly
If Make is not installed:
# Set image registry (optional)
export REGISTRY=ghcr.io/modelengine-group/
# Start basic services
docker compose -f deployment/docker/datamate/docker-compose.yml --profile milvus up -d
4. Verify Deployment
# Check container status
docker ps
# View service logs
docker compose -f deployment/docker/datamate/docker-compose.yml logs -f
# Access frontend
open http://localhost:30000
Optional Components
Milvus Vector Database
# Using Makefile
make install-milvus
# Or Docker Compose
docker compose -f deployment/docker/datamate/docker-compose.yml --profile milvus up -d
Components:
- milvus-standalone (19530, 9091)
- milvus-minio (9000, 9001)
- milvus-etcd
# Using Makefile
make install-label-studio
# Or Docker Compose
docker compose -f deployment/docker/datamate/docker-compose.yml --profile label-studio up -d
Access: http://localhost:30001
Default credentials:
MinerU PDF Processing
# Build MinerU image
make build-mineru
# Deploy MinerU
make install-mineru
DeerFlow Workflow Service
# Using Makefile
make install-deer-flow
# Or Docker Compose
docker compose -f deployment/docker/datamate/docker-compose.yml --profile deer-flow up -d
Environment Variables
| Variable | Default | Description |
|---|
DB_PASSWORD | password | Database password |
DATAMATE_JWT_ENABLE | false | Enable JWT authentication |
REGISTRY | ghcr.io/modelengine-group/ | Image registry |
VERSION | latest | Image version |
LABEL_STUDIO_HOST | - | Label Studio access URL |
Data Volume Management
DataMate uses Docker volumes for persistence:
# View all volumes
docker volume ls | grep datamate
# View volume details
docker volume inspect datamate-dataset-volume
# Backup volume data
docker run --rm -v datamate-dataset-volume:/data -v $(pwd):/backup \
ubuntu tar czf /backup/dataset-backup.tar.gz /data
Kubernetes/Helm Deployment
Prerequisites
# Check cluster connection
kubectl cluster-info
kubectl get nodes
# Check Helm version
helm version
# Create namespace (optional)
kubectl create namespace datamate
Using Makefile
# Deploy DataMate
make install INSTALLER=k8s
# Or deploy to specific namespace
make install NAMESPACE=datamate INSTALLER=k8s
Using Helm
1. Deploy Basic Services
# Deploy DataMate
helm upgrade datamate deployment/helm/datamate/ \
--install \
--namespace datamate \
--create-namespace \
--set global.image.repository=ghcr.io/modelengine-group/
# Check deployment status
kubectl get pods -n datamate
# Edit values.yaml
cat >> deployment/helm/datamate/values.yaml << EOF
ingress:
enabled: true
className: nginx
hosts:
- host: datamate.example.com
paths:
- path: /
pathType: Prefix
EOF
# Redeploy
helm upgrade datamate deployment/helm/datamate/ \
--namespace datamate \
-f deployment/helm/datamate/values.yaml
3. Deploy Optional Components
# Deploy Milvus
helm upgrade milvus deployment/helm/milvus \
--install \
--namespace datamate
# Deploy Label Studio
helm upgrade label-studio deployment/helm/label-studio/ \
--install \
--namespace datamate
Offline Deployment
Prepare Offline Images
1. Download Images
# Download all images locally
make download SAVE=true
# Download specific version
make download VERSION=v1.0.0 SAVE=true
Images saved in dist/ directory.
2. Package and Transfer
# Package
tar czf datamate-images.tar.gz dist/
# Transfer to target server
scp datamate-images.tar.gz user@target-server:/tmp/
Offline Installation
1. Load Images
# Extract on target server
tar xzf datamate-images.tar.gz
# Load all images
make load-images
2. Modify Configuration
Use empty REGISTRY for local images:
REGISTRY= docker compose -f deployment/docker/datamate/docker-compose.yml up -d
Upgrade Guide
Docker Compose Upgrade
# 1. Backup data
docker run --rm -v datamate-postgresql-volume:/data -v $(pwd):/backup \
ubuntu tar czf /backup/postgres-backup.tar.gz /data
# 2. Pull new images
docker pull ghcr.io/modelengine-group/datamate-backend:latest
# 3. Stop services
docker compose -f deployment/docker/datamate/docker-compose.yml down
# 4. Start new version
docker compose -f deployment/docker/datamate/docker-compose.yml up -d
# 5. Verify upgrade
docker ps
docker logs -f datamate-backend
Or use Makefile:
make datamate-docker-upgrade
Kubernetes Upgrade
# 1. Backup data
kubectl exec -n datamate deployment/datamate-database -- \
pg_dump -U postgres datamate > backup.sql
# 2. Update Helm Chart
helm upgrade datamate deployment/helm/datamate/ \
--namespace datamate \
--set global.image.tag=new-version
Uninstall
Docker Compose Complete Uninstall
# Using Makefile
make uninstall
# Choose to delete volumes for complete cleanup
Or manual uninstall:
# Stop and remove containers
docker compose -f deployment/docker/datamate/docker-compose.yml --profile milvus --profile label-studio down -v
# Remove all volumes
docker volume rm datamate-dataset-volume \
datamate-postgresql-volume \
datamate-log-volume
# Remove network
docker network rm datamate-network
Kubernetes Complete Uninstall
# Uninstall all components
make uninstall INSTALLER=k8s
# Or use Helm
helm uninstall datamate -n datamate
helm uninstall milvus -n datamate
helm uninstall label-studio -n datamate
# Delete namespace
kubectl delete namespace datamate
Troubleshooting
Common Issues
1. Service Won’t Start
# Check port conflicts
netstat -tlnp | grep -E '30000|8080|5432'
# Check disk space
df -h
# Check memory
free -h
# View detailed logs
docker logs datamate-backend --tail 100
2. Database Connection Failed
# Check database container
docker ps | grep database
# Test connection
docker exec -it datamate-database psql -U postgres -d datamate
2.2 - System Architecture
DataMate system architecture design documentation
This document details DataMate’s system architecture, tech stack, and design philosophy.
Overall Architecture
DataMate adopts a microservices architecture, splitting the system into multiple independent services, each responsible for specific business functions. This architecture provides good scalability, maintainability, and fault tolerance.
┌─────────────────────────────────────────────────────────────────┐
│ Frontend Layer │
│ (React + TypeScript) │
│ Ant Design + Tailwind │
└────────────────────────┬────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ API Gateway Layer │
│ (Spring Cloud Gateway) │
│ Port: 8080 │
└────────────────────────┬────────────────────────────────────────┘
│
┌───────────────┼───────────────┐
▼ ▼ ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Java Backend│ │ Python Backend│ │ Runtime │
│ Services │ │ Service │ │ Service │
├──────────────┤ ├──────────────┤ ├──────────────┤
│· Main App │ │· RAG Service │ │· Operator │
│· Data Mgmt │ │· LangChain │ │ Execution │
│· Collection │ │· FastAPI │ │ │
│· Cleaning │ │ │ │ │
│· Annotation │ │ │ │ │
│· Synthesis │ │ │ │ │
│· Evaluation │ │ │ │ │
│· Operator │ │ │ │ │
│· Pipeline │ │ │ │ │
└──────┬───────┘ └──────┬───────┘ └──────┬───────┘
│ │ │
└────────────────┼────────────────┘
▼
┌──────────────┴──────────────┐
│ │
┌────▼────┐ ┌─────────┐ ┌─────▼────┐
│PostgreSQL│ │ Redis │ │ Milvus │
│ (5432) │ │ (6379) │ │ (19530) │
└──────────┘ └─────────┘ └──────────┘
│
┌─────▼─────┐
│ MinIO │
│ (9000) │
└───────────┘
Tech Stack
Frontend Tech Stack
| Technology | Version | Purpose |
|---|
| React | 18.x | UI framework |
| TypeScript | 5.x | Type safety |
| Ant Design | 5.x | UI component library |
| Tailwind CSS | 3.x | Styling framework |
| Redux Toolkit | 2.x | State management |
| React Router | 6.x | Routing management |
| Vite | 5.x | Build tool |
Backend Tech Stack (Java)
| Technology | Version | Purpose |
|---|
| Java | 21 | Runtime environment |
| Spring Boot | 3.5.6 | Application framework |
| Spring Cloud | 2023.x | Microservices framework |
| MyBatis Plus | 3.x | ORM framework |
| PostgreSQL Driver | 42.x | Database driver |
| Redis | 5.x | Cache client |
| MinIO | 8.x | Object storage client |
Backend Tech Stack (Python)
| Technology | Version | Purpose |
|---|
| Python | 3.11+ | Runtime environment |
| FastAPI | 0.100+ | Web framework |
| LangChain | 0.1+ | LLM application framework |
| Ray | 2.x | Distributed computing |
| Pydantic | 2.x | Data validation |
Data Storage
| Technology | Version | Purpose |
|---|
| PostgreSQL | 15+ | Main database |
| Redis | 8.x | Cache and message queue |
| Milvus | 2.6.5 | Vector database |
| MinIO | RELEASE.2024+ | Object storage |
Microservices Architecture
Service List
| Service Name | Port | Tech Stack | Description |
|---|
| API Gateway | 8080 | Spring Cloud Gateway | Unified entry, routing, auth |
| Frontend | 30000 | React | Frontend UI |
| Main Application | - | Spring Boot | Core business logic |
| Data Management Service | 8092 | Spring Boot | Dataset management |
| Data Collection Service | - | Spring Boot | Data collection tasks |
| Data Cleaning Service | - | Spring Boot | Data cleaning tasks |
| Data Annotation Service | - | Spring Boot | Data annotation tasks |
| Data Synthesis Service | - | Spring Boot | Data synthesis tasks |
| Data Evaluation Service | - | Spring Boot | Data evaluation tasks |
| Operator Market Service | - | Spring Boot | Operator marketplace |
| RAG Indexer Service | - | Spring Boot | Knowledge base indexing |
| Runtime Service | 8081 | Python + Ray | Operator execution engine |
| Backend Python Service | 18000 | FastAPI | Python backend service |
| Database | 5432 | PostgreSQL | Database |
Service Communication
Synchronous Communication
- API Gateway → Backend Services: HTTP/REST
- Frontend → API Gateway: HTTP/REST
- Backend Services ↔: HTTP/REST (Feign Client)
Asynchronous Communication
- Task Execution: Database task queue
- Event Notification: Redis Pub/Sub
Data Architecture
Data Flow
┌─────────────┐
│ Data │ Collection task config
│ Collection │ → DataX → Raw data
└──────┬──────┘
│
▼
┌─────────────┐
│ Data │ Dataset management, file upload
│ Management │ → Structured storage
└──────┬──────┘
│
├──────────────┐
▼ ▼
┌─────────────┐ ┌─────────────┐
│ Data │ │ Knowledge │
│ Cleaning │ │ Base │
│ │ │ │
└──────┬──────┘ └──────┬──────┘
│ │
▼ ▼
┌─────────────┐ ┌─────────────┐
│ Data │ │ Vector │
│ Annotation │ │ Index │
└──────┬──────┘ └──────┬──────┘
│ │
▼ │
┌─────────────┐ │
│ Data │ │
│ Synthesis │ │
└──────┬──────┘ │
│ │
▼ ▼
┌─────────────┐ ┌─────────────┐
│ Data │ │ RAG │
│ Evaluation │ │ Retrieval │
└─────────────┘ └─────────────┘
Deployment Architecture
Docker Compose Deployment
┌────────────────────────────────────────────────┐
│ Docker Network │
│ datamate-network │
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │Frontend │ │ Gateway │ │ Backend │ │
│ │ :30000 │ │ :8080 │ │ │ │
│ └──────────┘ └──────────┘ └──────────┘ │
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │Backend │ │ Runtime │ │Database │ │
│ │ Python │ │ :8081 │ │ :5432 │ │
│ └──────────┘ └──────────┘ └──────────┘ │
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Milvus │ │ MinIO │ │ etcd │ │
│ │ :19530 │ │ :9000 │ │ │ │
│ └──────────┘ └──────────┘ └──────────┘ │
└────────────────────────────────────────────────┘
Kubernetes Deployment
┌────────────────────────────────────────────────┐
│ Kubernetes Cluster │
│ │
│ Namespace: datamate │
│ │
│ ┌────────────┐ ┌────────────┐ │
│ │ Deployment │ │ Deployment │ │
│ │ Frontend │ │ Gateway │ │
│ │ (3 Pods) │ │ (2 Pods) │ │
│ └─────┬──────┘ └─────┬──────┘ │
│ │ │ │
│ ┌─────▼────────────────▼──────┐ │
│ │ Service (LoadBalancer) │ │
│ └──────────────────────────────┘ │
│ │
│ ┌────────────┐ ┌────────────┐ │
│ │ StatefulSet│ │ Deployment │ │
│ │ Database │ │ Backend │ │
│ └────────────┘ └────────────┘ │
└────────────────────────────────────────────────┘
Security Architecture
Authentication & Authorization
JWT Authentication (Optional)
datamate:
jwt:
enable: true # Enable JWT authentication
secret: your-secret-key
expiration: 86400 # 24 hours
API Key Authentication
datamate:
api-key:
enable: false
Data Security
Transport Encryption
- API Gateway supports HTTPS/TLS
- Internal service communication can be encrypted
Storage Encryption
- Database: Transparent data encryption (TDE)
- MinIO: Server-side encryption
- Milvus: Encryption at rest
Next Steps
2.3 - Development Environment Setup
Local development environment configuration guide for DataMate
This document describes how to set up a local development environment for DataMate.
Prerequisites
Required Software
| Software | Version | Purpose |
|---|
| Node.js | 18.x+ | Frontend development |
| pnpm | 8.x+ | Frontend package management |
| Java | 21 | Backend development |
| Maven | 3.9+ | Backend build |
| Python | 3.11+ | Python service development |
| Docker | 20.10+ | Containerized deployment |
| Docker Compose | 2.0+ | Service orchestration |
| Git | 2.x+ | Version control |
| Make | 4.x+ | Build automation |
Recommended Software
- IDE: IntelliJ IDEA (backend) + VS Code (frontend/Python)
- Database Client: DBeaver, pgAdmin
- API Testing: Postman, curl
- Git Client: GitKraken, SourceTree
Code Structure
DataMate/
├── backend/ # Java backend
│ ├── services/ # Microservice modules
│ │ ├── main-application/
│ │ ├── data-management-service/
│ │ ├── data-cleaning-service/
│ │ └── ...
│ ├── openapi/ # OpenAPI specs
│ └── scripts/ # Build scripts
├── frontend/ # React frontend
│ ├── src/
│ │ ├── components/ # Common components
│ │ ├── pages/ # Page components
│ │ ├── services/ # API services
│ │ ├── store/ # Redux store
│ │ └── routes/ # Routes config
│ └── package.json
├── runtime/ # Python runtime
│ └── datamate/ # DataMate runtime
└── deployment/ # Deployment configs
├── docker/ # Docker configs
└── helm/ # Helm charts
Backend Development
1. Install Java 21
# macOS (Homebrew)
brew install openjdk@21
# Linux (Ubuntu/Debian)
sudo apt update
sudo apt install openjdk-21-jdk
# Verify
java -version
2. Install Maven
# macOS
brew install maven
# Linux
sudo apt install maven
# Verify
mvn -version
Install Plugins
- Lombok Plugin
- MyBatis Plugin
- Rainbow Brackets
- GitToolBox
Import Project
- Open IntelliJ IDEA
- File → Open
- Select
backend directory - Wait for Maven dependency download
Start Local Database (Docker)
# Start database only
docker compose -f deployment/docker/datamate/docker-compose.yml up -d datamate-database
Connection info:
- Host: localhost
- Port: 5432
- Database: datamate
- Username: postgres
- Password: password
5. Run Backend Service
Using Maven
cd backend/services/main-application
mvn spring-boot:run
Using IDE
- Find Application class
- Right-click → Run
- Access http://localhost:8080
Frontend Development
1. Install Node.js
# macOS
brew install node@18
# Linux
curl -fsSL https://deb.nodesource.com/setup_18.x | sudo -E bash -
sudo apt-get install -y nodejs
2. Install pnpm
3. Install Dependencies
Create .env.development:
VITE_API_BASE_URL=http://localhost:8080
VITE_API_TIMEOUT=30000
5. Start Dev Server
Access http://localhost:3000
Python Service Development
1. Install Python 3.11
# macOS
brew install python@3.11
# Linux
sudo apt install python3.11 python3.11-venv
2. Create Virtual Environment
cd runtime/datamate
python3.11 -m venv venv
source venv/bin/activate
3. Install Dependencies
pip install -r requirements.txt
4. Run Python Service
python operator_runtime.py --port 8081
Local Debugging
Start All Services
Using Docker Compose
# Start base services (database, Redis, etc.)
docker compose -f deployment/docker/datamate/docker-compose.yml up -d \
datamate-database \
datamate-redis
# Start Milvus (optional)
docker compose -f deployment/docker/datamate/docker-compose.yml --profile milvus up -d
Start Backend Services
# Terminal 1: Main Application
cd backend/services/main-application
mvn spring-boot:run
# Terminal 2: Data Management Service
cd backend/services/data-management-service
mvn spring-boot:run
Start Frontend
Start Python Services
# Runtime Service
cd runtime/datamate
python operator_runtime.py --port 8081
# Backend Python Service
cd backend-python
uvicorn main:app --reload --port 18000
Code Standards
Java Code Standards
Naming Conventions
- Class name: PascalCase
UserService - Method name: camelCase
getUserById - Constants: UPPER_CASE
MAX_SIZE - Variables: camelCase
userName
TypeScript Code Standards
Naming Conventions
- Components: PascalCase
UserProfile - Types/Interfaces: PascalCase
UserData - Functions: camelCase
getUserData - Constants: UPPER_CASE
API_BASE_URL
Python Code Standards
Follow PEP 8:
def get_user(user_id: int) -> dict:
"""Get user information
Args:
user_id: User ID
Returns:
User information dictionary
"""
# ...
Common Issues
Backend Won’t Start
- Check Java version:
java -version - Check port conflicts:
lsof -i :8080 - View logs
- Clean and rebuild:
mvn clean install
Frontend Won’t Start
- Check Node version:
node -v - Delete node_modules:
rm -rf node_modules && pnpm install - Check port:
lsof -i :3000
Next Steps
3 - User Guide
DataMate feature usage guides
This guide introduces how to use each feature module of DataMate.
DataMate provides comprehensive data processing solutions for large models, covering data collection, management, cleaning, annotation, synthesis, evaluation, and the full process.
Feature Modules
Typical Use Cases
Model Fine-tuning Scenario
1. Data Collection → 2. Data Management → 3. Data Cleaning → 4. Data Annotation
↓
5. Data Evaluation → 6. Export Training Data
RAG Application Scenario
1. Upload Documents → 2. Vectorization Index → 3. Knowledge Base Management
↓
4. Agent Chat (Knowledge Base Q&A)
Data Augmentation Scenario
1. Prepare Raw Data → 2. Create Instruction Template → 3. Data Synthesis
↓
4. Quality Evaluation → 5. Export Augmented Data
Quick Links
3.1 - Data Collection
Collect data from multiple data sources with DataMate
Data collection module helps you collect data from multiple data sources (databases, file systems, APIs, etc.) into the DataMate platform.
Features Overview
Based on DataX, data collection module supports:
- Multiple Data Sources: MySQL, PostgreSQL, Oracle, SQL Server, etc.
- Heterogeneous Sync: Data sync between different sources
- Batch Collection: Large-scale batch collection and sync
- Scheduled Tasks: Support scheduled execution
- Task Monitoring: Real-time monitoring of collection tasks
Supported Data Sources
| Data Source Type | Reader | Writer | Description |
|---|
| General Relational Databases | ✅ | ✅ | Supports MySQL, PostgreSQL, OpenGauss, SQL Server, DM, DB2 |
| MySQL | ✅ | ✅ | Relational database |
| PostgreSQL | ✅ | ✅ | Relational database |
| OpenGauss | ✅ | ✅ | Relational database |
| SQL Server | ✅ | ✅ | Microsoft database |
| DM (Dameng) | ✅ | ✅ | Domestic database |
| DB2 | ✅ | ✅ | IBM database |
| StarRocks | ✅ | ✅ | Analytical database |
| NAS | ✅ | ✅ | Network storage |
| S3 | ✅ | ✅ | Object storage |
| GlusterFS | ✅ | ✅ | Distributed file system |
| API Collection | ✅ | ✅ | API interface data |
| JSON Files | ✅ | ✅ | JSON format files |
| CSV Files | ✅ | ✅ | CSV format files |
| TXT Files | ✅ | ✅ | Text files |
| FTP | ✅ | ✅ | FTP servers |
| HDFS | ✅ | ✅ | Hadoop HDFS |
Quick Start
1. Create Collection Task
Step 1: Enter Data Collection Page
Select Data Collection in the left navigation.
Step 2: Create Task
Click Create Task button.
Fill in the following basic information:
- Name: A meaningful name for the task
- Timeout: Task execution timeout (seconds)
- Description: Task purpose (optional)
Step 4: Select Sync Mode
Select the task synchronization mode:
- Immediate Sync: Execute once immediately after task creation
- Scheduled Sync: Execute periodically according to schedule rules
When selecting Scheduled Sync, configure the execution policy:
- Execution Cycle: Hourly / Daily / Weekly / Monthly
- Execution Time: Select the execution time point
Select data source type: Choose from dropdown list (e.g., MySQL, CSV, etc.)
Configure data source parameters: Fill in connection parameters based on the selected data source template (form format)
MySQL Example:
- JDBC URL:
jdbc:mysql://localhost:3306/mydb - Username:
root - Password:
password - Table Name:
users
Field mapping is not supported. You can only extract specific fields from the configured SQL.
- Extract specific fields: Enter the field names you want to extract in the field list
- Extract all fields: Leave the field list empty to extract all fields from the SQL query result
Step 7: Create and Execute
Click Create button to create the task.
- If Immediate Sync is selected, task starts immediately
- If Scheduled Sync is selected, task runs periodically according to schedule
2. Monitor Task Execution
View all collection tasks with status, progress, and operations.
3. Task Management
Each task in the task list has the following actions available:
- View Execution Records: View all historical executions of the task
- Delete: Delete the task (note: deleting a task does not delete collected data)
Click the task name to view task details including:
- Basic configuration
- Execution record list
- Data statistics
Common Questions
Q: Task execution failed?
A: Troubleshooting:
- Check data source connection
- View execution logs
- Check data format
- Verify target dataset exists
Q: How to collect large tables?
A:
- Use incremental collection
- Split into multiple tasks
- Adjust concurrent parameters
- Use filter conditions
API Reference
3.2 - Data Management
Manage datasets and files with DataMate
Data management module provides unified dataset management capabilities, supporting multiple data types for storage, query, and operations.
Features Overview
Data management module provides:
- Multiple data types: Image, text, audio, video, and multimodal support
- File management: Upload, download, preview, delete operations
- Directory structure: Support for hierarchical directory organization
- Tag management: Use tags to categorize and retrieve data
- Statistics: Dataset size, file count, and other statistics
Dataset Types
| Type | Description | Supported Formats |
|---|
| Image | Image data | JPG, PNG, GIF, BMP, WebP |
| Text | Text data | TXT, MD, JSON, CSV |
| Audio | Audio data | MP3, WAV, FLAC, AAC |
| Video | Video data | MP4, AVI, MOV, MKV |
| Multimodal | Multimodal data | Mixed formats |
Quick Start
1. Create Dataset
Step 1: Enter Data Management Page
In the left navigation, select Data Management.
Step 2: Create Dataset
Click the Create Dataset button in the upper right corner.
- Dataset name: e.g.,
user_images_dataset - Dataset type: Select data type (e.g., Image)
- Description: Dataset purpose description (optional)
- Tags: Add tags for categorization (optional)
Step 4: Create Dataset
Click the Create button to complete.
2. Upload Files
Method 1: Drag & Drop
- Enter dataset details page
- Drag files directly to the upload area
- Wait for upload completion
Method 2: Click Upload
- Click Upload File button
- Select local files
- Wait for upload completion
Method 3: Chunked Upload (Large Files)
For large files (>100MB), the system automatically uses chunked upload:
- Select large file to upload
- System automatically splits the file
- Upload chunks one by one
- Automatically merge
3. Create Directory
Step 1: Enter Dataset
Click dataset name to enter details.
Step 2: Create Directory
- Click Create Directory button
- Enter directory name
- Select parent directory (optional)
- Click confirm
Directory structure example:
user_images_dataset/
├── train/
│ ├── cat/
│ └── dog/
├── test/
│ ├── cat/
│ └── dog/
└── validation/
├── cat/
└── dog/
4. Manage Files
View Files
In dataset details page, you can see all files:
| Filename | Size | File Count | Upload Time | Tags | Tag Update Time | Actions |
|---|
| image1.jpg | 2.3 MB | 1 | 2024-01-15 | Training Set | 2024-01-16 | Download Rename Delete |
| image2.png | 1.8 MB | 1 | 2024-01-15 | Validation Set | 2024-01-16 | Download Rename Delete |
Preview File
Click Preview button to preview in browser:
- Image: Display thumbnail and details
- Text: Display text content
- Audio: Online playback
- Video: Online playback
Download File
- Single file download: Click Download button
Currently, batch download and package download are not supported.
5. Dataset Operations
View Statistics
In dataset details page, you can see:
- Total files: Total number of files in dataset
- Total size: Total size of all files
Edit Dataset
Click Edit button to modify:
- Dataset name
- Description
- Tags
- Associated collection task
Delete Dataset
Click Delete button to delete entire dataset.
Note: Deleting a dataset will also delete all files within it. This action cannot be undone.
Advanced Features
Tag Management
Create Tag
- In dataset list page, click Tag Management
- Click Create Tag
- Enter tag name
- Edit dataset
- Select existing tags in tag bar
- Save dataset
In dataset list page, click tags to filter datasets with that tag.
Best Practices
1. Dataset Organization
Recommended directory organization:
project_dataset/
├── raw/ # Raw data
├── processed/ # Processed data
├── train/ # Training data
├── validation/ # Validation data
└── test/ # Test data
2. Naming Conventions
- Dataset name: Use lowercase letters and underscores, e.g.,
user_images_2024 - Directory name: Use meaningful English names, e.g.,
train, test, processed - File name: Keep original filename or use standardized naming
3. Tag Usage
Recommended tag categories:
- Project tags:
project-a, project-b - Status tags:
raw, processed, validated - Type tags:
image, text, audio - Purpose tags:
training, testing, evaluation
4. Data Backup
The system currently does not support automatic backup. To backup data, you can manually download individual files:
- Enter dataset details page
- Find the file you need to backup
- Click the Download button of the file
Common Questions
Q: Large file upload fails?
A: Suggestions for large file uploads:
- Use chunked upload: System automatically enables chunked upload
- Check network: Ensure stable network connection
- Adjust upload parameters: Increase timeout
- Use FTP/SFTP: For very large files, use FTP upload
Q: How to import existing data?
A: Three methods to import existing data:
- Upload files: Upload via interface
- Add files: If files already on server, use add file feature
- Data collection: Use data collection module to collect from external sources
Q: Dataset size limit?
A: Dataset size limits:
- Single file: Maximum 5GB (chunked upload)
- Total dataset: Limited by storage space
- File count: No explicit limit
Regularly clean unnecessary files to free up space.
API Reference
For detailed API documentation, see:
3.3 - Data Cleaning
Clean and preprocess data with DataMate
Data cleaning module provides powerful data processing capabilities to help you clean, transform, and optimize data quality.
Features Overview
Data cleaning module provides:
- Built-in Cleaning Operators: Rich pre-cleaning operator library
- Visual Configuration: Drag-and-drop cleaning pipeline design
- Template Management: Save and reuse cleaning templates
- Batch Processing: Support large-scale data batch cleaning
- Real-time Preview: Preview cleaning results
Cleaning Operator Types
Data Quality Operators
| Operator | Function | Applicable Data Types |
|---|
| Deduplication | Remove duplicates | All types |
| Null Handling | Handle null values | All types |
| Outlier Detection | Detect outliers | Numerical |
| Format Validation | Validate format | All types |
Text Cleaning Operators
| Operator | Function |
|---|
| Remove Special Chars | Remove special characters |
| Case Conversion | Convert case |
| Remove Stopwords | Remove common stopwords |
| Text Segmentation | Chinese word segmentation |
| HTML Tag Cleaning | Clean HTML tags |
Quick Start
1. Create Cleaning Task
Step 1: Enter Data Cleaning Page
Select Data Processing in the left navigation.
Step 2: Create Task
Click Create Task button.
- Task name: e.g.,
user_data_cleansing - Source dataset: Select dataset to clean
- Output dataset: Select or create output dataset
- Drag operators from left library to canvas
- Connect operators to form pipeline
- Configure operator parameters
- Preview cleaning results
Example pipeline:
Input Data → Deduplication → Null Handling → Format Validation → Output Data
2. Use Cleaning Templates
Create Template
- Configure cleaning pipeline
- Click Save as Template
- Enter template name
- Save
Use Template
- Create cleaning task
- Click Use Template
- Select template
- Adjust as needed
3. Monitor Cleaning Task
View task status, progress, and statistics in task list.
Advanced Features
Custom Operators
Develop custom operators. See:
Conditional Branching
Add conditional branches in pipeline:
Input Data → [Condition Check]
├── Satisfied → Pipeline A
└── Not Satisfied → Pipeline B
Best Practices
1. Pipeline Design
Recommended principles:
- Modular: Split complex pipelines
- Reusable: Use templates and parameters
- Maintainable: Add comments
- Testable: Test individually before combining
Optimize performance:
- Parallelize: Use parallel nodes
- Reduce data transfer: Process locally when possible
- Batch operations: Use batch operations
- Cache results: Cache intermediate results
Common Questions
Q: Task execution failed?
A: Troubleshooting:
- Check data format
- View execution logs
- Check operator parameters
- Test individual operators
- Reduce data size for testing
Q: Cleaning speed is slow?
A: Optimize:
- Reduce operator count
- Optimize operator order
- Increase concurrency
- Use incremental processing
API Reference
3.4 - Data Annotation
Perform data annotation with DataMate
Data annotation module integrates Label Studio to provide professional-grade data annotation capabilities.
Features Overview
Data annotation module provides:
- Multiple Annotation Types: Image, text, audio, etc.
- Annotation Templates: Rich annotation templates and configurations
- Quality Control: Annotation review and consistency checks
- Team Collaboration: Multi-person collaborative annotation
- Annotation Export: Export annotation results
Annotation Types
Image Annotation
| Type | Description | Use Cases |
|---|
| Image Classification | Classify entire image | Scene recognition |
| Object Detection | Annotate object locations | Object recognition |
| Semantic Segmentation | Pixel-level classification | Medical imaging |
| Key Point Annotation | Annotate key points | Pose estimation |
Text Annotation
| Type | Description | Use Cases |
|---|
| Text Classification | Classify text | Sentiment analysis |
| Named Entity Recognition | Annotate entities | Information extraction |
| Text Summarization | Generate summaries | Document understanding |
Quick Start
1. Deploy Label Studio
make install-label-studio
Access: http://localhost:30001
Default credentials:
2. Create Annotation Task
Step 1: Enter Data Annotation Page
Select Data Annotation in the left navigation.
Step 2: Create Task
Click Create Task.
- Task name: e.g.,
image_classification_task - Source dataset: Select dataset to annotate
- Annotation type: Select type
Image Classification Template:
<View>
<Image name="image" value="$image"/>
<Choices name="choice" toName="image">
<Choice value="cat"/>
<Choice value="dog"/>
<Choice value="bird"/>
</Choices>
</View>
- Annotation method: Single label / Multi label
- Minimum annotations: Per sample (for consistency)
- Review mechanism: Enable/disable review
3. Start Annotation
- Enter annotation interface
- View sample to annotate
- Perform annotation
- Click Submit
- Auto-load next sample
Advanced Features
Quality Control
Annotation Consistency
Check consistency between annotators:
- Cohen’s Kappa: Evaluate consistency
- Majority vote: Use majority annotation results
- Expert review: Expert reviews disputed annotations
Pre-annotation
Use models for pre-annotation:
- Train or use existing model
- Pre-annotate dataset
- Annotators correct pre-annotations
Best Practices
1. Annotation Guidelines
Create clear guidelines:
- Define standards: Clear annotation standards
- Provide examples: Positive and negative examples
- Edge cases: Handle edge cases
- Train annotators: Ensure understanding
Common Questions
Q: Poor annotation quality?
A: Improve:
- Refine guidelines
- Strengthen training
- Increase reviews
- Use pre-annotation
3.5 - Data Synthesis
Use large models for data augmentation and synthesis
Data synthesis module leverages large model capabilities to automatically generate high-quality training data, reducing data collection costs.
Features Overview
Data synthesis module provides:
- Instruction template management: Create and manage synthesis instruction templates
- Single task synthesis: Create individual synthesis tasks
- Proportional synthesis task: Synthesize multi-category balanced data by specified ratios
- Large model integration: Support for multiple LLM APIs
- Quality evaluation: Automatic evaluation of synthesized data quality
Quick Start
1. Create Instruction Template
Step 1: Enter Data Synthesis Page
In the left navigation, select Data Synthesis → Synthesis Tasks.
Step 2: Create Instruction Template
- Click Instruction Templates tab
- Click Create Template button
Basic Information:
- Template name: e.g.,
qa_generation_template - Template description: Describe template purpose (optional)
- Template type: Select template type (Q&A, dialogue, summary, etc.)
Prompt Configuration:
Example prompt:
You are a professional data generation assistant. Generate data based on the following requirements:
Task: Generate Q&A pairs
Topic: {topic}
Count: {count}
Difficulty: {difficulty}
Requirements:
1. Questions should be clear and specific
2. Answers should be accurate and complete
3. Cover different difficulty levels
Output format: JSON
[
{
"question": "...",
"answer": "..."
}
]
Parameter Configuration:
- Model: Select LLM to use (GPT-4, Claude, local model, etc.)
- Temperature: Control generation randomness (0-1)
- Max tokens: Limit generation length
- Other parameters: Configure according to model
Step 4: Save Template
Click Save button to save template.
2. Create Synthesis Task
- Return to Data Synthesis page
- Click Create Task button
- Fill basic information:
- Task name: e.g.,
medical_qa_synthesis - Task description: Describe task purpose (optional)
Step 2: Select Dataset and Files
Select required data from existing datasets:
- Select dataset: Choose the dataset to use from the list
- Select files:
- Can select all files from a dataset
- Can also select specific files from a dataset
- Support selecting multiple files
Step 3: Select Synthesis Instruction Template
Select an existing template or create a new one:
- Select from template library: Choose from created templates
- Template type: Q&A generation, dialogue generation, summary generation, etc.
- Preview template: View template prompt content
Step 4: Fill Synthesis Configuration
The synthesis configuration consists of four parts:
1. Set Total Synthesis Count
Set the maximum limit for the entire task:
| Parameter | Description | Default Value | Range |
|---|
| Maximum QA Pairs | Maximum number of QA pairs to generate for entire task | 5000 | 1-100,000 |
This setting is optional, used for total volume control in large-scale synthesis tasks.
2. Configure Text Chunking Strategy
Chunk the input text files, supporting multiple chunking methods:
| Parameter | Description | Default Value |
|---|
| Chunking Method | Select chunking strategy | Default chunking |
| Chunk Size | Character count per chunk | 3000 |
| Overlap Size | Overlap characters between adjacent chunks | 100 |
Chunking Method Options:
- Default Chunking (默认分块): Use system default intelligent chunking strategy
- Chapter-based Chunking (按章节分块): Split by chapter structure
- Paragraph-based Chunking (按段落分块): Split by paragraph boundaries
- Fixed Length Chunking (固定长度分块): Split by fixed character length
- Custom Separator Chunking (自定义分隔符分块): Split by custom delimiter
3. Configure Question Synthesis Parameters
Set parameters for question generation:
| Parameter | Description | Default Value | Range |
|---|
| Question Count | Number of questions generated per chunk | 1 | 1-20 |
| Temperature | Control randomness and diversity of question generation | 0.7 | 0-2 |
| Model | Select CHAT model for question generation | - | Select from model list |
Parameter Notes:
- Question Count: Number of questions generated per text chunk. Higher value generates more questions.
- Temperature: Higher values produce more diverse questions, lower values produce more stable questions.
4. Configure Answer Synthesis Parameters
Set parameters for answer generation:
| Parameter | Description | Default Value | Range |
|---|
| Temperature | Control stability of answer generation | 0.7 | 0-2 |
| Model | Select CHAT model for answer generation | - | Select from model list |
Parameter Notes:
- Temperature: Lower values produce more conservative and accurate answers, higher values produce more diverse and creative answers.
Synthesis Types:
The system supports two synthesis types:
- SFT Q&A Synthesis (SFT 问答数据合成): Generate Q&A pairs for supervised fine-tuning
- COT Chain-of-Thought Synthesis (COT 链式推理合成): Generate data with reasoning process
Step 5: Start Task
Click Start Task button, task will automatically start executing.
3. Create Ratio Synthesis Task
Ratio synthesis tasks are used to synthesize multi-category balanced data in specified proportions.
Step 1: Create Ratio Task
- In the left navigation, select Data Synthesis → Ratio Tasks
- Click Create Task button
| Parameter | Description | Required |
|---|
| Task Name | Unique identifier for the task | Yes |
| Total Target Count | Target total count for entire ratio task | Yes |
| Task Description | Describe purpose and requirements of ratio task | No |
Example:
- Task name:
balanced_dataset_synthesis - Total target count: 10000
- Task description: Generate balanced data for training and validation sets
Step 3: Select Datasets
Select datasets to participate in the ratio synthesis from existing datasets:
Dataset Selection Features:
- Search Datasets: Search datasets by keyword
- Multi-select Support: Can select multiple datasets simultaneously
- Dataset Information: Display detailed information for each dataset
- Dataset name and type
- Dataset description
- File count
- Dataset size
- Label distribution preview (up to 8 labels)
After selecting datasets, the system automatically loads label distribution information for each dataset.
Step 4: Fill Ratio Configuration
Configure specific synthesis rules for each selected dataset:
Ratio Configuration Items:
| Parameter | Description | Range |
|---|
| Label | Select label from dataset’s label distribution | Based on dataset labels |
| Label Value | Specific value under selected label | Based on label value list |
| Label Update Time | Select label update date range (optional) | Date picker |
| Quantity | Data count to generate for this config | 0 to total target count |
Feature Notes:
- Auto Distribute: Click “Auto Distribute” button, system automatically distributes total count evenly across datasets
- Quantity Limit: Each configuration item’s quantity cannot exceed the dataset’s total file count
- Percentage Calculation: System automatically calculates percentage of each configuration item
- Delete Configuration: Can delete unwanted configuration items
- Add Configuration: Each dataset can have multiple different label configurations
Example Configuration:
| Dataset | Label | Label Value | Label Update Time | Quantity |
|---|
| Training Dataset | Category | Training | - | 6000 |
| Training Dataset | Category | Validation | - | 2000 |
| Test Dataset | Category | Test | 2024-01-01 to 2024-12-31 | 2000 |
Step 5: Execute Task
Click Start Task button, the system will create and execute the task according to ratio configuration.
4. Monitor Synthesis Task
View Task List
In data synthesis page, you can see all synthesis tasks:
| Task Name | Template | Status | Progress | Generated Count | Actions |
|---|
| Medical QA Synthesis | qa_template | Running | 50% | 50/100 | View Details |
| Sentiment Data Synthesis | sentiment_template | Completed | 100% | 1000/1000 | View Details |
Advanced Features
Template Variables
Use variables in prompts for dynamic configuration:
Variable syntax: {variable_name}
Example:
Generate {count} {difficulty} level {type} about {topic}.
Built-in variables:
{current_date}: Current date{current_time}: Current time{random_id}: Random ID
Model Selection
DataMate supports multiple LLMs:
| Model | Type | Description |
|---|
| GPT-4 | OpenAI | High-quality generation |
| GPT-3.5-Turbo | OpenAI | Fast generation |
| Claude 3 | Anthropic | Long-text generation |
| Wenxin Yiyan | Baidu | Chinese optimized |
| Tongyi Qianwen | Alibaba | Chinese optimized |
| Local Model | Deployed locally | Private deployment |
Best Practices
1. Prompt Design
Good prompts should:
- Define task clearly: Clearly describe generation task
- Specify format: Clearly define output format requirements
- Provide examples: Give expected output examples
- Control quality: Set quality requirements
Example prompt:
You are a professional educational content creator.
Task: Generate educational Q&A pairs
Subject: {subject}
Grade: {grade}
Count: {count}
Requirements:
1. Questions should be appropriate for the grade level
2. Answers should be accurate, detailed, and easy to understand
3. Each answer should include explanation process
4. Do not generate sensitive or inappropriate content
Output format (JSON):
[
{
"id": 1,
"question": "Question content",
"answer": "Answer content",
"explanation": "Explanation content",
"difficulty": "easy/medium/hard",
"knowledge_points": ["point1", "point2"]
}
]
Start generating:
2. Parameter Tuning
Adjust model parameters according to needs:
| Parameter | High Quality | Fast Generation | Creative Generation |
|---|
| Temperature | 0.3-0.5 | 0.1-0.3 | 0.7-1.0 |
| Max tokens | As needed | Shorter | Longer |
| Top P | 0.9-0.95 | 0.9 | 0.95-1.0 |
Common Questions
Q: Generated data quality is not ideal?
A: Optimization suggestions:
- Improve prompt: More detailed and clear instructions
- Adjust parameters: Lower temperature, increase max tokens
- Provide examples: Give examples in prompt
- Change model: Try other LLMs
- Manual review: Manual review and filtering
Q: Generation speed is slow?
A: Acceleration suggestions:
- Reduce count: Generate in smaller batches
- Adjust concurrency: Increase concurrency appropriately
- Use faster model: Like GPT-3.5-Turbo
- Shorten output: Reduce max tokens
- Use local model: Deploy local model for acceleration
API Reference
For detailed API documentation, see:
3.6 - Data Evaluation
Evaluate data quality with DataMate
Data evaluation module provides multi-dimensional data quality evaluation capabilities.
Features Overview
Data evaluation module provides:
- Quality Metrics: Rich data quality evaluation metrics
- Automatic Evaluation: Auto-execute evaluation tasks
- Manual Evaluation: Manual sampling evaluation
- Evaluation Reports: Generate detailed reports
- Quality Tracking: Track data quality trends
Evaluation Dimensions
Data Completeness
| Metric | Description | Calculation |
|---|
| Null Rate | Null value ratio | Null count / Total count |
| Missing Field Rate | Required field missing rate | Missing fields / Total fields |
| Record Complete Rate | Complete record ratio | Complete records / Total records |
Data Accuracy
| Metric | Description | Calculation |
|---|
| Format Correct Rate | Format compliance | Format correct / Total |
| Value Range Compliance | In valid range | In range / Total |
| Consistency Rate | Data consistency | Consistent records / Total |
Quick Start
1. Create Evaluation Task
Step 1: Enter Data Evaluation Page
Select Data Evaluation in the left navigation.
Step 2: Create Task
Click Create Task.
- Task name: e.g.,
data_quality_evaluation - Evaluation dataset: Select dataset to evaluate
Select dimensions:
- ✅ Data completeness
- ✅ Data accuracy
- ✅ Data uniqueness
- ✅ Data timeliness
Completeness Rules:
Required fields: name, email, phone
Null threshold: 5% (warn if exceeded)
2. Execute Evaluation
Automatic Evaluation
Auto-executes after creation, or click Execute Now.
Manual Evaluation
- Click Manual Evaluation tab
- View samples to evaluate
- Manually evaluate quality
- Submit results
3. View Evaluation Report
Overall Score
Overall Quality Score: 85 (Excellent)
Completeness: 90 ⭐⭐⭐⭐⭐
Accuracy: 82 ⭐⭐⭐⭐
Uniqueness: 95 ⭐⭐⭐⭐⭐
Timeliness: 75 ⭐⭐⭐⭐
Detailed Metrics
Completeness:
- Null rate: 3.2% ✅
- Missing field rate: 1.5% ✅
- Record complete rate: 96.8% ✅
Advanced Features
Custom Evaluation Rules
Regex Validation
Field: phone
Rule: ^1[3-9]\d{9}$
Description: China mobile phone number
Value Range Validation
Field: age
Min value: 0
Max value: 120
Comparison Evaluation
Compare different datasets or versions.
Best Practices
1. Regular Evaluation
Recommended schedule:
- Daily: Critical data
- Weekly: General data
- Monthly: All data
2. Establish Baseline
Create quality baseline for each dataset.
3. Continuous Improvement
Based on evaluation results:
- Clean problem data
- Optimize collection process
- Update validation rules
Common Questions
Q: Evaluation task failed?
A: Troubleshoot:
- Check dataset exists
- Check rule configuration
- View execution logs
- Test with small sample size
API Reference
3.7 - Knowledge Base Management
Build and manage RAG knowledge bases with DataMate
Knowledge base management module helps you build enterprise knowledge bases for efficient vector retrieval and RAG applications.
Features Overview
Knowledge base management module provides:
- Document upload: Support multiple document formats
- Text chunking: Intelligent text splitting strategies
- Vectorization: Automatic text-to-vector conversion
- Vector search: Semantic similarity-based retrieval
- Knowledge base Q&A: RAG-intelligent Q&A
| Format | Description | Recommended For |
|---|
| TXT | Plain text | General text |
| PDF | PDF documents | Documents, reports |
| Markdown | Markdown files | Technical docs |
| JSON | JSON data | Structured data |
| CSV | CSV tables | Tabular data |
| DOCX | Word documents | Office documents |
Quick Start
1. Create Knowledge Base
Step 1: Enter Knowledge Base Page
In the left navigation, select Knowledge Generation.
Step 2: Create Knowledge Base
Click Create Knowledge Base button in upper right.
- Knowledge base name: e.g.,
company_docs_kb - Knowledge base description: Describe purpose (optional)
- Knowledge base type: General / Professional domain
Embedding model: Select embedding model
- OpenAI text-embedding-ada-002
- BGE-M3
- Custom model
Vector dimension: Auto-set based on model
Index type: IVF_FLAT / HNSW / IVF_PQ
2. Upload Documents
Step 1: Enter Knowledge Base Details
Click knowledge base name to enter details.
Step 2: Upload Documents
- Click Upload Document button
- Select local files
- Wait for upload completion
System will automatically:
- Parse document content
- Chunk text
- Generate vectors
- Build index
3. Vector Search
Step 1: Enter Search Page
In knowledge base details page, click Vector Search tab.
Step 2: Enter Query
Enter query in search box, e.g.:
How to use DataMate for data cleaning?
Step 3: View Search Results
System returns most relevant text chunks with similarity scores:
| Rank | Text Chunk | Similarity | Source Doc | Actions |
|---|
| 1 | DataMate’s data cleaning module… | 0.92 | user_guide.pdf | View |
| 2 | Configure cleaning task… | 0.87 | tutorial.md | View |
| 3 | Cleaning operator list… | 0.81 | reference.txt | View |
4. Knowledge Base Q&A (RAG)
Step 1: Enable RAG
In knowledge base details page, click RAG Q&A tab.
- LLM: Select LLM to use
- Retrieval count: Number of text chunks to retrieve
- Temperature: Control generation randomness
- Prompt template: Custom Q&A template
Step 3: Q&A
Enter question in dialog box, e.g.:
User: What data cleaning operators does DataMate support?
Assistant: DataMate supports rich data cleaning operators, including:
1. Data quality operators: deduplication, null handling, outlier detection...
2. Text cleaning operators: remove special chars, case conversion...
3. Image cleaning operators: format conversion, quality detection...
[Source: user_guide.pdf, tutorial.md]
Best Practices
1. Document Preparation
Before uploading documents:
- Unify format: Convert to unified format (PDF, Markdown)
- Clean content: Remove irrelevant content (headers, ads)
- Maintain structure: Keep good document structure
- Add metadata: Add document metadata (author, date, tags)
2. Chunking Strategy Selection
Choose based on document type:
| Document Type | Recommended Strategy | Chunk Size |
|---|
| Technical docs | Paragraph chunking | - |
| Long reports | Semantic chunking | - |
| Short text | Character chunking | 500 |
| Code | Character chunking | 300 |
Common Questions
Q: Document stuck in “Processing”?
A: Check:
- Document format: Ensure format is supported
- Document size: Single document under 100MB
- Vector service: Check if vector service is running
- View logs: Check detailed error messages
Q: Inaccurate search results?
A: Optimization suggestions:
- Adjust chunking: Try different chunking methods
- Increase chunk size: Add more context
- Use reranking: Enable reranking model
- Optimize query: Use clearer query statements
- Change embedding model: Try other models
API Reference
For detailed API documentation, see:
3.8 - Operator Market
Manage and use DataMate operators
Operator marketplace provides rich data processing operators and supports custom operator development.
Features Overview
Operator marketplace provides:
- Built-in Operators: Rich built-in data processing operators
- Operator Publishing: Publish and share custom operators
- Operator Installation: Install third-party operators
- Custom Development: Develop custom operators
Built-in Operators
Data Cleaning Operators
| Operator | Function | Input | Output |
|---|
| Deduplication | Remove duplicates | Dataset | Deduplicated data |
| Null Handler | Handle nulls | Dataset | Filled data |
| Format Converter | Convert format | Original format | New format |
Text Processing Operators
| Operator | Function |
|---|
| Text Segmentation | Chinese word segmentation |
| Remove Stopwords | Remove common stopwords |
| Text Cleaning | Clean special characters |
Quick Start
1. Browse Operators
Step 1: Enter Operator Market
Select Operator Market in the left navigation.
Step 2: Browse Operators
View all available operators with ratings and installation counts.
2. Install Operator
Install Built-in Operator
Built-in operators are installed by default.
Install Third-party Operator
- In operator details page, click Install
- Wait for installation completion
3. Use Operator
After installation, use in:
- Data Cleaning: Add operator node to cleaning pipeline
- Pipeline Orchestration: Add operator node to workflow
Advanced Features
Develop Custom Operator
Create Operator
- In operator market page, click Create Operator
- Fill operator information
- Write operator code (Python)
- Package and publish
Python Operator Example:
class MyTextCleaner:
def __init__(self, config):
self.remove_special_chars = config.get('remove_special_chars', True)
def process(self, data):
if isinstance(data, str):
result = data
if self.remove_special_chars:
import re
result = re.sub(r'[^\w\s]', '', result)
return result
return data
Best Practices
1. Operator Design
Good operator design:
- Single responsibility: One operator does one thing
- Configurable: Rich configuration options
- Error handling: Comprehensive error handling
- Performance: Consider large-scale data
Common Questions
Q: Operator execution failed?
A: Troubleshoot:
- View logs
- Check configuration
- Check data format
- Test locally
3.9 - Pipeline Orchestration
Visual workflow orchestration with DataMate
Pipeline orchestration module provides drag-and-drop visual interface for designing and managing complex data processing workflows.
Features Overview
Pipeline orchestration provides:
- Visual Designer: Drag-and-drop workflow design
- Rich Node Types: Data processing, conditions, loops, etc.
- Flow Execution: Auto-execute and monitor workflows
- Template Management: Save and reuse flow templates
- Version Management: Flow version control
Node Types
Data Nodes
| Node | Function | Config |
|---|
| Input Dataset | Read from dataset | Select dataset |
| Output Dataset | Write to dataset | Select dataset |
| Data Collection | Execute collection task | Select task |
| Data Cleaning | Execute cleaning task | Select task |
| Data Synthesis | Execute synthesis task | Select task |
Logic Nodes
| Node | Function | Config |
|---|
| Condition Branch | Execute different branches | Condition expression |
| Loop | Repeat execution | Loop count/condition |
| Parallel | Execute multiple branches in parallel | Branch count |
| Wait | Wait for specified time | Duration |
Quick Start
1. Create Pipeline
Step 1: Enter Pipeline Orchestration Page
Select Pipeline Orchestration in left navigation.
Step 2: Create Pipeline
Click Create Pipeline.
- Pipeline name: e.g.,
data_processing_pipeline - Description: Pipeline purpose (optional)
Step 4: Design Flow
- Drag nodes from left library to canvas
- Connect nodes
- Configure node parameters
- Save flow
Example:
Input Dataset → Data Cleaning → Condition Branch
├── Satisfied → Data Annotation → Output
└── Not Satisfied → Data Synthesis → Output
2. Execute Pipeline
Step 1: Enter Execution Page
Click pipeline name to enter details.
Step 2: Execute Pipeline
Click Execute Now.
Step 3: Monitor Execution
View execution status, progress, and logs.
Advanced Features
Flow Templates
Save as Template
- Design flow
- Click Save as Template
- Enter template name
Use Template
- Create pipeline, click Use Template
- Select template
- Load to designer
Parameterized Flow
Define parameters in pipeline:
{
"parameters": [
{
"name": "input_dataset",
"type": "dataset",
"required": true
}
]
}
Scheduled Execution
Configure scheduled execution:
- Cron expression:
0 0 2 * * ? (Daily at 2 AM) - Execution parameters
Best Practices
1. Flow Design
Recommended principles:
- Modular: Split complex flows
- Reusable: Use templates
- Maintainable: Add comments
- Testable: Test individually
Optimize performance:
- Parallelize: Use parallel nodes
- Reduce data transfer: Process locally
- Batch operations: Use batch operations
- Cache results: Cache intermediate results
Common Questions
Q: Flow execution failed?
A: Troubleshoot:
- View execution logs
- Check node configuration
- Check data format
- Test nodes individually
3.10 - Agent Chat
Use DataMate Agent for intelligent conversation
Agent chat module integrates large language models to provide intelligent Q&A and knowledge base retrieval capabilities.
Features Overview
Agent chat module provides:
- Intelligent Chat: Natural language conversation based on LLMs
- Knowledge Base Q&A: RAG-integrated knowledge base Q&A
- Multi-turn Dialogue: Support context-aware multi-turn conversation
- Dialogue Management: Conversation history and management
- Quick Commands: Quick commands for common operations
Quick Start
1. Access Agent
Select Agent Chat in left navigation, or visit:
- Local: http://localhost:30000/chat
- Production: https://your-domain/chat
2. Start Conversation
Enter question in dialog box, e.g.:
User: What data formats does DataMate support?
Assistant: DataMate supports multiple data formats:
Images: JPG, PNG, GIF, BMP, WebP
Text: TXT, MD, JSON, CSV
Audio: MP3, WAV, FLAC, AAC
Video: MP4, AVI, MOV, MKV
These formats can be uploaded and managed in the data management module.
3. Use Knowledge Base Q&A
Enable Knowledge Base
- Click Settings button
- In Knowledge Base settings, select knowledge base to use
- Save settings
Knowledge Base Q&A
User: How to create a data cleaning task?
Assistant: According to the knowledge base documentation:
1. Enter data processing page
2. Click create task button
3. Configure basic information
4. Configure cleaning pipeline (drag operators to canvas)
5. Configure execution parameters
6. Create and execute task
[Source: user_guide.md, data_cleansing.md]
Advanced Features
Conversation Modes
General Chat
Use LLM for general conversation without knowledge base.
Knowledge Base Q&A
Answer questions based on knowledge base content.
Mixed Mode
Combine general chat and knowledge base Q&A.
Quick Commands
| Command | Function | Example |
|---|
/dataset | Query datasets | /dataset list |
/task | Query tasks | /task status |
/help | Show help | /help |
/clear | Clear conversation | /clear |
Conversation History
View History
- Click History tab on left
- Select historical conversation
- View conversation content
Continue Conversation
Click historical conversation to continue.
Export Conversation
Export conversation records:
- Markdown: Export as Markdown file
- JSON: Export as JSON
- PDF: Export as PDF
Best Practices
1. Effective Questioning
Get better answers:
- Be specific: Clear and specific questions
- Provide context: Include background information
- Break down: Split complex questions
2. Knowledge Base Usage
Make the most of knowledge base:
- Select appropriate knowledge base: Choose based on question
- View sources: Check answer source documents
- Verify information: Verify with source documents
Common Questions
Q: Inaccurate Agent answers?
A: Improve:
- Optimize question: More specific
- Check knowledge base: Ensure relevant content exists
- Change model: Try more powerful model
- Provide context: More background info
4 - API Reference
DataMate API documentation
DataMate provides complete REST APIs supporting programmatic access to all core features.
API Overview
DataMate API is based on REST architecture design, providing the following services:
- Data Management API: Dataset and file management
- Data Cleaning API: Data cleaning task management
- Data Collection API: Data collection task management
- Data Annotation API: Data annotation task management
- Data Synthesis API: Data synthesis task management
- Data Evaluation API: Data evaluation task management
- Operator Market API: Operator management
- RAG Indexer API: Knowledge base and vector retrieval
- Pipeline Orchestration API: Pipeline orchestration management
Authentication
DataMate supports two authentication methods:
JWT Authentication (Recommended)
GET /api/v1/data-management/datasets
Authorization: Bearer <your-jwt-token>
Get JWT Token:
POST /api/v1/auth/login
Content-Type: application/json
{
"username": "admin",
"password": "password"
}
Response:
{
"token": "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9...",
"expiresIn": 86400
}
API Key Authentication
GET /api/v1/data-management/datasets
X-API-Key: <your-api-key>
Success Response
{
"code": 200,
"message": "success",
"data": {
// Response data
}
}
Error Response
{
"code": 400,
"message": "Bad Request",
"error": "Invalid parameter: datasetId",
"timestamp": "2024-01-15T10:30:00Z",
"path": "/api/v1/data-management/datasets"
}
Paged Response
{
"content": [],
"page": 0,
"size": 20,
"totalElements": 100,
"totalPages": 5,
"first": true,
"last": false
}
API Endpoints
Data Management
| Endpoint | Method | Description |
|---|
/data-management/datasets | GET | Get dataset list |
/data-management/datasets | POST | Create dataset |
/data-management/datasets/{id} | GET | Get dataset details |
/data-management/datasets/{id} | PUT | Update dataset |
/data-management/datasets/{id} | DELETE | Delete dataset |
/data-management/datasets/{id}/files | GET | Get file list |
/data-management/datasets/{id}/files/upload | POST | Upload files |
Data Cleaning
| Endpoint | Method | Description |
|---|
/data-cleaning/tasks | GET | Get cleaning task list |
/data-cleaning/tasks | POST | Create cleaning task |
/data-cleaning/tasks/{id} | GET | Get task details |
/data-cleaning/tasks/{id} | PUT | Update task |
/data-cleaning/tasks/{id} | DELETE | Delete task |
/data-cleaning/tasks/{id}/execute | POST | Execute task |
Data Collection
| Endpoint | Method | Description |
|---|
/data-collection/tasks | GET | Get collection task list |
/data-collection/tasks | POST | Create collection task |
/data-collection/tasks/{id} | GET | Get task details |
/data-collection/tasks/{id}/execute | POST | Execute collection task |
Data Synthesis
| Endpoint | Method | Description |
|---|
/data-synthesis/tasks | GET | Get synthesis task list |
/data-synthesis/tasks | POST | Create synthesis task |
/data-synthesis/templates | GET | Get instruction template list |
/data-synthesis/templates | POST | Create instruction template |
Operator Market
| Endpoint | Method | Description |
|---|
/operator-market/operators | GET | Get operator list |
/operator-market/operators | POST | Publish operator |
/operator-market/operators/{id} | GET | Get operator details |
/operator-market/operators/{id}/install | POST | Install operator |
RAG Indexer
| Endpoint | Method | Description |
|---|
/rag/knowledge-bases | GET | Get knowledge base list |
/rag/knowledge-bases | POST | Create knowledge base |
/rag/knowledge-bases/{id}/documents | POST | Upload documents |
/rag/knowledge-bases/{id}/search | POST | Vector search |
Error Codes
| Code | Description |
|---|
| 200 | Success |
| 201 | Created |
| 400 | Bad Request |
| 401 | Unauthorized |
| 403 | Forbidden |
| 404 | Not Found |
| 409 | Conflict |
| 500 | Internal Server Error |
Rate Limiting
API call rate limits:
- Default limit: 1000 requests/hour
- Burst limit: 100 requests/minute
Exceeding the limit returns 429 Too Many Requests.
Response headers contain rate limiting information:
X-RateLimit-Limit: 1000
X-RateLimit-Remaining: 999
X-RateLimit-Reset: 1642252800
Version Management
API versions are specified through URL paths:
- Current version:
/api/v1/ - Future versions:
/api/v2/
4.1 - Data Management API
Dataset and file management API
Data management API provides capabilities for dataset and file creation, query, update, and deletion.
- Base URL:
http://localhost:8092/api/v1/data-management - Authentication: JWT / API Key
- Content-Type:
application/json
Dataset Management
Get Dataset List
GET /data-management/datasets?page=0&size=20&type=text
Query Parameters:
| Parameter | Type | Required | Description |
|---|
| page | integer | No | Page number, starts from 0 |
| size | integer | No | Page size, default 20 |
| type | string | No | Dataset type filter |
| tags | string | No | Tag filter, comma-separated |
| keyword | string | No | Keyword search |
| status | string | No | Status filter |
Response Example:
{
"content": [
{
"id": "dataset-001",
"name": "text_dataset",
"description": "Text dataset",
"type": {
"code": "TEXT",
"name": "Text"
},
"status": "ACTIVE",
"fileCount": 1000,
"totalSize": 1073741824,
"createdAt": "2024-01-15T10:00:00Z"
}
],
"page": 0,
"size": 20,
"totalElements": 1
}
Create Dataset
POST /data-management/datasets
Content-Type: application/json
{
"name": "my_dataset",
"description": "My dataset",
"type": "TEXT",
"tags": ["training", "nlp"]
}
Get Dataset Details
GET /data-management/datasets/{datasetId}
Update Dataset
PUT /data-management/datasets/{datasetId}
Content-Type: application/json
{
"name": "updated_dataset",
"description": "Updated description"
}
Delete Dataset
DELETE /data-management/datasets/{datasetId}
File Management
Get File List
GET /data-management/datasets/{datasetId}/files?page=0&size=20
Upload File
POST /data-management/datasets/{datasetId}/files/upload/chunk
Content-Type: multipart/form-data
Download File
GET /data-management/datasets/{datasetId}/files/{fileId}/download
Delete File
DELETE /data-management/datasets/{datasetId}/files/{fileId}
Error Response
{
"code": 400,
"message": "Bad Request",
"error": "Invalid parameter: datasetId",
"timestamp": "2024-01-15T10:30:00Z",
"path": "/api/v1/data-management/datasets"
}
SDK Usage
Python
from datamate import DataMateClient
client = DataMateClient(
base_url="http://localhost:8080",
api_key="your-api-key"
)
# Get datasets
datasets = client.data_management.get_datasets()
# Create dataset
dataset = client.data_management.create_dataset(
name="my_dataset",
type="TEXT"
)
cURL
# Get datasets
curl -X GET "http://localhost:8092/api/v1/data-management/datasets" \
-H "Authorization: Bearer your-jwt-token"
# Create dataset
curl -X POST "http://localhost:8092/api/v1/data-management/datasets" \
-H "Authorization: Bearer your-jwt-token" \
-H "Content-Type: application/json" \
-d '{
"name": "my_dataset",
"type": "TEXT"
}'
5 - Developer Guide
DataMate architecture and development guide
Developer guide introduces DataMate’s technical architecture, development environment, and contribution process.
DataMate is an enterprise-level data processing platform using microservices architecture, supporting large-scale data processing and custom extensions.
Architecture Documentation
Development Guide
Tech Stack
Frontend
| Technology | Version | Description |
|---|
| React | 18.x | UI framework |
| TypeScript | 5.x | Type safety |
| Ant Design | 5.x | UI component library |
| Redux Toolkit | 2.x | State management |
| Vite | 5.x | Build tool |
Backend (Java)
| Technology | Version | Description |
|---|
| Java | 21 | Runtime environment |
| Spring Boot | 3.5.6 | Application framework |
| Spring Cloud | 2023.x | Microservices framework |
| MyBatis Plus | 3.x | ORM framework |
Backend (Python)
| Technology | Version | Description |
|---|
| Python | 3.11+ | Runtime environment |
| FastAPI | 0.100+ | Web framework |
| LangChain | 0.1+ | LLM framework |
| Ray | 2.x | Distributed computing |
Project Structure
DataMate/
├── backend/ # Java backend
│ ├── services/ # Microservice modules
│ ├── openapi/ # OpenAPI specs
│ └── scripts/ # Build scripts
├── frontend/ # React frontend
│ ├── src/
│ │ ├── components/ # Common components
│ │ ├── pages/ # Page components
│ │ ├── services/ # API services
│ │ └── store/ # Redux store
│ └── package.json
├── runtime/ # Python runtime
│ └── datamate/ # DataMate runtime
└── deployment/ # Deployment config
├── docker/ # Docker config
└── helm/ # Helm Charts
Quick Start
1. Clone Code
git clone https://github.com/ModelEngine-Group/DataMate.git
cd DataMate
2. Start Services
# Start basic services
make install
# Access frontend
open http://localhost:30000
3. Development Mode
# Backend development
cd backend/services/main-application
mvn spring-boot:run
# Frontend development
cd frontend
pnpm dev
# Python service development
cd runtime/datamate
python operator_runtime.py --port 8081
Core Concepts
Microservices Architecture
DataMate uses microservices architecture, each service handles specific business functions:
- API Gateway: Unified entry, routing, authentication
- Main Application: Core business logic
- Data Management Service: Dataset management
- Data Cleaning Service: Data cleaning
- Data Synthesis Service: Data synthesis
- Runtime Service: Operator execution
Operator System
Operators are basic units of data processing:
- Built-in operators: Common operators provided by platform
- Custom operators: User-developed custom operators
- Operator execution: Executed by Runtime Service
Pipeline Orchestration
Pipelines are implemented through visual orchestration:
- Nodes: Basic units of data processing
- Connections: Data flow between nodes
- Execution: Automatic execution according to workflow
Extension Development
Develop Custom Operators
Operator development guide:
- Operator Market - Operator usage guide
- Python operator development examples
- Operator testing and debugging
Integrate External Systems
- API integration: Integration via REST API
- Webhook: Event notifications
- Plugin system: (Coming soon)
Testing
Unit Tests
# Backend tests
cd backend
mvn test
# Frontend tests
cd frontend
pnpm test
# Python tests
cd runtime
pytest
Integration Tests
# Start test environment
make test-env-up
# Run integration tests
make integration-test
# Clean test environment
make test-env-down
Backend Optimization
- Database connection pool configuration
- Query optimization
- Caching strategies
- Asynchronous processing
Frontend Optimization
- Code splitting
- Lazy loading
- Caching strategies
Security
Authentication and Authorization
- JWT authentication
- RBAC permission control
- API Key authentication
Data Security
- Transport encryption (HTTPS/TLS)
- Storage encryption
- Sensitive data masking
5.1 - Backend Architecture
DataMate Java backend architecture design
DataMate backend adopts microservices architecture built on Spring Boot 3.x and Spring Cloud.
Architecture Overview
DataMate backend uses microservices architecture, splitting into multiple independent services:
┌─────────────────────────────────────────────┐
│ API Gateway │
│ (Spring Cloud Gateway) │
│ Port: 8080 │
└──────────────┬──────────────────────────────┘
│
┌───────┴───────┬───────────────┐
▼ ▼ ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Main │ │ Data │ │ Data │
│ Application │ │ Management │ │ Collection │
└──────────────┘ └──────────────┘ └──────────────┘
│ │ │
└───────────────┴───────────────┘
│
▼
┌────────────────┐
│ PostgreSQL │
│ Port: 5432 │
└────────────────┘
Tech Stack
Core Frameworks
| Technology | Version | Purpose |
|---|
| Java | 21 | Programming language |
| Spring Boot | 3.5.6 | Application framework |
| Spring Cloud | 2023.x | Microservices framework |
| MyBatis Plus | 3.5.x | ORM framework |
Support Components
| Technology | Version | Purpose |
|---|
| Redis | 5.x | Cache and message queue |
| MinIO | 8.x | Object storage |
| Milvus SDK | 2.3.x | Vector database |
Microservices List
API Gateway
Port: 8080
Functions:
- Unified entry point
- Route forwarding
- Authentication and authorization
- Rate limiting and circuit breaking
Tech: Spring Cloud Gateway, JWT authentication
Main Application
Functions:
- User management
- Permission management
- System configuration
- Task scheduling
Data Management Service
Port: 8092
Functions:
- Dataset management
- File management
- Tag management
- Statistics
API Endpoints:
/data-management/datasets - Dataset management/data-management/datasets/{id}/files - File management
Runtime Service
Port: 8081
Functions:
- Operator execution
- Ray integration
- Task scheduling
Tech: Python + Ray, FastAPI
Database Design
Main Tables
users (User Table)
| Field | Type | Description |
|---|
| id | BIGINT | Primary key |
| username | VARCHAR(50) | Username |
| password | VARCHAR(255) | Password (encrypted) |
| email | VARCHAR(100) | Email |
| role | VARCHAR(20) | Role |
| created_at | TIMESTAMP | Creation time |
datasets (Dataset Table)
| Field | Type | Description |
|---|
| id | VARCHAR(50) | Primary key |
| name | VARCHAR(100) | Name |
| description | TEXT | Description |
| type | VARCHAR(20) | Type |
| status | VARCHAR(20) | Status |
| created_by | VARCHAR(50) | Creator |
Service Communication
Synchronous Communication
Services communicate via HTTP/REST:
// Using Feign Client
@FeignClient(name = "data-management-service")
public interface DataManagementClient {
@GetMapping("/data-management/datasets/{id}")
DatasetResponse getDataset(@PathVariable String id);
}
Asynchronous Communication
Using Redis for async messaging:
// Send message
redisTemplate.convertAndSend("task.created", taskMessage);
// Receive message
@RedisListener(topic = "task.created")
public void handleTaskCreated(TaskMessage message) {
// Handle task creation event
}
Authentication & Authorization
JWT Authentication
@Configuration
public class JwtConfig {
@Value("${datamate.jwt.secret}")
private String secret;
@Value("${datamate.jwt.expiration}")
private Long expiration;
}
RBAC
@PreAuthorize("hasRole('ADMIN')")
public void adminOperation() {
// Admin operations
}
Database Connection Pool
spring:
datasource:
hikari:
maximum-pool-size: 20
minimum-idle: 5
connection-timeout: 30000
Caching Strategy
@Cacheable(value = "datasets", key = "#id")
public Dataset getDataset(String id) {
return datasetRepository.findById(id);
}
5.2 - Frontend Architecture
DataMate React frontend architecture design
DataMate frontend is built on React 18 and TypeScript with modern frontend architecture.
Architecture Overview
DataMate frontend adopts SPA architecture:
┌─────────────────────────────────────────────┐
│ Browser │
└──────────────────┬──────────────────────────┘
│
▼
┌─────────────────────────────────────────────┐
│ React App │
│ ┌──────────────────────────────────────┐ │
│ │ Components │ │
│ └──────────────────────────────────────┘ │
│ ┌──────────────────────────────────────┐ │
│ │ State Management │ │
│ │ (Redux Toolkit) │ │
│ └──────────────────────────────────────┘ │
│ ┌──────────────────────────────────────┐ │
│ │ Services (API) │ │
│ └──────────────────────────────────────┘ │
│ ┌──────────────────────────────────────┐ │
│ │ Routing │ │
│ └──────────────────────────────────────┘ │
└─────────────────────────────────────────────┘
Tech Stack
Core Frameworks
| Technology | Version | Purpose |
|---|
| React | 18.x | UI framework |
| TypeScript | 5.x | Type safety |
| Ant Design | 5.x | UI components |
| Tailwind CSS | 3.x | Styling |
State Management
| Technology | Version | Purpose |
|---|
| Redux Toolkit | 2.x | Global state |
| React Query | 5.x | Server state |
Project Structure
frontend/
├── src/
│ ├── components/ # Common components
│ ├── pages/ # Page components
│ ├── services/ # API services
│ ├── store/ # Redux store
│ ├── hooks/ # Custom hooks
│ ├── routes/ # Routes config
│ └── main.tsx # Entry point
Routing Design
const router = createBrowserRouter([
{ path: "/", Component: Home },
{ path: "/chat", Component: AgentPage },
{
path: "/data",
Component: MainLayout,
children: [
{
path: "management",
Component: DatasetManagement
}
]
}
]);
State Management
export const store = configureStore({
reducer: {
dataManagement: dataManagementSlice,
user: userSlice,
},
});
Slice Example
export const fetchDatasets = createAsyncThunk(
'dataManagement/fetchDatasets',
async (params: GetDatasetsParams) => {
const response = await getDatasets(params);
return response.data;
}
);
Component Design
Page Component
export const DataManagement: React.FC = () => {
const dispatch = useAppDispatch();
const { datasets, loading } = useAppSelector(
(state) => state.dataManagement
);
useEffect(() => {
dispatch(fetchDatasets({ page: 0, size: 20 }));
}, [dispatch]);
return (
<div className="p-6">
<h1>Data Management</h1>
<DataTable data={datasets} loading={loading} />
</div>
);
};
API Services
Axios Configuration
const request = axios.create({
baseURL: import.meta.env.VITE_API_BASE_URL,
timeout: 30000,
});
// Request interceptor
request.interceptors.request.use((config) => {
const token = localStorage.getItem('token');
if (token) {
config.headers.Authorization = `Bearer ${token}`;
}
return config;
});
Code Splitting
const DataManagement = lazy(() =>
import('@/pages/DataManagement/Home/DataManagement')
);
React.memo
export const DataCard = React.memo<DataCardProps>(({ data }) => {
return <div>{data.name}</div>;
});
6 - Appendix
Configuration, troubleshooting, and other reference information
Appendix contains configuration parameters, troubleshooting, and other reference information.
Appendix Content
Configuration
Detailed system configuration documentation:
- Environment Variables: All configurable environment variables
- application.yml: Spring Boot configuration file
- Docker Compose: Container configuration
- Kubernetes: K8s configuration
Troubleshooting
Common issue troubleshooting steps and solutions:
- Service startup issues: Container startup failures
- Database connection issues: Database connection failures
- Frontend issues: Page loading, API requests
- Task execution issues: Tasks stuck, execution failures
- Performance issues: Slow response, memory overflow
Other References
Technical Support
If you encounter issues:
- Check Troubleshooting documentation
- Search GitHub Issues
- Submit a new issue with detailed information
Contributing
Contributions to DataMate are welcome:
- Report bugs
- Propose new features
- Submit code contributions
- Improve documentation
See Contribution Guide for details.
6.1 - Configuration
DataMate system configuration parameters
This document details various configuration parameters of the DataMate system.
Environment Variables
Common Configuration
| Variable | Default | Description |
|---|
DB_PASSWORD | password | Database password |
DATAMATE_JWT_ENABLE | false | Enable JWT authentication |
REGISTRY | ghcr.io/modelengine-group/ | Image registry |
VERSION | latest | Image version tag |
Database Configuration
| Variable | Default | Description |
|---|
DB_HOST | datamate-database | Database host |
DB_PORT | 5432 | Database port |
DB_NAME | datamate | Database name |
DB_USER | postgres | Database username |
DB_PASSWORD | password | Database password |
Redis Configuration
| Variable | Default | Description |
|---|
REDIS_HOST | datamate-redis | Redis host |
REDIS_PORT | 6379 | Redis port |
REDIS_PASSWORD | - | Redis password (optional) |
REDIS_DB | 0 | Redis database number |
Milvus Configuration
| Variable | Default | Description |
|---|
MILVUS_HOST | milvus | Milvus host |
MILVUS_PORT | 19530 | Milvus port |
MILVUS_INDEX_TYPE | IVF_FLAT | Vector index type |
MILVUS_EMBEDDING_DIM | 768 | Vector dimension |
MinIO Configuration
| Variable | Default | Description |
|---|
MINIO_ENDPOINT | minio:9000 | MinIO endpoint |
MINIO_ACCESS_KEY | minioadmin | Access key |
MINIO_SECRET_KEY | minioadmin | Secret key |
MINIO_BUCKET | datamate | Bucket name |
LLM Configuration
| Variable | Default | Description |
|---|
OPENAI_API_KEY | - | OpenAI API key |
OPENAI_BASE_URL | https://api.openai.com/v1 | API base URL |
OPENAI_MODEL | gpt-4 | Model to use |
JWT Configuration
| Variable | Default | Description |
|---|
JWT_SECRET | default-insecure-key | JWT secret (CHANGE IN PRODUCTION) |
JWT_EXPIRATION | 86400 | Token expiration (seconds) |
Logging Configuration
| Variable | Default | Description |
|---|
LOG_LEVEL | INFO | Log level |
LOG_PATH | /var/log/datamate | Log path |
application.yml Configuration
Main Config
datamate:
jwt:
enable: ${DATAMATE_JWT_ENABLE:false}
secret: ${JWT_SECRET:default-insecure-key}
expiration: ${JWT_EXPIRATION:86400}
storage:
type: minio
endpoint: ${MINIO_ENDPOINT:minio:9000}
access-key: ${MINIO_ACCESS_KEY:minioadmin}
secret-key: ${MINIO_SECRET_KEY:minioadmin}
Spring Boot Config
spring:
datasource:
url: jdbc:postgresql://${DB_HOST:datamate-database}:${DB_PORT:5432}/${DB_NAME:datamate}
username: ${DB_USER:postgres}
password: ${DB_PASSWORD:password}
jpa:
hibernate:
ddl-auto: validate
show-sql: false
server:
port: 8092
Docker Compose Configuration
Environment Variables
services:
datamate-backend:
environment:
- DB_PASSWORD=${DB_PASSWORD:-password}
- LOG_LEVEL=${LOG_LEVEL:-INFO}
Resource Limits
services:
datamate-backend:
deploy:
resources:
limits:
cpus: '2'
memory: 4G
Kubernetes Configuration
ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
name: datamate-config
data:
LOG_LEVEL: "INFO"
Secret
apiVersion: v1
kind: Secret
metadata:
name: datamate-secret
type: Opaque
data:
DB_PASSWORD: cGFzc3dvcmQ= # base64 encoded
Database Connection Pool
spring:
datasource:
hikari:
maximum-pool-size: 20
minimum-idle: 5
connection-timeout: 30000
JVM Parameters
JAVA_OPTS="-Xms2g -Xmx4g -XX:+UseG1GC"
6.2 - Troubleshooting
Common issues and solutions for DataMate
This document provides troubleshooting steps and solutions for common DataMate issues.
Service Startup Issues
Service Won’t Start
Symptoms
Service fails to start or exits immediately after running make install.
Troubleshooting Steps
- Check Port Conflicts
# Check port usage
lsof -i :8080 # API Gateway
lsof -i :30000 # Frontend
If port is occupied:
# Kill process
kill -9 <PID>
- View Container Logs
# View all containers
docker ps -a
# View specific container logs
docker logs datamate-backend
- Check Docker Resources
# View Docker system info
docker system df
# Clean unused resources
docker system prune -a
Common Causes and Solutions
| Cause | Solution |
|---|
| Port occupied | Kill process or modify port mapping |
| Insufficient memory | Increase Docker memory limit |
| Image not pulled | Run docker pull |
| Network issues | Check firewall and network config |
Troubleshooting
# View exit code
docker ps -a
# View detailed logs
docker logs <container-name> --tail 100
Database Connection Issues
Cannot Connect to Database
Troubleshooting Steps
- Check Database Container
docker ps | grep datamate-database
docker logs datamate-database
- Test Database Connection
# Enter database container
docker exec -it datamate-database psql -U postgres -d datamate
- Check Database Config
# Check environment variables
docker exec datamate-backend env | grep DB_
Frontend Issues
Frontend Not Accessible
Symptoms
Browser cannot access http://localhost:30000
Troubleshooting
- Check Frontend Container
docker ps | grep datamate-frontend
docker logs datamate-frontend
- Check Port Mapping
docker port datamate-frontend
API Request Failed
Troubleshooting
- Check Browser Console
Open browser DevTools → Network tab
- Check API Gateway
docker ps | grep datamate-gateway
docker logs datamate-gateway
- Test API
curl http://localhost:8080/actuator/health
Task Execution Issues
Task Stuck
Troubleshooting
- View Task Logs
docker logs datamate-backend --tail 100 | grep <task-id>
docker logs datamate-runtime --tail 100
- Check System Resources
Slow System Response
Troubleshooting
- Check System Resources
- Check Database Performance
-- View active queries
SELECT * FROM pg_stat_activity WHERE state = 'active';
Memory Overflow
Troubleshooting
# Check exit reason
docker inspect <container> | grep OOMKilled
Log Viewing
View Application Logs
# Backend logs
docker logs datamate-backend --tail 100 -f
# Frontend logs
docker logs datamate-frontend --tail 100 -f
Log File Locations
| Service | Log Path |
|---|
| Backend | /var/log/datamate/backend/app.log |
| Frontend | /var/log/datamate/frontend/ |
| Database | /var/log/datamate/database/ |
| Runtime | /var/log/datamate/runtime/ |
Getting Help
If issues persist:
Collect Information
- Error messages
- Log files
- System environment
- Reproduction steps
Search Existing Issues
Visit GitHub Issues
- Submit New Issue
Include:
- DataMate version
- OS version
- Docker version
- Detailed error messages
- Reproduction steps
7 - Contribution Guide
Welcome to the DataMate project. We welcome all forms of contributions including documentation, code, testing, translation, etc.
DataMate is an enterprise-level open source data processing project dedicated to providing efficient data solutions for model training, AI applications, and data flywheel scenarios. We welcome all developers, document creators, and test engineers to participate through code commits, documentation optimization, issue feedback, and community support.
If this is your first time contributing to an open source project, we recommend reading Open Source Contribution Newbie Guide first, then proceed with this guide. All contributions must follow the DataMate Code of Conduct.
Contribution Scope and Methods
DataMate open source project contributions cover the following core scenarios. You can choose your participation based on your expertise:
| Contribution Type | Specific Content | Suitable For |
|---|
| Code Contribution | Core feature development, bug fixes, performance optimization, new feature proposals | Backend/frontend developers, data engineers |
| Documentation Contribution | User manual updates, API documentation improvements, tutorial writing, contribution guide optimization | Technical document creators, experienced users |
| Testing Contribution | Write unit/integration tests, feedback test issues, participate in compatibility testing | Test engineers, QA personnel |
| Community Contribution | Answer GitHub Issues, participate in community discussions, share use cases | All users, tech enthusiasts |
| Design Contribution | UI/UX optimization, logo/icon design, documentation visual upgrade | UI/UX designers, visual designers |
Thank you for choosing to participate in the DataMate open source project! Whether it’s code, documentation, or community support, every contribution helps the project grow and advances enterprise-level data processing technology. If you encounter any issues during the contribution process, feel free to seek help through community channels.
Getting Started
Development Environment
Before contributing, please set up your development environment:
- Clone Repository
git clone https://github.com/ModelEngine-Group/DataMate.git
cd DataMate
- Install Dependencies
# Backend dependencies
cd backend
mvn clean install
# Frontend dependencies
cd frontend
pnpm install
# Python dependencies
cd runtime
pip install -r requirements.txt
- Start Services
# Start basic services
make install dev=true
For detailed setup instructions, see:
Code Contribution
Code Standards
Java Code Standards
/**
* User service
*
* @author Your Name
* @since 1.0.0
*/
public class UserService {
/**
* Get user by ID
*
* @param userId user ID
* @return user information
*/
public User getUserById(Long userId) {
// ...
}
}
TypeScript Code Standards
- Naming Conventions:
- Components: PascalCase
UserProfile - Types/Interfaces: PascalCase
UserData - Functions: camelCase
getUserData - Constants: UPPER_CASE
API_BASE_URL
Python Code Standards
Follow PEP 8:
def get_user(user_id: int) -> dict:
"""
Get user information
Args:
user_id: User ID
Returns:
User information dictionary
"""
# ...
Submitting Code
1. Create Branch
git checkout -b feature/your-feature-name
Branch naming convention:
feature/ - New featuresfix/ - Bug fixesdocs/ - Documentation updatesrefactor/ - Refactoring
2. Make Changes
Follow the code standards mentioned above.
3. Write Tests
# Backend tests
mvn test
# Frontend tests
pnpm test
# Python tests
pytest
4. Commit Changes
git add .
git commit -m "feat: add new feature description"
Commit message format:
feat: - New featurefix: - Bug fixdocs: - Documentation changesstyle: - Code style changesrefactor: - Refactoringtest: - Adding testschore: - Other changes
5. Push and Create PR
git push origin feature/your-feature-name
Then create a Pull Request on GitHub.
Documentation Contribution
Documentation Structure
Documentation is located in the /docs directory:
docs/
├── getting-started/ # Quick start
├── user-guide/ # User guide
├── api-reference/ # API reference
├── developer-guide/ # Developer guide
└── appendix/ # Appendix
Writing Documentation
1. Choose Language
Documents support bilingual (Chinese and English). When updating documentation, please update both language versions.
Use Markdown format with Hugo front matter:
---
title: Page Title
description: Page description
weight: 1
---
Content here...
3. Add Examples
Include code examples, commands, and use cases to help users understand.
4. Cross-Reference
Add links to related documentation:
See [Data Management](/docs/user-guide/data-management/) for details.
Testing Contribution
Test Coverage
We aim for comprehensive test coverage:
- Unit Tests: Test individual functions and classes
- Integration Tests: Test service interactions
- E2E Tests: Test complete workflows
Writing Tests
Backend Tests (JUnit)
@Test
public void testGetDataset() {
// Arrange
String datasetId = "test-dataset";
// Act
Dataset result = datasetService.getDataset(datasetId);
// Assert
assertNotNull(result);
assertEquals("test-dataset", result.getId());
}
Frontend Tests (Jest + React Testing Library)
test('renders data management page', () => {
render(<DataManagement />);
expect(screen.getByText('Data Management')).toBeInTheDocument();
});
Reporting Issues
When finding bugs:
- Search existing GitHub Issues
- If not found, create new issue with:
- Clear title
- Detailed description
- Steps to reproduce
- Expected vs actual behavior
- Environment info
Design Contribution
UI/UX Guidelines
We use Ant Design as the UI component library. When contributing design changes:
- Follow Ant Design principles
- Ensure consistency with existing design
- Consider accessibility
- Test on different screen sizes
Design Assets
Design assets should be placed in:
- Frontend assets:
frontend/src/assets/ - Documentation images:
content/en/docs/images/
Code of Conduct
- Be respectful and inclusive
- Welcome newcomers and help them learn
- Focus on constructive feedback
- Collaborate openly
Communication Channels
- GitHub Issues: Bug reports and feature requests
- GitHub Discussions: General discussions
- Pull Requests: Code and documentation contributions
Getting Help
If you need help:
- Check existing documentation
- Search GitHub Issues
- Start a GitHub Discussion
Recognition
Contributors will be recognized in:
- Contributors List: In the documentation
- Release Notes: For significant contributions
- Community Highlights: For outstanding contributions
License
By contributing to DataMate, you agree that your contributions will be licensed under the MIT License.
Thank you for contributing to DataMate! Your contributions help make DataMate better for everyone. 🎉