This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Documentation

This is a placeholder page that shows you how to use this template site.

This section is where the user documentation for your project lives - all the information your users need to understand and successfully use your project.

For large documentation sets we recommend adding content under the headings in this section, though if some or all of them don’t apply to your project feel free to remove them or add your own. You can see an example of a smaller Docsy documentation site in the Docsy User Guide, which lives in the Docsy theme repo if you’d like to copy its docs section.

Other content such as marketing material, case studies, and community updates should live in the About and Community pages.

Find out how to use the Docsy theme in the Docsy User Guide. You can learn more about how to organize your documentation (and how we organized this site) in Organizing Your Content.

1 - Overview

DataMate - Enterprise-level Large Model Data Processing Platform

DataMate is an enterprise-level data processing platform designed for model fine-tuning and RAG retrieval. It provides comprehensive data processing capabilities including data collection, management, cleaning, annotation, synthesis, evaluation, and knowledge base management.

Product Positioning

DataMate is dedicated to solving data pain points in large model implementation, providing a one-stop data governance solution:

  • Full Lifecycle Coverage: From data collection to evaluation, covering the entire data processing lifecycle
  • Enterprise-grade Capabilities: Supports million-scale concurrent data processing with private deployment options
  • Flexible Extension: Rich built-in data processing operators with support for custom operator development
  • Visual Orchestration: Drag-and-drop pipeline design without coding for complex data processing workflows

Core Features

Data Collection

  • Heterogeneous data source collection capabilities based on DataX
  • Supports relational databases, NoSQL, file systems, and other data sources
  • Flexible task configuration and monitoring

Data Management

  • Unified dataset management supporting image, text, audio, video, and multimodal data types
  • Complete data operations: upload, download, preview
  • Tag and metadata management for easy data organization and retrieval

Data Cleaning

  • Rich built-in data cleaning operators
  • Visual cleaning template configuration
  • Supports both batch and stream processing modes

Data Annotation

  • Integrated Label Studio for professional annotation capabilities
  • Supports image classification, object detection, text classification, and other annotation types
  • Annotation review and quality control mechanisms

Data Synthesis

  • Data augmentation and synthesis capabilities based on large models
  • Instruction template management and customization
  • Proportional synthesis tasks for diverse data needs

Data Evaluation

  • Multi-dimensional data quality evaluation metrics
  • Supports both automatic and manual evaluation
  • Detailed evaluation reports

Knowledge Base Management (RAG)

  • Supports multiple document formats for knowledge base construction
  • Automated text chunking and vectorization
  • Integrated vector retrieval for RAG applications

Operator Marketplace

  • Rich built-in data processing operators
  • Support for operator publishing and sharing
  • Custom operator development capabilities

Pipeline Orchestration

  • Visual drag-and-drop workflow design
  • Multiple node types and configurations
  • Pipeline execution monitoring and debugging

Agent Chat

  • Integrated large language model chat capabilities
  • Knowledge base Q&A
  • Conversation history management

Technical Architecture

Overall Architecture

DataMate adopts a microservices architecture with core components including:

  • Frontend: React 18 + TypeScript + Ant Design + Tailwind CSS
  • Backend: Java 21 + Spring Boot 3.5.6 + Spring Cloud + MyBatis Plus
  • Runtime: Python FastAPI + LangChain + Ray
  • Database: PostgreSQL + Redis + Milvus + MinIO

Microservice Components

  • API Gateway (8080): Unified entry point for routing and authentication
  • Main Application: Core business logic
  • Data Management Service (8092): Dataset management
  • Data Collection Service: Data collection task management
  • Data Cleaning Service: Data cleaning task management
  • Data Annotation Service: Data annotation task management
  • Data Synthesis Service: Data synthesis task management
  • Data Evaluation Service: Data evaluation task management
  • Operator Market Service: Operator marketplace management
  • RAG Indexer Service: Knowledge base indexing
  • Runtime Service (8081): Operator execution engine
  • Backend Python Service (18000): Python backend service

Use Cases

Model Fine-tuning

  • Training data cleaning and quality improvement
  • Data augmentation and synthesis
  • Training data evaluation

RAG Applications

  • Enterprise knowledge base construction
  • Document vectorization and indexing
  • Semantic retrieval and Q&A

Data Governance

  • Unified management of multi-source data
  • Data lineage tracking
  • Data quality monitoring

Deployment Options

DataMate supports multiple deployment methods:

  • Docker Compose: Quick experience and development testing
  • Kubernetes/Helm: Production environment deployment
  • Offline Deployment: Supports air-gapped environment deployment

Comparison with Similar Products

FeatureDataMateLabel StudioDocArray
Data Management✅ Complete dataset management❌ Annotation data only❌ Document data only
Data Collection✅ DataX support❌ Not supported❌ Not supported
Data Cleaning✅ Rich built-in operators❌ Not supported❌ Not supported
Data Annotation✅ Label Studio integration✅ Professional tool❌ Not supported
Data Synthesis✅ LLM-based❌ Not supported❌ Not supported
Data Evaluation✅ Multi-dimensional⚠️ Basic❌ Not supported
Knowledge Base✅ RAG integration❌ Not supported⚠️ Requires development
Pipeline Orchestration✅ Visual orchestration❌ Not supported❌ Not supported
Operator Extension✅ Custom operators⚠️ Limited⚠️ Requires coding
License✅ MIT✅ Apache 2.0✅ MIT

Next Steps

2 - Quick Start

Deploy DataMate in 5 minutes

This guide will help you deploy DataMate platform in 5 minutes.

DataMate supports two main deployment methods:

  • Docker Compose: Suitable for quick experience and development testing
  • Kubernetes/Helm: Suitable for production deployment

Prerequisites

Docker Compose Deployment

  • Docker 20.10+
  • Docker Compose 2.0+
  • At least 4GB RAM
  • At least 10GB disk space

Kubernetes Deployment

  • Kubernetes 1.20+
  • Helm 3.0+
  • kubectl configured with cluster connection
  • At least 8GB RAM
  • At least 20GB disk space

5-Minute Quick Deployment (Docker Compose)

1. Clone the Code

git clone https://github.com/ModelEngine-Group/DataMate.git
cd DataMate

2. Start Services

Use the provided Makefile for one-click deployment:

make install

After running the command, the system will prompt you to select a deployment method:

Choose a deployment method:
1. Docker/Docker-Compose
2. Kubernetes/Helm
Enter choice:

Enter 1 to select Docker Compose deployment.

3. Verify Deployment

After services start, you can access them at:

  • Frontend: http://localhost:30000
  • API Gateway: http://localhost:8080
  • Database: localhost:5432

4. Check Service Status

docker ps

You should see the following containers running:

  • datamate-frontend (Frontend service)
  • datamate-backend (Backend service)
  • datamate-backend-python (Python backend service)
  • datamate-gateway (API gateway)
  • datamate-database (PostgreSQL database)
  • datamate-runtime (Operator runtime)

Optional Components Installation

Install Milvus Vector Database

Milvus is used for vector storage and retrieval in knowledge bases:

make install-milvus

Select Docker Compose deployment method when prompted.

Install Label Studio Annotation Tool

Label Studio is used for data annotation:

make install-label-studio

Access: http://localhost:30001

Default credentials:

Install MinerU PDF Processing Service

MinerU provides enhanced PDF document processing:

make build-mineru
make install-mineru

Install DeerFlow Service

DeerFlow is used for enhanced workflow orchestration:

make install-deer-flow

Using Local Images for Development

If you’ve modified local code, use local images for deployment:

make build
make install dev=true

Offline Environment Deployment

For offline environments, download all images first:

make download SAVE=true

Images will be saved in the dist/ directory. Load images on the target machine:

make load-images

Uninstall

Uninstall DataMate

make uninstall

The system will prompt whether to delete volumes:

  • Select 1: Delete all data (including datasets, configurations, etc.)
  • Select 2: Keep volumes

Uninstall Specific Components

# Uninstall Label Studio
make uninstall-label-studio

# Uninstall Milvus
make uninstall-milvus

# Uninstall DeerFlow
make uninstall-deer-flow

Next Steps

Common Questions

Q: What if service startup fails?

First check if ports are occupied:

# Check port usage
lsof -i :30000
lsof -i :8080

If ports are occupied, modify port mappings in deployment/docker/datamate/docker-compose.yml.

Q: How to view service logs?

# View all service logs
docker compose -f deployment/docker/datamate/docker-compose.yml logs

# View specific service logs
docker compose -f deployment/docker/datamate/docker-compose.yml logs -f datamate-backend

Q: Where is data stored?

Data is persisted through Docker volumes:

  • datamate-dataset-volume: Dataset files
  • datamate-postgresql-volume: Database data
  • datamate-log-volume: Log files

View all volumes:

docker volume ls | grep datamate

2.1 - Installation Guide

Detailed installation and configuration instructions for DataMate

This document provides detailed installation and configuration instructions for the DataMate platform.

System Requirements

Minimum Configuration

ComponentMinimumRecommended
CPU4 cores8 cores+
RAM8 GB16 GB+
Disk50 GB100 GB+
OSLinux/macOS/WindowsLinux (Ubuntu 20.04+)

Software Dependencies

Docker Compose Deployment

  • Docker 20.10+
  • Docker Compose 2.0+
  • Git (optional, for cloning code)
  • Make (optional, for using Makefile)

Kubernetes Deployment

  • Kubernetes 1.20+
  • Helm 3.0+
  • kubectl (matching cluster version)
  • Git (optional, for cloning code)
  • Make (optional, for using Makefile)

Deployment Method Comparison

FeatureDocker ComposeKubernetes
Deployment Difficulty⭐ Simple⭐⭐⭐ Complex
Resource Utilization⭐⭐ Fair⭐⭐⭐⭐ High
High Availability❌ Not supported✅ Supported
Scalability⭐⭐ Fair⭐⭐⭐⭐ Strong
Use CaseDev/test, small scaleProduction, large scale

Docker Compose Deployment

Basic Deployment

1. Prerequisites

# Clone code repository
git clone https://github.com/ModelEngine-Group/DataMate.git
cd DataMate

# Check Docker and Docker Compose versions
docker --version
docker compose version

2. Deploy Using Makefile

# One-click deployment (including Milvus)
make install

Select 1. Docker/Docker-Compose when prompted.

3. Use Docker Compose Directly

If Make is not installed:

# Set image registry (optional)
export REGISTRY=ghcr.io/modelengine-group/

# Start basic services
docker compose -f deployment/docker/datamate/docker-compose.yml --profile milvus up -d

4. Verify Deployment

# Check container status
docker ps

# View service logs
docker compose -f deployment/docker/datamate/docker-compose.yml logs -f

# Access frontend
open http://localhost:30000

Optional Components

Milvus Vector Database

# Using Makefile
make install-milvus

# Or Docker Compose
docker compose -f deployment/docker/datamate/docker-compose.yml --profile milvus up -d

Components:

  • milvus-standalone (19530, 9091)
  • milvus-minio (9000, 9001)
  • milvus-etcd

Label Studio Annotation Tool

# Using Makefile
make install-label-studio

# Or Docker Compose
docker compose -f deployment/docker/datamate/docker-compose.yml --profile label-studio up -d

Access: http://localhost:30001

Default credentials:

MinerU PDF Processing

# Build MinerU image
make build-mineru

# Deploy MinerU
make install-mineru

DeerFlow Workflow Service

# Using Makefile
make install-deer-flow

# Or Docker Compose
docker compose -f deployment/docker/datamate/docker-compose.yml --profile deer-flow up -d

Environment Variables

VariableDefaultDescription
DB_PASSWORDpasswordDatabase password
DATAMATE_JWT_ENABLEfalseEnable JWT authentication
REGISTRYghcr.io/modelengine-group/Image registry
VERSIONlatestImage version
LABEL_STUDIO_HOST-Label Studio access URL

Data Volume Management

DataMate uses Docker volumes for persistence:

# View all volumes
docker volume ls | grep datamate

# View volume details
docker volume inspect datamate-dataset-volume

# Backup volume data
docker run --rm -v datamate-dataset-volume:/data -v $(pwd):/backup \
  ubuntu tar czf /backup/dataset-backup.tar.gz /data

Kubernetes/Helm Deployment

Prerequisites

# Check cluster connection
kubectl cluster-info
kubectl get nodes

# Check Helm version
helm version

# Create namespace (optional)
kubectl create namespace datamate

Using Makefile

# Deploy DataMate
make install INSTALLER=k8s

# Or deploy to specific namespace
make install NAMESPACE=datamate INSTALLER=k8s

Using Helm

1. Deploy Basic Services

# Deploy DataMate
helm upgrade datamate deployment/helm/datamate/ \
  --install \
  --namespace datamate \
  --create-namespace \
  --set global.image.repository=ghcr.io/modelengine-group/

# Check deployment status
kubectl get pods -n datamate

2. Configure Ingress (Optional)

# Edit values.yaml
cat >> deployment/helm/datamate/values.yaml << EOF
ingress:
  enabled: true
  className: nginx
  hosts:
    - host: datamate.example.com
      paths:
        - path: /
          pathType: Prefix
EOF

# Redeploy
helm upgrade datamate deployment/helm/datamate/ \
  --namespace datamate \
  -f deployment/helm/datamate/values.yaml

3. Deploy Optional Components

# Deploy Milvus
helm upgrade milvus deployment/helm/milvus \
  --install \
  --namespace datamate

# Deploy Label Studio
helm upgrade label-studio deployment/helm/label-studio/ \
  --install \
  --namespace datamate

Offline Deployment

Prepare Offline Images

1. Download Images

# Download all images locally
make download SAVE=true

# Download specific version
make download VERSION=v1.0.0 SAVE=true

Images saved in dist/ directory.

2. Package and Transfer

# Package
tar czf datamate-images.tar.gz dist/

# Transfer to target server
scp datamate-images.tar.gz user@target-server:/tmp/

Offline Installation

1. Load Images

# Extract on target server
tar xzf datamate-images.tar.gz

# Load all images
make load-images

2. Modify Configuration

Use empty REGISTRY for local images:

REGISTRY= docker compose -f deployment/docker/datamate/docker-compose.yml up -d

Upgrade Guide

Docker Compose Upgrade

# 1. Backup data
docker run --rm -v datamate-postgresql-volume:/data -v $(pwd):/backup \
  ubuntu tar czf /backup/postgres-backup.tar.gz /data

# 2. Pull new images
docker pull ghcr.io/modelengine-group/datamate-backend:latest

# 3. Stop services
docker compose -f deployment/docker/datamate/docker-compose.yml down

# 4. Start new version
docker compose -f deployment/docker/datamate/docker-compose.yml up -d

# 5. Verify upgrade
docker ps
docker logs -f datamate-backend

Or use Makefile:

make datamate-docker-upgrade

Kubernetes Upgrade

# 1. Backup data
kubectl exec -n datamate deployment/datamate-database -- \
  pg_dump -U postgres datamate > backup.sql

# 2. Update Helm Chart
helm upgrade datamate deployment/helm/datamate/ \
  --namespace datamate \
  --set global.image.tag=new-version

Uninstall

Docker Compose Complete Uninstall

# Using Makefile
make uninstall

# Choose to delete volumes for complete cleanup

Or manual uninstall:

# Stop and remove containers
docker compose -f deployment/docker/datamate/docker-compose.yml --profile milvus --profile label-studio down -v

# Remove all volumes
docker volume rm datamate-dataset-volume \
  datamate-postgresql-volume \
  datamate-log-volume

# Remove network
docker network rm datamate-network

Kubernetes Complete Uninstall

# Uninstall all components
make uninstall INSTALLER=k8s

# Or use Helm
helm uninstall datamate -n datamate
helm uninstall milvus -n datamate
helm uninstall label-studio -n datamate

# Delete namespace
kubectl delete namespace datamate

Troubleshooting

Common Issues

1. Service Won’t Start

# Check port conflicts
netstat -tlnp | grep -E '30000|8080|5432'

# Check disk space
df -h

# Check memory
free -h

# View detailed logs
docker logs datamate-backend --tail 100

2. Database Connection Failed

# Check database container
docker ps | grep database

# Test connection
docker exec -it datamate-database psql -U postgres -d datamate

2.2 - System Architecture

DataMate system architecture design documentation

This document details DataMate’s system architecture, tech stack, and design philosophy.

Overall Architecture

DataMate adopts a microservices architecture, splitting the system into multiple independent services, each responsible for specific business functions. This architecture provides good scalability, maintainability, and fault tolerance.

┌─────────────────────────────────────────────────────────────────┐
│                           Frontend Layer                        │
│                    (React + TypeScript)                         │
│                      Ant Design + Tailwind                      │
└────────────────────────┬────────────────────────────────────────┘
                         │
                         ▼
┌─────────────────────────────────────────────────────────────────┐
│                        API Gateway Layer                        │
│                    (Spring Cloud Gateway)                       │
│                      Port: 8080                                 │
└────────────────────────┬────────────────────────────────────────┘
                         │
         ┌───────────────┼───────────────┐
         ▼               ▼               ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│  Java Backend│ │ Python Backend│ │  Runtime     │
│   Services   │ │    Service    │ │   Service    │
├──────────────┤ ├──────────────┤ ├──────────────┤
│· Main App    │ │· RAG Service  │ │· Operator    │
│· Data Mgmt   │ │· LangChain    │ │  Execution   │
│· Collection  │ │· FastAPI      │ │              │
│· Cleaning    │ │              │ │              │
│· Annotation  │ │              │ │              │
│· Synthesis   │ │              │ │              │
│· Evaluation  │ │              │ │              │
│· Operator    │ │              │ │              │
│· Pipeline    │ │              │ │              │
└──────┬───────┘ └──────┬───────┘ └──────┬───────┘
       │                │                │
       └────────────────┼────────────────┘
                        ▼
         ┌──────────────┴──────────────┐
         │                              │
    ┌────▼────┐    ┌─────────┐   ┌─────▼────┐
    │PostgreSQL│    │  Redis  │   │  Milvus  │
    │  (5432)  │    │ (6379)  │   │ (19530)  │
    └──────────┘    └─────────┘   └──────────┘
                                              │
                                        ┌─────▼─────┐
                                        │   MinIO   │
                                        │  (9000)   │
                                        └───────────┘

Tech Stack

Frontend Tech Stack

TechnologyVersionPurpose
React18.xUI framework
TypeScript5.xType safety
Ant Design5.xUI component library
Tailwind CSS3.xStyling framework
Redux Toolkit2.xState management
React Router6.xRouting management
Vite5.xBuild tool

Backend Tech Stack (Java)

TechnologyVersionPurpose
Java21Runtime environment
Spring Boot3.5.6Application framework
Spring Cloud2023.xMicroservices framework
MyBatis Plus3.xORM framework
PostgreSQL Driver42.xDatabase driver
Redis5.xCache client
MinIO8.xObject storage client

Backend Tech Stack (Python)

TechnologyVersionPurpose
Python3.11+Runtime environment
FastAPI0.100+Web framework
LangChain0.1+LLM application framework
Ray2.xDistributed computing
Pydantic2.xData validation

Data Storage

TechnologyVersionPurpose
PostgreSQL15+Main database
Redis8.xCache and message queue
Milvus2.6.5Vector database
MinIORELEASE.2024+Object storage

Microservices Architecture

Service List

Service NamePortTech StackDescription
API Gateway8080Spring Cloud GatewayUnified entry, routing, auth
Frontend30000ReactFrontend UI
Main Application-Spring BootCore business logic
Data Management Service8092Spring BootDataset management
Data Collection Service-Spring BootData collection tasks
Data Cleaning Service-Spring BootData cleaning tasks
Data Annotation Service-Spring BootData annotation tasks
Data Synthesis Service-Spring BootData synthesis tasks
Data Evaluation Service-Spring BootData evaluation tasks
Operator Market Service-Spring BootOperator marketplace
RAG Indexer Service-Spring BootKnowledge base indexing
Runtime Service8081Python + RayOperator execution engine
Backend Python Service18000FastAPIPython backend service
Database5432PostgreSQLDatabase

Service Communication

Synchronous Communication

  • API Gateway → Backend Services: HTTP/REST
  • Frontend → API Gateway: HTTP/REST
  • Backend Services ↔: HTTP/REST (Feign Client)

Asynchronous Communication

  • Task Execution: Database task queue
  • Event Notification: Redis Pub/Sub

Data Architecture

Data Flow

┌─────────────┐
│  Data       │ Collection task config
│  Collection │ → DataX → Raw data
└──────┬──────┘
       │
       ▼
┌─────────────┐
│  Data       │ Dataset management, file upload
│  Management │ → Structured storage
└──────┬──────┘
       │
       ├──────────────┐
       ▼              ▼
┌─────────────┐  ┌─────────────┐
│  Data       │  │ Knowledge   │
│  Cleaning   │  │ Base        │
│             │  │             │
└──────┬──────┘  └──────┬──────┘
       │                │
       ▼                ▼
┌─────────────┐  ┌─────────────┐
│  Data       │  │ Vector      │
│  Annotation │  │ Index       │
└──────┬──────┘  └──────┬──────┘
       │                │
       ▼                │
┌─────────────┐          │
│  Data       │          │
│  Synthesis  │          │
└──────┬──────┘          │
       │                │
       ▼                ▼
┌─────────────┐  ┌─────────────┐
│  Data       │  │  RAG        │
│  Evaluation │  │ Retrieval   │
└─────────────┘  └─────────────┘

Deployment Architecture

Docker Compose Deployment

┌────────────────────────────────────────────────┐
│              Docker Network                    │
│            datamate-network                    │
│                                                │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐   │
│  │Frontend  │  │ Gateway  │  │ Backend  │   │
│  │ :30000   │  │  :8080   │  │          │   │
│  └──────────┘  └──────────┘  └──────────┘   │
│                                                │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐   │
│  │Backend   │  │ Runtime  │  │Database  │   │
│  │  Python  │  │  :8081   │  │  :5432   │   │
│  └──────────┘  └──────────┘  └──────────┘   │
│                                                │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐   │
│  │  Milvus  │  │  MinIO   │  │  etcd    │   │
│  │  :19530  │  │  :9000   │  │          │   │
│  └──────────┘  └──────────┘  └──────────┘   │
└────────────────────────────────────────────────┘

Kubernetes Deployment

┌────────────────────────────────────────────────┐
│           Kubernetes Cluster                   │
│                                                │
│  Namespace: datamate                           │
│                                                │
│  ┌────────────┐  ┌────────────┐              │
│  │ Deployment │  │ Deployment │              │
│  │  Frontend  │  │  Gateway   │              │
│  │   (3 Pods) │  │  (2 Pods)  │              │
│  └─────┬──────┘  └─────┬──────┘              │
│        │                │                     │
│  ┌─────▼────────────────▼──────┐              │
│  │       Service (LoadBalancer) │              │
│  └──────────────────────────────┘              │
│                                                │
│  ┌────────────┐  ┌────────────┐              │
│  │ StatefulSet│  │ Deployment │              │
│  │  Database  │  │  Backend   │              │
│  └────────────┘  └────────────┘              │
└────────────────────────────────────────────────┘

Security Architecture

Authentication & Authorization

JWT Authentication (Optional)

datamate:
  jwt:
    enable: true  # Enable JWT authentication
    secret: your-secret-key
    expiration: 86400  # 24 hours

API Key Authentication

datamate:
  api-key:
    enable: false

Data Security

Transport Encryption

  • API Gateway supports HTTPS/TLS
  • Internal service communication can be encrypted

Storage Encryption

  • Database: Transparent data encryption (TDE)
  • MinIO: Server-side encryption
  • Milvus: Encryption at rest

Next Steps

2.3 - Development Environment Setup

Local development environment configuration guide for DataMate

This document describes how to set up a local development environment for DataMate.

Prerequisites

Required Software

SoftwareVersionPurpose
Node.js18.x+Frontend development
pnpm8.x+Frontend package management
Java21Backend development
Maven3.9+Backend build
Python3.11+Python service development
Docker20.10+Containerized deployment
Docker Compose2.0+Service orchestration
Git2.x+Version control
Make4.x+Build automation
  • IDE: IntelliJ IDEA (backend) + VS Code (frontend/Python)
  • Database Client: DBeaver, pgAdmin
  • API Testing: Postman, curl
  • Git Client: GitKraken, SourceTree

Code Structure

DataMate/
├── backend/                 # Java backend
│   ├── services/           # Microservice modules
│   │   ├── main-application/
│   │   ├── data-management-service/
│   │   ├── data-cleaning-service/
│   │   └── ...
│   ├── openapi/            # OpenAPI specs
│   └── scripts/            # Build scripts
├── frontend/               # React frontend
│   ├── src/
│   │   ├── components/    # Common components
│   │   ├── pages/         # Page components
│   │   ├── services/      # API services
│   │   ├── store/         # Redux store
│   │   └── routes/        # Routes config
│   └── package.json
├── runtime/                # Python runtime
│   └── datamate/          # DataMate runtime
└── deployment/             # Deployment configs
    ├── docker/            # Docker configs
    └── helm/              # Helm charts

Backend Development

1. Install Java 21

# macOS (Homebrew)
brew install openjdk@21

# Linux (Ubuntu/Debian)
sudo apt update
sudo apt install openjdk-21-jdk

# Verify
java -version

2. Install Maven

# macOS
brew install maven

# Linux
sudo apt install maven

# Verify
mvn -version

3. Configure IDE (IntelliJ IDEA)

Install Plugins

  • Lombok Plugin
  • MyBatis Plugin
  • Rainbow Brackets
  • GitToolBox

Import Project

  1. Open IntelliJ IDEA
  2. File → Open
  3. Select backend directory
  4. Wait for Maven dependency download

4. Configure Database

Start Local Database (Docker)

# Start database only
docker compose -f deployment/docker/datamate/docker-compose.yml up -d datamate-database

Connection info:

  • Host: localhost
  • Port: 5432
  • Database: datamate
  • Username: postgres
  • Password: password

5. Run Backend Service

Using Maven

cd backend/services/main-application
mvn spring-boot:run

Using IDE

  1. Find Application class
  2. Right-click → Run
  3. Access http://localhost:8080

Frontend Development

1. Install Node.js

# macOS
brew install node@18

# Linux
curl -fsSL https://deb.nodesource.com/setup_18.x | sudo -E bash -
sudo apt-get install -y nodejs

2. Install pnpm

npm install -g pnpm

3. Install Dependencies

cd frontend
pnpm install

4. Configure Dev Environment

Create .env.development:

VITE_API_BASE_URL=http://localhost:8080
VITE_API_TIMEOUT=30000

5. Start Dev Server

pnpm dev

Access http://localhost:3000

Python Service Development

1. Install Python 3.11

# macOS
brew install python@3.11

# Linux
sudo apt install python3.11 python3.11-venv

2. Create Virtual Environment

cd runtime/datamate
python3.11 -m venv venv
source venv/bin/activate

3. Install Dependencies

pip install -r requirements.txt

4. Run Python Service

python operator_runtime.py --port 8081

Local Debugging

Start All Services

Using Docker Compose

# Start base services (database, Redis, etc.)
docker compose -f deployment/docker/datamate/docker-compose.yml up -d \
  datamate-database \
  datamate-redis

# Start Milvus (optional)
docker compose -f deployment/docker/datamate/docker-compose.yml --profile milvus up -d

Start Backend Services

# Terminal 1: Main Application
cd backend/services/main-application
mvn spring-boot:run

# Terminal 2: Data Management Service
cd backend/services/data-management-service
mvn spring-boot:run

Start Frontend

cd frontend
pnpm dev

Start Python Services

# Runtime Service
cd runtime/datamate
python operator_runtime.py --port 8081

# Backend Python Service
cd backend-python
uvicorn main:app --reload --port 18000

Code Standards

Java Code Standards

Naming Conventions

  • Class name: PascalCase UserService
  • Method name: camelCase getUserById
  • Constants: UPPER_CASE MAX_SIZE
  • Variables: camelCase userName

TypeScript Code Standards

Naming Conventions

  • Components: PascalCase UserProfile
  • Types/Interfaces: PascalCase UserData
  • Functions: camelCase getUserData
  • Constants: UPPER_CASE API_BASE_URL

Python Code Standards

Follow PEP 8:

def get_user(user_id: int) -> dict:
    """Get user information

    Args:
        user_id: User ID

    Returns:
        User information dictionary
    """
    # ...

Common Issues

Backend Won’t Start

  1. Check Java version: java -version
  2. Check port conflicts: lsof -i :8080
  3. View logs
  4. Clean and rebuild: mvn clean install

Frontend Won’t Start

  1. Check Node version: node -v
  2. Delete node_modules: rm -rf node_modules && pnpm install
  3. Check port: lsof -i :3000

Next Steps

3 - User Guide

DataMate feature usage guides

This guide introduces how to use each feature module of DataMate.

DataMate provides comprehensive data processing solutions for large models, covering data collection, management, cleaning, annotation, synthesis, evaluation, and the full process.

Feature Modules

Typical Use Cases

Model Fine-tuning Scenario

1. Data Collection → 2. Data Management → 3. Data Cleaning → 4. Data Annotation
↓
5. Data Evaluation → 6. Export Training Data

RAG Application Scenario

1. Upload Documents → 2. Vectorization Index → 3. Knowledge Base Management
↓
4. Agent Chat (Knowledge Base Q&A)

Data Augmentation Scenario

1. Prepare Raw Data → 2. Create Instruction Template → 3. Data Synthesis
↓
4. Quality Evaluation → 5. Export Augmented Data

3.1 - Data Collection

Collect data from multiple data sources with DataMate

Data collection module helps you collect data from multiple data sources (databases, file systems, APIs, etc.) into the DataMate platform.

Features Overview

Based on DataX, data collection module supports:

  • Multiple Data Sources: MySQL, PostgreSQL, Oracle, SQL Server, etc.
  • Heterogeneous Sync: Data sync between different sources
  • Batch Collection: Large-scale batch collection and sync
  • Scheduled Tasks: Support scheduled execution
  • Task Monitoring: Real-time monitoring of collection tasks

Supported Data Sources

Data Source TypeReaderWriterDescription
General Relational DatabasesSupports MySQL, PostgreSQL, OpenGauss, SQL Server, DM, DB2
MySQLRelational database
PostgreSQLRelational database
OpenGaussRelational database
SQL ServerMicrosoft database
DM (Dameng)Domestic database
DB2IBM database
StarRocksAnalytical database
NASNetwork storage
S3Object storage
GlusterFSDistributed file system
API CollectionAPI interface data
JSON FilesJSON format files
CSV FilesCSV format files
TXT FilesText files
FTPFTP servers
HDFSHadoop HDFS

Quick Start

1. Create Collection Task

Step 1: Enter Data Collection Page

Select Data Collection in the left navigation.

Step 2: Create Task

Click Create Task button.

Step 3: Configure Basic Information

Fill in the following basic information:

  • Name: A meaningful name for the task
  • Timeout: Task execution timeout (seconds)
  • Description: Task purpose (optional)

Step 4: Select Sync Mode

Select the task synchronization mode:

  • Immediate Sync: Execute once immediately after task creation
  • Scheduled Sync: Execute periodically according to schedule rules

When selecting Scheduled Sync, configure the execution policy:

  • Execution Cycle: Hourly / Daily / Weekly / Monthly
  • Execution Time: Select the execution time point

Step 5: Configure Data Source

Select data source type: Choose from dropdown list (e.g., MySQL, CSV, etc.)

Configure data source parameters: Fill in connection parameters based on the selected data source template (form format)

MySQL Example:

  • JDBC URL: jdbc:mysql://localhost:3306/mydb
  • Username: root
  • Password: password
  • Table Name: users

Step 6: Configure Field Extraction

Field mapping is not supported. You can only extract specific fields from the configured SQL.

  • Extract specific fields: Enter the field names you want to extract in the field list
  • Extract all fields: Leave the field list empty to extract all fields from the SQL query result

Step 7: Create and Execute

Click Create button to create the task.

  • If Immediate Sync is selected, task starts immediately
  • If Scheduled Sync is selected, task runs periodically according to schedule

2. Monitor Task Execution

View all collection tasks with status, progress, and operations.

3. Task Management

Each task in the task list has the following actions available:

  • View Execution Records: View all historical executions of the task
  • Delete: Delete the task (note: deleting a task does not delete collected data)

Click the task name to view task details including:

  • Basic configuration
  • Execution record list
  • Data statistics

Common Questions

Q: Task execution failed?

A: Troubleshooting:

  1. Check data source connection
  2. View execution logs
  3. Check data format
  4. Verify target dataset exists

Q: How to collect large tables?

A:

  1. Use incremental collection
  2. Split into multiple tasks
  3. Adjust concurrent parameters
  4. Use filter conditions

API Reference

3.2 - Data Management

Manage datasets and files with DataMate

Data management module provides unified dataset management capabilities, supporting multiple data types for storage, query, and operations.

Features Overview

Data management module provides:

  • Multiple data types: Image, text, audio, video, and multimodal support
  • File management: Upload, download, preview, delete operations
  • Directory structure: Support for hierarchical directory organization
  • Tag management: Use tags to categorize and retrieve data
  • Statistics: Dataset size, file count, and other statistics

Dataset Types

TypeDescriptionSupported Formats
ImageImage dataJPG, PNG, GIF, BMP, WebP
TextText dataTXT, MD, JSON, CSV
AudioAudio dataMP3, WAV, FLAC, AAC
VideoVideo dataMP4, AVI, MOV, MKV
MultimodalMultimodal dataMixed formats

Quick Start

1. Create Dataset

Step 1: Enter Data Management Page

In the left navigation, select Data Management.

Step 2: Create Dataset

Click the Create Dataset button in the upper right corner.

Step 3: Fill Basic Information

  • Dataset name: e.g., user_images_dataset
  • Dataset type: Select data type (e.g., Image)
  • Description: Dataset purpose description (optional)
  • Tags: Add tags for categorization (optional)

Step 4: Create Dataset

Click the Create button to complete.

2. Upload Files

Method 1: Drag & Drop

  1. Enter dataset details page
  2. Drag files directly to the upload area
  3. Wait for upload completion

Method 2: Click Upload

  1. Click Upload File button
  2. Select local files
  3. Wait for upload completion

Method 3: Chunked Upload (Large Files)

For large files (>100MB), the system automatically uses chunked upload:

  1. Select large file to upload
  2. System automatically splits the file
  3. Upload chunks one by one
  4. Automatically merge

3. Create Directory

Step 1: Enter Dataset

Click dataset name to enter details.

Step 2: Create Directory

  1. Click Create Directory button
  2. Enter directory name
  3. Select parent directory (optional)
  4. Click confirm

Directory structure example:

user_images_dataset/
├── train/
│   ├── cat/
│   └── dog/
├── test/
│   ├── cat/
│   └── dog/
└── validation/
    ├── cat/
    └── dog/

4. Manage Files

View Files

In dataset details page, you can see all files:

FilenameSizeFile CountUpload TimeTagsTag Update TimeActions
image1.jpg2.3 MB12024-01-15Training Set2024-01-16Download Rename Delete
image2.png1.8 MB12024-01-15Validation Set2024-01-16Download Rename Delete

Preview File

Click Preview button to preview in browser:

  • Image: Display thumbnail and details
  • Text: Display text content
  • Audio: Online playback
  • Video: Online playback

Download File

  • Single file download: Click Download button

Currently, batch download and package download are not supported.

5. Dataset Operations

View Statistics

In dataset details page, you can see:

  • Total files: Total number of files in dataset
  • Total size: Total size of all files

Edit Dataset

Click Edit button to modify:

  • Dataset name
  • Description
  • Tags
  • Associated collection task

Delete Dataset

Click Delete button to delete entire dataset.

Note: Deleting a dataset will also delete all files within it. This action cannot be undone.

Advanced Features

Tag Management

Create Tag

  1. In dataset list page, click Tag Management
  2. Click Create Tag
  3. Enter tag name

Use Tags

  1. Edit dataset
  2. Select existing tags in tag bar
  3. Save dataset

Filter by Tags

In dataset list page, click tags to filter datasets with that tag.

Best Practices

1. Dataset Organization

Recommended directory organization:

project_dataset/
├── raw/              # Raw data
├── processed/        # Processed data
├── train/            # Training data
├── validation/       # Validation data
└── test/             # Test data

2. Naming Conventions

  • Dataset name: Use lowercase letters and underscores, e.g., user_images_2024
  • Directory name: Use meaningful English names, e.g., train, test, processed
  • File name: Keep original filename or use standardized naming

3. Tag Usage

Recommended tag categories:

  • Project tags: project-a, project-b
  • Status tags: raw, processed, validated
  • Type tags: image, text, audio
  • Purpose tags: training, testing, evaluation

4. Data Backup

The system currently does not support automatic backup. To backup data, you can manually download individual files:

  1. Enter dataset details page
  2. Find the file you need to backup
  3. Click the Download button of the file

Common Questions

Q: Large file upload fails?

A: Suggestions for large file uploads:

  1. Use chunked upload: System automatically enables chunked upload
  2. Check network: Ensure stable network connection
  3. Adjust upload parameters: Increase timeout
  4. Use FTP/SFTP: For very large files, use FTP upload

Q: How to import existing data?

A: Three methods to import existing data:

  1. Upload files: Upload via interface
  2. Add files: If files already on server, use add file feature
  3. Data collection: Use data collection module to collect from external sources

Q: Dataset size limit?

A: Dataset size limits:

  • Single file: Maximum 5GB (chunked upload)
  • Total dataset: Limited by storage space
  • File count: No explicit limit

Regularly clean unnecessary files to free up space.

API Reference

For detailed API documentation, see:

3.3 - Data Cleaning

Clean and preprocess data with DataMate

Data cleaning module provides powerful data processing capabilities to help you clean, transform, and optimize data quality.

Features Overview

Data cleaning module provides:

  • Built-in Cleaning Operators: Rich pre-cleaning operator library
  • Visual Configuration: Drag-and-drop cleaning pipeline design
  • Template Management: Save and reuse cleaning templates
  • Batch Processing: Support large-scale data batch cleaning
  • Real-time Preview: Preview cleaning results

Cleaning Operator Types

Data Quality Operators

OperatorFunctionApplicable Data Types
DeduplicationRemove duplicatesAll types
Null HandlingHandle null valuesAll types
Outlier DetectionDetect outliersNumerical
Format ValidationValidate formatAll types

Text Cleaning Operators

OperatorFunction
Remove Special CharsRemove special characters
Case ConversionConvert case
Remove StopwordsRemove common stopwords
Text SegmentationChinese word segmentation
HTML Tag CleaningClean HTML tags

Quick Start

1. Create Cleaning Task

Step 1: Enter Data Cleaning Page

Select Data Processing in the left navigation.

Step 2: Create Task

Click Create Task button.

Step 3: Configure Basic Information

  • Task name: e.g., user_data_cleansing
  • Source dataset: Select dataset to clean
  • Output dataset: Select or create output dataset

Step 4: Configure Cleaning Pipeline

  1. Drag operators from left library to canvas
  2. Connect operators to form pipeline
  3. Configure operator parameters
  4. Preview cleaning results

Example pipeline:

Input Data → Deduplication → Null Handling → Format Validation → Output Data

2. Use Cleaning Templates

Create Template

  1. Configure cleaning pipeline
  2. Click Save as Template
  3. Enter template name
  4. Save

Use Template

  1. Create cleaning task
  2. Click Use Template
  3. Select template
  4. Adjust as needed

3. Monitor Cleaning Task

View task status, progress, and statistics in task list.

Advanced Features

Custom Operators

Develop custom operators. See:

Conditional Branching

Add conditional branches in pipeline:

Input Data → [Condition Check]
              ├── Satisfied → Pipeline A
              └── Not Satisfied → Pipeline B

Best Practices

1. Pipeline Design

Recommended principles:

  • Modular: Split complex pipelines
  • Reusable: Use templates and parameters
  • Maintainable: Add comments
  • Testable: Test individually before combining

2. Performance Optimization

Optimize performance:

  • Parallelize: Use parallel nodes
  • Reduce data transfer: Process locally when possible
  • Batch operations: Use batch operations
  • Cache results: Cache intermediate results

Common Questions

Q: Task execution failed?

A: Troubleshooting:

  1. Check data format
  2. View execution logs
  3. Check operator parameters
  4. Test individual operators
  5. Reduce data size for testing

Q: Cleaning speed is slow?

A: Optimize:

  1. Reduce operator count
  2. Optimize operator order
  3. Increase concurrency
  4. Use incremental processing

API Reference

3.4 - Data Annotation

Perform data annotation with DataMate

Data annotation module integrates Label Studio to provide professional-grade data annotation capabilities.

Features Overview

Data annotation module provides:

  • Multiple Annotation Types: Image, text, audio, etc.
  • Annotation Templates: Rich annotation templates and configurations
  • Quality Control: Annotation review and consistency checks
  • Team Collaboration: Multi-person collaborative annotation
  • Annotation Export: Export annotation results

Annotation Types

Image Annotation

TypeDescriptionUse Cases
Image ClassificationClassify entire imageScene recognition
Object DetectionAnnotate object locationsObject recognition
Semantic SegmentationPixel-level classificationMedical imaging
Key Point AnnotationAnnotate key pointsPose estimation

Text Annotation

TypeDescriptionUse Cases
Text ClassificationClassify textSentiment analysis
Named Entity RecognitionAnnotate entitiesInformation extraction
Text SummarizationGenerate summariesDocument understanding

Quick Start

1. Deploy Label Studio

make install-label-studio

Access: http://localhost:30001

Default credentials:

2. Create Annotation Task

Step 1: Enter Data Annotation Page

Select Data Annotation in the left navigation.

Step 2: Create Task

Click Create Task.

Step 3: Configure Basic Information

  • Task name: e.g., image_classification_task
  • Source dataset: Select dataset to annotate
  • Annotation type: Select type

Step 4: Configure Annotation Template

Image Classification Template:

<View>
  <Image name="image" value="$image"/>
  <Choices name="choice" toName="image">
    <Choice value="cat"/>
    <Choice value="dog"/>
    <Choice value="bird"/>
  </Choices>
</View>

Step 5: Configure Annotation Rules

  • Annotation method: Single label / Multi label
  • Minimum annotations: Per sample (for consistency)
  • Review mechanism: Enable/disable review

3. Start Annotation

  1. Enter annotation interface
  2. View sample to annotate
  3. Perform annotation
  4. Click Submit
  5. Auto-load next sample

Advanced Features

Quality Control

Annotation Consistency

Check consistency between annotators:

  • Cohen’s Kappa: Evaluate consistency
  • Majority vote: Use majority annotation results
  • Expert review: Expert reviews disputed annotations

Pre-annotation

Use models for pre-annotation:

  1. Train or use existing model
  2. Pre-annotate dataset
  3. Annotators correct pre-annotations

Best Practices

1. Annotation Guidelines

Create clear guidelines:

  • Define standards: Clear annotation standards
  • Provide examples: Positive and negative examples
  • Edge cases: Handle edge cases
  • Train annotators: Ensure understanding

Common Questions

Q: Poor annotation quality?

A: Improve:

  1. Refine guidelines
  2. Strengthen training
  3. Increase reviews
  4. Use pre-annotation

3.5 - Data Synthesis

Use large models for data augmentation and synthesis

Data synthesis module leverages large model capabilities to automatically generate high-quality training data, reducing data collection costs.

Features Overview

Data synthesis module provides:

  • Instruction template management: Create and manage synthesis instruction templates
  • Single task synthesis: Create individual synthesis tasks
  • Proportional synthesis task: Synthesize multi-category balanced data by specified ratios
  • Large model integration: Support for multiple LLM APIs
  • Quality evaluation: Automatic evaluation of synthesized data quality

Quick Start

1. Create Instruction Template

Step 1: Enter Data Synthesis Page

In the left navigation, select Data SynthesisSynthesis Tasks.

Step 2: Create Instruction Template

  1. Click Instruction Templates tab
  2. Click Create Template button

Step 3: Configure Template

Basic Information:

  • Template name: e.g., qa_generation_template
  • Template description: Describe template purpose (optional)
  • Template type: Select template type (Q&A, dialogue, summary, etc.)

Prompt Configuration:

Example prompt:

You are a professional data generation assistant. Generate data based on the following requirements:

Task: Generate Q&A pairs
Topic: {topic}
Count: {count}
Difficulty: {difficulty}

Requirements:
1. Questions should be clear and specific
2. Answers should be accurate and complete
3. Cover different difficulty levels

Output format: JSON
[
  {
    "question": "...",
    "answer": "..."
  }
]

Parameter Configuration:

  • Model: Select LLM to use (GPT-4, Claude, local model, etc.)
  • Temperature: Control generation randomness (0-1)
  • Max tokens: Limit generation length
  • Other parameters: Configure according to model

Step 4: Save Template

Click Save button to save template.

2. Create Synthesis Task

Step 1: Fill Basic Information

  1. Return to Data Synthesis page
  2. Click Create Task button
  3. Fill basic information:
    • Task name: e.g., medical_qa_synthesis
    • Task description: Describe task purpose (optional)

Step 2: Select Dataset and Files

Select required data from existing datasets:

  • Select dataset: Choose the dataset to use from the list
  • Select files:
    • Can select all files from a dataset
    • Can also select specific files from a dataset
    • Support selecting multiple files

Step 3: Select Synthesis Instruction Template

Select an existing template or create a new one:

  • Select from template library: Choose from created templates
  • Template type: Q&A generation, dialogue generation, summary generation, etc.
  • Preview template: View template prompt content

Step 4: Fill Synthesis Configuration

The synthesis configuration consists of four parts:

1. Set Total Synthesis Count

Set the maximum limit for the entire task:

ParameterDescriptionDefault ValueRange
Maximum QA PairsMaximum number of QA pairs to generate for entire task50001-100,000

This setting is optional, used for total volume control in large-scale synthesis tasks.

2. Configure Text Chunking Strategy

Chunk the input text files, supporting multiple chunking methods:

ParameterDescriptionDefault Value
Chunking MethodSelect chunking strategyDefault chunking
Chunk SizeCharacter count per chunk3000
Overlap SizeOverlap characters between adjacent chunks100

Chunking Method Options:

  • Default Chunking (默认分块): Use system default intelligent chunking strategy
  • Chapter-based Chunking (按章节分块): Split by chapter structure
  • Paragraph-based Chunking (按段落分块): Split by paragraph boundaries
  • Fixed Length Chunking (固定长度分块): Split by fixed character length
  • Custom Separator Chunking (自定义分隔符分块): Split by custom delimiter

3. Configure Question Synthesis Parameters

Set parameters for question generation:

ParameterDescriptionDefault ValueRange
Question CountNumber of questions generated per chunk11-20
TemperatureControl randomness and diversity of question generation0.70-2
ModelSelect CHAT model for question generation-Select from model list

Parameter Notes:

  • Question Count: Number of questions generated per text chunk. Higher value generates more questions.
  • Temperature: Higher values produce more diverse questions, lower values produce more stable questions.

4. Configure Answer Synthesis Parameters

Set parameters for answer generation:

ParameterDescriptionDefault ValueRange
TemperatureControl stability of answer generation0.70-2
ModelSelect CHAT model for answer generation-Select from model list

Parameter Notes:

  • Temperature: Lower values produce more conservative and accurate answers, higher values produce more diverse and creative answers.

Synthesis Types: The system supports two synthesis types:

  • SFT Q&A Synthesis (SFT 问答数据合成): Generate Q&A pairs for supervised fine-tuning
  • COT Chain-of-Thought Synthesis (COT 链式推理合成): Generate data with reasoning process

Step 5: Start Task

Click Start Task button, task will automatically start executing.

3. Create Ratio Synthesis Task

Ratio synthesis tasks are used to synthesize multi-category balanced data in specified proportions.

Step 1: Create Ratio Task

  1. In the left navigation, select Data SynthesisRatio Tasks
  2. Click Create Task button

Step 2: Fill Basic Information

ParameterDescriptionRequired
Task NameUnique identifier for the taskYes
Total Target CountTarget total count for entire ratio taskYes
Task DescriptionDescribe purpose and requirements of ratio taskNo

Example:

  • Task name: balanced_dataset_synthesis
  • Total target count: 10000
  • Task description: Generate balanced data for training and validation sets

Step 3: Select Datasets

Select datasets to participate in the ratio synthesis from existing datasets:

Dataset Selection Features:

  • Search Datasets: Search datasets by keyword
  • Multi-select Support: Can select multiple datasets simultaneously
  • Dataset Information: Display detailed information for each dataset
    • Dataset name and type
    • Dataset description
    • File count
    • Dataset size
    • Label distribution preview (up to 8 labels)

After selecting datasets, the system automatically loads label distribution information for each dataset.

Step 4: Fill Ratio Configuration

Configure specific synthesis rules for each selected dataset:

Ratio Configuration Items:

ParameterDescriptionRange
LabelSelect label from dataset’s label distributionBased on dataset labels
Label ValueSpecific value under selected labelBased on label value list
Label Update TimeSelect label update date range (optional)Date picker
QuantityData count to generate for this config0 to total target count

Feature Notes:

  • Auto Distribute: Click “Auto Distribute” button, system automatically distributes total count evenly across datasets
  • Quantity Limit: Each configuration item’s quantity cannot exceed the dataset’s total file count
  • Percentage Calculation: System automatically calculates percentage of each configuration item
  • Delete Configuration: Can delete unwanted configuration items
  • Add Configuration: Each dataset can have multiple different label configurations

Example Configuration:

DatasetLabelLabel ValueLabel Update TimeQuantity
Training DatasetCategoryTraining-6000
Training DatasetCategoryValidation-2000
Test DatasetCategoryTest2024-01-01 to 2024-12-312000

Step 5: Execute Task

Click Start Task button, the system will create and execute the task according to ratio configuration.

4. Monitor Synthesis Task

View Task List

In data synthesis page, you can see all synthesis tasks:

Task NameTemplateStatusProgressGenerated CountActions
Medical QA Synthesisqa_templateRunning50%50/100View Details
Sentiment Data Synthesissentiment_templateCompleted100%1000/1000View Details

Advanced Features

Template Variables

Use variables in prompts for dynamic configuration:

Variable syntax: {variable_name}

Example:

Generate {count} {difficulty} level {type} about {topic}.

Built-in variables:

  • {current_date}: Current date
  • {current_time}: Current time
  • {random_id}: Random ID

Model Selection

DataMate supports multiple LLMs:

ModelTypeDescription
GPT-4OpenAIHigh-quality generation
GPT-3.5-TurboOpenAIFast generation
Claude 3AnthropicLong-text generation
Wenxin YiyanBaiduChinese optimized
Tongyi QianwenAlibabaChinese optimized
Local ModelDeployed locallyPrivate deployment

Best Practices

1. Prompt Design

Good prompts should:

  • Define task clearly: Clearly describe generation task
  • Specify format: Clearly define output format requirements
  • Provide examples: Give expected output examples
  • Control quality: Set quality requirements

Example prompt:

You are a professional educational content creator.

Task: Generate educational Q&A pairs
Subject: {subject}
Grade: {grade}
Count: {count}

Requirements:
1. Questions should be appropriate for the grade level
2. Answers should be accurate, detailed, and easy to understand
3. Each answer should include explanation process
4. Do not generate sensitive or inappropriate content

Output format (JSON):
[
  {
    "id": 1,
    "question": "Question content",
    "answer": "Answer content",
    "explanation": "Explanation content",
    "difficulty": "easy/medium/hard",
    "knowledge_points": ["point1", "point2"]
  }
]

Start generating:

2. Parameter Tuning

Adjust model parameters according to needs:

ParameterHigh QualityFast GenerationCreative Generation
Temperature0.3-0.50.1-0.30.7-1.0
Max tokensAs neededShorterLonger
Top P0.9-0.950.90.95-1.0

Common Questions

Q: Generated data quality is not ideal?

A: Optimization suggestions:

  1. Improve prompt: More detailed and clear instructions
  2. Adjust parameters: Lower temperature, increase max tokens
  3. Provide examples: Give examples in prompt
  4. Change model: Try other LLMs
  5. Manual review: Manual review and filtering

Q: Generation speed is slow?

A: Acceleration suggestions:

  1. Reduce count: Generate in smaller batches
  2. Adjust concurrency: Increase concurrency appropriately
  3. Use faster model: Like GPT-3.5-Turbo
  4. Shorten output: Reduce max tokens
  5. Use local model: Deploy local model for acceleration

API Reference

For detailed API documentation, see:

3.6 - Data Evaluation

Evaluate data quality with DataMate

Data evaluation module provides multi-dimensional data quality evaluation capabilities.

Features Overview

Data evaluation module provides:

  • Quality Metrics: Rich data quality evaluation metrics
  • Automatic Evaluation: Auto-execute evaluation tasks
  • Manual Evaluation: Manual sampling evaluation
  • Evaluation Reports: Generate detailed reports
  • Quality Tracking: Track data quality trends

Evaluation Dimensions

Data Completeness

MetricDescriptionCalculation
Null RateNull value ratioNull count / Total count
Missing Field RateRequired field missing rateMissing fields / Total fields
Record Complete RateComplete record ratioComplete records / Total records

Data Accuracy

MetricDescriptionCalculation
Format Correct RateFormat complianceFormat correct / Total
Value Range ComplianceIn valid rangeIn range / Total
Consistency RateData consistencyConsistent records / Total

Quick Start

1. Create Evaluation Task

Step 1: Enter Data Evaluation Page

Select Data Evaluation in the left navigation.

Step 2: Create Task

Click Create Task.

Step 3: Configure Basic Information

  • Task name: e.g., data_quality_evaluation
  • Evaluation dataset: Select dataset to evaluate

Step 4: Configure Evaluation Dimensions

Select dimensions:

  • ✅ Data completeness
  • ✅ Data accuracy
  • ✅ Data uniqueness
  • ✅ Data timeliness

Step 5: Configure Evaluation Rules

Completeness Rules:

Required fields: name, email, phone
Null threshold: 5% (warn if exceeded)

2. Execute Evaluation

Automatic Evaluation

Auto-executes after creation, or click Execute Now.

Manual Evaluation

  1. Click Manual Evaluation tab
  2. View samples to evaluate
  3. Manually evaluate quality
  4. Submit results

3. View Evaluation Report

Overall Score

Overall Quality Score: 85 (Excellent)

Completeness: 90 ⭐⭐⭐⭐⭐
Accuracy: 82 ⭐⭐⭐⭐
Uniqueness: 95 ⭐⭐⭐⭐⭐
Timeliness: 75 ⭐⭐⭐⭐

Detailed Metrics

Completeness:

  • Null rate: 3.2% ✅
  • Missing field rate: 1.5% ✅
  • Record complete rate: 96.8% ✅

Advanced Features

Custom Evaluation Rules

Regex Validation

Field: phone
Rule: ^1[3-9]\d{9}$
Description: China mobile phone number

Value Range Validation

Field: age
Min value: 0
Max value: 120

Comparison Evaluation

Compare different datasets or versions.

Best Practices

1. Regular Evaluation

Recommended schedule:

  • Daily: Critical data
  • Weekly: General data
  • Monthly: All data

2. Establish Baseline

Create quality baseline for each dataset.

3. Continuous Improvement

Based on evaluation results:

  • Clean problem data
  • Optimize collection process
  • Update validation rules

Common Questions

Q: Evaluation task failed?

A: Troubleshoot:

  1. Check dataset exists
  2. Check rule configuration
  3. View execution logs
  4. Test with small sample size

API Reference

3.7 - Knowledge Base Management

Build and manage RAG knowledge bases with DataMate

Knowledge base management module helps you build enterprise knowledge bases for efficient vector retrieval and RAG applications.

Features Overview

Knowledge base management module provides:

  • Document upload: Support multiple document formats
  • Text chunking: Intelligent text splitting strategies
  • Vectorization: Automatic text-to-vector conversion
  • Vector search: Semantic similarity-based retrieval
  • Knowledge base Q&A: RAG-intelligent Q&A

Supported Document Formats

FormatDescriptionRecommended For
TXTPlain textGeneral text
PDFPDF documentsDocuments, reports
MarkdownMarkdown filesTechnical docs
JSONJSON dataStructured data
CSVCSV tablesTabular data
DOCXWord documentsOffice documents

Quick Start

1. Create Knowledge Base

Step 1: Enter Knowledge Base Page

In the left navigation, select Knowledge Generation.

Step 2: Create Knowledge Base

Click Create Knowledge Base button in upper right.

Step 3: Configure Basic Information

  • Knowledge base name: e.g., company_docs_kb
  • Knowledge base description: Describe purpose (optional)
  • Knowledge base type: General / Professional domain

Step 4: Configure Vector Parameters

  • Embedding model: Select embedding model

    • OpenAI text-embedding-ada-002
    • BGE-M3
    • Custom model
  • Vector dimension: Auto-set based on model

  • Index type: IVF_FLAT / HNSW / IVF_PQ

Step 5: Configure Chunking Strategy

  • Chunking method:

    • By character count
    • By paragraph
    • By semantic
  • Chunk size: Size of each text chunk (character count)

  • Overlap size: Overlap between adjacent chunks

2. Upload Documents

Step 1: Enter Knowledge Base Details

Click knowledge base name to enter details.

Step 2: Upload Documents

  1. Click Upload Document button
  2. Select local files
  3. Wait for upload completion

System will automatically:

  1. Parse document content
  2. Chunk text
  3. Generate vectors
  4. Build index

Step 1: Enter Search Page

In knowledge base details page, click Vector Search tab.

Step 2: Enter Query

Enter query in search box, e.g.:

How to use DataMate for data cleaning?

Step 3: View Search Results

System returns most relevant text chunks with similarity scores:

RankText ChunkSimilaritySource DocActions
1DataMate’s data cleaning module…0.92user_guide.pdfView
2Configure cleaning task…0.87tutorial.mdView
3Cleaning operator list…0.81reference.txtView

4. Knowledge Base Q&A (RAG)

Step 1: Enable RAG

In knowledge base details page, click RAG Q&A tab.

Step 2: Configure RAG Parameters

  • LLM: Select LLM to use
  • Retrieval count: Number of text chunks to retrieve
  • Temperature: Control generation randomness
  • Prompt template: Custom Q&A template

Step 3: Q&A

Enter question in dialog box, e.g.:

User: What data cleaning operators does DataMate support?

Assistant: DataMate supports rich data cleaning operators, including:
1. Data quality operators: deduplication, null handling, outlier detection...
2. Text cleaning operators: remove special chars, case conversion...
3. Image cleaning operators: format conversion, quality detection...
[Source: user_guide.pdf, tutorial.md]

Best Practices

1. Document Preparation

Before uploading documents:

  • Unify format: Convert to unified format (PDF, Markdown)
  • Clean content: Remove irrelevant content (headers, ads)
  • Maintain structure: Keep good document structure
  • Add metadata: Add document metadata (author, date, tags)

2. Chunking Strategy Selection

Choose based on document type:

Document TypeRecommended StrategyChunk Size
Technical docsParagraph chunking-
Long reportsSemantic chunking-
Short textCharacter chunking500
CodeCharacter chunking300

Common Questions

Q: Document stuck in “Processing”?

A: Check:

  1. Document format: Ensure format is supported
  2. Document size: Single document under 100MB
  3. Vector service: Check if vector service is running
  4. View logs: Check detailed error messages

Q: Inaccurate search results?

A: Optimization suggestions:

  1. Adjust chunking: Try different chunking methods
  2. Increase chunk size: Add more context
  3. Use reranking: Enable reranking model
  4. Optimize query: Use clearer query statements
  5. Change embedding model: Try other models

API Reference

For detailed API documentation, see:

3.8 - Operator Market

Manage and use DataMate operators

Operator marketplace provides rich data processing operators and supports custom operator development.

Features Overview

Operator marketplace provides:

  • Built-in Operators: Rich built-in data processing operators
  • Operator Publishing: Publish and share custom operators
  • Operator Installation: Install third-party operators
  • Custom Development: Develop custom operators

Built-in Operators

Data Cleaning Operators

OperatorFunctionInputOutput
DeduplicationRemove duplicatesDatasetDeduplicated data
Null HandlerHandle nullsDatasetFilled data
Format ConverterConvert formatOriginal formatNew format

Text Processing Operators

OperatorFunction
Text SegmentationChinese word segmentation
Remove StopwordsRemove common stopwords
Text CleaningClean special characters

Quick Start

1. Browse Operators

Step 1: Enter Operator Market

Select Operator Market in the left navigation.

Step 2: Browse Operators

View all available operators with ratings and installation counts.

2. Install Operator

Install Built-in Operator

Built-in operators are installed by default.

Install Third-party Operator

  1. In operator details page, click Install
  2. Wait for installation completion

3. Use Operator

After installation, use in:

  • Data Cleaning: Add operator node to cleaning pipeline
  • Pipeline Orchestration: Add operator node to workflow

Advanced Features

Develop Custom Operator

Create Operator

  1. In operator market page, click Create Operator
  2. Fill operator information
  3. Write operator code (Python)
  4. Package and publish

Python Operator Example:

class MyTextCleaner:
    def __init__(self, config):
        self.remove_special_chars = config.get('remove_special_chars', True)

    def process(self, data):
        if isinstance(data, str):
            result = data
            if self.remove_special_chars:
                import re
                result = re.sub(r'[^\w\s]', '', result)
            return result
        return data

Best Practices

1. Operator Design

Good operator design:

  • Single responsibility: One operator does one thing
  • Configurable: Rich configuration options
  • Error handling: Comprehensive error handling
  • Performance: Consider large-scale data

Common Questions

Q: Operator execution failed?

A: Troubleshoot:

  1. View logs
  2. Check configuration
  3. Check data format
  4. Test locally

3.9 - Pipeline Orchestration

Visual workflow orchestration with DataMate

Pipeline orchestration module provides drag-and-drop visual interface for designing and managing complex data processing workflows.

Features Overview

Pipeline orchestration provides:

  • Visual Designer: Drag-and-drop workflow design
  • Rich Node Types: Data processing, conditions, loops, etc.
  • Flow Execution: Auto-execute and monitor workflows
  • Template Management: Save and reuse flow templates
  • Version Management: Flow version control

Node Types

Data Nodes

NodeFunctionConfig
Input DatasetRead from datasetSelect dataset
Output DatasetWrite to datasetSelect dataset
Data CollectionExecute collection taskSelect task
Data CleaningExecute cleaning taskSelect task
Data SynthesisExecute synthesis taskSelect task

Logic Nodes

NodeFunctionConfig
Condition BranchExecute different branchesCondition expression
LoopRepeat executionLoop count/condition
ParallelExecute multiple branches in parallelBranch count
WaitWait for specified timeDuration

Quick Start

1. Create Pipeline

Step 1: Enter Pipeline Orchestration Page

Select Pipeline Orchestration in left navigation.

Step 2: Create Pipeline

Click Create Pipeline.

Step 3: Fill Basic Information

  • Pipeline name: e.g., data_processing_pipeline
  • Description: Pipeline purpose (optional)

Step 4: Design Flow

  1. Drag nodes from left library to canvas
  2. Connect nodes
  3. Configure node parameters
  4. Save flow

Example:

Input Dataset → Data Cleaning → Condition Branch
                                    ├── Satisfied → Data Annotation → Output
                                    └── Not Satisfied → Data Synthesis → Output

2. Execute Pipeline

Step 1: Enter Execution Page

Click pipeline name to enter details.

Step 2: Execute Pipeline

Click Execute Now.

Step 3: Monitor Execution

View execution status, progress, and logs.

Advanced Features

Flow Templates

Save as Template

  1. Design flow
  2. Click Save as Template
  3. Enter template name

Use Template

  1. Create pipeline, click Use Template
  2. Select template
  3. Load to designer

Parameterized Flow

Define parameters in pipeline:

{
  "parameters": [
    {
      "name": "input_dataset",
      "type": "dataset",
      "required": true
    }
  ]
}

Scheduled Execution

Configure scheduled execution:

  • Cron expression: 0 0 2 * * ? (Daily at 2 AM)
  • Execution parameters

Best Practices

1. Flow Design

Recommended principles:

  • Modular: Split complex flows
  • Reusable: Use templates
  • Maintainable: Add comments
  • Testable: Test individually

2. Performance Optimization

Optimize performance:

  • Parallelize: Use parallel nodes
  • Reduce data transfer: Process locally
  • Batch operations: Use batch operations
  • Cache results: Cache intermediate results

Common Questions

Q: Flow execution failed?

A: Troubleshoot:

  1. View execution logs
  2. Check node configuration
  3. Check data format
  4. Test nodes individually

3.10 - Agent Chat

Use DataMate Agent for intelligent conversation

Agent chat module integrates large language models to provide intelligent Q&A and knowledge base retrieval capabilities.

Features Overview

Agent chat module provides:

  • Intelligent Chat: Natural language conversation based on LLMs
  • Knowledge Base Q&A: RAG-integrated knowledge base Q&A
  • Multi-turn Dialogue: Support context-aware multi-turn conversation
  • Dialogue Management: Conversation history and management
  • Quick Commands: Quick commands for common operations

Quick Start

1. Access Agent

Select Agent Chat in left navigation, or visit:

  • Local: http://localhost:30000/chat
  • Production: https://your-domain/chat

2. Start Conversation

Enter question in dialog box, e.g.:

User: What data formats does DataMate support?

Assistant: DataMate supports multiple data formats:

Images: JPG, PNG, GIF, BMP, WebP
Text: TXT, MD, JSON, CSV
Audio: MP3, WAV, FLAC, AAC
Video: MP4, AVI, MOV, MKV

These formats can be uploaded and managed in the data management module.

3. Use Knowledge Base Q&A

Enable Knowledge Base

  1. Click Settings button
  2. In Knowledge Base settings, select knowledge base to use
  3. Save settings

Knowledge Base Q&A

User: How to create a data cleaning task?

Assistant: According to the knowledge base documentation:

1. Enter data processing page
2. Click create task button
3. Configure basic information
4. Configure cleaning pipeline (drag operators to canvas)
5. Configure execution parameters
6. Create and execute task

[Source: user_guide.md, data_cleansing.md]

Advanced Features

Conversation Modes

General Chat

Use LLM for general conversation without knowledge base.

Knowledge Base Q&A

Answer questions based on knowledge base content.

Mixed Mode

Combine general chat and knowledge base Q&A.

Quick Commands

CommandFunctionExample
/datasetQuery datasets/dataset list
/taskQuery tasks/task status
/helpShow help/help
/clearClear conversation/clear

Conversation History

View History

  1. Click History tab on left
  2. Select historical conversation
  3. View conversation content

Continue Conversation

Click historical conversation to continue.

Export Conversation

Export conversation records:

  • Markdown: Export as Markdown file
  • JSON: Export as JSON
  • PDF: Export as PDF

Best Practices

1. Effective Questioning

Get better answers:

  • Be specific: Clear and specific questions
  • Provide context: Include background information
  • Break down: Split complex questions

2. Knowledge Base Usage

Make the most of knowledge base:

  • Select appropriate knowledge base: Choose based on question
  • View sources: Check answer source documents
  • Verify information: Verify with source documents

Common Questions

Q: Inaccurate Agent answers?

A: Improve:

  1. Optimize question: More specific
  2. Check knowledge base: Ensure relevant content exists
  3. Change model: Try more powerful model
  4. Provide context: More background info

4 - API Reference

DataMate API documentation

DataMate provides complete REST APIs supporting programmatic access to all core features.

API Overview

DataMate API is based on REST architecture design, providing the following services:

  • Data Management API: Dataset and file management
  • Data Cleaning API: Data cleaning task management
  • Data Collection API: Data collection task management
  • Data Annotation API: Data annotation task management
  • Data Synthesis API: Data synthesis task management
  • Data Evaluation API: Data evaluation task management
  • Operator Market API: Operator management
  • RAG Indexer API: Knowledge base and vector retrieval
  • Pipeline Orchestration API: Pipeline orchestration management

Authentication

DataMate supports two authentication methods:

GET /api/v1/data-management/datasets
Authorization: Bearer <your-jwt-token>

Get JWT Token:

POST /api/v1/auth/login
Content-Type: application/json

{
  "username": "admin",
  "password": "password"
}

Response:

{
  "token": "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9...",
  "expiresIn": 86400
}

API Key Authentication

GET /api/v1/data-management/datasets
X-API-Key: <your-api-key>

Common Response Format

Success Response

{
  "code": 200,
  "message": "success",
  "data": {
    // Response data
  }
}

Error Response

{
  "code": 400,
  "message": "Bad Request",
  "error": "Invalid parameter: datasetId",
  "timestamp": "2024-01-15T10:30:00Z",
  "path": "/api/v1/data-management/datasets"
}

Paged Response

{
  "content": [],
  "page": 0,
  "size": 20,
  "totalElements": 100,
  "totalPages": 5,
  "first": true,
  "last": false
}

API Endpoints

Data Management

EndpointMethodDescription
/data-management/datasetsGETGet dataset list
/data-management/datasetsPOSTCreate dataset
/data-management/datasets/{id}GETGet dataset details
/data-management/datasets/{id}PUTUpdate dataset
/data-management/datasets/{id}DELETEDelete dataset
/data-management/datasets/{id}/filesGETGet file list
/data-management/datasets/{id}/files/uploadPOSTUpload files

Data Cleaning

EndpointMethodDescription
/data-cleaning/tasksGETGet cleaning task list
/data-cleaning/tasksPOSTCreate cleaning task
/data-cleaning/tasks/{id}GETGet task details
/data-cleaning/tasks/{id}PUTUpdate task
/data-cleaning/tasks/{id}DELETEDelete task
/data-cleaning/tasks/{id}/executePOSTExecute task

Data Collection

EndpointMethodDescription
/data-collection/tasksGETGet collection task list
/data-collection/tasksPOSTCreate collection task
/data-collection/tasks/{id}GETGet task details
/data-collection/tasks/{id}/executePOSTExecute collection task

Data Synthesis

EndpointMethodDescription
/data-synthesis/tasksGETGet synthesis task list
/data-synthesis/tasksPOSTCreate synthesis task
/data-synthesis/templatesGETGet instruction template list
/data-synthesis/templatesPOSTCreate instruction template

Operator Market

EndpointMethodDescription
/operator-market/operatorsGETGet operator list
/operator-market/operatorsPOSTPublish operator
/operator-market/operators/{id}GETGet operator details
/operator-market/operators/{id}/installPOSTInstall operator

RAG Indexer

EndpointMethodDescription
/rag/knowledge-basesGETGet knowledge base list
/rag/knowledge-basesPOSTCreate knowledge base
/rag/knowledge-bases/{id}/documentsPOSTUpload documents
/rag/knowledge-bases/{id}/searchPOSTVector search

Error Codes

CodeDescription
200Success
201Created
400Bad Request
401Unauthorized
403Forbidden
404Not Found
409Conflict
500Internal Server Error

Rate Limiting

API call rate limits:

  • Default limit: 1000 requests/hour
  • Burst limit: 100 requests/minute

Exceeding the limit returns 429 Too Many Requests.

Response headers contain rate limiting information:

X-RateLimit-Limit: 1000
X-RateLimit-Remaining: 999
X-RateLimit-Reset: 1642252800

Version Management

API versions are specified through URL paths:

  • Current version: /api/v1/
  • Future versions: /api/v2/

4.1 - Data Management API

Dataset and file management API

Data management API provides capabilities for dataset and file creation, query, update, and deletion.

Basic Information

  • Base URL: http://localhost:8092/api/v1/data-management
  • Authentication: JWT / API Key
  • Content-Type: application/json

Dataset Management

Get Dataset List

GET /data-management/datasets?page=0&size=20&type=text

Query Parameters:

ParameterTypeRequiredDescription
pageintegerNoPage number, starts from 0
sizeintegerNoPage size, default 20
typestringNoDataset type filter
tagsstringNoTag filter, comma-separated
keywordstringNoKeyword search
statusstringNoStatus filter

Response Example:

{
  "content": [
    {
      "id": "dataset-001",
      "name": "text_dataset",
      "description": "Text dataset",
      "type": {
        "code": "TEXT",
        "name": "Text"
      },
      "status": "ACTIVE",
      "fileCount": 1000,
      "totalSize": 1073741824,
      "createdAt": "2024-01-15T10:00:00Z"
    }
  ],
  "page": 0,
  "size": 20,
  "totalElements": 1
}

Create Dataset

POST /data-management/datasets
Content-Type: application/json

{
  "name": "my_dataset",
  "description": "My dataset",
  "type": "TEXT",
  "tags": ["training", "nlp"]
}

Get Dataset Details

GET /data-management/datasets/{datasetId}

Update Dataset

PUT /data-management/datasets/{datasetId}
Content-Type: application/json

{
  "name": "updated_dataset",
  "description": "Updated description"
}

Delete Dataset

DELETE /data-management/datasets/{datasetId}

File Management

Get File List

GET /data-management/datasets/{datasetId}/files?page=0&size=20

Upload File

POST /data-management/datasets/{datasetId}/files/upload/chunk
Content-Type: multipart/form-data

Download File

GET /data-management/datasets/{datasetId}/files/{fileId}/download

Delete File

DELETE /data-management/datasets/{datasetId}/files/{fileId}

Error Response

{
  "code": 400,
  "message": "Bad Request",
  "error": "Invalid parameter: datasetId",
  "timestamp": "2024-01-15T10:30:00Z",
  "path": "/api/v1/data-management/datasets"
}

SDK Usage

Python

from datamate import DataMateClient

client = DataMateClient(
    base_url="http://localhost:8080",
    api_key="your-api-key"
)

# Get datasets
datasets = client.data_management.get_datasets()

# Create dataset
dataset = client.data_management.create_dataset(
    name="my_dataset",
    type="TEXT"
)

cURL

# Get datasets
curl -X GET "http://localhost:8092/api/v1/data-management/datasets" \
  -H "Authorization: Bearer your-jwt-token"

# Create dataset
curl -X POST "http://localhost:8092/api/v1/data-management/datasets" \
  -H "Authorization: Bearer your-jwt-token" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "my_dataset",
    "type": "TEXT"
  }'

5 - Developer Guide

DataMate architecture and development guide

Developer guide introduces DataMate’s technical architecture, development environment, and contribution process.

DataMate is an enterprise-level data processing platform using microservices architecture, supporting large-scale data processing and custom extensions.

Architecture Documentation

Development Guide

Tech Stack

Frontend

TechnologyVersionDescription
React18.xUI framework
TypeScript5.xType safety
Ant Design5.xUI component library
Redux Toolkit2.xState management
Vite5.xBuild tool

Backend (Java)

TechnologyVersionDescription
Java21Runtime environment
Spring Boot3.5.6Application framework
Spring Cloud2023.xMicroservices framework
MyBatis Plus3.xORM framework

Backend (Python)

TechnologyVersionDescription
Python3.11+Runtime environment
FastAPI0.100+Web framework
LangChain0.1+LLM framework
Ray2.xDistributed computing

Project Structure

DataMate/
├── backend/                 # Java backend
│   ├── services/           # Microservice modules
│   ├── openapi/            # OpenAPI specs
│   └── scripts/            # Build scripts
├── frontend/               # React frontend
│   ├── src/
│   │   ├── components/    # Common components
│   │   ├── pages/         # Page components
│   │   ├── services/      # API services
│   │   └── store/         # Redux store
│   └── package.json
├── runtime/                # Python runtime
│   └── datamate/          # DataMate runtime
└── deployment/             # Deployment config
    ├── docker/            # Docker config
    └── helm/              # Helm Charts

Quick Start

1. Clone Code

git clone https://github.com/ModelEngine-Group/DataMate.git
cd DataMate

2. Start Services

# Start basic services
make install

# Access frontend
open http://localhost:30000

3. Development Mode

# Backend development
cd backend/services/main-application
mvn spring-boot:run

# Frontend development
cd frontend
pnpm dev

# Python service development
cd runtime/datamate
python operator_runtime.py --port 8081

Core Concepts

Microservices Architecture

DataMate uses microservices architecture, each service handles specific business functions:

  • API Gateway: Unified entry, routing, authentication
  • Main Application: Core business logic
  • Data Management Service: Dataset management
  • Data Cleaning Service: Data cleaning
  • Data Synthesis Service: Data synthesis
  • Runtime Service: Operator execution

Operator System

Operators are basic units of data processing:

  • Built-in operators: Common operators provided by platform
  • Custom operators: User-developed custom operators
  • Operator execution: Executed by Runtime Service

Pipeline Orchestration

Pipelines are implemented through visual orchestration:

  • Nodes: Basic units of data processing
  • Connections: Data flow between nodes
  • Execution: Automatic execution according to workflow

Extension Development

Develop Custom Operators

Operator development guide:

  1. Operator Market - Operator usage guide
  2. Python operator development examples
  3. Operator testing and debugging

Integrate External Systems

  • API integration: Integration via REST API
  • Webhook: Event notifications
  • Plugin system: (Coming soon)

Testing

Unit Tests

# Backend tests
cd backend
mvn test

# Frontend tests
cd frontend
pnpm test

# Python tests
cd runtime
pytest

Integration Tests

# Start test environment
make test-env-up

# Run integration tests
make integration-test

# Clean test environment
make test-env-down

Performance Optimization

Backend Optimization

  • Database connection pool configuration
  • Query optimization
  • Caching strategies
  • Asynchronous processing

Frontend Optimization

  • Code splitting
  • Lazy loading
  • Caching strategies

Security

Authentication and Authorization

  • JWT authentication
  • RBAC permission control
  • API Key authentication

Data Security

  • Transport encryption (HTTPS/TLS)
  • Storage encryption
  • Sensitive data masking

5.1 - Backend Architecture

DataMate Java backend architecture design

DataMate backend adopts microservices architecture built on Spring Boot 3.x and Spring Cloud.

Architecture Overview

DataMate backend uses microservices architecture, splitting into multiple independent services:

┌─────────────────────────────────────────────┐
│              API Gateway                    │
│         (Spring Cloud Gateway)              │
│              Port: 8080                     │
└──────────────┬──────────────────────────────┘
               │
       ┌───────┴───────┬───────────────┐
       ▼               ▼               ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│   Main       │ │  Data        │ │  Data        │
│ Application  │ │  Management  │ │  Collection  │
└──────────────┘ └──────────────┘ └──────────────┘
       │               │               │
       └───────────────┴───────────────┘
                       │
                       ▼
              ┌────────────────┐
              │   PostgreSQL   │
              │   Port: 5432   │
              └────────────────┘

Tech Stack

Core Frameworks

TechnologyVersionPurpose
Java21Programming language
Spring Boot3.5.6Application framework
Spring Cloud2023.xMicroservices framework
MyBatis Plus3.5.xORM framework

Support Components

TechnologyVersionPurpose
Redis5.xCache and message queue
MinIO8.xObject storage
Milvus SDK2.3.xVector database

Microservices List

API Gateway

Port: 8080

Functions:

  • Unified entry point
  • Route forwarding
  • Authentication and authorization
  • Rate limiting and circuit breaking

Tech: Spring Cloud Gateway, JWT authentication

Main Application

Functions:

  • User management
  • Permission management
  • System configuration
  • Task scheduling

Data Management Service

Port: 8092

Functions:

  • Dataset management
  • File management
  • Tag management
  • Statistics

API Endpoints:

  • /data-management/datasets - Dataset management
  • /data-management/datasets/{id}/files - File management

Runtime Service

Port: 8081

Functions:

  • Operator execution
  • Ray integration
  • Task scheduling

Tech: Python + Ray, FastAPI

Database Design

Main Tables

users (User Table)

FieldTypeDescription
idBIGINTPrimary key
usernameVARCHAR(50)Username
passwordVARCHAR(255)Password (encrypted)
emailVARCHAR(100)Email
roleVARCHAR(20)Role
created_atTIMESTAMPCreation time

datasets (Dataset Table)

FieldTypeDescription
idVARCHAR(50)Primary key
nameVARCHAR(100)Name
descriptionTEXTDescription
typeVARCHAR(20)Type
statusVARCHAR(20)Status
created_byVARCHAR(50)Creator

Service Communication

Synchronous Communication

Services communicate via HTTP/REST:

// Using Feign Client
@FeignClient(name = "data-management-service")
public interface DataManagementClient {
    @GetMapping("/data-management/datasets/{id}")
    DatasetResponse getDataset(@PathVariable String id);
}

Asynchronous Communication

Using Redis for async messaging:

// Send message
redisTemplate.convertAndSend("task.created", taskMessage);

// Receive message
@RedisListener(topic = "task.created")
public void handleTaskCreated(TaskMessage message) {
    // Handle task creation event
}

Authentication & Authorization

JWT Authentication

@Configuration
public class JwtConfig {
    @Value("${datamate.jwt.secret}")
    private String secret;

    @Value("${datamate.jwt.expiration}")
    private Long expiration;
}

RBAC

@PreAuthorize("hasRole('ADMIN')")
public void adminOperation() {
    // Admin operations
}

Performance Optimization

Database Connection Pool

spring:
  datasource:
    hikari:
      maximum-pool-size: 20
      minimum-idle: 5
      connection-timeout: 30000

Caching Strategy

@Cacheable(value = "datasets", key = "#id")
public Dataset getDataset(String id) {
    return datasetRepository.findById(id);
}

5.2 - Frontend Architecture

DataMate React frontend architecture design

DataMate frontend is built on React 18 and TypeScript with modern frontend architecture.

Architecture Overview

DataMate frontend adopts SPA architecture:

┌─────────────────────────────────────────────┐
│              Browser                        │
└──────────────────┬──────────────────────────┘
                   │
                   ▼
┌─────────────────────────────────────────────┐
│              React App                      │
│  ┌──────────────────────────────────────┐  │
│  │         Components                   │  │
│  └──────────────────────────────────────┘  │
│  ┌──────────────────────────────────────┐  │
│  │         State Management             │  │
│  │         (Redux Toolkit)              │  │
│  └──────────────────────────────────────┘  │
│  ┌──────────────────────────────────────┐  │
│  │         Services (API)               │  │
│  └──────────────────────────────────────┘  │
│  ┌──────────────────────────────────────┐  │
│  │         Routing                      │  │
│  └──────────────────────────────────────┘  │
└─────────────────────────────────────────────┘

Tech Stack

Core Frameworks

TechnologyVersionPurpose
React18.xUI framework
TypeScript5.xType safety
Ant Design5.xUI components
Tailwind CSS3.xStyling

State Management

TechnologyVersionPurpose
Redux Toolkit2.xGlobal state
React Query5.xServer state

Project Structure

frontend/
├── src/
│   ├── components/     # Common components
│   ├── pages/          # Page components
│   ├── services/       # API services
│   ├── store/          # Redux store
│   ├── hooks/          # Custom hooks
│   ├── routes/         # Routes config
│   └── main.tsx        # Entry point

Routing Design

const router = createBrowserRouter([
  { path: "/", Component: Home },
  { path: "/chat", Component: AgentPage },
  {
    path: "/data",
    Component: MainLayout,
    children: [
      {
        path: "management",
        Component: DatasetManagement
      }
    ]
  }
]);

State Management

Redux Toolkit Configuration

export const store = configureStore({
  reducer: {
    dataManagement: dataManagementSlice,
    user: userSlice,
  },
});

Slice Example

export const fetchDatasets = createAsyncThunk(
  'dataManagement/fetchDatasets',
  async (params: GetDatasetsParams) => {
    const response = await getDatasets(params);
    return response.data;
  }
);

Component Design

Page Component

export const DataManagement: React.FC = () => {
  const dispatch = useAppDispatch();
  const { datasets, loading } = useAppSelector(
    (state) => state.dataManagement
  );

  useEffect(() => {
    dispatch(fetchDatasets({ page: 0, size: 20 }));
  }, [dispatch]);

  return (
    <div className="p-6">
      <h1>Data Management</h1>
      <DataTable data={datasets} loading={loading} />
    </div>
  );
};

API Services

Axios Configuration

const request = axios.create({
  baseURL: import.meta.env.VITE_API_BASE_URL,
  timeout: 30000,
});

// Request interceptor
request.interceptors.request.use((config) => {
  const token = localStorage.getItem('token');
  if (token) {
    config.headers.Authorization = `Bearer ${token}`;
  }
  return config;
});

Performance Optimization

Code Splitting

const DataManagement = lazy(() =>
  import('@/pages/DataManagement/Home/DataManagement')
);

React.memo

export const DataCard = React.memo<DataCardProps>(({ data }) => {
  return <div>{data.name}</div>;
});

6 - Appendix

Configuration, troubleshooting, and other reference information

Appendix contains configuration parameters, troubleshooting, and other reference information.

Appendix Content

Configuration

Detailed system configuration documentation:

  • Environment Variables: All configurable environment variables
  • application.yml: Spring Boot configuration file
  • Docker Compose: Container configuration
  • Kubernetes: K8s configuration

Troubleshooting

Common issue troubleshooting steps and solutions:

  • Service startup issues: Container startup failures
  • Database connection issues: Database connection failures
  • Frontend issues: Page loading, API requests
  • Task execution issues: Tasks stuck, execution failures
  • Performance issues: Slow response, memory overflow

Other References

Technical Support

If you encounter issues:

  1. Check Troubleshooting documentation
  2. Search GitHub Issues
  3. Submit a new issue with detailed information

Contributing

Contributions to DataMate are welcome:

  • Report bugs
  • Propose new features
  • Submit code contributions
  • Improve documentation

See Contribution Guide for details.

6.1 - Configuration

DataMate system configuration parameters

This document details various configuration parameters of the DataMate system.

Environment Variables

Common Configuration

VariableDefaultDescription
DB_PASSWORDpasswordDatabase password
DATAMATE_JWT_ENABLEfalseEnable JWT authentication
REGISTRYghcr.io/modelengine-group/Image registry
VERSIONlatestImage version tag

Database Configuration

VariableDefaultDescription
DB_HOSTdatamate-databaseDatabase host
DB_PORT5432Database port
DB_NAMEdatamateDatabase name
DB_USERpostgresDatabase username
DB_PASSWORDpasswordDatabase password

Redis Configuration

VariableDefaultDescription
REDIS_HOSTdatamate-redisRedis host
REDIS_PORT6379Redis port
REDIS_PASSWORD-Redis password (optional)
REDIS_DB0Redis database number

Milvus Configuration

VariableDefaultDescription
MILVUS_HOSTmilvusMilvus host
MILVUS_PORT19530Milvus port
MILVUS_INDEX_TYPEIVF_FLATVector index type
MILVUS_EMBEDDING_DIM768Vector dimension

MinIO Configuration

VariableDefaultDescription
MINIO_ENDPOINTminio:9000MinIO endpoint
MINIO_ACCESS_KEYminioadminAccess key
MINIO_SECRET_KEYminioadminSecret key
MINIO_BUCKETdatamateBucket name

LLM Configuration

VariableDefaultDescription
OPENAI_API_KEY-OpenAI API key
OPENAI_BASE_URLhttps://api.openai.com/v1API base URL
OPENAI_MODELgpt-4Model to use

JWT Configuration

VariableDefaultDescription
JWT_SECRETdefault-insecure-keyJWT secret (CHANGE IN PRODUCTION)
JWT_EXPIRATION86400Token expiration (seconds)

Logging Configuration

VariableDefaultDescription
LOG_LEVELINFOLog level
LOG_PATH/var/log/datamateLog path

application.yml Configuration

Main Config

datamate:
  jwt:
    enable: ${DATAMATE_JWT_ENABLE:false}
    secret: ${JWT_SECRET:default-insecure-key}
    expiration: ${JWT_EXPIRATION:86400}

  storage:
    type: minio
    endpoint: ${MINIO_ENDPOINT:minio:9000}
    access-key: ${MINIO_ACCESS_KEY:minioadmin}
    secret-key: ${MINIO_SECRET_KEY:minioadmin}

Spring Boot Config

spring:
  datasource:
    url: jdbc:postgresql://${DB_HOST:datamate-database}:${DB_PORT:5432}/${DB_NAME:datamate}
    username: ${DB_USER:postgres}
    password: ${DB_PASSWORD:password}

  jpa:
    hibernate:
      ddl-auto: validate
    show-sql: false

server:
  port: 8092

Docker Compose Configuration

Environment Variables

services:
  datamate-backend:
    environment:
      - DB_PASSWORD=${DB_PASSWORD:-password}
      - LOG_LEVEL=${LOG_LEVEL:-INFO}

Resource Limits

services:
  datamate-backend:
    deploy:
      resources:
        limits:
          cpus: '2'
          memory: 4G

Kubernetes Configuration

ConfigMap

apiVersion: v1
kind: ConfigMap
metadata:
  name: datamate-config
data:
  LOG_LEVEL: "INFO"

Secret

apiVersion: v1
kind: Secret
metadata:
  name: datamate-secret
type: Opaque
data:
  DB_PASSWORD: cGFzc3dvcmQ=  # base64 encoded

Performance Tuning

Database Connection Pool

spring:
  datasource:
    hikari:
      maximum-pool-size: 20
      minimum-idle: 5
      connection-timeout: 30000

JVM Parameters

JAVA_OPTS="-Xms2g -Xmx4g -XX:+UseG1GC"

6.2 - Troubleshooting

Common issues and solutions for DataMate

This document provides troubleshooting steps and solutions for common DataMate issues.

Service Startup Issues

Service Won’t Start

Symptoms

Service fails to start or exits immediately after running make install.

Troubleshooting Steps

  1. Check Port Conflicts
# Check port usage
lsof -i :8080  # API Gateway
lsof -i :30000 # Frontend

If port is occupied:

# Kill process
kill -9 <PID>
  1. View Container Logs
# View all containers
docker ps -a

# View specific container logs
docker logs datamate-backend
  1. Check Docker Resources
# View Docker system info
docker system df

# Clean unused resources
docker system prune -a

Common Causes and Solutions

CauseSolution
Port occupiedKill process or modify port mapping
Insufficient memoryIncrease Docker memory limit
Image not pulledRun docker pull
Network issuesCheck firewall and network config

Container Exits Immediately

Troubleshooting

# View exit code
docker ps -a

# View detailed logs
docker logs <container-name> --tail 100

Database Connection Issues

Cannot Connect to Database

Troubleshooting Steps

  1. Check Database Container
docker ps | grep datamate-database
docker logs datamate-database
  1. Test Database Connection
# Enter database container
docker exec -it datamate-database psql -U postgres -d datamate
  1. Check Database Config
# Check environment variables
docker exec datamate-backend env | grep DB_

Frontend Issues

Frontend Not Accessible

Symptoms

Browser cannot access http://localhost:30000

Troubleshooting

  1. Check Frontend Container
docker ps | grep datamate-frontend
docker logs datamate-frontend
  1. Check Port Mapping
docker port datamate-frontend

API Request Failed

Troubleshooting

  1. Check Browser Console

Open browser DevTools → Network tab

  1. Check API Gateway
docker ps | grep datamate-gateway
docker logs datamate-gateway
  1. Test API
curl http://localhost:8080/actuator/health

Task Execution Issues

Task Stuck

Troubleshooting

  1. View Task Logs
docker logs datamate-backend --tail 100 | grep <task-id>
docker logs datamate-runtime --tail 100
  1. Check System Resources
docker stats

Performance Issues

Slow System Response

Troubleshooting

  1. Check System Resources
docker stats
  1. Check Database Performance
-- View active queries
SELECT * FROM pg_stat_activity WHERE state = 'active';

Memory Overflow

Troubleshooting

# Check exit reason
docker inspect <container> | grep OOMKilled

Log Viewing

View Application Logs

# Backend logs
docker logs datamate-backend --tail 100 -f

# Frontend logs
docker logs datamate-frontend --tail 100 -f

Log File Locations

ServiceLog Path
Backend/var/log/datamate/backend/app.log
Frontend/var/log/datamate/frontend/
Database/var/log/datamate/database/
Runtime/var/log/datamate/runtime/

Getting Help

If issues persist:

  1. Collect Information

    • Error messages
    • Log files
    • System environment
    • Reproduction steps
  2. Search Existing Issues

Visit GitHub Issues

  1. Submit New Issue

Include:

  • DataMate version
  • OS version
  • Docker version
  • Detailed error messages
  • Reproduction steps

7 - Contribution Guide

Welcome to the DataMate project. We welcome all forms of contributions including documentation, code, testing, translation, etc.

DataMate is an enterprise-level open source data processing project dedicated to providing efficient data solutions for model training, AI applications, and data flywheel scenarios. We welcome all developers, document creators, and test engineers to participate through code commits, documentation optimization, issue feedback, and community support.

If this is your first time contributing to an open source project, we recommend reading Open Source Contribution Newbie Guide first, then proceed with this guide. All contributions must follow the DataMate Code of Conduct.

Contribution Scope and Methods

DataMate open source project contributions cover the following core scenarios. You can choose your participation based on your expertise:

Contribution TypeSpecific ContentSuitable For
Code ContributionCore feature development, bug fixes, performance optimization, new feature proposalsBackend/frontend developers, data engineers
Documentation ContributionUser manual updates, API documentation improvements, tutorial writing, contribution guide optimizationTechnical document creators, experienced users
Testing ContributionWrite unit/integration tests, feedback test issues, participate in compatibility testingTest engineers, QA personnel
Community ContributionAnswer GitHub Issues, participate in community discussions, share use casesAll users, tech enthusiasts
Design ContributionUI/UX optimization, logo/icon design, documentation visual upgradeUI/UX designers, visual designers

Thank you for choosing to participate in the DataMate open source project! Whether it’s code, documentation, or community support, every contribution helps the project grow and advances enterprise-level data processing technology. If you encounter any issues during the contribution process, feel free to seek help through community channels.

Getting Started

Development Environment

Before contributing, please set up your development environment:

  1. Clone Repository
git clone https://github.com/ModelEngine-Group/DataMate.git
cd DataMate
  1. Install Dependencies
# Backend dependencies
cd backend
mvn clean install

# Frontend dependencies
cd frontend
pnpm install

# Python dependencies
cd runtime
pip install -r requirements.txt
  1. Start Services
# Start basic services
make install dev=true

For detailed setup instructions, see:

Code Contribution

Code Standards

Java Code Standards

  • Naming Conventions:

    • Class name: PascalCase UserService
    • Method name: camelCase getUserById
    • Constants: UPPER_CASE MAX_SIZE
    • Variables: camelCase userName
  • Documentation: Add Javadoc comments for public APIs

/**
 * User service
 *
 * @author Your Name
 * @since 1.0.0
 */
public class UserService {
    /**
     * Get user by ID
     *
     * @param userId user ID
     * @return user information
     */
    public User getUserById(Long userId) {
        // ...
    }
}

TypeScript Code Standards

  • Naming Conventions:
    • Components: PascalCase UserProfile
    • Types/Interfaces: PascalCase UserData
    • Functions: camelCase getUserData
    • Constants: UPPER_CASE API_BASE_URL

Python Code Standards

Follow PEP 8:

def get_user(user_id: int) -> dict:
    """
    Get user information

    Args:
        user_id: User ID

    Returns:
        User information dictionary
    """
    # ...

Submitting Code

1. Create Branch

git checkout -b feature/your-feature-name

Branch naming convention:

  • feature/ - New features
  • fix/ - Bug fixes
  • docs/ - Documentation updates
  • refactor/ - Refactoring

2. Make Changes

Follow the code standards mentioned above.

3. Write Tests

# Backend tests
mvn test

# Frontend tests
pnpm test

# Python tests
pytest

4. Commit Changes

git add .
git commit -m "feat: add new feature description"

Commit message format:

  • feat: - New feature
  • fix: - Bug fix
  • docs: - Documentation changes
  • style: - Code style changes
  • refactor: - Refactoring
  • test: - Adding tests
  • chore: - Other changes

5. Push and Create PR

git push origin feature/your-feature-name

Then create a Pull Request on GitHub.

Documentation Contribution

Documentation Structure

Documentation is located in the /docs directory:

docs/
├── getting-started/     # Quick start
├── user-guide/          # User guide
├── api-reference/       # API reference
├── developer-guide/     # Developer guide
└── appendix/            # Appendix

Writing Documentation

1. Choose Language

Documents support bilingual (Chinese and English). When updating documentation, please update both language versions.

2. Follow Format

Use Markdown format with Hugo front matter:

---
title: Page Title
description: Page description
weight: 1
---

Content here...

3. Add Examples

Include code examples, commands, and use cases to help users understand.

4. Cross-Reference

Add links to related documentation:

See [Data Management](/docs/user-guide/data-management/) for details.

Testing Contribution

Test Coverage

We aim for comprehensive test coverage:

  • Unit Tests: Test individual functions and classes
  • Integration Tests: Test service interactions
  • E2E Tests: Test complete workflows

Writing Tests

Backend Tests (JUnit)

@Test
public void testGetDataset() {
    // Arrange
    String datasetId = "test-dataset";

    // Act
    Dataset result = datasetService.getDataset(datasetId);

    // Assert
    assertNotNull(result);
    assertEquals("test-dataset", result.getId());
}

Frontend Tests (Jest + React Testing Library)

test('renders data management page', () => {
  render(<DataManagement />);
  expect(screen.getByText('Data Management')).toBeInTheDocument();
});

Reporting Issues

When finding bugs:

  1. Search existing GitHub Issues
  2. If not found, create new issue with:
    • Clear title
    • Detailed description
    • Steps to reproduce
    • Expected vs actual behavior
    • Environment info

Design Contribution

UI/UX Guidelines

We use Ant Design as the UI component library. When contributing design changes:

  1. Follow Ant Design principles
  2. Ensure consistency with existing design
  3. Consider accessibility
  4. Test on different screen sizes

Design Assets

Design assets should be placed in:

  • Frontend assets: frontend/src/assets/
  • Documentation images: content/en/docs/images/

Community Guidelines

Code of Conduct

  • Be respectful and inclusive
  • Welcome newcomers and help them learn
  • Focus on constructive feedback
  • Collaborate openly

Communication Channels

  • GitHub Issues: Bug reports and feature requests
  • GitHub Discussions: General discussions
  • Pull Requests: Code and documentation contributions

Getting Help

If you need help:

  1. Check existing documentation
  2. Search GitHub Issues
  3. Start a GitHub Discussion

Recognition

Contributors will be recognized in:

  • Contributors List: In the documentation
  • Release Notes: For significant contributions
  • Community Highlights: For outstanding contributions

License

By contributing to DataMate, you agree that your contributions will be licensed under the MIT License.


Thank you for contributing to DataMate! Your contributions help make DataMate better for everyone. 🎉