This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

User Guide

DataMate feature usage guides

This guide introduces how to use each feature module of DataMate.

DataMate provides comprehensive data processing solutions for large models, covering data collection, management, cleaning, annotation, synthesis, evaluation, and the full process.

Feature Modules

Typical Use Cases

Model Fine-tuning Scenario

1. Data Collection → 2. Data Management → 3. Data Cleaning → 4. Data Annotation
↓
5. Data Evaluation → 6. Export Training Data

RAG Application Scenario

1. Upload Documents → 2. Vectorization Index → 3. Knowledge Base Management
↓
4. Agent Chat (Knowledge Base Q&A)

Data Augmentation Scenario

1. Prepare Raw Data → 2. Create Instruction Template → 3. Data Synthesis
↓
4. Quality Evaluation → 5. Export Augmented Data

1 - Data Collection

Collect data from multiple data sources with DataMate

Data collection module helps you collect data from multiple data sources (databases, file systems, APIs, etc.) into the DataMate platform.

Features Overview

Based on DataX, data collection module supports:

  • Multiple Data Sources: MySQL, PostgreSQL, Oracle, SQL Server, etc.
  • Heterogeneous Sync: Data sync between different sources
  • Batch Collection: Large-scale batch collection and sync
  • Scheduled Tasks: Support scheduled execution
  • Task Monitoring: Real-time monitoring of collection tasks

Supported Data Sources

Data Source TypeReaderWriterDescription
General Relational DatabasesSupports MySQL, PostgreSQL, OpenGauss, SQL Server, DM, DB2
MySQLRelational database
PostgreSQLRelational database
OpenGaussRelational database
SQL ServerMicrosoft database
DM (Dameng)Domestic database
DB2IBM database
StarRocksAnalytical database
NASNetwork storage
S3Object storage
GlusterFSDistributed file system
API CollectionAPI interface data
JSON FilesJSON format files
CSV FilesCSV format files
TXT FilesText files
FTPFTP servers
HDFSHadoop HDFS

Quick Start

1. Create Collection Task

Step 1: Enter Data Collection Page

Select Data Collection in the left navigation.

Step 2: Create Task

Click Create Task button.

Step 3: Configure Basic Information

Fill in the following basic information:

  • Name: A meaningful name for the task
  • Timeout: Task execution timeout (seconds)
  • Description: Task purpose (optional)

Step 4: Select Sync Mode

Select the task synchronization mode:

  • Immediate Sync: Execute once immediately after task creation
  • Scheduled Sync: Execute periodically according to schedule rules

When selecting Scheduled Sync, configure the execution policy:

  • Execution Cycle: Hourly / Daily / Weekly / Monthly
  • Execution Time: Select the execution time point

Step 5: Configure Data Source

Select data source type: Choose from dropdown list (e.g., MySQL, CSV, etc.)

Configure data source parameters: Fill in connection parameters based on the selected data source template (form format)

MySQL Example:

  • JDBC URL: jdbc:mysql://localhost:3306/mydb
  • Username: root
  • Password: password
  • Table Name: users

Step 6: Configure Field Extraction

Field mapping is not supported. You can only extract specific fields from the configured SQL.

  • Extract specific fields: Enter the field names you want to extract in the field list
  • Extract all fields: Leave the field list empty to extract all fields from the SQL query result

Step 7: Create and Execute

Click Create button to create the task.

  • If Immediate Sync is selected, task starts immediately
  • If Scheduled Sync is selected, task runs periodically according to schedule

2. Monitor Task Execution

View all collection tasks with status, progress, and operations.

3. Task Management

Each task in the task list has the following actions available:

  • View Execution Records: View all historical executions of the task
  • Delete: Delete the task (note: deleting a task does not delete collected data)

Click the task name to view task details including:

  • Basic configuration
  • Execution record list
  • Data statistics

Common Questions

Q: Task execution failed?

A: Troubleshooting:

  1. Check data source connection
  2. View execution logs
  3. Check data format
  4. Verify target dataset exists

Q: How to collect large tables?

A:

  1. Use incremental collection
  2. Split into multiple tasks
  3. Adjust concurrent parameters
  4. Use filter conditions

API Reference

2 - Data Management

Manage datasets and files with DataMate

Data management module provides unified dataset management capabilities, supporting multiple data types for storage, query, and operations.

Features Overview

Data management module provides:

  • Multiple data types: Image, text, audio, video, and multimodal support
  • File management: Upload, download, preview, delete operations
  • Directory structure: Support for hierarchical directory organization
  • Tag management: Use tags to categorize and retrieve data
  • Statistics: Dataset size, file count, and other statistics

Dataset Types

TypeDescriptionSupported Formats
ImageImage dataJPG, PNG, GIF, BMP, WebP
TextText dataTXT, MD, JSON, CSV
AudioAudio dataMP3, WAV, FLAC, AAC
VideoVideo dataMP4, AVI, MOV, MKV
MultimodalMultimodal dataMixed formats

Quick Start

1. Create Dataset

Step 1: Enter Data Management Page

In the left navigation, select Data Management.

Step 2: Create Dataset

Click the Create Dataset button in the upper right corner.

Step 3: Fill Basic Information

  • Dataset name: e.g., user_images_dataset
  • Dataset type: Select data type (e.g., Image)
  • Description: Dataset purpose description (optional)
  • Tags: Add tags for categorization (optional)

Step 4: Create Dataset

Click the Create button to complete.

2. Upload Files

Method 1: Drag & Drop

  1. Enter dataset details page
  2. Drag files directly to the upload area
  3. Wait for upload completion

Method 2: Click Upload

  1. Click Upload File button
  2. Select local files
  3. Wait for upload completion

Method 3: Chunked Upload (Large Files)

For large files (>100MB), the system automatically uses chunked upload:

  1. Select large file to upload
  2. System automatically splits the file
  3. Upload chunks one by one
  4. Automatically merge

3. Create Directory

Step 1: Enter Dataset

Click dataset name to enter details.

Step 2: Create Directory

  1. Click Create Directory button
  2. Enter directory name
  3. Select parent directory (optional)
  4. Click confirm

Directory structure example:

user_images_dataset/
├── train/
│   ├── cat/
│   └── dog/
├── test/
│   ├── cat/
│   └── dog/
└── validation/
    ├── cat/
    └── dog/

4. Manage Files

View Files

In dataset details page, you can see all files:

FilenameSizeFile CountUpload TimeTagsTag Update TimeActions
image1.jpg2.3 MB12024-01-15Training Set2024-01-16Download Rename Delete
image2.png1.8 MB12024-01-15Validation Set2024-01-16Download Rename Delete

Preview File

Click Preview button to preview in browser:

  • Image: Display thumbnail and details
  • Text: Display text content
  • Audio: Online playback
  • Video: Online playback

Download File

  • Single file download: Click Download button

Currently, batch download and package download are not supported.

5. Dataset Operations

View Statistics

In dataset details page, you can see:

  • Total files: Total number of files in dataset
  • Total size: Total size of all files

Edit Dataset

Click Edit button to modify:

  • Dataset name
  • Description
  • Tags
  • Associated collection task

Delete Dataset

Click Delete button to delete entire dataset.

Note: Deleting a dataset will also delete all files within it. This action cannot be undone.

Advanced Features

Tag Management

Create Tag

  1. In dataset list page, click Tag Management
  2. Click Create Tag
  3. Enter tag name

Use Tags

  1. Edit dataset
  2. Select existing tags in tag bar
  3. Save dataset

Filter by Tags

In dataset list page, click tags to filter datasets with that tag.

Best Practices

1. Dataset Organization

Recommended directory organization:

project_dataset/
├── raw/              # Raw data
├── processed/        # Processed data
├── train/            # Training data
├── validation/       # Validation data
└── test/             # Test data

2. Naming Conventions

  • Dataset name: Use lowercase letters and underscores, e.g., user_images_2024
  • Directory name: Use meaningful English names, e.g., train, test, processed
  • File name: Keep original filename or use standardized naming

3. Tag Usage

Recommended tag categories:

  • Project tags: project-a, project-b
  • Status tags: raw, processed, validated
  • Type tags: image, text, audio
  • Purpose tags: training, testing, evaluation

4. Data Backup

The system currently does not support automatic backup. To backup data, you can manually download individual files:

  1. Enter dataset details page
  2. Find the file you need to backup
  3. Click the Download button of the file

Common Questions

Q: Large file upload fails?

A: Suggestions for large file uploads:

  1. Use chunked upload: System automatically enables chunked upload
  2. Check network: Ensure stable network connection
  3. Adjust upload parameters: Increase timeout
  4. Use FTP/SFTP: For very large files, use FTP upload

Q: How to import existing data?

A: Three methods to import existing data:

  1. Upload files: Upload via interface
  2. Add files: If files already on server, use add file feature
  3. Data collection: Use data collection module to collect from external sources

Q: Dataset size limit?

A: Dataset size limits:

  • Single file: Maximum 5GB (chunked upload)
  • Total dataset: Limited by storage space
  • File count: No explicit limit

Regularly clean unnecessary files to free up space.

API Reference

For detailed API documentation, see:

3 - Data Cleaning

Clean and preprocess data with DataMate

Data cleaning module provides powerful data processing capabilities to help you clean, transform, and optimize data quality.

Features Overview

Data cleaning module provides:

  • Built-in Cleaning Operators: Rich pre-cleaning operator library
  • Visual Configuration: Drag-and-drop cleaning pipeline design
  • Template Management: Save and reuse cleaning templates
  • Batch Processing: Support large-scale data batch cleaning
  • Real-time Preview: Preview cleaning results

Cleaning Operator Types

Data Quality Operators

OperatorFunctionApplicable Data Types
DeduplicationRemove duplicatesAll types
Null HandlingHandle null valuesAll types
Outlier DetectionDetect outliersNumerical
Format ValidationValidate formatAll types

Text Cleaning Operators

OperatorFunction
Remove Special CharsRemove special characters
Case ConversionConvert case
Remove StopwordsRemove common stopwords
Text SegmentationChinese word segmentation
HTML Tag CleaningClean HTML tags

Quick Start

1. Create Cleaning Task

Step 1: Enter Data Cleaning Page

Select Data Processing in the left navigation.

Step 2: Create Task

Click Create Task button.

Step 3: Configure Basic Information

  • Task name: e.g., user_data_cleansing
  • Source dataset: Select dataset to clean
  • Output dataset: Select or create output dataset

Step 4: Configure Cleaning Pipeline

  1. Drag operators from left library to canvas
  2. Connect operators to form pipeline
  3. Configure operator parameters
  4. Preview cleaning results

Example pipeline:

Input Data → Deduplication → Null Handling → Format Validation → Output Data

2. Use Cleaning Templates

Create Template

  1. Configure cleaning pipeline
  2. Click Save as Template
  3. Enter template name
  4. Save

Use Template

  1. Create cleaning task
  2. Click Use Template
  3. Select template
  4. Adjust as needed

3. Monitor Cleaning Task

View task status, progress, and statistics in task list.

Advanced Features

Custom Operators

Develop custom operators. See:

Conditional Branching

Add conditional branches in pipeline:

Input Data → [Condition Check]
              ├── Satisfied → Pipeline A
              └── Not Satisfied → Pipeline B

Best Practices

1. Pipeline Design

Recommended principles:

  • Modular: Split complex pipelines
  • Reusable: Use templates and parameters
  • Maintainable: Add comments
  • Testable: Test individually before combining

2. Performance Optimization

Optimize performance:

  • Parallelize: Use parallel nodes
  • Reduce data transfer: Process locally when possible
  • Batch operations: Use batch operations
  • Cache results: Cache intermediate results

Common Questions

Q: Task execution failed?

A: Troubleshooting:

  1. Check data format
  2. View execution logs
  3. Check operator parameters
  4. Test individual operators
  5. Reduce data size for testing

Q: Cleaning speed is slow?

A: Optimize:

  1. Reduce operator count
  2. Optimize operator order
  3. Increase concurrency
  4. Use incremental processing

API Reference

4 - Data Annotation

Perform data annotation with DataMate

Data annotation module integrates Label Studio to provide professional-grade data annotation capabilities.

Features Overview

Data annotation module provides:

  • Multiple Annotation Types: Image, text, audio, etc.
  • Annotation Templates: Rich annotation templates and configurations
  • Quality Control: Annotation review and consistency checks
  • Team Collaboration: Multi-person collaborative annotation
  • Annotation Export: Export annotation results

Annotation Types

Image Annotation

TypeDescriptionUse Cases
Image ClassificationClassify entire imageScene recognition
Object DetectionAnnotate object locationsObject recognition
Semantic SegmentationPixel-level classificationMedical imaging
Key Point AnnotationAnnotate key pointsPose estimation

Text Annotation

TypeDescriptionUse Cases
Text ClassificationClassify textSentiment analysis
Named Entity RecognitionAnnotate entitiesInformation extraction
Text SummarizationGenerate summariesDocument understanding

Quick Start

1. Deploy Label Studio

make install-label-studio

Access: http://localhost:30001

Default credentials:

2. Create Annotation Task

Step 1: Enter Data Annotation Page

Select Data Annotation in the left navigation.

Step 2: Create Task

Click Create Task.

Step 3: Configure Basic Information

  • Task name: e.g., image_classification_task
  • Source dataset: Select dataset to annotate
  • Annotation type: Select type

Step 4: Configure Annotation Template

Image Classification Template:

<View>
  <Image name="image" value="$image"/>
  <Choices name="choice" toName="image">
    <Choice value="cat"/>
    <Choice value="dog"/>
    <Choice value="bird"/>
  </Choices>
</View>

Step 5: Configure Annotation Rules

  • Annotation method: Single label / Multi label
  • Minimum annotations: Per sample (for consistency)
  • Review mechanism: Enable/disable review

3. Start Annotation

  1. Enter annotation interface
  2. View sample to annotate
  3. Perform annotation
  4. Click Submit
  5. Auto-load next sample

Advanced Features

Quality Control

Annotation Consistency

Check consistency between annotators:

  • Cohen’s Kappa: Evaluate consistency
  • Majority vote: Use majority annotation results
  • Expert review: Expert reviews disputed annotations

Pre-annotation

Use models for pre-annotation:

  1. Train or use existing model
  2. Pre-annotate dataset
  3. Annotators correct pre-annotations

Best Practices

1. Annotation Guidelines

Create clear guidelines:

  • Define standards: Clear annotation standards
  • Provide examples: Positive and negative examples
  • Edge cases: Handle edge cases
  • Train annotators: Ensure understanding

Common Questions

Q: Poor annotation quality?

A: Improve:

  1. Refine guidelines
  2. Strengthen training
  3. Increase reviews
  4. Use pre-annotation

5 - Data Synthesis

Use large models for data augmentation and synthesis

Data synthesis module leverages large model capabilities to automatically generate high-quality training data, reducing data collection costs.

Features Overview

Data synthesis module provides:

  • Instruction template management: Create and manage synthesis instruction templates
  • Single task synthesis: Create individual synthesis tasks
  • Proportional synthesis task: Synthesize multi-category balanced data by specified ratios
  • Large model integration: Support for multiple LLM APIs
  • Quality evaluation: Automatic evaluation of synthesized data quality

Quick Start

1. Create Instruction Template

Step 1: Enter Data Synthesis Page

In the left navigation, select Data SynthesisSynthesis Tasks.

Step 2: Create Instruction Template

  1. Click Instruction Templates tab
  2. Click Create Template button

Step 3: Configure Template

Basic Information:

  • Template name: e.g., qa_generation_template
  • Template description: Describe template purpose (optional)
  • Template type: Select template type (Q&A, dialogue, summary, etc.)

Prompt Configuration:

Example prompt:

You are a professional data generation assistant. Generate data based on the following requirements:

Task: Generate Q&A pairs
Topic: {topic}
Count: {count}
Difficulty: {difficulty}

Requirements:
1. Questions should be clear and specific
2. Answers should be accurate and complete
3. Cover different difficulty levels

Output format: JSON
[
  {
    "question": "...",
    "answer": "..."
  }
]

Parameter Configuration:

  • Model: Select LLM to use (GPT-4, Claude, local model, etc.)
  • Temperature: Control generation randomness (0-1)
  • Max tokens: Limit generation length
  • Other parameters: Configure according to model

Step 4: Save Template

Click Save button to save template.

2. Create Synthesis Task

Step 1: Fill Basic Information

  1. Return to Data Synthesis page
  2. Click Create Task button
  3. Fill basic information:
    • Task name: e.g., medical_qa_synthesis
    • Task description: Describe task purpose (optional)

Step 2: Select Dataset and Files

Select required data from existing datasets:

  • Select dataset: Choose the dataset to use from the list
  • Select files:
    • Can select all files from a dataset
    • Can also select specific files from a dataset
    • Support selecting multiple files

Step 3: Select Synthesis Instruction Template

Select an existing template or create a new one:

  • Select from template library: Choose from created templates
  • Template type: Q&A generation, dialogue generation, summary generation, etc.
  • Preview template: View template prompt content

Step 4: Fill Synthesis Configuration

The synthesis configuration consists of four parts:

1. Set Total Synthesis Count

Set the maximum limit for the entire task:

ParameterDescriptionDefault ValueRange
Maximum QA PairsMaximum number of QA pairs to generate for entire task50001-100,000

This setting is optional, used for total volume control in large-scale synthesis tasks.

2. Configure Text Chunking Strategy

Chunk the input text files, supporting multiple chunking methods:

ParameterDescriptionDefault Value
Chunking MethodSelect chunking strategyDefault chunking
Chunk SizeCharacter count per chunk3000
Overlap SizeOverlap characters between adjacent chunks100

Chunking Method Options:

  • Default Chunking (默认分块): Use system default intelligent chunking strategy
  • Chapter-based Chunking (按章节分块): Split by chapter structure
  • Paragraph-based Chunking (按段落分块): Split by paragraph boundaries
  • Fixed Length Chunking (固定长度分块): Split by fixed character length
  • Custom Separator Chunking (自定义分隔符分块): Split by custom delimiter

3. Configure Question Synthesis Parameters

Set parameters for question generation:

ParameterDescriptionDefault ValueRange
Question CountNumber of questions generated per chunk11-20
TemperatureControl randomness and diversity of question generation0.70-2
ModelSelect CHAT model for question generation-Select from model list

Parameter Notes:

  • Question Count: Number of questions generated per text chunk. Higher value generates more questions.
  • Temperature: Higher values produce more diverse questions, lower values produce more stable questions.

4. Configure Answer Synthesis Parameters

Set parameters for answer generation:

ParameterDescriptionDefault ValueRange
TemperatureControl stability of answer generation0.70-2
ModelSelect CHAT model for answer generation-Select from model list

Parameter Notes:

  • Temperature: Lower values produce more conservative and accurate answers, higher values produce more diverse and creative answers.

Synthesis Types: The system supports two synthesis types:

  • SFT Q&A Synthesis (SFT 问答数据合成): Generate Q&A pairs for supervised fine-tuning
  • COT Chain-of-Thought Synthesis (COT 链式推理合成): Generate data with reasoning process

Step 5: Start Task

Click Start Task button, task will automatically start executing.

3. Create Ratio Synthesis Task

Ratio synthesis tasks are used to synthesize multi-category balanced data in specified proportions.

Step 1: Create Ratio Task

  1. In the left navigation, select Data SynthesisRatio Tasks
  2. Click Create Task button

Step 2: Fill Basic Information

ParameterDescriptionRequired
Task NameUnique identifier for the taskYes
Total Target CountTarget total count for entire ratio taskYes
Task DescriptionDescribe purpose and requirements of ratio taskNo

Example:

  • Task name: balanced_dataset_synthesis
  • Total target count: 10000
  • Task description: Generate balanced data for training and validation sets

Step 3: Select Datasets

Select datasets to participate in the ratio synthesis from existing datasets:

Dataset Selection Features:

  • Search Datasets: Search datasets by keyword
  • Multi-select Support: Can select multiple datasets simultaneously
  • Dataset Information: Display detailed information for each dataset
    • Dataset name and type
    • Dataset description
    • File count
    • Dataset size
    • Label distribution preview (up to 8 labels)

After selecting datasets, the system automatically loads label distribution information for each dataset.

Step 4: Fill Ratio Configuration

Configure specific synthesis rules for each selected dataset:

Ratio Configuration Items:

ParameterDescriptionRange
LabelSelect label from dataset’s label distributionBased on dataset labels
Label ValueSpecific value under selected labelBased on label value list
Label Update TimeSelect label update date range (optional)Date picker
QuantityData count to generate for this config0 to total target count

Feature Notes:

  • Auto Distribute: Click “Auto Distribute” button, system automatically distributes total count evenly across datasets
  • Quantity Limit: Each configuration item’s quantity cannot exceed the dataset’s total file count
  • Percentage Calculation: System automatically calculates percentage of each configuration item
  • Delete Configuration: Can delete unwanted configuration items
  • Add Configuration: Each dataset can have multiple different label configurations

Example Configuration:

DatasetLabelLabel ValueLabel Update TimeQuantity
Training DatasetCategoryTraining-6000
Training DatasetCategoryValidation-2000
Test DatasetCategoryTest2024-01-01 to 2024-12-312000

Step 5: Execute Task

Click Start Task button, the system will create and execute the task according to ratio configuration.

4. Monitor Synthesis Task

View Task List

In data synthesis page, you can see all synthesis tasks:

Task NameTemplateStatusProgressGenerated CountActions
Medical QA Synthesisqa_templateRunning50%50/100View Details
Sentiment Data Synthesissentiment_templateCompleted100%1000/1000View Details

Advanced Features

Template Variables

Use variables in prompts for dynamic configuration:

Variable syntax: {variable_name}

Example:

Generate {count} {difficulty} level {type} about {topic}.

Built-in variables:

  • {current_date}: Current date
  • {current_time}: Current time
  • {random_id}: Random ID

Model Selection

DataMate supports multiple LLMs:

ModelTypeDescription
GPT-4OpenAIHigh-quality generation
GPT-3.5-TurboOpenAIFast generation
Claude 3AnthropicLong-text generation
Wenxin YiyanBaiduChinese optimized
Tongyi QianwenAlibabaChinese optimized
Local ModelDeployed locallyPrivate deployment

Best Practices

1. Prompt Design

Good prompts should:

  • Define task clearly: Clearly describe generation task
  • Specify format: Clearly define output format requirements
  • Provide examples: Give expected output examples
  • Control quality: Set quality requirements

Example prompt:

You are a professional educational content creator.

Task: Generate educational Q&A pairs
Subject: {subject}
Grade: {grade}
Count: {count}

Requirements:
1. Questions should be appropriate for the grade level
2. Answers should be accurate, detailed, and easy to understand
3. Each answer should include explanation process
4. Do not generate sensitive or inappropriate content

Output format (JSON):
[
  {
    "id": 1,
    "question": "Question content",
    "answer": "Answer content",
    "explanation": "Explanation content",
    "difficulty": "easy/medium/hard",
    "knowledge_points": ["point1", "point2"]
  }
]

Start generating:

2. Parameter Tuning

Adjust model parameters according to needs:

ParameterHigh QualityFast GenerationCreative Generation
Temperature0.3-0.50.1-0.30.7-1.0
Max tokensAs neededShorterLonger
Top P0.9-0.950.90.95-1.0

Common Questions

Q: Generated data quality is not ideal?

A: Optimization suggestions:

  1. Improve prompt: More detailed and clear instructions
  2. Adjust parameters: Lower temperature, increase max tokens
  3. Provide examples: Give examples in prompt
  4. Change model: Try other LLMs
  5. Manual review: Manual review and filtering

Q: Generation speed is slow?

A: Acceleration suggestions:

  1. Reduce count: Generate in smaller batches
  2. Adjust concurrency: Increase concurrency appropriately
  3. Use faster model: Like GPT-3.5-Turbo
  4. Shorten output: Reduce max tokens
  5. Use local model: Deploy local model for acceleration

API Reference

For detailed API documentation, see:

6 - Data Evaluation

Evaluate data quality with DataMate

Data evaluation module provides multi-dimensional data quality evaluation capabilities.

Features Overview

Data evaluation module provides:

  • Quality Metrics: Rich data quality evaluation metrics
  • Automatic Evaluation: Auto-execute evaluation tasks
  • Manual Evaluation: Manual sampling evaluation
  • Evaluation Reports: Generate detailed reports
  • Quality Tracking: Track data quality trends

Evaluation Dimensions

Data Completeness

MetricDescriptionCalculation
Null RateNull value ratioNull count / Total count
Missing Field RateRequired field missing rateMissing fields / Total fields
Record Complete RateComplete record ratioComplete records / Total records

Data Accuracy

MetricDescriptionCalculation
Format Correct RateFormat complianceFormat correct / Total
Value Range ComplianceIn valid rangeIn range / Total
Consistency RateData consistencyConsistent records / Total

Quick Start

1. Create Evaluation Task

Step 1: Enter Data Evaluation Page

Select Data Evaluation in the left navigation.

Step 2: Create Task

Click Create Task.

Step 3: Configure Basic Information

  • Task name: e.g., data_quality_evaluation
  • Evaluation dataset: Select dataset to evaluate

Step 4: Configure Evaluation Dimensions

Select dimensions:

  • ✅ Data completeness
  • ✅ Data accuracy
  • ✅ Data uniqueness
  • ✅ Data timeliness

Step 5: Configure Evaluation Rules

Completeness Rules:

Required fields: name, email, phone
Null threshold: 5% (warn if exceeded)

2. Execute Evaluation

Automatic Evaluation

Auto-executes after creation, or click Execute Now.

Manual Evaluation

  1. Click Manual Evaluation tab
  2. View samples to evaluate
  3. Manually evaluate quality
  4. Submit results

3. View Evaluation Report

Overall Score

Overall Quality Score: 85 (Excellent)

Completeness: 90 ⭐⭐⭐⭐⭐
Accuracy: 82 ⭐⭐⭐⭐
Uniqueness: 95 ⭐⭐⭐⭐⭐
Timeliness: 75 ⭐⭐⭐⭐

Detailed Metrics

Completeness:

  • Null rate: 3.2% ✅
  • Missing field rate: 1.5% ✅
  • Record complete rate: 96.8% ✅

Advanced Features

Custom Evaluation Rules

Regex Validation

Field: phone
Rule: ^1[3-9]\d{9}$
Description: China mobile phone number

Value Range Validation

Field: age
Min value: 0
Max value: 120

Comparison Evaluation

Compare different datasets or versions.

Best Practices

1. Regular Evaluation

Recommended schedule:

  • Daily: Critical data
  • Weekly: General data
  • Monthly: All data

2. Establish Baseline

Create quality baseline for each dataset.

3. Continuous Improvement

Based on evaluation results:

  • Clean problem data
  • Optimize collection process
  • Update validation rules

Common Questions

Q: Evaluation task failed?

A: Troubleshoot:

  1. Check dataset exists
  2. Check rule configuration
  3. View execution logs
  4. Test with small sample size

API Reference

7 - Knowledge Base Management

Build and manage RAG knowledge bases with DataMate

Knowledge base management module helps you build enterprise knowledge bases for efficient vector retrieval and RAG applications.

Features Overview

Knowledge base management module provides:

  • Document upload: Support multiple document formats
  • Text chunking: Intelligent text splitting strategies
  • Vectorization: Automatic text-to-vector conversion
  • Vector search: Semantic similarity-based retrieval
  • Knowledge base Q&A: RAG-intelligent Q&A

Supported Document Formats

FormatDescriptionRecommended For
TXTPlain textGeneral text
PDFPDF documentsDocuments, reports
MarkdownMarkdown filesTechnical docs
JSONJSON dataStructured data
CSVCSV tablesTabular data
DOCXWord documentsOffice documents

Quick Start

1. Create Knowledge Base

Step 1: Enter Knowledge Base Page

In the left navigation, select Knowledge Generation.

Step 2: Create Knowledge Base

Click Create Knowledge Base button in upper right.

Step 3: Configure Basic Information

  • Knowledge base name: e.g., company_docs_kb
  • Knowledge base description: Describe purpose (optional)
  • Knowledge base type: General / Professional domain

Step 4: Configure Vector Parameters

  • Embedding model: Select embedding model

    • OpenAI text-embedding-ada-002
    • BGE-M3
    • Custom model
  • Vector dimension: Auto-set based on model

  • Index type: IVF_FLAT / HNSW / IVF_PQ

Step 5: Configure Chunking Strategy

  • Chunking method:

    • By character count
    • By paragraph
    • By semantic
  • Chunk size: Size of each text chunk (character count)

  • Overlap size: Overlap between adjacent chunks

2. Upload Documents

Step 1: Enter Knowledge Base Details

Click knowledge base name to enter details.

Step 2: Upload Documents

  1. Click Upload Document button
  2. Select local files
  3. Wait for upload completion

System will automatically:

  1. Parse document content
  2. Chunk text
  3. Generate vectors
  4. Build index

Step 1: Enter Search Page

In knowledge base details page, click Vector Search tab.

Step 2: Enter Query

Enter query in search box, e.g.:

How to use DataMate for data cleaning?

Step 3: View Search Results

System returns most relevant text chunks with similarity scores:

RankText ChunkSimilaritySource DocActions
1DataMate’s data cleaning module…0.92user_guide.pdfView
2Configure cleaning task…0.87tutorial.mdView
3Cleaning operator list…0.81reference.txtView

4. Knowledge Base Q&A (RAG)

Step 1: Enable RAG

In knowledge base details page, click RAG Q&A tab.

Step 2: Configure RAG Parameters

  • LLM: Select LLM to use
  • Retrieval count: Number of text chunks to retrieve
  • Temperature: Control generation randomness
  • Prompt template: Custom Q&A template

Step 3: Q&A

Enter question in dialog box, e.g.:

User: What data cleaning operators does DataMate support?

Assistant: DataMate supports rich data cleaning operators, including:
1. Data quality operators: deduplication, null handling, outlier detection...
2. Text cleaning operators: remove special chars, case conversion...
3. Image cleaning operators: format conversion, quality detection...
[Source: user_guide.pdf, tutorial.md]

Best Practices

1. Document Preparation

Before uploading documents:

  • Unify format: Convert to unified format (PDF, Markdown)
  • Clean content: Remove irrelevant content (headers, ads)
  • Maintain structure: Keep good document structure
  • Add metadata: Add document metadata (author, date, tags)

2. Chunking Strategy Selection

Choose based on document type:

Document TypeRecommended StrategyChunk Size
Technical docsParagraph chunking-
Long reportsSemantic chunking-
Short textCharacter chunking500
CodeCharacter chunking300

Common Questions

Q: Document stuck in “Processing”?

A: Check:

  1. Document format: Ensure format is supported
  2. Document size: Single document under 100MB
  3. Vector service: Check if vector service is running
  4. View logs: Check detailed error messages

Q: Inaccurate search results?

A: Optimization suggestions:

  1. Adjust chunking: Try different chunking methods
  2. Increase chunk size: Add more context
  3. Use reranking: Enable reranking model
  4. Optimize query: Use clearer query statements
  5. Change embedding model: Try other models

API Reference

For detailed API documentation, see:

8 - Operator Market

Manage and use DataMate operators

Operator marketplace provides rich data processing operators and supports custom operator development.

Features Overview

Operator marketplace provides:

  • Built-in Operators: Rich built-in data processing operators
  • Operator Publishing: Publish and share custom operators
  • Operator Installation: Install third-party operators
  • Custom Development: Develop custom operators

Built-in Operators

Data Cleaning Operators

OperatorFunctionInputOutput
DeduplicationRemove duplicatesDatasetDeduplicated data
Null HandlerHandle nullsDatasetFilled data
Format ConverterConvert formatOriginal formatNew format

Text Processing Operators

OperatorFunction
Text SegmentationChinese word segmentation
Remove StopwordsRemove common stopwords
Text CleaningClean special characters

Quick Start

1. Browse Operators

Step 1: Enter Operator Market

Select Operator Market in the left navigation.

Step 2: Browse Operators

View all available operators with ratings and installation counts.

2. Install Operator

Install Built-in Operator

Built-in operators are installed by default.

Install Third-party Operator

  1. In operator details page, click Install
  2. Wait for installation completion

3. Use Operator

After installation, use in:

  • Data Cleaning: Add operator node to cleaning pipeline
  • Pipeline Orchestration: Add operator node to workflow

Advanced Features

Develop Custom Operator

Create Operator

  1. In operator market page, click Create Operator
  2. Fill operator information
  3. Write operator code (Python)
  4. Package and publish

Python Operator Example:

class MyTextCleaner:
    def __init__(self, config):
        self.remove_special_chars = config.get('remove_special_chars', True)

    def process(self, data):
        if isinstance(data, str):
            result = data
            if self.remove_special_chars:
                import re
                result = re.sub(r'[^\w\s]', '', result)
            return result
        return data

Best Practices

1. Operator Design

Good operator design:

  • Single responsibility: One operator does one thing
  • Configurable: Rich configuration options
  • Error handling: Comprehensive error handling
  • Performance: Consider large-scale data

Common Questions

Q: Operator execution failed?

A: Troubleshoot:

  1. View logs
  2. Check configuration
  3. Check data format
  4. Test locally

9 - Pipeline Orchestration

Visual workflow orchestration with DataMate

Pipeline orchestration module provides drag-and-drop visual interface for designing and managing complex data processing workflows.

Features Overview

Pipeline orchestration provides:

  • Visual Designer: Drag-and-drop workflow design
  • Rich Node Types: Data processing, conditions, loops, etc.
  • Flow Execution: Auto-execute and monitor workflows
  • Template Management: Save and reuse flow templates
  • Version Management: Flow version control

Node Types

Data Nodes

NodeFunctionConfig
Input DatasetRead from datasetSelect dataset
Output DatasetWrite to datasetSelect dataset
Data CollectionExecute collection taskSelect task
Data CleaningExecute cleaning taskSelect task
Data SynthesisExecute synthesis taskSelect task

Logic Nodes

NodeFunctionConfig
Condition BranchExecute different branchesCondition expression
LoopRepeat executionLoop count/condition
ParallelExecute multiple branches in parallelBranch count
WaitWait for specified timeDuration

Quick Start

1. Create Pipeline

Step 1: Enter Pipeline Orchestration Page

Select Pipeline Orchestration in left navigation.

Step 2: Create Pipeline

Click Create Pipeline.

Step 3: Fill Basic Information

  • Pipeline name: e.g., data_processing_pipeline
  • Description: Pipeline purpose (optional)

Step 4: Design Flow

  1. Drag nodes from left library to canvas
  2. Connect nodes
  3. Configure node parameters
  4. Save flow

Example:

Input Dataset → Data Cleaning → Condition Branch
                                    ├── Satisfied → Data Annotation → Output
                                    └── Not Satisfied → Data Synthesis → Output

2. Execute Pipeline

Step 1: Enter Execution Page

Click pipeline name to enter details.

Step 2: Execute Pipeline

Click Execute Now.

Step 3: Monitor Execution

View execution status, progress, and logs.

Advanced Features

Flow Templates

Save as Template

  1. Design flow
  2. Click Save as Template
  3. Enter template name

Use Template

  1. Create pipeline, click Use Template
  2. Select template
  3. Load to designer

Parameterized Flow

Define parameters in pipeline:

{
  "parameters": [
    {
      "name": "input_dataset",
      "type": "dataset",
      "required": true
    }
  ]
}

Scheduled Execution

Configure scheduled execution:

  • Cron expression: 0 0 2 * * ? (Daily at 2 AM)
  • Execution parameters

Best Practices

1. Flow Design

Recommended principles:

  • Modular: Split complex flows
  • Reusable: Use templates
  • Maintainable: Add comments
  • Testable: Test individually

2. Performance Optimization

Optimize performance:

  • Parallelize: Use parallel nodes
  • Reduce data transfer: Process locally
  • Batch operations: Use batch operations
  • Cache results: Cache intermediate results

Common Questions

Q: Flow execution failed?

A: Troubleshoot:

  1. View execution logs
  2. Check node configuration
  3. Check data format
  4. Test nodes individually

10 - Agent Chat

Use DataMate Agent for intelligent conversation

Agent chat module integrates large language models to provide intelligent Q&A and knowledge base retrieval capabilities.

Features Overview

Agent chat module provides:

  • Intelligent Chat: Natural language conversation based on LLMs
  • Knowledge Base Q&A: RAG-integrated knowledge base Q&A
  • Multi-turn Dialogue: Support context-aware multi-turn conversation
  • Dialogue Management: Conversation history and management
  • Quick Commands: Quick commands for common operations

Quick Start

1. Access Agent

Select Agent Chat in left navigation, or visit:

  • Local: http://localhost:30000/chat
  • Production: https://your-domain/chat

2. Start Conversation

Enter question in dialog box, e.g.:

User: What data formats does DataMate support?

Assistant: DataMate supports multiple data formats:

Images: JPG, PNG, GIF, BMP, WebP
Text: TXT, MD, JSON, CSV
Audio: MP3, WAV, FLAC, AAC
Video: MP4, AVI, MOV, MKV

These formats can be uploaded and managed in the data management module.

3. Use Knowledge Base Q&A

Enable Knowledge Base

  1. Click Settings button
  2. In Knowledge Base settings, select knowledge base to use
  3. Save settings

Knowledge Base Q&A

User: How to create a data cleaning task?

Assistant: According to the knowledge base documentation:

1. Enter data processing page
2. Click create task button
3. Configure basic information
4. Configure cleaning pipeline (drag operators to canvas)
5. Configure execution parameters
6. Create and execute task

[Source: user_guide.md, data_cleansing.md]

Advanced Features

Conversation Modes

General Chat

Use LLM for general conversation without knowledge base.

Knowledge Base Q&A

Answer questions based on knowledge base content.

Mixed Mode

Combine general chat and knowledge base Q&A.

Quick Commands

CommandFunctionExample
/datasetQuery datasets/dataset list
/taskQuery tasks/task status
/helpShow help/help
/clearClear conversation/clear

Conversation History

View History

  1. Click History tab on left
  2. Select historical conversation
  3. View conversation content

Continue Conversation

Click historical conversation to continue.

Export Conversation

Export conversation records:

  • Markdown: Export as Markdown file
  • JSON: Export as JSON
  • PDF: Export as PDF

Best Practices

1. Effective Questioning

Get better answers:

  • Be specific: Clear and specific questions
  • Provide context: Include background information
  • Break down: Split complex questions

2. Knowledge Base Usage

Make the most of knowledge base:

  • Select appropriate knowledge base: Choose based on question
  • View sources: Check answer source documents
  • Verify information: Verify with source documents

Common Questions

Q: Inaccurate Agent answers?

A: Improve:

  1. Optimize question: More specific
  2. Check knowledge base: Ensure relevant content exists
  3. Change model: Try more powerful model
  4. Provide context: More background info