User Guide
DataMate feature usage guides
This guide introduces how to use each feature module of DataMate.
DataMate provides comprehensive data processing solutions for large models, covering data collection, management, cleaning, annotation, synthesis, evaluation, and the full process.
Feature Modules
Typical Use Cases
Model Fine-tuning Scenario
1. Data Collection → 2. Data Management → 3. Data Cleaning → 4. Data Annotation
↓
5. Data Evaluation → 6. Export Training Data
RAG Application Scenario
1. Upload Documents → 2. Vectorization Index → 3. Knowledge Base Management
↓
4. Agent Chat (Knowledge Base Q&A)
Data Augmentation Scenario
1. Prepare Raw Data → 2. Create Instruction Template → 3. Data Synthesis
↓
4. Quality Evaluation → 5. Export Augmented Data
Quick Links
1 - Data Collection
Collect data from multiple data sources with DataMate
Data collection module helps you collect data from multiple data sources (databases, file systems, APIs, etc.) into the DataMate platform.
Features Overview
Based on DataX, data collection module supports:
- Multiple Data Sources: MySQL, PostgreSQL, Oracle, SQL Server, etc.
- Heterogeneous Sync: Data sync between different sources
- Batch Collection: Large-scale batch collection and sync
- Scheduled Tasks: Support scheduled execution
- Task Monitoring: Real-time monitoring of collection tasks
Supported Data Sources
| Data Source Type | Reader | Writer | Description |
|---|
| General Relational Databases | ✅ | ✅ | Supports MySQL, PostgreSQL, OpenGauss, SQL Server, DM, DB2 |
| MySQL | ✅ | ✅ | Relational database |
| PostgreSQL | ✅ | ✅ | Relational database |
| OpenGauss | ✅ | ✅ | Relational database |
| SQL Server | ✅ | ✅ | Microsoft database |
| DM (Dameng) | ✅ | ✅ | Domestic database |
| DB2 | ✅ | ✅ | IBM database |
| StarRocks | ✅ | ✅ | Analytical database |
| NAS | ✅ | ✅ | Network storage |
| S3 | ✅ | ✅ | Object storage |
| GlusterFS | ✅ | ✅ | Distributed file system |
| API Collection | ✅ | ✅ | API interface data |
| JSON Files | ✅ | ✅ | JSON format files |
| CSV Files | ✅ | ✅ | CSV format files |
| TXT Files | ✅ | ✅ | Text files |
| FTP | ✅ | ✅ | FTP servers |
| HDFS | ✅ | ✅ | Hadoop HDFS |
Quick Start
1. Create Collection Task
Step 1: Enter Data Collection Page
Select Data Collection in the left navigation.
Step 2: Create Task
Click Create Task button.
Fill in the following basic information:
- Name: A meaningful name for the task
- Timeout: Task execution timeout (seconds)
- Description: Task purpose (optional)
Step 4: Select Sync Mode
Select the task synchronization mode:
- Immediate Sync: Execute once immediately after task creation
- Scheduled Sync: Execute periodically according to schedule rules
When selecting Scheduled Sync, configure the execution policy:
- Execution Cycle: Hourly / Daily / Weekly / Monthly
- Execution Time: Select the execution time point
Select data source type: Choose from dropdown list (e.g., MySQL, CSV, etc.)
Configure data source parameters: Fill in connection parameters based on the selected data source template (form format)
MySQL Example:
- JDBC URL:
jdbc:mysql://localhost:3306/mydb - Username:
root - Password:
password - Table Name:
users
Field mapping is not supported. You can only extract specific fields from the configured SQL.
- Extract specific fields: Enter the field names you want to extract in the field list
- Extract all fields: Leave the field list empty to extract all fields from the SQL query result
Step 7: Create and Execute
Click Create button to create the task.
- If Immediate Sync is selected, task starts immediately
- If Scheduled Sync is selected, task runs periodically according to schedule
2. Monitor Task Execution
View all collection tasks with status, progress, and operations.
3. Task Management
Each task in the task list has the following actions available:
- View Execution Records: View all historical executions of the task
- Delete: Delete the task (note: deleting a task does not delete collected data)
Click the task name to view task details including:
- Basic configuration
- Execution record list
- Data statistics
Common Questions
Q: Task execution failed?
A: Troubleshooting:
- Check data source connection
- View execution logs
- Check data format
- Verify target dataset exists
Q: How to collect large tables?
A:
- Use incremental collection
- Split into multiple tasks
- Adjust concurrent parameters
- Use filter conditions
API Reference
2 - Data Management
Manage datasets and files with DataMate
Data management module provides unified dataset management capabilities, supporting multiple data types for storage, query, and operations.
Features Overview
Data management module provides:
- Multiple data types: Image, text, audio, video, and multimodal support
- File management: Upload, download, preview, delete operations
- Directory structure: Support for hierarchical directory organization
- Tag management: Use tags to categorize and retrieve data
- Statistics: Dataset size, file count, and other statistics
Dataset Types
| Type | Description | Supported Formats |
|---|
| Image | Image data | JPG, PNG, GIF, BMP, WebP |
| Text | Text data | TXT, MD, JSON, CSV |
| Audio | Audio data | MP3, WAV, FLAC, AAC |
| Video | Video data | MP4, AVI, MOV, MKV |
| Multimodal | Multimodal data | Mixed formats |
Quick Start
1. Create Dataset
Step 1: Enter Data Management Page
In the left navigation, select Data Management.
Step 2: Create Dataset
Click the Create Dataset button in the upper right corner.
- Dataset name: e.g.,
user_images_dataset - Dataset type: Select data type (e.g., Image)
- Description: Dataset purpose description (optional)
- Tags: Add tags for categorization (optional)
Step 4: Create Dataset
Click the Create button to complete.
2. Upload Files
Method 1: Drag & Drop
- Enter dataset details page
- Drag files directly to the upload area
- Wait for upload completion
Method 2: Click Upload
- Click Upload File button
- Select local files
- Wait for upload completion
Method 3: Chunked Upload (Large Files)
For large files (>100MB), the system automatically uses chunked upload:
- Select large file to upload
- System automatically splits the file
- Upload chunks one by one
- Automatically merge
3. Create Directory
Step 1: Enter Dataset
Click dataset name to enter details.
Step 2: Create Directory
- Click Create Directory button
- Enter directory name
- Select parent directory (optional)
- Click confirm
Directory structure example:
user_images_dataset/
├── train/
│ ├── cat/
│ └── dog/
├── test/
│ ├── cat/
│ └── dog/
└── validation/
├── cat/
└── dog/
4. Manage Files
View Files
In dataset details page, you can see all files:
| Filename | Size | File Count | Upload Time | Tags | Tag Update Time | Actions |
|---|
| image1.jpg | 2.3 MB | 1 | 2024-01-15 | Training Set | 2024-01-16 | Download Rename Delete |
| image2.png | 1.8 MB | 1 | 2024-01-15 | Validation Set | 2024-01-16 | Download Rename Delete |
Preview File
Click Preview button to preview in browser:
- Image: Display thumbnail and details
- Text: Display text content
- Audio: Online playback
- Video: Online playback
Download File
- Single file download: Click Download button
Currently, batch download and package download are not supported.
5. Dataset Operations
View Statistics
In dataset details page, you can see:
- Total files: Total number of files in dataset
- Total size: Total size of all files
Edit Dataset
Click Edit button to modify:
- Dataset name
- Description
- Tags
- Associated collection task
Delete Dataset
Click Delete button to delete entire dataset.
Note: Deleting a dataset will also delete all files within it. This action cannot be undone.
Advanced Features
Tag Management
Create Tag
- In dataset list page, click Tag Management
- Click Create Tag
- Enter tag name
- Edit dataset
- Select existing tags in tag bar
- Save dataset
In dataset list page, click tags to filter datasets with that tag.
Best Practices
1. Dataset Organization
Recommended directory organization:
project_dataset/
├── raw/ # Raw data
├── processed/ # Processed data
├── train/ # Training data
├── validation/ # Validation data
└── test/ # Test data
2. Naming Conventions
- Dataset name: Use lowercase letters and underscores, e.g.,
user_images_2024 - Directory name: Use meaningful English names, e.g.,
train, test, processed - File name: Keep original filename or use standardized naming
3. Tag Usage
Recommended tag categories:
- Project tags:
project-a, project-b - Status tags:
raw, processed, validated - Type tags:
image, text, audio - Purpose tags:
training, testing, evaluation
4. Data Backup
The system currently does not support automatic backup. To backup data, you can manually download individual files:
- Enter dataset details page
- Find the file you need to backup
- Click the Download button of the file
Common Questions
Q: Large file upload fails?
A: Suggestions for large file uploads:
- Use chunked upload: System automatically enables chunked upload
- Check network: Ensure stable network connection
- Adjust upload parameters: Increase timeout
- Use FTP/SFTP: For very large files, use FTP upload
Q: How to import existing data?
A: Three methods to import existing data:
- Upload files: Upload via interface
- Add files: If files already on server, use add file feature
- Data collection: Use data collection module to collect from external sources
Q: Dataset size limit?
A: Dataset size limits:
- Single file: Maximum 5GB (chunked upload)
- Total dataset: Limited by storage space
- File count: No explicit limit
Regularly clean unnecessary files to free up space.
API Reference
For detailed API documentation, see:
3 - Data Cleaning
Clean and preprocess data with DataMate
Data cleaning module provides powerful data processing capabilities to help you clean, transform, and optimize data quality.
Features Overview
Data cleaning module provides:
- Built-in Cleaning Operators: Rich pre-cleaning operator library
- Visual Configuration: Drag-and-drop cleaning pipeline design
- Template Management: Save and reuse cleaning templates
- Batch Processing: Support large-scale data batch cleaning
- Real-time Preview: Preview cleaning results
Cleaning Operator Types
Data Quality Operators
| Operator | Function | Applicable Data Types |
|---|
| Deduplication | Remove duplicates | All types |
| Null Handling | Handle null values | All types |
| Outlier Detection | Detect outliers | Numerical |
| Format Validation | Validate format | All types |
Text Cleaning Operators
| Operator | Function |
|---|
| Remove Special Chars | Remove special characters |
| Case Conversion | Convert case |
| Remove Stopwords | Remove common stopwords |
| Text Segmentation | Chinese word segmentation |
| HTML Tag Cleaning | Clean HTML tags |
Quick Start
1. Create Cleaning Task
Step 1: Enter Data Cleaning Page
Select Data Processing in the left navigation.
Step 2: Create Task
Click Create Task button.
- Task name: e.g.,
user_data_cleansing - Source dataset: Select dataset to clean
- Output dataset: Select or create output dataset
- Drag operators from left library to canvas
- Connect operators to form pipeline
- Configure operator parameters
- Preview cleaning results
Example pipeline:
Input Data → Deduplication → Null Handling → Format Validation → Output Data
2. Use Cleaning Templates
Create Template
- Configure cleaning pipeline
- Click Save as Template
- Enter template name
- Save
Use Template
- Create cleaning task
- Click Use Template
- Select template
- Adjust as needed
3. Monitor Cleaning Task
View task status, progress, and statistics in task list.
Advanced Features
Custom Operators
Develop custom operators. See:
Conditional Branching
Add conditional branches in pipeline:
Input Data → [Condition Check]
├── Satisfied → Pipeline A
└── Not Satisfied → Pipeline B
Best Practices
1. Pipeline Design
Recommended principles:
- Modular: Split complex pipelines
- Reusable: Use templates and parameters
- Maintainable: Add comments
- Testable: Test individually before combining
Optimize performance:
- Parallelize: Use parallel nodes
- Reduce data transfer: Process locally when possible
- Batch operations: Use batch operations
- Cache results: Cache intermediate results
Common Questions
Q: Task execution failed?
A: Troubleshooting:
- Check data format
- View execution logs
- Check operator parameters
- Test individual operators
- Reduce data size for testing
Q: Cleaning speed is slow?
A: Optimize:
- Reduce operator count
- Optimize operator order
- Increase concurrency
- Use incremental processing
API Reference
4 - Data Annotation
Perform data annotation with DataMate
Data annotation module integrates Label Studio to provide professional-grade data annotation capabilities.
Features Overview
Data annotation module provides:
- Multiple Annotation Types: Image, text, audio, etc.
- Annotation Templates: Rich annotation templates and configurations
- Quality Control: Annotation review and consistency checks
- Team Collaboration: Multi-person collaborative annotation
- Annotation Export: Export annotation results
Annotation Types
Image Annotation
| Type | Description | Use Cases |
|---|
| Image Classification | Classify entire image | Scene recognition |
| Object Detection | Annotate object locations | Object recognition |
| Semantic Segmentation | Pixel-level classification | Medical imaging |
| Key Point Annotation | Annotate key points | Pose estimation |
Text Annotation
| Type | Description | Use Cases |
|---|
| Text Classification | Classify text | Sentiment analysis |
| Named Entity Recognition | Annotate entities | Information extraction |
| Text Summarization | Generate summaries | Document understanding |
Quick Start
1. Deploy Label Studio
make install-label-studio
Access: http://localhost:30001
Default credentials:
2. Create Annotation Task
Step 1: Enter Data Annotation Page
Select Data Annotation in the left navigation.
Step 2: Create Task
Click Create Task.
- Task name: e.g.,
image_classification_task - Source dataset: Select dataset to annotate
- Annotation type: Select type
Image Classification Template:
<View>
<Image name="image" value="$image"/>
<Choices name="choice" toName="image">
<Choice value="cat"/>
<Choice value="dog"/>
<Choice value="bird"/>
</Choices>
</View>
- Annotation method: Single label / Multi label
- Minimum annotations: Per sample (for consistency)
- Review mechanism: Enable/disable review
3. Start Annotation
- Enter annotation interface
- View sample to annotate
- Perform annotation
- Click Submit
- Auto-load next sample
Advanced Features
Quality Control
Annotation Consistency
Check consistency between annotators:
- Cohen’s Kappa: Evaluate consistency
- Majority vote: Use majority annotation results
- Expert review: Expert reviews disputed annotations
Pre-annotation
Use models for pre-annotation:
- Train or use existing model
- Pre-annotate dataset
- Annotators correct pre-annotations
Best Practices
1. Annotation Guidelines
Create clear guidelines:
- Define standards: Clear annotation standards
- Provide examples: Positive and negative examples
- Edge cases: Handle edge cases
- Train annotators: Ensure understanding
Common Questions
Q: Poor annotation quality?
A: Improve:
- Refine guidelines
- Strengthen training
- Increase reviews
- Use pre-annotation
5 - Data Synthesis
Use large models for data augmentation and synthesis
Data synthesis module leverages large model capabilities to automatically generate high-quality training data, reducing data collection costs.
Features Overview
Data synthesis module provides:
- Instruction template management: Create and manage synthesis instruction templates
- Single task synthesis: Create individual synthesis tasks
- Proportional synthesis task: Synthesize multi-category balanced data by specified ratios
- Large model integration: Support for multiple LLM APIs
- Quality evaluation: Automatic evaluation of synthesized data quality
Quick Start
1. Create Instruction Template
Step 1: Enter Data Synthesis Page
In the left navigation, select Data Synthesis → Synthesis Tasks.
Step 2: Create Instruction Template
- Click Instruction Templates tab
- Click Create Template button
Basic Information:
- Template name: e.g.,
qa_generation_template - Template description: Describe template purpose (optional)
- Template type: Select template type (Q&A, dialogue, summary, etc.)
Prompt Configuration:
Example prompt:
You are a professional data generation assistant. Generate data based on the following requirements:
Task: Generate Q&A pairs
Topic: {topic}
Count: {count}
Difficulty: {difficulty}
Requirements:
1. Questions should be clear and specific
2. Answers should be accurate and complete
3. Cover different difficulty levels
Output format: JSON
[
{
"question": "...",
"answer": "..."
}
]
Parameter Configuration:
- Model: Select LLM to use (GPT-4, Claude, local model, etc.)
- Temperature: Control generation randomness (0-1)
- Max tokens: Limit generation length
- Other parameters: Configure according to model
Step 4: Save Template
Click Save button to save template.
2. Create Synthesis Task
- Return to Data Synthesis page
- Click Create Task button
- Fill basic information:
- Task name: e.g.,
medical_qa_synthesis - Task description: Describe task purpose (optional)
Step 2: Select Dataset and Files
Select required data from existing datasets:
- Select dataset: Choose the dataset to use from the list
- Select files:
- Can select all files from a dataset
- Can also select specific files from a dataset
- Support selecting multiple files
Step 3: Select Synthesis Instruction Template
Select an existing template or create a new one:
- Select from template library: Choose from created templates
- Template type: Q&A generation, dialogue generation, summary generation, etc.
- Preview template: View template prompt content
Step 4: Fill Synthesis Configuration
The synthesis configuration consists of four parts:
1. Set Total Synthesis Count
Set the maximum limit for the entire task:
| Parameter | Description | Default Value | Range |
|---|
| Maximum QA Pairs | Maximum number of QA pairs to generate for entire task | 5000 | 1-100,000 |
This setting is optional, used for total volume control in large-scale synthesis tasks.
2. Configure Text Chunking Strategy
Chunk the input text files, supporting multiple chunking methods:
| Parameter | Description | Default Value |
|---|
| Chunking Method | Select chunking strategy | Default chunking |
| Chunk Size | Character count per chunk | 3000 |
| Overlap Size | Overlap characters between adjacent chunks | 100 |
Chunking Method Options:
- Default Chunking (默认分块): Use system default intelligent chunking strategy
- Chapter-based Chunking (按章节分块): Split by chapter structure
- Paragraph-based Chunking (按段落分块): Split by paragraph boundaries
- Fixed Length Chunking (固定长度分块): Split by fixed character length
- Custom Separator Chunking (自定义分隔符分块): Split by custom delimiter
3. Configure Question Synthesis Parameters
Set parameters for question generation:
| Parameter | Description | Default Value | Range |
|---|
| Question Count | Number of questions generated per chunk | 1 | 1-20 |
| Temperature | Control randomness and diversity of question generation | 0.7 | 0-2 |
| Model | Select CHAT model for question generation | - | Select from model list |
Parameter Notes:
- Question Count: Number of questions generated per text chunk. Higher value generates more questions.
- Temperature: Higher values produce more diverse questions, lower values produce more stable questions.
4. Configure Answer Synthesis Parameters
Set parameters for answer generation:
| Parameter | Description | Default Value | Range |
|---|
| Temperature | Control stability of answer generation | 0.7 | 0-2 |
| Model | Select CHAT model for answer generation | - | Select from model list |
Parameter Notes:
- Temperature: Lower values produce more conservative and accurate answers, higher values produce more diverse and creative answers.
Synthesis Types:
The system supports two synthesis types:
- SFT Q&A Synthesis (SFT 问答数据合成): Generate Q&A pairs for supervised fine-tuning
- COT Chain-of-Thought Synthesis (COT 链式推理合成): Generate data with reasoning process
Step 5: Start Task
Click Start Task button, task will automatically start executing.
3. Create Ratio Synthesis Task
Ratio synthesis tasks are used to synthesize multi-category balanced data in specified proportions.
Step 1: Create Ratio Task
- In the left navigation, select Data Synthesis → Ratio Tasks
- Click Create Task button
| Parameter | Description | Required |
|---|
| Task Name | Unique identifier for the task | Yes |
| Total Target Count | Target total count for entire ratio task | Yes |
| Task Description | Describe purpose and requirements of ratio task | No |
Example:
- Task name:
balanced_dataset_synthesis - Total target count: 10000
- Task description: Generate balanced data for training and validation sets
Step 3: Select Datasets
Select datasets to participate in the ratio synthesis from existing datasets:
Dataset Selection Features:
- Search Datasets: Search datasets by keyword
- Multi-select Support: Can select multiple datasets simultaneously
- Dataset Information: Display detailed information for each dataset
- Dataset name and type
- Dataset description
- File count
- Dataset size
- Label distribution preview (up to 8 labels)
After selecting datasets, the system automatically loads label distribution information for each dataset.
Step 4: Fill Ratio Configuration
Configure specific synthesis rules for each selected dataset:
Ratio Configuration Items:
| Parameter | Description | Range |
|---|
| Label | Select label from dataset’s label distribution | Based on dataset labels |
| Label Value | Specific value under selected label | Based on label value list |
| Label Update Time | Select label update date range (optional) | Date picker |
| Quantity | Data count to generate for this config | 0 to total target count |
Feature Notes:
- Auto Distribute: Click “Auto Distribute” button, system automatically distributes total count evenly across datasets
- Quantity Limit: Each configuration item’s quantity cannot exceed the dataset’s total file count
- Percentage Calculation: System automatically calculates percentage of each configuration item
- Delete Configuration: Can delete unwanted configuration items
- Add Configuration: Each dataset can have multiple different label configurations
Example Configuration:
| Dataset | Label | Label Value | Label Update Time | Quantity |
|---|
| Training Dataset | Category | Training | - | 6000 |
| Training Dataset | Category | Validation | - | 2000 |
| Test Dataset | Category | Test | 2024-01-01 to 2024-12-31 | 2000 |
Step 5: Execute Task
Click Start Task button, the system will create and execute the task according to ratio configuration.
4. Monitor Synthesis Task
View Task List
In data synthesis page, you can see all synthesis tasks:
| Task Name | Template | Status | Progress | Generated Count | Actions |
|---|
| Medical QA Synthesis | qa_template | Running | 50% | 50/100 | View Details |
| Sentiment Data Synthesis | sentiment_template | Completed | 100% | 1000/1000 | View Details |
Advanced Features
Template Variables
Use variables in prompts for dynamic configuration:
Variable syntax: {variable_name}
Example:
Generate {count} {difficulty} level {type} about {topic}.
Built-in variables:
{current_date}: Current date{current_time}: Current time{random_id}: Random ID
Model Selection
DataMate supports multiple LLMs:
| Model | Type | Description |
|---|
| GPT-4 | OpenAI | High-quality generation |
| GPT-3.5-Turbo | OpenAI | Fast generation |
| Claude 3 | Anthropic | Long-text generation |
| Wenxin Yiyan | Baidu | Chinese optimized |
| Tongyi Qianwen | Alibaba | Chinese optimized |
| Local Model | Deployed locally | Private deployment |
Best Practices
1. Prompt Design
Good prompts should:
- Define task clearly: Clearly describe generation task
- Specify format: Clearly define output format requirements
- Provide examples: Give expected output examples
- Control quality: Set quality requirements
Example prompt:
You are a professional educational content creator.
Task: Generate educational Q&A pairs
Subject: {subject}
Grade: {grade}
Count: {count}
Requirements:
1. Questions should be appropriate for the grade level
2. Answers should be accurate, detailed, and easy to understand
3. Each answer should include explanation process
4. Do not generate sensitive or inappropriate content
Output format (JSON):
[
{
"id": 1,
"question": "Question content",
"answer": "Answer content",
"explanation": "Explanation content",
"difficulty": "easy/medium/hard",
"knowledge_points": ["point1", "point2"]
}
]
Start generating:
2. Parameter Tuning
Adjust model parameters according to needs:
| Parameter | High Quality | Fast Generation | Creative Generation |
|---|
| Temperature | 0.3-0.5 | 0.1-0.3 | 0.7-1.0 |
| Max tokens | As needed | Shorter | Longer |
| Top P | 0.9-0.95 | 0.9 | 0.95-1.0 |
Common Questions
Q: Generated data quality is not ideal?
A: Optimization suggestions:
- Improve prompt: More detailed and clear instructions
- Adjust parameters: Lower temperature, increase max tokens
- Provide examples: Give examples in prompt
- Change model: Try other LLMs
- Manual review: Manual review and filtering
Q: Generation speed is slow?
A: Acceleration suggestions:
- Reduce count: Generate in smaller batches
- Adjust concurrency: Increase concurrency appropriately
- Use faster model: Like GPT-3.5-Turbo
- Shorten output: Reduce max tokens
- Use local model: Deploy local model for acceleration
API Reference
For detailed API documentation, see:
6 - Data Evaluation
Evaluate data quality with DataMate
Data evaluation module provides multi-dimensional data quality evaluation capabilities.
Features Overview
Data evaluation module provides:
- Quality Metrics: Rich data quality evaluation metrics
- Automatic Evaluation: Auto-execute evaluation tasks
- Manual Evaluation: Manual sampling evaluation
- Evaluation Reports: Generate detailed reports
- Quality Tracking: Track data quality trends
Evaluation Dimensions
Data Completeness
| Metric | Description | Calculation |
|---|
| Null Rate | Null value ratio | Null count / Total count |
| Missing Field Rate | Required field missing rate | Missing fields / Total fields |
| Record Complete Rate | Complete record ratio | Complete records / Total records |
Data Accuracy
| Metric | Description | Calculation |
|---|
| Format Correct Rate | Format compliance | Format correct / Total |
| Value Range Compliance | In valid range | In range / Total |
| Consistency Rate | Data consistency | Consistent records / Total |
Quick Start
1. Create Evaluation Task
Step 1: Enter Data Evaluation Page
Select Data Evaluation in the left navigation.
Step 2: Create Task
Click Create Task.
- Task name: e.g.,
data_quality_evaluation - Evaluation dataset: Select dataset to evaluate
Select dimensions:
- ✅ Data completeness
- ✅ Data accuracy
- ✅ Data uniqueness
- ✅ Data timeliness
Completeness Rules:
Required fields: name, email, phone
Null threshold: 5% (warn if exceeded)
2. Execute Evaluation
Automatic Evaluation
Auto-executes after creation, or click Execute Now.
Manual Evaluation
- Click Manual Evaluation tab
- View samples to evaluate
- Manually evaluate quality
- Submit results
3. View Evaluation Report
Overall Score
Overall Quality Score: 85 (Excellent)
Completeness: 90 ⭐⭐⭐⭐⭐
Accuracy: 82 ⭐⭐⭐⭐
Uniqueness: 95 ⭐⭐⭐⭐⭐
Timeliness: 75 ⭐⭐⭐⭐
Detailed Metrics
Completeness:
- Null rate: 3.2% ✅
- Missing field rate: 1.5% ✅
- Record complete rate: 96.8% ✅
Advanced Features
Custom Evaluation Rules
Regex Validation
Field: phone
Rule: ^1[3-9]\d{9}$
Description: China mobile phone number
Value Range Validation
Field: age
Min value: 0
Max value: 120
Comparison Evaluation
Compare different datasets or versions.
Best Practices
1. Regular Evaluation
Recommended schedule:
- Daily: Critical data
- Weekly: General data
- Monthly: All data
2. Establish Baseline
Create quality baseline for each dataset.
3. Continuous Improvement
Based on evaluation results:
- Clean problem data
- Optimize collection process
- Update validation rules
Common Questions
Q: Evaluation task failed?
A: Troubleshoot:
- Check dataset exists
- Check rule configuration
- View execution logs
- Test with small sample size
API Reference
7 - Knowledge Base Management
Build and manage RAG knowledge bases with DataMate
Knowledge base management module helps you build enterprise knowledge bases for efficient vector retrieval and RAG applications.
Features Overview
Knowledge base management module provides:
- Document upload: Support multiple document formats
- Text chunking: Intelligent text splitting strategies
- Vectorization: Automatic text-to-vector conversion
- Vector search: Semantic similarity-based retrieval
- Knowledge base Q&A: RAG-intelligent Q&A
| Format | Description | Recommended For |
|---|
| TXT | Plain text | General text |
| PDF | PDF documents | Documents, reports |
| Markdown | Markdown files | Technical docs |
| JSON | JSON data | Structured data |
| CSV | CSV tables | Tabular data |
| DOCX | Word documents | Office documents |
Quick Start
1. Create Knowledge Base
Step 1: Enter Knowledge Base Page
In the left navigation, select Knowledge Generation.
Step 2: Create Knowledge Base
Click Create Knowledge Base button in upper right.
- Knowledge base name: e.g.,
company_docs_kb - Knowledge base description: Describe purpose (optional)
- Knowledge base type: General / Professional domain
Embedding model: Select embedding model
- OpenAI text-embedding-ada-002
- BGE-M3
- Custom model
Vector dimension: Auto-set based on model
Index type: IVF_FLAT / HNSW / IVF_PQ
2. Upload Documents
Step 1: Enter Knowledge Base Details
Click knowledge base name to enter details.
Step 2: Upload Documents
- Click Upload Document button
- Select local files
- Wait for upload completion
System will automatically:
- Parse document content
- Chunk text
- Generate vectors
- Build index
3. Vector Search
Step 1: Enter Search Page
In knowledge base details page, click Vector Search tab.
Step 2: Enter Query
Enter query in search box, e.g.:
How to use DataMate for data cleaning?
Step 3: View Search Results
System returns most relevant text chunks with similarity scores:
| Rank | Text Chunk | Similarity | Source Doc | Actions |
|---|
| 1 | DataMate’s data cleaning module… | 0.92 | user_guide.pdf | View |
| 2 | Configure cleaning task… | 0.87 | tutorial.md | View |
| 3 | Cleaning operator list… | 0.81 | reference.txt | View |
4. Knowledge Base Q&A (RAG)
Step 1: Enable RAG
In knowledge base details page, click RAG Q&A tab.
- LLM: Select LLM to use
- Retrieval count: Number of text chunks to retrieve
- Temperature: Control generation randomness
- Prompt template: Custom Q&A template
Step 3: Q&A
Enter question in dialog box, e.g.:
User: What data cleaning operators does DataMate support?
Assistant: DataMate supports rich data cleaning operators, including:
1. Data quality operators: deduplication, null handling, outlier detection...
2. Text cleaning operators: remove special chars, case conversion...
3. Image cleaning operators: format conversion, quality detection...
[Source: user_guide.pdf, tutorial.md]
Best Practices
1. Document Preparation
Before uploading documents:
- Unify format: Convert to unified format (PDF, Markdown)
- Clean content: Remove irrelevant content (headers, ads)
- Maintain structure: Keep good document structure
- Add metadata: Add document metadata (author, date, tags)
2. Chunking Strategy Selection
Choose based on document type:
| Document Type | Recommended Strategy | Chunk Size |
|---|
| Technical docs | Paragraph chunking | - |
| Long reports | Semantic chunking | - |
| Short text | Character chunking | 500 |
| Code | Character chunking | 300 |
Common Questions
Q: Document stuck in “Processing”?
A: Check:
- Document format: Ensure format is supported
- Document size: Single document under 100MB
- Vector service: Check if vector service is running
- View logs: Check detailed error messages
Q: Inaccurate search results?
A: Optimization suggestions:
- Adjust chunking: Try different chunking methods
- Increase chunk size: Add more context
- Use reranking: Enable reranking model
- Optimize query: Use clearer query statements
- Change embedding model: Try other models
API Reference
For detailed API documentation, see:
8 - Operator Market
Manage and use DataMate operators
Operator marketplace provides rich data processing operators and supports custom operator development.
Features Overview
Operator marketplace provides:
- Built-in Operators: Rich built-in data processing operators
- Operator Publishing: Publish and share custom operators
- Operator Installation: Install third-party operators
- Custom Development: Develop custom operators
Built-in Operators
Data Cleaning Operators
| Operator | Function | Input | Output |
|---|
| Deduplication | Remove duplicates | Dataset | Deduplicated data |
| Null Handler | Handle nulls | Dataset | Filled data |
| Format Converter | Convert format | Original format | New format |
Text Processing Operators
| Operator | Function |
|---|
| Text Segmentation | Chinese word segmentation |
| Remove Stopwords | Remove common stopwords |
| Text Cleaning | Clean special characters |
Quick Start
1. Browse Operators
Step 1: Enter Operator Market
Select Operator Market in the left navigation.
Step 2: Browse Operators
View all available operators with ratings and installation counts.
2. Install Operator
Install Built-in Operator
Built-in operators are installed by default.
Install Third-party Operator
- In operator details page, click Install
- Wait for installation completion
3. Use Operator
After installation, use in:
- Data Cleaning: Add operator node to cleaning pipeline
- Pipeline Orchestration: Add operator node to workflow
Advanced Features
Develop Custom Operator
Create Operator
- In operator market page, click Create Operator
- Fill operator information
- Write operator code (Python)
- Package and publish
Python Operator Example:
class MyTextCleaner:
def __init__(self, config):
self.remove_special_chars = config.get('remove_special_chars', True)
def process(self, data):
if isinstance(data, str):
result = data
if self.remove_special_chars:
import re
result = re.sub(r'[^\w\s]', '', result)
return result
return data
Best Practices
1. Operator Design
Good operator design:
- Single responsibility: One operator does one thing
- Configurable: Rich configuration options
- Error handling: Comprehensive error handling
- Performance: Consider large-scale data
Common Questions
Q: Operator execution failed?
A: Troubleshoot:
- View logs
- Check configuration
- Check data format
- Test locally
9 - Pipeline Orchestration
Visual workflow orchestration with DataMate
Pipeline orchestration module provides drag-and-drop visual interface for designing and managing complex data processing workflows.
Features Overview
Pipeline orchestration provides:
- Visual Designer: Drag-and-drop workflow design
- Rich Node Types: Data processing, conditions, loops, etc.
- Flow Execution: Auto-execute and monitor workflows
- Template Management: Save and reuse flow templates
- Version Management: Flow version control
Node Types
Data Nodes
| Node | Function | Config |
|---|
| Input Dataset | Read from dataset | Select dataset |
| Output Dataset | Write to dataset | Select dataset |
| Data Collection | Execute collection task | Select task |
| Data Cleaning | Execute cleaning task | Select task |
| Data Synthesis | Execute synthesis task | Select task |
Logic Nodes
| Node | Function | Config |
|---|
| Condition Branch | Execute different branches | Condition expression |
| Loop | Repeat execution | Loop count/condition |
| Parallel | Execute multiple branches in parallel | Branch count |
| Wait | Wait for specified time | Duration |
Quick Start
1. Create Pipeline
Step 1: Enter Pipeline Orchestration Page
Select Pipeline Orchestration in left navigation.
Step 2: Create Pipeline
Click Create Pipeline.
- Pipeline name: e.g.,
data_processing_pipeline - Description: Pipeline purpose (optional)
Step 4: Design Flow
- Drag nodes from left library to canvas
- Connect nodes
- Configure node parameters
- Save flow
Example:
Input Dataset → Data Cleaning → Condition Branch
├── Satisfied → Data Annotation → Output
└── Not Satisfied → Data Synthesis → Output
2. Execute Pipeline
Step 1: Enter Execution Page
Click pipeline name to enter details.
Step 2: Execute Pipeline
Click Execute Now.
Step 3: Monitor Execution
View execution status, progress, and logs.
Advanced Features
Flow Templates
Save as Template
- Design flow
- Click Save as Template
- Enter template name
Use Template
- Create pipeline, click Use Template
- Select template
- Load to designer
Parameterized Flow
Define parameters in pipeline:
{
"parameters": [
{
"name": "input_dataset",
"type": "dataset",
"required": true
}
]
}
Scheduled Execution
Configure scheduled execution:
- Cron expression:
0 0 2 * * ? (Daily at 2 AM) - Execution parameters
Best Practices
1. Flow Design
Recommended principles:
- Modular: Split complex flows
- Reusable: Use templates
- Maintainable: Add comments
- Testable: Test individually
Optimize performance:
- Parallelize: Use parallel nodes
- Reduce data transfer: Process locally
- Batch operations: Use batch operations
- Cache results: Cache intermediate results
Common Questions
Q: Flow execution failed?
A: Troubleshoot:
- View execution logs
- Check node configuration
- Check data format
- Test nodes individually
10 - Agent Chat
Use DataMate Agent for intelligent conversation
Agent chat module integrates large language models to provide intelligent Q&A and knowledge base retrieval capabilities.
Features Overview
Agent chat module provides:
- Intelligent Chat: Natural language conversation based on LLMs
- Knowledge Base Q&A: RAG-integrated knowledge base Q&A
- Multi-turn Dialogue: Support context-aware multi-turn conversation
- Dialogue Management: Conversation history and management
- Quick Commands: Quick commands for common operations
Quick Start
1. Access Agent
Select Agent Chat in left navigation, or visit:
- Local: http://localhost:30000/chat
- Production: https://your-domain/chat
2. Start Conversation
Enter question in dialog box, e.g.:
User: What data formats does DataMate support?
Assistant: DataMate supports multiple data formats:
Images: JPG, PNG, GIF, BMP, WebP
Text: TXT, MD, JSON, CSV
Audio: MP3, WAV, FLAC, AAC
Video: MP4, AVI, MOV, MKV
These formats can be uploaded and managed in the data management module.
3. Use Knowledge Base Q&A
Enable Knowledge Base
- Click Settings button
- In Knowledge Base settings, select knowledge base to use
- Save settings
Knowledge Base Q&A
User: How to create a data cleaning task?
Assistant: According to the knowledge base documentation:
1. Enter data processing page
2. Click create task button
3. Configure basic information
4. Configure cleaning pipeline (drag operators to canvas)
5. Configure execution parameters
6. Create and execute task
[Source: user_guide.md, data_cleansing.md]
Advanced Features
Conversation Modes
General Chat
Use LLM for general conversation without knowledge base.
Knowledge Base Q&A
Answer questions based on knowledge base content.
Mixed Mode
Combine general chat and knowledge base Q&A.
Quick Commands
| Command | Function | Example |
|---|
/dataset | Query datasets | /dataset list |
/task | Query tasks | /task status |
/help | Show help | /help |
/clear | Clear conversation | /clear |
Conversation History
View History
- Click History tab on left
- Select historical conversation
- View conversation content
Continue Conversation
Click historical conversation to continue.
Export Conversation
Export conversation records:
- Markdown: Export as Markdown file
- JSON: Export as JSON
- PDF: Export as PDF
Best Practices
1. Effective Questioning
Get better answers:
- Be specific: Clear and specific questions
- Provide context: Include background information
- Break down: Split complex questions
2. Knowledge Base Usage
Make the most of knowledge base:
- Select appropriate knowledge base: Choose based on question
- View sources: Check answer source documents
- Verify information: Verify with source documents
Common Questions
Q: Inaccurate Agent answers?
A: Improve:
- Optimize question: More specific
- Check knowledge base: Ensure relevant content exists
- Change model: Try more powerful model
- Provide context: More background info