Data Cleaning

Clean and preprocess data with DataMate

Data cleaning module provides powerful data processing capabilities to help you clean, transform, and optimize data quality.

Features Overview

Data cleaning module provides:

Built-in Cleaning Operators: Rich pre-cleaning operator library
Visual Configuration: Drag-and-drop cleaning pipeline design
Template Management: Save and reuse cleaning templates
Batch Processing: Support large-scale data batch cleaning
Real-time Preview: Preview cleaning results

Cleaning Operator Types

Data Quality Operators

Operator	Function	Applicable Data Types
Deduplication	Remove duplicates	All types
Null Handling	Handle null values	All types
Outlier Detection	Detect outliers	Numerical
Format Validation	Validate format	All types

Text Cleaning Operators

Operator	Function
Remove Special Chars	Remove special characters
Case Conversion	Convert case
Remove Stopwords	Remove common stopwords
Text Segmentation	Chinese word segmentation
HTML Tag Cleaning	Clean HTML tags

Quick Start

1. Create Cleaning Task

Step 1: Enter Data Cleaning Page

Select Data Processing in the left navigation.

Step 2: Create Task

Click Create Task button.

Step 3: Configure Basic Information

Task name: e.g., user_data_cleansing
Source dataset: Select dataset to clean
Output dataset: Select or create output dataset

Step 4: Configure Cleaning Pipeline

Drag operators from left library to canvas
Connect operators to form pipeline
Configure operator parameters
Preview cleaning results

Example pipeline:

Input Data → Deduplication → Null Handling → Format Validation → Output Data

2. Use Cleaning Templates

Create Template

Configure cleaning pipeline
Click Save as Template
Enter template name
Save

Use Template

Create cleaning task
Click Use Template
Select template
Adjust as needed

3. Monitor Cleaning Task

View task status, progress, and statistics in task list.

Advanced Features

Custom Operators

Develop custom operators. See:

Operator Market - Operator development guide

Conditional Branching

Add conditional branches in pipeline:

Input Data → [Condition Check]
              ├── Satisfied → Pipeline A
              └── Not Satisfied → Pipeline B

Best Practices

1. Pipeline Design

Recommended principles:

Modular: Split complex pipelines
Reusable: Use templates and parameters
Maintainable: Add comments
Testable: Test individually before combining

2. Performance Optimization

Optimize performance:

Parallelize: Use parallel nodes
Reduce data transfer: Process locally when possible
Batch operations: Use batch operations
Cache results: Cache intermediate results

Common Questions

Q: Task execution failed?

A: Troubleshooting:

Check data format
View execution logs
Check operator parameters
Test individual operators
Reduce data size for testing

Q: Cleaning speed is slow?

A: Optimize:

Reduce operator count
Optimize operator order
Increase concurrency
Use incremental processing

API Reference

Data Cleaning API

Data Management - Manage cleaned data
Operator Market - Get more cleaning operators

Feedback

Was this page helpful?

Glad to hear it! Please tell us how we can improve.

Sorry to hear that. Please tell us how we can improve.

Last modified February 12, 2026: :memo: fix data-collection & data management (#4) (ed058f0)