Data Cleaning
Clean and preprocess data with DataMate
Data cleaning module provides powerful data processing capabilities to help you clean, transform, and optimize data quality.
Features Overview
Data cleaning module provides:
- Built-in Cleaning Operators: Rich pre-cleaning operator library
- Visual Configuration: Drag-and-drop cleaning pipeline design
- Template Management: Save and reuse cleaning templates
- Batch Processing: Support large-scale data batch cleaning
- Real-time Preview: Preview cleaning results
Cleaning Operator Types
Data Quality Operators
| Operator | Function | Applicable Data Types |
|---|---|---|
| Deduplication | Remove duplicates | All types |
| Null Handling | Handle null values | All types |
| Outlier Detection | Detect outliers | Numerical |
| Format Validation | Validate format | All types |
Text Cleaning Operators
| Operator | Function |
|---|---|
| Remove Special Chars | Remove special characters |
| Case Conversion | Convert case |
| Remove Stopwords | Remove common stopwords |
| Text Segmentation | Chinese word segmentation |
| HTML Tag Cleaning | Clean HTML tags |
Quick Start
1. Create Cleaning Task
Step 1: Enter Data Cleaning Page
Select Data Processing in the left navigation.
Step 2: Create Task
Click Create Task button.
Step 3: Configure Basic Information
- Task name: e.g.,
user_data_cleansing - Source dataset: Select dataset to clean
- Output dataset: Select or create output dataset
Step 4: Configure Cleaning Pipeline
- Drag operators from left library to canvas
- Connect operators to form pipeline
- Configure operator parameters
- Preview cleaning results
Example pipeline:
Input Data → Deduplication → Null Handling → Format Validation → Output Data
2. Use Cleaning Templates
Create Template
- Configure cleaning pipeline
- Click Save as Template
- Enter template name
- Save
Use Template
- Create cleaning task
- Click Use Template
- Select template
- Adjust as needed
3. Monitor Cleaning Task
View task status, progress, and statistics in task list.
Advanced Features
Custom Operators
Develop custom operators. See:
- Operator Market - Operator development guide
Conditional Branching
Add conditional branches in pipeline:
Input Data → [Condition Check]
├── Satisfied → Pipeline A
└── Not Satisfied → Pipeline B
Best Practices
1. Pipeline Design
Recommended principles:
- Modular: Split complex pipelines
- Reusable: Use templates and parameters
- Maintainable: Add comments
- Testable: Test individually before combining
2. Performance Optimization
Optimize performance:
- Parallelize: Use parallel nodes
- Reduce data transfer: Process locally when possible
- Batch operations: Use batch operations
- Cache results: Cache intermediate results
Common Questions
Q: Task execution failed?
A: Troubleshooting:
- Check data format
- View execution logs
- Check operator parameters
- Test individual operators
- Reduce data size for testing
Q: Cleaning speed is slow?
A: Optimize:
- Reduce operator count
- Optimize operator order
- Increase concurrency
- Use incremental processing
API Reference
Related Documentation
- Data Management - Manage cleaned data
- Operator Market - Get more cleaning operators
Feedback
Was this page helpful?
Glad to hear it! Please tell us how we can improve.
Sorry to hear that. Please tell us how we can improve.