Knowledge Base Management

Build and manage RAG knowledge bases with DataMate

Knowledge base management module helps you build enterprise knowledge bases for efficient vector retrieval and RAG applications.

Features Overview

Knowledge base management module provides:

Document upload: Support multiple document formats
Text chunking: Intelligent text splitting strategies
Vectorization: Automatic text-to-vector conversion
Vector search: Semantic similarity-based retrieval
Knowledge base Q&A: RAG-intelligent Q&A

Supported Document Formats

Format	Description	Recommended For
TXT	Plain text	General text
PDF	PDF documents	Documents, reports
Markdown	Markdown files	Technical docs
JSON	JSON data	Structured data
CSV	CSV tables	Tabular data
DOCX	Word documents	Office documents

Quick Start

1. Create Knowledge Base

Step 1: Enter Knowledge Base Page

In the left navigation, select Knowledge Generation.

Step 2: Create Knowledge Base

Click Create Knowledge Base button in upper right.

Step 3: Configure Basic Information

Knowledge base name: e.g., company_docs_kb
Knowledge base description: Describe purpose (optional)
Knowledge base type: General / Professional domain

Step 4: Configure Vector Parameters

Embedding model: Select embedding model
- OpenAI text-embedding-ada-002
- BGE-M3
- Custom model
Vector dimension: Auto-set based on model
Index type: IVF_FLAT / HNSW / IVF_PQ

Step 5: Configure Chunking Strategy

Chunking method:
- By character count
- By paragraph
- By semantic
Chunk size: Size of each text chunk (character count)
Overlap size: Overlap between adjacent chunks

2. Upload Documents

Step 1: Enter Knowledge Base Details

Click knowledge base name to enter details.

Step 2: Upload Documents

Click Upload Document button
Select local files
Wait for upload completion

System will automatically:

Parse document content
Chunk text
Generate vectors
Build index

3. Vector Search

Step 1: Enter Search Page

In knowledge base details page, click Vector Search tab.

Step 2: Enter Query

Enter query in search box, e.g.:

How to use DataMate for data cleaning?

Step 3: View Search Results

System returns most relevant text chunks with similarity scores:

Rank	Text Chunk	Similarity	Source Doc	Actions
1	DataMate’s data cleaning module…	0.92	user_guide.pdf	View
2	Configure cleaning task…	0.87	tutorial.md	View
3	Cleaning operator list…	0.81	reference.txt	View

4. Knowledge Base Q&A (RAG)

Step 1: Enable RAG

In knowledge base details page, click RAG Q&A tab.

Step 2: Configure RAG Parameters

LLM: Select LLM to use
Retrieval count: Number of text chunks to retrieve
Temperature: Control generation randomness
Prompt template: Custom Q&A template

Step 3: Q&A

Enter question in dialog box, e.g.:

User: What data cleaning operators does DataMate support?

Assistant: DataMate supports rich data cleaning operators, including:
1. Data quality operators: deduplication, null handling, outlier detection...
2. Text cleaning operators: remove special chars, case conversion...
3. Image cleaning operators: format conversion, quality detection...
[Source: user_guide.pdf, tutorial.md]

Best Practices

1. Document Preparation

Before uploading documents:

Unify format: Convert to unified format (PDF, Markdown)
Clean content: Remove irrelevant content (headers, ads)
Maintain structure: Keep good document structure
Add metadata: Add document metadata (author, date, tags)

2. Chunking Strategy Selection

Choose based on document type:

Document Type	Recommended Strategy	Chunk Size
Technical docs	Paragraph chunking	-
Long reports	Semantic chunking	-
Short text	Character chunking	500
Code	Character chunking	300

Common Questions

Q: Document stuck in “Processing”?

A: Check:

Document format: Ensure format is supported
Document size: Single document under 100MB
Vector service: Check if vector service is running
View logs: Check detailed error messages

Q: Inaccurate search results?

A: Optimization suggestions:

Adjust chunking: Try different chunking methods
Increase chunk size: Add more context
Use reranking: Enable reranking model
Optimize query: Use clearer query statements
Change embedding model: Try other models

API Reference

For detailed API documentation, see:

RAG Indexer API

Agent Chat - Q&A with knowledge base
Data Management - Manage knowledge base documents
Pipeline Orchestration - Integrate knowledge base into pipelines

Feedback

Was this page helpful?

Glad to hear it! Please tell us how we can improve.

Sorry to hear that. Please tell us how we can improve.

Last modified February 12, 2026: :memo: fix data-collection & data management (#4) (ed058f0)