Data Processing Core

📋 Overview

DataProcessCore is a unified file processing core class that supports automatic detection and processing of multiple file formats, providing flexible chunking strategies and multiple input source support.

⭐ Key Features

1. Core Processing Method: `file_process()`

Function Signature:

python

def file_process(self, 
                file_path_or_url: Optional[str] = None, 
                file_data: Optional[bytes] = None, 
                chunking_strategy: str = "basic", 
                destination: str = "local", 
                filename: Optional[str] = None, 
                **params) -> List[Dict]

Parameters:

Parameter	Type	Required	Description	Options
`file_path_or_url`	`str`	No*	Local file path or remote URL	Any valid file path or URL
`file_data`	`bytes`	No*	File byte data (for memory processing)	Any valid byte data
`chunking_strategy`	`str`	No	Chunking strategy	`"basic"`, `"by_title"`, `"none"`
`destination`	`str`	No	Destination type, indicating file source	`"local"`, `"minio"`, `"url"`
`filename`	`str`	No**	Filename	Any valid filename
`**params`	`dict`	No	Additional processing parameters	See parameter details below

*Note: Either file_path_or_url or file_data must be provided **Note: When using file_data, filename is required

Chunking Strategy (chunking_strategy) Details:

Strategy	Description	Use Case	Output Characteristics
`"basic"`	Basic chunking strategy	Most document processing scenarios	Automatic chunking based on content length
`"by_title"`	Title-based chunking	Structured documents with clear headings	Chunks divided by document structure
`"none"`	No chunking	Small files or when full content is needed	Returns complete content without chunking

📁 Supported File Formats

Text files: .txt, .md, .csv
Documents: .pdf, .docx, .pptx
Images: .jpg, .png, .gif (with OCR)
Web content: HTML, URLs
Archives: .zip, .tar

💡 Usage Examples

python

from nexent.data_process import DataProcessCore

# Initialize processor
processor = DataProcessCore()

# Process local file
results = processor.file_process(
    file_path_or_url="/path/to/document.pdf",
    chunking_strategy="by_title"
)

# Process from URL
results = processor.file_process(
    file_path_or_url="https://example.com/document.pdf",
    destination="url"
)

# Process from memory
with open("document.pdf", "rb") as f:
    file_data = f.read()
    
results = processor.file_process(
    file_data=file_data,
    filename="document.pdf",
    chunking_strategy="basic"
)

For detailed configuration and advanced usage, see the complete SDK documentation.

Data Processing Core ​

📋 Overview ​

⭐ Key Features ​

1. Core Processing Method: file_process() ​