Skip to content

Data Processing Core

📋 Overview

DataProcessCore is a unified file processing core class that supports automatic detection and processing of multiple file formats, providing flexible chunking strategies and multiple input source support.

⭐ Key Features

1. Core Processing Method: file_process()

Function Signature:

python
def file_process(self, 
                file_path_or_url: Optional[str] = None, 
                file_data: Optional[bytes] = None, 
                chunking_strategy: str = "basic", 
                destination: str = "local", 
                filename: Optional[str] = None, 
                **params) -> List[Dict]

Parameters:

ParameterTypeRequiredDescriptionOptions
file_path_or_urlstrNo*Local file path or remote URLAny valid file path or URL
file_databytesNo*File byte data (for memory processing)Any valid byte data
chunking_strategystrNoChunking strategy"basic", "by_title", "none"
destinationstrNoDestination type, indicating file source"local", "minio", "url"
filenamestrNo**FilenameAny valid filename
**paramsdictNoAdditional processing parametersSee parameter details below

*Note: Either file_path_or_url or file_data must be provided **Note: When using file_data, filename is required

Chunking Strategy (chunking_strategy) Details:

StrategyDescriptionUse CaseOutput Characteristics
"basic"Basic chunking strategyMost document processing scenariosAutomatic chunking based on content length
"by_title"Title-based chunkingStructured documents with clear headingsChunks divided by document structure
"none"No chunkingSmall files or when full content is neededReturns complete content without chunking

📁 Supported File Formats

  • Text files: .txt, .md, .csv
  • Documents: .pdf, .docx, .pptx
  • Images: .jpg, .png, .gif (with OCR)
  • Web content: HTML, URLs
  • Archives: .zip, .tar

💡 Usage Examples

python
from nexent.data_process import DataProcessCore

# Initialize processor
processor = DataProcessCore()

# Process local file
results = processor.file_process(
    file_path_or_url="/path/to/document.pdf",
    chunking_strategy="by_title"
)

# Process from URL
results = processor.file_process(
    file_path_or_url="https://example.com/document.pdf",
    destination="url"
)

# Process from memory
with open("document.pdf", "rb") as f:
    file_data = f.read()
    
results = processor.file_process(
    file_data=file_data,
    filename="document.pdf",
    chunking_strategy="basic"
)

For detailed configuration and advanced usage, see the complete SDK documentation.