Semantic extract operator

`SemanticExtractOperator`

Bases: Operator, ServiceClient

Semantic extract operator extracts entities from natural language text fields using LLM models.

Attributes:

Name	Type	Default	Description
`entities`	list[dict]	N/A	List of entities to extract. Each dict has 'name', 'description' (optional), 'extract_on_fields' (optional list of field names - if not provided, extracts from all fields), and 'type' (optional)
`context`	str	""	Additional context information that provides domain knowledge or additional instructions for the extraction
`demonstrations`	str	""	Additional demonstrations to help in-context learning
`extract_with_single_prompt`	bool	True	If true, extract all entities in a single prompt, else extract each entity with individual prompt

Source code in blue/operators/semantic_extract_operator.py

class SemanticExtractOperator(Operator, ServiceClient):
    """
    Semantic extract operator extracts entities from natural language text fields using LLM models.

    Attributes:
    ----------
    | Name                     | Type          | Required | Default | Description                                                                                                                   |
    |--------------------------|---------------|----------|---------|-------------------------------------------------------------------------------------------------------------------------------|
    | `entities`                 | list[dict]    | :fontawesome-solid-circle-check: {.green-check}     | N/A     | List of entities to extract. Each dict has 'name', 'description' (optional), 'extract_on_fields' (optional list of field names - if not provided, extracts from all fields), and 'type' (optional) |
    | `context`                  | str           |     | ""      | Additional context information that provides domain knowledge or additional instructions for the extraction                  |
    | `demonstrations`           | str           |     | ""      | Additional demonstrations to help in-context learning                                                                       |
    | `extract_with_single_prompt` | bool         |     | True    | If true, extract all entities in a single prompt, else extract each entity with individual prompt                             |
    """

    SINGLE_EXTRACT_PROMPT = """## Task
You are given one data record in JSON format. Your job is to extract specific entities from this record and return them in strict JSON format.

## Entities to Extract
${entities}

## Context
${context}

## Demonstrations
${demonstrations}

## Output Requirements
- Return a **single JSON object** as the output.
- The JSON object must contain:
  - Keys: exact entity names from the list above (preserve original case and formatting).
  - Values: arrays of extracted items (may be empty if nothing is found).
- Do not introduce extra keys or change entity names.
- **Deduplication rule (within the same entity key):**
  - Remove duplicates and near-duplicates (e.g., case/spacing/punctuation variants, obvious aliases, or versioned forms).
  - Keep **one canonical value** if there are multiple semantically equivalent values.
  - Example: `"skill": ["python", "python3.10", "Python "]` → `"skill": ["python"]`.
- Only extract entities from the specified fields for each entity name.
- Be precise and only return relevant entities.

## Additional Notes
- If no entities are found, still return an object with all keys mapped to empty arrays.
- Do not include explanations, comments, or text outside the JSON object.
- Validate that the final output is syntactically valid JSON.

---

### Data Record
${data_record}

---

### Output JSON Object
"""

    INDIVIDUAL_EXTRACT_PROMPT = """## Task
You are given text content and need to extract specific entities of type "${entity_name}" from it.

## Entity to Extract
- **Name**: ${entity_name}${entity_type_info}
- **Description**: ${entity_description}
- **Extract from fields**: ${extract_fields}${type_line}

## Context
${context}

## Demonstrations
${demonstrations}

## Output Requirements
- Return a **JSON array** of extracted ${entity_name} entities
- Each element should be a string representing one ${entity_name}
- If no ${entity_name} entities are found, return an empty array []
- Be precise and only extract relevant ${entity_name} entities
- **Deduplication rule:**
  - Remove duplicates and near-duplicates (e.g., case/spacing/punctuation variants, obvious aliases, or versioned forms)
  - Keep **one canonical value** if there are multiple semantically equivalent values
  - Example: `["python", "python3.10", "Python "]` → `["python"]`
- Do not include explanations, comments, or text outside the JSON array
- Validate that the final output is syntactically valid JSON

---

### Text to Extract From
${text_to_extract}

---

### Output JSON Array
"""

    PROPERTIES = {
        # openai related properties
        "openai.api": "ChatCompletion",
        "openai.model": "gpt-4o",
        "openai.stream": False,
        "openai.max_tokens": 4096,
        "openai.temperature": 0,
        # io related properties
        "input_json": "[{\"role\": \"user\"}]",
        "input_context": "$[0]",
        "input_context_field": "content",
        "input_field": "messages",
        "input_template": SINGLE_EXTRACT_PROMPT,
        "output_path": "$.choices[0].message.content",
        # service related properties
        "service_prefix": "openai",
        # output transformations
        "output_transformations": [{"transformation": "replace", "from": "```", "to": ""}, {"transformation": "replace", "from": "json", "to": ""}],
        "output_strip": True,
        "output_cast": "json",
    }

    name = "semantic_extract"
    description = "Extracts entities from natural language text fields using LLM models"
    default_attributes = {
        "entities": {
            "type": "list[dict]",
            "description": "List of entities to extract. Each dict has 'name', 'description' (optional), 'extract_on_fields' (optional list of field names - if not provided, extracts from all fields), and 'type' (optional)",
            "required": True,
        },
        "context": {
            "type": "str",
            "description": "Additional context information that provides domain knowledge or additional instructions for the extraction",
            "required": False,
            "default": "",
        },
        "demonstrations": {"type": "str", "description": "Additional demonstrations to help in-context learning", "required": False, "default": ""},
        "extract_with_single_prompt": {
            "type": "bool",
            "description": "If true, extract all entities in a single prompt, else extract each entity with individual prompt",
            "required": False,
            "default": True,
        },
    }

    def __init__(self, description: str = None, properties: Dict[str, Any] = None):
        super().__init__(
            self.name,
            function=semantic_extract_operator_function,
            description=description or self.description,
            properties=properties,
            validator=semantic_extract_operator_validator,
            explainer=semantic_extract_operator_explainer,
        )

    def _initialize_properties(self):
        super()._initialize_properties()

        # attribute definitions
        self.properties["attributes"] = self.default_attributes

        # service_url, set as default
        self.properties["service_url"] = PROPERTIES["services.openai.service_url"]

`semantic_extract_operator_explainer(output, input_data, attributes)`

Generate explanation for semantic extract operator execution.

Parameters:

Name	Type	Description	Default
`output`	`Any`	The output result from the operator execution.	required
`input_data`	`List[List[Dict[str, Any]]]`	The input data that was processed.	required
`attributes`	`Dict[str, Any]`	The attributes used for the operation.	required

Returns:

Type	Description
`Dict[str, Any]`	Dictionary containing explanation of the operation.

Source code in blue/operators/semantic_extract_operator.py

def semantic_extract_operator_explainer(output: Any, input_data: List[List[Dict[str, Any]]], attributes: Dict[str, Any]) -> Dict[str, Any]:
    """Generate explanation for semantic extract operator execution.

    Parameters:
        output: The output result from the operator execution.
        input_data: The input data that was processed.
        attributes: The attributes used for the operation.

    Returns:
        Dictionary containing explanation of the operation.
    """
    return default_operator_explainer(output, input_data, attributes)

`semantic_extract_operator_function(input_data, attributes, properties=None)`

Extract entities from natural language text fields using LLM models.

Parameters:

Name	Type	Description	Default
`input_data`	`List[List[Dict[str, Any]]]`	List of JSON arrays (List[List[Dict[str, Any]]]) containing records with text fields to extract entities from.	required
`attributes`	`Dict[str, Any]`	Dictionary containing extraction parameters including entities, context, demonstrations, and extract_with_single_prompt.	required
`properties`	`Dict[str, Any]`	Optional properties dictionary containing service configuration. Defaults to None.	`None`

Returns:

Type	Description
`List[List[Dict[str, Any]]]`	List containing extracted entities for each record in the input data.

Source code in blue/operators/semantic_extract_operator.py

def semantic_extract_operator_function(input_data: List[List[Dict[str, Any]]], attributes: Dict[str, Any], properties: Dict[str, Any] = None) -> List[List[Dict[str, Any]]]:
    """Extract entities from natural language text fields using LLM models.

    Parameters:
        input_data: List of JSON arrays (List[List[Dict[str, Any]]]) containing records with text fields to extract entities from.
        attributes: Dictionary containing extraction parameters including entities, context, demonstrations, and extract_with_single_prompt.
        properties: Optional properties dictionary containing service configuration. Defaults to None.

    Returns:
        List containing extracted entities for each record in the input data.
    """
    entities = attributes.get('entities', [])
    context = attributes.get('context', '')
    demonstrations = attributes.get('demonstrations', '')
    extract_with_single_prompt = attributes.get('extract_with_single_prompt', True)

    if not input_data or not input_data[0] or not entities:
        return []

    service_client = ServiceClient(name="semantic_extract_operator_service_client", properties=properties)

    results = []
    for data_group in input_data:
        if not data_group:
            results.append([])
            continue

        if extract_with_single_prompt:
            # Extract all entities in a single prompt
            result = _extract_all_entities_single_prompt(data_group, entities, context, demonstrations, service_client, properties)
        else:
            # Extract each entity with individual prompts
            result = _extract_entities_individual_prompts(data_group, entities, context, demonstrations, service_client, properties)

        results.append(result)

    return results

`semantic_extract_operator_validator(input_data, attributes, properties=None)`

Validate semantic extract operator attributes.

Parameters:

Name	Type	Description	Default
`input_data`	`List[List[Dict[str, Any]]]`	List of JSON arrays (List[List[Dict[str, Any]]]) to validate.	required
`attributes`	`Dict[str, Any]`	Dictionary containing operator attributes to validate.	required
`properties`	`Dict[str, Any]`	Optional properties dictionary. Defaults to None.	`None`

Returns:

Type	Description
`bool`	True if attributes are valid, False otherwise.

Source code in blue/operators/semantic_extract_operator.py

def semantic_extract_operator_validator(input_data: List[List[Dict[str, Any]]], attributes: Dict[str, Any], properties: Dict[str, Any] = None) -> bool:
    """Validate semantic extract operator attributes.

    Parameters:
        input_data: List of JSON arrays (List[List[Dict[str, Any]]]) to validate.
        attributes: Dictionary containing operator attributes to validate.
        properties: Optional properties dictionary. Defaults to None.

    Returns:
        True if attributes are valid, False otherwise.
    """
    try:
        if not default_operator_validator(input_data, attributes, properties):
            return False
    except Exception:
        return False

    entities = attributes.get('entities', [])

    if not isinstance(entities, list) or len(entities) == 0:
        return False

    for entity in entities:
        if not isinstance(entity, dict):
            return False
        if 'name' not in entity:
            return False
        if 'description' in entity and not isinstance(entity['description'], str):
            return False
        if 'type' in entity and not isinstance(entity['type'], str):
            return False
        if 'extract_on_fields' in entity:
            if not isinstance(entity['extract_on_fields'], list):
                return False
            if len(entity['extract_on_fields']) == 0:
                return False
    return True

Last update: 2025-10-08