Skip to content

Semantic extract operator

SemanticExtractOperator

Bases: Operator, ServiceClient

Semantic extract operator extracts entities from natural language text fields using LLM models.

Attributes:

Name Type Required Default Description
entities list[dict] N/A List of entities to extract. Each dict has 'name', 'description' (optional), 'extract_on_fields' (optional list of field names - if not provided, extracts from all fields), and 'type' (optional)
context str "" Additional context information that provides domain knowledge or additional instructions for the extraction
demonstrations str "" Additional demonstrations to help in-context learning
extract_with_single_prompt bool True If true, extract all entities in a single prompt, else extract each entity with individual prompt
Source code in blue/operators/semantic_extract_operator.py
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
class SemanticExtractOperator(Operator, ServiceClient):
    """
    Semantic extract operator extracts entities from natural language text fields using LLM models.

    Attributes:
    ----------
    | Name                     | Type          | Required | Default | Description                                                                                                                   |
    |--------------------------|---------------|----------|---------|-------------------------------------------------------------------------------------------------------------------------------|
    | `entities`                 | list[dict]    | :fontawesome-solid-circle-check: {.green-check}     | N/A     | List of entities to extract. Each dict has 'name', 'description' (optional), 'extract_on_fields' (optional list of field names - if not provided, extracts from all fields), and 'type' (optional) |
    | `context`                  | str           |     | ""      | Additional context information that provides domain knowledge or additional instructions for the extraction                  |
    | `demonstrations`           | str           |     | ""      | Additional demonstrations to help in-context learning                                                                       |
    | `extract_with_single_prompt` | bool         |     | True    | If true, extract all entities in a single prompt, else extract each entity with individual prompt                             |
    """

    SINGLE_EXTRACT_PROMPT = """## Task
You are given one data record in JSON format. Your job is to extract specific entities from this record and return them in strict JSON format.

## Entities to Extract
${entities}

## Context
${context}

## Demonstrations
${demonstrations}

## Output Requirements
- Return a **single JSON object** as the output.
- The JSON object must contain:
  - Keys: exact entity names from the list above (preserve original case and formatting).
  - Values: arrays of extracted items (may be empty if nothing is found).
- Do not introduce extra keys or change entity names.
- **Deduplication rule (within the same entity key):**
  - Remove duplicates and near-duplicates (e.g., case/spacing/punctuation variants, obvious aliases, or versioned forms).
  - Keep **one canonical value** if there are multiple semantically equivalent values.
  - Example: `"skill": ["python", "python3.10", "Python "]` → `"skill": ["python"]`.
- Only extract entities from the specified fields for each entity name.
- Be precise and only return relevant entities.

## Additional Notes
- If no entities are found, still return an object with all keys mapped to empty arrays.
- Do not include explanations, comments, or text outside the JSON object.
- Validate that the final output is syntactically valid JSON.

---

### Data Record
${data_record}

---

### Output JSON Object
"""

    INDIVIDUAL_EXTRACT_PROMPT = """## Task
You are given text content and need to extract specific entities of type "${entity_name}" from it.

## Entity to Extract
- **Name**: ${entity_name}${entity_type_info}
- **Description**: ${entity_description}
- **Extract from fields**: ${extract_fields}${type_line}

## Context
${context}

## Demonstrations
${demonstrations}

## Output Requirements
- Return a **JSON array** of extracted ${entity_name} entities
- Each element should be a string representing one ${entity_name}
- If no ${entity_name} entities are found, return an empty array []
- Be precise and only extract relevant ${entity_name} entities
- **Deduplication rule:**
  - Remove duplicates and near-duplicates (e.g., case/spacing/punctuation variants, obvious aliases, or versioned forms)
  - Keep **one canonical value** if there are multiple semantically equivalent values
  - Example: `["python", "python3.10", "Python "]` → `["python"]`
- Do not include explanations, comments, or text outside the JSON array
- Validate that the final output is syntactically valid JSON

---

### Text to Extract From
${text_to_extract}

---

### Output JSON Array
"""

    PROPERTIES = {
        # openai related properties
        "openai.api": "ChatCompletion",
        "openai.model": "gpt-4o",
        "openai.stream": False,
        "openai.max_tokens": 4096,
        "openai.temperature": 0,
        # io related properties
        "input_json": "[{\"role\": \"user\"}]",
        "input_context": "$[0]",
        "input_context_field": "content",
        "input_field": "messages",
        "input_template": SINGLE_EXTRACT_PROMPT,
        "output_path": "$.choices[0].message.content",
        # service related properties
        "service_prefix": "openai",
        # output transformations
        "output_transformations": [{"transformation": "replace", "from": "```", "to": ""}, {"transformation": "replace", "from": "json", "to": ""}],
        "output_strip": True,
        "output_cast": "json",
    }

    name = "semantic_extract"
    description = "Extracts entities from natural language text fields using LLM models"
    default_attributes = {
        "entities": {
            "type": "list[dict]",
            "description": "List of entities to extract. Each dict has 'name', 'description' (optional), 'extract_on_fields' (optional list of field names - if not provided, extracts from all fields), and 'type' (optional)",
            "required": True,
        },
        "context": {
            "type": "str",
            "description": "Additional context information that provides domain knowledge or additional instructions for the extraction",
            "required": False,
            "default": "",
        },
        "demonstrations": {"type": "str", "description": "Additional demonstrations to help in-context learning", "required": False, "default": ""},
        "extract_with_single_prompt": {
            "type": "bool",
            "description": "If true, extract all entities in a single prompt, else extract each entity with individual prompt",
            "required": False,
            "default": True,
        },
    }

    def __init__(self, description: str = None, properties: Dict[str, Any] = None):
        super().__init__(
            self.name,
            function=semantic_extract_operator_function,
            description=description or self.description,
            properties=properties,
            validator=semantic_extract_operator_validator,
            explainer=semantic_extract_operator_explainer,
        )

    def _initialize_properties(self):
        super()._initialize_properties()

        # attribute definitions
        self.properties["attributes"] = self.default_attributes

        # service_url, set as default
        self.properties["service_url"] = PROPERTIES["services.openai.service_url"]

semantic_extract_operator_explainer(output, input_data, attributes)

Generate explanation for semantic extract operator execution.

Parameters:

Name Type Description Default
output Any

The output result from the operator execution.

required
input_data List[List[Dict[str, Any]]]

The input data that was processed.

required
attributes Dict[str, Any]

The attributes used for the operation.

required

Returns:

Type Description
Dict[str, Any]

Dictionary containing explanation of the operation.

Source code in blue/operators/semantic_extract_operator.py
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
def semantic_extract_operator_explainer(output: Any, input_data: List[List[Dict[str, Any]]], attributes: Dict[str, Any]) -> Dict[str, Any]:
    """Generate explanation for semantic extract operator execution.

    Parameters:
        output: The output result from the operator execution.
        input_data: The input data that was processed.
        attributes: The attributes used for the operation.

    Returns:
        Dictionary containing explanation of the operation.
    """
    return default_operator_explainer(output, input_data, attributes)

semantic_extract_operator_function(input_data, attributes, properties=None)

Extract entities from natural language text fields using LLM models.

Parameters:

Name Type Description Default
input_data List[List[Dict[str, Any]]]

List of JSON arrays (List[List[Dict[str, Any]]]) containing records with text fields to extract entities from.

required
attributes Dict[str, Any]

Dictionary containing extraction parameters including entities, context, demonstrations, and extract_with_single_prompt.

required
properties Dict[str, Any]

Optional properties dictionary containing service configuration. Defaults to None.

None

Returns:

Type Description
List[List[Dict[str, Any]]]

List containing extracted entities for each record in the input data.

Source code in blue/operators/semantic_extract_operator.py
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
def semantic_extract_operator_function(input_data: List[List[Dict[str, Any]]], attributes: Dict[str, Any], properties: Dict[str, Any] = None) -> List[List[Dict[str, Any]]]:
    """Extract entities from natural language text fields using LLM models.

    Parameters:
        input_data: List of JSON arrays (List[List[Dict[str, Any]]]) containing records with text fields to extract entities from.
        attributes: Dictionary containing extraction parameters including entities, context, demonstrations, and extract_with_single_prompt.
        properties: Optional properties dictionary containing service configuration. Defaults to None.

    Returns:
        List containing extracted entities for each record in the input data.
    """
    entities = attributes.get('entities', [])
    context = attributes.get('context', '')
    demonstrations = attributes.get('demonstrations', '')
    extract_with_single_prompt = attributes.get('extract_with_single_prompt', True)

    if not input_data or not input_data[0] or not entities:
        return []

    service_client = ServiceClient(name="semantic_extract_operator_service_client", properties=properties)

    results = []
    for data_group in input_data:
        if not data_group:
            results.append([])
            continue

        if extract_with_single_prompt:
            # Extract all entities in a single prompt
            result = _extract_all_entities_single_prompt(data_group, entities, context, demonstrations, service_client, properties)
        else:
            # Extract each entity with individual prompts
            result = _extract_entities_individual_prompts(data_group, entities, context, demonstrations, service_client, properties)

        results.append(result)

    return results

semantic_extract_operator_validator(input_data, attributes, properties=None)

Validate semantic extract operator attributes.

Parameters:

Name Type Description Default
input_data List[List[Dict[str, Any]]]

List of JSON arrays (List[List[Dict[str, Any]]]) to validate.

required
attributes Dict[str, Any]

Dictionary containing operator attributes to validate.

required
properties Dict[str, Any]

Optional properties dictionary. Defaults to None.

None

Returns:

Type Description
bool

True if attributes are valid, False otherwise.

Source code in blue/operators/semantic_extract_operator.py
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
def semantic_extract_operator_validator(input_data: List[List[Dict[str, Any]]], attributes: Dict[str, Any], properties: Dict[str, Any] = None) -> bool:
    """Validate semantic extract operator attributes.

    Parameters:
        input_data: List of JSON arrays (List[List[Dict[str, Any]]]) to validate.
        attributes: Dictionary containing operator attributes to validate.
        properties: Optional properties dictionary. Defaults to None.

    Returns:
        True if attributes are valid, False otherwise.
    """
    try:
        if not default_operator_validator(input_data, attributes, properties):
            return False
    except Exception:
        return False

    entities = attributes.get('entities', [])

    if not isinstance(entities, list) or len(entities) == 0:
        return False

    for entity in entities:
        if not isinstance(entity, dict):
            return False
        if 'name' not in entity:
            return False
        if 'description' in entity and not isinstance(entity['description'], str):
            return False
        if 'type' in entity and not isinstance(entity['type'], str):
            return False
        if 'extract_on_fields' in entity:
            if not isinstance(entity['extract_on_fields'], list):
                return False
            if len(entity['extract_on_fields']) == 0:
                return False
    return True
Last update: 2025-10-08