Skip to content

Query breakdown operator

QueryBreakdownOperator

Bases: Operator, ServiceClient

Query Breakdown Operator

This operator processes a natural language query and breaks it down into subqueries using common table expressions (CTEs). It leverages LLM models to generate SQL statements, descriptions, columns, dependencies, and generality scores for each subquery.

Attributes:

Name Type Required Default Description
query str - Natural language query to process
context str "" Optional context to provide domain knowledge
schema str [] Optional schema to match
num_alternatives int 1 Optional number of alternatives to generate
Source code in blue/operators/query_breakdown_operator.py
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
class QueryBreakdownOperator(Operator, ServiceClient):
    """Query Breakdown Operator

    This operator processes a natural language query and breaks it down into subqueries using common table expressions (CTEs).
    It leverages LLM models to generate SQL statements, descriptions, columns, dependencies, and generality scores for each subquery.

    Attributes:
    ----------
    | Name             | Type | Required | Default | Description                                      |
    |-----------------|------|----------|---------|--------------------------------------------------|
    | `query`           | str  | :fontawesome-solid-circle-check: {.green-check}     | -       | Natural language query to process               |
    | `context`         | str  |     | ""      | Optional context to provide domain knowledge    |
    | `schema`          | str  |     | []      | Optional schema to match                        |
    | `num_alternatives` | int  |     | 1       | Optional number of alternatives to generate     |

    """

    PROMPT = """
Your task is to process a natural language query, and break it down to its subqueries.  
Your strategy is to translate the natural language query into SQL, defining each subquery as common table expressions (CTE):
* Each subquery should be sufficiently self-contained and independent in terms of specific data to retrieve. 
* Break down into as many subsqueries as necessary but don't do excessively. 
* Only last subquery should join all the subqueries to answer the natural language query. Return the results in JSON format.

Return the results in JSON format.
The response should be a valid JSON array containing the following information for each CTE:
- 'description': natural language description of the CTE, representing the subquery
- 'sql': sql statement corresponding to the CTE
- 'table':  name of the CTE table, use short generic table names 
- 'columns': a list of columns of the table, each with a name and type (suitable for sql)
- 'dependency': a list of dependent tables names, as defined in respective CTEs
- 'generality': score of 0 to 10, where a score of 10 indicating whether the subquery can be completed answered through public general knowledge sources and a score of 0 indicating subquery can only be answered through private data sources

Here are additional requirements:
- Create a total of ${num_alternatives} alternatives.
- The output should be a JSON array, each containing an alternative set of CTEs. 
- Each alternative set of CTE should be in a JSON array with each CTE as a JSON object in the array.
- As such the overall output should be a JSON array of JSON arrays each representing an alternative set. The second level of JSONArray contains JSON objects.
- Output should not be 3 levels of JSON Arrays. 
- Avoid using IN within a CTE. Instead breakdown further and create another CTE and have another CTE finally that uses JOIN to put the subqueries together. 
- Dependency field should only contain table names defined in other CTEs.
- Use columns with ids sparingly. When joining especially different tables use columns that have values instead of ids.
- There might be optional context provided. Use it to assist the query if provided.
- There might be specificed schema, whenever possible try to match it.
- Each object should contain the requested information in a structured format
- Please try to return non-empty output. If the query is not clear, please use your best judgement to return a non-empty output.
- The response should be well-formatted and easy to parse.
- Output the JSON directly. Do not generate explanation or other additional output.

Below is an example: 
Query: 
what jobs are available for data scientists in bay area? 

Output: 
[[{"description":"Retrieve all jobs for data scientist roles","sql":"SELECT job_title, company_name, location FROM jobs WHERE LOWER(job_title) LIKE '%data scientist%'","table":"ds_roles","columns":[{"name":"job_title","type":"VARCHAR"},{"name":"company_name","type":"VARCHAR"},{"name":"location","type":"VARCHAR"}],"dependency":[],"generality":3},{"description":"Retrieve locations that are considered part of the Bay Area","sql":"SELECT city FROM locations WHERE city IN ('San Francisco', 'San Jose', 'Oakland', 'Palo Alto', 'Mountain View', 'Sunnyvale', 'Berkeley')","table":"bay_locs","columns":[{"name":"city","type":"VARCHAR"}],"dependency":[],"generality":8},{"description":"Join data scientist job listings with Bay Area locations to find available data scientist jobs in the Bay Area","sql":"SELECT d.job_title, d.company_name, d.location FROM ds_roles d JOIN bay_locs b ON LOWER(d.location) = LOWER(b.city)","table":"ds_jobs_bayarea","columns":[{"name":"job_title","type":"VARCHAR"},{"name":"company_name","type":"VARCHAR"},{"name":"location","type":"VARCHAR"}],"dependency":["ds_roles","bay_locs"],"generality":2}]]

Below is Your task:

Query:
${query}

Schema:
${schema}

Context:
${context}

Output:
"""

    PROPERTIES = {
        # openai related properties
        "openai.api": "ChatCompletion",
        # "openai.model": "gpt-4o",
        "openai.model": "gpt-4o",
        "openai.stream": False,
        "openai.max_tokens": 4096,
        "openai.temperature": 0,
        # io related properties
        "input_json": "[{\"role\": \"user\"}]",
        "input_context": "$[0]",
        "input_context_field": "content",
        "input_field": "messages",
        "input_template": PROMPT,
        "output_path": "$.choices[0].message.content",
        # service related properties
        "service_prefix": "openai",
        # output transformations
        "output_transformations": [{"transformation": "replace", "from": "```", "to": ""}, {"transformation": "replace", "from": "json", "to": ""}],
        "output_strip": True,
        "output_cast": "json",
    }

    name = "query_breakdown"
    description = "Processes natural language query and break it down to subqueries using common table expressions"
    default_attributes = {
        "query": {"type": "str", "description": "Natural language query to process", "required": True},
        "context": {"type": "str", "description": "Optional context to provide domain knowledge", "required": False, "default": ""},
        "schema": {"type": "str", "description": "Optional schema to match", "required": False, "default": []},
        "num_alternatives": {"type": "int", "description": "Optional number of alternatives to generate", "required": False, "default": 1},
    }

    def __init__(self, description: str = None, properties: Dict[str, Any] = None):
        super().__init__(
            self.name,
            function=query_breakdown_operator_function,
            description=description or self.description,
            properties=properties,
            validator=query_breakdown_operator_validator,
            explainer=query_breakdown_operator_explainer,
        )

    def _initialize_properties(self):
        super()._initialize_properties()

        # attribute definitions
        self.properties["attributes"] = self.default_attributes

        # service_url, set as default
        self.properties["service_url"] = PROPERTIES["services.openai.service_url"]

query_breakdown_operator_explainer(output, input_data, attributes)

Explain query breakdown operator output. Currently only returns attributes and output.

Parameters:

Name Type Description Default
output Any

The output result from the operator execution.

required
input_data List[List[Dict[str, Any]]]

The input data that was processed.

required
attributes Dict[str, Any]

The attributes used for the operation.

required

Returns:

Type Description
Dict[str, Any]

Dictionary containing explanation of the operation.

Source code in blue/operators/query_breakdown_operator.py
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
def query_breakdown_operator_explainer(output: Any, input_data: List[List[Dict[str, Any]]], attributes: Dict[str, Any]) -> Dict[str, Any]:
    """
    Explain query breakdown operator output. Currently only returns attributes and output.

    Parameters:
        output: The output result from the operator execution.
        input_data: The input data that was processed.
        attributes: The attributes used for the operation.

    Returns:
        Dictionary containing explanation of the operation."""
    query_breakdown_explanation = {
        'output': output,
        "attributes": attributes,
    }
    return query_breakdown_explanation

query_breakdown_operator_function(input_data, attributes, properties=None)

Process natural language query using LLM models, identifies subsqueries along with metadata such as sql, columns, dependencies, and generality of the subquery.

Parameters:

Name Type Description Default
input_data List[List[Dict[str, Any]]]

List of JSON arrays (List[List[Dict[str, Any]]]), passes through unchanged.

required
attributes Dict[str, Any]

Dictionary containing query breakdown parameters including query, context, num_alternatives, and schema.

required
properties Dict[str, Any]

Optional properties dictionary containing service client information. Defaults to None.

None

Returns:

Type Description
List[List[Dict[str, Any]]]

List containing the output of the query breakdown execution.

Source code in blue/operators/query_breakdown_operator.py
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
def query_breakdown_operator_function(input_data: List[List[Dict[str, Any]]], attributes: Dict[str, Any], properties: Dict[str, Any] = None) -> List[List[Dict[str, Any]]]:
    """
    Process natural language query using LLM models, identifies subsqueries along with metadata such as sql, columns, dependencies, and generality of the subquery.

    Parameters:
        input_data: List of JSON arrays (List[List[Dict[str, Any]]]), passes through unchanged.
        attributes: Dictionary containing query breakdown parameters including query, context, num_alternatives, and schema.
        properties: Optional properties dictionary containing service client information. Defaults to None.

    Returns:
        List containing the output of the query breakdown execution.
    """
    # Extract attributes
    query = attributes.get('query', '')
    context = attributes.get('context', '')
    num_alternatives = attributes.get('num_alternatives', 1)
    schema = attributes.get('schema', '')

    # Validate input
    if not query or not query.strip():
        return []

    service_client = ServiceClient(name="query_breakdown_operator_service_client", properties=properties)
    additional_data = {'query': query, 'context': context, 'num_alternatives': num_alternatives, 'schema': schema}

    results = service_client.execute_api_call({}, properties=properties, additional_data=additional_data)
    return results

query_breakdown_operator_validator(input_data, attributes, properties=None)

Validate query breakdown operator attributes.

Parameters:

Name Type Description Default
input_data List[List[Dict[str, Any]]]

List of JSON arrays (List[List[Dict[str, Any]]]) to validate.

required
attributes Dict[str, Any]

Dictionary containing operator attributes to validate.

required
properties Dict[str, Any]

Optional properties dictionary. Defaults to None.

None

Returns:

Type Description
bool

True if attributes are valid, False otherwise.

Source code in blue/operators/query_breakdown_operator.py
45
46
47
48
49
50
51
52
53
54
55
56
def query_breakdown_operator_validator(input_data: List[List[Dict[str, Any]]], attributes: Dict[str, Any], properties: Dict[str, Any] = None) -> bool:
    """
    Validate query breakdown operator attributes.

    Parameters:
        input_data: List of JSON arrays (List[List[Dict[str, Any]]]) to validate.
        attributes: Dictionary containing operator attributes to validate.
        properties: Optional properties dictionary. Defaults to None.

    Returns:
        True if attributes are valid, False otherwise."""
    return default_operator_validator(input_data, attributes, properties)
Last update: 2025-10-09