Natural Language Analytics on Apache Iceberg
Natural language analytics (NLA) is the ability to query a data lakehouse by asking questions in plain English: “What was our fastest-growing product category last quarter?”, and receiving data-backed, accurate answers without writing SQL. Apache Iceberg is the ideal storage foundation for NLA because its governed, versioned, well-documented tables provide the authoritative data that LLMs need to generate trustworthy answers.
The Three-Layer NLA Architecture
Every production natural language analytics system over Iceberg has three layers:
1. Semantic Layer → business context (what does each table/column mean?)
2. LLM Layer → natural language → SQL translation (grounded in semantic context)
3. Execution Layer → SQL → Iceberg data → results
Without the semantic layer, the LLM generates SQL against raw schemas it doesn’t understand, producing incorrect or hallucinated results. Without the execution layer, there’s no connection to actual data. Without the LLM layer, it’s not natural language.
Layer 1: The Semantic Layer
The semantic layer is the most important component for NLA accuracy. It translates raw Iceberg schemas into AI-understandable business context:
# Semantic catalog: register tables with business context
semantic_catalog = {
"analytics.orders": {
"description": "Canonical order table. Every order placed by customers since 2020-01-01.",
"columns": {
"order_id": "Unique identifier for each order. Never null.",
"customer_id": "Foreign key to analytics.customers. Links order to customer profile.",
"total": "Net order revenue in USD after discounts. Excludes tax and shipping.",
"order_date": "Calendar date the order was placed (not shipped or delivered).",
"status": "Order lifecycle status. Values: PENDING, PROCESSING, SHIPPED, DELIVERED, CANCELLED.",
"region": "Geographic sales region. Values: AMER (Americas), EMEA (Europe/Middle East/Africa), APAC (Asia Pacific)."
},
"metrics": {
"revenue": "SUM(total) WHERE status IN ('SHIPPED', 'DELIVERED')",
"order_count": "COUNT(*) WHERE status != 'CANCELLED'",
"aov": "SUM(total) / COUNT(*) WHERE status IN ('SHIPPED', 'DELIVERED')"
},
"business_rules": [
"Only SHIPPED and DELIVERED orders count as revenue",
"CANCELLED orders are excluded from all revenue metrics",
"order_date is in UTC timezone"
]
}
}
Layer 2: LLM Text-to-SQL (NL2SQL)
The LLM generates SQL from natural language, grounded in semantic context:
import anthropic
client = anthropic.Anthropic()
def natural_language_to_sql(question: str, table_context: dict) -> str:
"""Convert natural language question to SQL using LLM + semantic context."""
context = f"""
Table: analytics.orders
Description: {table_context['description']}
Columns:
{chr(10).join([f"- {col}: {desc}" for col, desc in table_context['columns'].items()])}
Pre-defined metrics:
{chr(10).join([f"- {m}: {d}" for m, d in table_context['metrics'].items()])}
Business rules:
{chr(10).join([f"- {r}" for r in table_context['business_rules']])}
"""
response = client.messages.create(
model="claude-opus-4-5",
max_tokens=1024,
system=f"""You are a SQL expert. Generate a single, correct SQL SELECT query to answer
the user's question using the following table context. Return ONLY the SQL query, no explanation.
{context}""",
messages=[{"role": "user", "content": f"Question: {question}"}]
)
return response.content[0].text.strip()
Layer 3: SQL Execution Against Iceberg
Execute the generated SQL against Iceberg:
import duckdb
from pyiceberg.catalog import load_catalog
def execute_on_iceberg(sql: str) -> pd.DataFrame:
"""Execute SQL against Iceberg tables, return results."""
catalog = load_catalog("my_catalog", **{...})
table = catalog.load_table("analytics.orders")
arrow = table.scan().to_arrow()
conn = duckdb.connect()
conn.register("orders", arrow)
return conn.execute(sql).fetchdf()
Full NLA Pipeline
def answer_business_question(question: str) -> str:
"""
End-to-end natural language analytics over Iceberg.
1. Get semantic context
2. Generate SQL from question
3. Execute against Iceberg
4. Synthesize natural language answer
"""
# 1. Get semantic context for relevant tables
context = semantic_catalog["analytics.orders"]
# 2. Generate SQL
sql = natural_language_to_sql(question, context)
print(f"Generated SQL: {sql}")
# 3. Execute against Iceberg
results = execute_on_iceberg(sql)
# 4. Synthesize answer
response = client.messages.create(
model="claude-opus-4-5",
max_tokens=1024,
system="You are a business analyst. Synthesize the data into a clear, concise answer.",
messages=[{
"role": "user",
"content": f"""
Question: {question}
Data:
{results.to_markdown()}
Provide a clear business answer.
"""
}]
)
return response.content[0].text
# Example
answer = answer_business_question(
"Which regions showed revenue growth above 20% this quarter compared to last?"
)
print(answer)
Dremio’s AI Agent: Production NLA on Iceberg
Dremio’s AI Agent is the production-ready implementation of NLA over Iceberg:
- Semantic Layer: Dremio Virtual Datasets with Wiki documentation and column descriptions provide the business context.
- NL2SQL: Dremio’s AI Agent uses the semantic layer to generate highly accurate SQL.
- Execution: Dremio’s Intelligent Query Engine executes against Iceberg via Apache Polaris.
- Governance: All queries are authenticated, authorized, and audit-logged.
- MCP Integration: AI Agents in Claude, ChatGPT, or custom applications connect via MCP.
The Dremio approach eliminates the need to build and maintain a custom semantic catalog: the Dremio platform manages the business context layer that makes NLA accurate.
Common NLA Failure Modes (and How Iceberg + Semantic Layer Fixes Them)
| Failure Mode | Cause | Fix |
|---|---|---|
| Wrong numerical results | LLM uses wrong filter (e.g., includes cancelled orders in revenue) | Business rules in semantic layer |
| Column name confusion | rev vs revenue: LLM guesses wrong | Clear column descriptions |
| Join errors | LLM doesn’t know customer_id joins to customers.id | Relationship declarations in semantic catalog |
| Stale data | LLM answers from training data, not live tables | Iceberg query execution grounds answers in current data |
| Hallucinated metrics | LLM invents numbers | Iceberg query execution with LIMIT enforcement |