Usage¶
A practical, end-to-end tour of the schemadex Python API. Every section below is runnable as written — paste a snippet into a REPL with the package installed and it should produce sensible output against your own database.
Higher-level design context lives in architecture.md. For the resolution-path numbers cited here, see benchmark.md.
1. Install¶
The default wheel ships every backend (Postgres, SQLite, MySQL, DuckDB, BigQuery, Snowflake, MSSQL) plus the agent-facing describe / resolve / run-SQL surface.
# Everything
pip install schemadex
# With one of the optional framework adapters
pip install "schemadex[mcp]" # MCP server entry point
pip install "schemadex[langchain]" # langchain_core.tools.Tool factory
pip install "schemadex[langgraph]" # LangGraph node helper
pip install "schemadex[llamaindex]" # LlamaIndex retriever
pip install "schemadex[dspy]" # DSPy module
pip install "schemadex[litellm]" # LiteLLM message builder
# Or several at once
pip install "schemadex[mcp,langchain,langgraph]"
Building a slim wheel from source¶
The published wheel includes every backend; the binary is therefore
heavy. Users who need a smaller artifact can build one locally with
maturin, picking only the backends they care about (full recipe in
slim-wheels.md):
# Postgres + SQLite, no DuckDB / BigQuery / Snowflake / MSSQL / MySQL
git clone https://github.com/AmiyaMandal1/schemadex
cd schemadex
maturin build --release --no-default-features \
--features schemadex-py/postgres,schemadex-py/sqlite \
--out dist
pip install dist/schemadex-*.whl
Rust git dependency¶
schemadex-core is not on crates.io yet; pull it directly from GitHub:
# Cargo.toml
[dependencies]
schemadex-core = {
git = "https://github.com/AmiyaMandal1/schemadex",
features = ["postgres", "sqlite", "duckdb_backend"],
}
Verify the install¶
2. Connecting to a database¶
SchemaCache.from_url(url) is the single entry point. The URL scheme
selects the backend; credentials follow the conventions of the
underlying driver.
| Backend | URL scheme | Credentials |
|---|---|---|
| Postgres | postgres://user:pass@host:port/db |
password in URL or PGPASSWORD env |
| SQLite | sqlite:///abs/path/to.db or sqlite://./rel.db?mode=rwc |
none |
| MySQL | mysql://user:pass@host:port/db |
password in URL |
| DuckDB | duckdb:// (in-memory) or duckdb:///abs/path.duckdb |
none |
| MSSQL | mssql://user:pass@host:port/db |
?encrypt=true / ?trust_cert=true query params |
| BigQuery | bigquery://project[/dataset] |
GOOGLE_APPLICATION_CREDENTIALS env or ADC (gcloud auth application-default login) |
| Snowflake | snowflake://account/database[/schema] |
SNOWFLAKE_USER, SNOWFLAKE_PRIVATE_KEY_PATH, optional SNOWFLAKE_WAREHOUSE, SNOWFLAKE_ROLE |
One-liners per backend:
from schemadex import SchemaCache
# Postgres
cache = SchemaCache.from_url("postgres://app:secret@db.internal:5432/orders")
# SQLite (absolute path)
cache = SchemaCache.from_url("sqlite:///var/lib/app/data.db")
# MySQL
cache = SchemaCache.from_url("mysql://app:secret@db.internal:3306/orders")
# DuckDB (file)
cache = SchemaCache.from_url("duckdb:///tmp/analytics.duckdb")
# MSSQL with self-signed cert
cache = SchemaCache.from_url(
"mssql://sa:Pass1!@sqlserver:1433/orders?encrypt=true&trust_cert=true"
)
# BigQuery (ADC must be set up first: `gcloud auth application-default login`)
cache = SchemaCache.from_url("bigquery://my-project/analytics")
# Snowflake (key-pair JWT)
# export SNOWFLAKE_USER=...; export SNOWFLAKE_PRIVATE_KEY_PATH=/path/to/key.p8
cache = SchemaCache.from_url("snowflake://my-account/ANALYTICS/PUBLIC")
3. Basic cache operations¶
Once a SchemaCache is built, everything else is local:
from schemadex import SchemaCache
cache = SchemaCache.from_url("postgres://app:secret@db/orders")
# List every cached table (qualified name when the backend has schemas).
print(cache.list_tables())
# ['public.customers', 'public.orders', 'public.shipments']
# Get the full structured record for one table.
table = cache.get_table("public.orders")
print(table["columns"][0])
# {'name': 'id', 'data_type': 'integer', 'nullable': False, ...}
# Per-database DDL fingerprint (None if no fingerprint was computed).
print(cache.fingerprint())
# 'sha256:9f2c…'
# Where this cache lives on disk.
print(cache.cache_path())
# '/Users/you/Library/Caches/schemadex/9b21c4e8…/database.json.zst'
# Full dump as a JSON string (suitable for diffing or shipping to a peer).
blob = cache.to_json()
The first call to from_url(url) writes a zstd-compressed envelope to
~/.cache/schemadex/<url-hash>/database.json.zst (on macOS the path is
under ~/Library/Caches/, on Linux $XDG_CACHE_HOME or ~/.cache).
Subsequent calls within the TTL (24 hours by default; override with
ttl_seconds=) read the envelope back instead of hitting the database.
The architecture doc has the full cache-layout spec
(architecture.md).
4. Refreshing the cache¶
When DDL changes you can re-introspect without re-paying the cost on every table:
# Re-introspect everything. Tables whose DDL hash is unchanged are
# left alone; only changed tables are rewritten.
changed, unchanged = cache.refresh("postgres://app:secret@db/orders")
print(f"{len(changed)} changed, {len(unchanged)} unchanged")
# Re-introspect a single table by qualified or bare name.
changed, unchanged = cache.refresh_table(
"postgres://app:secret@db/orders",
"public.orders",
)
Both calls return a (changed, unchanged) tuple of qualified table
names. The DDL fingerprint check happens inside the Rust core — if a
table's hash matches the persisted snapshot, it doesn't get rewritten.
5. Resolving fuzzy column names¶
This is the primary friction-removal feature. The agent says
customer_idd; you want customer_id.
r = cache.resolve("public.orders", "customer_idd")
print(r.matched) # 'customer_id'
print(r.confidence) # 0.94
print(r.alternatives) # [('customer_uuid', 0.71), ('id', 0.55)]
Scoring model:
- Base score is Jaro-Winkler similarity between candidate and every column name on the table.
confidence == 1.0means an exact (case-insensitive) match.alternativescontains up to three runner-up(name, score)pairs.
In practice, treat anything >= 0.85 as "trust the match," anything
between 0.70 and 0.85 as "ask the user," and lower than 0.70 as
"the candidate is probably wrong; fall back to listing columns."
Embedding fallback¶
For semantic misses Jaro-Winkler can't bridge (review_body ↔ body,
state ↔ status), resolve_with_embedding re-ranks the candidates by
cosine similarity of embeddings produced by a local Ollama model:
from schemadex import SchemaCache, resolve_with_embedding
cache = SchemaCache.from_url("postgres://app:secret@db/reviews")
r = resolve_with_embedding(
cache,
"users",
"review_body",
threshold=0.85, # only embed if lexical < 0.85
model="nomic-embed-text-v2-moe",
ollama_url="http://localhost:11434",
)
print(r.matched, r.confidence)
Requires ollama serve to be running locally and the requested model to
be pulled (ollama pull nomic-embed-text-v2-moe). If Ollama is
unreachable or the model is missing, resolve_with_embedding logs a
warning to stderr and returns the original lexical result — it never
raises.
Synonym dictionary¶
For domain-specific aliases the lexical scorer doesn't know about
(amount_cents is the project's name for total, etc.) provide a YAML
synonym map:
# .schemadex/synonyms.yaml
public.orders:
total: amount_cents
shipped_at: dispatched_at
public.customers:
email: contact_email
# Option A: load once, reuse on every resolve call.
cache.load_synonyms(".schemadex/synonyms.yaml")
r = cache.resolve("public.orders", "total",
synonyms_path=".schemadex/synonyms.yaml")
print(r.matched) # 'amount_cents'
The parsed synonym map is cached on the SchemaCache instance, so
repeated calls with the same synonyms_path don't re-read the file.
6. Token-budgeted schema descriptions¶
The describe API renders the cache into a chunk of text small enough to fit in an LLM prompt:
prompt, tokens = cache.describe_for_agent(
max_tokens=1500,
hint="orders by region",
include_samples=True,
include_foreign_keys=True,
)
print(f"-- {tokens} tokens --")
print(prompt)
Key behaviors:
hintbiases the per-table relevance score. Tables that match the hint are kept; low-ranked tables are dropped first when the budget is tight.max_tokensis counted withtiktoken(cl100k_baseencoding), the same tokenizer the GPT-4o-family and Claude-class models use, so the estimate is within a few percent of the model's own count.- When the rendered output exceeds the budget, the truncation hierarchy fires in this order:
- drop sample values,
- drop column / table comments,
- drop foreign keys,
- drop columns past ordinal 8 per table,
- drop lowest-ranked tables.
You can also restrict the output to specific tables — useful when the agent already knows roughly what it wants:
prompt, tokens = cache.describe_for_agent(
max_tokens=800,
tables=["public.orders", "public.customers"],
)
For very tight budgets (under ~512 tokens on a 50-table schema), see the token-budget stress run in benchmark.md.
7. Running SQL¶
Schemadex includes a read-only SQL runner with built-in markdown rendering. The connection pool is created lazily per URL and shared across calls, so the first invocation pays the connect cost and the rest don't.
# Blocking call — returns once the entire result set has been rendered.
text, tokens = cache.run_sql(
"postgres://app:secret@db/orders",
"SELECT id, email FROM customers ORDER BY id LIMIT 50",
token_budget=1024,
)
print(text)
# | id | email |
# | --- | --- |
# | 1 | a@example.com |
# ...
# Streaming variant — the backend stops pulling rows as soon as the
# rendered table would exceed `token_budget`. Use this for queries that
# might return millions of rows.
text, tokens = cache.run_sql_streaming(
"postgres://app:secret@db/orders",
"SELECT * FROM big_table",
token_budget=512,
)
Safety model:
- Both functions parse the SQL through
assert_readonlyfirst, which rejects everything exceptSELECT/WITH/EXPLAIN/SHOW/DESCRIBE/DESC. run_sql(..., allow_write=True)skips the read-only guard. Only do this if you have already validated the SQL yourself —DELETE,DROP, andUPDATEwill reach the database.run_sql_streamingalways enforces the read-only check; there is noallow_writeescape hatch on the streaming path.
The output is a markdown pipe-table. If rows had to be dropped to fit
the budget, the renderer appends _(truncated to N rows)_ so the agent
knows it didn't see everything.
8. SQL pre-validation and error-to-hint¶
validate_sql runs the cached schema over a query before you execute
it. It is a heuristic, regex-driven check — not a full SQL parser — but
it catches the typos LLM SQL agents actually emit.
issues = cache.validate_sql("SELECT emial FROM users")
print(issues)
# [{
# 'kind': 'unknown_column',
# 'table': 'users',
# 'identifier': 'emial',
# 'suggestion': 'email',
# 'confidence': 0.95,
# }]
hint_for_error runs the same logic in reverse: given a raw database
error message, it pulls out the likely-real identifier and emits a
structured hint the agent can retry with:
hint = cache.hint_for_error('column "emial" does not exist')
print(hint)
# {
# 'kind': 'unknown_column',
# 'table': None,
# 'original_identifier': 'emial',
# 'suggested_identifier': 'email',
# 'confidence': 0.95,
# 'human_message': "column 'emial' does not exist — did you mean 'email'?",
# }
hint_for_error returns None if the error text doesn't match a known
pattern.
9. Sampling and sentinel detection¶
Sampling collects top-K values and percentiles per column, then flags
any value covering more than 40% of the column as a sentinel — the
common case being unhelpful placeholders like 'No Delay' or 'N/A'
that an agent shouldn't filter on naively.
cache = SchemaCache.from_url(
"postgres://app:secret@db/outages",
sample_values=True,
sample_top_k=10,
sample_sentinel_threshold=0.4,
sample_rows=10000,
)
table = cache.get_table("public.outages")
delay = next(c for c in table["columns"] if c["name"] == "delay_code")
print(delay["sample"]["sentinel"])
# ('No Delay', 0.80)
Sampling is supported on Postgres, SQLite, MySQL, and DuckDB.
PII redaction is enabled by default. Columns named email, phone,
ssn, password, etc., are skipped at sample time — the sample field
on those columns is None. The same skip applies to columns whose
comment carries a "PII" / "personally identifiable" marker. To override,
construct a custom SamplingPolicy in Rust and clear redaction; the
Python binding exposes only the safe defaults.
10. Async API¶
The async variants share the same tokio runtime as the sync API and don't block the event loop:
import asyncio
from schemadex import from_url_async, run_sql_async, refresh_async
URL = "postgres://app:secret@db/orders"
async def main():
cache = await from_url_async(URL)
text, tokens = await run_sql_async(
cache,
URL,
"SELECT id, email FROM customers LIMIT 50",
token_budget=1024,
)
changed, unchanged = await refresh_async(cache, URL)
asyncio.run(main())
refresh_table_async is the single-table sibling of refresh_async.
All async functions accept the same sampling kwargs as their sync
counterparts.
11. MCP server¶
Install with pip install "schemadex[mcp]". The schemadex-mcp
console script speaks the Model Context Protocol over stdio:
Wire it into Claude Code by adding to ~/.claude/mcp.json:
{
"mcpServers": {
"schemadex": {
"command": "schemadex-mcp",
"args": ["--url", "sqlite:///path/to/db.sqlite"]
}
}
}
The agent then has these tools, with no extra wiring:
| Tool | Description |
|---|---|
list_tables |
List every table in the connected database. |
describe_for_agent |
Render a token-budgeted schema description, optionally biased by a free-text hint. |
resolve_column |
Fuzzy-resolve a candidate column name on a table. Returns matched + confidence + up to three alternatives. |
run_sql |
Run a read-only SQL query and return a markdown-rendered result table that fits inside token_budget. |
validate_sql |
Pre-validate a SQL query against the cached schema. Returns a list of issues; an empty list means the query references only known identifiers. |
hint_for_error |
Wrap a raw database error message in a structured hint ({kind, original_identifier, suggested_identifier, human_message}). |
Operational flags:
# Expose /health and /metrics (Prometheus text format) on port 9100.
schemadex-mcp --url ... --metrics-port 9100
# Print the registered tool catalog as JSON and exit. Useful for
# agents that don't speak MCP but want the JSON Schema for each tool.
schemadex-mcp --url ... --print-schemas
The /metrics endpoint exposes schemadex_cache_tables,
schemadex_introspection_seconds_total, schemadex_run_sql_calls_total,
and schemadex_run_sql_errors_total.
12. Agent framework adapters¶
Each adapter ships as a runnable example under examples/. Copy the
file into your own codebase or import it directly.
LangChain (examples/langchain_tools.py)¶
from langchain_core.tools import Tool
from schemadex import SchemaCache
from examples.langchain_tools import make_schema_tools
cache = SchemaCache.from_url("sqlite:///demo.sqlite")
tools = make_schema_tools(cache)
# `tools` is a list of LangChain Tool objects: list_tables,
# describe_for_agent, resolve_column. Hand them to AgentExecutor.
LangGraph (examples/langgraph_node.py)¶
from langgraph.graph import StateGraph
from examples.langgraph_node import schema_node, AgentState
graph = StateGraph(AgentState)
graph.add_node("schema", schema_node(cache, max_tokens=2048))
graph.set_entry_point("schema")
LlamaIndex (examples/llamaindex_retriever.py)¶
from llama_index.core.query_engine import RetrieverQueryEngine
from examples.llamaindex_retriever import SchemaIndexRetriever
retriever = SchemaIndexRetriever(cache, max_tokens=2048)
engine = RetrieverQueryEngine.from_args(retriever)
response = engine.query("which regions had the most refunds?")
DSPy (examples/dspy_module.py)¶
import dspy
from examples.dspy_module import SchemadexContext
ctx = SchemadexContext(cache, max_tokens=2048)
prediction = ctx(question="orders by region")
print(prediction.schema) # the schema description
print(prediction.schema_tokens) # the token count
LiteLLM (examples/litellm_adapter.py)¶
from examples.litellm_adapter import schemadex_completion
resp = schemadex_completion(
cache,
"which regions had the most refunds?",
model="ollama/qwen2.5-coder:3b",
)
print(resp.choices[0].message.content)
13. dbt manifest¶
When a project already has a dbt manifest.json, you can build a
SchemaCache directly off it — no live warehouse needed. Useful for
offline development, CI, and sandboxed environments where the agent
can't reach production:
from schemadex import dbt_source
cache = dbt_source.from_manifest("target/manifest.json")
print(cache.list_tables())
Under the hood the manifest is projected into a synthetic SQLite
database whose schema mirrors every dbt model / source / seed. From
there it's the same SchemaCache you'd get from a live connection —
the same resolve, describe_for_agent, and so on.
14. Jupyter magic¶
A small IPython extension is bundled. Load it once, set a default URL, then hit the cache from a notebook line:
%load_ext schemadex
# Set a default URL for the rest of the session.
%schemadex_url sqlite:///tmp/demo.sqlite
# List every table.
%schemadex list-tables
# Describe one table.
%schemadex describe orders
# Resolve a column.
%schemadex resolve orders customer_idd
You can also pass --url inline if you switch databases mid-session:
15. Tracing + OpenTelemetry¶
schemadex-core emits structured tracing spans on every public entry
point. For Python users, the spans surface anywhere a Rust
tracing_subscriber is installed.
Plain log output via RUST_LOG:
# High-signal cache events; sqlx warnings only.
RUST_LOG=schemadex=info,sqlx=warn cargo run --example my_app
# Per-backend introspection calls.
RUST_LOG=schemadex_core::backends::postgres=debug,schemadex=info cargo run
For Rust users who want the same spans as OTLP traces, build with the
otel feature and call init_otel once at startup:
use schemadex_core::init_otel;
#[tokio::main]
async fn main() -> anyhow::Result<()> {
init_otel("schemadex", "http://localhost:4317")?;
// ... build SchemaCache, run queries — every span is now an OTLP span.
Ok(())
}
The full collector recipe (Jaeger / Honeycomb / Datadog all work over
OTLP/gRPC on 4317) lives in observability.md.
16. Schema diff¶
schemadex-diff is a small Rust binary that compares two cache
snapshots and prints a markdown changelog:
Each input is one of:
- a schemadex cache envelope (the JSON inside
~/.cache/schemadex/<url-hash>/database.json.zstafter decompression), - or a bare
DatabaseJSON object emitted bycache.to_json().
The on-disk cache is zstd-compressed; decompress it first or write a
fresh dump via to_json():
zstd -d ~/.cache/schemadex/9b21c4e8…/database.json.zst -o old.json
# ... later, after a refresh:
zstd -d ~/.cache/schemadex/9b21c4e8…/database.json.zst -o new.json
schemadex-diff --from old.json --to new.json
Output covers tables added or removed, columns added or removed per
surviving table, and column-type changes. If both snapshots are
equivalent the binary prints no changes and exits 0.
17. Troubleshooting¶
Q: ImportError: No module named 'schemadex._native'
A: The compiled extension didn't ship with the wheel for your
platform. Reinstall via pip install --force-reinstall schemadex, or
build from source with maturin develop --release from a clone of the
repo.
Q: The cache returns stale data after a DDL change.
A: Call cache.refresh(url) (the DDL fingerprint check then leaves
unchanged tables alone), or delete ~/.cache/schemadex/<url-hash>/ to
force a full re-introspection on the next from_url.
Q: resolve_with_embedding always returns the lexical result.
A: Ollama isn't reachable or the model isn't pulled. Check that
ollama serve is running and try curl
http://localhost:11434/api/tags to confirm. The fallback is silent on
purpose — it logs to stderr and never raises so it can't break an
agent loop.
Q: MSSQL fails with Login failed against a self-signed cert.
A: Append ?encrypt=true&trust_cert=true to the URL — without
trust_cert=true, TDS rejects the certificate before authentication.
Q: BigQuery raises ApplicationDefaultCredentialsError.
A: Run gcloud auth application-default login once on the host, or
set GOOGLE_APPLICATION_CREDENTIALS to a service-account JSON path.
Q: pip install schemadex fails on ARM Linux.
A: Before v1.0.0 there was no manylinux-aarch64 wheel; pip would
fall back to building from source, which needs a Rust toolchain. Either
install a recent rustc and let it build, or upgrade to v1.0.0+ where
the wheel is published.
Q: run_sql hangs on the first call.
A: That's the connection-pool handshake. Subsequent calls reuse
the pool. If it never returns, the URL is wrong or the database is
unreachable; cancel the call and verify with a plain psql /
mysql / etc. client.
Q: cache.to_json() is huge.
A: Expected — it's the full uncompressed dump. The on-disk
database.json.zst is the same content zstd-compressed (~10× smaller
on real schemas).
18. Versioning and stability¶
schemadex follows SemVer 2.0.0. The exact
contract — what counts as public, what counts as breaking — is spelled
out in semver.md. Pre-1.0 minor versions (0.x.0) may
break; patch versions (0.x.y) won't.
For 0.x → 1.0 migration notes see
migration-0.x-to-1.0.md. The deprecation
policy (one minor with DeprecationWarning before any public removal)
lives in deprecation.md.