Inferring JSON Structure Schemas from Your Data with Structurize
If you have JSON data and need a schema for it, the new json2s command in
Structurize can help. It analyzes your JSON files, figures out what types
you’re working with, and produces a valid JSON Structure schema document — one
that you can use for validation, code generation, or documentation.
What is Structurize?
Structurize is a schema conversion toolkit that transforms between various schema formats: JSON Schema, JSON Structure, Avro Schema, Protocol Buffers, XSD, and more. It also generates code in multiple languages (C#, Python, TypeScript, Java, Go, Rust, C++) and exports schemas to SQL databases and data formats like Parquet and Iceberg.
The tool ships under two package names — structurize and avrotize — both
sharing the same codebase. Choose whichever aligns with your primary use case.
JSON Structure users will likely prefer structurize.
Install it with:
pip install structurize
The json2s Command: Schema Inference from JSON
The json2s command reads one or more JSON files and infers a JSON Structure
schema from them. It handles single JSON objects, JSON arrays, and JSONL
(newline-delimited JSON) files.
Basic Usage
structurize json2s data.json --out schema.jstruct.json --type-name MyType
Parameters:
<json_files...>— One or more JSON files to analyze--out— Output path for the JSON Structure schema (stdout if omitted)--type-name— Name for the root type (default: “Document”)--base-id— Base URI for$idgeneration (default: “https://example.com/”)--sample-size— Maximum records to sample (0 = all, default: 0)--infer-choices— Detect discriminated unions (more on this below)
Multiple Files and JSONL
The command accepts multiple input files, merging their structures into a unified schema. This is useful when your data is split across files or when you want to analyze several examples together.
JSONL files (one JSON object per line) are first-class citizens. The inferrer reads each line as a separate document and consolidates their structures.
# Multiple JSON files
structurize json2s orders.json users.json events.json --out unified.jstruct.json
# A JSONL file with many records
structurize json2s events.jsonl --out events.jstruct.json --type-name DomainEvent
Detecting Discriminated Unions with --infer-choices
Here’s where things get interesting. Many event-driven systems, APIs, and
message formats use discriminated unions: a single field (often called
type, kind, or event_type) determines which variant of a structure you’re
dealing with.
Consider this JSONL file with three event types:
{"event_type": "user_created", "user_id": "u123", "email": "alice@example.com", "created_at": "2026-02-04T10:00:00Z"}
{"event_type": "user_created", "user_id": "u456", "email": "bob@example.com", "created_at": "2026-02-04T11:00:00Z"}
{"event_type": "order_placed", "order_id": "ord-001", "user_id": "u123", "total": 99.50, "items": [{"sku": "A1", "qty": 2}]}
{"event_type": "order_placed", "order_id": "ord-002", "user_id": "u456", "total": 150.00, "items": [{"sku": "B2", "qty": 1}]}
{"event_type": "payment_received", "payment_id": "pay-001", "order_id": "ord-001", "amount": 99.50, "method": "card"}
{"event_type": "payment_received", "payment_id": "pay-002", "order_id": "ord-002", "amount": 150.00, "method": "paypal"}
Without --infer-choices: A Flat Object
Running the basic inference:
structurize json2s events.jsonl --out events.jstruct.json --type-name DomainEvent
Produces a single object type with all fields merged:
{
"$schema": "https://json-structure.org/meta/core/v0/#",
"$id": "https://example.com/DomainEvent",
"type": "object",
"name": "DomainEvent",
"properties": {
"event_type": { "type": "string" },
"user_id": { "type": "string" },
"email": { "type": "string" },
"created_at": { "type": "string" },
"order_id": { "type": "string" },
"total": { "type": "double" },
"items": { "type": "array", "items": { ... } },
"payment_id": { "type": "string" },
"amount": { "type": "double" },
"method": { "type": "string" }
},
"required": ["event_type"]
}
This works, but it loses the structure: email only makes sense for
user_created events, items only for order_placed, and so on. All fields
become optional except event_type, which is the only one present in every
record.
With --infer-choices: An Inline Union
Add the --infer-choices flag:
structurize json2s events.jsonl --infer-choices --out events.jstruct.json --type-name DomainEvent
Now the inferrer detects that event_type is a discriminator whose values
correlate with distinct field signatures. It produces a JSON Structure
choice type — an inline union:
{
"$schema": "https://json-structure.org/meta/core/v0/#",
"$id": "https://example.com/schemas/DomainEvent",
"type": "choice",
"name": "DomainEvent",
"$extends": "#/definitions/DomainEventBase",
"selector": "event_type",
"choices": {
"order_placed": { "type": { "$ref": "#/definitions/OrderPlaced" } },
"payment_received": { "type": { "$ref": "#/definitions/PaymentReceived" } },
"user_created": { "type": { "$ref": "#/definitions/UserCreated" } }
},
"definitions": {
"DomainEventBase": {
"abstract": true,
"type": "object",
"name": "DomainEventBase",
"properties": {
"event_type": { "type": "string" }
}
},
"OrderPlaced": {
"type": "object",
"name": "OrderPlaced",
"$extends": "#/definitions/DomainEventBase",
"properties": {
"items": { "type": "array", "items": { ... } },
"order_id": { "type": "string" },
"total": { "type": "double" },
"user_id": { "type": "string" }
},
"required": ["items", "order_id", "total", "user_id"]
},
"PaymentReceived": {
"type": "object",
"name": "PaymentReceived",
"$extends": "#/definitions/DomainEventBase",
"properties": {
"amount": { "type": "double" },
"method": { "type": "string" },
"order_id": { "type": "string" },
"payment_id": { "type": "string" }
},
"required": ["amount", "method", "order_id", "payment_id"]
},
"UserCreated": {
"type": "object",
"name": "UserCreated",
"$extends": "#/definitions/DomainEventBase",
"properties": {
"created_at": { "type": "string" },
"email": { "type": "string" },
"user_id": { "type": "string" }
},
"required": ["created_at", "email", "user_id"]
}
}
}
This is a proper inline union:
selectorpoints to the discriminator field (event_type)choicesmaps each discriminator value to a variant type$extendsreferences an abstract base type with common fields- Each variant extends the base and adds its specific fields
The choice keys (order_placed, payment_received, user_created) match the
actual values in the data, so instances validate correctly.
Validating the Result
Using the json-structure Python SDK, we can verify that both the schema and the original instances are valid:
import json
from json_structure import SchemaValidator, InstanceValidator
with open('events_schema.jstruct.json') as f:
schema = json.load(f)
# Validate the schema itself
sv = SchemaValidator(extended=True)
errors = sv.validate(schema)
print('Schema valid:', len(errors) == 0)
# Validate each instance
iv = InstanceValidator(schema, extended=True)
with open('events.jsonl') as f:
for line in f:
if line.strip():
instance = json.loads(line)
errors = iv.validate(instance)
print(f"{instance['event_type']}: {'valid' if not errors else errors}")
Output:
Schema valid: True
user_created: valid
user_created: valid
order_placed: valid
order_placed: valid
payment_received: valid
payment_received: valid
All six instances validate against the inferred schema.
How the Algorithm Works
The --infer-choices option uses a clustering algorithm:
-
Document Fingerprinting: Each JSON object is characterized by its field signature — the set of top-level keys it contains.
-
Jaccard Similarity Clustering: Documents with similar field signatures are grouped together. A two-pass refinement handles edge cases.
-
Discriminator Detection: The algorithm looks for fields whose values correlate strongly with cluster membership. A field like
event_typethat has distinct values for each cluster is a strong discriminator candidate. -
Sparse Data Filtering: If documents have high overlap (same basic structure with some optional fields), they’re treated as a single type with optional properties rather than distinct variants.
-
Nested Discriminators: The algorithm can detect discriminators inside nested objects (up to 2 levels deep), handling envelope patterns like CloudEvents with typed payloads.
The result is a schema that captures the polymorphic structure of your data rather than flattening everything into a single bag of optional fields.
Use Cases
- Event Sourcing: Infer schemas from event logs with multiple event types
- API Documentation: Generate schemas from sample API responses
- Message Queues: Document Kafka/RabbitMQ message formats
- Data Lake Schemas: Create schemas for semi-structured data in Parquet or Iceberg
- Code Generation: Feed the schema into structurize’s code generators to produce typed classes
Getting Started
Install structurize:
pip install structurize
Point it at your data:
structurize json2s your-data.jsonl --infer-choices --out schema.jstruct.json --type-name YourType
Validate the result with the json-structure SDK, or use structurize to convert the schema to code, documentation, or other formats.
The json2s command with --infer-choices bridges the gap between the JSON
data you have and the structured schema you need. It understands that your
data isn’t just a blob of fields — it’s a collection of distinct types
with a common discriminator. And it produces schemas that reflect that structure.