Log Schema Support for Zeek
This Zeek package generates schemas for Zeek's logs. For every log your Zeek installation produces (such as conn.log or tls.log) the schema describes each log field including name, type, docstring, and more. The package supports popular schema formats and understands Zeek's log customization in detail. The schema export code is extensible, allowing you to produce your own schemas.
Quickstart
Install this package via zkg install logschema
. The package has no dependencies and
currently works with Zeek 5.2 and newer.
To get a JSON Schema of each Zeek log in your installation, run:
$ zeek logschema/export/jsonschema
Your local directory now contains a JSON Schema file for each of Zeek's logs. For example, for your conn.log:
$ cat zeek-conn-log.schema.json | jq
{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"title": "Schema for Zeek conn.log",
"description": "JSON Schema for Zeek conn.log",
"type": "object",
"properties": {
"ts": {
"type": "number",
"description": "This is the time of the first packet."
},
...
}
To instead get a schema in CSV format, run this:
$ zeek logschema/export/csv
This combines all schema information in one file:
$ cat zeek-logschema.csv
log,field,type,record_type,script,is_optional,default,docstring,package
analyzer,ts,time,Analyzer::Logging::Info,base/frameworks/analyzer/logging.zeek,false,-,"Timestamp of confirmation or violation.",-
analyzer,cause,string,Analyzer::Logging::Info,base/frameworks/analyzer/logging.zeek,false,-,"What caused this log entry to be produced. This can\ncurrently be ""violation"" or ""confirmation"".",-
analyzer,analyzer_kind,string,Analyzer::Logging::Info,base/frameworks/analyzer/logging.zeek,false,-,"The kind of analyzer involved. Currently ""packet"", ""file""\nor ""protocol"".",-
...
Background
Zeek features a powerful logging framework that manages Zeek's log streams, log writes, and their eventual output format. The format of Zeek's log entries is highly site-dependent and depends on the configuration of log filters, enrichments that add additional fields to existing logs, new logs produced by add-on protocol parsers, etc.
Zeek does not automatically provide a description of what the resulting log data, after all of this customization, look like. This package closes this gap, allowing users to verify that their logs still look the same after an upgrade, that they're compatible with a given log ingester, etc.
The package does this by using reflection APIs at runtime. It scans registered
log streams to retrieve each log's underlying Zeek record
type and study its
fields, and inspects a configurable log filter on each of those streams to
understand included/excluded fields, separator naming, field name mappings,
etc. For each schema format a registered exporter then translates the gathered
information into suitable output.
Using the package
The package does nothing when loaded via @load packages
or @load logschema
.
Instead, you load the desired exporters, each of which resides in its own script
in logschema/export/<format>
. Exports run at startup: in standalone Zeek this
means right after zeek_init()
handlers have executed, and when running in a
cluster, it means once the cluster is up and running.
Many aspects of the export are customizable, and you can roll your own logic for when to run (and perhaps re-run) schema generation at runtime if desired.
Schema information
For each log stream known to Zeek, the package determines for each of the log's fields:
- the name (such as
uid
orservice
), - its type in the Zeek scripting language (such as
string
orcount
), - the record type containing the field (such as
Conn::Info
orconn_id
), - whether the field is optional (*),
- the default value of the field, if any,
- the field's docstring,
- the Zeek script that defined the field (*),
- the package that added the field, if applicable (*).
(*) Only available when using Zeek 6 or newer.
The package then filters this information based on modifications applied by the log filter in effect, which can include/exclude fields, transform field names, add extension fields, etc.
At this point, each schema exporter decides how to use the resulting field metadata. Not all schema formats support all of this information -- for example, a schema language may have no concept of the Zeek package providing a log field.
Supported schema formats
JSON Schema
@load logschema/export/jsonschema
This exporter provides JSON Schema files. By default
the exporter writes one schema file per log, named
zeek-{logname}-log.schema.json
. Each log field becomes a property in the
schema. The schemas feature the type of each field when rendered in JSON, a
description (from Zeek's docstrings), default values, and whether a field is
required. They currently do not annotate or enforce formats (e.g. to convey that
an address string is formatted as an IP address), and they don't yet apply all
conceivable constraints (such as the integer range of a port number). The
schemas also don't currently prohibit additionalProperties
.
The schemas are "data-centric", not "metainformation-centric". For example, the Zeek script defining a given log field is currently not included, because JSON Schema doesn't provide an immediate keyword to do so. We may add vocabulary to convey such things in the future.
Each log's schema is self-contained.
Note that Zeek logs in JSON format are technically JSONL documents, i.e., every line in a log is a JSON document. Keep this in mind when validating logs.
Customization
Redef Log::Schema::JSONSchema::filename
to control the file output, see below
for details.
Example
$ cat zeek-conn-log.schema.json | jq
{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"title": "Schema for Zeek conn.log",
"description": "JSON Schema for Zeek conn.log",
"type": "object",
"properties": {
"ts": {
"type": "number",
"description": "This is the time of the first packet."
},
...
}
Validation
Using the Sourcemeta jsonschema CLI:
$ npm install --global @sourcemeta/jsonschema
$ zeek -r test.pcap LogAscii::use_json=T
$ zeek logschema/export/jsonschema
Now:
`
console
$ jsonschema validate zeek-conn-log.schema.json conn.log
$
$ # Pass! Now mismatch schema and log:
$ jsonschema validate zeek-conn-log.schema.json ssl.log
fail: /home/christian/t4/logs/ssl.log
error: Schema validation failure
The value was expected to be an object that defines properties "id.orig_h", "id.orig_p", "id.resp_h", "id.resp_p", "proto", "ts", and "uid"
at instance location ""
at evaluate path "/required"
### CSV
```zeek
@load logschema/export/csv
The CSV exporter renders the schema into comma-separated rows, with one row per
log field. By default it produces a file called zeek-logschema.csv
. A header
line explaining each column is optional and included by default. The
line-oriented nature makes this format great for diffing.
For "complex" columns, such as default values or the docstrings, the formatter
uses JSON representation of the resulting strings. It escapes \"
to ""
, but
leaves escaped newline in place.
Customization
Redef Log::Schema::CSV::filename
to control the file output, see below
for details.
To disable the header line, use the following:
redef Log::Schema::CSV::add_header = F;
To change the separator from commas to another string:
redef Log::Schema::CSV::separater = ":";
To change the string used for unset &optional
fields from the default of "-":
redef Log::Schema::CSV::separater = "";
Example
$ cat zeek-logschema.csv
log,field,type,record_type,script,is_optional,default,docstring,package
analyzer,ts,time,Analyzer::Logging::Info,base/frameworks/analyzer/logging.zeek,false,-,"Timestamp of confirmation or violation.",-
analyzer,cause,string,Analyzer::Logging::Info,base/frameworks/analyzer/logging.zeek,false,-,"What caused this log entry to be produced. This can\ncurrently be ""violation"" or ""confirmation"".",-
analyzer,analyzer_kind,string,Analyzer::Logging::Info,base/frameworks/analyzer/logging.zeek,false,-,"The kind of analyzer involved. Currently ""packet"", ""file""\nor ""protocol"".",-
...
Zeek Log
@load logschema/export/log
This export looks a lot like the CSV format, but produces a regular Zeek log
named logschema
with the schema information (and yes, the log itself gets
reflected in the schema :-). This is a handy way to record and archive schema
information as part of your regular Zeek setup.
Example
$ cat logschema.log
#separator \x09
#set_separator ,
#empty_field (empty)
#unset_field -
#path logschema
#open 2025-05-20-18-00-08
#fields log field _type record_type script is_optional _default docstring package
#types string string string string string bool string string string
analyzer ts time Analyzer::Logging::Info base/frameworks/analyzer/logging.zeek F - Timestamp of confirmation or violation. -
analyzer cause string Analyzer::Logging::Info base/frameworks/analyzer/logging.zeek F - What caused this log entry to be produced. This can\x0acurrently be "violation" or "confirmation".-
analyzer analyzer_kind string Analyzer::Logging::Info base/frameworks/analyzer/logging.zeek F - The kind of analyzer involved. Currently "packet", "file"\x0aor "protocol". -
...
Zeek-y JSON
@load logschema/export/json
This exporter just runs the package's internal log state through to_json()
to
produce the schema, and is just a handful of lines of code. While simple, this
naturally features all information the log analysis builds up. We mostly
consider this a development/troubleshooting tool and wouldn't recommend it for
actual schema use.
By default, this writes a single output file called zeek-logschema.json
. The
toplevel JSON value is an object, with each key being a Zeek Log::ID
and the
value the corresponding state.
Customization
Redef Log::Schema::JSON::filename
to control the file output, see below
for details.
Example
$ cat zeek-logschema.json | jq
{
"Analyzer::Logging::LOG": {
"name": "analyzer",
"fields": {
"ts": {
"name": "ts",
"type": "time",
"record_type": "Analyzer::Logging::Info",
"script": "base/frameworks/analyzer/logging.zeek",
"is_optional": false,
"docstring": "Timestamp of confirmation or violation."
},
"cause": {
"name": "cause",
"type": "string",
"record_type": "Analyzer::Logging::Info",
"script": "base/frameworks/analyzer/logging.zeek",
"is_optional": false,
"docstring": "What caused this log entry to be produced. This can\ncurrently be \"violation\" or \"confirmation\"."
},
...
Choosing a log filter
By default, the package studies the default
filter on each log stream. You can
adjust this by redef'ing Log::Schema::logfilter
.
Configuring filenames
All exporters except the Zeek log one write their schemas to files. You can configure how they do this by adjusting a per-exporter filename pattern. This pattern supports keyword substitutions, as follows:
{log}
: the name of the log, such as "conn
". This keyword also controls whether the exporter writes one file per log, or all schemas in a single log: when the filename pattern features this keyword, it's one-file-per-log, otherwise a single file.{filter}
: the log filter used for the export, such as "default
".{pid}
: the PID of the Zeek process, handy for disambiguating multiple runs.{version}
: the Zeek version string, as produced byzeek_version()
.strftime()
conversion characters, such as%Y-%m-%d
, based oncurrent_time()
.
Using "-" as filename will cause the schemas to be written to stdout.
Customizing log metadata
The package provides a hook to make arbitrary changes to the log metadata before the exporters produce schemas from it. Let's say you want to patch up the docstring of the conn.log's service field. With this in test.zeek ...
hook Log::Schema::adapt(logs: Log::Schema::LogsTable) {
logs[Conn::LOG]$fields["service"]$docstring = "My much better docstring";
}
... creating a JSON Schema yields:
`
console
$ zeek logschema/export/jsonschema ./test.zeek
$ cat zeek-conn-log.schema.json | jq '.properties["service"]'
{
"type": "string",
"description": "My much better docstring"
}
`
Consult the logschema package's Field
record
for details on the available log field metadata.
Writing your own exporter
Writing an exporter involves three steps:
Create a record of type
Log::Schema::Exporter
with a name for your exporter and needed function callbacks. The record features callbacks for every log the reflection processes ($process_log()
), a finalization over all state prior to output ($finalize_schema()
), a callback to write all information to a single file ($write_all_schemas()
), a callback to write a single log's schema to a file ($write_single_schema()
), and a custom output routine when filenames don't apply ($custom_export()
).Register this exporter with a call to
Log::Schema::add_exporter()
. This usually happens in azeek_init()
handler.Run the export. You can use the default logic, in which case you need to do nothing. To roll your own logic, redef
Log::Schema::run_at_startup
toF
to disable built-in schema production, and callLog::Schema::run_export()
where- and whenever you see fit.
Take a look at the exporters in this package to get you started.
Common pitfalls
Completeness
Log streams nearly always get defined in zeek_init()
event handlers. That's
why the package looks for registered log streams after those handlers have
run. However, script authors are free to create Zeek logs at any time and under
arbitrary conditions, so the package will not automatically see such logs. We
suggest the use of custom Log::Schema::run_export()
invocations in that case.
Ever-changing default field values
A few Zeek logs use &default
attributes for which this package produces
different output from run to run in schema formats that capture default values,
such as CSV. Specifically, the SMB logs have timestamps defaulting to current
network time, producing different timestamps every time you generate the schema.
You can adjust this and other troublesome output via the Log::Schema::adapt()
hook mentioned above:
hook Log::Schema::adapt(logs: Log::Schema::LogsTable) {
logs[SMB::FILES_LOG]$fields["ts"]$_default = 0.0;
logs[SMB::MAPPING_LOG]$fields["ts"]$_default = 0.0;
}
(You can also suppress this particular churn by redef'ing
allow_network_time_forward=F
, which will keep these timestamps at 0.0 when
producing the schema at startup. You will probably not want to use this approach
if you're running Zeek in production while producing schemas, since it affects
Zeek's internal handling of time.)