When I start wrangling a new data source, I search for a lot of terms, I ask a lot of questions and I take a lot of notes. I like to publish those notes on the company wiki. By working publicly, the quality of the docs tends to improve – no one has to re-invent the wheel, I sometimes get comments that improve the docs, and there is a clear stub for others to maintain.
These notes often find their way to becoming onboarding directives (“go read these pages to learn about our data”), and they’re valuable to skim when revisiting data.
When I was in a highly mature data science organization with weak engineering practices, everyone was building on everyone else’s data and data documentation was useful in tracking alternative datasets and tracing back to canonical sources. Now I’m in a maturing data science organization with strong engineering practices, and these documents are useful because they stop the data users from annoying the engineers with the same questions repeatedly.
I’ve landed on a general template while documenting many, many dozens of datasets. I want to share it in case it can be useful to someone else.
$typeOfData data documentation
The page naming convention supports searching for “X data”.
The first section is a very short orientation (ideally 1-3 sentences). It gives just enough for the reader to figure out if the data are relevant to their task. When the dataset scope is complex, this section tends to be a bit longer to explain what is (not) included.
Background
This section varies in length. It covers anything that isn’t immediately apparent that someone might need to know, such as:
- Where do these data come from? Do we have any contractual or PII obligations?
- How does the system that creates these data work? (for instance, how does Tor work, how does network communication work, why would a certificate be revoked, what is domain registration)
Summary of dataset
This section contains a short paragraph or small table that always give the most basic facts:
- Where can you find the data?
- What time period do the data cover?
- How often is the source refreshed? By whom?
- Roughly how large are the data on disk (per unit of time)?
- Who is the best POC team(s) for this data? (useful in itself and as an indicator of data maturity)
Organizations sometimes keep multiple of the same data. Maybe that looks like Parquet and .ndjson and Avro copies of every file. Maybe that looks like GCS and BigQuery and Databricks. Maybe that looks like something else entirely. I get opinionated here. I think it’s unreasonable to document all of those, so it’s important to decide which approach is the most “organizationally canonical” source and write a data documentation doc that reflects only that canonical source. (And I think the source itself should have documentation on it that discusses how it is loaded.)
Fields
This section is the core of the documentation. It contains a very, very long table. Each field gets its own row. I use three columns:
- field name
- example values
- notes
The examples must be edited to be notional. The company wiki never has the same controls as R&D systems do, so I always edit the examples so they do not actually appear in the dataset.
The “notes” field is the core. This is where I capture relevant facts, like whether this field is sufficient to be a primary key, whether it’s a hash of another field, whether it’s an enum masquerading as a string and how frequent each value is, whether it is actually the same as the field with the same name in that other dataset, whether it’s a dirty raw version of some other field, ….
I often stub out a row for each field, and then I fill in example values and notes as I work with each field.
Limitations and gotchas
This section reiterates the most frequent mistakes that someone is likely to make with this data. I often fill the entire section with a handful of very short bullet points.
It includes big-picture gotchas. What kinds of data will not be visible in this dataset? What kinds of mistakes will people make? (This is the place I underscore the implications of what is already documented – “remember to deduplicate when you JOIN on this data, or your join will explode”.)
It also includes the most important field-level gotchas (“the domains in this dataset always have a final period postpended, which means all naive JOINs will falsely fail”).
Since different people read different things and duplicating warnings is usually valuable, I think the repetition is worthwhile.
Usage
When the datasouce is canonical but opaque and there isn’t organizational interest in maintaining either a usable copy of all the data or a centralized query library, it can be useful to provide an example bit of SQL or Java code. I write a bit of code that massages the data into a useful format while correctly avoiding the gotchas. If the organizational appetite for data maturity is there, that’s even better; do that instead.
Related resources
This section distinguishes these data from similar data sources. It also curates links to any other related data that someone might not even know they were looking for (e.g., WHOIS might link to DNS data as a source of FQDNs rather than only SLDs).
And that’s that. I don’t use all the sections every time, and I sometimes move them around, but I do tend to keep to this structure. I find public documents in a consistent structure to be very valuable for working with data.