Engineering Principles Scientific Practice

The vision, team, and story behind DataJoint

What if complexity could fuel discovery, instead of breaking it?

Today’s labs are built for improvisation, not scale. Most rely on fragmented tools, poorly documented workflows, and manual processes strung together with brittle scripts and ad hoc tools. They function—until overwhelmed by complexity.

And complexity is rising fast. We’ve stretched loosely managed systems to the breaking point. The result? A crisis of replication and waste, and progress that’s incremental at best.

At the same time, the potential for acceleration has never been greater. AI and automation promise to transform the speed and depth of discovery. But AI will only amplify the mess if its inputs are not reliable, structured, and context-rich.

Science has hit a complexity ceiling. DataJoint exists to break through.

“The common goal … is to accelerate scientific knowledge generation, potentially by orders of magnitude, while achieving greater control and reproducibility in the scientific process.”

Automated Research Workflows for Accelerated Discovery

The Computational Database

Experimental science requires both reproducibility and flexibility.

A computational result that cannot be recreated is not valid science. And reproducing a given result requires linking it with the raw data, metadata, code, parameters, and sequence of transformations that produced it.

But experiments are always changing: code, parameters, algorithms, instruments, processes. Every change threatens to sever one of those critical links. Given the complexity, the conditions for reproducibility are rarely met in studies of any scale.

DataJoint’s core innovation is a database model that delivers flexibility without sacrificing integrity and reproducibility.

Our solution, the computational database, is fundamental infrastructure for reproducible science. It unifies every aspect of a study – data, code, and workflows – and manages computation and change. It makes scientific processes flexible, repeatable, and ready for next-generation AI.

Diagram showing a flow of data tables with green manual tables labeled ProcessingTask and Curation containing user-entered data, and brown computed tables labeled Processing and Segmentation that run code to insert new rows, described as a spreadsheet with entire data tables instead of cells.

The SciOps Discipline

The computational database provides the infrastructure. SciOps defines the discipline.

SciOps brings structure and operational rigor to every stage of the research process via technology-enabled methodologies that foster a high level of operational maturity.

SciOps replaces disconnected tools and manual handoffs with a continuously running system for research. It demands an integrated approach to scientific work: modular workflows, automated quality control, versioned code and process, and real-time collaboration around shared pipelines.

DataJoint is helping define the SciOps discipline, co-leading, with Johns Hopkins Applied Physics Lab, an alliance of academic and industry partners. Email us to learn more about SciOps or the Alliance.

“SciOps is a methodology that unifies experimental design, data collection, processing, analysis, and dissemination into a seamless, repeatable pipeline that enhances efficiency, reproducibility, and scalability in scientific research.”

Erik Johnson

et al., SciOps: A New Operational Model for Reproducible Science (2024) (under review by Nature Methods)

Before

Level 1

Initial

Ad Hoc Processes

DIY Custom Development

Results with

Level 2

Managed

Established Processes

Repeatable Processes

Role Specialization

Quality Control

Level 3

Defined

Sharable Processes

Open-Source Ecosystems

FAIR Data

FAIR Workflows

Level 4

Scalable

Automated Workflows

SciOps Pipeline

Collaborative Environments

Teamflow

Level 5

Optimizing

Closed-loop Discovery

AI + Human in the Loop

Before

Results with

SciOps Core Principles

Modularity

Automation

Transparency

Traceability

Continuous Improvement

Modularity

Automation

Transparency

Traceability

Continuous Improvement

The History

Built for scientists. Proven at scale. Open by design.

DataJoint’s story began with a scientist: Dimitri Yatsenko, an expert in data architecture and systems engineering who set aside a successful career to study the brain. The neuroscience lab presented a too-common scene: cutting-edge experiments with fragile workflows, burdensome manual processes, and a lack of rigor. So, he invented a new type of system – a computational database – and released it as an open-source project called DataJoint.

DataJoint quickly gained traction in high-stakes, high-complexity research – such as the landmark MICrONS study recently published in Nature. It has enabled dozens of labs to collaborate, process petabytes of data, and push the limits of what’s scientifically possible.

In 2020, NIH stepped in to amplify DataJoint’s reach, funding our evolution from a DIY system used primarily on big-budget projects with significant engineering capabilities into an accessible commercial platform within reach of every lab.

Today, DataJoint’s operating platform has been adopted by leading labs across systems neuroscience, pathology, and rehabilitation. And while the platform has grown in capability and support, its foundation remains open: DataJoint Python gives labs a common language to describe their data, code, and computational workflows. Anyone can read, understand, and extend your pipeline. And you can take your data with you.