- Discoverability follows a zeta-like scaling curve
- Early modes hold the strongest signal
- Better encoders can move signal forward
- Extra modalities can improve sample efficiency
- Model winners can flip as data grows
If your health data feels like a mountain, this paper offers a rule for knowing when more of it will actually help. The authors argue that discoverability — the chance of finding a real signal in noisy biomedical data — does not rise smoothly with sample size. Instead, it follows a zeta-like scaling law, shaped by how data spectra decay and how well signals line up across modalities. In their framework, common metrics such as AUC can be written as an effective signal-to-noise quantity that builds across spectral modes of an encoder and a cross-modal operator. That matters because the model’s representation can change the curve: sparse models, low-rank embeddings, and multimodal contrastive objectives can push useful signal into earlier, more stable modes and improve sample efficiency. The paper also predicts cross-over behavior, where simpler models win when data are scarce, but higher-capacity or multimodal encoders take over after enough data stabilizes extra degrees of freedom. The authors say this could help anticipate when scaling data, improving representations, or adding modalities like text is most likely to accelerate discovery in areas such as multimodal disease classification, imaging genetics, functional MRI, and topological data analysis.
Biomedical datasets now reach millions of people, and models can hold billions of settings. That size brings a cruel question. When does one more pile of data unlock a real discovery? When does it just add noise? If you work with health data, you know the pain. A bigger file is not always a better answer. The framework says the rise in skill does not have to be smooth. It can follow a zeta-like curve. The Riemann zeta function is a famous math tool for power-law patterns. The surprise is simple. The gain depends not just on size, but on how well the clues line up. That is the real twist.
How a model finds the right clues
The framework treats AUC, short for area under the curve, as built-up signal compared with noise. That signal builds across ranked modes, the model's strongest-to-weakest channels for carrying clues. Under mild assumptions, the build-up follows a zeta-like law. That law comes from two decay patterns. One is the data's covariance spectrum. That is just how fast the data's directions fade in strength. The other is the task signal that lines up with those directions. When both fade in power-law form, a few strong parts lead and a long tail follows. The curve then bends the way the Riemann zeta function suggests. The result is a shape that can bend, slow, or cross over as sample size grows. That helps explain why gains can look uneven across tasks, data types, and model choices.
Why representation learning changes the curve
An encoder, the part of a model that turns raw data into usable numbers, can change where the signal lands. Sparse models keep only a few useful parts. Low-rank embeddings, or compact number lists, squeeze data into a smaller set of hidden directions. Multimodal training uses more than one data type. Contrastive learning pulls matching items together and pushes mismatched ones apart. Together, these methods move useful signal into earlier, more stable modes. That makes the drop from strong modes to weak ones steeper. In plain terms, the model wastes less effort on weak clues. It spends more of its capacity on the signal that arrives early and cleanly. The same setup also helps explain why adding text or language embeddings may help, even when they repeat some information.
named in the abstract
multimodal disease classification, imaging genetics, functional MRI, topological data analysis- Multimodal disease classification, which mixes more than one data type, is one target area.
- Imaging genetics, which links scans and genes, is another.
- Functional MRI, a brain scan that tracks activity, is a third.
- Topological data analysis, a way to study shapes in data, is the fourth.
“The same questions keep coming up: how much data do we need to make reliable discoveries?”
“The gain depends not just on size, but on how well the clues line up.”
Why it matters for real study design
For a lab, this curve is a planning tool. It says extra data is not equal data. The key is whether new samples sharpen the early modes or only feed the weak tail. That makes model choice part of sample planning. Sparse models, low-rank embeddings, and multimodal training can move signal forward. So a smaller dataset can act larger than it looks. The framework also gives a reason to add another modality, even when it partly repeats what you already have. If it steepens the shared decay, discoverability can rise faster. That is useful for disease typing, brain scans, and gene-image links.
What the cross-over point means
The surprise now has a practical edge. Bigger data can change who wins. A simple model may look best at first. A richer multimodal encoder may pull ahead later. That means a lab should not freeze model choice too early. It should watch for the cross-over point where the curve flips. If the zeta law holds, better data plans can save time and wasted compute. They can also stop a small-study result from being mistaken for the final word.

Comments