Tsallis Statistics in AI: One Knob for Heavy Tails, Sparse Attention, and Robust Loss

tsallisartificial-intelligencealgorithms

Since I read Tsallis's 1988 paper, which gave me the intuition and motivation for my master's thesis, I have been trying to observe the deep connection between non-extensive statistical mechanics and how this theory can be useful for the practical problems that we face in artificial intelligence research. My first approach was driven by the hypothesis that a generalized approach could be useful for learning on graphs (my research in this direction has found interesting results, but a fundamental connection is still open). More recently, though, I tried to provide a broader connection, organizing the discussion that is spread across many papers from the artificial intelligence community that implement a Tsallis approach in reinforcement learning, generalizations of softmax, attention architectures, and so on. Based on this, I organized some of the discussion and built a library called qjax, exposing some of the main primitives of Tsallis statistics in a (I hope) useful and lightweight Python library.

When I started looking at this more carefully, I realized that modern machine learning rests on a statistical foundation so familiar that we rarely state it out loud. Softmax, cross-entropy, the Gaussian, the Kullback–Leibler divergence — the everyday building blocks of a deep learning system — all descend from a single source: the Shannon entropy of classical Boltzmann–Gibbs statistics. Every time I reach for any of them, I am implicitly accepting the worldview that comes attached, namely that the systems we model are of a certain type of extensivity (their information adds up independently) and light-tailed (rare events are exponentially rare).

That worldview is an extraordinarily successful default, but I keep reminding myself that it is still a default. Real data is full of heavy tails, long-range correlations, and structure that the exponential family handles only awkwardly — which is why we keep patching it by hand, with robust losses here, sparse activations there, and outlier-tolerant distributions somewhere else. What struck me is that each patch is invented in isolation, with its own derivation and its own justification.

This is where I think Tsallis statistics offers something more economical: a principled generalization of the classical theory that absorbs all three of those patches into one continuous parameter. It does not discard Shannon — it contains Shannon as a single limiting case and opens a tunable dial in every other direction. In this post I want to explain where that dial comes from, the three places I have found it pays off in AI, and how the qjax library makes it differentiable enough to learn rather than guess.

A short history, and the person behind it

Constantino Tsallis

The idea traces to a 1988 paper, Possible Generalization of Boltzmann–Gibbs Statistics, by the physicist Constantino Tsallis. Tsallis was born in Athens in 1943, raised in Argentina, and trained in physics in France, earning his doctorate in Paris before settling in Brazil, where he became a central figure at the Centro Brasileiro de Pesquisas Físicas (CBPF) in Rio de Janeiro. The story of the central formula is now part of physics folklore: Tsallis has said the inspiration came from looking at a multiplicative structure — pqp^q — written on a slip of paper at a conference, which led him to ask what entropy would look like if Boltzmann's logarithm were deformed by a power.

The resulting quantity, now called the Tsallis entropy, was proposed as a foundation for non-extensive statistical mechanics: a framework for systems where the standard assumption that entropy is additive over independent subsystems breaks down. Over the following decades it grew from a speculative generalization into a large research program spanning thousands of papers, with applications to turbulence, high-energy collisions, financial markets, biological systems, and anomalous diffusion — anywhere power-law behavior and long-range interactions dominate. It has not been without controversy among physicists, but its mathematical core is clean and well understood, and it has proven remarkably portable.

That portability is what brings it to machine learning. The same deformation that captures heavy-tailed physics turns out to describe exactly the behaviors deep learning practitioners engineer by hand — and unlike a physical system, a neural network can simply differentiate through the parameter to find the right value.

The entropic index qq

At the heart of the framework is a generalization of Shannon entropy:

Sq(p)=1ipiqq1.S_q(p) = \frac{1 - \sum_i p_i^{\,q}}{q - 1}.

The parameter qq is the entropic index. It has one essential property: as q1q \to 1, SqS_q collapses back to ordinary Shannon entropy, ipilnpi-\sum_i p_i \ln p_i. So Tsallis statistics doesn't replace the classical theory — it contains it as the special case q=1q = 1, and opens a continuous dial in every other direction.

The same deformation runs through the whole toolkit. Define the qq-logarithm and qq-exponential:

lnqx=x1q11q,expqx=[1+(1q)x]+11q,\ln_q x = \frac{x^{1-q} - 1}{1 - q}, \qquad \exp_q x = \big[1 + (1-q)\,x\big]_+^{\frac{1}{1-q}},

and the rest follows: qq-deformed cross-entropy, divergence, and the qq-Gaussian all drop out by swapping ln\ln for lnq\ln_q. Each one recovers its textbook counterpart at q=1q = 1.

The reason this matters for AI is that the three things qq controls — tail weight, sparsity, and loss boundedness — are exactly the three things we keep fighting by hand.

1. Heavy tails: the qq-Gaussian

The Gaussian is the maximum-entropy distribution under a variance constraint when the entropy is Shannon's. Maximize Tsallis entropy instead and you get the qq-Gaussian:

Gq(x)expq(βx2).\mathcal{G}_q(x) \propto \exp_q(-\beta x^2).

  • q=1q = 1: the ordinary Gaussian.
  • 1<q<31 < q < 3: power-law tails — in fact a Student-tt distribution, with infinite variance as q5/3q \to 5/3 and beyond.
  • q<1q < 1: compact support, tails cut off entirely.

A single parameter slides from light-tailed to heavy-tailed. This is useful anywhere Gaussian assumptions are too optimistic about outliers: exploration noise in reinforcement learning, robust regression, or generative models over data that simply isn't normal. Instead of choosing a distribution family, you choose a point on a continuum.

2. Sparse attention: entmax

Softmax always returns strictly positive weights — every token attends to every other token, if only a little. Often you want real zeros: hard, interpretable sparsity.

Tsallis entropy gives a clean derivation. Define the activation as a regularized argmax over the probability simplex:

entmaxq(z)=argmaxpΔ[p,z+Sq(p)].\mathrm{entmax}_q(z) = \arg\max_{p \in \Delta}\,\big[\,\langle p, z\rangle + S_q(p)\,\big].

  • q=1q = 1: the regularizer is Shannon entropy, and you recover softmax (dense).
  • q=2q = 2: you recover sparsemax — exact zeros, a true projection onto the simplex.
  • 1<q<21 < q < 2: a smooth interpolation, controllably sparse.

So sparse attention isn't a separate trick bolted on; it's the same softmax you already use, viewed at a different qq. The entropic index becomes a sparsity knob you can set — or learn.

3. Robust loss: bounded cross-entropy

This is the most immediately practical case. Standard cross-entropy is lnpc-\ln p_c for the true class cc. It is unbounded: a confidently mislabeled example (pc0p_c \to 0) incurs arbitrarily large loss, so an over-parameterized network is dragged into memorizing the noise.

Swap ln\ln for lnq\ln_q and you get Tsallis cross-entropy:

Lq(p,c)=lnqpc=1pc1q1q.\mathcal{L}_q(p, c) = -\ln_q p_c = \frac{1 - p_c^{\,1-q}}{1 - q}.

For q<1q < 1 this loss is bounded above by 1/(1q)1/(1-q). Its gradient saturates on points it cannot fit, so the model shrugs off unfittable (likely mislabeled) examples instead of contorting itself to match them. As q1q \to 1 it is exactly ordinary cross-entropy — so you pay nothing in the clean-data limit and gain robustness as labels get noisier.

In practice, on a small classifier trained from clean data up to 40% label noise, q=1q = 1 and q<1q < 1 are indistinguishable when labels are clean (~98–99% accuracy), but as noise grows the Shannon baseline carves spurious wrong-class regions while the Tsallis loss keeps the decision boundary clean.

qq is a parameter, not just a hyperparameter

Here is the part that makes Tsallis statistics genuinely interesting for deep learning rather than just a statistical curiosity.

Every construction above is a closed form in qq that is finite and differentiable everywhere — including the removable singularity at q=1q = 1. That means qq is not a discrete setting you grid-search. It is an ordinary differentiable argument. You can put it next to your weights and learn it by gradient descent:

Lqis well-defined.\frac{\partial \mathcal{L}}{\partial q} \quad \text{is well-defined.}

The right amount of non-extensivity — how heavy the tails should be, how sparse the attention, how bounded the loss — can be discovered from data rather than guessed. Non-extensivity stops being a modeling assumption and becomes something the model infers.

The library: qjax

qjax packages all of this as pure, differentiable, jit/vmap-friendly JAX functions. Because qq is just another argument, you can hold it fixed or learn it end-to-end.

import jax, jax.numpy as jnp
import qjax

# q-deformed functions (recover log / exp as q -> 1)
qjax.q_log(2.0, q=1.5)
qjax.q_exp(1.0, q=1.5)

# Tsallis information measures
p = jnp.array([0.5, 0.3, 0.2])
qjax.tsallis_entropy(p, q=2.0)          # -> Shannon entropy as q -> 1
qjax.tsallis_divergence(p, p, q=2.0)    # -> KL divergence as q -> 1

# q-Gaussian: heavy-tailed for 1 < q < 3
x = jnp.linspace(-4, 4, 100)
qjax.q_gaussian_pdf(x, q=1.5, beta=1.0)
samples = qjax.sample(jax.random.PRNGKey(0), q=1.5, beta=1.0, shape=(1000,))

# Sparse softmax: q=1 -> softmax, q=2 -> sparsemax (exact zeros)
qjax.tsallis_entmax(jnp.array([2.0, 1.0, -1.0]), q=2.0)

# A learnable q
nll = lambda q: -jnp.mean(qjax.q_gaussian_logpdf(x, q, 1.0))
grad_q = jax.grad(nll)(1.5)             # gradient w.r.t. the entropic index

Every primitive is a single closed form in qq and recovers its Boltzmann–Gibbs–Shannon counterpart in the q1q \to 1 limit, with numerics tested across that limit, gradients, and jit/vmap.

Install it with:

uv add qjax

qjax is MIT-licensed and built on JAX. If you've ever reached for sparsemax, a robust loss, or heavy-tailed noise as separate tools, it's worth seeing them as three settings of the same dial.

Where I am, and where this goes

When I look back at the path from that 1988 paper to this small library, what stays with me is how a single idea — deforming the logarithm by a power — keeps reappearing under different names across fields that rarely talk to each other. I did not set out to unify anything; I just kept noticing the same qq showing up in places I thought were unrelated, and qjax is my attempt to put those observations in one place where I (and hopefully others) can actually play with them.

I want to be honest that this is a beginning, not a conclusion. The fundamental connection I chased on graphs is still open, and I suspect the deepest payoff of treating qq as a learnable parameter is something I have only glimpsed — what does it mean when a network settles on a particular non-extensivity for a given dataset? I do not have a clean answer yet, and I find that exciting rather than discouraging.

So if any of this resonates, I would love for you to try qjax, break it, and tell me where the abstraction leaks. The whole reason I built it was to make these ideas cheap enough to experiment with — and the best outcome I can imagine is someone finding the connection I am still looking for.

References

  1. C. Tsallis. Possible Generalization of Boltzmann–Gibbs Statistics. Journal of Statistical Physics, 52(1–2):479–487, 1988. doi:10.1007/BF01016429
  2. C. Tsallis. Introduction to Nonextensive Statistical Mechanics: Approaching a Complex World. Springer, 2009.
  3. A. F. T. Martins and R. F. Astudillo. From Softmax to Sparsemax: A Sparse Model of Attention and Multi-Label Classification. ICML, 2016. PMLR v48
  4. B. Peters, V. Niculae, and A. F. T. Martins. Sparse Sequence-to-Sequence Models. ACL, 2019. arXiv:1905.05702
  5. Z. Zhang and M. R. Sabuncu. Generalized Cross Entropy Loss for Training Deep Neural Networks with Noisy Labels. NeurIPS, 2018. arXiv:1805.07836
  6. JAX: composable transformations of Python+NumPy programs. github.com/jax-ml/jax