Zipfian Patterns in DNA
Scope & Motivation
Zipf-like patterns appear in many natural systems, but their presence in DNA was mind-blowing to me, even though it's still debated. Some research claims strongly Zipfian behavior, others argue that it's a coincidence, often depending on how “words” in DNA are defined and at what scale the sequence is analyzed. I’m attempting to investigate whether Zipf-type patterns in DNA reflect underlying biological organization, evolutionary constraints, or are just statistical artifacts of long sequences.
My primary goal is not to explain why Zipf patterns might appear in DNA, but to investigate whether pattern strength variance emerge across species, across individuals of the same species, or across coding vs. non-coding regions.
Current Direction
This is a huge project I'm undertaking and will probably consume years if not decades, and I'm currently at a very early stage.
I’m currently working through the existing literature, which varies widely in methodology, especially in how DNA “tokens” are defined. Noting that all research I've encountered so far uses fixed-length k-mers, making me wonder how a variable-length k-mers could be used (I already have an idea of a tokenization algorithm that I'll try later)
The immediate focus is on understanding how sensitive Zipf-like behavior is to segmentation choices and trying to gain more domain knowledge as well as attempting to replicate results.