The Zipf Mystery

The Zipf Mystery


Hey, Vsauce. Michael here. About 6 percent of
everything you say and read and write is the “the” – is the most used word in the
English language. About one out of every 16 words we encounter on a daily basis
is “the.” The top 20 most common English words in order are “the,” “of,” “and,” “to,” “a,” “in,” “is,” “I,” “that,” “it,” “for,” “you,” “was,” “with,” “on,” “as,” “have,” “but,” “be,” “they.”
That’s a fun fact. A piece of trivia but it’s also more. You see, whether the most
commonly used words are ranked across an entire language, or in just one book or
article, almost every time a bizarre pattern emerges. The second most used
word will appear about half as often as the most used. The third one third as
often. The fourth one fourth as often. The fifth one fifth as often. The sixth one sixth
as often, and so on all the way down. Seriously. For some reason, the amount of
times a word is used is just proportional to one over its rank. Word
frequency and ranking on a log log graph follow a nice straight line. A power-law.
This phenomenon is called Zipf’s Law and it doesn’t only apply to English. It also
applies to other languages, like, well, all of them. Even ancient languages we haven’t been
able to translate yet. And here’s the thing. We have no idea why.
It’s surprising that something as complex as reality should be conveyed by
something as creative as language in such a predictable way. How predictable?
Well, watch this. According to WordCount.org, which ranks words as found in the
British National Corpus, “sauce” is the 5,555th most common English word.
Now, here is a list of how many times every word on Wikipedia and in the
entire Gutenberg Corpus of tens of thousands of public domain books shows
up. The most used word, ‘the,’ shows up about 181 million times. Knowing these two
things, we can estimate that the word “sauce” should appear about thirty
thousand times on Wikipedia and Gutenberg combined.
And it pretty much does. What gives? The world is chaotic. Things are
distributed in myriad of ways, not just power laws. And language is personal, intentional, idiosyncratic. What about the
world and ourselves could cause such complex activities and behaviors to
follow such a basic rule? We literally don’t know. More than a century of
research has yet to close the case. Moreover, Zipf’s law doesn’t just
mysteriously describe word use. It’s also found in city populations, solar
flare intensities, protein sequences and immune receptors, the amount of traffic
websites get, earthquake magnitudes, the number of times academic papers are
cited, last names, the firing patterns of neural networks, ingredients used in
cookbooks, the number of phone calls people received, the diameter of Moon
craters, the number of people that die in wars, the popularity of opening chess
moves, even the rate at which we forget. There are plenty of theories about why
language is ‘zipf-y,’ but no firm conclusions and this video doesn’t contain a
definite explanation either. Sorry, I know that’s a bummer, since we appear to like
knowing more than mystery. But that said, we also ask more than we answer. So
let’s dive into Zipf’s ramifications, some related patterns, some possible
explanations and the depth of the mystery itself.
Zipf’s law was popularized by George Zipf, a linguist at Harvard University. It is a
discrete form of the continuous Pareto distribution from which we get the
Pareto Principle. Because so many real-world processes behave this way,
the Pareto Principle tells us that, as a rule of thumb, it’s worth assuming that 20% of
the causes are responsible for 80% of the outcome, like in language, where the most
frequently used 18 percent of words account for over 80% of word occurrences.
In 1896, Vilfredo Pareto showed that approximately 80% of the land in Italy
was owned by just twenty percent of the population. It is said that he later
noticed in his garden 20 percent of his pea pods contained eighty percent of the
peas. He and other researchers looked at other datasets and found that this 80-20
imbalance comes up a lot in the world. The richest 20% of humans have 82.7% of
the world’s income. In the US, 20% of patients use eighty percent of health
care resources. In 2002, Microsoft reported that 80% of the errors and
crashes in Windows and Office are caused by 20% of the bugs detected. A common
rule of thumb in the business world states that 20% of your customers are
responsible for 80% of your profits and eighty percent of the complaints you
receive will come from 20% of your customers. A book titled “The 80/20 Principle”
even says that in a home or office, 20% of the carpet receives 80 percent of
the wear. Oh, and as Woody Allen famously said, “eighty percent of success is just
showing up.” The Pareto Principle is everywhere, which is good. By focusing on just 20 percent of what’s
wrong, you can often expect to solve eighty percent of the problems. A variety
of different unrelated factors cause this to be true from case to case, but if
we can get to the bottom of what causes some of them, maybe we’ll find that one or more of
those mechanisms is responsible for Zipf’s law in language. George Zipf
himself thought languages’ interesting rank frequency distribution was a consequence
of the Principle of Least Effort. The tendency for life and things to follow
the path of least resistance. Zipf believed it drove much of human behavior and
hypothesized that as language developed in our species, speakers naturally
preferred drawing from as few words as possible to get their thoughts out there.
It was easier. But in order to understand what was being said, listeners preferred larger vocabularies
that gave more specificity, so that they had to do less work. The compromise
between listening and speaking, Zipf felt, led to the current state of language.
A few words are used often and many many many words are used rarely. Recent papers have suggested that having
a few short, often used, predictable words helps dissipate information load density
on listeners, spacing out important vocab so that the information rate is more
constant. This makes sense and much has been learned by applying the least
effort principle to other behaviors, but later researchers argued that for
language, the explanation was even more simple. Just a few years after Zipf’s
seminal paper, Benoit Mandelbrot showed that there may be nothing mysterious
about Zipf’s law at all, because even if you just randomly type on a keyboard you
will produce words distributed according to Zipf’s law. It’s a pretty cool point and
this is why it happens. There are exponentially more different long words
than short words. For instance, the English alphabet can be used to make 26 one
letter words, but 26 squared 2 letter words. Also, in random typing, whenever the
space bar is pressed a word terminates. Since there’s always a certain chance that
the space bar will be pressed, longer stretches of time before it happens are exponentially less likely than
shorter ones. The combination of these exponentials is pretty ‘Zipf-y.’
For example, if all 26 letters and the spacebar are equally likely to be typed,
after a letter is typed and a word has begun, the probability that the next
input will be a space, thus creating a one letter word, is just one in 27.
And sure enough, if you randomly generate characters or hire a proverbial typing
monkey, about one out of every 27 or 3.7 percent of the stuff between spaces,
will be single letters. Two letter words appear when after beginning a word any
character but the space bar is hit – a 26 in 27 chance and then the space bar. A three-letter word is the probability
of a letter, another letter and then a space. If we divide by the number of
unique words of each length there can be, we get the frequency of occurrence
expected for any particular word given its length. For example, the letter V will
make up about 0.142 percent of random typing. The word “Vsauce”
0.0000000993 percent. Longer words are less likely, but watch this. Let’s spread
these frequencies out according to the ranks they’d take up on a most often
used list. There are 26 possible one letter words, so each of the top 26
ranked words are expected to occur about this often. The next 676 ranks will be taken up by two letter words that show up about
this often. If we extend each frequency according to how many members it has,
we get Zipf. Subsequent researchers have detailed how changing up the initial
conditions can smooth the steps out. Our mysterious distribution has been created
out of nothing but the inevitabilities of math. So maybe there is no mystery. Maybe words
are just the result of humans randomly segmenting the observable world and the
mental world into labels and Zipf’s law describes what naturally happens when
you do that. Case closed. and as always And as always, thanks for… wait a minute! Actual language is very different from
random typing. Communication is deterministic to a certain extent.
Utterances and topics arrive based on what was said before. And the vocabulary
we have to work with certainly isn’t the result of purely random naming.
For example, the monkey typing model can’t explain why even the names of the
elements, the planets and the days of the week are used in language according to
Zipf’s law. Sets like these are constrained by the natural world and they’re not the
result of us randomly segmenting the world into labels. Furthermore, when given
a list of novel words, words they’ve never heard or used before, like when
prompted to write a story about alien creatures with strange names, people will
naturally tend to use the name of one alien twice as often as another, three
times as often as another… Zipf’s law appears to be built into our brains. Perhaps there
is something about the way thoughts and topics of discussion ebb and flow that
contributes to Zipf’s law. Another way ‘Zipf-ian’ distributions
occur is via processes that change according to how they’ve previously
operated. These are called preferential attachment processes.
They occur when something – money, views, attention, variation, friends, jobs,
anything really is given out according to how much is already possessed.
To go back to the carpet example, if most people walk from the living room to the
kitchen across a certain path, furniture will be placed elsewhere, making that
path even more popular. The more views a video or image or post has,
the more likely it is to get recommended automatically or make the news for
having so many views, both of which give it more views. It’s like a snowball rolling down a
snowy hill. The more snow it accumulates, the bigger its surface area becomes for
collecting more and the faster it grows. There doesn’t have to be a deliberate
choice driving a preferential attachment process. It can happen naturally. Try this.
Take a bunch of paper clips and grab any two at random. Link them together and then throw them
back in the pile. Now, repeat over and over again. If you grab paper clips that
are already part of a chain, link ’em anyway. More often than not after a while
you will have a distribution that looks ‘Zipf-ian.’ A small number of chains
contain a disproportionate amount of the total paperclip count. This is simply
because the longer a chain gets, the greater proportion of the whole it
contains, which gives it a better chance of being picked up in the future and
consequently made even longer. The rich get richer, the big get bigger,
the popular get popular-er. It’s just math. Perhaps languages’ Zipf mystery is, if not
caused by it, at least strengthened by preferential attachment. Once a word is
used, it’s more likely to be used again soon. Critical points may play a role as well. Writing and conversation often stick to a
topic until a critical point is reached and the subject is changed and
the vocabulary shifts. Processes like these are known to result in power laws. So, in
the end, it seems tenable that all these mechanisms might collude to make Zipf’s
law the most natural way for language to be. Perhaps some of our vocabulary and
grammar was developed randomly, according to Mandelbrot’s theory. And the natural
way conversation and discussion follow preferential attachment and criticality,
coupled with the principle of least effort when speaking and listening are
all responsible for the relationship between word rank and frequency. It’s a shame that the answer isn’t
simpler, but it’s fascinating because of the consequences it has on what
communication is made of. Roughly speaking, and this is mind blowing, nearly
half of any book, conversation or article will be nothing but the same 50 to 100
words. And nearly the other half will be words that appear in that selection only
once. That’s not so surprising when you consider the fact that one word accounts
for 6 percent of what we say. The top 25 most used words make up about a third of
everything we say and the top 100 about half. Seriously. I mean, whether it’s all the
words in “Wet Hot American Summer,” or all the words in Plato’s “Complete Works” or
in the complete works of Edgar Allan Poe or the Bible itself, only about 100 words
are used for nearly half of everything written or said. In Alice’s Adventures in
Wonderland 44% and in Tom Sawyer 49.8% of the unique words used appear only
once in the book. A word that is used only once in a given selection of words
is called a ‘hapax legomenon.’ Hapax legomena are vitally important to
understanding languages. If a word has only been found once in the entire known
collection of an ancient language, it can be very difficult to figure out what it
means. Now, there is no corpus of everything ever said or written in
English, but there are very very large collections and it’s fun to find hapax legomena in them.
For instance, and this probably won’t be the case after I
mention it, but the word “quizzaciously” is in the Oxford English Dictionary, but
appears nowhere on Wikipedia or in the Gutenberg corpus or in the British
National Corpus or the American National Corpus, but it does appear when searched in
just one result on Google. Fittingly, in a book titled “ElderSpeak” that lists it
as a ‘rare word.’ Quizzaciously, by the way, means “in a mocking manner,” as in
“The paradist rattled off quizzaciously, ‘Hey, Vsauce. Michael here. But who is Michael
and how much does here weigh?'” It’s a little sad that quizzaciously
has been used so infrequently. It’s a fun word, but that’s the way things go in
a ‘Zipf-ian’ system. Some things get all the love, some get little. Most of what you
experience on a day-to-day basis is forgotten, forgettable. The Dictionary of Obscure Sorrows, as it often does, has a word for this – Olēka – the awareness of how
few days are memorable. I’ve been alive for almost 11,000 days
but I couldn’t tell you something about each one of them. I mean, not even close. Most of what we do and see and think and
say and hear and feel is forgotten at a rate quite similar to Zipf’s law,
which makes sense. If a number of factors naturally selected for thinking and
talking about the world with tools in a ‘Zipf-ian’ way, it makes sense we’d
remember it that way too. Some things really well, most things hardly at all.
But it bums me out sometimes because it means that so much is forgotten,
even things that at the time you thought you could never forget. My locker number – senior year – its combination, the jokes
I liked when I saw a comedian on stage, the names of people I saw every day 10
years ago. So many memories are gone. When I look at all the books I’ve read and
realize that I can’t remember every detail from them, it’s a little
disappointing. I mean, why even bother if the Pareto Principle dictates that my
‘Zipf-ian’ mind will consciously remember pretty much only the titles and a few
basic reactions years later Ralph Waldo Emerson makes me feel better.
He once said, “I cannot remember the books I’ve read any more than the meals I have
eaten. Even so, they have made me.” And as always, thanks for watching.

100 thoughts on “The Zipf Mystery

  1. Have you noticed that the Zipf law looks to climb up exponentially? Perhaps there is a relation between Zipfs Law and The Law of Accellerating Returns.

  2. Who turned their phone back upright during the first "Thanks for watching" then was like wAiT tHeRe'S mOrE?

  3. Time to make sauce the most common word

    Sauce sauce sauce sauce sauce sauce sauce sauce sauce sauce sauce sauce sauce sauce sauce sauce sauce sauce sauce sauce sauce sauce sauce sauce sauce sauce sauce sauce sauce sauce

  4. If 80% of the things wrong on Wikipedia can be found on 20% of the pages, which pages are responsible for so much misinformation?

  5. 18:50 Memory is more than what you can recall at will (not that such a thing even really exists). Recognition is much more durable.

  6. Hoongingging gongong hoongingging gongong hoongingging gongong hoongingging gongong hoongingging gongong hoongingging gongong hoongingging gongong hoongingging gongong hoongingging gongong hoongingging gongong hoongingging gongong hoongingging gongong hoongingging gongong hoongingging gongong hoongingging gongong

  7. Simulation theory? The theory that everything is written in time since the Big Bang and free will doesn’t exist?

  8. the greek name for when a word is used just once across a particular set of works etc. is known as Hapax legomenon

  9. 4 years later, Google now lists 7,550 results for "quizzaciously." Michael Stevens has incredibly quantifiably changed the path of word use in the English language. I wonder how much more this word will be used in another 4 years.

  10. Vsauce is smarter than more than half of native English speakers. He literally uses the word 'literally' correctly.

  11. This is the matrix. Reality is predifine. Self awarness is programmed. Try to scap it. see what happens.

  12. I decided to check google trends, and quizzaciously was zero from 2004 (earliest time they had data on it) to August 2015. In September 2015 the word at a 100.

  13. Literally every Vsauce video in a nutshell: Hey Vsauce, Micheal here! The sky exist, or does it? But first let's question your whole Life

  14. the be to of and a in that have i it for not on with he as you do at this but his by from they we say her she or an will my one all would there their what so up out if about who get which go me when make can like time no just him know take people into year your good some could them see other than then now look only come its over think also back after use two how our work first well way even new want because any these give day most us

  15. "Talk:quizzaciously

    A famous YouTuber humorously coined this word in a video so it's all over the Internet, but real usage is rare or non-existent. The adjective quizzacious itself is a single writer's nonce word."

  16. And some people still doubt the Bible wisdom: "For whoever has, to him more will be given, and he will have abundance" Mathew 13:12

  17. well a graph can show a myriad of things in the same way, it is odd but the 80 20 imbalance is weird i can't explain it

  18. lol, bussssssssssssssssssted! Time index 1:11 shows a "ranking" that does not match the list at time index 2:36, or the list at time index 16:15. This is the issue with what is called data mining. There are seemigly patterns because of how the data is shifted around, yet the conclusions are without support.
    1:11— the of and to a in is i that it for you was with on as have but be they
    2:36— the of and to in a was is that he for as it
    16:15– the be to of and a in that have I it for not on with
    How "of" is shifted to make the pattern work in the third; or how "be" is near the end in the first, missing in the second, and second in the third; or "a" is fifth in all three. Just kidding, it is fifth in the first and sixth in the other two. I did this to demonstrate about the ease of shifting for data mining to support a conclusion in readers that my conclusion was wrong.

    lol, bussssssssssssssssssted, again! The concept of the random typing following zipfs claim, and the applying of zipfs claim to word usage distribution does not match. According to zipfs claim "the" in random typing would occur far less than zipfs claim of "the" in word usage distribution.

    The zipfs claim is actually a psychological effect of human though patterns "fitting" information into a biased pattern. The video even alludes to this when talking about repeating a word again near itself. A fun game to play with this is to get someone to say tin 10X, then to quickly ask them what an aluminum can is made of. There is a high chance they will say "tin." I saw a video about how people were given a series of numbers and then asked to see if they could then come up with a pattern. Once they had their pattern, they were then given a next number in the series that did not follow their pattern. Most were stumped. Some attempted to explain the next number, or to slightly alter their pattern to fit it. Yet the solution was a pattern that was nothing like their pattern. This is an issue when trying to solve problems and to predict a favorable outcome. When the pattern is not correct, it leads to real difficulties down the road. And this is why biases are problematic. Biases lead to allocation of resources towards a desired outcome based upon a biased pattern, sometimes based upon data mining, and then leading to an undesired consequence.

  19. A lot of the words at the top of the list are prepositions, articles, and pronouns; words of that type correspond to the ways that we process and connect bits of information, stuff that's usually represented in language by descriptors, which are situationally specific. I would say that the most commonly used words are ones that correspond to the nuts and bolts of human thought processes, and therefore, it's not surprising to see that shared across different individuals and different languages. What is the word "the"? What does it represent? How you answer that question and how you relate it to the fundamentals of cognition itself, may offer an answer to why it's the most commonly used word.

Leave a Reply

Your email address will not be published. Required fields are marked *