You may have seen the term TF*IDF being tossed around in the last year or so, but no one could blame you if you haven’t started paying attention yet.
A lot of SEO fads come and go, and some of the most interesting ones just end up attracting penalties, later on, right?
But TF*IDF is something a little different.
It’s not a manipulation of search engines; it’s a method of analyzing the topics in content, and it’s built on the same principles as the search engines themselves. Because of that, it has amazing potential for SEOs who need a truly objective method to measure and improve content.
I just recently wrapped up a case study into exactly what it’s capable of, and the results are quite interesting.
In case some of you are where I was only a few months ago, I want to make sure that I cover what I learned about TF*IDF, and how it’s used before I get to what I learned from my personal experiments with it.
The crash course starts in the next section, but if you’re an experienced user already, you can find the results of my personal tests and some comparisons of the top TF*IDF tools near the end.
Looking forward to your questions and comments.
Contents
What is TF*IDF?
So what is TF*IDF? An acronym? An equation? A really obscure text emoji?
It’s at least two of those things.
In literal terms, it means Term Frequency times Inverse Document Frequency.
TF*IDF is an equation that combines those two measurements—the measurement of how frequently a term is used on a page (TF), and the measurement of how often that term appears in all pages of a collection (IDF) — to assign a score, or weight, to the importance of that term to the page.
I know… nerd alert, right?
We’ll look at why this is so important to SEOs in a bit, but first, let’s look at where it came from.
The equation has a very long history in academia, where researchers in fields as diverse as linguistics and information architecture have used it as a way to analyze massive libraries of documents in a short amount of time.
It’s also used by information retrieval programs (including all search engines) to efficiently sort and judge the relevance of millions of results.
There is an important difference between what you want to do and what the search engine wants to do with this same information.
The search engine wants to consider a collection made up of all the results on the web while you want to compare one page or website to just the sites that are out-performing it…. namely the top 10.
Let’s look at TF and IDF in more depth…
The Equations that take you to TF*IDF
You need to do a little more math to get both of the measurements involved, that is TF and IDF. but I promise it won’t be difficult. Depending on the application, the equations for TF*IDF can get a lot more complicated than the examples I’m using below.
Simplified or not, you generally don’t want to be caught doing this work by hand if you’re trying to optimize a site. These equations will help you understand how TF*IDF functions, but it’s the tools I’m discussing at the end that really open up the potential.
Solve the first one, Term Frequency, by doing a raw count of the number of times a term appears on one page. Then, plug that number into the equation below:
Term frequency = (raw count of terms) / (total word count of document)
Alone, the TF score can tell you whether you’re using a word too rarely or too often, but it’s only really useful when weighed against the other measure.
Calculate the Inverse Document Frequency by dividing the number of documents the term appears in by the total number of documents in the chosen collection, like so:
Inverse Document Frequency (term) = log (number of docs / (docs containing keyword)
With the IDF score, you can now measure the importance of a phrase to a page, not just its number of uses. This is important because it’s putting you in the mindset of the people who are building search engine algorithms.
Why does it Matter to SEOs?
The end goal of being able to fill out this equation is to be able to give an actionable relevance score to your content. Using the TF*IDF tools available now, you can then compare your scores to the scores of the top-performing pages for any term.
By grading pages on this measure, you can nearly pull back the curtain on how Google might grade sites dedicated to the same topic.
It’s unknown if Google is using TF*IDF in their algorithm, and if they are, is it a mutated form of it or not? That said, there have been some private correlation studies that I’ve been privy whose data suggests that it’s likely.
TF*IDF analysis allows you to optimize the balance of terms in your content according to what is already being rewarded by the algorithm.
That’s huge for SEOs because it marks the return of something all the old hats knew and…loved?
Keyword Density Returns?
Nope. No one loved the days when keyword density reigned.
However, TF*IDF could mark a return to the primacy of phrases and keywords as an important marker—just in a very different way.
Instead, keyword density strategies were an early attempt to game out how Google was really using TF*IDF for its indexing and recall.
People were trying to keyword stuff, so then algos and filters came out to combat it (hi, panda).
So, in a way, keyword density is back. It ran away from home as a surly teen and has returned as a mature adult with a degree in the sciences.
Keyword density was an early, limited tactic that mostly encouraged bad habits. Measuring term usage with TF*IDF will give you an idea (at least as far as the top results are using them) balance. It reveals what is considered natural, in a very precise way.
Using TF*IDF to enhance Keyword Research
TF*IDF goes a step further than keyword density in the way that it opens you to insights about whole families of words on a website.
For example, imagine that you’ve already completed keyword research to optimize a page for “DUI lawyer Chicago”. Most keyword research tools are going to spit out keywords like “DUI lawyer in chicago”, “chicago DUI attorney”, etc.
When you use the TF*IDF tools that I’m covering later on, you’ll also be able to find related non-SEO terms that are being used by the top-ranked pages that you would have never found before using normal keyword research. Terms like “legal”, “experienced”, “rights” and “practice”.
These words wouldn’t have shown up in keyword research tools because the articles themselves aren’t ranking for them, yet they’re needed to tell the story of the search intent.
Let’s put the equation to use.
Fortunately, you won’t need to do it by hand for your sites. There’s always a tool to use, and you’re only a few scrolls from seeing the ones I’ve tested for results.
Putting TF*IDF to Use
Oh, no. More math.
At this point you may be having high school flashbacks, twisting around in your chair looking desperately for the wall clock that will tell you when you’re free.
Don’t worry, this time, I’m going to do the math. Immediately after this, we’ll get to the juicy stuff—How to put TF*IDF to use.
Let’s take a look at the equation in action…
Say that a document, such as a client’s landing page you’re examining, contains the term “PPC” 12 times, and is about 100 words in length. If you wanted to begin analyzing this piece of content, you would begin by plugging that into the term frequency equation from earlier.
TF (PPC) = (12 / 100) = 0 .12
Now, say that you wanted to understand how this usage compared to the usage of this term on the rest of the web. From a sample size of 10,000,000, at least some of these pages are going to be about web services and will include references to PPC. Let’s say, 300,000 of them.
We can use those numbers to finish the Inverse-Document Frequency equation.
IDF (PPC) = log (10,000,000/300,000) = 1.52
Now you score your page based on that term with the TF*IDF equation
TF*IDF (PPC) = 0.12 * 1.52 = 0.182
That’s a great score. Or is it?
The truth is, it’s not really a matter of meeting a limit. You want to bring your score for targeted terms into balance with the best-performing URLs on page 1.
A high score for a certain term isn’t necessarily a good thing (12 uses in 100 words is a lot, after all).
What about Common Terms like “the” and “of”?
You may be wondering, what about the noise?
What about all the common words like “of”, “the” or “and”? Because of the way the equation is structured, this noise isn’t really a problem.
The entire set of documents uses these words frequently, so the prominence of those words is scaled down considerably.
Let’s go back to the equation. To really illustrate the difference, we’ll say that there are as many uses of “of” on the page as there are of “PPC”.
TF (OF) = (12 /100) = 0 .12
But look what happens when we finish the IDF equation with the knowledge that the vast majority of results are going to contain the word “of”, say 8,000,000 of them.
IDF (OF) = log (10,000,000/8,000,000) = 0.09
That would make the final TF*IDF value:
TF*IDF (OF) = 0 .12 * 0.09 = 0.010
The TF*IDF value increases proportionally to the number of times the phrase is used in the document, but in this case, it is so offset by the frequency of the word throughout the rest of the collection, that its value score is cratered compared to the last example.
In other words, the more common the word is, the smaller IDF becomes.
What about Phrases?
Search engines tend to give an outsize weight to multi-word phrases over single terms.
This is especially true when the natural quality of language is being considered.
Naturally, you want to carry these considerations over to how you perform your TF*IDF assessments.
Fortunately, that takes no extra effort on your part. Most TF*IDF tools are capable of calculating keywords as 2-word and 3-word versions.
When TF*IDF was used exclusively for academic and research purposes, terms were already calculated as either 2-word sets called bigrams, or 3-word sets called trigrams. That same practice was adopted by search engines, so it’s important to analyze your content the same way they do.
Using the example of a PPC page from before, let’s look at a phrase that might appear on that page, and what the phrases may suggest about the topic.
“A PPC campaign needs many ads”
Each set of two words in this phrase could be calculated as a set of bigrams.
- A PPC
- PPC campaign
- campaign needs
- etc.
When a third word is added, it becomes even clearer how much important context is added when longer phrases are considered.
- A PPC Campaign
- PPC campaign needs
- campaign needs many
- etc.
Not all TF*IDF tools are capable of handling more than two combinations. I’ll go into more detail into the capabilities of each in the tool comparison located further down.
How to use TF*IDF
TF*IDF fits neatly into the content development process of almost any SEO.
It’s a method of learning more before you’ve started building content, and then knowing where and how to perfect it again.
Once you’ve chosen a tool, only it’s a step-by-step process to get more insight into each keyword choice. If you have not chosen a TF*IDF tool yet, you can find the data from the tests I performed with them in the next section.
1) Write content
Write content to the highest standards you know, or refer to a piece of content that you’re optimizing for a client. Create a list of one, two or three-word topics that you want to cover and take it to the TF*IDF tool that you’ve chosen.
Your goal here is to target keywords and the URLs of the top domains that target them to reveal what topics you are missing, and what topics you aren’t covering in enough depth.
2) Plug into a TF*IDF tool
Each tool works in a slightly different way, as you’ll be able to see, below. They also track different information, but the most useful ones are geared toward helping you understand how your competitors are finding success with their use of keywords.
Take advantage of any features your chosen tool has to help you discover terms that are associated with the top 10-20 top-ranking URLs, and then produce scores that reflect the weight of each other term they’re using.
3) Re-optimize content
Now that you have a complete idea of the topics covered by each of your competitors and an understanding of how frequently those words are used, you can use that information to refine your own content.
Take a second pass at the content and look for natural ways to introduce topics that you haven’t covered yet. Remember, your motivation is not to stuff unnaturally, but to restore natural connections where they’re currently missing.
4) Publish
Publish the content updated with the insights that you’ve recently gleaned from your searches. From here, you can continue to analyze it, and any changes in the ranks.
5) Show before and after TF*IDF graphs
One of the rewards of TF*IDF is that it allows you to track performance at a very minute level. Before and after each adjustment you make to your content, you can produce graphs of how the balance of topics on your pages has changed. These are useful to clients who are interested in seeing specific metrics for changes you’re making in their content.
Now, we’re ready to get into the part you’ve been waiting for!
I’ve had a chance to play with all of the biggest TF*IDF tools on my own sites, and I have a lot to show you about what they can do.
But first, let me share some results I’ve gotten from testing TF*IDF in the actual Interwebs.
Testing Results
I’d like to preface this section by saying that I’ve actually been testing TF*IDF for over a year.
Ever since I first looked into niche-based semantic density algorithms, the concept struck a harmonious chord with me.
And although the right mindset going into any kind of experimentation is agnosticism, I really wanted TF*IDF to work.
That said… for a very long time, I got lackluster results.
And then things changed.
I’m about to walk you through the timeline, but first, let me describe how I tested it.
Identifying Testcases for TF*IDF Experiments
Creating single-variable test structures is quite challenging for this particular scenario.
What is a single variable test?
In a super controlled test environment, you would have two groups of testcases.
One group would be the control group.
In the control group, you don’t change anything. You’re simply getting a “baseline” result to compare against the experimental group.
The experimental group is completely identical to the control group in most regards.
The web pages might have the same types of backlinks, they target the same keywords, etc. All these variables must be similar and constant between each other, or else the test is flawed.
However, with the experimental group, you change one thing. This is the “single variable”. And in this case, it would be TF*IDF optimization.
For the websites in the experimental group, you would perform TF*IDF optimization, let them sit, and then compare the results against the control group.
The challenge with SEO testing is that you can never control all the variables. There’s always noise coming along in the form of backlinks, traffic, competition, algorithm changes, etc.
You know how SEO is. It’s noisy AF.
One way people like to create SEO tests is by using gibberish words.
Let’s say we create 10 inner-pages on the same domain, all targeting some made-up word like “flubblegoblin”.
They’d take up the entire first page of Google since there’s no search results for “flubblegoblin” (yet).
These pages would be very similar in length, optimization, etc.
You could then optimize three of them with TF*IDF, let them sit, and then if TF*IDF works, they should start ranking #1-3, right?
But this approach is flawed from the start.
You’d have to optimize their content with respect to all the other pages you’ve built, which were already created similarly to each other.
Thus, if you set up the experiment correctly from the beginning, there would be no optimization possible. They’re already identical.
So dead end here too.
Alas, I went with the following approach to testing.
I would isolate multiple pages on multiple live websites that had the following characteristics:
- Static rankings for at least a month’s time
- Not receiving any backlinks or internal link juice
I would then apply TF*IDF optimization and let them sit for about 30 days and look out for increases or decreases in rankings.
I’m not entirely happy with this approach as a lot of “noise” can enter in this experiment structure from algorithm changes, the websites aging themselves, etc.
So, I decided to combat this inaccuracy, by testing over multiple phases and many different pages.
Now onto the show.
Phase 1 – Between December 2017 and March 2018
Aka, the dark ages.
Optimization tools:
- From the Future’s free tool
- Text-tools.net (Use code MATT-TFIDF for 35% off)
My first experiments with TF*IDF optimization were run between the dates mentioned above.
I ran experiments on three different occasions, on 12 different URLs, and tracked 36 different keywords (3 per URL).
In each case, the results were left to settle for 45 days (just in case).
Here are the lackluster results:
Whomp whomp.
There didn’t seem to be much effect in either the positive or negative direction.
After so much testing and results like these, why did I continue?
Because, as I mentioned before, I was really into the concept and I was (to be frank) quite surprised it didn’t do anything.
I started doubting my testcase integrity and the tools I was using.
Eventually, I just told myself I would continue to test this periodically just to “checkup” on things.
Phase 2 – April 2018
For this second round of testing, I decided to stick to Text Tools for the analysis and optimization.
Why?
For one because the software allowed for in-tool adjustments, so I could edit my text and re-evaluate on the fly (I’ll be doing a tool review later in this article).
And two, because the owner gave me a free license (thanks Michael).
I was surprised to see the following results the 2nd time around.
On two of the three testcases, we experienced positive movement.
It wasn’t groundbreaking movement, but enough to show a trend.
But here was the kicker.
During this point in time, a core algorithm update was released. It happened in March to be exact.
The two sites that showed positive movement were currently getting beat-up by this algorithm update.
And while all pages on the site were experiencing a loss in rankings, the pages where I was testing TF*IDF either held their ground or gained rankings.
And then I found articles like this…
If these algorithm updates were really about relevance, then what better indicator of relevance than the damn words that show up on web pages.
The coincidence was enough to peak my interest.
Was it enough for me to completely sign off on TF*IDF and add it to my standard operating procedures (SOP)?
Absolutely not.
Only more testing could do that.
Phase 3 – May 2018
Nothing changed in this experiment.
I continued to use Text Tools as my software of choice.
The only thing different was new testcases and a different date.
The trends remained the same as in phase 2.
More positive results.
This time I dug into things and noticed some patterns.
Results typically get worse before they get better
In 61% of the keywords I was tracking, the keywords got worse before they got better.
Only after 22-24 days after the initial kick-off and re-caching of the new optimized text did the rankings start to turn the corner.
By optimizing one keyword, you might deoptimize another
I do a lot of affiliate SEO, so most of the pages I was experimenting with were review pages.
So, when deciding which keywords to analyze and optimize for I would typically go for “best ___” keywords like “best protein powder”.
Yet, for the testing, I was tracking a wide range of keywords such as “protein powder benefits”.
Those keywords that aren’t really related to review-oriented queries like “best protein powder” or “protein powder reviews” were more likely to experience negative movement.
Phase 4 – August 2018
This time around I decided to use a different tool: Link Assistant’s Website Auditor.
I switched things up from Text Tools as there’s (what I believe to be) a flaw in its implantation, which I’ll discuss later.
Here’s the results:
At this point, I started to feel comfortable enough with the results to warrant writing this article and to start incorporating this technique into our SOP.
Especially with results like these that required zero link building:
Tool Comparison: Website Auditor vs Text Tools
Here’s a comparison of two of the most popular tools on the market which can be used for TF*IDF content analysis and optimization: Link Assistant’s Website Auditor vs Text Tools (Use code MATT-TFIDF for 35% off).
TF*IDF Tool Comparison
Tool | Platform | Usability | Accuracy | Cost | Our Choice |
---|---|---|---|---|---|
![]() |
|
|
|
||
![]() |
|
|
Platform (Winner: Text Tools)
Text Tools is run in the cloud. You log in to their platform and all the analysis is run server-side.
Obviously, this is the way most of us like to run our software these days (if possible) so we’re giving our vote to Text Tools when it comes to platform.
Website Auditor is a downloadable piece of software. The free version of it includes TF*IDF analysis.
It’s a pretty solid tool, as you can see below.
Nonetheless, we still prefer to work on the cloud so the vote goes to Text Tools.
Usability (Winner: Text Tools)
Right off the bat, Website Auditor has a big strike against it since you can’t save projects.
This is a feature which is unlocked when you upgrade to the paid version of the tool, so I guess it’s a moot point, but I just thought I would throw it in there.
Going back the other way, Text Tools is kind of glitchy on Chrome. At least the version I’m playing with right now.
For the life of me, I can’t switch between the various tabs in the analysis mode on Chrome. I’m stuck in overview mode and can’t get into the juicy stuff like “Compare” where you analyze your URL vs the analysis of the competition.
That said, on Firefox everything is fine.
But where the scale clearly tips in Text Tool’s favor is in the optimization phase.
I envision a productive TF*IDF workflow to work like this:
- Analysis of the competition
- Comparison against your content
- Optimization of your content
- Re-comparison against your content
- Publish
Text Tools allows you to copy and paste your page’s text into the tool itself. If you make changes to the content, you can simply edit the content in the tool, and re-analyze to see how you’ve done.
Website Auditor only compares against URLs. You either need to make changes to your live content or publish your content in a Google doc and have the tool analyze that.
It’s not a deal breaker, but it takes time and its annoying.
I mentioned this to Website Auditor’s staff, and they say they “added it to the queue”.
Let’s see.
Accuracy (Winner: Website Auditor)
As my team and I were playing around with Text Tools, we started noticing something strange.
Let’s say you analyze a term like “keyword cannibalization”.
When comparing the result vs my article on keyword cannibalization, you’ll find a result that looks like this:
You’ll notice that for the word “strategy” my content (yellow line) gets a zero because I don’t have that word on my page.
But what you’ll find is that even though it appears that the average is about 3.4, I would just need to add the word “strategy” once to jump up to adequate numbers.
I talked to the developer Michael Kaiser about this (lovely guy by the way), and he said his tool denotes the y-axis as a “weight”, calculated internally. And a lot of the time, adding a word once to an article is enough to satisfy the weight requirement.
This is fine, but I’m more looking for actual guidance of how many times each word should appear in the article.
Website Auditor delivers that.
It will break down based on phrases or single words, exactly how many times the competition is including your words, how many times you include it, and whether or not you might need to add or remove some.
Such as with the word “strategy”:
Cost (Winner: Website Auditor)
Unfair competition. You can’t compete with free.
Our Choice: Website Auditor
Text Tools has a lot of things going for it. I’d much rather work on the cloud and perform my edits inside a tool so I can do a quick reanalysis.
But at the end of the day, I’m looking for guidance, on a granular level, of the niche average keyword density for each phrase and word. For this reason alone, we’re going with Website Auditor.
Here’s a quick video on how to use Website Auditor for TF*IDF analysis, whipped up by LeadSpring’s own Anthony Lam:
FAQ
What is TF IDF SEO?
This is the SEO process of optimizing your content’s keyword density with the guidance of the algorithm known as TF IDF.
How does TF IDF work?
TF IDF refers to the term frequency times the inverse document frequency. TF grows higher with the number of times a given word shows up on a page. While IDF decreases the value of commonly used words such as “and”.
Each word gets a score, which can be used to determine the importance of various words in content.
Does Google use TF IDF?
Not likely in its entirety. If they do use it, it’s an advanced version that has evolved past its original understanding in the 1970s.
Who invented TF IDF?
British computer scientist Karen Spärck Jones invented TF*IDF.
Can TF IDF be negative?
No. Both values TF and IDF can never be negative.
Conclusion
I hope this article has helped clear things up regarding the extremely useful, yet often misunderstood, TF*IDF analysis.
You’ve not only learned the mathematics behind it but also how it applies to SEO and creating relevance in your articles.
You’ve also seen some test results of how optimization shows up in the SERPs, as well as a comparison of the most popular tools on the market.
If you have any questions, please use the comment box below.
Great stuff Matt.
Hey Matt, I heard about TF/IDF at CMSEO 2018. So I decided to give it a test. After coming back from Thailand, I used this technique on many pages by using the ryte.com tool and got a good success rate. Although I have also used text tools but not so good while using it.
Hi,
Thanks for sharing this article with us. I’m working hard on my blog to achieve some success and to get some audience. to it. Hopefully, Health blog [link removed by moderator] can grab attention but it needs hard work and smart work as well. So, I will follow your points to make mine on page SEO successful.
Thanks and Regards
Pretty huge post. It will take some real good time to read and understand the latest trend in SEO.
BY the way, again you did a good work.
Keep it up.
Vikas
There are free tools that do the same sort of thing.
2 of the 3 tools mentioned in this article are free. 🙂
Hi Matt,
Amazing work! As always!
This is a lot of information to take in at once but I was wondering if I understood correctly.. Basically you should compare your TD-IDF with the competition on the page one only?
Thank you
Some people do the top 3 pages. I like page 1.
Have good results as well using this method.
Thanks Matt, good article. I’ve been playing about with text tools and you’re at least the second person to suggest it’s not as accurate as some others. Would love to see a detailed comparison with other tools such as how TF*IDF works in Cora, Cognitive SEO and any others.
I’ve been using website auditor for a while now too and have found it to be the easiest to use for TF-IDF. I have NOT done testing like you have, so was happy to read this and see that my efforts to improve my content are (probably!) not in vain.
Seems like a tightened up version of using LSI keywords…
Thanks Matt. I tested this as well through 2018. I also noticed that the pages that had the TF/IDF words included tended to rise in rankings 4-8 weeks after adding them (without additional links added). I will have to differ with the common thought that keyword density of the exact phrases your trying to rank for doesn’t matter anymore. I find over and over again that it is very important. That being said I do density different than other SEO’s.
Cheers!
If my site wasn’t impacted at all by the algorithm update that focused on relevance, is TF*IDF something I need to focus on?
If you’re #1 already, then ignore it.
Hey, does we need to count keywords in the whole page, or only in body part, without the menus, footers, and etc.
Body. But the tools aren’t good at that. I wouldn’t worry too much about it.
Hey Matt, thanks for the great article!
Quick question: when you guys use TF*IDF in your SOPs to optimise a piece of content, at what point exactly do you do it?
We usually use TF*IDF rather late in the optimisation process, because we feel like we need to make sure first the content is about the same length/style/structure as the competitors’ so that we could compare apples to apples and not say a 100w article to a 5000w article.
How do you guys go about this?
Thanks and regards,
Tim
We pushed it towards the front of the process. When we get content written, we pass the writers a list of keywords they need to hit. We now mingle the TF*IDF words into that list. They can’t tell the difference.
After we get that content back, we tweak it a bit to hit more exact numbers.
When you give writers exact instructions (like say the word “machine” 25 times), typically they return a poor piece of content.
Hey Matt,
that’s interesting. So you guys don’t specify any keyword quantity, you simply provide the writers a list of individual keywords that must be included at least once, then optimize the quantity on your own after?
That’s right.
Good read Matt.
Whats interesting is now we are moving into full machine learning an AI in so many daily applications (Google Home, Assistant etc) the search engines need to start incorporating more in search as we ‘humans’ speak entirely differently in terms of context, synonyms, stemming, lsi.
The AI in Google Assistant needs to be able to deliver more accurate results – this is the beauty about incorporating this in our content, we are relevant!
More testing is needed, and I’m wondering if it would be good to test (or if its possible) english language but in a foreign language Google. Or Bing? As Bing is more on page, and less about links…
🙂
Good points. Feel free to test it. I’m all tested out (for the time being) on this subject. Back delving hardcore into links. 🙂
Thanks Matt.
I have a question about choosing competing pages in the TFIDF analysis.
Let’s say I’m analyzing a targeted “Best” affiliate article. If there are 2-3 big “Guide” type articles in the top 10, would you remove those guides from consideration in the TFIDF algorithm, instead focusing on comparing my article against other articles that have the same intent?
If I also plan on writing a big “Guide” type article, it seems like comparing those articles against my “Best” affiliate article is a bad idea because there will be many more occurences of keywords in the guides compared to a tightly-focused review article. That seems to be counter-productive as it may start to make my article sound unnatural since it’s likely much shorter than the big guides.
Really good question. Google is clearly able to tell the difference between different content types as they match a search intent. I wonder if they’re holding keyword frequency separate between them. Worth a test.
Hey Matt,
Great article and thanks for this. Have you tested SEMrush’s Semantic Analysis in their On-Page SEO Checker that shows TF*IDF? I am wondering if it’s as accurate as Website Auditor?
Haven’t, brother. Canceled SEMRush a bit ago when Ahrefs shaped up their site audit tool.
Good read, thanks for that. I’ve got the paid version of the Powersuite tools but have only played around with TF-IDF a little bit. You’ve convinced me to delve in to it further 🙂
Come back and let us know what you think.
Thanks for publishing your case study Matt! Appreciate it.
Rey interesting read – thanks for sharing Matt 🙂
Great post, as always, Mr. Diggity. “In 61% of the keywords … got worse before they got better. Only after 22-24 days … did the rankings start to turn the corner.” Exactly 🙂 . Cognitive SEO is a great too for TF IDF as well.
Nice. Will give it a look.
Amazing job. Congrats!
You sir area a scientist.
This really is some quite unique methodology here and it’s great that you’ve found some tools to help with this too.
Even better is that it’s actionable and can give you an edge.
Thanks Matt
Hi,
A very nice, comprehensive, detailed article you got there.
Been using SEOlyze since 2015, and it has been doing miracles for the couple of websites we used it on. I’d suggest you test that tool as well, it is very powerful and they keep on adding more stuff to it.
Thanks for the tip. Will check it out.
If TF IDF really benefits rankings, then top 10 pages may end up being packed with articles that have similar structure and covering the same sub-topics. However, that’s probably the case for most competitive niches anyway.
Great article.
Yes, i thought so as well, but you know you can target the same topics and subtopics but from a different angle, maybe a controversial angle, or more a scientific angle (especially for what comes to well being and fitness)..
True. Tf*IDF just makes perfect sense when you are a search engine and have to be dealing with millions of pages that have no ux signals. You just look what the majority of articles are about and assume that’s a standard for this topic.
But it doesn’t mean we have to re-write top ranking articles since we can approach the topic from many different angles.
I think that TF*IDF as a major part of the puzzle with the other element being NLP (Natural Language Processing). After all Google’s algorithm doesn’t actually understand what it ‘reads’ so by writing your content in a way that helps Google infer meaning should also help with relevancy and rankings. You’ll find out more on the background of NLP in this interesting article: https://www.briggsby.com/on-page-seo-for-nlp
Here are a few tips I use when writing for Google:
1. Connect questions to any answers your provide in your content
2. Add units i.e. ‘The boiling point of water is 100 degrees centigrade’
3. Reduce dependency hops when explaining a topic area i.e. get to the point with the same sentence, if possible.
4. Don’t write long content for the sake of it and get to the point quickly and efficiently
5. Write for Google comprehension level – i.e. 5th grade or less
6. Use related words, terms, and phrases to the topic, not just the obvious ones
7. Ask yourself does this content answer the readers intent
8. Be careful with your headers and don’t use unrelated text within them
9, Structure headers carefully – and use them as lists with clear topic/sub-topic relationships
Testing many of these factors single-variable right now, especially #3. Results are $.
Awesome post as always Matt, I’ve been using TF*IDF for quite some years now and it’s true that for one variable testing, is kinda a pain in the ass, so i just followed my guts and common sense and kept using it, did you try ryte.com tool, i think it’s worth a shot as a cloud solution, let me know what you think if u did, peace.
Will check it out.
Awesome stuff Matt! I have been using Website Auditor for years and LOVE it for their TF-IDF capabilities. While I have not run any specific tests (this makes me want to though) — I definitely know it has an impact on my rank. I think using this gives many an edge — most website owners will not invest in these tools are take the extra steps – so lots of room to outrank in my opinion!
Detailed and very helpful info as always Matt. Thanks. I believe text-tools uses WDF*IDF, and the formula for that is a little bit different than TF*IDF, even though the end result is going to be similar. The formula I’ve been using for text analysis is here: http://wdf-idf.com/
Hey Matt,
You can run the website auditor on a server, and schedule analysis of periodic basis too. I guess that will need a paid version, and you can export those reports to an excel sheet and keep updating.
Talk with their team, they can help you automate the process.
This article is a very deep one, and thanks for pointing that, we can do a TF*IDF analysis on a google doc too, I didn’t think of that at all 🙂
Thanks Matt.
Suresh
At our agency we have been using WA for a while now and we had some great results with our clients. It is well worth it.
Have you ever tried Semrush – SEO Writing Assistant (beta)? It checks if your texts follow best SEO recommendations.
Haven’t. Thanks for the rec.
np
Check out TopicalRelevance.com – it is a term frequency counter. Intended to be used to generate a list of relevant keywords you can send to your writers. Similar to the ones in your article.
As details as always by Matt
Using TF-IDF since last year when you first tell in the Group, using WA, it help me in my R&R sites when they stuck on page 2,3
Hey Matt, great guide on TF-IDF. I knew about it before but thought that I’d need paid tools to do it and frankly didn’t know where to start.
Now I do.
I noticed in the video that Anthony starts optimizing content by starting from the start and going down. What I mean is, he’s adding words that Site Auditor gives in green ass sufficiently used on the page.
I’m confused.
Isn’t it the point to include words that the tool says are missing, and to remove those that are overused?
Why would we want to change the words that are already sufficiently used on the page?
Doesn’t Site Auditor gives us a passing grade so that we don’t have to waste time with them?
Thanks for your reply.
Hey Nikola
There are several ways you can go about it.
1. If it says “OK”, then that’s a pass. You move on to the next keyword. This is the quicker way of doing it and you’re still satisfying the criteria of falling within the min and max.
2. Getting the current keyword count/TFIDF value as close as you can to the average – I personally try to get it slightly above the average (but still far from the max).
If I’m near the max, I might be pushing it in terms of over optimization.
If I’m near the min, I might be underoptimized.
This method does take longer, so it’s up to you with whatever method you choose.
Hey Anthony,
that’s good to read.
I needed it as TF-IDF is really interesting to me, and I was worried I wasn’t doing it the right way- hence no benefit.
Thanks for the clarification,
What if you are doing it for a product name keyword, so the first page has sites like Amazon, Ebay, etc but also has affiliate sites reviewing the product.
Would you compare yourself to the whole page1 or you will pick only the same sites that are similar to yours, the aff sites?
In general, do you exclude any specific url’s like Youtube or Amazon or you use the whole page 1 for optimizing?
I’d compare to similar sites.
Really in-depth article. Thanks Matt
Hello
Matt Diggity you are doing very well I’m your super fan. I have a question in my mind people are saying content is king quality content matter……
My question is there are alot of sites rank on ist position which dont have any content like
Online Image Optimizer
https://imagecompressor.com
Can you please describe how they rank without content
They’re fulfilling the search intent. People are searching for something like “online image compression tool” and getting what they came for. Which is signaled to Google by the time on site, etc.
Hey Matt,
Do you include your own URL in the analysis? For example, if your page is ranking #8, would you include it in the comparison against the other 9 sites ranking in the top 10?
Knowing when and when not to include your own page is something that really confuses me.
Best regards
Yup 🙂
A long but very informative read. Very interesting topic as I had never heard of TF-IDF before. Thanks for sharing.
Amazing article, thanks for sharing your knowledge!
Great article Matt! Really enjoyed the read! Keep them coming!!
Forgot to say… Big fan of TF IDF for SEO!
Wow, amazing article Matt. I’ve always wanted to read in-depth case studies like this instead of many articles out there which just tell you to do something without any reliable data to back it up.
I’m glad I found your stuff. Keep up the great work!