Understanding Duplicate Content for SEO

What constitutes duplicate content and how do you fix it?

Duplicate content is a common issue in the world of SEO. Whether it is due to a new site, going live and accidentally copying a set of content, or whether it is a legacy issue surrounding specific products or pages on a pre-existing site, duplicate content can crop up anywhere.

There is an interesting circumstance that exists around duplicate content where, although it is generally recognised that it is bad, the reasons why duplicate content can damage a site, what constitutes duplicate content, and how it can be fixed (bar "remove the duplicate pages") are often unknown.

Throughout the course of this article we are going to delve deeper into the world of duplicate content to shed a bit more light on why duplicate content is not ideal for SEO, and how search engines deal with duplicate content as a whole.

What is Duplicate Content?

Let's start at the start when it comes to the topics of SEO, duplicate content, and search engines. Content, of all types, tends to fit into three categories - unique, near duplicate/somewhat unique, and duplicated.

Unique content is the content that we always herald as SEOs. Unique content is a fresh collective of content, wherein the content is completely different to every page. As SEOs, this is the golden content, as generally speaking unique content is a lot easier to optimise (with a few exceptions) than duplicated content.

Near duplicate or 'somewhat' unique content (depending if you are a glass half-full, or a glass half-empty kind of person) is content where there are similarities to other pages, and potentially a portion of the page is the same, but not all. Each page still a chunk of unique content on it.

Duplicated content is where content is completely copied from another source on the page, or is an exact duplicate. Sometimes duplicated or "extremely similar" pages are unavoidable, and we'll explore how to deal with those further down the article, but, generally speaking, duplicated content is a bad thing in the world of SEO.

What doesn't constitute as duplicate content?

Where, throughout the course of this article, we'll be focusing on what duplicate content is, it is also important to understand what duplicate content is not.

There are two core exclusions to duplicate content.

Common content. Headers, footers, and menu structures do not count towards duplicate content - even though they will often be the same for every page on a site.
Code. Code doesn't count towards duplicate content, do you don't need to worry about the HTML structure of each page being the same.

The only thing that counts towards duplicate copy is the main content of a page.

How duplicate does duplicate content need to be?

The other question that often comes up when asking what duplicate content is tends to be the query of how unoriginal content needs to be in order to be considered duplicate.

Generally speaking, search engines understand a page in chunks of contextual phrases called "shingles". A shingle is a short phrase, often described as being five or ten words in length, that contains contextual clues for a search engine. Developed in 1997, those shingles can then be compared with other pages and where there are too many identical shingles, then that page is considered duplicated.

There are a couple of huge points to make when talking about shingles. Despite being over two decades old, this method of SEO duplicate content identification is still considered to be the best way of analysing content to identify duplicated content. Secondly, no search engine has disclosed how many words are too many words in regards to duplication. This means content needs to be dealt with carefully in order to ensure it is unique in the eyes of the SERP.

THAT BEING SAID - there are a couple of points to note. Firstly, snippets and quotes generally tend to be acceptable within digital content. This means you are highly unlikely to get penalised for a quote - especially if it is marked up correctly within the code. Secondly, as mentioned before, there are countermeasures you can put in place to ensure that duplicated content, that needs to be duplicated for whatever reason, doesn't incur any penalty in the SERP.

Why is duplicate content a bad thing?

Understanding what constitutes as duplicate content is one thing, but understanding why it is bad is something completely different. Why can duplicated content end up with your website being penalised?

When it comes to duplicate content, the answer most SEOs will give is that, if you have more than one page talking about a specific topic, or if you content has been syndicated for another site, then Google won't necessarily know which page to rank. Rather than one page then being chosen as the dominant page, search engines will then flit between the pages as they tweak the algorithm and each page seems the most relevant, pushing the others out of the SERP. This can create a fluctuating SERP, where none of the pages perform well due to their relation to the other pages.

That said, there are a couple of other, more interesting, technical explanations as to why duplicated content is a bad thing.

Duplicate Content Wastes Authority

It is well known in the world of SEO, that authority is an incredibly important ranking factor for helping search engines determine which pages to show. Authority is passed through backlinks however, it can be diluted or ignored based on specific circumstances.

Backlinks aren't easy to get in the SEO world and the last thing you want is for a high quality link to be ignored. Unfortunately, when it comes to duplicated content, if a page is removed from the SERP because it is considered low quality (something that is likely to happen with entirely duplicated pages), then all backlinks leading to that page will be completely wasted. They will pass none of their authority, and that can be a real problem.

Importantly, what this means though, and something that will be emphasised when talking about crawl budgets below is that wasted authority doesn't just affect the page that is duplicated. Wasting authority works against the site as a whole.

Understanding Crawl Budgets

One of the reasons why duplicate content can be such a problem for a site is due to crawl budgets. A crawl budget is the set allowance a website crawler has for searching a site. Generally speaking, it is determined by two different criteria - the crawl capacity limit, and the crawl demand.

One of the elements that impacts crawl demand is the notion of perceived inventory. Perceived inventory is the concept that the Googlebot (in particular) generates of your site so that it knows the resource it must spend to crawl it. Duplicated content can negatively manipulate that perceived inventory, and this can waste the crawl budget of the crawler. Wasting the crawl budget of the crawler, means that certain pages on your site don't get crawled.

Once again, as with wasting authority, this means that duplicated content doesn't just affect the page that has been duplicated but instead it affects the site as a whole. Since it is not possible to control the Google crawl budget, it is not possible to state when it will run out.

How to Fix Duplicate Content

Solution 1: Remove the Duplication

There are a number of core ways of fixing duplicate content. The first of which is incredibly simple and that is to remove the duplicate so that content becomes unique. That said, there are occasions where it is not possible to remove a duplication and those could be:

The duplication exists on another website due to syndicated copy.
The duplication exists due to a product variant. This may be the case when there are various sizes of a clothing item, like a pair of shoes having a page for size 6, size 7, and size 8, despite the shoes being the exact same pair.

In these cases, there are a number of different options, depending on the situation.

Solution 2: Canonical Tag

In most cases, the most suitable solution would be a canonical tag and implementing a canonical tag on the duplicated pages (including the original), to referencing back to the original. This is done by inserting the following into the header of the page:

<link rel="canonical" href="https://www.website.co.uk/original/" />

Where "https://www.website.co.uk/original/" is the URL of the original page, and what this does is help show search engines which page is the original as well as to pass authority to that original page where appropriate.

It is important to note that canonical tags are the best solution for syndicated content appearing on different sites. Asking the site to insert a canonical tag on the page will ensure that quality gets passed back, and only one version is seen as the original and the other versions aren't seen as duplicated.

Solution 3: Robots.txt

The second solution is arguably the next best in regards to practicality from an SEO perspective, and that is adding the duplicate pages to the Robots.txt of the site. What this will do is add a directive in the Robots.txt telling search engine crawlers not to crawl the website.

Solution 4: NoIndex

The third solution to insert a NoIndex meta-tag in the duplicated pages. A NoIndex tag looks like the below:

<meta name="robots" content="noindex">

What this does is it tells the search engine crawler not to index the page in the SERP. Where this isn't as preferential as a canonical tag, it will help avoid the issue of duplicated content.

SEO, Duplicate Content and Ensuring Your Site Remains Penalty Free

There is no doubt about it - duplicate content can be a nightmare for SEO and can result in some pretty nasty penalties against a site. That said, when dealt with correctly, there is no reason why duplicate content needs to cause long lasting issues.

If you are having any issues with duplicated content or would like to talk to us about our SEO capabilities, then you can find out more about our SEO by clicking on the button below