What Is Duplicate Content?
Duplicate content is defined as content that is an exact copy of content found elsewhere. However, the term duplicate content can also refer to almost identical content (such as just swapping a product, brand name, or location name only).
Simply swapping a few words out will not necessarily save a page from being deemed as duplicated content. As a response, your organic search performance can see a negative effect.
Duplicate content also refers to content that is the same across multiple webpages on your site or across two or more separate sites. However, there are many methods to prevent or minimize the impact of duplicate content that can be handled by technical fixes.
In this guide, I’ll look deeper into the causes of duplicate content, the best ways to avoid it, and how to make sure competitors can’t copy your content and claim that they are the original creator.
The Impact of Duplicate Content
Pages created with duplicate content can result in several ramifications in Google Search results and, occasionally, even penalties. Most common duplicate content issues include:
The wrong version of pages showing in SERPs
Key pages unexpectedly not performing well in SERPs or experiencing indexing problems
Fluctuations or decreases in core site metrics (traffic, rank positions, or E-A-T criteria)
Other unexpected actions by search engines as a result of confusing prioritization signals
Although no one is sure which elements of content will be prioritized and deprioritized by Google, the search engine giant has always advised webmasters and content creators to ‘make pages primarily for users, not for search engines.’
With this in mind, the starting point for any webmaster or SEO should be to create unique content that brings unique value to users. However, this is not always easy or even possible. Factors such as templating content, search functionality, UTM tags, sharing of information, or syndicating content can be fraught with the risk of duplication.
Ensuring that your own site does not run the risk of duplication of content entails a combination of a clear architecture, regular maintenance, and technical understanding to combat the creation of duplicate content as much as possible.
Methods to Prevent Duplicate Content
There are many different methods and strategies to prevent the creation of duplicate content on your own site and to prevent other sites from benefiting from copying your content:
As a starting point, it is wise to have a general look at your site’s taxonomy. Whether you have a new, existing, or a revised document, mapping out the pages from a crawl and assigning a unique H1 and focus keyword is a great start. Organizing your content in a topic cluster can help you develop a thoughtful strategy that limits duplication.
Possibly the most important element in combating duplication of content on your own site or across multiple sites are Canonical Tags.
The rel=canonical element is a snippet of HTML code that makes it clear to Google that the publisher owns a piece of content even when the content can be found elsewhere. These tags denote to Google which version of a page is the ‘main version.’
The canonical tag can be used for print vs. web versions of content, mobile and desktop page versions, or multiple location targeting pages. It can be used for any other instances where duplicate pages exist that stem from the main version page, too.
There are two types of canonical tags, those that point to a page and those that point away from a page. Those that point to another page tell search engines that another version of the page is the ‘master version.’
The other is those that recognize themselves as the master version, also known as self-referencing canonical tags. Referencing canonicals are an essential part of recognizing and eliminating duplicate content, and self-referencing canonicals are a matter of good practice.
Another useful technical item to look for when analyzing the risk of duplicate content on your site are Meta robots and the signals you are currently sending to search engines from your pages.
Meta robots tags are useful if you want to exclude a certain page, or pages, from being indexed by Google and would prefer them not to show in search results.
By adding the ‘no index’ meta robots tag to the HTML code of the page, you effectively tell Google you don’t want it to be shown on SERPs. This is the preferred method to Robots.txt blocking, as this methodology allows for more granular blocking of a particular page or file, whereas Robots.txt is most often a larger scale undertaking.
Although this instruction can be given for many reasons, Google will understand this directive and should exclude the duplicate pages from SERPs.
URL Parameters indicate how to crawl sites effectively and efficiently to search engines. Parameters often cause content duplication as their usage creates copies of a page. For example, if there were several different product pages of the same product, it would be deemed duplicate content by Google.
However, parameter handling facilitates more effective and efficient crawling of sites. The benefit to search engines is proven, and their resolution to avoid creating duplicate content is simple. Particularly for larger sites and sites with integrated search functionality, it is important to employ parameter handling through Google Search Console and Bing Webmaster Tools.
By indicating parameterized pages in the respective tool and signaling to Google, it can be clear to the search engine that these pages should not be crawled and what, if any, additional action to take.
Several structural URL elements can cause duplication issues on a website. Many of these are caused because of the way search engines perceive URLs. If there are no other directives or instructions, a different URL will always mean a different page.
This lack of clarity or unintentional wrong signaling can cause fluctuations or decreases in core site metrics (traffic, rank positions, or E-A-T criteria) if not addressed. As we have already covered, URL Parameters caused by search functionality, tracking codes, and other third-party elements can cause multiple versions of a page to be created.
The most common ways that duplicate versions of URLs can occur include: HTTP and HTTPS versions of pages, www. and non-www., and pages with trailing slashes and those without.
In the case of www. vs. non-www and trailing slash vs. non-trailing slashes, you need to identify the version most commonly used on your site and stick to this version on all pages to avoid the risk of duplication. Furthermore, redirects should be set up to direct to the version of the page that should be indexed and remove the risk of duplication, e.g., mysite.com > www.mysite.com.
On the other hand, HTTP URLs represent a security issue as the HTTPS version of the page would use encryption (SSL), making the page secure.
Redirects are very useful for eliminating duplicate content. Pages duplicated from another can be redirected and fed back to the main version of the page.
Where there are pages on your site with high volumes of traffic or link value that are duplicated from another page, redirects may be a viable option to address the problem.
When using redirects to remove duplicate content, there are two important things to remember: always redirect to the higher-performing page to limit the impact on your site’s performance and, if possible, use 301 redirect. If you want more information on which redirects to implement, check out our guide to 301 redirects.
What If My Content Has Been Copied Against My Will?
What should you do if your content has been copied and you have not used a canonical tag to signify that your content is the original?
Use Search Console to identify how regularly your site is being indexed.
Contact the webmaster responsible for the site that has copied your content and ask for accreditation or removal.
Use self-referencing canonical tags on all new content created to ensure that your content is recognized as the ‘true source’ of the information.
Duplicate Content Review
Avoiding duplicate content starts focusing on creating unique quality content for your site; however, the practices to avoid the risk of others copying you can be more complex. The safest way to avoid duplicate content issues is to think carefully about site structure and focus your users and their journeys onsite. When content duplication occurs due to technical factors, the tactics covered should alleviate the risk to your site.
When considering the risks of duplicate content, it is important to send the right signals to Google to mark your content as the original source. This is true especially if your content is syndicated or you have found your content has been replicated by other sources previously.
Depending on how the duplication has occurred, you may employ one or many tactics to establish content as having an original source and recognizing other versions as duplicates.