Repetition Hinders Conversions: Why And How To Hide Duplicates?

Alexandra Konovalenko
Alexandra Konovalenko
16 January 2021

Share:

Want to climb higher in search engine results?

In this article you will learn how page parameters can impede the promotion of your site. You will also discover what are duplicate pages, analyze the reasons for their existence, and look at ways to hide them from indexing.

You will also learn the opinion of SEO expert Alexander Alaev about duplicate pages and how to neutralize their negative influence on the website.

Identical pages: where do they come from and how can you find them?

Duplicates are pages with completely or partially identical content (aka. incomplete duplicates) and different URLs. Dynamic URLs with parameters are the most common cause of duplicates.

Parameters are usually used for: 

  • search (internal site search generates result pages using parameters);

  • tracking of traffic and/or search query sources (usually utm tags are used in contextual advertising);

  • pagination (content is divided into separate pages for the convenience of using a product catalog and increase of loading speed. In this case, the content pages are fully or partially duplicated);

  • split of different site versions (mobile, language);

  • filtering and sorting products in a catalog.

Finding duplicate pages can be done using online tools such as Siteliner, Copyscape, Screamingfrog or desktop crawler programs like Xenu, ScreamingFrog, ComparseR, etc. Almost always duplicate pages have the same title and description.

Multiple combinations of URLs with duplicate content pose a problem for SEO optimization. URL duplicate pages are particularly typical for Ecommerce sites. Search, sorting, and filtering features on the site contribute to this.

For example: 

URL without parameters: https://www.test.com/bytovaya-tehnika/

URLs with parameters of: 

  • search: https://www.test.com/search?q=ноутбуки;

  • sorting: https://www.test.com/pylesosy/?s=price;

  • filter: https://www.test.com/c/355/clothes-zhenskaya-odezhda/?brands=23673

The partial duplication of pages is common for the Ecommerce industry. This happens because the product descriptions in a catalog and in the product cards are identical. Therefore, it is not necessary to display the full information about the products on the catalog pages. 

What are the dangers of duplicate pages?

Sites with a high degree of uniqueness have priority ranking. Repetitive content and meta tags will not improve the ranking. If content is repeated, it will reflect badly on the site indexing, and as a consequence, its search engine optimization. The search engine can not determine where there is a duplicate and where is the main page. Because of this,  the desired pages may disappear from the output and the duplicates will be shown instead.

  • If the relevant pages that match the query will be constantly changing. You are working against yourself in the algorithm and the site can sink in search results. This is called cannibalization.

  • With a large number of duplicate pages the site can be indexed for longer, which is also not profitable and will reduce traffic from search. Often, two pages with similar content are indexed only once and there is no way to make sure that the right one is indexed.

  • The number of pages that the search robot has planned to bypass on your site is limited. And you'll be spending your crawling budget not on pages you really want to display in search engine results, but on duplicate pages.

Of course, a simple removal of all the pages with options is not a solution. Such pages improve the user experience and help make the site better. For example, the search bar increases conversions, while search tracking gives the site owner valuable information about users and how they interact with the content. The filtering or sorting option is familiar and very convenient for online shoppers because it helps narrow down searches and make them faster. It would be foolish to give up such features.

How to deal with the duplicate pages?

  1. Using a canonical tag

Consider ways that will help us reduce the negative impact of duplicate pages on the promotion of the site.

If the pages with similar content are indexed as separate pages, it is important to use canonical tags.

Tag canonical is an element that is added to the code of the page in the section <head>. This tag informs the search engines that this is not the main version of another page and directs to its location.

Sometimes in Ecommerce the same product is placed in different categories and the URL is generated from a category. As a consequence, there are several page addresses for the same product and it is needed to select the main or canonical address. The result of setting the canonical tag is similar to 301 redirect, but without redirects. That is, all the authority (weight) and accumulated characteristics of the page will be transferred to the canonical page from the non-canonical page. You will need to add a special code to the line  <head> of your additional page.

For instance: On the sorting page of goods at https://www.test.com/NOUTBUKI/?s=price

Registered tag <link rel = "canonical" href = "https://www.test.com/noutbuki/"> which indicates the main page (without parameter? S = Price).

It is worth mentioning that specifying a canonical link without http:// or https:// is an error. The link must be absolute, i.e. include the protocol, the domain and the address itself.

A canonical tag tells search engines the preferred page to display on search result pages. The canonical tag will not work if the content of the pages is significantly different - the tag can be ignored because it is advisory in nature, not mandatory.

Sites that use canonical in their practice often have a canonical tag. It refers to the same page where it is located. This is not a mistake.

For instance:
On the https://www.test.com/pr89849/ The tag may be specified
<link rel = "canonical" href = "https://www.test.com/pr89849/" />

2. Hiding duplicates with robots.txt

Pages can also be hidden from search engine crawlers by using the Disallow: directive in your robots.txt file. This method is best if there are a few duplicates or if the duplicates are generated using the same parameters, otherwise this process can take a long time. It is necessary to make changes to the robots.txt file which governs search engine crawlers. This file can be found at www.yoursite.com/robots.txt. If the file isn't there, it is possible to use a robots.txt generator or just create a new .txt and put the directives in it. Afterwards, the file should be saved as robots.txt and placed in the root directory of the domain (next to the main index.php).

By writing the directive Disallow, you will prohibit the indexing of certain pages. In the case below you will not index all URLs with a question mark, hence those that contain parameters:

Disallow: /*?

File example www.test.com/robots.txt

User-Agent: YadirectBot
Disallow:

User-Agent: YandexDirect
Disallow:

User-agent: *
Disallow: /account
Disallow: /admin
Disallow: /cabinet
Disallow: /company-contacts
Disallow: /context/
Disallow: /error/
Disallow: /feedback/
Disallow: /map/frame_map
Disallow: /opensearch.xml
Disallow: /opinions/create
Disallow: /order_mobile
Disallow: /order_mobile_confirm
Disallow: /order_v2
Disallow: /preview
Disallow: /product_opinion/create
Disallow: /product_view/ajax_
Disallow: /product_view/get_products_for_overlay
Disallow: /company/mark_invalid_phone
Disallow: /redirect
Disallow: /remote
Disallow: /search
Disallow: /shopping_cart
Disallow: /shop_settings/
Disallow: /social_auth/
Disallow: /tracker/
Disallow: /*/shopping_cart
Disallow: /*/partner_links
Disallow: /m*/offers*.html
Disallow: /for-you
Allow: /*?_escaped_fragment_=
Disallow: /*?
Allow: /.well-known/assetlinks.json

Each search engine has its own user agents. We can specify in robots.txt how each user agent should behave on the site. We can also specify a directive to all search agents at once using User-agent:*.

Using "Disallow: /" or "Disallow:" prohibits indexing of all pages on the site. Therefore, it is needed to pay attention when working with robots.txt.

3. Placing meta robots and x-robots-tag

There are two other methods which determine the rules for the search engine crawlers behavior on the page:

  • x-robots-tag (http header);

  • robots (meta tag).

X-robots-tag is part of the HTTP header and is mainly used to limit indexing of non-HTML files. For example, these are PDF documents, pictures, videos, etc. or the internal elements which one can't put in a meta tag. However, the procedure for placing a x-robots-tag is quite complicated so it is used very rarely.

Placing the robots meta tag in the HTML code of the page is much easier. They are used to hide the indexed pages. In <head> you should add <meta name="robots" content="noindex" />. Thus, it is possible to remove the page from the index and not allow it to get into it.

The meta robots work with the following directives:

  • noindex: do not index,

  • index: index (it is not an obligatory directive robot because if you do not specify noindex, the page is indexed by default),

  • follow: follow links on page (just like index, it is not a mandatory directive);

  • nofollow: do not follow links on the page,

  • some other less common options are described in the Google Search Center.

An interesting experiment was described by blogger Igor Bakalov. He tested whether search engines actually take into account meta robots. The results were quite surprising and you might want to check out his experience.

It is fair to add that usually only the robots.txt file is edited manually, and other methods of managing the indexing of pages or the entire site as a whole is carried out from CMS and special knowledge is not required.

"Personally, I prefer to use meta robots wherever possible."

Alexander Alaev
Alexander Alaev

head of the web studio “АлаичЪ и Ко”. and author of "блог Алаича"


"Since there is more than one method, there is a logical question: which is the best one to use? From my own experience, I can highlight the most reliable way, namely the meta tag robots. Meta tags can be used not only to get rid of duplicates, but in any other case when you want to hide a certain page from search engines. A tag canonical is less universal as it requires to have identical pages while x-robots-tags are challenging even for experts in the field.

In fact, we choose between robots.txt and meta robots. Using robots.txt is easy and quick, and although it has a drawback, closed pages can still be displayed in search engine results (it is important to bear in mind that page content is banned for indexing by means of robots.txt instead of a snippet), if there are any external links to them. Meta robots do not have such a problem. Although if we talk specifically about seo and the impact on it, this "problem" is not really a problem.

Personally, I prefer to use meta robots wherever possible. However, you can choose a simpler and faster way like robots.txt. Again, it does not make any difference for seo."

Bottom line

You can not completely avoid duplicate pages related to page parameters. However, it is needed to exclude them from indexing, so that you will not lose your ranking in search results. The methods described above will help you do this.

Direct transitions from Multisearch will also be useful. This function works like this: entering a specific search request will take the user to the category/brand/filter/goods page directly, bypassing the search results page. This increases the number of pageviews in the catalog, rather than within the non-indexed pages of the internal search. This will have a positive effect on the site's ranking, since the number of users and the quality of the session on the landing pages are taken into account in Google and Yandex algorithms.

Share:

Alexandra Konovalenko
Автор
Alexandra Konovalenko
The author of the articles for the blog MultiSearch.

Easy installation
in 2 steps

No complicated settings. No IT costs.
Your search bar will be smart in an hour. 14 days free

1
Provide an XML feed link
2
Write Javascript
a line of code
Start
More interesting