

Want to climb higher in search engine results?
In this article you will learn how page parameters can impede the promotion of your site. You will also discover what are duplicate pages, analyze the reasons for their existence, and look at ways to hide them from indexing.
You will also learn the opinion of SEO expert Alexander Alaev about duplicate pages and how to neutralize their negative influence on the website.
Identical pages: where do they come from and how can you find them?
Duplicates are pages with completely or partially identical content (aka. incomplete duplicates) and different URLs. Dynamic URLs with parameters are the most common cause of duplicates.
Parameters are usually used for:
-
search (internal site search generates result pages using parameters);
-
tracking of traffic and/or search query sources (usually utm tags are used in contextual advertising);
-
pagination (content is divided into separate pages for the convenience of using a product catalog and increase of loading speed. In this case, the content pages are fully or partially duplicated);
-
split of different site versions (mobile, language);
-
filtering and sorting products in a catalog.
Finding duplicate pages can be done using online tools such as Siteliner, Copyscape, Screamingfrog or desktop crawler programs like Xenu, ScreamingFrog, ComparseR, etc. Almost always duplicate pages have the same title and description.
Multiple combinations of URLs with duplicate content pose a problem for SEO optimization. URL duplicate pages are particularly typical for Ecommerce sites. Search, sorting, and filtering features on the site contribute to this.
For example:
URL without parameters: https://www.test.com/bytovaya-tehnika/
URLs with parameters of:
-
search: https://www.test.com/search?q=ноутбуки;
-
sorting: https://www.test.com/pylesosy/?s=price;
-
filter: https://www.test.com/c/355/clothes-zhenskaya-odezhda/?brands=23673
The partial duplication of pages is common for the Ecommerce industry. This happens because the product descriptions in a catalog and in the product cards are identical. Therefore, it is not necessary to display the full information about the products on the catalog pages.
What are the dangers of duplicate pages?
Sites with a high degree of uniqueness have priority ranking. Repetitive content and meta tags will not improve the ranking. If content is repeated, it will reflect badly on the site indexing, and as a consequence, its search engine optimization. The search engine can not determine where there is a duplicate and where is the main page. Because of this, the desired pages may disappear from the output and the duplicates will be shown instead.
-
If the relevant pages that match the query will be constantly changing. You are working against yourself in the algorithm and the site can sink in search results. This is called cannibalization.
-
With a large number of duplicate pages the site can be indexed for longer, which is also not profitable and will reduce traffic from search. Often, two pages with similar content are indexed only once and there is no way to make sure that the right one is indexed.
-
The number of pages that the search robot has planned to bypass on your site is limited. And you'll be spending your crawling budget not on pages you really want to display in search engine results, but on duplicate pages.
Of course, a simple removal of all the pages with options is not a solution. Such pages improve the user experience and help make the site better. For example, the search bar increases conversions, while search tracking gives the site owner valuable information about users and how they interact with the content. The filtering or sorting option is familiar and very convenient for online shoppers because it helps narrow down searches and make them faster. It would be foolish to give up such features.
How to deal with the duplicate pages?
-
Using a canonical tag
Consider ways that will help us reduce the negative impact of duplicate pages on the promotion of the site.
If the pages with similar content are indexed as separate pages, it is important to use canonical tags.
Tag canonical is an element that is added to the code of the page in the section <head>. This tag informs the search engines that this is not the main version of another page and directs to its location.
Sometimes in Ecommerce the same product is placed in different categories and the URL is generated from a category. As a consequence, there are several page addresses for the same product and it is needed to select the main or canonical address. The result of setting the canonical tag is similar to 301 redirect, but without redirects. That is, all the authority (weight) and accumulated characteristics of the page will be transferred to the canonical page from the non-canonical page. You will need to add a special code to the line <head> of your additional page.
It is worth mentioning that specifying a canonical link without http:// or https:// is an error. The link must be absolute, i.e. include the protocol, the domain and the address itself.
A canonical tag tells search engines the preferred page to display on search result pages. The canonical tag will not work if the content of the pages is significantly different - the tag can be ignored because it is advisory in nature, not mandatory.
Sites that use canonical in their practice often have a canonical tag. It refers to the same page where it is located. This is not a mistake.
2. Hiding duplicates with robots.txt
Pages can also be hidden from search engine crawlers by using the Disallow: directive in your robots.txt file. This method is best if there are a few duplicates or if the duplicates are generated using the same parameters, otherwise this process can take a long time. It is necessary to make changes to the robots.txt file which governs search engine crawlers. This file can be found at www.yoursite.com/robots.txt. If the file isn't there, it is possible to use a robots.txt generator or just create a new .txt and put the directives in it. Afterwards, the file should be saved as robots.txt and placed in the root directory of the domain (next to the main index.php).
By writing the directive Disallow, you will prohibit the indexing of certain pages. In the case below you will not index all URLs with a question mark, hence those that contain parameters:
Disallow: /*?
File example www.test.com/robots.txt
Each search engine has its own user agents. We can specify in robots.txt how each user agent should behave on the site. We can also specify a directive to all search agents at once using User-agent:*.
Using "Disallow: /" or "Disallow:" prohibits indexing of all pages on the site. Therefore, it is needed to pay attention when working with robots.txt.
3. Placing meta robots and x-robots-tag
There are two other methods which determine the rules for the search engine crawlers behavior on the page:
-
x-robots-tag (http header);
-
robots (meta tag).
X-robots-tag is part of the HTTP header and is mainly used to limit indexing of non-HTML files. For example, these are PDF documents, pictures, videos, etc. or the internal elements which one can't put in a meta tag. However, the procedure for placing a x-robots-tag is quite complicated so it is used very rarely.
Placing the robots meta tag in the HTML code of the page is much easier. They are used to hide the indexed pages. In <head> you should add <meta name="robots" content="noindex" />. Thus, it is possible to remove the page from the index and not allow it to get into it.
The meta robots work with the following directives:
-
noindex: do not index,
-
index: index (it is not an obligatory directive robot because if you do not specify noindex, the page is indexed by default),
-
follow: follow links on page (just like index, it is not a mandatory directive);
-
nofollow: do not follow links on the page,
-
some other less common options are described in the Google Search Center.
An interesting experiment was described by blogger Igor Bakalov. He tested whether search engines actually take into account meta robots. The results were quite surprising and you might want to check out his experience.
It is fair to add that usually only the robots.txt file is edited manually, and other methods of managing the indexing of pages or the entire site as a whole is carried out from CMS and special knowledge is not required.
"Personally, I prefer to use meta robots wherever possible."
Bottom line
You can not completely avoid duplicate pages related to page parameters. However, it is needed to exclude them from indexing, so that you will not lose your ranking in search results. The methods described above will help you do this.
Direct transitions from Multisearch will also be useful. This function works like this: entering a specific search request will take the user to the category/brand/filter/goods page directly, bypassing the search results page. This increases the number of pageviews in the catalog, rather than within the non-indexed pages of the internal search. This will have a positive effect on the site's ranking, since the number of users and the quality of the session on the landing pages are taken into account in Google and Yandex algorithms.