Google search console

28 and 30 September, 1, 3 and 4 October 2025

What is it

The Google search console describes itself as follows:

“Search Console tools and reports help you measure your site's Search traffic and performance, fix issues, and make your site shine in Google Search results.”

Webmasters or website owners can couple their site with the tool, by obtaining a file from Google, with a name like google------------.html, where in place of the hyphens there is a long and complicated code. That name is also in the file. By placing that file in the root directory of the website, so it becomes visible from the outside, webmasters or website owners prove to Google that they have control over the site, so they are authorised to see the performance data.

Why so many?

After pressing Start Now I can go to Indexing, Pages. There I read:
Not indexed 7.22K, 11 reasons, Indexed 959

The number of files in my sitemap (not counting *.mp[34]) is 2139. So why 7.22K? Why so many? And why only 959 files were indexed? First questions first, the other one later.

There are so many because Google remembers and keeps trying an insane number of URLs that are obsolete, superfluous, intentionally redirected, or in other ways make no sense. There are several categories.

Not found (404)

On 28 September 2025 there were 1118 URLs in this category. Examples:

https://wordpress.rudhar.com/fonetics/barrosnl.htm
https://wordpress.rudhar.com/sfreview/siworin/siworin18.htm
https://aa.aa.rudhar.com/recensie/amydocia.htm
https://aa.aa.rudhar.com/musica/frmusih.htm

I don’t know how and where Google picked these up. As far as I can remember such URLs have not been in public web pages as hyperlinks. But it is true that I used to type such addresses in a browser as a test, when years ago I was experimenting with subdomains and with proper server settings (for Apache, later nginx) to handle them.

It may have had to do with rudhar.wordpress.com, which I still have, but no longer maintain.

Does Google monitor and remember what I type in the address line of my browser? Seems unlikely. Only if I type something that does not qualify as a valid URL, so the browser retries it as a search string for the default search engine.

Anyway, there were times when I had my web server engine silently accept such addresses, so
https://aa.aa.rudhar.com/recensie/amydocia.htm
fetched and showed the same content as
https://rudhar.com/recensie/amydocia.htm

There were also times when I redirected such addresses to the simpler ones. And recently (since a few years already, I guess) I deem them invalid. Also, my Let’s encrypt TLS certificate now only allows https://www.rudhar.com/… and https://rudhar.com/…

By querying URLS that start with aa.aa etc., the Google bot ignores the invalid certificate. Invalid for this type of requests, that is. In my opinion it should not.

The reason that I set nginx (or perhaps Apache when I still used it) to return a ‘404 Not Found’ on anything that does not have just www.rudhar.com or rudhar.com, was my hope that Google would delete all such addresses from its list of URLs to try for my site. But it doesn’t. It keeps remembering them, keeps retrying them, and then presents them in a list in the Google Search Console, as a problem that _I_ should fix. For above the list there is this:

“First detected: 4/29/23 Done fixing? VALIDATE FIX”

In reality, in my opinion, this is a problem that Google caused, and that Google should fix. Just stop retrying, delete them from the list, after consistently receiving a 404 a number of times. Use only the sitemaps that I kindly provide.

Pages with redirect

There were 1203 such URLs. Many of them have http:// instead of https:// . Yes, I redirect those. Is that a problem? I would think every decent website now does this, after it was Google, of all companies, that strongly promoted that all websites should encrypt all traffic. If Google doesn’t mean this list as a list of problems, then why is there this
“First detected: 4/29/23 Done fixing? VALIDATE FIX"” above the list, here too? As if _I_ caused a problem? No, here too the problem is with Google, not with me. It acts as if it isn’t aware of a now very common convention.

Under this redirection category I also see a lot of addresses of the type
https://aa.aa.rudhar.com/…,
last visited September 2024. They would probably be moved to the 404 list if I run “validate fix”. But why should I? It makes no sense! The former redirect, and the current 404, are intended behaviour, not errors. Google should be smart enough to understand that, and ignore these harmless situations.

Blocked by robots.txt

There were 1885 addresses in this list when I made my notes on 28 September 2025. Examples:

https://rudhar.com/dcia?cerca=aere?&enia=1&deia=1&esia=1&whwo=1
https://rudhar.com/dcia?cerca=app?el%7B1%2C2%7Dar&iaen=1&ianl=1

They are parametrised dictionary searches, from examples in the manual to my Interlingua dictionary interface, or in discussions in Facebook, Telegram, etc. It makes no sense for a search engine robot to include all those searches. Therefore I blocked them. The blocking was intentional. It is not a problem that I should fix, Google. Please try to understand that. Better just silently ignore such cases.

A similar case is that of parametrised searches on my own site, using my local search engine Siworin. Some examples:

https://rudhar.com/cgi-bin/s.cgi?q=d3sse
https://rudhar.com/cgi-bin/s.cgi?q=daariver
https://rudhar.com/cgi-bin/s.cgi?q=dataschuren

(This “daariver” is a misspelled Dutch word, it should be ‘daarover’, but it is in a literal quote from somebody else, so as a webmaster, I won’t correct it.)

Here too, visiting these search results make no sense for a search engine robot, because any pages that could be found with them are also in the site maps, and if they aren’t, that is intentional: they have content that should be findable through my own local search engine, but that need not be and should be in Google’s index. My choice, my decision.

Why so few?

Submitted or known

At the top of the Google Search Console’s menu page, instead of “All known pages” I can also chose “All submitted pages”. Submitted via sitemaps, I suppose. Or mostly that. Indeed that changes
Not indexed 7.22K, 11 reasons, Indexed 959
to
Not indexed 1.26K, 3 reasons, Indexed 853
This data collected on 28 September 2025, as usual in this article.

1.26 is in fact 1264, to be exact. The total is 2117, and 853 is a little over 40%. Why is 60% not indexed? Too many on one site? Then why doesn’t Google say so? Or is it too many, including all those thousands of bogus addresses (see previous chapters) that Google refuses to forget?

Under “Why pages aren’t indexed” three reasons are given:
“Duplicate, Google chose different canonical than user”, 4 pages.
“Crawled - currently not indexed”, 696 pages,
“Discovered - currently not indexed”, 564 pages

Canonical

First that simple one, with only 4 occurrences. That’s clearly a bug. In the past, I had many cases of “Duplicate without user-selected canonical”. The duplicates weren’t due to actual duplicate content, duplicate HTML pages, but due to Google accessing the same page under several URLs. I failed to indicate what should be seen as the canonical address. To improve that, in my nginx configuration file, I added this line:
add_header Link "<https://rudhar.com$uri>; rel=\"canonical\"";

As a result, when issuing this command for example (--head may also be -I):
curl --head https://rudhar.com/sfreview/sfreview.htm
the headers shown now include this line:
Link: <https://rudhar.com/sfreview/sfreview.htm>; rel="canonical"

In 3 of these 4 cases, Google is stubborn enough to deviate from what I tell my web server to specify, as follows:
https://rudhar.com/index3.htm => https://rudhar.com/index5.htm
https://rudhar.com/index4.htm => https://rudhar.com/index1.htm
https://rudhar.com/index8.htm => https://rudhar.com/index6.htm

These are real errors. The files are similar, but not the same. There is no valid reason to deviate. What I specify is correct. These are old pages with frames (but one has iframe), that I keep on the site as historic curiosities. But that’s no reason to make this mistake. Google shouldn’t try to be smarter than me. I myself know best what is going on on my website, having been the sole webmaster for the past 28 years or so.

The fourth case is:
https://rudhar.com/index.htm => https://rudhar.com/
A harmless, but unnecessary change. No further comment. It makes me so tired.

Crawled or discovered

Now back to
“Crawled - currently not indexed” 696 pages,
“Discovered - currently not indexed”, 564 pages

That could make sense as a temporary situation, lasting a few days or so. But it’s permanent, although the numbers vary somewhat over time. They’re presented as reasons, but they’re not. They are facts. The reasons remain unknown. When pressing “LEARN MORE” I reach a page, where among other things I find:

Crawled - currently not indexed

The page was crawled by Google but not indexed. It may or may not be indexed in the future; no need to resubmit this URL for crawling.

Discovered - currently not indexed

The page was found by Google, but not crawled yet. Typically, Google wanted to crawl the URL but this was expected to overload the site; therefore Google rescheduled the crawl. This is why the last crawl date is empty on the report.

That’s not true. None of these situations apply.

There is the possibility, via “INSPECT URL”, to request indexing for individual pages. That shouldn’t be necessary, given the presence of proper sitemaps on my site. And it takes a promised “minute or two”, which in reality is about ten seconds. Why so long? My site responds in milliseconds, in the log files I sometimes see badly behaving crawlers fetch 50 pages in one second. The delay is probably meant to discourage such requests.

Recommendation

In the presence of one or more sitemaps, Google should crawl and index only the URLs mentioned there, all of them, and not use any other URLs found on the web, or whatever may be their origin.

This could be done very efficiently, even for large sites. I purposely provide two complete sitemaps, in two different orders:

sitemap.xml is sorted by file date. That date is updated with every change of the content, no matter how tiny, so every corrected typo or added remark put a file on top of the list. By comparing the URL dates with the Googlebot’s last crawl date, as soon as an older URL is encountered in this sitemap, Google could skip everything else, because they are guaranteed to all be older, so already in Google’s index with the latest content.
rssfull.xml (an RSS feed) contains the same URLs, but now sorted by the date of first publication. Later small corrections or additions do not alter that date. This can be useful to quickly and efficiently pick up new articles.
For extra efficiency, there are also variants of this that contain only references to articles published the last two months or the last two weeks. If the Googlebot regularly visits, it needs only handle those, to stay fully up-to-date with my site.

Conclusion

As said, Google indexes only about 40% of the files in my website. Almost 60% remains unindexed, and therefore unfindable. Perhaps my site isn’t typical, but if it is, and many other sites also have this problem, it means 60% of the whole World Wide Web is unfindable with Google as your search engine.

What does this do to SEO (Search Engine Optimization)? Using the right terms in the right places, hoping to improve your ranking in Google’s search results, is useless for pages that weren’t indexed in the first place. They won’t be found whatever you do.

Do other search engines do this better, do this appropriately? I haven’t investigated that yet, and haven’t made up my mind about what to use for my own searches, instead of Google. But here’s a list of alternatives: Mojeek, DuckDuckGo, Bing, Yahoo, Ecosia, Qwant, Brave, metaGer, Kagi.

Embargo lifted

I first published this article under embargo, protected by a username and password. On 5 October 2025 at 13:13B (CEST), in the Google Search Console, I sent feedback to Google, providing the URL and the access credentials, asking them for comments.

They did not respond. I see that as a symptom of what I call the arrogance of success. It reminds me of Microsoft. And of HP, and of IBM, many years ago.

12 October 2025, 11:30B: embargo lifted, password removed, the article is now publicly available.