Google search console
28 and 30 September, 1, 3 and
What is it
The Google search console describes itself as follows:
“Search Console tools and reports help you measure your site's Search traffic and performance, fix issues, and make your site shine in Google Search results.”
Webmasters or website owners can couple their site with the tool, by
obtaining a file from Google, with a name like
google------------.html
, where in place of the hyphens
there is a long and complicated code. That name is also in the file.
By placing that file in the root directory of the website, so it becomes
visible from the outside, webmasters or website owners prove to Google
that they have control over the site, so they are authorised to
see the performance data.
Why so many?
After pressing Start Now I can go to
Indexing, Pages. There I read:
Not indexed 7.22K, 11 reasons, Indexed 959
The number of files in my sitemap (not counting *.mp[34]
)
is 2139. So why 7.22K? Why so many?
And why only 959 files were indexed?
First questions first, the other one later.
There are so many because Google remembers and keeps trying an insane number of URLs that are obsolete, superfluous, intentionally redirected, or in other ways make no sense. There are several categories.
Not found (404)
On 28 September 2025 there were 1118 URLs in this category. Examples:
https://wordpress.rudhar.com/fonetics/barrosnl.htm https://wordpress.rudhar.com/sfreview/siworin/siworin18.htm https://aa.aa.rudhar.com/recensie/amydocia.htm https://aa.aa.rudhar.com/musica/frmusih.htm
I don’t know how and where Google picked these up. As far as I can remember such URLs have not been in public web pages as hyperlinks. But it is true that I used to type such addresses in a browser as a test, when years ago I was experimenting with subdomains and with proper server settings (for Apache, later nginx) to handle them.
It may have had to do with rudhar.wordpress.com, which I still have, but no longer maintain.
Does Google monitor and remember what I type in the address line of my browser? Seems unlikely. Only if I type something that does not qualify as a valid URL, so the browser retries it as a search string for the default search engine.
Anyway, there were times when I had my web server engine silently accept
such addresses, so
https://aa.aa.rudhar.com/recensie/amydocia.htm
fetched and showed the same content as
https://rudhar.com/recensie/amydocia.htm
There were also times when I redirected such addresses to the simpler
ones. And recently (since a few years already, I guess) I deem them
invalid. Also, my Let’s encrypt TLS certificate now only allows
https://www.rudhar.com/…
and
https://rudhar.com/…
By querying URLS that start with aa.aa
etc., the Google
bot ignores the invalid certificate. Invalid for this type of
requests, that is. In my opinion it should not.
The reason that I set nginx
(or perhaps Apache when I still
used it) to return a ‘404 Not Found
’ on anything that
does not have just www.rudhar.com
or rudhar.com
,
was my hope that Google would delete all such addresses from its
list of URLs to try for my site. But it doesn’t. It keeps remembering
them, keeps retrying them, and then presents them in a list in
the Google Search Console, as a problem that _I_
should fix. For above the list there is this:
“First detected: 4/29/23 Done fixing? VALIDATE FIX”
In reality, in my opinion, this is a problem that Google caused, and that Google should fix. Just stop retrying, delete them from the list, after consistently receiving a 404 a number of times. Use only the sitemaps that I kindly provide.
Pages with redirect
There were 1203 such URLs. Many of them have http://
instead of https://
. Yes, I redirect those. Is that a
problem? I would think every decent website now does this, after it
was Google, of all companies, that strongly promoted that all websites
should encrypt all traffic. If Google doesn’t mean this list as a list
of problems, then why is there this
“First detected: 4/29/23 Done fixing? VALIDATE FIX"”
above the list, here too? As if _I_ caused a problem?
No, here too the problem is with Google, not with me. It acts as if
it isn’t aware of a now very common convention.
Under this redirection category I also see a lot of addresses of the type
https://aa.aa.rudhar.com/…
,
last visited September
2024. They would probably be moved to the 404 list if I run “validate fix”.
But why should I? It makes no sense! The former redirect, and the current
404, are intended behaviour, not errors. Google should be smart enough
to understand that, and ignore these harmless situations.
Blocked by robots.txt
There were 1885 addresses in this list when I made my notes on 28 September 2025. Examples:
https://rudhar.com/dcia?cerca=aere?&enia=1&deia=1&esia=1&whwo=1 https://rudhar.com/dcia?cerca=app?el%7B1%2C2%7Dar&iaen=1&ianl=1
They are parametrised dictionary searches, from examples in the manual to my Interlingua dictionary interface, or in discussions in Facebook, Telegram, etc. It makes no sense for a search engine robot to include all those searches. Therefore I blocked them. The blocking was intentional. It is not a problem that I should fix, Google. Please try to understand that. Better just silently ignore such cases.
A similar case is that of parametrised searches on my own site, using my local search engine Siworin. Some examples:
https://rudhar.com/cgi-bin/s.cgi?q=d3sse https://rudhar.com/cgi-bin/s.cgi?q=daariver https://rudhar.com/cgi-bin/s.cgi?q=dataschuren
(This “daariver” is a misspelled Dutch word, it should be ‘daarover’, but it is in a literal quote from somebody else, so as a webmaster, I won’t correct it.)
Here too, visiting these search results make no sense for a search engine robot, because any pages that could be found with them are also in the site maps, and if they aren’t, that is intentional: they have content that should be findable through my own local search engine, but that need not be and should be in Google’s index. My choice, my decision.
Why so few?
Submitted or known
At the top of the Google Search Console’s menu page, instead of
“All known pages”
I can also chose
“All submitted pages”. Submitted via
sitemaps, I suppose. Or mostly that. Indeed that changes
Not indexed 7.22K, 11 reasons, Indexed 959
to
Not indexed 1.26K, 3 reasons, Indexed 853
This data collected on 28 September 2025, as usual in this article.
1.26 is in fact 1264, to be exact. The total is 2117, and 853 is a little over 40%. Why is 60% not indexed? Too many on one site? Then why doesn’t Google say so? Or is it too many, including all those thousands of bogus addresses (see previous chapters) that Google refuses to forget?
Under
“Why pages aren’t indexed” three reasons are given:
“Duplicate, Google chose different
canonical than user”,
4 pages.
“Crawled - currently not indexed”, 696 pages,
“Discovered - currently not indexed”, 564 pages
Canonical
First that simple one, with only 4 occurrences. That’s clearly a bug.
In the past, I had many cases of
“Duplicate without user-selected canonical”.
The duplicates weren’t due to actual duplicate content, duplicate HTML pages,
but due to Google accessing the same page under several URLs. I failed to
indicate what should be seen as the canonical address. To improve that, in
my As a result, when issuing this command for example
(nginx
configuration file, I added this line:
add_header Link "<https://rudhar.com$uri>; rel=\"canonical\"";
--head
may also be -I
):
curl --head https://rudhar.com/sfreview/sfreview.htm
the headers shown now include this line:
Link: <https://rudhar.com/sfreview/sfreview.htm>; rel="canonical"
In 3 of these 4 cases, Google is stubborn enough to deviate from what I tell
my web server to specify, as follows:
https://rudhar.com/index3.htm
=> https://rudhar.com/index5.htm
https://rudhar.com/index4.htm
=> https://rudhar.com/index1.htm
https://rudhar.com/index8.htm
=> https://rudhar.com/index6.htm
These are real errors. The files are similar, but not the same. There is no valid reason to deviate. What I specify is correct. These are old pages with frames (but one has iframe), that I keep on the site as historic curiosities. But that’s no reason to make this mistake. Google shouldn’t try to be smarter than me. I myself know best what is going on on my website, having been the sole webmaster for the past 28 years or so.
The fourth case is:
https://rudhar.com/index.htm
=> https://rudhar.com/
A harmless, but unnecessary change. No further comment. It makes me so tired.
Crawled or discovered
Now back to
“Crawled - currently not indexed” 696 pages,
“Discovered - currently not indexed”, 564 pages
That could make sense as a temporary situation, lasting a few days or so.
But it’s permanent, although the numbers vary somewhat over time.
They’re presented as reasons, but they’re not. They are facts.
The reasons remain unknown. When pressing “LEARN MORE
” I reach
a page, where among other things I find:
Crawled - currently not indexed
The page was crawled by Google but not indexed. It may or may not be indexed in the future; no need to resubmit this URL for crawling.
Discovered - currently not indexed
The page was found by Google, but not crawled yet. Typically, Google wanted to crawl the URL but this was expected to overload the site; therefore Google rescheduled the crawl. This is why the last crawl date is empty on the report.
That’s not true. None of these situations apply.
There is the possibility, via “INSPECT URL
”, to request indexing
for individual pages. That shouldn’t be necessary, given the presence
of proper sitemaps on my site. And it takes a promised
“minute or two”, which in reality is about
ten seconds. Why so long? My site responds in milliseconds, in the
log files I sometimes see badly behaving crawlers fetch 50 pages in
one second. The delay is probably meant to discourage such requests.
Recommendation
In the presence of one or more sitemaps, Google should crawl and index only the URLs mentioned there, all of them, and not use any other URLs found on the web, or whatever may be their origin.
This could be done very efficiently, even for large sites. I purposely provide two complete sitemaps, in two different orders:
- sitemap.xml is sorted by file date. That date is updated with every change of the content, no matter how tiny, so every corrected typo or added remark put a file on top of the list. By comparing the URL dates with the Googlebot’s last crawl date, as soon as an older URL is encountered in this sitemap, Google could skip everything else, because they are guaranteed to all be older, so already in Google’s index with the latest content.
- rssfull.xml (an
RSS feed) contains the same URLs,
but now sorted by the date of first publication. Later small
corrections or additions do not alter that date. This
can be useful to quickly and efficiently pick up new articles.
For extra efficiency, there are also variants of this that contain only references to articles published the last two months or the last two weeks. If the Googlebot regularly visits, it needs only handle those, to stay fully up-to-date with my site.
Conclusion
As said, Google indexes only about 40% of the files in my website. Almost 60% remains unindexed, and therefore unfindable. Perhaps my site isn’t typical, but if it is, and many other sites also have this problem, it means 60% of the whole World Wide Web is unfindable with Google as your search engine.
What does this do to SEO (Search Engine Optimization)? Using the right terms in the right places, hoping to improve your ranking in Google’s search results, is useless for pages that weren’t indexed in the first place. They won’t be found whatever you do.
Do other search engines do this better, do this appropriately? I haven’t investigated that yet, and haven’t made up my mind about what to use for my own searches, instead of Google. But here’s a list of alternatives: Mojeek, DuckDuckGo, Bing, Yahoo, Ecosia, Qwant, Brave, metaGer, Kagi.
Embargo lifted
I first published this article under embargo, protected by a username and password. On 5 October 2025 at 13:13B (CEST), in the Google Search Console, I sent feedback to Google, providing the URL and the access credentials, asking them for comments.
They did not respond. I see that as a symptom of what I call the arrogance of success. It reminds me of Microsoft. And of HP, and of IBM, many years ago.
12 October 2025, 11:30B: embargo lifted, password removed, the article is now publicly available.
Copyright © 2025 by R. Harmsen, all rights reserved.