So after a while, I had a clear, machine-readable publication date in all of my web pages. The next step of course, was to read the data from the HTML files, and to use it for generating an XML sitemap and RSS files.
Earlier on I had already written code to extract the title from HTML, but in a primitive, makeshift, quick and dirty way that worked only for my own pages, which by personal convention have all titles including the tags (in lowercase) on a single line, even though HTML syntax doesn’t require that.
Therefore this time (still by the end of March 2021), I wanted genuine parsing. Somebody must have already written code for that, right? I googled and found this: Parse html using C. It mentions these:
I had trouble installing this. In hindsight, this was largely due to my trying it in a test directory in a vfat volume, created for me by Jetico’s Bestcrypt. I chose that filesystem vfat so backups made by copying the container files, would have a good chance of being decipherable also on Windows or Apple Macintosh.
The installation procedure of Gumbo Parser requires symbolic links. Vfat doesn’t support those. Proper Unix filesystems, like ext3 or ext4, do. So to install Gumbo Parser on a Linux system, use a location somewhere in its own disk volumes.
I also had problems with a C++ compiler being required, although I don’t use it and don’t need it. Nevertheless I finally decided to use Gumbo anyway, as described later in this article. For now, all in all, too many problems, and I decided to skip to the next candidate.
LexBor was written by Alexander Borisov. I quote from the file README.md: “The lexbor project is being developed using the C language, without dependencies.” Sounds good, that’s how I like it. See also Lexbor.com.
Based on instructions found in the two locations, I tried:
curl -O https://lexbor.com/keys/lexbor_signing.key
sudo apt-key add lexbor_signing.key
That worked. Then:
sudo apt install liblexbor
but that didn’t work. Error message:
E: Unable to locate package liblexbor.
A pity. But I wanted to have an HTML parsing library also on FreeBSD,
at least for the time being. There’s no
apt there, only
pkg. Building from sources seemed the better option. So,
again following instructions in the LexBor docs, I tried:
git clone https://github.com/lexbor/lexbor/
cmake . -DLEXBOR_BUILD_TESTS=ON -DLEXBOR_BUILD_EXAMPLES=ON -DLEXBOR_BUILD_SEPARATELY=ON
There was no
cmake on my system, but of course I could
sudo apt install cmake.
That worked. (That was in late March 2021, now that I try it again
in mid November, it does not:
CMake Error: Could not find CMAKE_ROOT !!!
CMake has most likely not been installed correctly.
In March, it started a search for a CXX compiler. That probably means
C++. Why would that be needed? The project promised to be in pure C.
So why not just use
cc? I have been using Unix systems on
and off since 1985, and there was always a C compiler called
(Note that after installing a fresh Ubuntu Server, it contains no
compilers at all, not even for C. Also the utility
is unavailable. Kinda makes sense because you may well have a web server
that doesn’t require any compilation to install it. But mine does.
sudo apt install build-essential .)
My conclusion about LexBor was, unfortunately: too many problems. If it doesn’t install flawlessly, out of the box, I’m gone, I’ll try something else instead.
So I cloned that, read files
README/BUILD.md, followed instructions. However, HTML Tidy
had the same problem as LexBor: it got stuck on not finding CXX.
I can’t find that acceptable. An installation has to work flawlessly,
and shouldn’t require things that aren’t really needed. Or it should
install those itself, which is the philosophy behind
(Debian, Ubuntu, Mint), and also
24 March 2021. After a good night’s sleep, I decided to give Google’s Gumbo Parser another try, stop being stubbourn, and meekly install everything that Gumbo needs, libtool, m4, automake, and even a C++ compiler, namely GNU’s g++.
And I got it to work. That was under FreeBSD. I wrote a little wrapper around Gumbo (see getfhtml.h and getfhtml.c) to make it do what I want it to do. So far, so good.
A few days ago, however, when I wanted to install Gumbo also on Linux Mint,
in preparation of eventually moving web hosting from FreeBSD to Ubuntu
(mainly because of its
release-cycle with five years of support), I got a different
problem, even though
g++ was installed. Error messages:
checking for g++... g++
checking whether the C++ compiler works... no
configure: error: in `/var/www/toolwrk/gumbo-parser':
configure: error: C++ compiler cannot create executables
See `config.log' for more details
And in that config.log file I found:
g++: fatal error: cannot execute 'cc1plus': execvp: No such file or directory
With that I found this Stackoverflow page. Suggestions there didn’t help in my case. Moreover, I just find this unacceptable. Like the other libraries I tried, Gumbo is supposed to be written in pure C99. So it shouldn’t force me to install a C++ compiler, probably only for some examples that I don’t need and don’t use. And if it does force me, there shouldn’t be any vague and incomprehensible problems like this latest one.
But I couldn’t give up. I already had my wrapper, and integration with
earlier code, and I wanted to keep using that. Then I had this bright
idea: from the cloned gumbo-parser directory, I ran
find . -name '*.[ch]'
Result: nearly all C code resides in the directory
Can’t I compile that directly and build the library I need?
Simple answer: yes I can.
Shared libraries with GCC on Linux
instructed me how to create a shared library – I had never
done that before.
This solution makes
automake unnecessary too. All that is needed is
make, and standard Unix userland
utilities. Plain and simple.
So after defining
now I do just this:
cc -Wall -fPIC -shared -Wl,-soname -Wl,libgumbo.so.1.0.0 *.c -o libgumbo.so.1.0.0
sudo cp gumbo.h tag_enum.h $LOCALINC
sudo mv libgumbo.so.1.0.0 $LOCALLIB
sudo ln -s libgumbo.so.1.0.0 libgumbo.so.1
sudo ln -s libgumbo.so.1.0.0 libgumbo.so
# ldconfig updates the cache, so the new library will be found
However, seeing Soname in Wikipedia, I might better have used libtool (of which: libtoolize) anyway. I don’t know. What I have works. I’ll leave it at that.
Update 14 November, while installing
and seeing what it does: the SONAME can be put in by passing
-Wl options to
gcc, which it will pass on to the
ld. I added those in the above. Still no