11–
So after a while, I had a clear, machine-readable publication date in all of my web pages. The next step of course, was to read the data from the HTML files, and to use it for generating an XML sitemap and RSS files.
Earlier on I had already written code to extract the title from HTML, but in a primitive, makeshift, quick and dirty way that worked only for my own pages, which by personal convention have all titles including the tags (in lowercase) on a single line, even though HTML syntax doesn’t require that.
Therefore this time (still by the end of March 2021), I wanted genuine parsing. Somebody must have already written code for that, right? I googled and found this: Parse html using C. It mentions these:
I had trouble installing this. In hindsight, this was largely due to my trying it in a test directory in a vfat volume, created for me by Jetico’s Bestcrypt. I chose that filesystem vfat so backups made by copying the container files, would have a good chance of being decipherable also on Windows or Apple Macintosh.
The installation procedure of Gumbo Parser requires symbolic links. Vfat doesn’t support those. Proper Unix filesystems, like ext3 or ext4, do. So to install Gumbo Parser on a Linux system, use a location somewhere in its own disk volumes.
I also had problems with a C++ compiler being required, although I don’t use it and don’t need it. Nevertheless I finally decided to use Gumbo anyway, as described later in this article. For now, all in all, too many problems, and I decided to skip to the next candidate.
LexBor was written by Alexander Borisov. I quote from the file README.md: “The lexbor project is being developed using the C language, without dependencies.” Sounds good, that’s how I like it. See also Lexbor.com.
Based on instructions found in the two locations, I tried:
curl -O https://lexbor.com/keys/lexbor_signing.key
sudo apt-key add lexbor_signing.key
That worked. Then:
sudo apt install liblexbor
but that didn’t work. Error message:
E: Unable to locate package liblexbor.
A pity. But I wanted to have an HTML parsing library also on FreeBSD,
at least for the time being. There’s no apt
there, only
pkg
. Building from sources seemed the better option. So,
again following instructions in the LexBor docs, I tried:
git clone https://github.com/lexbor/lexbor/
cd lexbor
cmake . -DLEXBOR_BUILD_TESTS=ON -DLEXBOR_BUILD_EXAMPLES=ON -DLEXBOR_BUILD_SEPARATELY=ON
make
make test
There was no cmake
on my system, but of course I could
install it: sudo apt install cmake
.
That worked. (That was in late March 2021, now that I try it again
in mid November, it does not:
CMake Error: Could not find CMAKE_ROOT !!!
)
CMake has most likely not been installed correctly.
In March, it started a search for a CXX compiler. That probably means
C++. Why would that be needed? The project promised to be in pure C.
So why not just use cc
? I have been using Unix systems on
and off since 1985, and there was always a C compiler called
cc
.
(Note that after installing a fresh Ubuntu Server, it contains no
compilers at all, not even for C. Also the utility make
is unavailable. Kinda makes sense because you may well have a web server
that doesn’t require any compilation to install it. But mine does.
Remedy, found
here:
sudo apt install build-essential
.)
My conclusion about LexBor was, unfortunately: too many problems. If it doesn’t install flawlessly, out of the box, I’m gone, I’ll try something else instead.
HTML Tidy’s original author was Dave Raggett. It is now maintained by others. It is on Source Forge and there’s an example here. But that’s an older version. The current one is on Github.
So I cloned that, read files README.md
and
README/BUILD.md
, followed instructions. However, HTML Tidy
had the same problem as LexBor: it got stuck on not finding CXX.
I can’t find that acceptable. An installation has to work flawlessly,
and shouldn’t require things that aren’t really needed. Or it should
install those itself, which is the philosophy behind apt
(Debian, Ubuntu, Mint), and also pkg
(FreeBSD).
24 March 2021. After a good night’s sleep, I decided to give Google’s Gumbo Parser another try, stop being stubbourn, and meekly install everything that Gumbo needs, libtool, m4, automake, and even a C++ compiler, namely GNU’s g++.
And I got it to work. That was under FreeBSD. I wrote a little wrapper around Gumbo (see getfhtml.h and getfhtml.c) to make it do what I want it to do. So far, so good.
A few days ago, however, when I wanted to install Gumbo also on Linux Mint,
in preparation of eventually moving web hosting from FreeBSD to Ubuntu
(mainly because of its
release-cycle with five years of support), I got a different
problem, even though g++
was installed. Error messages:
checking for g++... g++
checking whether the C++ compiler works... no
configure: error: in `/var/www/toolwrk/gumbo-parser':
configure: error: C++ compiler cannot create executables
See `config.log' for more details
And in that config.log file I found:
g++: fatal error: cannot execute 'cc1plus': execvp: No such file or directory
compilation terminated.
With that I found this Stackoverflow page. Suggestions there didn’t help in my case. Moreover, I just find this unacceptable. Like the other libraries I tried, Gumbo is supposed to be written in pure C99. So it shouldn’t force me to install a C++ compiler, probably only for some examples that I don’t need and don’t use. And if it does force me, there shouldn’t be any vague and incomprehensible problems like this latest one.
But I couldn’t give up. I already had my wrapper, and integration with
earlier code, and I wanted to keep using that. Then I had this bright
idea: from the cloned gumbo-parser directory, I ran
find . -name '*.[ch]'
Result: nearly all C code resides in the directory src
.
Can’t I compile that directly and build the library I need?
Simple answer: yes I can.
The page
Shared libraries with GCC on Linux
instructed me how to create a shared library – I had never
done that before.
This solution makes libtool
, m4
,
automake
unnecessary too. All that is needed is
cc
, make
, and standard Unix userland
utilities. Plain and simple.
So after defining
LOCALINC=/usr/local/include
LOCALLIB=/usr/local/lib
now I do just this:
cd src
cc -Wall -fPIC -shared -Wl,-soname -Wl,libgumbo.so.1.0.0 *.c -o libgumbo.so.1.0.0
sudo cp gumbo.h tag_enum.h $LOCALINC
sudo mv libgumbo.so.1.0.0 $LOCALLIB
cd $LOCALLIB
sudo ln -s libgumbo.so.1.0.0 libgumbo.so.1
sudo ln -s libgumbo.so.1.0.0 libgumbo.so
# ldconfig updates the cache, so the new library will be found
sudo ldconfig
However, seeing Soname in Wikipedia, I might better have used libtool (of which: libtoolize) anyway. I don’t know. What I have works. I’ll leave it at that.
Update 14 November, while installing libmaxminddb
and seeing what it does: the SONAME can be put in by passing
-Wl options to gcc
, which it will pass on to the
linker ld
. I added those in the above. Still no
libtool
needed.
Copyright © 2021 by R. Harmsen, all rights reserved.