Yesterday and today, I implemented a result ranking for my siworin search engine. Before that change, search results were presented in the order in which they were found. That often wasn’t optimal. So now I collect results before displaying them, assign a ranking to each, sort on that ranking, and then present the result, highest ranking first.
There is a new module consisting of
siworin-ranking.c, that is in
siworin-match.c in function
I used to call function
I now call function
siworin_ranking_add1hit of source file
There, the search results are collected in a
realloc’ed table, which was prepared by the function
siworin_ranking_init. Once all results for the current search
are there, the table is
qsorted on rank, then function
siworin_ranking_finalise traverses the table and repeatedly
siworin_displaymatches to display the search results.
The ranking algorithm is rather simple:
In this C source
siworin-ranking.c, I do something that may
seem strange and dangerous (but I am convinced that it isn’t): I write
binary values, of pointers and integers, into records that are not defined
in C as those data types, but as a
malloc’ed area of unsigned
characters instead. I use the standard library function
to do that.
To read back the data, in the statement
thisone = (struct hfile_wordvector_s *)
(hitstab + i * binrecsize + sizeof (rank_t));
I cast a calculated pointer to
unsigned char to a pointer to
struct hfile_wordvector_s, defined in
That pointer I pass to function
then uses the data it points to (directly and indirectly) as if nothing strange
The reason for doing it the way I did it, is in the data structure
for the search words and the matches, as created by function
setup_matchvectors in source file
The data structure can be mentally depicted by the words being next to each
Each word has a pointer to a vertical array of locations in the HTML files,
encoded as file number and byte offset. Each word also has an index number
associated with it, into that locations array, so as to encode the match
currently under investigation.
The number of words is variable between searches, but within one search it is always the same. Perhaps under newer standards of C, you could solve this with variable-size arrays. But I never understood how that works, and I’m not really interested in learning it.
The order of the words etc. can change even within a single search, because
of the repeated
qsort done in
which means the index increment to find new potential matches needs only be
done on the very first element.
I think my solution is legal in C, safe and portable, because the binary
data is written, cast and read within the same C module (source file),
so also within the same program and the same hardware platform. By
sizeofs and a
assumptions are made about the size of pointers, floating point numbers
and integers, nor about data alignment, nor about endianness. So nothing
can ever go wrong.
I think. But do feel free to disagree.