Yesterday and today, I implemented a result ranking for my siworin search engine. Before that change, search results were presented in the order in which they were found. That often wasn’t optimal. So now I collect results before displaying them, assign a ranking to each, sort on that ranking, and then present the result, highest ranking first.
There is a new module consisting of siworin-ranking.h
and siworin-ranking.c
, that is in
between siworin-match.c
and siworin-displ.c
.
Where in siworin-match.c
in function find_matches
I used to call function siworin_displaymatches
, instead
I now call function siworin_ranking_add1hit
of source file
siworin-ranking.c
.
There, the search results are collected in a malloc
’ed and
realloc
’ed table, which was prepared by the function
siworin_ranking_init
. Once all results for the current search
are there, the table is qsort
ed on rank, then function
siworin_ranking_finalise
traverses the table and repeatedly
calls siworin_displaymatches
to display the search results.
The ranking algorithm is rather simple:
In this C source siworin-ranking.c
, I do something that may
seem strange and dangerous (but I am convinced that it isn’t): I write
binary values, of pointers and integers, into records that are not defined
in C as those data types, but as a malloc
’ed area of unsigned
characters instead. I use the standard library function memcpy
to do that.
To read back the data, in the statement
thisone = (struct hfile_wordvector_s *)
(hitstab + i * binrecsize + sizeof (rank_t));
I cast a calculated pointer to unsigned char
to a pointer to
a structure struct hfile_wordvector_s
, defined in
siworin-match.h
.
That pointer I pass to function siworin_displaymatches
, which
then uses the data it points to (directly and indirectly) as if nothing strange
had happened.
The reason for doing it the way I did it, is in the data structure
for the search words and the matches, as created by function
setup_matchvectors
in source file siworin-match.c
.
The data structure can be mentally depicted by the words being next to each
other horizontally.
Each word has a pointer to a vertical array of locations in the HTML files,
encoded as file number and byte offset. Each word also has an index number
associated with it, into that locations array, so as to encode the match
currently under investigation.
The number of words is variable between searches, but within one search it is always the same. Perhaps under newer standards of C, you could solve this with variable-size arrays. But I never understood how that works, and I’m not really interested in learning it.
The order of the words etc. can change even within a single search, because
of the repeated qsort
done in siworin-match.c
,
which means the index increment to find new potential matches needs only be
done on the very first element.
I think my solution is legal in C, safe and portable, because the binary
data is written, cast and read within the same C module (source file),
so also within the same program and the same hardware platform. By
consistently using sizeof
s and a typedef
, no
assumptions are made about the size of pointers, floating point numbers
and integers, nor about data alignment, nor about endianness. So nothing
can ever go wrong.
I think. But do feel free to disagree.
Copyright © 2022 by R. Harmsen, all rights reserved.