Simple word indexer (19)

13 February 2022

Ranking search results

Yesterday and today, I implemented a result ranking for my siworin search engine. Before that change, search results were presented in the order in which they were found. That often wasn’t optimal. So now I collect results before displaying them, assign a ranking to each, sort on that ranking, and then present the result, highest ranking first.

There is a new module consisting of siworin-ranking.h and siworin-ranking.c, that is in between siworin-match.c and siworin-displ.c. Where in siworin-match.c in function find_matches I used to call function siworin_displaymatches, instead I now call function siworin_ranking_add1hit of source file siworin-ranking.c.

There, the search results are collected in a malloc’ed and realloc’ed table, which was prepared by the function siworin_ranking_init. Once all results for the current search are there, the table is qsorted on rank, then function siworin_ranking_finalise traverses the table and repeatedly calls siworin_displaymatches to display the search results.

The ranking algorithm is rather simple:

More words are better than fewer words.
Infrequent words are special and therefore more relevant than trivial, frequent words.
Adjacent words provide better hits than more distant words.

Memory management

In this C source siworin-ranking.c, I do something that may seem strange and dangerous (but I am convinced that it isn’t): I write binary values, of pointers and integers, into records that are not defined in C as those data types, but as a malloc’ed area of unsigned characters instead. I use the standard library function memcpy to do that.

To read back the data, in the statement
thisone = (struct hfile_wordvector_s *) (hitstab + i * binrecsize + sizeof (rank_t));
I cast a calculated pointer to unsigned char to a pointer to a structure struct hfile_wordvector_s, defined in siworin-match.h. That pointer I pass to function siworin_displaymatches, which then uses the data it points to (directly and indirectly) as if nothing strange had happened.

The reason for doing it the way I did it, is in the data structure for the search words and the matches, as created by function setup_matchvectors in source file siworin-match.c. The data structure can be mentally depicted by the words being next to each other horizontally. Each word has a pointer to a vertical array of locations in the HTML files, encoded as file number and byte offset. Each word also has an index number associated with it, into that locations array, so as to encode the match currently under investigation.

The number of words is variable between searches, but within one search it is always the same. Perhaps under newer standards of C, you could solve this with variable-size arrays. But I never understood how that works, and I’m not really interested in learning it.

The order of the words etc. can change even within a single search, because of the repeated qsort done in siworin-match.c, which means the index increment to find new potential matches needs only be done on the very first element.

I think my solution is legal in C, safe and portable, because the binary data is written, cast and read within the same C module (source file), so also within the same program and the same hardware platform. By consistently using sizeofs and a typedef, no assumptions are made about the size of pointers, floating point numbers and integers, nor about data alignment, nor about endianness. So nothing can ever go wrong.

I think. But do feel free to disagree.

Any next article, ever?