Simple word indexer (6)

27–. Continued from the previous.


The algorithm may seem somewhat simple, primitive and slow, because there is no way to update the index, it can only be built up anew completely each time. However, in practice it is reasonably fast.

The data I tested it with had 1875 HTML files, using 78,827 bytes to specify their paths. There were 2,107,851 words, taking up 47,157,512 bytes to store them and encode their locations. There were 186,106 unique words, and for those, storage needed was 11,267,657 bytes.

Tests were performed in four different hardware and OS configurations:

CnfOSCPUClock CoresWrk. mem.Storage
A FreeBSD 12.2 Intel® Xeon® E312xx (Sandy Bridge) 3.2 GHz 80% of 1 core 1 GB HDD (magnetic hard disk drive)
B Ubuntu Server 20.4 Intel® Xeon® E312xx (Sandy Bridge) 3.2 GHz 70% of 1 core 1 GB SSD (solid state drive)
C Linux Mint 20.1 Intel® Core™ i3-6100U 2.3 GHz 100% of 4 cores 4 GB HDD (magnetic hard disk drive)
D Linux Mint 20.1 Intel® Core™ i3-10110U 2.1 GHz 100% of 4 cores 8 GB SSD (solid state drive)

Below is a table of how long the steps took in those situations. All test steps were performed several times in succession, to take advantage of disk caching. Step 4 was done with
#define MAX_OCCURR 250 .
All times were measured by time (1) and are expressed in milli­seconds (ms).

StepConf. AConf. BConf. CConf. D
1 find 51 40 28 65
2 wordsep 3370 490028801670
3 sort 5950 51601200 796
4 combine 1130 1410 860 574
2, 3 & 4 piped 11,35011,06046102943
5 pathoffs 3 4 7 5
6 bsearch 4 4 6 4
6 egrep 26 35 36 43

Note: In configuration D the combined test of steps 2, 3 and 4 was the only case in which some effect could be seen of having more than one processor core: much of the sorting work can already be done while the word separating work is still in progress, leaving only merging as a final step to complete the sorting. And combining the file data can be done while sort writes the result of merging its pre-sorted temporary files, or in-memory tables. For example, one of 7 measure­ments that I averaged, read:

real	0m2.942s
user	0m3.065s
sys	0m0.225s

So the elapsed time 2942 ms is less than the sum of user and system time, 3065 + 225 = 3290 ms. However, the effect of the parallelism is still rather limited.

1670 + 796 + 574 = 3040 ms, so combining three steps using two pipes, avoiding to write the largest file to disk, has only limited positive effect on performance.

We can conclude that with this amount of data, the steps that have to be performed often, viz. the searching, are quite fast. This is true even if regular expressions are used, so egrep has to wade through all the words sequentially. Using a more efficient algorithm like a binary search doesn’t even make that much difference, and isn’t necessary for performance reasons.

The steps to create the index files are much slower, but it normally suffices to perform them only once a day on average. And even their elapsed time remains way below one minute in all configurations.

For larger amounts of data, times probably increase linearly, or logarithmically for the binary search.

Next, we’ll have a look at the word extractor.