Simple word indexer (6)

27–29 July 2021. Continued from the previous.

Performance

The algorithm may seem somewhat simple, primitive and slow, because there is no way to update the index, it can only be built up anew completely each time. However, in practice it is reasonably fast.

The data I tested it with had 1875 HTML files, using 78,827 bytes to specify their paths. There were 2,107,851 words, taking up 47,157,512 bytes to store them and encode their locations. There were 186,106 unique words, and for those, storage needed was 11,267,657 bytes.

Tests were performed in four different hardware and OS configurations:

Cnf	OS	CPU	Clock	Cores	Wrk. mem.	Storage
A	FreeBSD 12.2	Intel® Xeon® E312xx (Sandy Bridge)	3.2 GHz	80% of 1 core	1 GB	HDD (magnetic hard disk drive), ext4
B	Ubuntu Server 20.4	Intel® Xeon® E312xx (Sandy Bridge)	3.2 GHz	70% of 1 core	1 GB	SSD (solid state drive), ext4
C	Linux Mint 20.1	Intel® Core™ i3-6100U	2.3 GHz	100% of 4 cores	4 GB	HDD, ext4
D	Linux Mint 20.1	Intel® Core™ i3-10110U	2.1 GHz	100% of 4 cores	8 GB	SSD, ext4

Below is a table of how long the steps took in those situations. All test steps were performed several times in succession, to take advantage of disk caching. Step 4 was done with
#define MAX_OCCURR 250 .
All times were measured by time (1) and are expressed in milliseconds (ms).

Step	Conf. A	Conf. B	Conf. C	Conf. D
1 find	51	40	28	65
2 wordsep	3,370	4,900	2,880	1,670
4 sort	5,950	5,160	1,200	796
5 combine	1,130	1,410	860	574
2, 4 & 5 piped	11,350	11,060	4,610	2,943
6 pathoffs	3	4	7	5
7 bsearch	4	4	6	4
7 egrep	26	35	36	43

Note: In configuration D the combined test of steps 2, 3 and 4 was the only case in which some effect could be seen of having more than one processor core: much of the sorting work can already be done while the word separating work is still in progress, leaving only merging as a final step to complete the sorting. And combining the file data can be done while sort writes the result of merging its pre-sorted temporary files, or in-memory tables. For example, one of 7 measurements that I averaged, read:

real	0m2.942s
user	0m3.065s
sys	0m0.225s

So the elapsed time 2942 ms is less than the sum of user and system time, 3065 + 225 = 3290 ms. However, the effect of the parallelism is still rather limited.

1670 + 796 + 574 = 3040 ms, so combining three steps using two pipes, avoiding to write the largest file to disk, has only limited positive effect on performance.

We can conclude that with this amount of data, the steps that have to be performed often, viz. the searching, are quite fast. This is true even if regular expressions are used, so egrep has to wade through all the words sequentially. Using a more efficient algorithm like a binary search doesn’t even make that much difference, and isn’t necessary for performance reasons.

The steps to create the index files are much slower, but it normally suffices to perform them only once a day on average. And even their elapsed time remains way below one minute in all configurations.

For larger amounts of data, times probably increase linearly, or logarithmically for the binary search.

(A new article inserted 31 May 2025: More performance tests.)

Next, we’ll have a look at the word extractor.