27–. Continued from the previous.
The algorithm may seem somewhat simple, primitive and slow, because there is no way to update the index, it can only be built up anew completely each time. However, in practice it is reasonably fast.
The data I tested it with had 1875 HTML files, using 78,827 bytes to specify their paths. There were 2,107,851 words, taking up 47,157,512 bytes to store them and encode their locations. There were 186,106 unique words, and for those, storage needed was 11,267,657 bytes.
Tests were performed in four different hardware and OS configurations:
Cnf | OS | CPU | Clock | Cores | Wrk. mem. | Storage |
---|---|---|---|---|---|---|
A | FreeBSD 12.2 | Intel® Xeon® E312xx (Sandy Bridge) | 3.2 GHz | 80% of 1 core | 1 GB | HDD (magnetic hard disk drive) |
B | Ubuntu Server 20.4 | Intel® Xeon® E312xx (Sandy Bridge) | 3.2 GHz | 70% of 1 core | 1 GB | SSD (solid state drive) |
C | Linux Mint 20.1 | Intel® Core™ i3-6100U | 2.3 GHz | 100% of 4 cores | 4 GB | HDD (magnetic hard disk drive) |
D | Linux Mint 20.1 | Intel® Core™ i3-10110U | 2.1 GHz | 100% of 4 cores | 8 GB | SSD (solid state drive) |
Below is a table of how long
the steps took in
those situations. All test steps were performed several times in succession,
to take advantage of disk caching. Step 4 was done with
#define MAX_OCCURR 250
.
All times were measured by time (1)
and are expressed in
milliseconds (ms).
Step | Conf. A | Conf. B | Conf. C | Conf. D |
---|---|---|---|---|
1 find | 51 | 40 | 28 | 65 |
2 wordsep | 3370 | 4900 | 2880 | 1670 |
3 sort | 5950 | 5160 | 1200 | 796 |
4 combine | 1130 | 1410 | 860 | 574 |
2, 3 & 4 piped | 11,350 | 11,060 | 4610 | 2943 |
5 pathoffs | 3 | 4 | 7 | 5 |
6 bsearch | 4 | 4 | 6 | 4 |
6 egrep | 26 | 35 | 36 | 43 |
Note: In configuration D the combined test of steps 2, 3 and 4 was
the only case in which some effect could be seen of having more than
one processor core: much of the sorting work can already be done while
the word separating work is still in progress, leaving only merging as
a final step to complete the sorting. And combining the file data can
be done while sort
writes the result of merging its
pre-sorted temporary files, or in-memory tables.
For example, one of 7 measurements that I averaged, read:
real 0m2.942s user 0m3.065s sys 0m0.225s
So the elapsed time 2942 ms is less than the sum of user and system time, 3065 + 225 = 3290 ms. However, the effect of the parallelism is still rather limited.
1670 + 796 + 574 = 3040 ms, so combining three steps using two pipes, avoiding to write the largest file to disk, has only limited positive effect on performance.
We can conclude that with this amount
of data, the steps that have to be performed often, viz. the searching,
are quite fast. This is true even if regular expressions are used, so
egrep
has to wade through all the words
sequentially. Using a more efficient algorithm like a binary search doesn’t
even make that much difference, and isn’t necessary for performance reasons.
The steps to create the index files are much slower, but it normally suffices to perform them only once a day on average. And even their elapsed time remains way below one minute in all configurations.
For larger amounts of data, times probably increase linearly, or logarithmically for the binary search.
Next, we’ll have a look at the word extractor.
Copyright © 2021 by R. Harmsen, all rights reserved.