7.8 KiB
| title | draft |
|---|---|
| when two macros are faster than one | false |
While working on my Database datapack (still WIP), I knew I'd want to find the most efficient way to dynamically access dynamically populated arrays. I had some ideas and decided to benchmark them using Kragast's Benchmark Datapack. This process was really illuminating to me, and I hope it will be for you as well. Thanks for all the help from PuckiSilver, amandin, and Nicoder.
scenario
The following are the dataset and constraints I used to test different methods of accessing data within an array.
dataset
The testing data is stored in the storage #_macro.array. The array is populated with a total of 500 entries, each having id and string fields.
[
{
id: 1,
string: "entry1"
},
...
{
id: 500,
string: "entry500"
}
]
This dataset could also be represented as a table:
| id | string |
|---|---|
| 1 | "entry1" |
| ... | ... |
| 500 | "entry500" |
constraints
The objective is to create an interface that receives a keyword, say entry500, and searches #_macro.array for an entry where the value of string matches the keyword.
The keyword must be able to be entered by a player at runtime, and #_macro.array can have an arbitrary number of custom entries created by a player.
In TypeScript, it would look something like this:
function searchArray(keyword: string) {
// logic
return theRelevantEntry
}
searchArray('entry500')
In mcfunction, this is not so straightforward. Macros would make this really clean:
function test_namespace:search_array {keyword: "entry500"}
Unfortunately, macros come with a performance hit. In this particular situation, we can bypass macros altogether. While it's less elegant, it is more performant to store the keyword in NBT storage prior to calling the function. The storage can be removed after the function is run:
data modify storage test_namespace:test_namespace temp.keyword set value 'entry500'
function test_namespace:search_array
data remove storage test_namespace:test_namespace temp.keyword
Once the entry is found, it is stored in the temp.result storage, which can then be consumed by another function.
Now for the logic to do the actual array searching. Here, the performance hit of running macros is worth it as the alternative involves a massive number of commands to manually iterate over an array. As we'll see later when benchmarking functions, manual iteration is really slow. Macros it is...
one macro
Macros allow us to reach into our array and pick out an entry that matching value in the string property. This is something that I didn't realize (for some reason) and was pointed out by PuckiSilver and amandin on the Datapack Hub discord server.
... one_macro.array[string:$(keyword)]
This method is super clean and results in a one liner that is wordy but simple:
'# one_macro/_searcharray.mcfunction
$data modify storage test_namespace:test_namespace temp.result set from storage test_namespace:test_namespace one_macro.array[string:$(keyword)]
_searcharray can then be called using the temp.keyword storage:
'# one_macro/run.mcfunction
data modify storage test_namespace:test_namespace temp.keyword set value 'entry500'
function test_namespace:one_macro/_searcharray with storage test_namespace:test_namespace temp
data remove storage test_namespace:test_namespace temp.keyword
'# call the function that consumes 'temp.result', then remove it
data remove storage test_namespace:test_namespace temp.result
two macro
Another way to crack the problem is through indexing. This was my original plan when I didn't realize that ...[{string:$(keyword)}] was possible.
This method requires the creation of an index of the field that is going to be searched. The index is a list of key/value pairs:
{
entry1: 0,
entry2: 1,
...
entry500: 499
}
The key, e.g. entry2: corresponds with the value of a string field in the main array, while the value 1 indicates the main array index where we'll find the full entry. The index can be searched with a direct path, index.$(keyword), and the main array can also be searched with a direct reference to the entry index, array.#(index). Keep in mind that the index must already exist prior to running the search function. In a practical application, an index could be updated every time the main array is updated. A scheduled task could also audit the index to ensure that it's up to date.
The index search looks like this:
'# two_macro/_searchindex.mcfunction
$data modify storage test_namespace:test_namespace temp.index set from storage test_namespace:test_namespace two_macro.index.$(keyword)
And the array search looks like this:
'# two_macro/_searcharray.mcfunction
$data modify storage test_namespace:test_namespace temp.result set from storage test_namespace:test_namespace two_macro.array[$(index)]
The index and array search functions are then called using the temp.keyword storage:
'# two_macro/run.mcfunction
data modify storage test_namespace:test_namespace temp.keyword set value 'entry500'
function test_namespace:two_macro/_searchindex with storage test_namespace:test_namespace temp
function test_namespace:two_macro/_searcharray with storage test_namespace:test_namespace temp
data remove storage test_namespace:test_namespace temp.keyword
data remove storage test_namespace:test_namespace temp.index
'# call the function that consumes 'temp.result', then remove it
data remove storage test_namespace:test_namespace temp.result
two is faster than one??
I ran benchmarks on a simple iteration-based function and the single-macro function suggested by PuckiSilver and amandin. I also threw in the two-macro indexing function since I had already coded it. I assumed using one macro would be faster than two, but I was curious exactly how much faster it would be.
As expected, the iteration-based function was sloooooow. Both macro functions blew it out of the water. Unexpectedly, however, the two_macro function doubled the performance of the one_macro function. Here are the results (bigger is better):
| function | benchmark |
|---|---|
| iteration | 416 |
| one_macro | 30342 |
| two_macro | 72450 |
The two_macro function is 2.4x faster than the one_macro function.
What the heck is going on? How does adding an entire second macro function improve performance??
It turns out that the clever and convenient one_macro.array[string:$(keyword)] triggers iteration to filter the array. Since the iteration is triggered by a macro, it directly runs Java code. It's still much faster than iterating in mcfunction, but the performance hit is O(n). In contrast, the two_macro approach directly accesses values by key and index. These operations have a performance hit of O(1). This was confirmed by Nicoder. While I haven't tested it, this means that, when run on a larger dataset, the gap between two_macro and one_macro should continue to widen.
takeaways
Indexing is cool. If you find yourself in a situation where you're working with moderate-to-large arrays and are able to index in advance of querying data, it's absolutely worth it from a query performance standpoint.
However, indexing is pretty expensive, and also requires active preplanning when writing a datapack. When items are added, updated, or deleted, the index will also need to be updated. A scheduled task should probably be run every so often to audit indexes and identify potential errors. Indexing existing fields that do not already have an index could be annoying.
Point being, if it's worth it, it's worth it; if it's not, the one_macro one liner is simpler and fast enough for most applications.