Running Benchmarks
Benchmark experiments should be run with every functional change to the code. This is how you do it.
Whenever you change something in the code's functionality, you need to run the benchmark experiments to ensure it doesn't impede performance. Also, if you implement an alternative approach, the benchmark experiments can be a good way to compare it to our current best implementation.
The benchmark test suite is designed to evaluate the performance of Monty in different scenarios. These include rotated and translated objects, noisy observations, similar objects, different action spaces, multiple LMs, multiple objects, real-world data, and continual unsupervised learning.
For more details on the current benchmark experiments, see this page.
When merging a change that impacts the performance on the benchmark experiments, you need to update the table in our documentation here.
How to Run a Benchmark Experiment
To run a benchmark experiment, simply call
python /benchmarks/run.py -e run_name
and replace run_name
with the name of the benchmark experiment. All benchmark experiment configs are in the benchmarks/configs/
folder. So for example, for running the quickest benchmark experiment you would call
python benchmarks/run.py -e randrot_10distinctobj_surf_agent
Go Ahead and Run the Command Above!
If you run the
randrot_10distinctobj_surf_agent
experiment using the command above, you will be able to follow along with all the following data analysis guides since we use this as an example.This should take about 1.5 minutes on an M3 MacBook or 5 minutes distributed on 16 CPU cores.
How to Report the Benchmark Performance
TODO: Figure out a way to have timings comparable
Currently we run all benchmark experiments on our cloud infrastructure, using 16 CPUs. However, external people will not have access to this and their hardware may give very different runtimes. How do we deal with this?
How to Report Benchmark Performance
Where to Find the Results
If you are using a wandb logger (used by default in the benchmark experiment configs), you can view the experiment results in the wandb dashboard. If you go into the "Runs" tab (selected on the left), you should see the summary statistics in the columns starting with "overall/".
If you are not using wandb, you can also calculate the statistics from the saved .csv file.
from tbp.monty.frameworks.utils.logging_utils import (load_stats,
print_overall_stats,
print_unsupervised_stats)
_, eval_stats, _, _ = load_stats(log_path + 'run_name',
load_train=False,
load_eval=True,
load_models=False,
load_detailed=False)
print_overall_stats(eval_stats)
# for the learning from scratch experiments, load the training csv instead and call
print_unsupervised_stats(train_stats, epoch_len=10) # 10 is the number of objects shown in an epoch
Where to Report the Results
If your code affected any of the benchmark results you should update the benchmark results table here in the same PR. See our guide on contributing to documentation for instructions on how to edit documentation.
Updated about 7 hours ago