Automation

For anyone that has ever had to do boring, repetitive tasks, there is always the wish that it could be done without any interaction at all. For a number of professional applications, automation can be a primary requirement - the ability to press a button and let something go, with consistency every time, removes headaches and can lead to scaling out the process.

When it comes to benchmarking, having an automated test suite enables several benefits. Tests can have consistent delays between each test to provide the same environment for temperature and turbo ramps, it should arrange the cache and standardize cache defragmentation, and it lends itself to repeated consistent results. Bonus points are then awarded if the testing can then be scaled out to multiple systems at once. Sitting at a system with irregular jumps in testing can add in more degrees of freedom on things that might not be consistent and effect the results. Plus it becomes incredibly dull, incredibly fast. I mean OEM product manufacturing line dull. To all my fellow reviewers out there, I know the pain when you have several hundred hours of gameplay on something like Far Cry 5, but it’s all just benchmarking.

This is where I point to the well-known graph about automation (original source unknown):

For small tasks or projects, sometimes manually doing the work is quicker. If it takes 5 minutes to do the task manually, but then 8 hours to write the script which saves 5 seconds, the script has to be run 5760 times for the payoff. If the script is run 50 times a day, then the payoff will be in 115 days. This ignores scale out, if the script allows multiple systems to run concurrently, but for a lot of tasks can make it a no brainer to put the effort in. Otherwise, 3 years later, it becomes ultimately depressing when running CineBench for the 80000th time. (Insert stories from TheDailyWTF about how in a company a boss does not want automation because it might kill their job). Insert obligatory XKCD.

When I first started at AnandTech, testing motherboards, I did not run anything automated. Going through a basic motherboard testing suite manually took three days, because when testing you have to be alert and present every time a test finished to run the next one (and if the mind wanders, that 2-minute test becomes 15 minutes until you realize it's done). For our 2015 CPU Benchmark suite, a basic script that was written performed about 20 tests and lasted around 4 hours. It looked like spaghetti code, and very quickly became annoying to manage and update, especially when a benchmark decided it wasn't going to work/needed to be bypassed – there was no easy way to add benchmarks either. On top of this, benchmark installing was manual. Insert more XKCD. Thank you XKCD.

The new scripts for our Windows 10 testing are larger, modular, and more involved. The goal was essentially to automate everything down to what was feasibly possible, within my knowledge (or didn't require much learning), and required no user interaction. Over the course of two months, while testing which benchmarks were usable and applicable, two major scripts were written: CPU Tests and CPU Gaming Tests. 

How to Automate: Batch Files, Powershell, and AHK

There are many ways to automate in a system. Ganesh, for example, uses PowerShell almost exclusively to call benchmarks from the command line. To say that PowerShell is a glorified command prompt doesn't do it justice, but Ganesh ensures that his workloads for mini-PC testing can only ever run from the command line, and the results can be parsed therein. 

I'm not as au fait with PowerShell (if I had time for a crash course, it'd be on my to-do list), so I use a combination of batch files and a tool called AutoHotKey (AHK for short). AHK is a simple enough scripting language which can run programs, call command line functions, call PowerShell scripts, emulate mouse movements, emulate clicks and keyboard presses, and perform internal math, with subroutine support. It is like a poor man's C++, with an alarming number of foibles, such as poor type definition and zero type checking, but it can work if you treat it right.

For each benchmark I tested for suitability, either a fixed benchmark like Cinebench or a custom workload such as WinRAR or Blender, I tried to get the test to run from a simple batch file command line and manipulate the output. For Cinebench 15, the output is part of the stderr, and for Photoscan it outputs a results file due to the python script it requires that Agisoft provided (and I've edited). For WinRAR it is a timing function wrapper around a command line call pointing at the workload, and for Civilization 6 it's a simple flag after adjusting the settings file. For benchmarks like Gears Tactics, or Cinebench R10, there is no command line option and we have to turn to AHK to simulate keyboard presses.

So with each benchmark profiled, the individual tests are written as separate functions in AHK with three stages: preparation/installation, execution, and result parsing.

Preparation involves ensuring that the benchmark can be run in its current state, installing it if it isn’t, and deleting any previous temporary results file (if present) to ensure the directory structure is valid where needed. With the right preparation, running each test in the same manner makes the result as consistent as possible. Parsing the output into something suitable usually means flicking through an output file and doing the appropriate regular expression functions to pull out the required value. Some tests automatically allow for repeated results (Corona or 3DPMv2), whereas others need multiple runs specified (WinRAR) and those results can be put into an array and averaged or geomeaned using AHK. A final function is written to take the results and ply them into a custom results directory.

Outside of the testing functions is a general preparation element to the script. For our testing we have four main modes: the full list of tests, a short list of tests (determined in the script), running a single test, and an option to continue from a certain point of a full test run (in case one benchmark needed attention and errored out the process, such as a web benchmark when the server host fails). The initialization of one of our scripts asks which benchmarks suite is required, and detects the CPU/GPU present in the system, before offering a default location to save the results based on the CPU/GPU combo. By having the results location determined when the script is started, we can move results to the directory as each test finishes, and the results are parsed into an easy to read format for a mental check before they go into the database. For ease of use, I have a results location on a NAS, and so as the script uploads benchmark results to it, I can start looking at the results uploaded to the NAS as the other benchmarks are running. Useful when running to a deadline! We also do additional checks on the state of Spectre and Meltdown fixes in the OS, to ensure consistency.

Sanity Checks of Results and Running Order

Mental checks of results become important - being able to spot an outlier, or identifying when a result seems abnormal. For example, through the initial testing, I noticed that one of the results in one of our web tests (scoring ~100ms) was staying in the clipboard for the next web test (scoring 700ms). This gave a much lower average for the second test - and this only happened on fast CPUs.  Similarly with game tests, over the benchmark being repeated multiple times, sometimes a result (for whatever reason) might be 10% down on all the others. So either automatic detection of outliers needs to be in place (doesn't work if two results out of four repeats are bad), or a manual mental check needs to take place. There are a few things that automation can’t replace easily, such as experience. This is where for some tests an average might be representative, or a median might be more appropriate.

Also useful to note is determining the benchmark running order. Experience with our previous automation has shown that the shortest tests should run first, in order to populate our results directory on the NAS quicker, and the longer tests should be near the end but not right at the end. The tests that more frequently cause unpredictable errors (e.g. DLL support on a new platform causing a system to hang, or a benchmark that is reliant on online license servers which could be down for maintenance) are put in last, so an overnight run will go through as many tests as possible first before tackling potential breaks in the testing.

GPU Tests and Steam

The methods listed above work for our CPU and CPU Gaming tests. The CPU Gaming tests have an additional element, given that we are using games from Steam, and we are using only one log-in account for multiple systems under test at once. For the most part, if the game title likes to run nice offline, the test can be run offline. Unfortunately there are some games where the benchmark script will run 95% smoother (GTA, RDR2) when the user is logged in, due to online DRM checks.

For this, the script I’ve written runs a test and lock mechanism when trying to log in to Steam, and only tries to run the online tests if the account is not already signed in elsewhere. If the account is already signed in on a different system, the first system will instead automatically run one of the offline tests and come back after one test to see of online is available. If not, it will run another of the other offline tests, check again and so on, until there are no more offline tests to run, where it will sit and wait and probe every 120 seconds for access to Steam. For the machine that is online, it will run both sets of the online tests back-to-back, and then go back offline to run the rest of the offline tests, freeing the lock for any other machine that needs it. Some of this uses Steam's APIs, probing how Steam’s login mechanism works, and undocumented features. 

Nine Hundred CPUs OS Preparation and Benchmark Installation
Comments Locked

110 Comments

View All Comments

  • Smell This - Monday, July 20, 2020 - link


    ;- )
  • Oxford Guy - Monday, July 20, 2020 - link

    "If there’s a CPU, old or new, you want to see tested, then please drop a comment below."

    • i7-3820. This one is especially interesting because it had roughly the same number of transistors as Piledriver on roughly the same node (Intel 32nm vs. GF 32 nm).

    • 5775C

    • 5675C (which outperformed and matched the 5775C in some games due to thermal throttling)

    • 5775C with TDP bypassed or increased if this is possible, to avoid the aforementioned throttling

    • I would really really like you to add Deserts of Kharak to your games test suite. It is the only game I know of that showed Piledriver beating Intel's chips. That unusual performance suggests that it was possible to get more performance out of Piledriver if developers targeted that CPU for optimization and/or the game's engine somehow simply suited it particularly.

    • 8320E or 8370E at 4.7 GHz (non-turbo) with 2133 CAS 9-11-10 RAM, the most optimal Piledriver setup. The 9590 was not the most performant of the FX line, likely because of the turbo. A straight overclock coupled with tuned RAM (not 1600 CAS 10 nonsense) makes a difference. 4.7 GHz is a realistic speed achievable by a large AIO or small loop. If you want air cooling only then drop to 4.5 Ghz but keep the fast RAM. The point of testing this is to see what people were able to get in the real world from the AMD alternative for all the years they had to wait for Zen. Since we were stuck with Piledriver as the most performant Intel alternative for so so many years it's worth including for historical context. The "E" models don't have to be used but their lower leakage makes higher clocks less stressful on cooling than a 9000 series. 4.7 GHz was obtainable on a cheap motherboard like the Gigabyte UD3P, with strong airflow to the VRM sink.

    • VIA's highest-performance model. If it won't work with Windows 10 then run the tests on it with 8.1. The thing is, though... VIA released an update fairly recently that should make it compatible with Windows 10. I saw Youtube footage of it gaming, in fact, with a discrete card. It really would be a refreshing thing to see VIA included, even though it's such a bit player.

    • Lynnfield at 3 GHz.

    • i7-9700K, of course.
  • Oxford Guy - Monday, July 20, 2020 - link

    Regarding Deserts of Kharak... It may be that it took advantage of the extra cores. That would make it noteworthy also as an early example of a game that scaled to 8 threads.
  • Oxford Guy - Monday, July 20, 2020 - link

    Also, the Chinese X86 CPU, the one based on Zen 1, would be very nice to have included.
  • Oxford Guy - Monday, July 20, 2020 - link

    VIA CPUs tested with games as recently as 2019 (there was another video of the quad core but I didn't find it today with a quick search):

    https://www.youtube.com/watch?v=JPvKwqSMo-k
    https://www.youtube.com/watch?v=Da0BkEW459E

    The Zhaoxin KaiXian KX-U6880A would be nice to see included, not just the Chinese Zen 1 derivative.
  • Oxford Guy - Monday, July 20, 2020 - link

    "due to thermal throttling"

    TDP throttling, to be more accurate. I suppose it could throttle due to current demand rather than temp.
  • axer1234 - Monday, July 20, 2020 - link

    honestly i would love to know how different generation processor perform today especially higher core count. like prescott series pentium 4 athlon II phenomX6 core2 duo core2quad nehlam sandy bridge bulldozer etc with todays generation work loads and offering

    in many scenario like word excel ppt photoshop it all works very well still in many offices
    its just the new generation of application slowing it down for almost the same work etc
  • herefortheflops - Monday, July 20, 2020 - link

    @Dr. Cutress.,

    As someone that has been dealing with similar or greater product testing challenges and configuration complexity for the better part of a decade or so, I would like to commend you for your ambitious goals and efforts so far. Additionally, I could be of high value to your effort if you are willing to discuss. I have reviewed in-depth the bench database (as well as competing websites) and I have come to the conclusion the Anandtech bench data is of very limited usefulness at present--and would require some significant changes to the data being collected/reported and the way things have been done to this point. I do understand where the industry is going, the questions the readers are going to be asking of the data, and the major comparisons that will be attempted with the data. Unfortunately, much of your effort may easily become irrelevant unless you proceed with some extreme caution to provide data with more utility. I also know methods to accomplish the desired result while reducing the size and cost of the task at hand. Reply by e-mail if you are interested in talking.

    Best,
    -A potential contributor to your effort.
  • Bensam123 - Tuesday, July 21, 2020 - link

    Despite how impressive this is, one thing that hasn't been tackled is still multiplayer performance and it vastly changes recommendations for CPUs (doesn't effect GPUs as much).

    It goes from recommending a 6 core chip hands down to trying to make a case for 4 core chips still in this day and age. I own a 3900x and 2800 and I can tell you hands down Modern Warfare will gobble 70% of that 12 core chip, sometimes a bit more, that's equivalent to maxing out a 8 core of the same series. That vastly changes recommendations and data points. It's not just Modern Warfare. Overwatch, Black Ops 3(same engine as MW), and recently Hyper Scape will will make use of those extra cores. I have a widget to monitor CPU utilization in the background and I can check Task Manager. If I had a better video card I'm positive it would've sucked down even more of those 12 cores (my GPU is running at 100% load according to MSI AB).

    This is a huge deal and while I understand, I get it, it's hard to reliably reproduce the same results in a multiplayer environment because it changes so much and generally seen as taboo from a hardware benchmarking standpoint, it is vastly different then singleplayer workloads to the point at which it requires completely different recommendations. Given how many people are making expensive hardware choices specifically because they play multiplayer games, I would even say most tech reviews in this day and age are irrelevant for CPU recommendations outside of the casual single player gamer. GPU recommendations are still very much on par, CPU is not remotely.

    I talk about this frequently on my stream and why I still recommended the 1600 AF even when it was sitting at $105-125, it's a steal if you play multiplayer games, while most people that either read benchmarking websites or run benchmarks themselves will start making a case for a 4c Intel. 6 core is a must at the very least in this day and age.

    Anandtech it's time to tread new ground and go into the uncharted area. Singleplayer results and multiplayer results are too different, you can't keep spinning the wheel and expect things to remain the same. You can verify this yourself just by running task manager in the background while playing one of the games I mentioned at the lowest settings regardless of being able to repeat those results exactly you'll see it's definitely a multi-core landscape for newer multiplayer games.

    Not even touched on in the article.
  • Bensam123 - Tuesday, July 21, 2020 - link

    70%, I have SMT off for clarification.

Log in

Don't have an account? Sign up now