A couple weeks ago, a popular account on twitter posted this: This sparked a bit of discussion, including this quote tweet: I think these tweets, particularly the second one, demonstrate some common misconceptions about evaluations and benchmarking that I’ve been seeing recently, so I figured this could be a useful case study to explore how …