On ImageNet, a dataset of images of everyday objects, they trained 50 versions of an image reconnaissance model.

Despite all 50 models scoring more or less the same in the training test-suggesting they were equally accurate-their performance in the stress test varied wildly.

The stress test used ImageNet-C, a dataset of pixelated or altered images from ImageNet, and ObjectNet, a dataset of images of everyday objects in unusual poses, such as back chairs, upside-down teapots, and hook-hanging T-shirts.

Some of the 50 models have done well with pixelated images, some have done well with the unusual poses; some have done far better overall than others.

These were all the same as regards the standard training process.

Each system had the same problem: models which should have been performed differently equally accurately when tested with real-world data, such as various retinal scans or skin types.

"The biggest and most immediate takeaway is we need to do a lot more testing," he says.

The stress tests were specifically tailored to each task, using data taken from the real world or data mimicking the real world.

Some stress tests also contradict one another: for example, models which were good at recognizing pixelated images were often bad at recognizing high-contrast images.

Training a single model which passes all stress tests may not always be possible.

One option is to design an additional stage to the training and testing process, where many models are produced at once, rather than just one.

Then these competing models can be tested again to select the best one for the job on specific real-world tasks.

It could be worth it for a company like Google, which is building and deploying large models, says Yannic Kilcher, a machine-learning researcher at ETH Zurich.

Google could offer 50 different versions of an NLP model and developers of applications could choose the one that worked best for them, he says.

"We have to get better at specifying exactly what our models require," he says.

"Because often what happens is that we only discover those requirements after the model fails in the world."

Read the original article "The way we train AI is essentially flawed" at https://www.technologyreview.com/2020/11/18/1012234/training-machine-learning-broken-real-world-heath-nlp-computer-vision/