Reference for error/accuracy


New member
From the 2019 AI benchmark paper:

"For each corresponding test, the L1 loss is computed between the target and actual outputs produced by the deep learning models."

Ideally, when computing error of a [rounded] result (whatever metric you like), it should be compared to an exact result, not to a result that could have similar [rounding] errors. That usually isn't feasible, but just comparing to the result that should have significantly smaller rounding errors (to exact) than the actual outputs is likely good enough.

Results (forward-pass/inference) from training also have error. Even with IEEE-754 compliance, using the same types going from one platform to another can result in differences. Section 11 of IEEE-754-2019 talks about reproducibility and typically is not feasible across platforms for ML operations. The main problem is that addition isn't associative - dot-products/summations (from matrix-multiply or convolutional layer) results are sensitive to order/grouping.

For <=16-bit inference, I suggest producing reference results that use at least FP32 intermediates/outputs. And if you are trying to determine error from FP32 inference, you'd need FP64 as a reference. Training could still happen in FP16, but a final inference using more accurate FP32 would be best. Maybe FP16 reference is good enough for int8, but FP16 only has 11 bits of precision for unsigned data so it may be only marginally more accurate.

If FP16 references are used, maybe this is only a problem FP16 inference?

Andrey Ignatov

Staff member
Hi @bagofwater,

Thank you for your suggestions.

For <=16-bit inference, I suggest producing reference results that use at least FP32 intermediates/outputs.

For FP16 inference, the targets are generated in FP32 mode that provides an accuracy of 7-8 digits after decimal point, so there are no issues here.


New member

Do you also have the error for FP16 vs. this FP32 mode from the same training platform? This will give an idea of what just changing the tensor types to FP16 does to accuracy. That would be a target for the inference platforms. I guess this will only work easily if all the tensors are in FP16 range - otherwise some scaling will be needed. So, maybe only if training happened in FP16 (where the ranges are in FP16 by construction).

For int8 inference, the targets are also from FP32? Producing some kind of reference int8 accuracy will clearly be more difficult as scaling is certainly needed (and how much effort goes into determining scales and at what granularity could be quite impactful).

Are error/accuracy measures for fp16 vs. int8 are on the same scale so that one can compare fp16 vs. int8 accuracy for the same network?