much higher latencies with NNAPI on Snapdragon 888

noodles · Jun 3, 2021

	CPU-int8	CPU-FP16(ms)	CPU-FP32(ms)	GPU-int8(ms)	GPU-FP16(ms)	GPU-FP32(ms)	NNAPI-int8(ms)	NNAPI-FP16(ms)	NNAPI-FP32(ms)
TF-lite(float)	Unsupported	218	220	Unsupported	30	31	Unsupported	22	41
TF-lite(weight quantized)	Unsupported	169	163	Unsupported	19	32	Unsupported	194	156
TF-lite(full quantized)	149	Unsupported	Unsupported	25	Unsupported	Unsupported	8710	Unsupported	Unsupported

I run a very simple model ESPCN for super resolution, it is just a 3-layer convolution network. The results are very strange. Is there something wrong with quantization or other things? If the reason is quantization, how to deal with it?

noodles · Jun 3, 2021

noodles said:
CPU-int8 CPU-FP16(us) CPU-FP32(us) GPU-int8(us) GPU-FP16(us) GPU-FP32(us) NNAPI-int8(us) NNAPI-FP16(us) NNAPI-FP32(us)
TF-lite(float) Unsupported
218000
220000
Unsupported
30000
31400
Unsupported
22600
41600
TF-lite(weight quantized) Unsupported
169000
163000
Unsupported
19200
32400
Unsupported
194000
156000
TF-lite(full quantized)
149000
Unsupported Unsupported
25900
Unsupported Unsupported
8710000
Unsupported Unsupported

I run a very simple model ESPCN for super resolution, it is just a 3-layer convolution network. The results are very strange. Is there something wrong with quantization or other things? If the reason is quantization, how to deal with it?

by the way, my input is big, a 245 * 530 image, and output a 980 * 2120 image.
And I use Netron to check the model to make sure the input, output and weights are int8 numbers.

noodles · Jun 3, 2021

I also run the quantized fsrcnn from MAI-2021， the latency is still much higher than CPU, GPU. But the models come with AI benchmark can run very fast with NNAPI-int8

Andrey Ignatov · Jun 15, 2021

Hi @noodles,

noodles said:
Is there something wrong with quantization or other things? If the reason is quantization, how to deal with it?

There are basically two issues:

1. The first one is that the ESPCN model contains sigmoid and tanh activations that are not well supported by TFLite quantizer and NNAPI.

2. The second issue is the same as described in this thread, the solution can be found in this post.

Finally, in case of the TFLite GPU delegate, it mainly de-quantizes the provided INT8 model to FP16 format, thus there is almost no difference in the runtime between these two modes.

much higher latencies with NNAPI on Snapdragon 888

noodles

New member

noodles

New member

noodles

New member

Andrey Ignatov

Administrator

	CPU-int8	CPU-FP16(us)	CPU-FP32(us)	GPU-int8(us)	GPU-FP16(us)	GPU-FP32(us)	NNAPI-int8(us)	NNAPI-FP16(us)	NNAPI-FP32(us)
TF-lite(float)	Unsupported	218000	220000	Unsupported	30000	31400	Unsupported	22600	41600
TF-lite(weight quantized)	Unsupported	169000	163000	Unsupported	19200	32400	Unsupported	194000	156000
TF-lite(full quantized)	149000	Unsupported	Unsupported	25900	Unsupported	Unsupported	8710000	Unsupported	Unsupported