NNAPI usage on Exynos 2100 devices

iska

New member
Dear Members,

First of all, I would like to thank you for creating the benchmark table. It provides a helpful overview.

We will soon replace our research object detection smartphones with newer ones and the Samsung S21 devices show a promising performance.
The question is how well TFLite uses the Adroid NNAPI in combination with the Exynos 2100.

Did you encounter any problems while using the TFLite NNAPI delegate?
For example, were many operations in the graph not supported?

Could you determine which unit (NPU, GPU, DPS) of the Exynos 2100 was used for the inference?

Thanks a lot
iska
 

Andrey Ignatov

Administrator
Staff member
Hi @iska,

how well TFLite uses the Adroid NNAPI in combination with the Exynos 2100

In general, the performance of the Exynos 2100 chipset is quite good, and it is almost fully-compatible with Android NNAPI.

Could you determine which unit (NPU, GPU, DPS) of the Exynos 2100 was used for the inference?

All quantized models are running on its NPU (that was designed for this inference type only), while the floating-point networks are accelerated with the Mali-G78 GPU. You can find the corresponding inference time for many popular models here (NN-INT8 / NN-FP16 results).

For example, were many operations in the graph not supported?

As mentioned above, the majority of NNAPI ops are supported, you can additionally download a more detailed development guideline provided directly by the Samsung Exynos team using this link.
 

iska

New member
Hello Andery,

thank you very much for answering in detail and providing further links.
That will help us in our decision :)
 

cruiser

New member
Hello @Andrey Ignatov

Could you please tell a little bit more about running models with NNAPI on Exynos 2100. I tried running int8 mobilenet_v2 model using TFLite NNAPI with eden-drv accelerator, but it seems to not use accelerator (it shows timing ~12 ms, while in your benchmark its 2.2 ms). Do I need to download some SDK or drivers to be able to use NPU of Exynos 2100? I have read somewhere that you use Eden delegate for Samsung phones, but I didn't find this delegate in TFLite.
 

cruiser

New member
I am getting this number both in my own scripts (which run NNAPI using TFLite) and in AI Benchmark app.
Currently in PRO Mode -> Custom Model for mobilenet-v2 I'm getting:
CPU-int8: 4 ms, CPU-FP: 6.6 ms, NN-int8: 11 ms, NN-fp32: 19 ms.
When I turn on Eden delegate instead of NNAPI I'm getting NN-int8 15-25 ms
 

Andrey Ignatov

Administrator
Staff member
@cruiser,

The problem is most likely related to your model quantization codes. There are lots of tricks there, you should try to look at this script first and make sure that you are using uint8 ops only.
 

cruiser

New member
@cruiser,

The problem is most likely related to your model quantization codes. There are lots of tricks there, you should try to look at this script first and make sure that you are using uint8 ops only.
I've updated my Android build and now AI Benchmark shows 2.2 ms for mobilenet v2. But I'm still unable to reproduce that in my own scripts. I also tried doing everything as in script you mentioned, but it doesn't work as well. Could you please share tflite model for mobilenet_v2 that you use in benchmark?
 

Andrey Ignatov

Administrator
Staff member
Could you please share tflite model for mobilenet_v2 that you use in benchmark?

The standard MobileNet-V2 model is used in the benchmark without any structural modifications. If you do the quantization correctly, you should be able to get the same results when running the model using AI Benchmark's PRO Mode.
 

cruiser

New member
The standard MobileNet-V2 model is used in the benchmark without any structural modifications. If you do the quantization correctly, you should be able to get the same results when running the model using AI Benchmark's PRO Mode.
I've tried to run all models from Tensorflow site and this one from TFHub, but they still refuse to work properly. I've tried running them in Pro mode -> Custom model and they show same timings of ~11 ms. Could you please give a link to a standard Mobilenet-V2 with a proper quantization
 

Andrey Ignatov

Administrator
Staff member
they show same timings of ~11 ms. Could you please give a link to a standard Mobilenet-V2 with a proper quantization

You can download a standard MobileNet-V2 TFLite model that will show around 2ms on the Exynos 2100 platform using this link. Note, however, that its accuracy is not high as only a very basic quantization approach was applied. If you use TensorFlow's post-training quantization tools, the resulting model will contain ops not supported properly by any vendor now, thus its runtime would be higher (around 6ms on the Exynos 2100, the corresponding quantized model can be downloaded here).
 

cruiser

New member
Thank you, it works for me as well! Maybe you know some docs regarding quantization format that is expected by Exynos 2100, so that I could quantize different networks? I am OK if I will need to write quantization on my own, but it's hard to do it without knowing what is expected format.
 

Andrey Ignatov

Administrator
Staff member
Maybe you know some docs regarding quantization format that is expected by Exynos 2100

You can use TF-v1 API to get the same fully quantized model:

Python:
converter = tf.compat.v1.lite.TFLiteConverter.from_session(sess, [input], [output])

converter.inference_type = tf.compat.v1.lite.constants.QUANTIZED_UINT8
input_arrays = converter.get_input_arrays()
# Define the correct input stats, more information about it:
# https://stackoverflow.com/questions/54830869/understanding-tf-contrib-lite-tfliteconverter-quantization-parameters
converter.quantized_input_stats = {input_arrays[0]: (127.5, 127.5)}
# Using "dummy" quantization ranges
converter.default_ranges_stats = (-10.0, 10.0)

tflie_model = converter.convert()

Note that to get an acceptable accuracy, you need to perform quantized-aware training first, and to remove the converter.default_ranges_stats option - otherwise the same "dummy" interval will be used for quantizing all tensors.
 

cruiser

New member
You can use TF-v1 API to get the same fully quantized model:
Thank you for suggestion, but for some reason it still doesn't work this way. Here's full script I've tried with tf 2.5.0:

Python:
import tensorflow as tf
import numpy as np

tf.compat.v1.disable_eager_execution()

# Model consisting of 1 convolution layer
input_shape = (1, 224, 224, 3)
x_in = tf.keras.layers.Input(shape=input_shape[1:])
x = tf.keras.layers.Conv2D(64, 3, padding='same')(x_in)
model = tf.keras.Model(x_in, x)

with tf.compat.v1.Session() as sess:
    tf.compat.v1.keras.backend.set_session(sess)
    preds = model.predict(np.zeros(input_shape))

    converter = tf.compat.v1.lite.TFLiteConverter.from_session(sess, model.inputs, model.outputs)
    converter.inference_type = tf.compat.v1.lite.constants.QUANTIZED_UINT8
    input_arrays = converter.get_input_arrays()
    # Define the correct input stats, more information about it:
    # https://stackoverflow.com/questions/54830869/understanding-tf-contrib-lite-tfliteconverter-quantization-parameters
    converter.quantized_input_stats = {input_arrays[0]: (127.5, 127.5)}
    # Using "dummy" quantization ranges
    converter.default_ranges_stats = (-10.0, 10.0)

    tflite_model = converter.convert()
    open('quantized_model.tflite', 'wb').write(tflite_model)

I have used very simple model consisting of single convolution. I get timing of 16 ms on Exynos 2100, while Snapdragon 888 shows 3.3 ms. Do you know what can be the cause?
 

cruiser

New member
Try to define your model using tf.compat.v1 layers instead of Keras.
Tried, but unfortunately it didn't help. The code I used:

Python:
import tensorflow as tf

def create_conv_layer(input_data, num_input_channels, num_filters, filter_shape, name):
    conv_filt_shape = [filter_shape[0], filter_shape[1], num_input_channels, num_filters]
    weights = tf.compat.v1.Variable(tf.compat.v1.truncated_normal(conv_filt_shape, stddev=0.03), name=name+'_W')
    bias = tf.compat.v1.Variable(tf.compat.v1.truncated_normal([num_filters]), name=name+'_b')
    out_layer = tf.compat.v1.nn.conv2d(input_data, weights, [1, 1, 1, 1], padding='SAME')
    out_layer += bias
    return out_layer

with tf.compat.v1.Session() as sess:
    input_shape = (1, 224, 224, 3)
    # Model consisting of 1 convolution layer
    x_in = tf.compat.v1.placeholder(tf.compat.v1.float32, input_shape)
    x_out = create_conv_layer(x_in, 3, 64, [3, 3], name='layer1')
    init = tf.compat.v1.global_variables_initializer()
    sess.run(init)

    converter = tf.compat.v1.lite.TFLiteConverter.from_session(sess, [x_in], [x_out])
    converter.inference_type = tf.compat.v1.lite.constants.QUANTIZED_UINT8
    input_arrays = converter.get_input_arrays()
    # Define the correct input stats, more information about it:
    # https://stackoverflow.com/questions/54830869/understanding-tf-contrib-lite-tfliteconverter-quantization-parameters
    converter.quantized_input_stats = {input_arrays[0]: (127.5, 127.5)}
    # Using "dummy" quantization ranges
    converter.default_ranges_stats = (-10.0, 10.0)

    tflite_model = converter.convert()
    open('quantized_model.tflite', 'wb').write(tflite_model)
 

Andrey Ignatov

Administrator
Staff member
Tried, but unfortunately it didn't help. The code I used:

Well, that's not surprising - you are trying to get an output of size 224 x 224 x 64.

This code + stride of 3 will work:

Python:
x_in = tf.compat.v1.placeholder(tf.compat.v1.float32, input_shape)
x_1 = create_conv_layer(x_in, 3, 64, [3, 3], name='layer_middle')
x_2 = create_conv_layer(x_1, 64, 3, [3, 3], name='layer_middle')
x_out = create_conv_layer(x_2, 3, 3, [3, 3], name='x_out')
Python:
out_layer = tf.compat.v1.nn.conv2d(input_data, weights, [1, 3, 3, 1], padding='SAME')
 
Top