NNAPI usage on Exynos 2100 devices

iska · Feb 11, 2021

Dear Members,

First of all, I would like to thank you for creating the benchmark table. It provides a helpful overview.

We will soon replace our research object detection smartphones with newer ones and the Samsung S21 devices show a promising performance.
The question is how well TFLite uses the Adroid NNAPI in combination with the Exynos 2100.

Did you encounter any problems while using the TFLite NNAPI delegate?
For example, were many operations in the graph not supported?

Could you determine which unit (NPU, GPU, DPS) of the Exynos 2100 was used for the inference?

Thanks a lot
iska

Andrey Ignatov · Feb 11, 2021

Hi @iska,

iska said:
how well TFLite uses the Adroid NNAPI in combination with the Exynos 2100

In general, the performance of the Exynos 2100 chipset is quite good, and it is almost fully-compatible with Android NNAPI.

iska said:
Could you determine which unit (NPU, GPU, DPS) of the Exynos 2100 was used for the inference?

All quantized models are running on its NPU (that was designed for this inference type only), while the floating-point networks are accelerated with the Mali-G78 GPU. You can find the corresponding inference time for many popular models here (NN-INT8 / NN-FP16 results).

iska said:
For example, were many operations in the graph not supported?

As mentioned above, the majority of NNAPI ops are supported, you can additionally download a more detailed development guideline provided directly by the Samsung Exynos team using this link.

iska · Feb 12, 2021

Hello Andery,

thank you very much for answering in detail and providing further links.
That will help us in our decision

cruiser · Apr 14, 2021

Hello @Andrey Ignatov

Could you please tell a little bit more about running models with NNAPI on Exynos 2100. I tried running int8 mobilenet_v2 model using TFLite NNAPI with eden-drv accelerator, but it seems to not use accelerator (it shows timing ~12 ms, while in your benchmark its 2.2 ms). Do I need to download some SDK or drivers to be able to use NPU of Exynos 2100? I have read somewhere that you use Eden delegate for Samsung phones, but I didn't find this delegate in TFLite.

Andrey Ignatov · Apr 16, 2021

Hi @cruiser,

cruiser said:
it shows timing ~12 ms

Are you getting this number when running your model using AI Benchmark (PRO Mode -> Custom Model) or with your own scripts?

cruiser · Apr 17, 2021

I am getting this number both in my own scripts (which run NNAPI using TFLite) and in AI Benchmark app.
Currently in PRO Mode -> Custom Model for mobilenet-v2 I'm getting:
CPU-int8: 4 ms, CPU-FP: 6.6 ms, NN-int8: 11 ms, NN-fp32: 19 ms.
When I turn on Eden delegate instead of NNAPI I'm getting NN-int8 15-25 ms

Andrey Ignatov · Apr 21, 2021

@cruiser,

The problem is most likely related to your model quantization codes. There are lots of tricks there, you should try to look at this script first and make sure that you are using uint8 ops only.

cruiser · Apr 27, 2021

Andrey Ignatov said:
@cruiser,

The problem is most likely related to your model quantization codes. There are lots of tricks there, you should try to look at this script first and make sure that you are using uint8 ops only.

I've updated my Android build and now AI Benchmark shows 2.2 ms for mobilenet v2. But I'm still unable to reproduce that in my own scripts. I also tried doing everything as in script you mentioned, but it doesn't work as well. Could you please share tflite model for mobilenet_v2 that you use in benchmark?

Andrey Ignatov · Apr 27, 2021

cruiser said:
Could you please share tflite model for mobilenet_v2 that you use in benchmark?

The standard MobileNet-V2 model is used in the benchmark without any structural modifications. If you do the quantization correctly, you should be able to get the same results when running the model using AI Benchmark's PRO Mode.

cruiser · May 3, 2021

Andrey Ignatov said:
The standard MobileNet-V2 model is used in the benchmark without any structural modifications. If you do the quantization correctly, you should be able to get the same results when running the model using AI Benchmark's PRO Mode.

I've tried to run all models from Tensorflow site and this one from TFHub, but they still refuse to work properly. I've tried running them in Pro mode -> Custom model and they show same timings of ~11 ms. Could you please give a link to a standard Mobilenet-V2 with a proper quantization

Andrey Ignatov · May 13, 2021

cruiser said:
they show same timings of ~11 ms. Could you please give a link to a standard Mobilenet-V2 with a proper quantization

You can download a standard MobileNet-V2 TFLite model that will show around 2ms on the Exynos 2100 platform using this link. Note, however, that its accuracy is not high as only a very basic quantization approach was applied. If you use TensorFlow's post-training quantization tools, the resulting model will contain ops not supported properly by any vendor now, thus its runtime would be higher (around 6ms on the Exynos 2100, the corresponding quantized model can be downloaded here).

cruiser · May 13, 2021

Thank you, it works for me as well! Maybe you know some docs regarding quantization format that is expected by Exynos 2100, so that I could quantize different networks? I am OK if I will need to write quantization on my own, but it's hard to do it without knowing what is expected format.

Andrey Ignatov · May 19, 2021

cruiser said:
Maybe you know some docs regarding quantization format that is expected by Exynos 2100

You can use TF-v1 API to get the same fully quantized model:

Python:

converter = tf.compat.v1.lite.TFLiteConverter.from_session(sess, [input], [output])

converter.inference_type = tf.compat.v1.lite.constants.QUANTIZED_UINT8
input_arrays = converter.get_input_arrays()
# Define the correct input stats, more information about it:
# https://stackoverflow.com/questions/54830869/understanding-tf-contrib-lite-tfliteconverter-quantization-parameters
converter.quantized_input_stats = {input_arrays[0]: (127.5, 127.5)}
# Using "dummy" quantization ranges
converter.default_ranges_stats = (-10.0, 10.0)

tflie_model = converter.convert()

Note that to get an acceptable accuracy, you need to perform quantized-aware training first, and to remove the converter.default_ranges_stats option - otherwise the same "dummy" interval will be used for quantizing all tensors.

cruiser · May 19, 2021

Andrey Ignatov said:
You can use TF-v1 API to get the same fully quantized model:

Thank you for suggestion, but for some reason it still doesn't work this way. Here's full script I've tried with tf 2.5.0:

Python:

import tensorflow as tf
import numpy as np

tf.compat.v1.disable_eager_execution()

# Model consisting of 1 convolution layer
input_shape = (1, 224, 224, 3)
x_in = tf.keras.layers.Input(shape=input_shape[1:])
x = tf.keras.layers.Conv2D(64, 3, padding='same')(x_in)
model = tf.keras.Model(x_in, x)

with tf.compat.v1.Session() as sess:
    tf.compat.v1.keras.backend.set_session(sess)
    preds = model.predict(np.zeros(input_shape))

    converter = tf.compat.v1.lite.TFLiteConverter.from_session(sess, model.inputs, model.outputs)
    converter.inference_type = tf.compat.v1.lite.constants.QUANTIZED_UINT8
    input_arrays = converter.get_input_arrays()
    # Define the correct input stats, more information about it:
    # https://stackoverflow.com/questions/54830869/understanding-tf-contrib-lite-tfliteconverter-quantization-parameters
    converter.quantized_input_stats = {input_arrays[0]: (127.5, 127.5)}
    # Using "dummy" quantization ranges
    converter.default_ranges_stats = (-10.0, 10.0)

    tflite_model = converter.convert()
    open('quantized_model.tflite', 'wb').write(tflite_model)

I have used very simple model consisting of single convolution. I get timing of 16 ms on Exynos 2100, while Snapdragon 888 shows 3.3 ms. Do you know what can be the cause?

Andrey Ignatov · May 19, 2021

cruiser said:
Do you know what can be the cause?

Try to define your model using tf.compat.v1 layers instead of Keras.

cruiser · May 19, 2021

Andrey Ignatov said:
Try to define your model using tf.compat.v1 layers instead of Keras.

Tried, but unfortunately it didn't help. The code I used:

Python:

import tensorflow as tf

def create_conv_layer(input_data, num_input_channels, num_filters, filter_shape, name):
    conv_filt_shape = [filter_shape[0], filter_shape[1], num_input_channels, num_filters]
    weights = tf.compat.v1.Variable(tf.compat.v1.truncated_normal(conv_filt_shape, stddev=0.03), name=name+'_W')
    bias = tf.compat.v1.Variable(tf.compat.v1.truncated_normal([num_filters]), name=name+'_b')
    out_layer = tf.compat.v1.nn.conv2d(input_data, weights, [1, 1, 1, 1], padding='SAME')
    out_layer += bias
    return out_layer

with tf.compat.v1.Session() as sess:
    input_shape = (1, 224, 224, 3)
    # Model consisting of 1 convolution layer
    x_in = tf.compat.v1.placeholder(tf.compat.v1.float32, input_shape)
    x_out = create_conv_layer(x_in, 3, 64, [3, 3], name='layer1')
    init = tf.compat.v1.global_variables_initializer()
    sess.run(init)

    converter = tf.compat.v1.lite.TFLiteConverter.from_session(sess, [x_in], [x_out])
    converter.inference_type = tf.compat.v1.lite.constants.QUANTIZED_UINT8
    input_arrays = converter.get_input_arrays()
    # Define the correct input stats, more information about it:
    # https://stackoverflow.com/questions/54830869/understanding-tf-contrib-lite-tfliteconverter-quantization-parameters
    converter.quantized_input_stats = {input_arrays[0]: (127.5, 127.5)}
    # Using "dummy" quantization ranges
    converter.default_ranges_stats = (-10.0, 10.0)

    tflite_model = converter.convert()
    open('quantized_model.tflite', 'wb').write(tflite_model)

Andrey Ignatov · May 27, 2021

cruiser said:
Tried, but unfortunately it didn't help. The code I used:

Well, that's not surprising - you are trying to get an output of size 224 x 224 x 64.

This code + stride of 3 will work:

Python:

x_in = tf.compat.v1.placeholder(tf.compat.v1.float32, input_shape)
x_1 = create_conv_layer(x_in, 3, 64, [3, 3], name='layer_middle')
x_2 = create_conv_layer(x_1, 64, 3, [3, 3], name='layer_middle')
x_out = create_conv_layer(x_2, 3, 3, [3, 3], name='x_out')

Python:

out_layer = tf.compat.v1.nn.conv2d(input_data, weights, [1, 3, 3, 1], padding='SAME')

imao · Oct 7, 2021

Hello @Andrey Ignatov

I've been struggling with testing my custom model on exynos 2100 using NNAPI accelerator.

In my case, models listed on site seem to work fine.
this model shows ~8ms on exynos 2100.

Python:

def unet(pretrained_weights = None,input_size = (256,256,1)):
    inputs = Input(input_size)
    conv1 = Conv2D(64, 3, activation = 'relu', padding = 'same',bias_initializer='he_normal', kernel_initializer = 'he_normal')(inputs)
    conv1 = Conv2D(64, 3, activation = 'relu', padding = 'same',bias_initializer='he_normal', kernel_initializer = 'he_normal')(conv1)
    conv1 = Conv2D(64, 3, activation = 'relu', padding = 'same',bias_initializer='he_normal', kernel_initializer = 'he_normal')(conv1)
    conv1 = Conv2D(64, 3, activation = 'relu', padding = 'same',bias_initializer='he_normal', kernel_initializer = 'he_normal')(conv1)

    conv1 = Conv2D(64, 3, activation = 'relu', padding = 'same',bias_initializer='he_normal', kernel_initializer = 'he_normal')(conv1)
    conv1 = Conv2D(64, 3, activation = 'relu', padding = 'same',bias_initializer='he_normal', kernel_initializer = 'he_normal')(conv1)
    conv1 = Conv2D(64, 3, activation = 'relu', padding = 'same',bias_initializer='he_normal', kernel_initializer = 'he_normal')(conv1)
    conv1 = Conv2D(64, 3, activation = 'relu', padding = 'same',bias_initializer='he_normal', kernel_initializer = 'he_normal')(conv1)

    conv1 = Conv2D(64, 3, activation = 'relu', padding = 'same',bias_initializer='he_normal', kernel_initializer = 'he_normal')(conv1)
    conv1 = Conv2D(64, 3, activation = 'relu', padding = 'same',bias_initializer='he_normal', kernel_initializer = 'he_normal')(conv1)
    conv1 = Conv2D(64, 3, activation = 'relu', padding = 'same',bias_initializer='he_normal', kernel_initializer = 'he_normal')(conv1)
    conv1 = Conv2D(64, 3, activation = 'relu', padding = 'same',bias_initializer='he_normal', kernel_initializer = 'he_normal')(conv1)

    conv1 = Conv2D(64, 3, activation = 'relu', padding = 'same',bias_initializer='he_normal', kernel_initializer = 'he_normal')(conv1)
    conv1 = Conv2D(64, 3, activation = 'relu', padding = 'same',bias_initializer='he_normal', kernel_initializer = 'he_normal')(conv1)
    conv1 = Conv2D(64, 3, activation = 'relu', padding = 'same',bias_initializer='he_normal', kernel_initializer = 'he_normal')(conv1)
    conv1 = Conv2D(64, 3, activation = 'relu', padding = 'same',bias_initializer='he_normal', kernel_initializer = 'he_normal')(conv1)

    conv1 = Conv2D(64, 3, activation = 'relu', padding = 'same',bias_initializer='he_normal', kernel_initializer = 'he_normal')(conv1)
    conv1 = Conv2D(64, 3, activation = 'relu', padding = 'same',bias_initializer='he_normal', kernel_initializer = 'he_normal')(conv1)
    conv1 = Conv2D(64, 3, activation = 'relu', padding = 'same',bias_initializer='he_normal', kernel_initializer = 'he_normal')(conv1)
    conv1 = Conv2D(1, 3,  padding = 'same',bias_initializer='he_normal', kernel_initializer = 'he_normal')(conv1)
    conv1 = Add()([conv1, inputs])
    model = tf.keras.Model(inputs,  conv1)
    model.compile(optimizer = Adam(lr = 1e-4), loss = 'mse')
     if(pretrained_weights):
        model.load_weights(pretrained_weights)

    return model

But the problem is if I changed model, the inference time is very slow.
In my case just deleting a single convolution layer or adding convolution layer, the inference time is over 70ms.
Any Idea?

Andrey Ignatov · Oct 8, 2021

imao said:
But the problem is if I changed model, the inference time is very slow.

Are you using exactly the same conversion script as in this post?

As for the model you've shared above, its runtime will not be below 70ms as it is quite heavy. To check if it is running with acceleration, you can first run it on CPU and then with NNAPI to see if there will be a 5-10 times difference in the runtime.

imao · Oct 11, 2021

@Andrey Ignatov Thanks for the reply!
Yes I am using same script as you mentioned.

As for the model you've shared above, its runtime will not be below 70ms as it is quite heavy

How about model (assets/models/vgg_quant.tflite) in your apk?
Model I shareed above has same structure as that one(vgg_quant). This (vgg_quant.tflite) model packed in apk seems work fine, and the runtime
is almost same as here.
I'm using this site(netron) to figure out how the model structure is. I think model I shared above has same structure(vgg_quant.tflite)

Andrey Ignatov · Oct 15, 2021

imao said:
I think model I shared above has same structure(vgg_quant.tflite)

Yes, the structure seems to be the same. The difference in the runtime is likely caused by using different TF converters: the model you extracted from the APK file was generated using the TF 2.2 nightly build, and lots of things have changed since that time, especially if you are working with quantized models. Samsung Eden delegate was primarily optimized for older TF builds, and thus there is probably some discrepancy between the recent quantizer and the Eden drivers leading to the observed performance degradation.

imao · Oct 18, 2021

@Andrey Ignatov

Code:

def vgg_quant(pretrained_weights = None,input_size = (256,256,1)):
    inputs = Input(input_size)
    conv1 = Conv2D(64, 3, activation = 'relu', padding = 'same',bias_initializer='he_normal', kernel_initializer = 'he_normal')(inputs)
    conv1 = Conv2D(64, 3, activation = 'relu', padding = 'same',bias_initializer='he_normal', kernel_initializer = 'he_normal')(conv1)
    conv1 = Conv2D(64, 3, activation = 'relu', padding = 'same',bias_initializer='he_normal', kernel_initializer = 'he_normal')(conv1)
    conv1 = Conv2D(64, 3, activation = 'relu', padding = 'same',bias_initializer='he_normal', kernel_initializer = 'he_normal')(conv1)

    conv1 = Conv2D(64, 3, activation = 'relu', padding = 'same',bias_initializer='he_normal', kernel_initializer = 'he_normal')(conv1)
    conv1 = Conv2D(64, 3, activation = 'relu', padding = 'same',bias_initializer='he_normal', kernel_initializer = 'he_normal')(conv1)
    conv1 = Conv2D(64, 3, activation = 'relu', padding = 'same',bias_initializer='he_normal', kernel_initializer = 'he_normal')(conv1)
    conv1 = Conv2D(64, 3, activation = 'relu', padding = 'same',bias_initializer='he_normal', kernel_initializer = 'he_normal')(conv1)

    conv1 = Conv2D(64, 3, activation = 'relu', padding = 'same',bias_initializer='he_normal', kernel_initializer = 'he_normal')(conv1)
    conv1 = Conv2D(64, 3, activation = 'relu', padding = 'same',bias_initializer='he_normal', kernel_initializer = 'he_normal')(conv1)
    conv1 = Conv2D(64, 3, activation = 'relu', padding = 'same',bias_initializer='he_normal', kernel_initializer = 'he_normal')(conv1)
    conv1 = Conv2D(64, 3, activation = 'relu', padding = 'same',bias_initializer='he_normal', kernel_initializer = 'he_normal')(conv1)

    conv1 = Conv2D(64, 3, activation = 'relu', padding = 'same',bias_initializer='he_normal', kernel_initializer = 'he_normal')(conv1)
    conv1 = Conv2D(64, 3, activation = 'relu', padding = 'same',bias_initializer='he_normal', kernel_initializer = 'he_normal')(conv1)
    conv1 = Conv2D(64, 3, activation = 'relu', padding = 'same',bias_initializer='he_normal', kernel_initializer = 'he_normal')(conv1)
    conv1 = Conv2D(64, 3, activation = 'relu', padding = 'same',bias_initializer='he_normal', kernel_initializer = 'he_normal')(conv1)

    conv1 = Conv2D(64, 3, activation = 'relu', padding = 'same',bias_initializer='he_normal', kernel_initializer = 'he_normal')(conv1)
    conv1 = Conv2D(64, 3, activation = 'relu', padding = 'same',bias_initializer='he_normal', kernel_initializer = 'he_normal')(conv1)
    conv1 = Conv2D(64, 3, activation = 'relu', padding = 'same',bias_initializer='he_normal', kernel_initializer = 'he_normal')(conv1)
    conv1 = Conv2D(1, 3,  padding = 'same',bias_initializer='he_normal', kernel_initializer = 'he_normal')(conv1)
    conv1 = Add()([conv1, inputs])
    model = tf.keras.Model(inputs,  conv1)
    model.compile(optimizer = Adam(lr = 1e-4), loss = 'mse')
     if(pretrained_weights):
        model.load_weights(pretrained_weights)

    return model

Code:

def vgg_quant_modified(pretrained_weights = None,input_size = (256,256,1)):
    inputs = Input(input_size)
    conv1 = Conv2D(64, 3, activation = 'relu', padding = 'same',bias_initializer='he_normal', kernel_initializer = 'he_normal')(inputs)
    conv1 = Conv2D(64, 3, activation = 'relu', padding = 'same',bias_initializer='he_normal', kernel_initializer = 'he_normal')(conv1)
    conv1 = Conv2D(64, 3, activation = 'relu', padding = 'same',bias_initializer='he_normal', kernel_initializer = 'he_normal')(conv1)
    conv1 = Conv2D(64, 3, activation = 'relu', padding = 'same',bias_initializer='he_normal', kernel_initializer = 'he_normal')(conv1)

    conv1 = Conv2D(64, 3, activation = 'relu', padding = 'same',bias_initializer='he_normal', kernel_initializer = 'he_normal')(conv1)
    conv1 = Conv2D(64, 3, activation = 'relu', padding = 'same',bias_initializer='he_normal', kernel_initializer = 'he_normal')(conv1)
    conv1 = Conv2D(64, 3, activation = 'relu', padding = 'same',bias_initializer='he_normal', kernel_initializer = 'he_normal')(conv1)
    conv1 = Conv2D(64, 3, activation = 'relu', padding = 'same',bias_initializer='he_normal', kernel_initializer = 'he_normal')(conv1)

    conv1 = Conv2D(64, 3, activation = 'relu', padding = 'same',bias_initializer='he_normal', kernel_initializer = 'he_normal')(conv1)
    conv1 = Conv2D(64, 3, activation = 'relu', padding = 'same',bias_initializer='he_normal', kernel_initializer = 'he_normal')(conv1)
    conv1 = Conv2D(64, 3, activation = 'relu', padding = 'same',bias_initializer='he_normal', kernel_initializer = 'he_normal')(conv1)
    conv1 = Conv2D(64, 3, activation = 'relu', padding = 'same',bias_initializer='he_normal', kernel_initializer = 'he_normal')(conv1)

    conv1 = Conv2D(64, 3, activation = 'relu', padding = 'same',bias_initializer='he_normal', kernel_initializer = 'he_normal')(conv1)
    conv1 = Conv2D(64, 3, activation = 'relu', padding = 'same',bias_initializer='he_normal', kernel_initializer = 'he_normal')(conv1)
    conv1 = Conv2D(64, 3, activation = 'relu', padding = 'same',bias_initializer='he_normal', kernel_initializer = 'he_normal')(conv1)
    conv1 = Conv2D(64, 3, activation = 'relu', padding = 'same',bias_initializer='he_normal', kernel_initializer = 'he_normal')(conv1)

    conv1 = Conv2D(64, 3, activation = 'relu', padding = 'same',bias_initializer='he_normal', kernel_initializer = 'he_normal')(conv1)
    conv1 = Conv2D(64, 3, activation = 'relu', padding = 'same',bias_initializer='he_normal', kernel_initializer = 'he_normal')(conv1)
    conv1 = Conv2D(1, 3,  padding = 'same',bias_initializer='he_normal', kernel_initializer = 'he_normal')(conv1)
    conv1 = Add()([conv1, inputs])
    model = tf.keras.Model(inputs,  conv1)
    model.compile(optimizer = Adam(lr = 1e-4), loss = 'mse')
     if(pretrained_weights):
        model.load_weights(pretrained_weights)

    return model

The difference in the runtime is likely caused by using different TF converters: the model you extracted from the APK file was generated using the TF 2.2 nightly build, and lots of things have changed since that time, especially if you are working with quantized models.

Hmm.. Does TF converter version affect the performance even very little difference of the model? ( I'm also using TF 2.2.0)
vgg_quant I shared takses runnining 7~8ms on exynos 2100 but the below one (vgg_quant_modified) takes 70ms.

@crusier Did you solve this?

Andrey Ignatov · Nov 2, 2021

imao said:
Does TF converter version affect the performance even very little difference of the model?

Yes, TF Converter can drastically affect the performance as it maps the original TF ops to the corresponding TFLite layers which specifications are updated quite often (you can notice that even the standard ops like convolutions or splits have multiple revisions: V1, V2, V3, etc). Thus, it might be the case that some accelerator supports, e.g., conv_v1, but not conv_v2, especially if this is related to quantized inference.

susite · Dec 7, 2021

Hi, I got some questions about Exynos NPU.
Does Exynos NPU not support fp16, and it only support int8 inference?
What does Eden-drv means, NPU or DSP?

susite · Dec 7, 2021

”The NPU of the Exynos 9820 supports only quantized in- ference and consists of the controller and two cores (Fig. 2) having 1024 multiply-accumulate (MAC) units [78].“
How about other exynos chips? 990, 1080, 2100. All of then only support int8 inference?

Andrey Ignatov · Dec 17, 2021

susite said:
Does Exynos NPU not support fp16, and it only support int8 inference?

susite said:
How about other exynos chips? 990, 1080, 2100. All of then only support int8 inference?

Yes, they all support only int8 inference. For floating-point models, you can run them on Mali GPUs either with the Samsung Eden or with the TFLite GPU delegates.

NNAPI usage on Exynos 2100 devices

New member

Administrator

New member

New member

Administrator

New member

Administrator

New member

Administrator

New member

Administrator

New member

Administrator

New member

Administrator

New member

Administrator

New member

Administrator

New member

Administrator

New member

Administrator

New member

New member

Administrator