How to reproduce Float16 models on HTP


New member
Dear AI Benchmark team,

The app was super helpful in benchmarking our NN on Android with the Float16 inference on HTP showing the best performance.

Our question however is, how are you guys running Float16 models on the HTP chip?
The "old" Qualcomm SNPE SDK does not offer the HTP runtime at all while the newer QNN SDK only seems to be able to run Int8 models on HTP (Float16 models get stuck.)
Any pointers how to reproduce this runtime would be appreciated!

Thank you and best regards,


New member
Answering my own question for anyone else trying to optimize their models on Qualcomm based on some of our findings in the meantime:
  • Generally, we have found the Qualcomm tools and documentation to be somewhat confusing or inconsistent at times, so be prepared for that.
  • To go from Float32 to Float16 models for GPU / TPU, use the `--float_bw` parameter of the qnn-pytorch-converter (and other conversion tools.) This parameter is missing from the documentation of at least some QNN versions. We generally advise to also check the --help output of tools, which might provide better documentation at times.
  • The Float32 or Float16 model coming from the conversion tool can directly run on the QNN GPU backend.
  • Only Float16 (or Int8-quantized) models can run on the HTP. The HTP crashes silently unless the model is Float16 and loaded from a cached binary, see the "qnn-context-binary-generator" tool. The need for creating a context binary and the possibility to run Float16 on HTP is also not mentioned in the Qualcomm docs.
  • Float16 on HTP is the fastest way to deploy our non-quantized models to Qualcomm, about 20% faster than QNN GPU and 45% faster than using TF Lite Delegates or NNAPI on the GPU.
Hope this help, let me know if anyone has any questions!

Manuel Kolmet (GLASS Imaging Inc.)


New member
Hey @Mako443 , thanks for answering your question + sharing the solutions; very much appreciated. We are also working with Qualcomm SoCs (SNPE, QNN) and are increasingly frustrated by the state of their ecosystem (their developer forum literally runs on a potato server) + non-availability of support. Some of your answers helped us a lot.

If you are interested in exchanging Qualcomm optimization experiences, feel free to DM me.

Best, E.