Difference between TFLite Python API vs TFLite Micro Python API

TFLite Micro is the most popular Neural Network inference engine for micro CPUs. It’s designed for light weight model running on low power hardwares.

Most common scenario is you use some training framework like PyTorch and TensorFlow, quantize it to tflite model. Finally you run your model on embedded device through the TFLite Micro library. However, the precision of the model usually loses during the quantization process. People would like to use the TFLite Python API to run the model on their computer, and verify the accuracy of the model is reduced to an acceptable level before deploying it to the embedded device. The problem is that the TFLite Python API cannot really simulate the model’s precision when running on the embedded device with TFLite Micro library. There are already lots of discussions about this issue, see github discussions 1 and 2

Unknown to many people, TFLite Micro has a Python API, but the TFLite Micro Python API is different from the TFLite Python API. The TFLite Micro Python API is designed to run on the same principles as the TFLite Micro C++ library, which means it can provide a more accurate simulation of how the model will perform on embedded devices.

In this blog I will compare the TFLite Python API and the TFLite Micro Python API differences, and show how the two different APIs can be used to run the same model but have different results.

TFLite Python API

The TFLite was previously decoupled from TensorFlow and was a standalone library called LiteRT now.

Loading model:

from ai-edge-litert.interpreter import Interpreter
interpreter = Interpreter(model_path=model)

Allocate tensors:
```
interpreter.allocate_tensors()
```
It performs dynamic memory allocation for: a. allocates memory buffers for model inputs and outputs, b. allocates memory for all intermediate computation results between layers, c. allocates workspace memory needed during inference, d. finalizes the computation graph and prepares it for execution.

Get input and output tensors details:

input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()

Set input tensor:

interpreter.set_tensor(input_details[0]['index'], x)

Run inference:
```
interpreter.invoke()
```

Get output tensor:

output_data = interpreter.get_tensor(output_details[0]['index'])

TFLite Micro Python API

The TFLite Micro Python API is designed to run on the same principles as the TFLite Micro C++ library, it’s more aligned with the C++ API.

Loading model, there are two ways to load the model file, one is from file, another is from bytearray.

from tflite_micro.python.tflite_micro import runtime
# Load from file
interpreter = runtime.Interpreter.from_file(model_path=model)
# Load from bytearray
interpreter = runtime.Interpreter.from_bytes(model=model_bytes)

There is no need to allocate tensors. TFLite Micro pre-allocates all memory at compile time or initialization using a fixed-size arena. Embedded systems often lack heap allocation or have strict memory constraints, so TFLite Micro avoids malloc/free entirely.
Get input and output tensors details:
```
input_details = interpreter.get_input_details(index)
output_details = interpreter.get_output_details(index)
```
Compare to TFLite Python API where you can get all input and output details through one function call. In TFLite Micro Python API, you need to specify the index of the input or output tensor to get its details. For example, if you have multiple inputs or outputs, you can get the details of each one by specifying its index:
```
input_details = interpreter.get_input_details(0)  # Get details of first input tensor
output_details = interpreter.get_output_details(1)  # Get details of second output tensor
```
Set input tensor:
```
# Set input tensor using the input index
interpreter.set_input(x, 0)
# If you have second input tensor:
interpreter.set_input(y, 1)
```
Compared to TFLite Python API where you set the input tensor using the input details index, in TFLite Micro Python API, you set the input tensor using the input index directly. The index range depends on how many inputs you have. The order between the data and the index was reversed as well.
Run inference:
```
interpreter.invoke()
```
Get output tensor:
```
output_data = interpreter.get_output(0)
```
The same as set input tensor, you can use the output index depending on how many outputs you have.

Difference between TFLite and TFLite Micro Output

I will use the yolov8n example from ultralytics to compare the difference between tflite and tflite-micro. They provide a script to inference yolov8 tflite model and visualize the result, makes it quite convenient. I make another class that subclass the YOLOv8TFLite example but using tflite-micro python package to do the inference.

If you are interested to follow the example you can refer my repo tflite_vs_tflite-micro.

Inference Speed

Compared to TFLite python package, the TFLite Micro python package is much slower to run on PC. In this example, tflite python package only takes 0.04 second while tflite micro python package takes 76.7 second - 1917 times slower!!.

Inference Result

Image 1 Caption Image 2 Caption

If you observe in detail, you could see that the bounding box of the bus is different and tflite predicts 0.34 confidence for the cut off person while tflite-micro predicts 0.39.

Conclusion

From the example above, we could see that due to the implementation difference TFLite Python API and TFLite Micro Python produced different result. I have seen worse cases than where the tflite result and tflite-micro result diverge a lot this from my past experience. My suggestion is if you want to check or reproduce your embedded tflite-micro result with python script, tflite-micro python package is a better choice than tflite python package. But due to the slow inference speed of tflite-micro package, it’s not suitable for evaluation the whole evaluation dataset.