Achieving real-time performance on edge hardware requires optimization. This guide details strategies for reducing model size and latency while maintaining accuracy.
1. Post-Training Quantization (PTQ)
Kneron supports standard INT8 PTQ. This is the default mode and requires no retraining.
Calibration Dataset
The quality of quantization depends on your calibration dataset.
- Use 50-100 images.
- Images should be representative of the target domain (e.g., if deploying for face recognition, use face images).
- Preprocessing (resize/normalization) must match the inference time logic.
2. Quantization Aware Training (QAT)
If PTQ results in significant accuracy drop (>1%), consider QAT.
QAT simulates quantization noise during training, allowing the weights to adapt.
# PyTorch Example (Conceptual)
model.qconfig = torch.quantization.get_default_qat_qconfig('fbgemm')
torch.quantization.prepare_qat(model, inplace=True)
# ... Retrain for a few epochs ...3. Optimization Tips
Reduce Input Resolution
Halving the resolution (e.g., 640x640 -> 320x320) reduces compute load by 4x. Check if your accuracy tolerates this.
Pruning
Remove channels with near-zero weights. Kneron toolchain supports unstructured pruning up to 50% sparsity.
4. Performance Analysis
Use the profiler to identify bottlenecks.
| Metric | Target (KL720) | Action if Failed |
|---|---|---|
| Latency | < 33ms (30fps) | Reduce depth or input size. |
| Accuracy Loss | < 1.0% | Increase calibration set size or use QAT. |