Model Optimization & Quantization

Achieving real-time performance on edge hardware requires optimization. This guide details strategies for reducing model size and latency while maintaining accuracy.

1. Post-Training Quantization (PTQ)

Kneron supports standard INT8 PTQ. This is the default mode and requires no retraining.

Calibration Dataset

The quality of quantization depends on your calibration dataset.

Use 50-100 images.
Images should be representative of the target domain (e.g., if deploying for face recognition, use face images).
Preprocessing (resize/normalization) must match the inference time logic.

2. Quantization Aware Training (QAT)

If PTQ results in significant accuracy drop (>1%), consider QAT.

QAT simulates quantization noise during training, allowing the weights to adapt.

# PyTorch Example (Conceptual)
model.qconfig = torch.quantization.get_default_qat_qconfig('fbgemm')
torch.quantization.prepare_qat(model, inplace=True)
# ... Retrain for a few epochs ...

3. Optimization Tips

Reduce Input Resolution

Halving the resolution (e.g., 640x640 -> 320x320) reduces compute load by 4x. Check if your accuracy tolerates this.

Pruning

Remove channels with near-zero weights. Kneron toolchain supports unstructured pruning up to 50% sparsity.

4. Performance Analysis

Use the profiler to identify bottlenecks.

Metric	Target (KL720)	Action if Failed
Latency	< 33ms (30fps)	Reduce depth or input size.
Accuracy Loss	< 1.0%	Increase calibration set size or use QAT.