Kneron

Model Optimization & Quantization

Achieving real-time performance on edge hardware requires optimization. This guide details strategies for reducing model size and latency while maintaining accuracy.

1. Post-Training Quantization (PTQ)

Kneron supports standard INT8 PTQ. This is the default mode and requires no retraining.

Calibration Dataset

The quality of quantization depends on your calibration dataset.

  • Use 50-100 images.
  • Images should be representative of the target domain (e.g., if deploying for face recognition, use face images).
  • Preprocessing (resize/normalization) must match the inference time logic.

2. Quantization Aware Training (QAT)

If PTQ results in significant accuracy drop (>1%), consider QAT.

QAT simulates quantization noise during training, allowing the weights to adapt.

# PyTorch Example (Conceptual) model.qconfig = torch.quantization.get_default_qat_qconfig('fbgemm') torch.quantization.prepare_qat(model, inplace=True) # ... Retrain for a few epochs ...

3. Optimization Tips

Reduce Input Resolution

Halving the resolution (e.g., 640x640 -> 320x320) reduces compute load by 4x. Check if your accuracy tolerates this.

Pruning

Remove channels with near-zero weights. Kneron toolchain supports unstructured pruning up to 50% sparsity.

4. Performance Analysis

Use the profiler to identify bottlenecks.

MetricTarget (KL720)Action if Failed
Latency< 33ms (30fps)Reduce depth or input size.
Accuracy Loss< 1.0%Increase calibration set size or use QAT.