Neural Network Quantization Overview
Neural Network Quantization 101
Table of Contents
1. Introduction
2. Int8 vs Int4
3. Matrix Form
1. Introduction
2. INT8 vs INT4
Quantization Ranges
| Ranges | INT8 | INT4 |
|---|---|---|
| Input range (fp) | [-1, 1] | [-1, 1] |
| Weight range (fp) | [-0.5, 0.5] | [-0.5, 0.5] |
| Quant range | [-128,127] | [-8,7] |
Quantization
| Steps | INT8 | INT4 |
|---|---|---|
| $s_x$ | 2/255 ≈ 0.007843 | 2/15 ≈ 0.1333 |
| $s_w$ | 1/255 ≈ 0.0039216 | 1/15 ≈ 0.0667 |
| $q_x$ | 96 | 6 |
| $q_w$ | 102 | 6 |
| $q_y$ | 9792 | 36 |
| $y = s_x * s_w * q_y$ | 0.3006 | 0.3200 p |
| Error | +0.0006 | +0.0200 |
3. Matrix Form
Neural Network Quantization Overview
http://chuzcjoe.github.io/CGV/cgv-neural-network-quantization-overview/
