Neural Network Quantization Overview

Neural Network Quantization 101

Table of Contents

1. Introduction
2. Int8 vs Int4
3. Matrix Form

1. Introduction

2. INT8 vs INT4

Quantization Ranges

Ranges INT8 INT4
Input range (fp) [-1, 1] [-1, 1]
Weight range (fp) [-0.5, 0.5] [-0.5, 0.5]
Quant range [-128,127] [-8,7]

Quantization

Steps INT8 INT4
$s_x$ 2/255 ≈ 0.007843 2/15 ≈ 0.1333
$s_w$ 1/255 ≈ 0.0039216 1/15 ≈ 0.0667
$q_x$ 96 6
$q_w$ 102 6
$q_y$ 9792 36
$y = s_x * s_w * q_y$ 0.3006 0.3200 p
Error +0.0006 +0.0200

3. Matrix Form

Author

Joe Chu

Posted on

2025-04-25

Updated on

2026-05-02

Licensed under

Comments