llama.cpp

master

remove-vzip

readme

ci_cublas

fix-eval-bos

q4_3-range-fix

ik/rmse_quantization

q4_0-q4_2-range-fix

gg/rmse_quantization

quant-attn

mmap-pages-stats

flash-attn

mmap

q4_1_more_accel

q4_1_more_accel_kahan

q4_1_more_accel_loopsplit

tcp_server

dev

ci_cublas-31ff9e2

ci_cublas-44286d3

ci_cublas-45d94c8

master-018f227

master-01a297b

master-02c5b27

master-02d6988

master-03f7e33

master-04aaae1

master-04c6f5e

master-074bea2

master-084e2f0

master-09aecbf

master-0ad9646

master-0b2da20

master-0b366e7

master-0b5a935

master-0ba76c1

master-0c44427

master-0c56923

master-0d054e2

master-0e018fe

master-0e07e6a

master-0e6cbff

master-0f07cac

master-0f1b21c

master-106faaf

master-10f19c1

master-11d9023

master-12b5900

master-13b0c68

master-1481a9c

master-1623a6e

master-180b693

master-1972616

master-1bfc153

master-1d08882

master-1f0414f

master-2005469

master-20e1e84

master-20fbf2a

master-214b6a3

master-22213a1

master-2456837

master-2485d7a

master-25d7abb

master-2663d2c

master-29b7baa

master-2a2e63c

master-2a98bc1

master-2bb992f

master-2bdc096

master-2d099e5

master-2d3481c

master-2e17dfd

master-2e664f1

master-2ec8342

master-2edbdb0

master-2f7bf7d

master-2f7c8e0

master-305ba6f

master-305eb5a

master-31572d9

master-315a95a

master-3173a62

master-334637e

master-33e35b8

master-34ab526

master-34c1072

master-34d9f22

master-3525899

master-368d0c8

master-36b4f7e

master-36d0753

master-36d19a6

master-38de86a

master-3bcc129

master-3cd8dde

master-3d59769

master-3e5aa8a

master-3e6e70d

master-4122dff

master-4274722

master-436e561

master-437e778

master-459e93c

master-461ba9e

master-4640eff

master-47f61aa

master-481044d

master-483bab2

master-4870e45

master-489537e

master-4953e90

master-4b8efff

master-502a400

master-50a8a2a

master-50cb666

master-50fae10

master-53c8434

master-53dbba7

master-54bb60e

master-55390bc

master-55bc5f0

master-563cdc3

master-56e659a

master-574406d

master-585d91a

master-58b367c

master-58e6c9f

master-5a5f8b1

master-5a8c4f6

master-5addcb1

master-5af8e32

master-5b70e7d

master-5c19c70

master-5d5817c

master-5ecff35

master-6232f2d

master-62cfc54

master-6667401

master-66aab46

master-67c7779

master-684da25

master-698f7b5

master-69b7402

master-69c9229

master-6a9661e

master-6b6dbc8

master-6bc4400

master-6c24870

master-6f1ee4b

master-6f79699

master-70269ca

master-70f01cb

master-7296c96

master-76a8849

master-77a7340

master-77efdf5

master-799fdc1

master-7a32fcb

master-7a87d31

master-7a9b6c3

master-7b8dbcb

master-7e312f1

master-7f4c5c6

master-7fc50c0

master-7ff0dcd

master-81040f1

master-83df563

master-8520fc3

master-857308d

master-859fee6

master-863f65e

master-8687c1f

master-872c365

master-87a6f84

master-884e7d7

master-8944a13

master-8a0f867

master-8a1756a

master-8b67998

master-8c2ec5e

master-8c3ffc2

master-8c9be35

master-8cda5c9

master-8cf9f34

master-8d4a855

master-90b19bd

master-9190e8e

master-928480e

master-92a6e13

master-93265e9

master-939ad2d

master-9411288

master-94c5652

master-957c8ae

master-95ea26f

master-96f9c05

master-9794052

master-986b6ce

master-99c5b27

master-9b0a4d4

master-9cbc404

master-9daff41

master-9e17072

master-9ff334f

master-a140219

master-a316a42

master-a3a2a0e

master-a4755cf

master-a5c42c4

master-a5d30b1

master-a6bdc47

master-a791a68

master-aa485ce

master-aaf3b23

master-ad072fc

master-ad5fd5b

master-ae44e23

master-afd220d

master-b1ee8f5

master-b391579

master-b3f460e

master-b51c717

master-b6e7f9b

master-b925f1f

master-be87b6e

master-bf4b22f

master-c0bb1d3

master-c12b14b

master-c1f8850

master-c2b25b6

master-c3ac702

master-c3ca7a5

master-c494ed5

master-c4f89d8

master-c4fe84f

master-c50b628

master-c56b715

master-c5aa5e5

master-c5d70f5

master-c85e03d

master-c8c2c52

master-c9a59b7

master-c9e2c26

master-cc0bb72

master-cc9cee8

master-cd7fa95

master-cea1c85

master-d0aaff5

master-d3f202d

master-d40fded

master-d502bc7

master-d5850c5

master-d7def1a

master-d990e3f

master-d9a239c

master-da5303c

master-db10808

master-dcdd65e

master-dd0eabc

master-dd7eff5

master-e0305ea

master-e216aa0

master-e2cd506

master-e4412b4

master-e4422e2

master-e4cf982

master-e6c9e09

master-e7f6997

master-e899bf5

master-e8c0516

master-e95b655

master-e986f94

master-ea10d3d

master-ea3a0ad

master-eb17a02

master-ec728e4

master-ec9cdb6

master-ecbe466

master-ed3c680

master-ee0c40d

master-eeaa7b0

master-efd0564

master-f0d70f1

master-f121705

master-f202ada

master-f266259

master-f2d1c47

master-f3d4edf

master-f4cef87

master-f4d277a

master-f5a77a6

master-f647ce0

master-f7d0509

master-f7dc43b

master-fbd4d38

Commit Graph

Author	SHA1	Message	Date
Kawrakow	1bfc153e2f	ggml : a faster version for Q4_1 x Q8_0 dot products (#1083 ) * A faster version for Q4_1 x Q8_0 dot products The idea nehind being that Q8_0 quantized values get used many times in the matrix multiplications where they are involved. In the current implementations, when we are evaluating the dot products, we need to compute the sum of the quants in the Q8_0 vector, so the same operation is repeated many times. Here we pre-compute the sum during Q8_0 quantization, store it in the now modified block_q8_0 struct, and then reuse this result in the subsequent dot products. In a synthetic benchmark (just compute a bunch of dot products), this change speeds up the Q4_1 * Q8_0 dot product by 80%, making the performance identical to Q4_0 * Q8_0. In practical application, I see a ~15% gain in speed for token prediction on M2, and ~5% gain on Ryzen 7950X. The speed gain in the prompt evaluation is much bigger (around 50%). I have only done the change for the scalar version, ARM_NEON, and AVX2, so we still need an AVX implementation. * Cleaning up --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	1 year ago
Kawrakow	5ecff35151	Adding a simple program to measure speed of dot products (#1041 ) On my Mac, the direct Q4_1 product is marginally slower (~69 vs ~55 us for Q4_0). The SIMD-ified ggml version is now almost 2X slower (~121 us). On a Ryzen 7950X CPU, the direct product for Q4_1 quantization is faster than the AVX2 implementation (~60 vs ~62 us). --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	1 year ago

Author

SHA1

Message

Date

Kawrakow

1bfc153e2f

ggml : a faster version for Q4_1 x Q8_0 dot products (#1083 )

* A faster version for Q4_1 x Q8_0 dot products

The idea nehind being that Q8_0 quantized
values get used many times in the matrix multiplications
where they are involved. In the current implementations,
when we are evaluating the dot products, we need to compute
the sum of the quants in the Q8_0 vector, so the same
operation is repeated many times. Here we pre-compute
the sum during Q8_0 quantization, store it in the
now modified block_q8_0 struct, and then reuse this
result in the subsequent dot products.

In a synthetic benchmark (just compute a bunch of dot
products), this change speeds up the Q4_1 * Q8_0 dot
product by 80%, making the performance identical to
Q4_0 * Q8_0.

In practical application, I see a ~15% gain in speed for
token prediction on M2, and ~5% gain on Ryzen 7950X.
The speed gain in the prompt evaluation is much bigger
(around 50%).

I have only done the change for the scalar version,
ARM_NEON, and AVX2, so we still need an AVX implementation.

* Cleaning up

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

Kawrakow

5ecff35151

Adding a simple program to measure speed of dot products (#1041 )

On my Mac, the direct Q4_1 product is marginally slower
(~69 vs ~55 us for Q4_0). The SIMD-ified ggml version
is now almost 2X slower (~121 us).

On a Ryzen 7950X CPU, the direct product for Q4_1 quantization
is faster than the AVX2 implementation (~60 vs ~62 us).

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

2 Commits (master)