Commit Graph

510 Commits (master)
 

Author SHA1 Message Date
Georgi Gerganov 955ef9a5d5
ggml : alternative Q4_3 implementation using modified Q8_0 (#1109)
* ggml : prefer vzip to vuzp

This way we always use the same type of instruction across all quantizations

* ggml : alternative Q4_3 implementation using modified Q8_0

* ggml : fix Q4_3 scalar imlpementation

* ggml : slight improvement of Q4_3 - no need for loop unrolling

* ggml : fix AVX paths for Q8_0 quantization
1 year ago
Stephan Walter c5aa5e5777
ggml : AVX2 optimization for vec_dot_q4_3_q8_0 and refactoring (#1099)
* AVX2 optimization for vec_dot_q4_3_q8_0 and refactoring

* finish AVX vectorization of quantize_row_q8_0

* Rename hsum_int_8 to hsum_i32_8
1 year ago
Clint Herron e9a9cb0c54
examples : Improve Alpaca Default Repeat Penalty: Better Match Alpaca.cpp Experience (#1107)
* Moving parameters to separate lines for readability.

* Increasing repeate_penalty to 1.1 to make alpaca more usable by default.

* Adding trailing newline.
1 year ago
xaedes b6e7f9b09e
llama : add api for getting/setting the complete state: rng, logits, embedding and kv_cache (#1105)
* reserve correct size for logits

* add functions to get and set the whole llama state:

including rng, logits, embedding and kv_cache

* remove unused variables

* remove trailing whitespace

* fix comment
1 year ago
slaren 50cb666b8a
Improve cuBLAS performance by using a memory pool (#1094)
* Improve cuBLAS performance by using a memory pool

* Move cuda specific definitions to ggml-cuda.h/cu

* Add CXX flags to nvcc

* Change memory pool synchronization mechanism to a spin lock
General code cleanup
1 year ago
apaz 25d7abbd1f
llama : fixed rlimit error message (#888) 1 year ago
源文雨 018f2279f5
cmake : link threads publicly to ggml (#1042)
* fix: ld link test-tokenizer-0 error

```
cmake3 --build . --config Release
[  5%] Built target ggml
[ 16%] Built target llama
[ 22%] Linking CXX executable ../bin/test-tokenizer-0
../libllama.a(ggml.c.o):在函数‘ggml_graph_compute’中:
ggml.c:(.text+0xf2db):对‘pthread_create’未定义的引用
ggml.c:(.text+0xf9d4):对‘pthread_join’未定义的引用
collect2: error: ld returned 1 exit status
gmake[2]: *** [bin/test-tokenizer-0] 错误 1
gmake[1]: *** [tests/CMakeFiles/test-tokenizer-0.dir/all] 错误 2
gmake: *** [all] 错误 2
```

* Update CMakeLists.txt

* Update CMakeLists.txt

* Update CMakeLists.txt
1 year ago
Alex Klinkhamer 9411288271
main : evaluate tokens in batches after swapping context (#1014)
* examples : evaluate tokens in batches after swapping context

* Update examples/main/main.cpp

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
1 year ago
xaedes 8687c1f258
llama : remember and restore kv cache data pointers (#1104)
because their value is stored in buf and overwritten by memcpy
1 year ago
Kawrakow 1bfc153e2f
ggml : a faster version for Q4_1 x Q8_0 dot products (#1083)
* A faster version for Q4_1 x Q8_0 dot products

The idea nehind being that Q8_0 quantized
values get used many times in the matrix multiplications
where they are involved. In the current implementations,
when we are evaluating the dot products, we need to compute
the sum of the quants in the Q8_0 vector, so the same
operation is repeated many times. Here we pre-compute
the sum during Q8_0 quantization, store it in the
now modified block_q8_0 struct, and then reuse this
result in the subsequent dot products.

In a synthetic benchmark (just compute a bunch of dot
products), this change speeds up the Q4_1 * Q8_0 dot
product by 80%, making the performance identical to
Q4_0 * Q8_0.

In practical application, I see a ~15% gain in speed for
token prediction on M2, and ~5% gain on Ryzen 7950X.
The speed gain in the prompt evaluation is much bigger
(around 50%).

I have only done the change for the scalar version,
ARM_NEON, and AVX2, so we still need an AVX implementation.

* Cleaning up

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
1 year ago
slaren 3d59769c3b
Show perplexity ETA in hours and minutes (#1096) 1 year ago
Georgi Gerganov d40fded93e
llama : fix comment for "output.weight" tensor 1 year ago
Stephan Walter 2510c1831f
Add ggml-model-*.bin checksums for 7B, 13B, 30B, 65B (#1088)
* Add ggml-model-*.bin checksums for 7B, 13B, 30B
* Add ggml-model-*.bin checksums for 65B

---------

Co-authored-by: Pavol Rusnak <pavol@rusnak.io>
1 year ago
Georgi Gerganov 12b5900dbc
ggml : sync ggml (add GPT-NeoX RoPE implementation) 1 year ago
Georgi Gerganov 9ff334f3c9
ggml : fix bug in ggml_compute_forward_dup_f32() 1 year ago
slaren 2005469ea1
Add Q4_3 support to cuBLAS (#1086) 1 year ago
Georgi Gerganov 8a1756abdf
ggml : do not break cuBLAS build (Q4_3 is not yet implemented) 1 year ago
Georgi Gerganov 66aab46079
ggml : fix Q4_3 quantization
Broke it during conflict resolution in last PR
1 year ago
Kawrakow 38de86a711
llama : multi-threaded quantization (#1075)
* Multi-threading quantization.

Not much gain for simple quantizations, bit it will be important
for quantizations that require more CPU cycles.

* Multi-threading for quantize-stats

It now does the job in ~14 seconds on my Mac for
Q4_0, Q4_1 and Q4_2. Single-threaded it was taking
more than 2 minutes after adding the more elaborate
version of Q4_2.

* Reviewer comments

* Avoiding compiler confusion

After changing chunk_size to const int as suggested by
@ggerganov, clang and GCC starting to warn me that I don't
need to capture it in the lambda. So, I removed it from the
capture list. But that makes the MSVC build fail. So,
making it a constexpr to make every compiler happy.

* Still fighting with lambda captures in MSVC

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
1 year ago
Georgi Gerganov e0305ead3a
ggml : add Q4_3 quantization (#1082) 1 year ago
Ivan Komarov 6a9661ea5a
ci : remove the LLAMA_ACCELERATE matrix dimension from Ubuntu builds in the CI (#1074)
[Accelerate](https://developer.apple.com/documentation/accelerate) is an Apple framework which can only be used on macOS, and the CMake build [ignores](https://github.com/ggerganov/llama.cpp/blob/master/CMakeLists.txt#L102) the `LLAMA_ACCELERATE` variable when run on non-Apple platforms. This implies setting `LLAMA_ACCELERATE` is a no-op on Ubuntu and can be removed.

This will reduce visual noise in CI check results (in addition to reducing the number of checks we have to run for every PR). Right now every sanitized build is duplicated twice for no good reason (e.g., we have `CI / ubuntu-latest-cmake-sanitizer (ADDRESS, Debug, ON)` and `CI / ubuntu-latest-cmake-sanitizer (ADDRESS, Debug, OFF)`).
1 year ago
源文雨 5addcb120c
fix: LLAMA_CUBLAS=1 undefined reference 'shm_open' (#1080) 1 year ago
Stephan Walter c8c2c52482
AVX2 optimization for vec_dot_q4_2_q8_0 (#1068) 1 year ago
slaren 02d6988121
Improve cuBLAS performance by dequantizing on the GPU (#1065) 1 year ago
CRD716 834695fe3a
Minor: Readme fixed grammar, spelling, and misc updates (#1071) 1 year ago
Kawrakow f7d05095b4
Q4_2 quantization with rmse-optimized scale and quants (#1062)
* Q4_2 quantization with rmse-optimized scale and quants

For quantize-stats we get
q4_2: rmse 0.00159301, maxerr 0.17480469, 95pct<0.0030, median<0.0012

For 7B perplexity with BLAS enabled we get 6.2038 after 655 chunks.

Quantization is slow (~90 seconds on my Mac for 7B) as not
multi-threaded as in PR #896.

* ggml : satisfy the sanitizer builds

Not sure why this makes them fail

* Better follow ggml conventions for function names

* Fixed type as per reviewer comment

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
1 year ago
Georgi Gerganov 884e7d7a2b
ggml : use 8-bit precision for Q4_1 intermediate results (#1047)
* ggml : use 8-bit precision for Q4_1 intermediate results (ARM)

* ggml : optimize ggml_vec_dot_q4_1_q8_0() via vmalq_n_f32

56 ms/token with Q4_1 !

* ggml : AVX2 implementation of ggml_vec_dot_q4_1_q8_0 (#1051)

* gitignore : ignore ppl-*.txt files

---------

Co-authored-by: slaren <2141330+slaren@users.noreply.github.com>
1 year ago
Georgi Gerganov 7cd5c4a3e9
readme : add warning about Q4_2 and Q4_3 1 year ago
Stephan Walter f3d4edf504
ggml : Q4 cleanup - remove 4-bit dot product code (#1061)
* Q4 cleanup

* Remove unused AVX512 Q4_0 code
1 year ago
slaren 8944a13296
Add NVIDIA cuBLAS support (#1044) 1 year ago
slaren 6667401238
Multi-threaded ggml_cpy (#1035)
* Multi-threaded ggml_cpy

* Update ggml.c

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Also fix wdata offset in ggml_compute_forward_add_q_f32

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
1 year ago
Georgi Gerganov 77a73403ca
ggml : add new Q4_2 quantization (ARM only) (#1046)
* ggml : Q4_2 ARM

* ggml : add ggml_is_quantized()

* llama : update llama_type_name() with Q4_2 entry

* ggml : speed-up q4_2

- 4 threads: ~100ms -> ~90ms
- 8 threads:  ~55ms -> ~50ms

* ggml : optimize q4_2 using vmlaq_n_f32 + vmulq_n_f32
1 year ago
Georgi Gerganov 50a8a2af97
ggml : scratch that - vmlaq_n_f32 is always better
Had a background process that was messing with the timings
1 year ago
Georgi Gerganov 4caebf6d40
gitignore : vdot 1 year ago
Georgi Gerganov dcdd65e296
ggml : optimize ggml_vec_dot_q4_0_q8_0() using vectorized accumulators 1 year ago
Kawrakow 5ecff35151
Adding a simple program to measure speed of dot products (#1041)
On my Mac, the direct Q4_1 product is marginally slower
(~69 vs ~55 us for Q4_0). The SIMD-ified ggml version
is now almost 2X slower (~121 us).

On a Ryzen 7950X CPU, the direct product for Q4_1 quantization
is faster than the AVX2 implementation (~60 vs ~62 us).

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
1 year ago
Georgi Gerganov 7faa7460f0
readme : update hot topics about new LoRA functionality 1 year ago
Georgi Gerganov 5af8e32238
ci : do not run on drafts 1 year ago
Ivan Komarov 42747220b4
Do not close file after mmap (Windows version) (#1034) 1 year ago
Atsushi Tatsuma e9298af389
readme : add Ruby bindings (#1029) 1 year ago
Cameron 4ad73137a1
add 4_0 to default outfile namestr dict (#1031)
this came up when trying to convert the gpt4all-lora-unfiltered-quantized.bin file
1 year ago
slaren 315a95a4d3
Add LoRA support (#820) 1 year ago
Arik Poznanski efd05648c8
llama : well-defined static initialization of complex objects (#927)
* Replaced static initialization of complex objects with a initialization on first use. This prevents an undefined behavior on program run, for example, crash in Release build, works in Debug build

* replaced use of auto with exact type to avoid using -std=c++14

* Made the assessors functions for static maps be static const
1 year ago
Georgi Gerganov eb17a026fd
quantize-stats : fix bug in --type argument 1 year ago
Georgi Gerganov 69b740289f
ggml : avoid using ggml_fp16_to_fp32() and ggml_fp32_to_fp16() in ggml.c 1 year ago
Ivan Komarov f266259ad9
Speedup the AVX-512 implementation of ggml_vec_dot_q4_0() (#933) 1 year ago
slaren 47f61aaa5f
Fix: do not close file on mmap (#1017) 1 year ago
Georgi Gerganov 3173a62eb9
stdout : vertical align outputs for better readibility 1 year ago
Pavol Rusnak 489537e6cf
examples: add missing <ctime> include for time() (#1011) 1 year ago
nanahi 2d3481c721
Fix msys2 build error and warnings (#1009) 1 year ago