llama.cpp

Commit Graph

Author	SHA1	Message	Date
Pavol Rusnak	859fee6dfb	quantize : use `map` to assign quantization type from `string` (#1191 ) instead of `int` (while `int` option still being supported) This allows the following usage: `./quantize ggml-model-f16.bin ggml-model-q4_0.bin q4_0` instead of: `./quantize ggml-model-f16.bin ggml-model-q4_0.bin 2`	1 year ago
Stephan Walter	4afcc37869	Update SHA256SUMS after quantization change (#1181 ) Co-authored-by: Pavol Rusnak <pavol@rusnak.io>	1 year ago
ostix360	667c501334	py : cast lora_alpha to int in convert-lora-to-ggml (#1170 ) Co-authored-by: Pavol Rusnak <pavol@rusnak.io>	1 year ago
Pavol Rusnak	bb98e77be7	nix: use convert.py instead of legacy wrapper convert-pth-to-ggml.py (#981 )	1 year ago
Georgi Gerganov	7a32fcb3b2	ggml : add Q8_0 quantization format (rename the old one to Q8_1) (ARM NEON) (#1179 ) * ggml : add Q8_0 quantization format (rename the old one to Q8_1) * tests : fix test-quantize-fns * ggml : finalize Q8_0 implementation * ggml : use q4_0_q8_0 and q4_2_q8_0 * ggml : fix Q8_0 dot product bug (ARM) * ggml : Q8_0 unroll x2 * ggml : fix bug - using wrong block type * ggml : extend quantize_fns_t with "vec_dot_type" * ggml : fix Q8_0 to use 255 values out of 256 * ggml : fix assert using wrong QK4_2 instead of QK4_3	1 year ago
unbounded	dd0eabc049	ggml : use full range for Q4_0 and Q4_2 quantization (#729 ) * Use full range for q4_0 quantization By keeping the sign of the highest magnitude, we can make sure the highest value maps to -8, which is currently unused. This is a bit of a freebie since it is fully backwards compatible with the current format. * Update quantize_row_q4_0 for AVX/AVX2 * Update quantize_row_q4_0 for WASM Untested * Update quantize_row_q4_0 for Arm NEON * Update quantize_row_q4_0 for PowerPC Untested * Use full range for q4_2 quantization	1 year ago
xaedes	54bb60e268	ggml : fix bug in ggml_compute_forward_sum_f32 (#1162 ) The sum over all rows is now computed instead of just the last row	1 year ago
Georgi Gerganov	8a0f8673ba	ggml : export symbols (#1155 )	1 year ago
xaedes	0c5692345d	examples : add save_load_state example (#1150 ) * add save_load_state example * use <cstdio> instead of <iostream> and fprintf / printf instead of cout * renamed save-load-state example files replacing underscores by dashes	1 year ago
Georgi Gerganov	957c8ae21d	llama : increase scratch buffer size for 65B (ref #1152 ) Temporary solution	1 year ago
mgroeber9110	9b0a4d4214	examples/main README improvements and some light refactoring (#1131 )	1 year ago
Stephan Walter	2ec83428de	Fix build for gcc 8 and test in CI (#1154 )	1 year ago
slaren	e4cf982e0d	Fix cuda compilation (#1128 ) * Fix: Issue with CUBLAS compilation error due to missing -fPIC flag --------- Co-authored-by: B1gM8c <89020353+B1gM8c@users.noreply.github.com>	1 year ago
Georgi Gerganov	c4fe84fb0d	llama : refactor get / set state + remove redundant kv cache API (#1143 )	1 year ago
slaren	1d78fecdab	Fix LoRA acronym (#1145 )	1 year ago
Georgi Gerganov	284685f169	scripts : add helper scripts to synch ggml repo	1 year ago
DannyDaemonic	edce63baa9	Added README.md for main with examples and explanations (#1139 )	1 year ago
Georgi Gerganov	ec9cdb6752	ggml : do not print perf ops that have not been used at all	1 year ago
Georgi Gerganov	e4422e299c	ggml : better PERF prints + support "LLAMA_PERF=1 make"	1 year ago
Stephan Walter	53c8434398	Improve AVX2 for vec_dot_q4_3_q8_0 (#1138 )	1 year ago
Pavol Rusnak	c6524f46eb	readme : update gpt4all instructions (#980 )	1 year ago
Yishuo Wang	c9e2c26f41	A better `packNibbles` and `mul_sum_i8_pairs_float` implementation using AVX512 (#1119 )	1 year ago
Georgi Gerganov	0e018fe008	ggml : fix Q4_3 cuBLAS	1 year ago
Stephan Walter	857308d1e8	ci : trigger CI for drafts, but not most PR actions (#1125 )	1 year ago
Stephan Walter	c50b628810	Fix CI: ARM NEON, quantization unit tests, editorconfig (#1122 )	1 year ago
unbounded	5f939498d5	ggml : unit test for quantization functions (#953 ) * Unit test for quantization functions Use the ggml_internal_get_quantize_fn function to loop through all quantization formats and run a sanity check on the result. Also add a microbenchmark that times these functions directly without running the rest of the GGML graph. * test-quantize-fns: CI fixes Fix issues uncovered in CI - need to use sizes divisible by 328 for loop unrolling - use intrinsic header that should work on Mac test-quantize: remove Per PR comment, subsumed by test-quantize-fns * test-quantize: fix for q8_0 intermediates	1 year ago
wbpxre150	36b4f7e064	llama : print timings on ctrl+c exit (#1021 ) * print timings on ctrl+c exit * remove redundant free memory call. * add global pointer to ctx.	1 year ago
eiery	10f19c1121	llama : have n_batch default to 512 (#1091 ) * set default n_batch to 512 when using BLAS * spacing * alternate implementation of setting different n_batch for BLAS * set n_batch to 512 for all cases	1 year ago
Howard Su	7e312f165c	cmake : fix build under Windows when enable BUILD_SHARED_LIBS (#1100 ) * Fix build under Windows when enable BUILD_SHARED_LIBS * Make AVX512 test on Windows to build the shared libs	1 year ago
Georgi Gerganov	872c365a91	ggml : fix AVX build + update to new Q8_0 format	1 year ago
Georgi Gerganov	955ef9a5d5	ggml : alternative Q4_3 implementation using modified Q8_0 (#1109 ) * ggml : prefer vzip to vuzp This way we always use the same type of instruction across all quantizations * ggml : alternative Q4_3 implementation using modified Q8_0 * ggml : fix Q4_3 scalar imlpementation * ggml : slight improvement of Q4_3 - no need for loop unrolling * ggml : fix AVX paths for Q8_0 quantization	1 year ago
Stephan Walter	c5aa5e5777	ggml : AVX2 optimization for vec_dot_q4_3_q8_0 and refactoring (#1099 ) * AVX2 optimization for vec_dot_q4_3_q8_0 and refactoring * finish AVX vectorization of quantize_row_q8_0 * Rename hsum_int_8 to hsum_i32_8	1 year ago
Clint Herron	e9a9cb0c54	examples : Improve Alpaca Default Repeat Penalty: Better Match Alpaca.cpp Experience (#1107 ) * Moving parameters to separate lines for readability. * Increasing repeate_penalty to 1.1 to make alpaca more usable by default. * Adding trailing newline.	1 year ago
xaedes	b6e7f9b09e	llama : add api for getting/setting the complete state: rng, logits, embedding and kv_cache (#1105 ) * reserve correct size for logits * add functions to get and set the whole llama state: including rng, logits, embedding and kv_cache * remove unused variables * remove trailing whitespace * fix comment	1 year ago
slaren	50cb666b8a	Improve cuBLAS performance by using a memory pool (#1094 ) * Improve cuBLAS performance by using a memory pool * Move cuda specific definitions to ggml-cuda.h/cu * Add CXX flags to nvcc * Change memory pool synchronization mechanism to a spin lock General code cleanup	1 year ago
apaz	25d7abbd1f	llama : fixed rlimit error message (#888 )	1 year ago
源文雨	018f2279f5	cmake : link threads publicly to ggml (#1042 ) * fix: ld link test-tokenizer-0 error ``` cmake3 --build . --config Release [ 5%] Built target ggml [ 16%] Built target llama [ 22%] Linking CXX executable ../bin/test-tokenizer-0 ../libllama.a(ggml.c.o)：在函数‘ggml_graph_compute’中： ggml.c:(.text+0xf2db)：对‘pthread_create’未定义的引用 ggml.c:(.text+0xf9d4)：对‘pthread_join’未定义的引用 collect2: error: ld returned 1 exit status gmake[2]: * [bin/test-tokenizer-0] 错误 1 gmake[1]: * [tests/CMakeFiles/test-tokenizer-0.dir/all] 错误 2 gmake: *** [all] 错误 2 ``` * Update CMakeLists.txt * Update CMakeLists.txt * Update CMakeLists.txt	1 year ago
Alex Klinkhamer	9411288271	main : evaluate tokens in batches after swapping context (#1014 ) * examples : evaluate tokens in batches after swapping context * Update examples/main/main.cpp --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	1 year ago
xaedes	8687c1f258	llama : remember and restore kv cache data pointers (#1104 ) because their value is stored in buf and overwritten by memcpy	1 year ago
Kawrakow	1bfc153e2f	ggml : a faster version for Q4_1 x Q8_0 dot products (#1083 ) * A faster version for Q4_1 x Q8_0 dot products The idea nehind being that Q8_0 quantized values get used many times in the matrix multiplications where they are involved. In the current implementations, when we are evaluating the dot products, we need to compute the sum of the quants in the Q8_0 vector, so the same operation is repeated many times. Here we pre-compute the sum during Q8_0 quantization, store it in the now modified block_q8_0 struct, and then reuse this result in the subsequent dot products. In a synthetic benchmark (just compute a bunch of dot products), this change speeds up the Q4_1 * Q8_0 dot product by 80%, making the performance identical to Q4_0 * Q8_0. In practical application, I see a ~15% gain in speed for token prediction on M2, and ~5% gain on Ryzen 7950X. The speed gain in the prompt evaluation is much bigger (around 50%). I have only done the change for the scalar version, ARM_NEON, and AVX2, so we still need an AVX implementation. * Cleaning up --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	1 year ago
slaren	3d59769c3b	Show perplexity ETA in hours and minutes (#1096 )	1 year ago
Georgi Gerganov	d40fded93e	llama : fix comment for "output.weight" tensor	1 year ago
Stephan Walter	2510c1831f	Add ggml-model-.bin checksums for 7B, 13B, 30B, 65B (#1088 ) Add ggml-model-.bin checksums for 7B, 13B, 30B Add ggml-model-*.bin checksums for 65B --------- Co-authored-by: Pavol Rusnak <pavol@rusnak.io>	1 year ago
Georgi Gerganov	12b5900dbc	ggml : sync ggml (add GPT-NeoX RoPE implementation)	1 year ago
Georgi Gerganov	9ff334f3c9	ggml : fix bug in ggml_compute_forward_dup_f32()	1 year ago
slaren	2005469ea1	Add Q4_3 support to cuBLAS (#1086 )	1 year ago
Georgi Gerganov	8a1756abdf	ggml : do not break cuBLAS build (Q4_3 is not yet implemented)	1 year ago
Georgi Gerganov	66aab46079	ggml : fix Q4_3 quantization Broke it during conflict resolution in last PR	1 year ago
Kawrakow	38de86a711	llama : multi-threaded quantization (#1075 ) * Multi-threading quantization. Not much gain for simple quantizations, bit it will be important for quantizations that require more CPU cycles. * Multi-threading for quantize-stats It now does the job in ~14 seconds on my Mac for Q4_0, Q4_1 and Q4_2. Single-threaded it was taking more than 2 minutes after adding the more elaborate version of Q4_2. * Reviewer comments * Avoiding compiler confusion After changing chunk_size to const int as suggested by @ggerganov, clang and GCC starting to warn me that I don't need to capture it in the lambda. So, I removed it from the capture list. But that makes the MSVC build fail. So, making it a constexpr to make every compiler happy. * Still fighting with lambda captures in MSVC --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	1 year ago
Georgi Gerganov	e0305ead3a	ggml : add Q4_3 quantization (#1082 )	1 year ago

1 2 3 4 5 ...

440 Commits (859fee6dfb00fab7ce6bc215b4adae78d82f4759) All Branches Search

440 Commits (859fee6dfb00fab7ce6bc215b4adae78d82f4759)

All Branches