Commit Graph

510 Commits (master)
 

Author SHA1 Message Date
Georgi Gerganov 84ca9c2ecf
examples : fix save-load-state + rename llama-util.h 1 year ago
Georgi Gerganov 334637e43e
common : change default parameters to pre-#1126 (#1223) 1 year ago
Ivan Stepanov dd7eff57d8
llama : new sampling algorithms (#1126)
* Sample interface, new samplers.

New samplers:
- locally typical sampling
- tail free sampling
- frequency and presence penalty
- mirostat

Ignore EOS fix: -inf should be used.

* mirostat

* Added --logit-bias and --no-penalize-nl, removed std::span

* Use C++11, clarify llama API documentation, rename Mirostat parameters to --mirostat_lr and --mirostat_ent, add temperature sampling for Mirostat, simplify Mirostat sampling API parameters (removed N and *k)

Use C++11, clarify llama API documentation, rename Mirostat parameters to --mirostat_lr and --mirostat_ent, add temperature sampling for Mirostat, simplify Mirostat sampling API parameters (removed N and *k)

* Save and load example adjust

* Tests

* Windows build fix

* Windows test fix
1 year ago
slaren 7fc50c051a
cuBLAS: use host pinned memory and dequantize while copying (#1207)
* cuBLAS: dequantize simultaneously while copying memory

* cuBLAS: use host pinned memory

* cuBLAS: improve ggml_compute_forward_mul_mat_f16_f32 with pinned memory

* cuBLAS: also pin kv cache

* fix rebase
1 year ago
Henri Vasserman b1ee8f59b4
cuBLAS: non-contiguous tensor support (#1215)
* Cuda: non-contiguous tensor support

* remove extra stuff

* rename

* fix error

* more fixes, now OpenBLAS and CLBlast build too

* now then?
1 year ago
Stephan Walter 36d19a603b
Remove Q4_3 which is no better than Q5 (#1218) 1 year ago
Georgi Gerganov 7f15c5c477
readme : update hot topics 1 year ago
Georgi Gerganov 55390bcaf2
ggml : sync ggml (ggml_alibi) 1 year ago
CRD716 5fba3c016b
examples : add Jeopardy example (#1168)
* Basic Setup

* Prevent Results.txt from coming up

* Prefixes, Line separators, etc

* editorcheck

* introduction to give more consistent results

* Basic graph thing

* Grading, ready for testing!

* Y'all ready to get funky?

* fix column removal stuff

* missed a few
1 year ago
Evan Jones 1481a9cf25
llama : add session file format and saved sessions in main (#1169) 1 year ago
Georgi Gerganov 11d902364b
ggml : add helper debug printf in soft_max 1 year ago
0cc4m 7296c961d9
ggml : add CLBlast support (#1164)
* Allow use of OpenCL GPU-based BLAS using ClBlast instead of OpenBLAS for context processing

* Improve ClBlast implementation, avoid recreating buffers, remove redundant transfers

* Finish merge of ClBlast support

* Move CLBlast implementation to separate file

Add buffer reuse code (adapted from slaren's cuda implementation)

* Add q4_2 and q4_3 CLBlast support, improve code

* Double CLBlast speed by disabling OpenBLAS thread workaround

Co-authored-by: Concedo <39025047+LostRuins@users.noreply.github.com>
Co-authored-by: slaren <2141330+slaren@users.noreply.github.com>

* Fix device selection env variable names

* Fix cast in opencl kernels

* Add CLBlast to CMakeLists.txt

* Replace buffer pool with static buffers a, b, qb, c

Fix compile warnings

* Fix typos, use GGML_TYPE defines, improve code

* Improve btype dequant kernel selection code, add error if type is unsupported

* Improve code quality

* Move internal stuff out of header
* Use internal enums instead of CLBlast enums
* Remove leftover C++ includes and defines
* Make event use easier to read

Co-authored-by: Henri Vasserman <henv@hot.ee>

* Use c compiler for opencl files

* Simplify code, fix include

* First check error, then release event

* Make globals static, fix indentation

* Rename dequant kernels file to conform with other file names

* Fix import cl file name

---------

Co-authored-by: Concedo <39025047+LostRuins@users.noreply.github.com>
Co-authored-by: slaren <2141330+slaren@users.noreply.github.com>
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
1 year ago
Folko-Ven 78ec543733
Correcting link to w64devkit (#1214)
Correcting link to w64devkit (change seeto to skeeto).
1 year ago
Johannes Gäßler 92a6e13a31
Add Manjaro CUDA include and lib dirs to Makefile (#1212) 1 year ago
Yann Follet 04aaae1d79
add avx2 for dot_q8_0_q8_0, 2x faster than scalar (#1211) 1 year ago
Stephan Walter 0b2da20538
ggml : slightly faster AVX2 implementation for Q5 (#1197) 1 year ago
Georgi Gerganov f9be42add0
readme : add quantization info 1 year ago
Georgi Gerganov 574406dc7e
ggml : add Q5_0 and Q5_1 quantization (#1187)
* ggml : add Q5_0 quantization (cuBLAS only)

* ggml : fix Q5_0 qh -> uint32_t

* ggml : fix q5_0 histogram stats

* ggml : q5_0 scalar dot product

* ggml : q5_0 ARM NEON dot

* ggml : q5_0 more efficient ARM NEON using uint64_t masks

* ggml : rename Q5_0 -> Q5_1

* ggml : adding Q5_0 mode

* quantize : add Q5_0 and Q5_1 to map

* ggml : AVX2 optimizations for Q5_0, Q5_1 (#1195)

---------

Co-authored-by: Stephan Walter <stephan@walter.name>
1 year ago
Ásgeir Bjarni Ingvarsson 87a6f846d3
Allow setting the rng seed after initialization. (#1184)
The llama_set_state_data function restores the rng state to what it
was at the time llama_copy_state_data was called. But users may want
to restore the state and proceed with a different seed.
1 year ago
DaniAndTheWeb ea3ad7eb60
Updating build instructions to include BLAS support (#1183)
* Updated build information

First update to the build instructions to include BLAS.

* Update README.md

* Update information about BLAS

* Better BLAS explanation

Adding a clearer BLAS explanation and adding a link to download the CUDA toolkit.

* Better BLAS explanation

* BLAS for Mac

Specifying that BLAS is already supported on Macs using the Accelerate Framework.

* Clarify the effect of BLAS

* Windows Make instructions

Added the instructions to build with Make on Windows

* Fixing typo

* Fix trailing whitespace
1 year ago
Pavol Rusnak 859fee6dfb
quantize : use `map` to assign quantization type from `string` (#1191)
instead of `int` (while `int` option still being supported)

This allows the following usage:

`./quantize ggml-model-f16.bin ggml-model-q4_0.bin q4_0`

instead of:

`./quantize ggml-model-f16.bin ggml-model-q4_0.bin 2`
1 year ago
Stephan Walter 4afcc37869
Update SHA256SUMS after quantization change (#1181)
Co-authored-by: Pavol Rusnak <pavol@rusnak.io>
1 year ago
ostix360 667c501334
py : cast lora_alpha to int in convert-lora-to-ggml (#1170)
Co-authored-by: Pavol Rusnak <pavol@rusnak.io>
1 year ago
Pavol Rusnak bb98e77be7
nix: use convert.py instead of legacy wrapper convert-pth-to-ggml.py (#981) 1 year ago
Georgi Gerganov 7a32fcb3b2
ggml : add Q8_0 quantization format (rename the old one to Q8_1) (ARM NEON) (#1179)
* ggml : add Q8_0 quantization format (rename the old one to Q8_1)

* tests : fix test-quantize-fns

* ggml : finalize Q8_0 implementation

* ggml : use q4_0_q8_0 and q4_2_q8_0

* ggml : fix Q8_0 dot product bug (ARM)

* ggml : Q8_0 unroll x2

* ggml : fix bug - using wrong block type

* ggml : extend quantize_fns_t with "vec_dot_type"

* ggml : fix Q8_0 to use 255 values out of 256

* ggml : fix assert using wrong QK4_2 instead of QK4_3
1 year ago
unbounded dd0eabc049
ggml : use full range for Q4_0 and Q4_2 quantization (#729)
* Use full range for q4_0 quantization

By keeping the sign of the highest magnitude, we can make sure the
highest value maps to -8, which is currently unused.
This is a bit of a freebie since it is fully backwards compatible with
the current format.

* Update quantize_row_q4_0 for AVX/AVX2

* Update quantize_row_q4_0 for WASM

Untested

* Update quantize_row_q4_0 for Arm NEON

* Update quantize_row_q4_0 for PowerPC

Untested

* Use full range for q4_2 quantization
1 year ago
xaedes 54bb60e268
ggml : fix bug in ggml_compute_forward_sum_f32 (#1162)
The sum over all rows is now computed instead of just the last row
1 year ago
Georgi Gerganov 8a0f8673ba
ggml : export symbols (#1155) 1 year ago
xaedes 0c5692345d
examples : add save_load_state example (#1150)
* add save_load_state example

* use <cstdio> instead of <iostream> and fprintf / printf instead of cout

* renamed save-load-state example files replacing underscores by dashes
1 year ago
Georgi Gerganov 957c8ae21d
llama : increase scratch buffer size for 65B (ref #1152)
Temporary solution
1 year ago
mgroeber9110 9b0a4d4214
examples/main README improvements and some light refactoring (#1131) 1 year ago
Stephan Walter 2ec83428de
Fix build for gcc 8 and test in CI (#1154) 1 year ago
slaren e4cf982e0d
Fix cuda compilation (#1128)
* Fix: Issue with CUBLAS compilation error due to missing -fPIC flag

---------

Co-authored-by: B1gM8c <89020353+B1gM8c@users.noreply.github.com>
1 year ago
Georgi Gerganov c4fe84fb0d
llama : refactor get / set state + remove redundant kv cache API (#1143) 1 year ago
slaren 1d78fecdab
Fix LoRA acronym (#1145) 1 year ago
Georgi Gerganov 284685f169
scripts : add helper scripts to synch ggml repo 1 year ago
DannyDaemonic edce63baa9
Added README.md for main with examples and explanations (#1139) 1 year ago
Georgi Gerganov ec9cdb6752
ggml : do not print perf ops that have not been used at all 1 year ago
Georgi Gerganov e4422e299c
ggml : better PERF prints + support "LLAMA_PERF=1 make" 1 year ago
Stephan Walter 53c8434398
Improve AVX2 for vec_dot_q4_3_q8_0 (#1138) 1 year ago
Pavol Rusnak c6524f46eb
readme : update gpt4all instructions (#980) 1 year ago
Yishuo Wang c9e2c26f41
A better `packNibbles` and `mul_sum_i8_pairs_float` implementation using AVX512 (#1119) 1 year ago
Georgi Gerganov 0e018fe008
ggml : fix Q4_3 cuBLAS 1 year ago
Stephan Walter 857308d1e8
ci : trigger CI for drafts, but not most PR actions (#1125) 1 year ago
Stephan Walter c50b628810
Fix CI: ARM NEON, quantization unit tests, editorconfig (#1122) 1 year ago
unbounded 5f939498d5
ggml : unit test for quantization functions (#953)
* Unit test for quantization functions

Use the ggml_internal_get_quantize_fn function to loop through all
quantization formats and run a sanity check on the result.

Also add a microbenchmark that times these functions directly without
running the rest of the GGML graph.

* test-quantize-fns: CI fixes

Fix issues uncovered in CI
 - need to use sizes divisible by 32*8 for loop unrolling
 - use intrinsic header that should work on Mac

* test-quantize: remove

Per PR comment, subsumed by test-quantize-fns

* test-quantize: fix for q8_0 intermediates
1 year ago
wbpxre150 36b4f7e064
llama : print timings on ctrl+c exit (#1021)
* print timings on ctrl+c exit

* remove redundant free memory call.

* add global pointer to ctx.
1 year ago
eiery 10f19c1121
llama : have n_batch default to 512 (#1091)
* set default n_batch to 512 when using BLAS

* spacing

* alternate implementation of setting different n_batch for BLAS

* set n_batch to 512 for all cases
1 year ago
Howard Su 7e312f165c
cmake : fix build under Windows when enable BUILD_SHARED_LIBS (#1100)
* Fix build under Windows when enable BUILD_SHARED_LIBS

* Make AVX512 test on Windows to build the shared libs
1 year ago
Georgi Gerganov 872c365a91 ggml : fix AVX build + update to new Q8_0 format 1 year ago