Commit Graph

27 Commits (8520fc310eab87f2c4612f2a00d4adbd44a20d0d)

Author SHA1 Message Date
Georgi Gerganov 8520fc310e
Disable BLAS altogether - the bug is not just for qunatized mat mul 1 year ago
Georgi Gerganov b3f460e941
Disable BLAS branch in mul_mat - seems there is a bug 1 year ago
Georgi Gerganov 7a9b6c3a8b
Reduce memory usage and allocate enough memory for largest context (#473)
* Reduce memory usage and allocate enough memory for large contexts

* Simpler scratch buffer usage

* Reenable BLAS for quantized mul_mat

* Fix number of layers in 30B and 65B

* Fix KV cache size for F32
1 year ago
Cameron Kaiser 481044d50c
additional optimizations for POWER9 (#454) 1 year ago
comex 563cdc391d
Support calling mlock() on loaded model data on Linux and macOS (#453)
* Support calling mlock() on loaded model data on Linux and macOS

This is enabled by a new --mlock command line option.

Using mlock() disables swapping and memory compression for the model
data.  Doing so can be useful on systems where the model takes up a
large fraction of system RAM.  In my experience, macOS is quite eager to
start compressing llama.cpp's memory, which then makes it halt for a few
seconds while it decompresses, even with a model that uses "only" 25GB
out of 32GB.

Of course, this comes at the cost of forcing the system to swap or
compress other processes' memory instead, so it needs to be used with
care and shouldn't be enabled by default.

In theory it should be possible to support this on Windows as well using
VirtualLock(), but I'm not much of a Windows user.

* Update llama.cpp

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
1 year ago
Stephan Walter 69c92298a9
Deduplicate q4 quantization functions (#383)
* Deduplicate q4 quantization functions

* Use const; add basic test

* Re-enable quantization test

* Disable AVX2 flags in CI

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
1 year ago
Valentyn Bezshapkin 97940520e8
fix: add POSIX functionality for Linux compilation (#51)
* fix: add POSIX functionality for Linux compilation

* fix: older standard for compatibility
1 year ago
Georgi Gerganov f5a77a629b
Introduce C-style API (#370)
* Major refactoring - introduce C-style API

* Clean up

* Add <cassert>

* Add <iterator>

* Add <algorithm> ....

* Fix timing reporting and accumulation

* Measure eval time only for single-token calls

* Change llama_tokenize return meaning
1 year ago
Kevin Lo 715d292ee0
Add OpenBSD support (#314) 1 year ago
Casey Primozic 2e664f1ff4
Add initial AVX512 support for dot product on Linux (#320)
* Update Makefile to detect AVX512 support and add compiler flags if it's available
 * Based on existing AVX2 implementation, dot product on one 32-value block of 4-bit quantized ints at a time
 * Perform 8 bit -> 16 bit sign extension and multiply+add on 32 values at time instead of 16
 * Use built-in AVX512 horizontal reduce add to get sum at the end
 * Manual unrolling on inner dot product loop to reduce loop counter overhead
1 year ago
Georgi Gerganov 22213a17b5
Change RMSNorm eps to 1e-6 (#173)
I think this is what is used in the Python code
1 year ago
Stephan Walter 367946c668
Don't tell users to use a bad number of threads (#243)
The readme tells people to use the command line option "-t 8", causing 8
threads to be started. On systems with fewer than 8 cores, this causes a
significant slowdown. Remove the option from the example command lines
and use /proc/cpuinfo on Linux to determine a sensible default.
1 year ago
Matvey Soloviev 904d2a8d6a
Q4_1 quantization (#193)
* Add AVX2 version of ggml_vec_dot_q4_1

* Small optimisations to q4_1 dot product (@Const-me)

* Rearrange Q4_1 quantization to work for multipart models. (Fix #152)

* Fix ggml_vec_mad_q4_1 too

* Fix non-vectorised q4_1 vec mul
1 year ago
Nebula 9b4a15b17d
Fix RMS norm in GGML (#191) 1 year ago
hoangmit 6eac39ba95
Add RMS norm and use it (#187)
* add ggml_rms_norm

* update op num
1 year ago
hoangmit 113e685d18
inline -> static inline for "bytesFromNibbles" (#161)
Without "static" prefix, it fails to compile in clang
1 year ago
Ronsor 47857e564c
Don't use vdotq_s32 if it's not available (#139)
* Don't use vdotq_s32 if it's not available

`dotprod` extensions aren't available on some ARM CPUs (e.g. Raspberry Pi 4), so check for them and only use them if they're available.

Reintroduces the code removed in 84d9015 if `__ARM_FEATURE_DOTPROD` isn't defined.

* Update ggml.c

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
1 year ago
Thomas Klausner 41be0a3b3d
Add NetBSD support. (#90) 1 year ago
Georgi Gerganov 84d9015c4a
Use vdotq_s32 to improve performance (#67)
* 10% performance boost on ARM

* Back to original change
1 year ago
Georgi Gerganov c80e2a8f2a
Revert "10% performance boost on ARM"
This reverts commit 113a9e83eb.

There are some reports for illegal instruction.
Moved this stuff to vdotq_s32 branch until resolve
1 year ago
Georgi Gerganov 54a0e66ea0
Check for vdotq_s32 availability 1 year ago
Georgi Gerganov 543c57e991
Ammend to previous commit - forgot to update non-QRDMX branch 1 year ago
Georgi Gerganov 113a9e83eb
10% performance boost on ARM 1 year ago
Sebastián A eb062bb012
Windows fixes (#31)
* Apply fixes suggested to build on windows

Issue: https://github.com/ggerganov/llama.cpp/issues/22

* Remove unsupported VLAs

* MSVC: Remove features that are only available on MSVC C++20.

* Fix zero initialization of the other fields.

* Change the use of vector for stack allocations.
1 year ago
Georgi Gerganov f1eaff4721 Add AVX2 support for x86 architectures thanks to @Const-me ! 1 year ago
Georgi Gerganov 007a8f6f45
Support all LLaMA models + change Q4_0 quantization storage 1 year ago
Georgi Gerganov 26c0846629
Initial release 1 year ago