ollama

mirror of https://github.com/jmorganca/ollama synced 2025-10-06 00:32:49 +02:00

Author	SHA1	Message	Date
Daniel Hiltgen	517807cdf2	perf: build graph for next batch async to keep GPU busy (#11863 ) * perf: build graph for next batch in parallel to keep GPU busy This refactors the main run loop of the ollama runner to perform the main GPU intensive tasks (Compute+Floats) in a go routine so we can prepare the next batch in parallel to reduce the amount of time the GPU stalls waiting for the next batch of work. * tests: tune integration tests for ollama engine This tunes the integration tests to focus more on models supported by the new engine.	2025-08-29 14:20:28 -07:00
Daniel Hiltgen	fdd4d479a3	integration: add qwen2.5-vl (#10815 ) Replace the older llava model with qwen2.5 for vision tests Skip split-batch test on small VRAM systems to avoid excessive test time	2025-05-22 09:12:32 -07:00
Daniel Hiltgen	ed4e139314	Integration test improvements (#9654 ) Add some new test coverage for various model architectures, and switch from orca-mini to the small llama model.	2025-04-16 14:25:55 -07:00
Bruce MacDonald	9876c9faa4	chore(all): replace instances of interface with any (#10067 ) Both interface{} and any (which is just an alias for interface{} introduced in Go 1.18) represent the empty interface that all types satisfy.	2025-04-02 09:44:27 -07:00
Jesse Gross	9679f40146	ml: Allow models to constrain inputs to a single batch Models may require that a set of inputs all be processed as part of the same batch. For example, if an image has multiple patches with fully connected attention between them, we should not split the batch in the middle of an image. Fixes #9697	2025-03-14 15:38:54 -07:00
Daniel Hiltgen	8a9bb0d000	Add basic mllama integration tests (#7455 )	2024-10-31 17:25:48 -07:00
Daniel Hiltgen	68dfc6236a	refined test timing adjust timing on some tests so they don't timeout on small/slow GPUs	2024-06-14 14:51:40 -07:00
Daniel Hiltgen	34b9db5afc	Request and model concurrency This change adds support for multiple concurrent requests, as well as loading multiple models by spawning multiple runners. The default settings are currently set at 1 concurrent request per model and only 1 loaded model at a time, but these can be adjusted by setting OLLAMA_NUM_PARALLEL and OLLAMA_MAX_LOADED_MODELS.	2024-04-22 19:29:12 -07:00
Patrick Devine	1b272d5bcd	change `github.com/jmorganca/ollama` to `github.com/ollama/ollama` (#3347 )	2024-03-26 13:04:17 -07:00
Daniel Hiltgen	7b6cbc10ec	Integration tests conditionally pull If images aren't present, pull them. Also fixes the expected responses	2024-03-25 08:57:45 -07:00
Daniel Hiltgen	949b6c01e0	Revamp go based integration tests This uplevels the integration tests to run the server which can allow testing an existing server, or a remote server.	2024-03-23 14:24:18 +01:00

11 Commits