run_lm_server.py#

Run the self-hosting language model server.

This server controls a vLLM server for language model inference and provides an Open-AI compatible API for training.

scripts/run_lm_server.py#

Run the self-hosting language model server.

usage: scripts/run_lm_server.py [-h] [--log-to-file] [--lm-server-port int]
                                [--vllm-port int] [--max-training-jobs int]
                                [--vllm-num-gpus {int,auto}]
                                [--vllm-clear-cache | --no-vllm-clear-cache]
                                [--vllm-max-lora-rank {int,auto}]
                                [--accelerate-config-path str]
                                [--debug | --no-debug]
                                [--external | --no-external]
                                [--reload | --no-reload]

-h, --help#: show this help message and exit

--log-to-file#: Whether to log vLLM and trainer to files instead of stdout and stderr.

--lm-server-port <int>#: The port on which the main language model server will run. (default: 5000)

--vllm-port <int>#: The port on which the vLLM server will run. (default: 8000)

--max-training-jobs <int>#: The maximum number of concurrent training jobs allowed. (default: 1)

--vllm-num-gpus <{int,auto}>#: The maximum number of GPUs to use for the vLLM server. If set to ‘auto’, it will use all available GPUs. The actual number of GPUs used may be less than this value, because it must divide the number of attention heads in the model. (default: auto)

--vllm-clear-cache, --no-vllm-clear-cache#: Whether to clear the Hugging Face model cache before starting the server. If True, all cached models other than the one being loaded will be cleared before starting the vLLM server. (default: False)

--vllm-max-lora-rank <{int,auto}>#: The maximum rank for LoRA layers permitted in the vLLM server. This should be set to the maximum rank of the LoRA layers in the model being trained. If set to ‘auto’, it will use the rank of the LoRA model to be served, if it is available. If no LoRA model is available, it will use the vLLM default value. (default: auto)

--accelerate-config-path <str>#: Path to the configuration file for the accelerate library. If the filename ends with .jinja2, it will be treated as a Jinja2 template and rendered. If empty, no configuration file will be passed to the accelerate command. In this case, the accelerate command will use the default configuration file, which is usually located at ~/.cache/huggingface/accelerate/default_config.yaml. Relative paths are resolved against the current working directory, or if that fails against the template directory: nip/language_model_server/templates/. (default: accelerate_config.yaml.jinja2)

--debug, --no-debug#: Whether to enable debug mode. (default: False)

--external, --no-external#: Whether to run the server in external mode, with host set to ‘0.0.0.0’. This allows the server to be accessed from outside the local machine. (default: False)

--reload, --no-reload#: Whether to enable auto-reload for the uvicorn server. This auto-reloads the server when any of the source files change, at the cost of some performance. (default: False)

This server controls a vLLM server for language model inference and provides an Open-AI compatible API for training.

run_lm_server.py

Contents

run_lm_server.py#

scripts/run_lm_server.py#