Skip to content

Latest commit

 

History

History

inference

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 

LocalInference

LocalInference provides a local inference implementation powered by executorch.

Llama Stack currently supports on-device inference for iOS with Android coming soon. You can run on-device inference on Android today using executorch, PyTorch’s on-device inference library.

Installation

We're working on making LocalInference easier to set up. For now, you'll need to import it via .xcframework:

  1. Clone the executorch submodule in this repo and its dependencies: git submodule update --init --recursive

  2. Install Cmake for the executorch build`

  3. Drag LocalInference.xcodeproj into your project

  4. Add LocalInference as a framework in your app target

  5. Add a package dependency on https://github.com/pytorch/executorch (branch latest)

  6. Add all the kernels / backends from executorch (but not exectuorch itself!) as frameworks in your app target:

    • backend_coreml
    • backend_mps
    • backend_xnnpack
    • kernels_custom
    • kernels_optimized
    • kernels_portable
    • kernels_quantized
  7. In "Build Settings" > "Other Linker Flags" > "Any iOS Simulator SDK", add:

    -force_load
    $(BUILT_PRODUCTS_DIR)/libkernels_optimized-simulator-release.a
    -force_load
    $(BUILT_PRODUCTS_DIR)/libkernels_custom-simulator-release.a
    -force_load
    $(BUILT_PRODUCTS_DIR)/libkernels_quantized-simulator-release.a
    -force_load
    $(BUILT_PRODUCTS_DIR)/libbackend_xnnpack-simulator-release.a
    -force_load
    $(BUILT_PRODUCTS_DIR)/libbackend_coreml-simulator-release.a
    -force_load
    $(BUILT_PRODUCTS_DIR)/libbackend_mps-simulator-release.a
    
  8. In "Build Settings" > "Other Linker Flags" > "Any iOS SDK", add:

    -force_load
    $(BUILT_PRODUCTS_DIR)/libkernels_optimized-simulator-release.a
    -force_load
    $(BUILT_PRODUCTS_DIR)/libkernels_custom-simulator-release.a
    -force_load
    $(BUILT_PRODUCTS_DIR)/libkernels_quantized-simulator-release.a
    -force_load
    $(BUILT_PRODUCTS_DIR)/libbackend_xnnpack-simulator-release.a
    -force_load
    $(BUILT_PRODUCTS_DIR)/libbackend_coreml-simulator-release.a
    -force_load
    $(BUILT_PRODUCTS_DIR)/libbackend_mps-simulator-release.a
    

Preparing a model

  1. Prepare a .pte file following the executorch docs
  2. Bundle the .pte and tokenizer.model file into your app

We now support models quantized using SpinQuant and QAT-LoRA which offer a significant performance boost (demo app on iPhone 13 Pro):

Llama 3.2 1B Tokens / Second (total) Time-to-First-Token (sec)
Haiku Paragraph Haiku Paragraph
BF16 2.2 2.5 2.3 1.9
QAT+LoRA 7.1 3.3 0.37 0.24
SpinQuant 10.1 5.2 0.2 0.2

Using LocalInference

  1. Instantiate LocalInference with a DispatchQueue. Optionally, pass it into your agents service:
  init () {
    runnerQueue = DispatchQueue(label: "org.meta.llamastack")
    inferenceService = LocalInferenceService(queue: runnerQueue)
    agentsService = LocalAgentsService(inference: inferenceService)
  }
  1. Before making any inference calls, load your model from your bundle:
let mainBundle = Bundle.main
inferenceService.loadModel(
    modelPath: mainBundle.url(forResource: "llama32_1b_spinquant", withExtension: "pte"),
    tokenizerPath: mainBundle.url(forResource: "tokenizer", withExtension: "model"),
    completion: {_ in } // use to handle load failures
)
  1. Make inference calls (or agents calls) as you normally would with LlamaStack:
for await chunk in try await agentsService.initAndCreateTurn(
    messages: [
    .UserMessage(Components.Schemas.UserMessage(
        content: .case1("Call functions as needed to handle any actions in the following text:\n\n" + text),
        role: .user))
    ]
) {

Troubleshooting

If you receive errors like "missing package product" or "invalid checksum", try cleaning the build folder and resetting the Swift package cache:

(Opt+Click) Product > Clean Build Folder Immediately

rm -rf \
  ~/Library/org.swift.swiftpm \
  ~/Library/Caches/org.swift.swiftpm \
  ~/Library/Caches/com.apple.dt.Xcode \
  ~/Library/Developer/Xcode/DerivedData