As AI systems evolve, Vision-Language Large Models (VLLMs) are transforming how machines understand and interact with the world. These powerful models combine computer vision and natural language processing, enabling applications such as image captioning, visual question answering, multimodal agents, and interactive assistants that "see" and "speak".
While most users access these capabilities through cloud APIs from big tech platforms, there's a growing demand โ and necessity โ for running VLLMs locally. From privacy-conscious developers to edge-device innovators, many are choosing to bring AI inference back to the userโs machine. Thanks to tools like Ollama and Streamlit, it's now possible to deploy advanced multimodal AI applications entirely offline, with minimal setup.
In this post, we'll explore the advantages of local VLLMs, and walk through building your own intelligent image-based chatbot using open-source models and Python tooling โ all running locally on your computer.
Centralized AI services offer convenience and scalability, but they come with trade-offs. Hereโs why moving AI closer to the user can be a game-changer:
As open-source models like LLaVA
, MiniGPT-4
, and BakLLaVA
mature, they offer capabilities rivaling cloud models โ without the data-sharing risks. Developers now have a rare opportunity: to build intelligent, multimodal apps where users own the code, the models, and their data.
Ollama makes it easy to run open-source LLMs and VLLMs locally using Docker-like workflows. Weโll combine it with Streamlit to create a user-friendly frontend for interacting with a VLLM through image uploads and text prompts.
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
# Install Streamlit
pip install streamlit
# Optional: For image handling
pip install pillow
Ollama supports models like llava
that handle both images and text. To run LLaVA locally:
ollama run llava
This will download the model and expose a local API endpoint for chat and vision input.
Save this code as app.py
:
import streamlit as st
from PIL import Image
import requests
import base64
import io
st.title("๐ผ๏ธ Local VLLM Agent with Ollama")
image = st.file_uploader("Upload an image", type=["jpg", "png", "jpeg"])
prompt = st.text_input("Enter your prompt")
if st.button("Ask"):
if image and prompt:
# Convert image to base64
img = Image.open(image)
buffered = io.BytesIO()
img.save(buffered, format="PNG")
img_base64 = base64.b64encode(buffered.getvalue()).decode()
# Send to Ollama
response = requests.post(
"http://localhost:11434/api/generate",
json={
"model": "llava",
"prompt": prompt,
"images": [img_base64],
"stream": False
}
)
if response.status_code == 200:
result = response.json()["response"]
st.success(result)
else:
st.error("Failed to get response from Ollama")
else:
st.warning("Please upload an image and enter a prompt.")
streamlit run app.py
This will open a local web interface where you can upload an image and enter a prompt to interact with the VLLM model running via Ollama.
Here is a demo video:
Local Vision-Language Models mark a turning point in AI โ putting sophisticated multimodal reasoning directly into the hands of developers, researchers, and makers. By combining tools like Ollama and Streamlit, you can go from raw models to polished applications in hours โ not weeks.
Whether youโre building educational tools, assistive tech, industrial automation, or just experimenting with AI on your own terms, running VLLMs locally opens up a future of responsible, fast, and flexible AI development.
Happy building! ๐ง๐ง ๐ท
Author: Tien Pham