Call your deployed Flash endpoints using HTTP requests for queue-based and load-balanced configurations.
After deploying your Flash app with flash deploy, you can call your endpoints directly via HTTP. The request format depends on whether you’re using queue-based or load-balanced configurations.
Queue-based endpoints (using @Endpoint(name=..., gpu=...) decorator) provide two routes for job submission: /run (asynchronous) and /runsync (synchronous).
{ "id": "job-abc-123", "status": "COMPLETED", "output": { "generated_text": "Hello world from GPU!" }}
The /runsync endpoint has a 60-second client-side timeout by default. If you’ve configured execution_timeout_ms on your endpoint, the client timeout uses that value instead. For jobs that take longer than 60 seconds, set execution_timeout_ms to prevent /runsync requests from timing out.
Use /run for long-running jobs that you’ll check later. Use /runsync for quick jobs where you want immediate results (with timeout protection).
A single load-balanced endpoint can serve multiple routes:
Copy
from runpod_flash import Endpointapi = Endpoint(name="api-server", cpu="cpu5c-4-8", workers=(1, 5))# All these routes share one endpoint URL@api.post("/generate")async def generate_text(prompt: str): ...@api.post("/translate")async def translate_text(text: str): ...@api.get("/health")async def health_check(): ...
Copy
# All use the same base URL with different pathscurl -X POST https://abc123xyz.api.runpod.ai/generate -H "..." -d '{...}'curl -X POST https://abc123xyz.api.runpod.ai/translate -H "..." -d '{...}'curl -X GET https://abc123xyz.api.runpod.ai/health -H "..."