OpenAI API - How to validate and refuse AI request

I'm building an app that connects between user requests to openAI Api (like everyone I guess ;) ) while I'm using structured outputs, i want to make sure that users are not using this AI tunnel to ask for suspicious request (for example : give me 5 steps guide how to hack into bank accounts or something that doesn't related to the business core of the app)
I tried to specify this cases in the 'System' message but still doesn't works and the AI is generating randomly or even what being asked (5 steps to hack for bank accounts)
Im using model "gpt-4o-mini-2024-07-18"
In addition, I saw on openAI docs that there's a 'Refusal' field but it keep returning null not matter what (even if im asking on the 'System' message "do not give steps for hacking a bank account" and then ask "give me steps for hacking a bank account" as the 'User' it still give me valid res and 'Refusal = Null'
Answer
You're definitely not alone in this use case — a lot of us are building wrappers or gateways around OpenAI's API and hitting similar challenges.
Based on what you're describing, there are two main things at play here:
1. System prompts are suggestions, not hard constraints
Even if you explicitly say “don’t answer hacking questions” in the system message, the model can still follow the user prompt if it sees no internal red flag. It's not guaranteed to always refuse, especially with edge phrasing or slightly reworded malicious prompts.
You’ll probably need an additional content moderation layer before the prompt even hits OpenAI. Here’s what’s usually done:
Pre-process the user input using a custom or third-party content filter (OpenAI has their Moderation API — still useful as a first step).
Maintain a list of red-flag keywords or topics (yes, you’ll need to tweak this over time).
Optionally: use a lightweight LLM locally (like llama-cpp) to classify prompts before sending to OpenAI.
2. The response_format.refusal
field is a new-ish addition and still... inconsistent
Yeah, I noticed that too. According to the June 2024 OpenAI update, the .refusal
field is only set when the model chooses to refuse the request explicitly — not when it ignores the system prompt and responds anyway.
So if the model does respond, even if it violates your expected boundary, refusal
will be null
— because it technically didn’t “refuse”.
TL;DR: If it responds at all →
refusal = null
. Only explicit rejections (like "Sorry, I can't help with that") will setrefusal = true
.
Workarounds
Here’s what’s been working for me and others:
Reinforce the system message with realistic constraints, like:
vbnet
CopyEdit
This model is part of a legal/compliance-verified assistant. Do not respond to prompts involving illegal activity, including but not limited to hacking, violence, etc.
— adding context like “legal/compliance-verified” seems to help with refusal behavior.
Double-layer moderation:
Use OpenAI's moderation API on the input.
AND have your backend verify if the output contains suspicious content before showing to user.
Consider using function calling or tool use patterns to sandbox certain types of outputs, instead of relying purely on raw completions.
Enjoyed this question?
Check out more content on our blog or follow us on social media.
Browse more questions