OpenAI API - How to validate and refuse AI request

I'm building an app that connects between user requests to openAI Api (like everyone I guess ;) ) while I'm using structured outputs, i want to make sure that users are not using this AI tunnel to ask for suspicious request (for example : give me 5 steps guide how to hack into bank accounts or something that doesn't related to the business core of the app)

I tried to specify this cases in the 'System' message but still doesn't works and the AI is generating randomly or even what being asked (5 steps to hack for bank accounts)

Im using model "gpt-4o-mini-2024-07-18"

In addition, I saw on openAI docs that there's a 'Refusal' field but it keep returning null not matter what (even if im asking on the 'System' message "do not give steps for hacking a bank account" and then ask "give me steps for hacking a bank account" as the 'User' it still give me valid res and 'Refusal = Null'

Answer

You're definitely not alone in this use case — a lot of us are building wrappers or gateways around OpenAI's API and hitting similar challenges.

Based on what you're describing, there are two main things at play here:

1. **System prompts are suggestions, not hard constraints**

Even if you explicitly say “don’t answer hacking questions” in the system message, the model can still follow the user prompt if it sees no internal red flag. It's not guaranteed to always refuse, especially with edge phrasing or slightly reworded malicious prompts.

You’ll probably need an additional content moderation layer before the prompt even hits OpenAI. Here’s what’s usually done:

Pre-process the user input using a custom or third-party content filter (OpenAI has their Moderation API — still useful as a first step).
Maintain a list of red-flag keywords or topics (yes, you’ll need to tweak this over time).
Optionally: use a lightweight LLM locally (like llama-cpp) to classify prompts before sending to OpenAI.

2. The `response_format.refusal` field is a new-ish addition and still... inconsistent

Yeah, I noticed that too. According to the June 2024 OpenAI update, the .refusal field is only set when the model chooses to refuse the request explicitly — not when it ignores the system prompt and responds anyway.

So if the model does respond, even if it violates your expected boundary, refusal will be null — because it technically didn’t “refuse”.

TL;DR: If it responds at all → refusal = null. Only explicit rejections (like "Sorry, I can't help with that") will set refusal = true.

Workarounds

Here’s what’s been working for me and others:

Reinforce the system message with realistic constraints, like:
```
vbnet
```
CopyEdit

This model is part of a legal/compliance-verified assistant. Do not respond to prompts involving illegal activity, including but not limited to hacking, violence, etc.

— adding context like “legal/compliance-verified” seems to help with refusal behavior.
Double-layer moderation:
- Use OpenAI's moderation API on the input.
- AND have your backend verify if the output contains suspicious content before showing to user.
Consider using function calling or tool use patterns to sandbox certain types of outputs, instead of relying purely on raw completions.

Answer

1. System prompts are suggestions, not hard constraints

2. The response_format.refusal field is a new-ish addition and still... inconsistent

Workarounds

Enjoyed this question?

1. **System prompts are suggestions, not hard constraints**

2. The `response_format.refusal` field is a new-ish addition and still... inconsistent