ai content detection

AI Content Detection

As models advance, AI content detection tools are struggling to keep up. Text generated by the latest Gemini, GPT and Claude models is fooling even the best of them.

We’ve decided to bring AI content detection back in-house in order to keep up. Each time a new model comes out the classifier needs a fine-tune on that model’s output.

Our base model, DEJAN-LM was pre-trained on a 10,000,000 sentence dataset using masked language modelling (MLM) on top quality content from websites with excellent editorial practices. DEJAN-LM is a web article expert.

The model was fine-tuned for AI content detection on a 20,000,000 sentence dataset, 50% original human content, 50% AI paraphrase or derivative content.

Test Results

GPT-4


GPT-4.5


GPT-4o-mini


GPT-4o


GPT-o3


GPT-o4-mini


Manual Algorithm & Heuristics

It’s clear that OpenAI’s latest model flies under the radar and avoids deep-learning based detection so we went old school. The 20,000,000 sentence dataset was processed to define top 1000 words for each class sorted by dataset count. We then normalise their values allowing for non-discriminating words to self-eliminate.

The two lists of top words and their weights were used in a simple ranking algorithm to help our deep learning model where it struggles.

As a result the classification confidence for the elusive GPT-o4-mini went from mere 20.7% all the way to 68.1% which puts it in the “Yes, it’s AI generated!” category.

  • Model AI Likelihood: 20.8%
  • Heuristic AI Likelihood: 47.3%
  • Combined AI Likelihood: 68.1%


Comments

Leave a Reply

Your email address will not be published. Required fields are marked *