AI vs Human Detector Demo

Back To Detector

Dataset Size

6,069 samples

Class Labels

0 Human / 1 AI

Train / Test Split

4,855 / 1,214

Core Features

TF-IDF (1-2 grams), max 10,000

What We Found

Model	Accuracy	Precision	Recall	F1	Notes
TF-IDF + Logistic Regression (full test set)	0.9885	0.9886	0.9886	0.9886	Fast baseline, strong and stable on full 1,214-sample test split.
BERT (fine-tuned subset)	0.9975	0.9951	1.0000	0.9975	Evaluated on subset (train=800, test=400), so direct fairness vs full baseline is limited.

Method Details

Data cleaning and preprocessing produced a balanced dataset of 6,069 samples after normalization. The baseline model used TF-IDF features plus Logistic Regression and achieved strong generalization on the full 1,214-sample test split.

The BERT experiment improved metrics on its evaluated subset, with especially strong recall for AI class detection. Because BERT was run on a smaller subset for compute feasibility, the result is best treated as a directional improvement signal, not a strict apples-to-apples benchmark against the full baseline.

The OpenAI prompted comparison section in the notebook adds a practical “external judge” perspective and highlights prompt-sensitivity risk when using LLM-only classification strategies.

Challenges

Compute Constraints

BERT fine-tuning required a reduced subset (800 train / 400 test), making throughput manageable but limiting one-to-one comparability with full-dataset baseline results.

External Dependency Friction

Notebook runs note unauthenticated Hugging Face requests (rate-limit risk) and optional OpenAI API usage (cost/rate/reproducibility sensitivity).

Prompt Sensitivity

OpenAI detection behavior depends on system prompt framing (strict forensic vs balanced vs style sensitive), which can shift decision thresholds and consistency.

Notebook Code Snippets

1. Baseline Training

def train_tfidf_baseline(X_train, X_test, y_train, y_test):
    vectorizer = TfidfVectorizer(max_features=10000, ngram_range=(1, 2))
    X_train_tfidf = vectorizer.fit_transform(X_train)
    X_test_tfidf = vectorizer.transform(X_test)

    model = LogisticRegression(max_iter=2000, random_state=SEED)
    model.fit(X_train_tfidf, y_train)
    pred = model.predict(X_test_tfidf)

2. BERT Subset Setup

BERT_TRAIN_SAMPLES = min(800, len(X_train_raw))
BERT_TEST_SAMPLES = min(400, len(X_test_raw))

training_args = TrainingArguments(
    output_dir=str(outputs_dir),
    num_train_epochs=1,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    evaluation_strategy="epoch",
)

3. Prompted OpenAI Comparison

SYSTEM_PROMPTS = {
    "strict_forensic": "...Prefer high precision...",
    "style_sensitive": "...Focus on repetition and generic abstraction...",
    "balanced": "...calibrated classifier...",
}

response = client.responses.create(
    model=OPENAI_MODEL,
    input=[{"role": "system", "content": system_prompt}, {"role": "user", "content": text}],
)

Performance Summary

The project baseline already performs very strongly and is deployment-friendly due to speed and simplicity. BERT shows further gains on a smaller evaluated subset, indicating deeper contextual representations can help when compute budget allows. Overall, the practical recommendation is to keep TF-IDF + Logistic Regression as the default production model and use BERT or prompted LLM analysis as higher-cost secondary validation paths when extra accuracy or interpretive checks are needed.