6,069 samples
0 Human / 1 AI
4,855 / 1,214
TF-IDF (1-2 grams), max 10,000
What We Found
| Model | Accuracy | Precision | Recall | F1 | Notes |
|---|---|---|---|---|---|
| TF-IDF + Logistic Regression (full test set) | 0.9885 | 0.9886 | 0.9886 | 0.9886 | Fast baseline, strong and stable on full 1,214-sample test split. |
| BERT (fine-tuned subset) | 0.9975 | 0.9951 | 1.0000 | 0.9975 | Evaluated on subset (train=800, test=400), so direct fairness vs full baseline is limited. |
Method Details
Data cleaning and preprocessing produced a balanced dataset of 6,069 samples after normalization. The baseline model used TF-IDF features plus Logistic Regression and achieved strong generalization on the full 1,214-sample test split.
The BERT experiment improved metrics on its evaluated subset, with especially strong recall for AI class detection. Because BERT was run on a smaller subset for compute feasibility, the result is best treated as a directional improvement signal, not a strict apples-to-apples benchmark against the full baseline.
The OpenAI prompted comparison section in the notebook adds a practical “external judge” perspective and highlights prompt-sensitivity risk when using LLM-only classification strategies.
Challenges
Compute Constraints
BERT fine-tuning required a reduced subset (800 train / 400 test), making throughput manageable but limiting one-to-one comparability with full-dataset baseline results.
External Dependency Friction
Notebook runs note unauthenticated Hugging Face requests (rate-limit risk) and optional OpenAI API usage (cost/rate/reproducibility sensitivity).
Prompt Sensitivity
OpenAI detection behavior depends on system prompt framing (strict forensic vs balanced vs style sensitive), which can shift decision thresholds and consistency.
Notebook Code Snippets
def train_tfidf_baseline(X_train, X_test, y_train, y_test):
vectorizer = TfidfVectorizer(max_features=10000, ngram_range=(1, 2))
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)
model = LogisticRegression(max_iter=2000, random_state=SEED)
model.fit(X_train_tfidf, y_train)
pred = model.predict(X_test_tfidf)BERT_TRAIN_SAMPLES = min(800, len(X_train_raw))
BERT_TEST_SAMPLES = min(400, len(X_test_raw))
training_args = TrainingArguments(
output_dir=str(outputs_dir),
num_train_epochs=1,
per_device_train_batch_size=8,
per_device_eval_batch_size=8,
evaluation_strategy="epoch",
)SYSTEM_PROMPTS = {
"strict_forensic": "...Prefer high precision...",
"style_sensitive": "...Focus on repetition and generic abstraction...",
"balanced": "...calibrated classifier...",
}
response = client.responses.create(
model=OPENAI_MODEL,
input=[{"role": "system", "content": system_prompt}, {"role": "user", "content": text}],
)Performance Summary
The project baseline already performs very strongly and is deployment-friendly due to speed and simplicity. BERT shows further gains on a smaller evaluated subset, indicating deeper contextual representations can help when compute budget allows. Overall, the practical recommendation is to keep TF-IDF + Logistic Regression as the default production model and use BERT or prompted LLM analysis as higher-cost secondary validation paths when extra accuracy or interpretive checks are needed.