Training Concepts

Supervised Learning

Teaching a model by showing it labeled examples. Like training a spam filter by feeding it thousands of emails already marked "spam" or "not spam." The model learns patterns from these examples and applies them to new data it has never seen.

circle-check

Reinforcement Learning (RLHF)

Training through trial and error, using rewards and penalties instead of fixed labels. Think of teaching a dog with treats and corrections. The model tries different approaches, gets feedback on what works, and gradually improves.

Example: An AI learning to play a game gets points for good moves and loses points for mistakes. Over thousands of rounds, it figures out winning strategies.

circle-check

Direct Preference Optimization (DPO)

Instead of saying "this answer is right, that one is wrong," you're saying "I prefer this response over that one." The model learns to match human preferences, making it better at producing helpful, appropriate answers.

Example: You see two customer support responses to the same question: one clear and polite, one vague and wordy. You choose the better one. The model learns from thousands of these preferences.

circle-check

Human-in-the-Loop (HITL)

Keeping humans actively involved in critical decisions instead of letting AI run fully automated. This could involve labeling data, reviewing model outputs, or approving high-stakes decisions.

Why it matters: Content moderators double-check AI flags. Doctors review AI-suggested diagnoses. Support agents correct AI-drafted replies. The AI learns while humans handle edge cases and sensitive decisions.

circle-check

Process Supervision

Evaluating how an AI reaches an answer, not just whether the final answer is correct. Instead of only rewarding correct outputs, you check the reasoning steps and give feedback on whether the process follows good practices.

Example: An AI might reach the right conclusion but skip important safety checks. Process supervision teaches models to follow proper workflows, not just guess answers that seem correct.

circle-check

Rubrics & Verifiers

Creating evaluation frameworks that define what "good" looks like, then checking if outputs meet those standards. Rubrics are detailed scoring criteria (like a grading sheet), while verifiers are systems that automatically check whether specific requirements are met.

Example: A writing rubric might score clarity (1-5), accuracy (1-5), and tone (1-5). A verifier might automatically check that a medical AI's response includes required disclaimers or that code runs without errors.

circle-check

Red-Teaming

Deliberately testing an AI's limits by trying to make it fail, produce harmful outputs, or bypass safety measures. Like ethical hackers who break into systems to find vulnerabilities before bad actors do.

Example: You might try creative prompts to see if a chatbot can be tricked into giving dangerous advice, revealing private information, or producing biased content. Finding these weaknesses helps developers build better guardrails.

circle-check

Adversarial Testing

Similar to red-teaming but broader—systematically creating challenging scenarios to test model robustness. This includes unusual inputs, edge cases, and real-world complexity that might confuse or break the AI.

Example: Testing a self-driving AI with rare weather conditions, confusing road signs, or unusual pedestrian behavior. Or testing a language model with ambiguous questions, contradictory instructions, or inputs from different languages mixed together.

circle-check

Constitutional AI / Rule-Based Training

Teaching models to follow specific principles or rules rather than learning only from examples. The AI is given explicit guidelines (a "constitution") about what responses are acceptable and what crosses the line.

Example: Instead of just showing examples of harmful vs. helpful content, you give the model principles like "Be helpful, harmless, and honest" or "Never provide instructions for illegal activities." The model learns to evaluate its own outputs against these rules.

circle-check

Last updated