|
ResearchAudio.io
The Complete ML Jargon Guide
Part 2: Training, Architectures, and Common Terms
|
|
Welcome to Part 2 of our ML jargon guide. Yesterday we covered the foundations: core concepts, how models learn, neural network components, and language models.
Today we complete the picture with training approaches, model architectures, multimodal systems, and the buzzwords you hear constantly in AI discussions.
|
|
Part 2 Contents
| 5. Training Approaches — Different ways to teach AI |
| 6. Architectures — Blueprints for building AI systems |
| 7. Multimodal Systems — AI that sees, reads, and hears |
| 8. Common Terms — Buzzwords decoded |
|
|
|
|
Training Approaches
Understanding these methods explains how models go from raw text prediction to helpful assistants.
|
|
How Modern LLMs Are Built
|
Pre-training
Learn from internet
|
→ |
SFT
Learn from examples
|
→ |
RLHF
Learn from feedback
|
Each stage builds on the previous. Pre-training creates raw capability, SFT teaches format, RLHF aligns with preferences.
|
|
Pre-training
The initial phase where a model learns from massive amounts of general data—often most of the internet. Creates a "foundation model" with broad knowledge. Extremely expensive, costing millions in compute. The result can predict text but is not yet helpful.
Analogy: General education from kindergarten through 12th grade. Broad foundation before specialization.
|
|
Fine-tuning
Takes a pre-trained model and trains it further on specific data to specialize it. Much cheaper than pre-training because you build on existing knowledge. Companies fine-tune on internal documents to create custom assistants.
Analogy: A trained chef taking a two-week Japanese cuisine course. They specialize existing skills.
|
|
RL and RLHF
Reinforcement Learning (RL): Learning through trial and error with rewards. RLHF (RL from Human Feedback): Humans rate outputs as good or bad; the model learns to produce higher-rated responses. This is how ChatGPT became helpful.
Analogy: Training a dog with treats. Good behavior gets rewards. RLHF is treats for AI.
|
|
SFT (Supervised Fine-Tuning)
Training with human-written examples of ideal responses. Provide pairs of prompts and perfect answers; the model learns to mimic that style. SFT typically happens before RLHF—teach format with examples first, then refine with feedback.
Analogy: A writing class where the teacher shows A+ essays. Learn by studying excellent examples.
|
|
LoRA (Low-Rank Adaptation)
Efficient fine-tuning technique. Instead of updating all billions of parameters, LoRA adds small "adapter" layers and trains only those. Original model stays frozen. Dramatically reduces memory and compute while achieving similar results.
Analogy: Instead of renovating your entire house, you add a new room. Main structure stays intact.
|
|
|
|
Architectures
Different blueprints for building AI systems. The transformer has become dominant, but alternatives provide context.
|
|
Attention: What the Model Focuses On
|
"The cat sat on the mat because it was tired"
|
When processing "it," attention determines that "cat" is most relevant. This is how models resolve references.
|
|
Transformer
The architecture behind GPT, Claude, Llama, and most modern AI. Introduced in the 2017 paper "Attention Is All You Need." Processes all tokens simultaneously using attention mechanisms. Unlike older sequential models, transformers can be massively parallelized.
Analogy: Older models read like a typewriter, one letter at a time. Transformers read like you—seeing the whole page at once.
|
|
Attention Mechanism
Allows the model to focus on relevant parts of input when processing each position. For every token, attention computes how much to "attend to" every other token. Self-attention, multi-head attention, and cross-attention are variations.
Analogy: At a crowded party, you focus on one conversation while background noise fades away.
|
|
CNN (Convolutional Neural Network)
Designed for processing images. Uses "filters" that slide across the image, detecting patterns at different scales—first edges, then shapes, then objects. Dominated computer vision before transformers. Still efficient for many image tasks.
Analogy: Examining a photograph through a magnifying glass, one patch at a time. First edges, then shapes, then "that is a cat."
|
|
MoE (Mixture of Experts)
Contains multiple "expert" sub-networks. A router decides which experts handle each input. More total parameters while only activating a fraction per input, making them efficient. Mixtral and some GPT-4 variants use MoE.
Analogy: A hospital with specialists. Triage routes you to the right expert—cardiologist, orthopedist, etc.
|
|
|
|
Multimodal Systems
Modern AI increasingly handles multiple input types—text, images, audio, video—simultaneously.
|
|
Multimodal: Multiple Input Types
|
📝
Text
|
🖼
Images
|
🔊
Audio
|
🎬
Video
|
|
↓
|
|
Unified Understanding
|
|
|
Multimodal
AI that processes multiple input types: text, images, audio, video. GPT-4V, Claude with vision, and Gemini are multimodal. You can show them images and ask questions—much more powerful than describing things in words.
Analogy: Humans are naturally multimodal—we see, hear, and read simultaneously. AI is catching up.
|
|
Vision Encoder
Converts images into numerical representations (embeddings) the language model can understand. Takes raw pixels and outputs numbers in the same format as text embeddings, allowing unified processing.
Analogy: Your eyes and visual cortex. Light hits retina (pixels), converts to neural signals (embeddings).
|
|
CLIP
OpenAI model connecting images and text in a shared embedding space. Trained on millions of image-caption pairs. Given an image, finds matching text. Given text, finds matching images. Powers many search and generation systems.
Analogy: A translator who speaks both "image" and "text" languages fluently.
|
|
|
|
Common Terms
These terms appear constantly in AI discussions. Understanding them helps you follow conversations and evaluate claims.
|
Hallucination
When AI confidently generates false information. Invents citations, states incorrect facts, makes up events—all while sounding confident. A fundamental limitation because LLMs predict plausible text, not necessarily true text.
Analogy: A friend who always has a story but sometimes fills gaps with plausible-sounding made-up details.
|
|
RAG (Retrieval-Augmented Generation)
Instead of relying solely on training knowledge, RAG first searches a database for relevant information, then feeds that context to generate a response. Grounds output in specific sources and reduces hallucination.
Analogy: Open-book versus closed-book exam. RAG looks up information before answering.
|
|
Prompt Engineering
Crafting instructions to get better AI outputs. Small wording changes can dramatically affect quality. Techniques include being specific, providing examples, asking for step-by-step reasoning, and specifying output format.
Analogy: Interview technique. "Tell me about yourself" rambles. "What is your biggest achievement?" gets focused responses.
|
|
Chain of Thought
Asking the model to show reasoning step by step before the final answer. Adding "Let's think step by step" can significantly improve accuracy on math, logic, and reasoning problems.
Analogy: "Show your work" on a math test. Writing steps helps catch errors along the way.
|
|
Zero-shot and Few-shot
Zero-shot: No examples, just task description. Few-shot: A few examples provided first. Few-shot often works better because examples demonstrate expected format and style.
Analogy: Asking someone to write a poem (zero-shot) vs showing them three poems you like first (few-shot).
|
|
AGI (Artificial General Intelligence)
Hypothetical AI that can perform any intellectual task a human can. Current AI is "narrow"—excellent at specific tasks but not truly general-purpose. AGI would learn any skill, reason across domains, adapt to novel situations. Timelines hotly debated.
Analogy: Current AI is a chess grandmaster who cannot make toast. AGI would learn any skill you throw at it.
|
|
Benchmark
Standardized test for comparing AI models. MMLU tests general knowledge, HumanEval tests coding, HellaSwag tests common sense. Enables apples-to-apples comparisons, though models can optimize for benchmarks without matching real-world performance.
Analogy: Standardized tests like the SAT. Everyone takes the same test for fair comparison.
|
|
Complete Quick Reference
|
Fundamentals
Model — Learned patterns from data
Training — Learning phase
Inference — Using phase
Parameters — Adjustable values
Learning
Loss — Error measurement
Gradient — Direction to improve
Epoch — One complete data pass
Backprop — Tracing errors backward
|
Language Models
Token — Text unit, about 4 chars
Embedding — Words as numbers
Context — How much model sees
Temperature — Creativity level
Key Terms
Transformer — Dominant architecture
Attention — Focus mechanism
RAG — Retrieval + generation
Hallucination — Confident errors
|
|
|
|
Combined with Part 1, you now have 45 essential terms to navigate any AI conversation. Save both parts as your reference guide.
|
|
ResearchAudio.io
AI research explained clearly.
Unsubscribe ·
View online
|