RLHF
Reinforcement learning from human feedback
PEFT
Parameter efficient fine tuning
Self-Attention
Multi-headed Self-Attention
-Multiple sets of self-attention weights or heads are learned in parallel independently of each other
- The outputs of the multi-headed attention layers are fed through a feed-forward network to the output of the encoder
How many parameters does a model with general knowledge about the world have?
Hundreds of billions
How many parameters do you need for a single task like summarizing dialog or acting as a customer service agent for a single company?
Often just 500-1,000 examples can result in good performance
Context window
Space available for the prompt
Inference
Completion
Output of the model
Entity recognition
Word classification to identify all the people and places
Foundational models by decreasing number of parameters
Bloom -> GPT -> Flan-T5 -> LLaMa -> PaLM -> BERT
RNN
What’s so important about the transformer architecture?
Instruction Fine Tuning
Adapting pre-trained models to specific tasks and datasets
RAG
Retrieval Augmented Generation
Knowledge base data is used for the retrieval portion of the solution
What’s significant about the transformer architecture
Origin of the Transformer Architecture
Attention Is All You Need
What are attention weights?
The model learns the relevance of each word to all other words during training
What are the two distinct parts of the transformer architecture
Encoder and decoder
Tokenize
Embedding Layer
What was the vector size in Attention Is All You Need
512 dimensions
Positional encoding
Position of word in sentence/document
What is passed into the encoder/decoder