Machine Learning FAQ
It is always a pleasure to engage in discussions about machine learning. Below, I collected some of the most frequently asked questions that I answered via email or other social network platforms in hope that these are useful to others!
The only thing to do with good advice is to pass it on. It is never of any use to oneself.
— Oscar Wilde
General Questions About Machine Learning and Data Science
- What is data-centric AI, how does it compare to the conventional modeling paradigm, and how do we decide it’s the right fit for a project?
- What are machine learning and data science?
- Why do you and other people sometimes implement machine learning algorithms from scratch?
- What learning path/discipline in data science I should focus on?
- At what point should one start contributing to open source?
- How important do you think having a mentor is to the learning process?
- Where are the best online communities centered around data science/machine learning or python?
- How would you explain machine learning to a software engineer?
- How would your curriculum for a machine learning beginner look like?
- What is the Definition of Data Science?
- How do Data Scientists perform model selection? Is it different from Kaggle?
- What are the main differences between statistical modeling and machine learning?
Questions About the Machine Learning Field
- What are the different approaches for dealing with limited labeled data in supervised machine learning settings?
- How are Artificial Intelligence and Machine Learning related?
- What are some real-world examples of applications of machine learning in the field?
- What are the different fields of study in data mining?
- What are differences in research nature between the two fields: machine learning & data mining?
- How do I know if the problem is solvable through machine learning?
- What are the origins of machine learning?
- How was classification, as a learning machine, developed?
- Which machine learning algorithms can be considered as among the best?
- What are the broad categories of classifiers?
- What is the difference between a classifier and a model?
- What is the difference between a parametric learning algorithm and a nonparametric learning algorithm?
- What is the difference between a cost function and a loss function in machine learning?
Questions about Machine Learning Concepts and Statistics
Activation Functions
- What are the common activation functions for neural networks
- Why is the ReLU function not differentiable at x=0?
- What is the derivative of the logistic sigmoid function?
Cost/Loss Functions and Optimization
- Consider Poisson regression and ordinal regression; when do we use which over the other?
- Is the cross-entropy loss a proper metric?
- Is the squared error loss a proper metric?
- What is the derivative of the mean squared error?
- How do we regularize generalized linear models?
- What is the difference between likelihood and probability?
- What are gradient descent and stochastic gradient descent?
- How to compute gradients with backpropagation for arbitrary loss and activation functions?
- Fitting a model via closed-form equations vs. Gradient Descent vs Stochastic Gradient Descent vs Mini-Batch Learning – what is the difference?
- How do you derive the Gradient Descent rule for Linear Regression and Adaline?
- Why are there so many ways to compute the Cross Entropy Loss in PyTorch and how do they differ?
- How is stochastic gradient descent implemented in the context of machine learning and deep learning?
Deployment and Production
Regression Analysis
- What is the difference between Pearson R and Simple Linear Regression?
- What is the difference between covariance and correlation?
Tree-based Models
- How does the random forest model work? How is it different from bagging and boosting in ensemble models?
- What are the disadvantages of using classic decision tree algorithm for a large dataset?
- Why are implementations of decision tree algorithms usually binary, and what are the advantages of the different impurity metrics?
- Why are we growing decision trees via entropy instead of the classification error?
- When can a random forest perform terribly?
- Does random forest select a subset of features for every tree or every node?
Model Evaluation
- How do you compare supervised algorithms efficiency and accuracy-wise?
- What is overfitting?
- How can I avoid overfitting?
- Is it always better to have the largest possible number of folds when performing cross validation?
- When training an SVM classifier, is it better to have a large or small number of support vectors?
- How do I evaluate a model?
- What is the best validation metric for multi-class classification?
- What factors should I consider when choosing a predictive model technique?
- What are the best toy datasets to help visualize and understand classifier behavior?
- How do I select SVM kernels?
- Interlude: Comparing and Computing Performance Metrics in Cross-Validation – Imbalanced Class Problems and 3 Different Ways to Compute the F1 Score
Logistic Regression
- What is the relationship between the negative log-likelihood and logistic loss?
- What is Softmax regression and how is it related to Logistic regression?
- Why is logistic regression considered a linear model?
- What is the probabilistic interpretation of regularized logistic regression?
- Does regularization in logistic regression always results in better fit and better generalization?
- What is the major difference between naive Bayes and logistic regression?
- What exactly is the “softmax and the multinomial logistic loss” in the context of machine learning?
- What is the relation between Logistic Regression and Neural Networks and when to use which?
- Logistic Regression: Why sigmoid function?
- Is there an analytical solution to Logistic Regression similar to the Normal Equation for Linear Regression?
Neural Networks and Deep Learning
- In deep learning, we often use the terms embedding vectors, representations, and latent space. What do these concepts have in common, and how do they differ?
- What is self-supervised learning and when is it useful?
- What is few-shot learning? And how does it differ from the conventional training procedure for supervised learning?
- What is the lottery ticket hypothesis, and if it holds true, how can it be useful in practice?
- What are some of the common ways to reduce overfitting in neural networks through the use of altered or additional data?
- What are some of the common ways to reduce overfitting in neural networks through model or training loop modifications?
- How does a training loop in PyTorch look like?
- What is the difference between deep learning and usual machine learning?
- Can you give a visual explanation for the back propagation algorithm for neural networks?
- Why did it take so long for deep networks to be invented?
- What are some good books/papers for learning deep learning?
- Why are there so many deep learning libraries?
- Why do some people hate neural networks/deep learning?
- How can I know if Deep Learning works better for a specific problem than SVM or random forest?
- What is wrong when my neural network’s error increases?
- How do I debug an artificial neural network algorithm?
- What is the difference between a Perceptron, Adaline, and neural network model?
- What is the basic idea behind the dropout technique?
- Is dropout applied before or after the non-linear activation function
- Is the logistic sigmoid function just a rescaled version of the hyberpolic tangent (tanh) function
Convolutional Neural Networks
- Can Fully Connected Layers be Replaced by Convolutional Layers?
- What are the first successful convolutional neural networks trained on a GPU?
Other Algorithms for Supervised Learning
Unsupervised Learning
Semi-Supervised Learning
Ensemble Methods
Preprocessing, Feature Selection, and Feature Extraction
- Why do we need to re-use training parameters to transform test data?
- What are the different dimensionality reduction methods in machine learning?
- What is the difference between LDA and PCA for dimensionality reduction?
- When should I apply data normalization/standardization?
- Does mean centering or feature scaling affect a Principal Component Analysis?
- How do you attack a machine learning problem with a large number of features?
- What are some common approaches for dealing with missing data?
- What is the difference between filter, wrapper, and embedded methods for feature selection?
- Should data preparation/pre-processing step be considered one part of feature engineering? Why or why not?
- Is a bag of words feature representation for text classification considered as a sparse matrix?
Naive Bayes
- Why is the Naive Bayes Classifier naive?
- What is the decision boundary for Naive Bayes?
- Is it possible to mix different variable types in Naive Bayes, for example, binary and continues features?
Other
- What is Euclidean distance in terms of machine learning?
- When should one use median, as opposed to the mean or average?
Programming Languages and Libraries for Data Science and Machine Learning
- Is R used extensively today in data science?
- What is the main difference between TensorFlow and scikit-learn?
Large Language Models (LLMs)
Foundations
- What is a large language model (LLM), and how is it different from earlier language models?
- What are the main stages of building a large language model (LLM) from scratch?
- How does tokenization work, and why do LLMs usually rely on subword tokenizers such as BPE?
- Why can an embedding layer be interpreted as a linear layer applied to one-hot encoded tokens?
- How should someone progress from a simple GPT implementation to a modern production-style LLM stack?
Pretraining and Generation
- How does next-token prediction train a large language model?
- How are input-target training examples constructed for LLM pretraining?
- What does pretraining on unlabeled text actually teach an LLM?
- How does autoregressive text generation work at inference time?
- How do temperature, top-k, and top-p sampling differ?
- Why do LLMs sometimes repeat themselves or get stuck in loops during generation?
- What is perplexity, and what does it actually tell us about an LLM?
- Why can a model have low training loss but still generate poor text?
- Why is inference sequential while training is much more parallel?
Attention, Transformers, and Context
- What is self-attention, and why is it the core mechanism behind modern LLMs?
- What is causal attention, and why can GPT-style models not look at future tokens?
- Why do transformer-based LLMs use multi-head attention instead of a single attention mechanism?
- What role does positional information play in a transformer-based LLM?
- What are the main building blocks of a GPT-style model?
- What is a KV cache, and why does it make LLM inference faster?
- Why does context length matter so much in LLM training and inference?
- What are the tradeoffs between short-context and long-context LLMs?
- Why is the KV cache such a big memory bottleneck at long context lengths?
Finetuning and Adaptation
- What is the difference between pretraining, finetuning, and instruction finetuning?
- How can a pretrained LLM be finetuned for text classification?
- How does instruction finetuning make a base model more useful in practice?
- What is LoRA, and when is parameter-efficient finetuning preferable to full finetuning?
- Why is full finetuning so expensive compared with LoRA?
- How do rank and alpha affect LoRA behavior in practice?
- Where should LoRA adapters be inserted in an LLM for the biggest impact?
- What is the difference between a base model, an instruct model, and a reasoning model?
- Why can a pretrained model answer some questions but still be bad at following instructions?
- What are good ways to build an instruction dataset from scratch?
- Why do prompt templates matter so much during instruction finetuning?
- When should prompt tokens be masked out of the loss during instruction finetuning?
- How do instruct tuning, tool use, and reasoning-style training differ?
Alignment and Evaluation
- What is Direct Preference Optimization (DPO), and how does it differ from supervised finetuning?
- What does it mean to align an LLM?
- How is RLHF different from DPO at a high level?
- Why can preference tuning improve style even when supervised finetuning already works?
- Why is evaluating LLM outputs difficult, and what are common ways to evaluate them?
- What makes LLM evaluation harder than classification evaluation?
- When should you use exact-match metrics versus LLM-as-a-judge evaluation?
Architectures and Model Families
- What is grouped-query attention (GQA), and why do many modern LLMs use it?
- What is mixture-of-experts (MoE), and how does it differ from a dense LLM?
- What is sliding-window attention, and when is it useful?
- How do architectures such as GPT, Llama, Qwen, and Gemma differ at a high level?
- Why do many modern LLMs use RMSNorm instead of LayerNorm?
- What is RoPE, and why did many models move away from learned absolute positional embeddings?
- What is SwiGLU, and why is it common in modern LLM feed-forward layers?
- Why do some LLMs remove bias terms from linear layers?
- How do modern open models balance quality, speed, and memory?
- What architectural changes turned GPT-style models into Llama-style models?
- When is GQA enough, and when do you also want sliding-window attention?
- Why do MoE models have huge parameter counts but lower active compute per token?
- What is the difference between dense Qwen variants and MoE-style variants?
Efficiency and Deployment
- What are the most common practical bottlenecks when training or running an LLM on limited hardware?
- How does batching affect LLM training speed and memory use?
- Why does mixed precision such as
bfloat16help so much in practice? - What is FlashAttention, and why did it matter so much for LLM training speed?
- What does
torch.compileactually help with in LLM workloads? - Why can model loading require much more memory than expected?
- What is memory-mapped weight loading, and when is it useful?
- What are the main reasons an LLM project fails on consumer hardware?
- What are good first optimizations before moving from one GPU to multi-GPU training?