100 Deep Learning interview
Top 10 EloNo ratings yet
Play a challenge match to create ratings.
Terms in this set
Click study modes above to mirror Quizlet flow.
1
What is deep learning primarily based on?
Multi-layer neural networks
2
What is a neuron in a neural network?
A parametric function with learnable weights and bias
3
What does an activation function do?
Introduces non-linearity into the network
4
Which of the following is a commonly used activation function?
ReLU
5
What does ReLU stand for?
Rectified Linear Unit
6
Why are non-linear activation functions necessary?
To allow the network to model complex functions
7
What is the typical output activation for binary classification?
Sigmoid
8
What is the typical output activation for multi-class classification?
Softmax
9
What is backpropagation?
A gradient-based algorithm to update weights
10
Which optimization algorithm is an extension of SGD with adaptive learning rates and momentum?
Adam
11
What is the main purpose of the loss function?
To measure the error between prediction and target
12
Which loss is commonly used for multi-class classification?
Categorical cross-entropy
13
What is a feedforward neural network?
A network where information flows only from input to output
14
In deep learning, what does 'depth' usually refer to?
Number of layers in the network
15
What is weight initialization?
Setting initial values for model parameters
16
Which initialization method is commonly used with ReLU networks?
He initialization
17
What problem do vanishing gradients cause?
Early layers learn very slowly
18
Which activation helps mitigate vanishing gradients?
ReLU
19
What is a mini-batch in training?
A subset of training samples used per update
20
Which of the following best describes batch gradient descent?
Uses the entire dataset per update
21
What is an epoch?
One full pass through the training dataset
22
What does dropout do?
Randomly disables neurons during training
23
Why is dropout used?
To prevent overfitting
24
What does batch normalization do?
Normalizes activations within a mini-batch
25
Which benefit is associated with batch normalization?
Allows very large learning rates and stabilizes training
26
What is a convolution in CNNs?
A dot product between a filter and a local region of the input
27
Which type of data are CNNs especially good at?
Grid-like data such as images
28
What is a filter (kernel) in a CNN?
A learned weight matrix applied locally across the input
29
What does pooling in CNNs achieve?
Reduces spatial dimension and provides translation invariance
30
Which is a common pooling operation?
Max pooling
31
What does 'stride' mean in convolution?
Step size of the filter across input
32
What is padding in convolutional layers?
Adding zeros around input to control output size
33
What is a fully connected (dense) layer?
Layer where each neuron connects to all neurons of the previous layer
34
Which architecture is classically used for image classification?
ResNet
35
What is the main idea behind residual connections in ResNet?
Add input of a block to its output to ease optimization
36
Why were residual networks introduced?
To solve vanishing gradient and degradation in very deep networks
37
What is a recurrent neural network (RNN) designed for?
Sequential data with temporal dependencies
38
What is the main issue with vanilla RNNs?
They suffer from vanishing and exploding gradients
39
Which architectures were introduced to mitigate RNN gradient issues?
LSTM and GRU
40
What do LSTM cells introduce to remember information?
Gates controlling information flow
41
Which gate in LSTM controls how much new information to store?
Input gate
42
Which gate in LSTM controls how much information to discard from memory?
Forget gate
43
What is a GRU?
Gated Recurrent Unit
44
What is a sequence-to-sequence (seq2seq) model?
Model that maps one sequence to another sequence
45
In a seq2seq model, what do encoder and decoder do?
Encoder compresses input sequence, decoder generates output sequence
46
What is attention in deep learning (at a high level)?
A mechanism to focus on important parts of the input
47
Which of the following is a key idea of self-attention?
Comparing each token with all other tokens in the sequence
48
What are queries, keys, and values used for in attention?
To compute attention scores and weighted combinations of representations
49
Why are positional encodings needed in self-attention models?
To encode sequence order information
50
What is the main difference between CNNs and RNNs?
CNNs process local spatial patterns, RNNs model temporal dependencies
51
What is weight sharing in CNNs?
Same kernel applied across different spatial locations
52
What is transfer learning?
Using a model trained on one task as starting point for another
53
Which scenario commonly uses transfer learning?
Limited labeled data for new task and a large pretrained model
54
What is fine-tuning in deep learning?
Starting from pretrained weights and continuing training on a new task
55
What is the main purpose of data augmentation?
Generate additional realistic training samples from existing ones
56
Which is a common image data augmentation technique?
Random cropping and flipping
57
What is label smoothing?
Assigning soft target probabilities instead of hard 0/1 labels
58
What is gradient clipping?
Limiting the magnitude of gradients to prevent exploding gradients
59
Which optimization algorithm uses exponentially decaying averages of past gradients and squared gradients?
Adam
60
What is the role of the learning rate?
Controls the update step size in parameter space
61
What can happen if the learning rate is too high?
Model may diverge or oscillate
62
What can happen if the learning rate is too low?
Training might be very slow and stuck in poor minima
63
What is early stopping?
Stopping training before convergence based on validation performance
64
What is a validation set used for?
Tuning hyperparameters and monitoring overfitting
65
Why is batch size important?
It affects gradient noise, memory usage, and training stability
66
Which is true about large batch sizes?
Can lead to sharp minima and sometimes worse generalization
67
What is overparameterization in deep learning?
Model has many more parameters than training samples
68
Which regularization method penalizes large weights directly in the loss?
Weight decay (L2 regularization)
69
What is the difference between L1 and L2 regularization?
L1 encourages sparsity, L2 encourages small weights
70
What is a bottleneck layer in an autoencoder?
The narrow central layer representing compressed features
71
What is an autoencoder trained to do?
Reconstruct its input at the output
72
What is a denoising autoencoder?
Learns to reconstruct clean input from a corrupted version
73
What is a variational autoencoder (VAE) mainly used for?
Learning a continuous latent distribution to generate samples
74
Which term appears in the VAE loss function?
Reconstruction loss plus KL divergence regularization
75
What is a skip connection?
Connection that skips one or more layers and adds earlier activations to later ones
76
What does 'exploding gradients' refer to?
Gradients becoming extremely large
77
Which method helps with exploding gradients?
Gradient clipping
78
What is the purpose of using a learning rate schedule?
To adjust learning rate over time for better convergence
79
Which of the following is a common learning rate schedule?
Exponential decay
80
What is a receptive field in CNNs?
Region of the input that affects a particular neuron
81
How can you increase the receptive field in CNNs?
Use larger kernels, more layers, or dilation
82
What is dilation in convolution?
Skipping input elements inside the kernel to enlarge receptive field
83
What is channel-wise (depthwise) convolution?
Convolution applied separately per input channel
84
Why are depthwise separable convolutions used?
To reduce parameters and computational cost
85
What is a hyperparameter in deep learning?
Fixed parameter set manually, like learning rate or number of layers
86
Which of the following is a hyperparameter?
Learning rate
87
What is the difference between training loss and validation loss?
Training loss is computed on training data, validation loss on held-out validation data
88
What does it typically indicate if validation loss starts increasing while training loss continues decreasing?
Overfitting
89
Which metric is commonly used for multi-class classification evaluation in deep networks?
Accuracy
90
What is top-k accuracy?
Probability that correct class is within the model’s top k predictions
91
Why might you use mixed-precision training?
To reduce memory usage and speed up training using lower precision arithmetic
92
Which numeric formats are typically involved in mixed-precision training?
Float16/BFloat16 and Float32
93
What is a gradient checkpointing technique used for?
Trading extra computation for reduced memory usage
94
What is the main challenge of training very deep networks?
Vanishing/exploding gradients and optimization difficulties
95
What does 'end-to-end training' mean?
Training the entire pipeline jointly from raw inputs to final outputs
96
What is a common challenge when deploying deep learning models?
Fitting models into resource constraints and achieving low latency
97
Why might you prune a deep neural network?
To reduce parameters and speed up inference
98
What is knowledge distillation in deep learning?
Using a large teacher model to train a smaller student model
99
Which of the following is true about deep learning vs. classic ML?
Deep learning models automatically learn hierarchical features from raw data
100
Which factor is usually most critical for successful deep learning?
Sufficient data, compute, and good optimization/hyperparameters