Log in
public

100 Deep Learning interview

100 termsBy guiem
Top 10 EloNo ratings yet

Play a challenge match to create ratings.

Terms in this set

1

What is deep learning primarily based on?

Multi-layer neural networks

2

What is a neuron in a neural network?

A parametric function with learnable weights and bias

3

What does an activation function do?

Introduces non-linearity into the network

4

Which of the following is a commonly used activation function?

ReLU

5

What does ReLU stand for?

Rectified Linear Unit

6

Why are non-linear activation functions necessary?

To allow the network to model complex functions

7

What is the typical output activation for binary classification?

Sigmoid

8

What is the typical output activation for multi-class classification?

Softmax

9

What is backpropagation?

A gradient-based algorithm to update weights

10

Which optimization algorithm is an extension of SGD with adaptive learning rates and momentum?

Adam

11

What is the main purpose of the loss function?

To measure the error between prediction and target

12

Which loss is commonly used for multi-class classification?

Categorical cross-entropy

13

What is a feedforward neural network?

A network where information flows only from input to output

14

In deep learning, what does 'depth' usually refer to?

Number of layers in the network

15

What is weight initialization?

Setting initial values for model parameters

16

Which initialization method is commonly used with ReLU networks?

He initialization

17

What problem do vanishing gradients cause?

Early layers learn very slowly

18

Which activation helps mitigate vanishing gradients?

ReLU

19

What is a mini-batch in training?

A subset of training samples used per update

20

Which of the following best describes batch gradient descent?

Uses the entire dataset per update

21

What is an epoch?

One full pass through the training dataset

22

What does dropout do?

Randomly disables neurons during training

23

Why is dropout used?

To prevent overfitting

24

What does batch normalization do?

Normalizes activations within a mini-batch

25

Which benefit is associated with batch normalization?

Allows very large learning rates and stabilizes training

26

What is a convolution in CNNs?

A dot product between a filter and a local region of the input

27

Which type of data are CNNs especially good at?

Grid-like data such as images

28

What is a filter (kernel) in a CNN?

A learned weight matrix applied locally across the input

29

What does pooling in CNNs achieve?

Reduces spatial dimension and provides translation invariance

30

Which is a common pooling operation?

Max pooling

31

What does 'stride' mean in convolution?

Step size of the filter across input

32

What is padding in convolutional layers?

Adding zeros around input to control output size

33

What is a fully connected (dense) layer?

Layer where each neuron connects to all neurons of the previous layer

34

Which architecture is classically used for image classification?

ResNet

35

What is the main idea behind residual connections in ResNet?

Add input of a block to its output to ease optimization

36

Why were residual networks introduced?

To solve vanishing gradient and degradation in very deep networks

37

What is a recurrent neural network (RNN) designed for?

Sequential data with temporal dependencies

38

What is the main issue with vanilla RNNs?

They suffer from vanishing and exploding gradients

39

Which architectures were introduced to mitigate RNN gradient issues?

LSTM and GRU

40

What do LSTM cells introduce to remember information?

Gates controlling information flow

41

Which gate in LSTM controls how much new information to store?

Input gate

42

Which gate in LSTM controls how much information to discard from memory?

Forget gate

43

What is a GRU?

Gated Recurrent Unit

44

What is a sequence-to-sequence (seq2seq) model?

Model that maps one sequence to another sequence

45

In a seq2seq model, what do encoder and decoder do?

Encoder compresses input sequence, decoder generates output sequence

46

What is attention in deep learning (at a high level)?

A mechanism to focus on important parts of the input

47

Which of the following is a key idea of self-attention?

Comparing each token with all other tokens in the sequence

48

What are queries, keys, and values used for in attention?

To compute attention scores and weighted combinations of representations

49

Why are positional encodings needed in self-attention models?

To encode sequence order information

50

What is the main difference between CNNs and RNNs?

CNNs process local spatial patterns, RNNs model temporal dependencies

51

What is weight sharing in CNNs?

Same kernel applied across different spatial locations

52

What is transfer learning?

Using a model trained on one task as starting point for another

53

Which scenario commonly uses transfer learning?

Limited labeled data for new task and a large pretrained model

54

What is fine-tuning in deep learning?

Starting from pretrained weights and continuing training on a new task

55

What is the main purpose of data augmentation?

Generate additional realistic training samples from existing ones

56

Which is a common image data augmentation technique?

Random cropping and flipping

57

What is label smoothing?

Assigning soft target probabilities instead of hard 0/1 labels

58

What is gradient clipping?

Limiting the magnitude of gradients to prevent exploding gradients

59

Which optimization algorithm uses exponentially decaying averages of past gradients and squared gradients?

Adam

60

What is the role of the learning rate?

Controls the update step size in parameter space

61

What can happen if the learning rate is too high?

Model may diverge or oscillate

62

What can happen if the learning rate is too low?

Training might be very slow and stuck in poor minima

63

What is early stopping?

Stopping training before convergence based on validation performance

64

What is a validation set used for?

Tuning hyperparameters and monitoring overfitting

65

Why is batch size important?

It affects gradient noise, memory usage, and training stability

66

Which is true about large batch sizes?

Can lead to sharp minima and sometimes worse generalization

67

What is overparameterization in deep learning?

Model has many more parameters than training samples

68

Which regularization method penalizes large weights directly in the loss?

Weight decay (L2 regularization)

69

What is the difference between L1 and L2 regularization?

L1 encourages sparsity, L2 encourages small weights

70

What is a bottleneck layer in an autoencoder?

The narrow central layer representing compressed features

71

What is an autoencoder trained to do?

Reconstruct its input at the output

72

What is a denoising autoencoder?

Learns to reconstruct clean input from a corrupted version

73

What is a variational autoencoder (VAE) mainly used for?

Learning a continuous latent distribution to generate samples

74

Which term appears in the VAE loss function?

Reconstruction loss plus KL divergence regularization

75

What is a skip connection?

Connection that skips one or more layers and adds earlier activations to later ones

76

What does 'exploding gradients' refer to?

Gradients becoming extremely large

77

Which method helps with exploding gradients?

Gradient clipping

78

What is the purpose of using a learning rate schedule?

To adjust learning rate over time for better convergence

79

Which of the following is a common learning rate schedule?

Exponential decay

80

What is a receptive field in CNNs?

Region of the input that affects a particular neuron

81

How can you increase the receptive field in CNNs?

Use larger kernels, more layers, or dilation

82

What is dilation in convolution?

Skipping input elements inside the kernel to enlarge receptive field

83

What is channel-wise (depthwise) convolution?

Convolution applied separately per input channel

84

Why are depthwise separable convolutions used?

To reduce parameters and computational cost

85

What is a hyperparameter in deep learning?

Fixed parameter set manually, like learning rate or number of layers

86

Which of the following is a hyperparameter?

Learning rate

87

What is the difference between training loss and validation loss?

Training loss is computed on training data, validation loss on held-out validation data

88

What does it typically indicate if validation loss starts increasing while training loss continues decreasing?

Overfitting

89

Which metric is commonly used for multi-class classification evaluation in deep networks?

Accuracy

90

What is top-k accuracy?

Probability that correct class is within the model’s top k predictions

91

Why might you use mixed-precision training?

To reduce memory usage and speed up training using lower precision arithmetic

92

Which numeric formats are typically involved in mixed-precision training?

Float16/BFloat16 and Float32

93

What is a gradient checkpointing technique used for?

Trading extra computation for reduced memory usage

94

What is the main challenge of training very deep networks?

Vanishing/exploding gradients and optimization difficulties

95

What does 'end-to-end training' mean?

Training the entire pipeline jointly from raw inputs to final outputs

96

What is a common challenge when deploying deep learning models?

Fitting models into resource constraints and achieving low latency

97

Why might you prune a deep neural network?

To reduce parameters and speed up inference

98

What is knowledge distillation in deep learning?

Using a large teacher model to train a smaller student model

99

Which of the following is true about deep learning vs. classic ML?

Deep learning models automatically learn hierarchical features from raw data

100

Which factor is usually most critical for successful deep learning?

Sufficient data, compute, and good optimization/hyperparameters