public•

100 Deep Learning interview

100 terms•By guiem

Top 10 EloNo ratings yet

Play a challenge match to create ratings.

Study flashcards Learn mode Test mode

Terms in this set

Click study modes above to mirror Quizlet flow.

What is deep learning primarily based on?

Multi-layer neural networks

What is a neuron in a neural network?

A parametric function with learnable weights and bias

What does an activation function do?

Introduces non-linearity into the network

Which of the following is a commonly used activation function?

ReLU

What does ReLU stand for?

Rectified Linear Unit

Why are non-linear activation functions necessary?

To allow the network to model complex functions

What is the typical output activation for binary classification?

Sigmoid

What is the typical output activation for multi-class classification?

Softmax

What is backpropagation?

A gradient-based algorithm to update weights

Which optimization algorithm is an extension of SGD with adaptive learning rates and momentum?

Adam

What is the main purpose of the loss function?

To measure the error between prediction and target

Which loss is commonly used for multi-class classification?

Categorical cross-entropy

What is a feedforward neural network?

A network where information flows only from input to output

In deep learning, what does 'depth' usually refer to?

Number of layers in the network

What is weight initialization?

Setting initial values for model parameters

Which initialization method is commonly used with ReLU networks?

He initialization

What problem do vanishing gradients cause?

Early layers learn very slowly

Which activation helps mitigate vanishing gradients?

ReLU

What is a mini-batch in training?

A subset of training samples used per update

Which of the following best describes batch gradient descent?

Uses the entire dataset per update

What is an epoch?

One full pass through the training dataset

What does dropout do?

Randomly disables neurons during training

Why is dropout used?

To prevent overfitting

What does batch normalization do?

Normalizes activations within a mini-batch

Which benefit is associated with batch normalization?

Allows very large learning rates and stabilizes training

What is a convolution in CNNs?

A dot product between a filter and a local region of the input

Which type of data are CNNs especially good at?

Grid-like data such as images

What is a filter (kernel) in a CNN?

A learned weight matrix applied locally across the input

What does pooling in CNNs achieve?

Reduces spatial dimension and provides translation invariance

Which is a common pooling operation?

Max pooling

What does 'stride' mean in convolution?

Step size of the filter across input

What is padding in convolutional layers?

Adding zeros around input to control output size

What is a fully connected (dense) layer?

Layer where each neuron connects to all neurons of the previous layer

Which architecture is classically used for image classification?

ResNet

What is the main idea behind residual connections in ResNet?

Add input of a block to its output to ease optimization

Why were residual networks introduced?

To solve vanishing gradient and degradation in very deep networks

What is a recurrent neural network (RNN) designed for?

Sequential data with temporal dependencies

What is the main issue with vanilla RNNs?

They suffer from vanishing and exploding gradients

Which architectures were introduced to mitigate RNN gradient issues?

LSTM and GRU

What do LSTM cells introduce to remember information?

Gates controlling information flow

Which gate in LSTM controls how much new information to store?

Input gate

Which gate in LSTM controls how much information to discard from memory?

Forget gate

What is a GRU?

Gated Recurrent Unit

What is a sequence-to-sequence (seq2seq) model?

Model that maps one sequence to another sequence

In a seq2seq model, what do encoder and decoder do?

Encoder compresses input sequence, decoder generates output sequence

What is attention in deep learning (at a high level)?

A mechanism to focus on important parts of the input

Which of the following is a key idea of self-attention?

Comparing each token with all other tokens in the sequence

What are queries, keys, and values used for in attention?

To compute attention scores and weighted combinations of representations

Why are positional encodings needed in self-attention models?

To encode sequence order information

What is the main difference between CNNs and RNNs?

CNNs process local spatial patterns, RNNs model temporal dependencies

What is weight sharing in CNNs?

Same kernel applied across different spatial locations

What is transfer learning?

Using a model trained on one task as starting point for another

Which scenario commonly uses transfer learning?

Limited labeled data for new task and a large pretrained model

What is fine-tuning in deep learning?

Starting from pretrained weights and continuing training on a new task

What is the main purpose of data augmentation?

Generate additional realistic training samples from existing ones

Which is a common image data augmentation technique?

Random cropping and flipping

What is label smoothing?

Assigning soft target probabilities instead of hard 0/1 labels

What is gradient clipping?

Limiting the magnitude of gradients to prevent exploding gradients

Which optimization algorithm uses exponentially decaying averages of past gradients and squared gradients?

Adam

What is the role of the learning rate?

Controls the update step size in parameter space

What can happen if the learning rate is too high?

Model may diverge or oscillate

What can happen if the learning rate is too low?

Training might be very slow and stuck in poor minima

What is early stopping?

Stopping training before convergence based on validation performance

What is a validation set used for?

Tuning hyperparameters and monitoring overfitting

Why is batch size important?

It affects gradient noise, memory usage, and training stability

Which is true about large batch sizes?

Can lead to sharp minima and sometimes worse generalization

What is overparameterization in deep learning?

Model has many more parameters than training samples

Which regularization method penalizes large weights directly in the loss?

Weight decay (L2 regularization)

What is the difference between L1 and L2 regularization?

L1 encourages sparsity, L2 encourages small weights

What is a bottleneck layer in an autoencoder?

The narrow central layer representing compressed features

What is an autoencoder trained to do?

Reconstruct its input at the output

What is a denoising autoencoder?

Learns to reconstruct clean input from a corrupted version

What is a variational autoencoder (VAE) mainly used for?

Learning a continuous latent distribution to generate samples

Which term appears in the VAE loss function?

Reconstruction loss plus KL divergence regularization

What is a skip connection?

Connection that skips one or more layers and adds earlier activations to later ones

What does 'exploding gradients' refer to?

Gradients becoming extremely large

Which method helps with exploding gradients?

Gradient clipping

What is the purpose of using a learning rate schedule?

To adjust learning rate over time for better convergence

Which of the following is a common learning rate schedule?

Exponential decay

What is a receptive field in CNNs?

Region of the input that affects a particular neuron

How can you increase the receptive field in CNNs?

Use larger kernels, more layers, or dilation

What is dilation in convolution?

Skipping input elements inside the kernel to enlarge receptive field

What is channel-wise (depthwise) convolution?

Convolution applied separately per input channel

Why are depthwise separable convolutions used?

To reduce parameters and computational cost

What is a hyperparameter in deep learning?

Fixed parameter set manually, like learning rate or number of layers

Which of the following is a hyperparameter?

Learning rate

What is the difference between training loss and validation loss?

Training loss is computed on training data, validation loss on held-out validation data

What does it typically indicate if validation loss starts increasing while training loss continues decreasing?

Overfitting

Which metric is commonly used for multi-class classification evaluation in deep networks?

Accuracy

What is top-k accuracy?

Probability that correct class is within the model’s top k predictions

Why might you use mixed-precision training?

To reduce memory usage and speed up training using lower precision arithmetic

Which numeric formats are typically involved in mixed-precision training?

Float16/BFloat16 and Float32

What is a gradient checkpointing technique used for?

Trading extra computation for reduced memory usage

What is the main challenge of training very deep networks?

Vanishing/exploding gradients and optimization difficulties

What does 'end-to-end training' mean?

Training the entire pipeline jointly from raw inputs to final outputs

What is a common challenge when deploying deep learning models?

Fitting models into resource constraints and achieving low latency

Why might you prune a deep neural network?

To reduce parameters and speed up inference

What is knowledge distillation in deep learning?

Using a large teacher model to train a smaller student model

Which of the following is true about deep learning vs. classic ML?

Deep learning models automatically learn hierarchical features from raw data

100

Which factor is usually most critical for successful deep learning?

Sufficient data, compute, and good optimization/hyperparameters