Brief of AI/ML coding interview questions (Leetcode style)
Some implementation examples for machine learning coding interview
Practice platforms
- tensortonic.com: auto push to github
- deep-ml.com
- Github:TorchCode: provide advanced implementation (i.e: topK/topP sampling, KV Cache/Flash/Grouped Query/Multi-Head/etc. Attention, Speculative Decoding, RL loss) More of my implementations can be found here: Github: QuanHNguyen232/TensorTonic-Solutions
Backprop
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
N = 3
probs = np.random.randint(0, 10, (N, 10))
y = np.random.randint(0, 10, (N,))
probs, y, probs[np.arange(N), y]
# probs, probs[1: 2, ...]
epsilon = 1e-12
preds = probs[np.arange(N), y]
print(np.log(preds))
# 1.
loss = -1*np.mean(np.log(preds))
y_truth = np.zeros_like(probs)
y_truth[np.arange(N), y] = 1 # (N, out) == probs.shape
per_sample = np.sum(y_truth * np.log(probs + epsilon), axis=1) # (B,)
print(per_sample)
loss1 = -1 * np.mean(per_sample)
assert loss-loss1 <= 0.005, f"expect loss==loss1, but got {loss}(loss)!={loss1}(loss1)"
Complete code
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
import numpy as np
def sigmoid(x):
return 1 / (1 + np.exp(-x))
def sigmoid_derivative(x):
return x * (1 - x)
def single_neuron_backpropagation(features, labels, initial_weights, initial_bias, learning_rate, epochs):
# Initialize weights and bias
weights = initial_weights
bias = initial_bias
# Training loop
for epoch in range(epochs):
# Forward pass
z = np.dot(features, weights) + bias
y_pred = sigmoid(z)
# Compute loss
loss = np.mean((y_pred - labels) ** 2)
# Backpropagation
d_loss_d_ypred = 2 * (y_pred - labels)
d_ypred_d_z = sigmoid_derivative(y_pred)
d_z_d_w = features
# Compute gradients
d_loss_d_w = np.dot(d_z_d_w.T, d_loss_d_ypred * d_ypred_d_z)
d_loss_d_b = np.sum(d_loss_d_ypred * d_ypred_d_z)
# Update weights and bias
weights -= learning_rate * d_loss_d_w
bias -= learning_rate * d_loss_d_b
# Print progress every 100 epochs
if epoch % 100 == 0:
print(f"Epoch {epoch}, Loss: {loss}")
return weights, bias
# Example usage
np.random.seed(42)
features = np.random.randn(100, 3) # 100 samples, 3 features
labels = np.random.randint(0, 2, 100) # Binary labels
initial_weights = np.random.randn(3)
initial_bias = 0.0
learning_rate = 0.01
epochs = 1000
final_weights, final_bias = single_neuron_backpropagation(
features, labels, initial_weights, initial_bias, learning_rate, epochs
)
print("Final weights:", final_weights)
print("Final bias:", final_bias)
# Test the trained neuron
def predict(features, weights, bias):
return sigmoid(np.dot(features, weights) + bias)
test_features = np.random.randn(5, 3) # 5 test samples
predictions = predict(test_features, final_weights, final_bias)
print("Predictions for test samples:", predictions)
####################################
def calculate_d_theta(X: list[list[float]], Y: list[float], Y_hat: list[float]) -> list[float]:
m = len(Y) # Number of training examples
n = len(X[0]) # Number of features
d_theta = [0.0] * (n + 1) # +1 for the bias term
for j in range(n + 1):
for i in range(m):
if j == 0:
# For bias term (θ₀)
d_theta[j] += Y_hat[i] - Y[i]
else:
# For other parameters (θ₁, θ₂, ..., θₙ)
d_theta[j] += (Y_hat[i] - Y[i]) * X[i][j-1]
# Average the gradients
d_theta = [d / m for d in d_theta]
return d_theta
Gradient Descent (Linear Regression)
Overview
Implement the missing code (denoted by ellipses).
You may not modify the pre-existing code.
Your task is to implement parts of the Gradient Descent optimization algorithm from scratch (i.e., without importing any libraries or packages).
You will apply this algorithm to linear regression, finding the coefficients for the following equation:
\[y = b + \theta_1 x_1 + \theta_2 x_2 + \cdots + \theta_m x_m\]Gradient Descent Steps
Step 1: Initialize parameters
Randomly set initial values of:
- bias ( $b$ )
- coefficients ( $\theta_i$ )
Step 2: Predict output
Calculate the predicted value ( $\hat{y}$ ).
\[\hat{y}_i = b + \sum_{k=1}^{m} \theta_k x_{ik}\]Step 3: Compute derivative with respect to bias
Calculate the partial derivative of the cost function with respect to bias.
Cost function: \(J = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2\)
Derivative w.r.t. ( $b$ ): \(d_b = \frac{\partial}{\partial b} \left( \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 \right) = -\frac{2}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)\)
Step 4: Compute derivatives with respect to coefficients
Calculate the partial derivatives of the cost function with respect to each ( $\theta_k$ ).
\[d_{\theta_k} = \frac{\partial}{\partial \theta_k} \left( \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 \right) = -\frac{2}{n} \sum_{i=1}^{n} x_{ik} (y_i - \hat{y}_i)\]Step 5: Update parameters
Let ( $\alpha$ ) be the learning rate.
\(b = b - \alpha d_b\) \(\theta_k = \theta_k - \alpha d_{\theta_k}\)
Step 6: Iterate
Repeat Steps 2-5 for a fixed number of iterations.
Input / Output Description
You will be given:
-
x_train: a two-dimensional array of floats- Each subarray
x_train[i]represents one training data point.
- Each subarray
-
y_train: a one-dimensional array of floats- Each element is the true function value corresponding to
x_train[i].
- Each element is the true function value corresponding to
-
x_test: test data in the same format asx_train, without labels.
Task
- Train the gradient descent model using
x_trainandy_train. - Use the trained model to predict function values for
x_test. - Return the predictions.
Notes
- All training and test data are guaranteed to be floating-point values.
- The skeleton code for assigning and returning class labels has already been created.
- Do not edit any code except inside the
# implement thissections.
Example
Input
1
2
3
4
5
6
7
8
9
10
11
12
13
14
x_train = [
[1.0],
[2.0],
[0.0],
[-1.0],
[3.0]
]
y_train = [2.0, 3.0, 1.0, 0.0, 4.0]
x_test = [
[-2.0],
[10.0]
]
the output should be solution(x_train, y_train, x_test) = [-1.0, 11.0]
Code
-
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82
from typing import List import random def random_bias_theta(n_features: int) -> tuple[float, list[float]]: """ STEP 1: Randomly set bias and theta """ b = random.random() theta = [random.random() for _ in range(n_features)] return b, theta def calculate_y(b: float, theta: list[float], x: list[float]) -> float: """ STEP 2: Calculate predicted value of Y """ # implement this y_hat = b + sum([x_ * theta_ for (x_, theta_) in zip(x, theta)]) return y_hat def calculate_d_b(Y: list[float], Y_hat: list[float]) -> float: """ STEP 3: Calculate bias derivative """ # implement this assert len(Y_hat) == len(Y), "len mismatch" n = len(Y_hat) grad = sum([y - y_hat for y, y_hat in zip(Y, Y_hat) ]) return (-2/n) * grad def calculate_d_theta(X: list[list[float]], Y: list[float], Y_hat: list[float]) -> list[float]: """ STEP 4: Calculate theta derivative """ # implement this assert len(Y_hat) == len(Y) and len(Y)==len(X), "len mismatch" m = len(X) n = len(X[0]) grad = [0.0] * n for i in range(m): err = Y[i] - Y_hat[i] for j in range(n): grad[j] += err * X[i][j] for j in range(n): grad[j] *= (-2/m) return grad def update(X: list[list[float]], Y: list[float], Y_hat: list[float], b_prev: float, theta_prev: list[float], learning_rate: float) -> tuple[float, list[float]]: """ STEP 5: Update gradient and weights """ d_theta = calculate_d_theta(X, Y, Y_hat) d_b = calculate_d_b(Y, Y_hat) # implement this b_new = b_prev - learning_rate * d_b n = len(theta_prev) theta_new = [0.0] * n for i in range(n): theta_new[i] = theta_prev[i] - learning_rate * d_theta[i] return b_new, theta_new def fit(X: list[list[float]], Y: list[float], num_iterations: int, learning_rate: float = 0.2) -> tuple[float, list[float]]: """ STEP 6: Pulling it together """ b, theta = random_bias_theta(len(X[0])) for _ in range(num_iterations): # implement this Y_hat = [0.0] * len(Y) for i in range(len(Y)): Y_hat[i] = calculate_y(b, theta, x=X[i]) b, theta = update(X, Y, Y_hat, b, theta, learning_rate) return b, theta def solution(x_train: list[list[float]], y_train: list[float], x_test: list[list[float]], iterations: int = 1000) -> list[float]: random.seed(42) b, theta = fit(x_train, y_train, iterations) return [round(calculate_y(b, theta, x), 2) for x in x_test]
-
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
import numpy as np import random class LinearRegression: def random_bias_theta(self, n_features: int) -> tuple[float, np.ndarray]: """ Randomly initialize bias (scalar) and theta (shape: [n_features]). """ self.b = random.random() self.theta = np.random.rand(n_features) return self.b, self.theta def calculate_y(self, X: np.ndarray) -> np.ndarray: """ Predict y_hat for a batch. X: (m, n) theta: (n,) returns: (m,) """ return X @ self.theta + self.b def gradients(self, X: np.ndarray, Y: np.ndarray, Y_hat: np.ndarray) -> tuple[float, np.ndarray]: """ Gradients for MSE: (1/m) * sum (Y - Y_hat)^2 Using your exact derivative form: d_b = (-2/m) * sum_i (Y_i - Yhat_i) d_theta = (-2/m) * X^T (Y - Yhat) """ m = Y.shape[0] err = Y - Y_hat # (m,) self.d_b = float((-2.0 / m) * err.sum()) # scalar self.d_theta = (-2.0 / m) * (X.T @ err) # (n, m) @ (m,) = (n,) return self.d_b, self.d_theta def update(self, learning_rate: float) -> tuple[float, np.ndarray]: b_new = self.b - learning_rate * self.d_b theta_new = self.theta - learning_rate * self.d_theta return b_new, theta_new def fit(self, X: list[list[float]] | np.ndarray, Y: list[float] | np.ndarray, num_iterations: int, learning_rate: float = 0.2, seed: int = 42) -> tuple[float, np.ndarray]: """ Vectorized gradient descent for linear regression. """ X = np.asarray(X, dtype=np.float64) # (m, n) Y = np.asarray(Y, dtype=np.float64) # (m,) if X.ndim != 2: raise ValueError(f"X must be 2D (m, n), got shape {X.shape}") if Y.ndim != 1: raise ValueError(f"Y must be 1D (m,), got shape {Y.shape}") if X.shape[0] != Y.shape[0]: raise ValueError(f"len mismatch: X has {X.shape[0]} rows, Y has {Y.shape[0]}") self.random_bias_theta(X.shape[1]) for _ in range(num_iterations): Y_hat = self.calculate_y(X) # (m,) self.d_b, self.d_theta = self.gradients(X, Y, Y_hat) # scalar, (n,) self.b, self.theta = self.update(learning_rate) return self.b, self.theta def solution(self, x_train, y_train, x_test, iterations: int = 1000) -> list[float]: b, theta = self.fit(x_train, y_train, iterations, learning_rate=0.2, seed=42) X_test = np.asarray(x_test, dtype=np.float64) preds = b + X_test @ theta return [round(float(p), 2) for p in preds]
K-Nearest Neighbors (KNN)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
def cal_dist(value1: List[float], value2: List[float]) -> float:
# Step 1 Define distance metric
#Implement this
import math
val = 0
for a, b in zip(value1, value2):
val += (a - b)**2
return math.sqrt(val)
def kneighbors(k: int, x_train: List[List[float]],
test_sample: List[float]) -> List[int]:
# Step 2: Indentify indices of k nearest neighbors
#Implement this
import heapq
arr = []
for i, pt in enumerate(x_train):
dist = cal_dist(pt, test_sample)
# arr.append((dist, i))
heapq.heappush(arr, (-dist, i))
if len(arr) > k:
_ = heapq.heappop(arr)
# arr.sort(key=lambda x: x[0])
#arr = arr[:k]
#return [x[1] for x in arr]
ans = []
while arr:
ans.append(heapq.heappop(arr)[1])
return ans
def predict(k: int, x_train: List[List[float]], y_train: List[List[float]],
test_sample: List[float]) -> int:
# Step 3: Assign label
# implement this
from collections import Counter
indices = kneighbors(k, x_train, test_sample)
labels = [y_train[i] for i in indices]
tmp = Counter(labels)
return tmp.most_common(1)[0][0]
def solution(x_train: List[List[float]], y_train: List[List[float]],
x_test: List[List[float]], k: int) -> List[int]:
# Step 4: Pull it together
return [predict(k, x_train, y_train, sample) for sample in x_test]
x_train = [
[0., 2.],
[10., 10.],
[1., 0.],
[7., 8.],
[9., 1],
[11., 0],
[6., 10.],
[-2., 1.],
[8., 0.],
]
y_train = [0, 1, 0, 1, 2, 2, 1, 0, 2]
x_test = [
[4., 4.],
[5., 5.],
[5., 4.]
]
ans = solution(x_train, y_train, x_test, 3)
print(ans)
KMeans
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
from typing import List
import numpy as np
import matplotlib.pyplot as plt
PREV_CENTROIDS = None
ITERATION = 0
def my_cdist(data, centroids):
def compute_distance(point, centroid):
point = np.array(point)
centroid = np.array(centroid)
return np.linalg.norm(point - centroid)
ans = []
for i in range(len(data)):
tmp = []
for j in range(len(centroids)):
distance = compute_distance(data[i], centroids[j])
tmp.append(distance)
ans.append(tmp)
return np.array(ans)
def calculate_distance(data: List[List[float]], centroids: List[List[float]]) -> List[List[float]]:
# Step 1: Calculate distance between each data point and the k centroids
from scipy.spatial.distance import cdist
# return cdist(data, centroids, "cosine").tolist()
return cdist(data, centroids).tolist() # shape = (num_samples, k_centroids)
# return my_cdist(data, centroids).tolist()
def get_clusters(data: List[List[float]], centroids: List[List[float]]) -> List[int]:
# Step 2: Assign each data point to its nearest centroid
distances = calculate_distance(data, centroids)
distances = np.array(distances) # shape = (num_samples, k_centroids)
# return [dist.index(min(dist)) for dist in distances]
return np.argmin(distances, axis = 1).tolist() # shape = (num_samples)
def plot(data, new_centroids):
clusters = get_clusters(data, new_centroids)
plt.scatter(
[x[0] for x in data] + [x[0] for x in new_centroids],
[x[1] for x in data] + [x[1] for x in new_centroids],
c = clusters + list(range(len(new_centroids)))
)
plt.show()
def update_clusters(clusters: List[int], data: List[List[float]], k: int) -> List[float]:
""" Assume: k is num clusters
Step 3: Average the data points in each cluster to update the centroids' locations
"""
global PREV_CENTROIDS, ITERATION
new_centroids = []
for cluster_id in range(k):
# get datapoints by cluster id
cluster_points = [data[j] for j in range(len(data)) if clusters[j] == cluster_id]
if cluster_points:
new_centroid = [sum(dim) / len(cluster_points) for dim in zip(*cluster_points)]
new_centroids.append(new_centroid)
else:
# If the cluster is empty, retain the previous centroid (or initialize to zero if it's the first iteration)
# new_centroids.append(new_centroids[cluster_id] if cluster_id < len(new_centroids) else [5.0] * len(data[0]))
new_centroids.append(PREV_CENTROIDS[cluster_id] if ITERATION > 0 else [0.0] * len(data[0]))
PREV_CENTROIDS = new_centroids
# plot(data, new_centroids)
return get_clusters(data, new_centroids)
def fit_predict(data: List[List[float]], k: int, centroids: List[List[float]], iterations: int) -> List[int]:
# Step 4: Pull everything together
global PREV_CENTROIDS, ITERATION
PREV_CENTROIDS = centroids
clusters = get_clusters(data, centroids)
for i in range(iterations):
ITERATION = i
clusters = update_clusters(clusters, data, k)
return clusters
Naive Bayes
1
# Not yet implemented
Decision Tree
Overview
Implement the missing code (denoted by ellipses).
You may not modify the pre-existing code.
Your task is to implement parts of the Decision Tree classification algorithm from scratch (i.e., without importing any libraries or packages).
As a reminder, Decision Tree building comprises the following four major steps.
Decision Tree Building Steps
Step 1: Build tree nodes
- Inner (non-leaf) nodes contain:
- a feature index
- a feature value based on which the decision is made
- Leaf nodes contain:
- a class label to predict
Step 2: Find the best split
For each tree node, find the best split of samples based on feature values.
Step 3: Calculate entropy
Entropy is used to measure the purity of a split, using the following formula:
\[E = - \sum_{i=1}^{N} p_i \log_2 p_i\]where:
- $ p_i = \frac{N_i}{N} $
- $ N_i $ is the number of occurrences of class label in the labeled data
- $ N $ is the total number of samples
Step 4: Calculate Information Gain
The Information Gain (IG) of a split is computed as:
\[IG = E_{\text{parent}} - \left( \frac{l}{n} \cdot E_{\text{left}} + \frac{r}{n} \cdot E_{\text{right}} \right)\]where:
- $ l $ is the number of samples in the left child
- $ r $ is the number of samples in the right child
- $ n = l + r $ is the total number of samples
Prediction
To predict class labels of unlabeled samples:
- Traverse the decision tree according to the information stored in the nodes.
- Continue until a leaf node is reached.
- Assign the class label stored in that leaf node.
Input / Output Description
To validate the algorithm implementation, you will need to use it for classification tasks.
You will be given:
-
x_train: a two-dimensional array of float values- Each sub-array
x_train[i]represents a unique training sample.
- Each sub-array
-
y_train: a one-dimensional array of floats- Each element represents the true class label corresponding to
x_train[i].
- Each element represents the true class label corresponding to
-
x_test: test data in the same format asx_train, without class labels.
Task
- Create a Decision Tree model using the training data.
- Use the model to classify the test data.
- Return the predicted class labels (float values) for the test data.
Notes
- It is guaranteed that all training and test data are float values.
- The skeleton code for assigning and returning class labels has already been created.
- Do not edit any code except inside the
# implement thissections.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
import math
from typing import NamedTuple
class Node:
"""
Just a dummy super-class for decision tree nodes
"""
pass
class InnerNode(Node):
"""
Represents an inner node of decision tree
feature_idx - index of feature, basing on which value, decision is made
split_value - split value of feature at feature_idx"
left - left child
right - right child
"""
def __init__(self, feature_idx: int, split_value: float, left: Node, right: Node) -> None:
self.feature_idx = feature_idx
self.split_value = split_value
self.left = left
self.right = right
class LeafNode(Node):
"""
Represents a leaf node of decision tree
label - class label to assign to the processed semple
"""
def __init__(self, label: int) -> None:
self.label = label
class BestSplit(NamedTuple):
"""
Represents the best split of a set of samples
ig - information gain of the split
feature_idx - index of feature used for the split
split_value - value used as a split criteria
left_indices - indicef of samples going to the left child
right_indices - indices of samples going to the right child
"""
ig: float = -math.inf
feature_idx: int = 0
split_value: float = 0.0
left_indices: list[int] = []
right_indices: list[int] = []
class DecisionTreeClassifier:
def __init__(self):
self.max_depth = 10
def _build_node(self, x: list[list[float]], y: list[int], depth: int) -> Node:
"""
Step 1: Recursively build tree nodes by splitting the remaining samples into two parts
"""
n_samples = len(x)
if n_samples > 2 and depth < self.max_depth:
best_split = self._get_best_split(x, y)
if best_split.ig > 0:
left_x = [x[i] for i in best_split.left_indices]
left_y = [y[i] for i in best_split.left_indices]
right_x = [x[i] for i in best_split.right_indices]
right_y = [y[i] for i in best_split.right_indices]
left_child = self._build_node(left_x, left_y, depth + 1)
right_child = self._build_node(right_x, right_y, depth + 1)
return InnerNode(best_split.feature_idx, best_split.split_value, left_child, right_child)
return LeafNode(label=max(sorted(y), key=y.count))
def _get_best_split(self, x: list[list[float]], y: list[int]) -> BestSplit:
"""
Step 2: Find the best split of the given set of samples
"""
n_samples = len(x)
n_features = len(x[0])
best_split = BestSplit()
# iterate over all features
for feature in range(n_features):
transposed_matrix = list(zip(*x))
feature_values = transposed_matrix[feature]
# try to split by all values of the feature
for split_value in feature_values:
left_indices = list(
filter(lambda i: x[i][feature] <= split_value, range(n_samples)))
right_indices = list(
filter(lambda i: x[i][feature] > split_value, range(n_samples)))
# if both left and right parts are not empty, calculate IG and save the best value
if len(left_indices) > 0 and len(right_indices) > 0:
left_y = [y[i] for i in left_indices]
right_y = [y[i] for i in right_indices]
ig = self._calculate_ig(y, left_y, right_y)
if ig > best_split.ig:
best_split = BestSplit(ig, feature, split_value, left_indices, right_indices)
return best_split
@staticmethod
def calculate_entropy(y: list[int]) -> float:
"""
Step 3: Calculate entropy of labels 1ist
"""
# implement this
n = len(y)
if n == 0:
return 0.0
# count label frequencies without any imports
counts = {}
for label in y:
counts[label] = counts.get(label, 0) + 1
entropy = 0.0
for c in counts.values():
pi = c / n
entropy += pi * math.log2(pi)
return (-1) * entropy
@staticmethod
def _calculate_ig(y_parent: list[int], y_left: list[int], y_right: list[int]):
"""
Step 4: Calculate Information Gain of a split
"""
# implement this
n = len(y_parent)
l = len(y_left)
r = len(y_right)
assert n == l+r
if n == 0 or (l == 0 or r == 0): return 0.0
e_parent = DecisionTreeClassifier.calculate_entropy(y_parent)
e_left = DecisionTreeClassifier.calculate_entropy(y_left)
e_right = DecisionTreeClassifier.calculate_entropy(y_right)
ig = e_parent - (
(l / n) * e_left
+ (r / n) * e_right
)
return ig
def fit(self, x: list[list[float]], y: list[int]):
"""
Builds the tree from the given samples
"""
self.root = self._build_node(x, y, 0)
def predict(self, x: list[list[float]]) -> list[int]:
"""
Step 5: Traverse the tree and return a predicted class label from leaf node
"""
preds = []
for sample in x:
# implement this
node = self.root
while isinstance(node, InnerNode):
if sample[node.feature_idx] <= node.split_value:
node = node.left
else:
node = node.right
assert isinstance(node, LeafNode)
preds.append(node.label)
return preds
def solution(x_train: list[list[float]], y_train: list[int], x_test: list[list[float]]) -> list[int]:
"""
Pull everything together
"""
tree = DecisionTreeClassifier()
tree.fit(x_train, y_train)
return tree.predict(x_test)
Boostrap
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
import numpy as np
from collections import Counter, defaultdict
from sklearn.base import ClassifierMixin
from typing import List
from random import randint, seed
# 1. Bootstrap sampling function
def bootsample(n: int) -> List[int]:
"""
Perform sampling with replacement n times from the range [0, n-1].
Returns a list of sampled indices.
"""
indices = []
for _ in range(n):
indices.append(randint(0, n - 1))
return indices
# 2. Fit multiple classifiers using bootstrap sampling
def fit(classifiers: List[ClassifierMixin], x: list, y: list) -> None:
"""
Train each classifier on a different bootstrap sample of the data.
classifiers: List of sklearn classifier instances.
x: Features of the training data.
y: Labels of the training data.
"""
n = len(x)
for clf in classifiers:
# Get a bootstrap sample of the training data
sample_indices = bootsample(n)
x_sampled = x[sample_indices]
y_sampled = y[sample_indices]
# Fit the classifier on the sampled data
clf.fit(x_sampled, y_sampled)
# Step 3: Predict with majority voting
def predict(classifiers: List[ClassifierMixin], x_test: list) -> List[int]:
"""
Predict the class label for each instance in the test data using majority voting.
classifiers: List of sklearn classifier instances.
x_test: Test data features.
Returns: List of predicted class labels.
"""
predictions = []
for x in x_test:
votes = defaultdict(int)
for clf in classifiers:
label = clf.predict([x])[0]
votes[label] += 1
# majority vote, tie → smallest label
best_label = min(votes.keys(), key=lambda k: (-votes[k], k))
predictions.append(best_label)
return predictions
# Step 4: Full pipeline
def solution(x_train: list, y_train: list, x_test: list, n_estimators: int):
seed(42)
classifiers = [DecisionTreeClassifier(random_state=0) for _ in range(n_estimators)]
fit(classifiers, x_train, y_train)
return predict(classifiers, x_test)
Sampling Methods
Top-K
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
import torch
import torch.nn.functional as F
def top_k_sampling(logits: torch.Tensor, k: int) -> torch.Tensor:
"""
logits: (B, V)
"""
B, V = logits.shape
topk_vals, topk_idx = torch.topk(logits, k, dim=-1) # (B, k), (B, k)
probs = F.softmax(topk_vals, dim=-1) # (B, k)
pos = torch.multinomial(probs, num_samples=1) # (B, 1)
token_ids = topk_idx.gather(dim=-1, index=pos).squeeze(1) # (B, 1) -> (B,)
sampled_vals = topk_vals.gather(dim=-1, index=pos).squeeze(1) # (B, 1) -> (B,)
return token_ids, sampled_vals
# logits = torch.randint(0, 10, (2, 5)).to(torch.float32)
logits = torch.tensor([
[0, 1, 2, 3],
[4, 3, 1, -1]
]).to(torch.float32)
print(logits)
k = 3
print(top_k_sampling(logits, k))
Top-P
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
import torch
import torch.nn.functional as F
def top_p_sampling(logits: torch.Tensor, p: float) -> torch.Tensor:
"""
logits: (B, V)
"""
sorted_logits, sorted_idx = torch.sort(logits, dim=-1, descending=True) # (B, V)
sorted_probs = F.softmax(sorted_logits, dim=-1) # (B, V)
cumprobs = sorted_probs.cumsum(dim=-1) # (B, V)
# Mask tokens where cumulative prob would exceed p
cutoff_mask = cumprobs > p # (B, V)
# shift 1 right so the first token above p is still kept
cutoff_mask[..., 1:] = cutoff_mask[..., :-1].clone()
cutoff_mask[..., 0] = False
filtered_sorted_logits = sorted_logits.masked_fill(cutoff_mask, float('-inf')) # (B, V)
probs = F.softmax(filtered_sorted_logits, dim=-1) # (B, V)
pos = torch.multinomial(probs, num_samples=1) # (B, 1)
token_ids = sorted_idx.gather(dim=-1, index=pos).squeeze(1) # (B, 1) -> (B,)
sampled_vals = sorted_logits.gather(dim=-1, index=pos).squeeze(1) # (B, 1) -> (B,)
sampled_vals = filtered_sorted_logits.gather(dim=-1, index=pos).squeeze(1)
return token_ids, sampled_vals
# logits = torch.randint(0, 10, (2, 5)).to(torch.float32)
logits = torch.tensor([
[0, 1, 2, 3],
[4, 3, 2, -1]
]).to(torch.float32)
print(logits)
p = 0.9
print(top_p_sampling(logits, p))
Multi-layer Perceptron
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
import numpy as np
"""
backprop for multi-layer perceptron
"""
class Optimizer:
def update(self, param: np.ndarray, grad: np.ndarray, name: str) -> np.ndarray:
"""
Return updated parameter tensor.
name is a unique string key for per-parameter state (needed for Adam).
"""
raise NotImplementedError
class SGD(Optimizer):
def __init__(self, lr) -> None:
self.lr = lr
def update(self, param: np.ndarray, grad: np.ndarray, name: str) -> np.ndarray:
return param - self.lr * grad
class Adam(Optimizer):
# https://www.geeksforgeeks.org/deep-learning/adam-optimizer/
def __init__(self, lr=1e-3, betas=(0.9, 0.999), eps=1e-8):
self.lr = lr
self.beta1, self.beta2 = betas
self.eps = eps
self.t = 0
self.m = {} # name -> ndarray
self.v = {} # name -> ndarray
def update(self, param: np.ndarray, grad: np.ndarray, name: str) -> np.ndarray:
# Adam uses one global step count for the optimizer
self.t += 1
if name not in self.m:
self.m[name] = np.zeros_like(param)
self.v[name] = np.zeros_like(param)
m = self.m[name]
v = self.v[name]
# Update biased first/second moments
m[:] = self.beta1 * m + (1 - self.beta1) * grad
v[:] = self.beta2 * v + (1 - self.beta2) * (grad**2)
# Bias correction
m_hat = m / (1 - (self.beta1 ** self.t))
v_hat = v / (1 - (self.beta2 ** self.t))
# Parameter update
return param - self.lr * m_hat / (np.sqrt(v_hat) + self.eps)
class Linear:
def __init__(self, in_size, out_size, name="linear"):
self.name = name
self.weights = np.random.rand(in_size, out_size)
self.bias = np.zeros(out_size)
def __call__(self, x):
"""f(x) = x @ W + b
Args:
x: (B, in_size)
Return
z: (B, out_size)
"""
self.x = x
return np.matmul(x, self.weights) + self.bias
def backward(self, grad_out):
"""
Args:
grad_out = dL/dz, shape (B, out_size)
Returns:
grad_in = dL/dx, shape (B, in_size)
Notes:
z = f(x) = x @ W + b
-> dL/dW (in_size, out_size)
= dL/dz * dz/dW (knowing dz/dW = x)
= grad_out (B, out_size) * x (B, in_size)
(to make it match shape -> x.T @ grad_out)
dL/dx (B, in_size)
= dL/dz * dz/dx (knowing dz/dx = W)
= grad_out (B, out_size) * W (in_size, out_size)
(to make it match shape -> grad_out @ W.T)
"""
self.dW = np.transpose(self.x, axes=(1, 0)) @ grad_out # (in_size, B) x (B, out_size) = (in_size, out_size)
self.db = np.sum(grad_out, axis=0) # (out_size,)
grad_in = grad_out @ np.transpose(self.weights, axes=(1, 0)) # (B, out_size) x (out_size, in_size) = (B,in_size)
return grad_in
def step(self, optimizer: Optimizer):
self.weights = optimizer.update(self.weights, self.dW, name=f"{self.name}.W")
self.bias = optimizer.update(self.bias, self.db, name=f"{self.name}.b")
class Sigmoid:
def __call__(self, x):
"""z = f(x) = 1 / (1+e^{-x})
Args:
x: (B, size)
Return
z: (B, size)
"""
self.out = 1 / (1 + np.exp(-x))
return self.out
def backprop(self, grad_out):
"""sigmoid'(x) = sigmoid(x) * (1 - sigmoid(x))
Since self.out is already = sigmoid(x) --> self.out * (1-self.out)
Args:
grad_out = dL/dz, shape (B, hidden_size)
Returns:
grad_in = dL/dx, shape (B, hidden_size)
"""
dz_dx = self.out * (1-self.out)
grad_in = grad_out * dz_dx
return grad_in
class ReLU:
def __call__(self, x):
self.x = x
return np.maximum(0, x)
def backprop(self, grad_out):
# mask: 1 where x>0 else 0
mask = self.x > 0
return grad_out * mask
class Softmax:
def __call__(self, x):
"""z = e^xi / sum(e^xi)
Args:
x: (B, hidden_size)
Return:
z: (B, hidden_size)
"""
x = x - np.max(x, axis=1, keepdims=True) # stability
exp_x = np.exp(x)
total = np.sum(exp_x, axis=1, keepdims=True) # (B, 1)
self.out = exp_x / total
return self.out
def backprop(self, grad_out):
"""
Args:
grad_out = dL/ds where s = softmax(x), shape (B, C)
Return:
dL/dx: (B, C)
"""
s = self.out # (B, C)
# dot = (grad_out · s) per sample -> shape (B, 1)
dot = np.sum(grad_out * s, axis=1, keepdims=True)
# dL/dx = s * (grad_out - dot)
return s * (grad_out - dot)
class LossFunc:
y: np.ndarray
y_hat: np.ndarray
class MSE(LossFunc):
"""
y, y_hat: must have shape (B,) or (B, 1)
"""
y: np.ndarray
y_hat: np.ndarray
def __call__(self, y_hat: np.ndarray, y: np.ndarray):
"""
MSE=(Predicted Output - Actual Output)**2
MSE = 1/N sum_{i=1}^{N} (y_hat_i - y_i)**2
--> backprop = 2/N sum_{i=1}^{N} (y_hat_i - y_i)
y: (B,) -> (B, 1)
y_hat: (B, 1)
"""
self.y_hat = y_hat
self.y = y.reshape(y_hat.shape)
loss = np.mean((y_hat - y) ** 2)
return loss
def backprop(self):
"""
backprop = 2/N sum_{i=1}^{N} (y_hat_i - y_i)
Args:
y: (B, 1)
y_hat: (B, 1)
Return
dL/dL_dy_hat: (B, 1)
"""
batch = self.y_hat.shape[0]
return 2/batch * (self.y_hat - self.y)
class CrossEntropy(LossFunc):
# y: np.ndarray
# y_hat: np.ndarray
epsilon = 1e-12
def __call__(self, y_hat: np.ndarray, y: np.ndarray):
"""
CE = -1/N * sum_{i->N} sum_{j->output_size}(y_i_j log y_hat_i_j)
y: (B, output_size) # one hot encode
y_hat: (B, output_size)
"""
self.y = y
self.y_hat = y_hat
per_sample = np.sum(y * np.log(y_hat + self.epsilon), axis=1) # (B,)
loss = -1 * np.mean(per_sample)
return loss
def backprop_combine(self):
"""combined softmax + CE
Args:
y: (B, output_size) # one hot encode
y_hat: (B, output_size)
Return
dL_dz2
"""
batch = self.y_hat.shape[0]
return (1/batch) * (self.y_hat - self.y)
def backprop(self):
"""
Return:
dL_dy_hat
"""
batch = self.y_hat.shape[0]
return (-1/batch) * (self.y / (self.y_hat + self.epsilon))
class BinaryCE:
# https://cs230.stanford.edu/fall2018/section_files/section3_soln.pdf
pass
class MLP:
def __init__(self, input_size, hidden_size, output_size, optimizer) -> None:
self.linear1 = Linear(input_size, hidden_size)
self.linear2 = Linear(hidden_size, output_size)
self.sigmoid = Sigmoid()
self.softmax = Softmax()
self.optimizer = optimizer
self.criterion = CrossEntropy()
def forward(self, x):
"""
z1 = x*W1 + b1
a1 = sigmoid(z1)
z2 = a1*W2 + b2
y_hat = softmax(z2)
Loss = MSE = 1/N sum_{i=1->N} sum_{j=1->C} (y_hat_ij - y_ij)**2
Args:
x: (B, input_size)
Return
y_hat: (B, output_size)
"""
self.z1 = self.linear1(x)
self.a1 = self.sigmoid(self.z1)
self.z2 = self.linear2(self.a1)
self.y_hat = self.softmax(self.z2)
return self.y_hat
def backward(self, y):
"""
Args:
y: (B, output_size)
Notes:
z1 = x*W1 + b1
a1 = sigmoid(z1)
z2 = a1*W2 + b2
y_hat = softmax(z2)
Loss = MSE = 1/N sum_{i=1->N} sum_{j=1->C} (y_hat_ij - y_ij)**2
"""
loss = self.criterion(self.y_hat, y)
# delta2 = dL/dz2 (combined softmax + CE)
dL_dz2 = self.criterion.backprop_combine() # (B, output_size)
# through Linear2
dL_da1 = self.linear2.backward(dL_dz2) # (B, hidden_size)
# through sigmoid
dL_dz1 = self.sigmoid.backprop(dL_da1) # (B, hidden_size)
# through Linear1
dL_dx = self.linear1.backward(dL_dz1)
# gradient step
self.linear2.step(self.optimizer)
self.linear1.step(self.optimizer)
return loss
Attention
Sparse Attn
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
import torch
import torch.sparse
# Define the number of queries, keys, and dimensions
n = 10
m = 10
d = 5
# Generate random queries, keys, and values
Q = torch.randn(n, d)
K = torch.randn(m, d)
V = torch.randn(m, d)
# Define the local window size
window_size = 3
# Create a sparse matrix to represent the attention mask
indices = []
org_indices = []
for i in range(n):
start = max(0, i - window_size // 2)
end = min(m, i + window_size // 2 + 1)
for j in range(start, end):
indices.append([i, j])
org_indices = indices.copy()
indices = torch.tensor(indices).t() # (nnz, 2) -> (2, nnz). nnz = 28 ("Number of Non-Zero values")
values = torch.ones(indices.size(1)) # (nnz,)
mask = torch.sparse_coo_tensor(indices, values, (n, m)) # indices: 2D where the first dimension is the number of tensor dimensions and the second dimension is the "Number of Non-Zero values".
# Compute the attention scores
scores = torch.matmul(Q, K.t())
scores = scores * mask.to_dense()
# Apply softmax to the scores
scores = torch.softmax(scores, dim=-1)
# Compute the output
output = torch.matmul(scores, V)
print(output)
print(mask)
dense = mask.to_dense().int() # 0/1
for i in range(n):
print(f"{i:2d}", "".join("█" if dense[i, j] else "·" for j in range(m)))
Flash Attn
1
# not yet implemented
KV-Caching
1
# not yet implemented
Enjoy Reading This Article?
Here are some more articles you might like to read next: