Creating a Deep Learning Framework from Scratch in C++
Table of Contents
Prerequisites
You might be interested in my related articles on Matrix Multiplication and Concurrency before reading further!
Introduction
My goal for this project was to train a simple 2-layer Multi-layer perceptron by creating a Deep Learning Framework.
Typically representing models with a Dynamic Acyclic Graph (DAG) provides a better user debugging experience, so data and calculations had to flow at runtime.
So, I gave myself the following restrictions:
- Learn and incorporate as many design patterns and C++20 features as I could
- Use little to no dependencies in my code
This is what a sample training loop might look like with my library:
constexpr double LEARNING_RATE = 0.001;
constexpr double TRAINING_EPOCS = 300;
auto ma = NeuralNetwork::Computation::Graph::TensorConstructor::create(
Matrix::Rows(1),
Matrix::Columns(2000));
auto ground_truth = NeuralNetwork::Computation::Graph::TensorConstructor::create(
Matrix::Rows(1),
Matrix::Columns(10));
NeuralNetwork::Sequential model;
model.add(std::make_unique<NeuralNetwork::Layer>(
std::make_unique<NeuralNetwork::MatrixMultiplyStep>(Matrix::Rows(2000), Matrix::Columns(1000)),
std::make_unique<NeuralNetwork::AddStep>(Matrix::Columns(1000))
));
model.add(std::make_unique<NeuralNetwork::ActivationFunctions::ReLU>());
model.add(std::make_unique<NeuralNetwork::Layer>(
std::make_unique<NeuralNetwork::MatrixMultiplyStep>(Matrix::Rows(1000), Matrix::Columns(10)),
std::make_unique<NeuralNetwork::AddStep>(Matrix::Columns(10))
));
auto CE = NeuralNetwork::Computation::Graph::TensorOp(Matrix::Operations::Metric::CrossEntropy{});
for (int i = 0; i < TRAINING_EPOCS; i++) {
auto out = model.forward(ma);
auto loss = CE(ground_truth, out);
loss->backwards();
for (auto it = loss->parameters().begin(); it != loss->parameters().end(); ++it) {
*it += -LEARNING_RATE * it.gradient();
}
}
Results
I demonstrated my network improving via it’s loss function, but I wanted more! I saw that Matrix Multiplication represented the majority of the neural network computation’s runtime, so I found it was critical to optimize.
This project utilized parallel code via OpenCilk. It is a superset of C++ for parallel programming, hence the specialized compiler needed.
Here I profiled the elapsed time for a 25M parameter operation:
Operation | Data Size | Time (ms) |
---|---|---|
Parallel Matrix Multiply Benchmark | [5000, 5000] x [5000, 5000] |
6,831,126 |
Cross Entropy Metric Benchmark | [5000, 1] x [5000, 1] |
34 |
Outer Product Benchmark | [5000, 1] x [5000, 1] |
63,514 |
Softmax Benchmark | [5000, 1] x [5000, 1] |
5 |
Hadamard Product Benchmark | [5000, 1] x [5000, 1] |
21 |
Matrix Transpose Benchmark | [10000, 9000] |
659,166 improved to 66,771 |
In the end, I achieved a 142x and 10x speedup on MM and transpose ops by implementing Parallel Divide and Conquer algorithms and representing matrices in row-major order for reduced cache misses.
Design
I did this by encapsulating data in tensors, which were instantiated with a factory pattern and used in tandem with matrix operators, which were wrapped in a decorator object and together coupled with a global object to hold the graph’s edges.
- Each node contains a statically built finite state automata, used in tandem with traversal policies, such as gradient descent to update state.
- Additionally, iterator design patterns were used to decouple different traversal algorithms from the graph itself
Debugging
Just to add my own two cents to the build versus buy argument… planning ahead and choosing a mature library with linear algebra primitives would have been a smart choice for sure. It was pretty tough debugging large input matricies.
Conclusion
This project came at a very personal time in my life. A lot of uncertainty was going on with my family, but I woke up every day programming to keep me going. I learned so much and used these learnings as I prepared for my next internship.
- Here’s a link to the repository if you’re interested: https://github.com/alejandroarmas/Wirikuta