What are Transformers - Understanding the Architecture End-to-End
5828 words ~28 mins

#Machine Learning
Alejandro Armas and Sachin Loechler have been hard at work on a project that involves developing streaming workloads. The project has the goal to process realtime data and support real-time traffic prediction! In order to make sense of the enormous quanity of unstructured video data, we employed Foundational models that perform video tracking, bounding box, depth estimation and segmentation to extract information from video data. Many of these foundational models relied on an artificial Neural Network Architecture called a Transformer.

Authenticating Data for Experimentation Environment
778 words ~4 mins

#Programming #Data Engineering
Identity Access Management (IAM) In this post, we will explore how to leverage Terraform, a popular Infrastructure as Code (IaC) tool, to automate the setup and management of AWS IAM. We will walk through creating IAM roles, policies, and users, and demonstrating how to attach policies to these entities. Teammate Usage Figure 2: IAM User Interaction. There are two AWS accounts. One is for an admin. You have been assigned an IAM user account.

Driving a Data Product - Uncovering Insights and Laying out Assumptions with Exploratory Data Analysis
2142 words ~11 mins

#Programming #Data Engineering
Alejandro Armas and Sachin Loechler have been hard at work on a project that involves developing streaming workloads. The project has the goal to process realtime data and support real-time traffic prediction! However, before I was able to begin with that. I had to demonstrate the viability of this initative. It was critical to communicate and achieve consensus on my understanding of the data with the team. In addition to learning about the data, I testing hypothesis I had and laying out the assumptions I had.

Enabling a Reproducible Data Experimentation Environment
1844 words ~9 mins

#Programming #Data Engineering
This post is going to detail how I enabled reproducible environments Reproducible Docker Builds Optimizing Data Transfers Enabled reproducible data analysis by leveraging DVC to capture data lineage, provisioning both object storage and IAM policies using Terraform for secure access, optimizing network transfers by 35x and packaging notebooks via Docker In this figure, the developer utilizes four main command line tools: DVC, Git, Poetry and Docker. DVC is configured to pull and push dataset artifacts onto the DVC repository.

Getting Started with PyFlink: My Local Development Experience
1823 words ~9 mins

#Programming #Data Engineering
Background A hobby project I am working on, involves developing streaming workloads. We want to process realtime data and support traffic prediction! Often at the start of tool adoption and especially when working in a multi-tool ecosystem, I was finding myself at a familiar roadblock: As the engineer responsible for creating the streaming workloads, I was having a hard time weighing the tradeoffs in what language to use for our data pipeline’s tooling.

Winning 3rd place at MLOPS LLM Hackathon: Question & Answer for MLOps System
773 words ~4 mins

This post describes the experience of team RedisCovering LLMs, as we developed a Question & Answer system specialized on MLOps community slack discussions, armed with GPT-3.5 for precise answers and verifiable references to slack threads, guarding against misinformation. 1. Introduction Last weekend, I had the opportunity to participate in a 12-hour hackathon organized by the San Francisco Bay Area MLOps Community. It was my third hackathon experience, and the first one I attended through the MLOps Community.

Unveiling Dimensionality Reduction - A Beginner's Guide to Principal Component Analysis
2139 words ~11 mins

#Probability #Mathematics
Introduction Imagine for a second you were transplanted into Olvera Street in LA. It’s a Tuesday, but today is a little different. Theres a spark in the air. You’re not quite sure what to make of it, but you know that today, something great is going to happen. You walk around aimlessly for awhile, until your mind begins to get distracted by this huge sense of hunger. “Dang – if only I could have some tacos”, you think to yourself.

What is the Difference Between Covariance and Correlation?
738 words ~4 mins

#Probability #Mathematics
Working with data will almost always begin with a data exploration phase. We listen to its heartbeat and ask lots of questions. As we begin this phase, one might ask themselves ‘what are the tools we can leverage?’. What do we do to define a linear measure of a relationship between two random variables? In other words, how do we measure the amount of ‘increasing X increases Y’-ness, or ‘- decreasing Y’-ness and vice versa in a joint probability distribution?

Unlocking the Power of Joint Distributions - How to Analyze Multiple Random Variables
1124 words ~6 mins

#Probability #Mathematics
The concept of joint distribution is useful when studying the outcomes and effects of multiple random variables in statistics. Joint distribution allows generalizing probability theory to the multivariate case. Let me paint a story for you. Joint Distributions Today, the weather is nice. Its a fresh summer morning. You’re out at a restaurant having breakfast with your in-laws and you want to impress. You’re such a nice person, you think to yourself.

Breaking Down Virtual Memory: The Role of Paging in Modern Operating Systems
882 words ~5 mins

#Programming #Operating Systems
Introduction Have you ever wondered why 32-bit and 64-bit get thrown around and not know what it meant? So too did I. Well the simple answer is that these refer to the amount of memory addressable to a program or more accurately, the computer architectures bit width i.r.t registers and address busses. Now let’s see how much this amounts to: \(2^{32} = 4,294,967,296\) Bytes or more succincly 4GiB. In modern days we are able to address \(2^{64} = 18,446,744,073,709,551,616\) Bytes or 16.