Samy Cherfaoui

Hey there!

I am currently a software engineer working on infrastructure and developer productivity at Oscar Health. I am simultaneously pursuing a Master's Degree at Stanford University in computer science, concentrating in Information Systems and advised by Fredrik Kjolstad. My primary research interest lies in utilizing recent machine learning developments to better extract and categorize human knowledge for smarter model training. Other interests include generative artificial intelligence and designing systems with ML applications in mind. For more information on things I've built on my own time or personal projects I've worked on, please jump here. For more information about projects I've completed at work, please jump here. For other relevant experiences, jump here.

I graduated from the University of California, Berkeley (sorry Stanford but 🐻 go bears 🐻) summa cum laude with a major in computer science and a minor in electrical engineering. I was particularly active within the teaching community and was an undergraduate student instructor for our introductory computer science course known as CS61A and our databases course, CS186. I was also an undergraduate student instructor for the first iteration of Cal's data engineering course. For this, I was recognized with an Outstanding Graduate Student Instructor Award in 2022 and was also cited in this paper about the experience.

Some of my interests include actively contributing to Wikipedia (especially the Vital Articles Level 5 project which seeks to identify the 50,000 most important articles on Wikipedia for the purposes of prioritizing improvements and for me personally, as a way of boiling down all of human knowledge into as much of its essence as possible), education technology, reading and writing (most of my writing is done in private but I am planning on starting up a new public series on the math behind the most popular generative machine learning models - stay tuned!), playing less-than-game-theory-optimal poker, being the world's worst amateur DJ, road trips (did a cross-country road trip last summer traversing 19 states and 2 countries - worst state: Indiana (sorry), most underrated state: South Dakota), and both playing and watching soccer. If you're curious about my name and why there is only one 'm' in Samy and why my last name has all the vowels, it's because both my parents are from Algeria, a beautiful country in North Africa.

If you'd like something more formal, check out my resume. If you have any questions or opportunities for me, I'm reachable at scherfaoui@berkeley.edu.


What I Do


I am a software engineer working at Oscar Health on the Developer Productivity team within the infrastructure department. This was the perfect job for a new grad: our team owns almost every part of the critical infrastructure at Oscar and yet we are a very small team. I was forced to learn quickly, lead projects with massive impact, and wear many hats. My projects include or have included:

  • Co-leading the Doscar project which involved constructing a machine-agnostic, standardized developer environment through Docker containers, taking the product from 0 to nearly 100% adoption. Our previous solution involved running a script which would often times lead to impossible-to-debug bugs and different setups due to operating system upgrades and other oddities. On top of implementing the core of the design with the other leads, I developed a completely novel solution to support PyCharm on Docker-based coding environments which, at the time, even Jetbrains, the parent company of PyCharm, did not offer. I also implemented support for sidecars, most popularly Redis and Postgres for testing.
  • Co-wrote and productionized a Kubernetes migration workflow tool that automates migrating existing apps to Kubernetes and it is used directly or indirectly to construct over 60% of the Kubernetes app configs currently in the codebase. Oscar is in the process of migrating from Apache Aurora to Kubernetes and this tool automatically parses existing configs to determine the appropriate configuration on Kubernetes. The frontend for this tool was developed on Backstage. In order to get Backstage running on Kubernetes, I had to implement full support for HTTPS services on Kubernetes since there weren't any HTTPS applications running on our cluster yet.
  • I became a subject matter expert for homegrown CI/CD solution known as Kochiku and rewrote portions of the API, CLI, Kafka collectors, and our Nomad dispatching logic to ensure full SDLC compliance and reasonable performance on our monorepo. Having a homegrown tool instead of Github/Gitlab is difficult but was a great learning experience for what exactly needs to go into a CI/CD system (how to scalably partition and then build/test/package targets, how to promote targets to production, how to separate dev-only target builds from prod builds etc.).
  • Assisted in implementing first support for Bazel targets in Kochiku, speeding up applicable builds by up to 50% over our previous build tool (Pants). We are currently migrating from Pants to Bazel and I am currently a lead on that effort.
  • Due to the type of metrics present within kube-state-metrics, I had to design and implement our entire Kubernetes cronjob alerting suite from scratch (different metrics were available on our previous solution, Apache Aurora). I also spearheaded a move to alerts written in Jsonnet to facilitate speedier updates to alerts when needed.

Some other work I've done includes overseeing the rollout of some our observability tooling (Honeycomb and Raygun) and writing a series of scripts to automate the verification of various compliance checklists on our code and builds. I work primarily in Python but I do some coding in Go. I also use Ansible and Terraform as the need arises.

What I've Done


Software Engineering Intern, Oscar Health

I worked on the Messaging team on the Mobile Texting platform. My primary project was rolling out an gRPC API to add support for MMS to our mobile texting platform. Some other projects I worked on included building an events logging platform and analytics integration platform to track members through our mobile texting workflow via AWS SQS, BigQuery, and Postgres.


Software Engineering Intern, Coursera

I worked on the infrastructure team at Coursera as an intern, specifically on the data services workstream. I improved upon the Nostos service, an offline-to-online data cache, by crafting a pipeline from Redshift to S3 to AWS Aurora MySQL written using Scala, Java, and SQL and then reconfiguring the Nostos backend (once again using Scala) to specifically access the new MySQL stores. I configured the new SQL solution from scratch by spinning up Aurora clusters, tuning configurations, and setting up parameter and security groups as well as setting up permissions for S3. I was one of the first testers of Amazon Aurora Postgres 2.0 in a production setting and I realized it was not the best (sorry Amazon). Three bug reports later, I switched to Aurora MySQL and was able to complete the project. There were a few improvements from the previous Cassandra solution but the main feature is that it allowed Coursera to deprecate its Cassandra usage since there were no longer any experts in Cassandra left at the company.


Kleiner Perkins Fellow, Kleiner Perkins

1 of 50 engineering fellows selected from over 3000 applicants. Named the #2 summer internship program by Vault (out of over 400 programs), the KP Fellows Program is an intensive, high-impact summer program for students pursuing careers in engineering, design, and product. In addition to completing an internship at a company within the KP portfolio, fellows attend events held by Kleiner Perkins and various portfolio companies.


Juni Learning, Instructor, Software Engineering Intern

I taught high school students advanced web design with HTML, CSS, and Bootstrap, data analysis with Python, and software design principles with Java. I also developed an interactive sandbox web app (Java server with Spring MVC) as a tool to teach students about memory management in C.


What I've Built


Quick note: I am pretty much almost always working on a new side project, though I like to keep the projects on the low until I actually release them. I am currently working on a smarter search indexing platform utilizing some of the research on Wikipedia embeddings I mention below. If you'd like to learn more about this project or get updated when I finish or release them, feel free to drop me a line.

Some fairly recent, notable projects I'm proud of include:


State-of-the-art Wikipedia article embeddings via category graphs

Probably my favorite project in recent memory. I was able to achieve state-of-the-art Wikipedia topic embedding performance just by using simple but powerful training data augmentation via Wikipedia-provided category constructs. Wikipedia articles are assigned to categories by editors and these categories form a graph. Intuitively, the shorter the path between two articles via their categories are, the more related the articles will be. However, categories are a little-known feature outside of the Wikipedia community and the semantic encodings embedded in the graph structure were not included in the training of existing topic embedding models. The results here were incredible: I was also able to achieve near SOTA performance just by using PCA on this augmented training data with no need for a complicated neural network at all. I think there are so many clever sources of knowledge that will allow LLMs and other advanced models to extract more with fewer parameters.


You Text, We Sketch: Text-Conditioned Vectorized Sketch Generation

A first-of-its-kind text-conditioned sketch generation model which takes in a description of something in natural language and outputs a sketch. It builds off the previous work of the unconditioned model SketchKnitter. Built with the brilliant Yash Shah. The Github repository is currently private but if you are interested, let me know and I would be happy to grant access.


Identifying Information Leakage Vectors on Github

A security paper I wrote with a friend where we were able to exploit certain behaviors on Github to extract extremely sensitive personal data. We are currently working with Github to ameliorate some of the issues we noted so we cannot disclose our code or paper. However, we are aiming to have this published in a security conference by late 2024.


Survey of Neural Network Explainer Algorithms

I wrote this with a friend for a course in graph machine learning. The survey sought to provide context on what explainer and explanation algorithms are in the context of graph neural networks, what kinds are available and how they work, how to compare their performances, and how to implement them in your own models. It was one of the few projects given a shoutout by our professor Jure Leskovec on his Twitter account.


Some older projects I'm also proud of include:

AutoMatminer

AutoMatminer is an automatic prediction engine for materials properties that I worked on with Alex Dunn, Alex Ganose, Qi Wang, and Dr. Anubhav Jain. I also contributed significantly to the documentation. My work was recognized in this paper.


Port Authority

PortAuthority is a simple network utility made with Flask that allows you to input an IP address or domain and range of ports, and get back the ports that are open within that range at that address. Think Nmap except without all the bells and whistles. Currently at 12 stars on Github.


Throwaway

This simple Flask web app will assign a user a temporary email address which they can use as they wish to sign up for accounts, get rewards etc. I built a Python API wrapper for the Temp Mail API and a simple and flexible web interface to see the emails.


Flexfill

Have you ever had trouble adding event handlers for antiquated browsers? Confused about what functions to call to establish polyfill functionality? Well, look no further than flexfill! Flexfill is a want-all catch-all event handler mechanism which does the grunt work for you so you don't need to worry about adding specialized functionality for those dang users who are still using IE6.


This website

I know you like what you see...


The source code for this website is on GitHub.