About Me

Hey there! My name is Angad Singh Puri and I'm a Big Data Engineer at Citigroup, based in Toronto, Canada. My work is centered around using Big Data and machine learning techniques to solve hard problems with high-impact. I graduated with distinction from the University of Waterloo in 2020, where I studied Computer Engineering. During my degree, I accumulated 2 years of professional work experience, most of which was concentrated at the intersection of machine learning and Data Engineering.

Where I've Worked

Thanks to Waterloo's co-op program, I've had the privilege of completing
six awesome four-month internships. During my interships,
I lived in Waterloo, Ottawa, Toronto, California!

Jun 2020 - Present, Toronto, ON

  • Building real-time e-trading infrastructure on Kakfa for global credit business.

Sept - Dec 2019, Toronto, ON

  • Build and unit tested 5 services for collection, cleaning, deduping, matching 7M+ merchants from multiple data sources for 20 cities in microservice architecture stored in BigQuery and packaged all 5 services in Docker containers.
  • Deployed all 5 services to production with Kubernetes HPA hosted on GCP, and built Jenkins pipeline to update images.
  • Developed Google Pub/Sub service to run real-time ML model prediction handling 1M+ requests/day in Java.
  • Built and unit tested a translation pipeline to translate Ritual app in 7 languages using ElasticSearch, BigQuery and Airflow in Java and Python.

Jan - May 2019, Menlo Park, CA

  • Groq is founded by the founders of TPU (GoogleX), we’re building world’s fastest AI chip backed by Social Capital LLC.
  • Built a custom real-time analytics benchmarking framework integrating Deepbench and MLPerf benchmarking suite comparing performance of Groq chip with NVIDIA GPUs using Tableau and Tensorflow.
  • Studied research papers and implemented an in-house quantization tool to quantize post-training custom model protobufs using Graph Transform tool in Tensorflow.
  • Implemented quantized models for Attention Is All You Need and ResNet50 research papers guided by ML@Berkeley Professors, reduced size of the model by 40% with 0.1-0.5% dip in inference accuracy using Tensorflow.

May - Sept 2018, Toronto, ON

  • Built an in-house ad revenue reporting tool: Automated hourly ad revenue data download from Google DFP using Python, transformed data to parquet files using SparkSQL, staged data on S3, build Tableau dashboard for reporting connected to Snowflake DataWarehouse.
  • Built and automated ETL pipelines to migrate 5 PetaBytes of data from MongoDB, SAP, Salesforce Krux, AWS-Athena to Snowflake using Spark in Scala deployed on Databricks clusters.
  • Built and deployed Apache Airflow dockerized application to production on EC2 using MySQL backend, migrated and scheduled 20 cron jobs to Airflow workflows reducing the data load failures by 80%.
  • Setup and deployed containerized Kibana application to a docker swarm cluster on top of ElasticSearch to monitor logs data, used Logstash to ingest 1M+ lines of daily logs from S3.
  • Developed Python scripts to analyze stakeholder’s Tableau query’s and made querying the tables 10X faster by writing JavaScript UDFs to warm up Snowflake tables.

Sept - Dec 2017, Toronto, ON

  • Project 1: Hourly Patient Volume Prediction Model:
    • Developed prediction model for 200 Dynacare locations with 96% accuracy using Recurrent Neural Net(RNN).
    • Conducted Time-Series Analysis to investigate seasonality and trend of the data using matplotlib.
    • Used constrained programming to calculate the wait-time for each location to hire optimized staff count Cleaned and processed the raw production data to be fed to the model using Pandas and T-SQL.
    • According to business evaluation, company will save US$1.6M/yr by hiring the projected number of staff.
  • Project 2: Automating Data Entry Using Image Recognition:
    • Automated data entry of 20K+ medical forms daily to MySQL db using Convolutional Neural Net(CNN).
    • Built ETL pipeline in Python to collect over 1.3M images preparing training dataset of medical forms.
    • Utilized transfer learning on R-CNN with custom data to extract the dialog box data with 92% accuracy.

Jan - May 2017, Toronto, ON

  • Developed Android applications and services in JAVA using native Android SDK for Blackberry Hub+ suite.
  • Designed a REST API to keep track of all 6 BB-apps that display ads after 30 days and designed UI/UX for ads.
  • Retrofitted HTTP implementation, improving API request performance by 30-50%.

May - Sept 2016, Toronto, ON

  • Deployed and managed Linux servers to support research on Electrical and Computer Engineering network.

Consultant Work

Perks of being an international student,
Side Hustle!

Feb-May 2019, Bangaluru, IN

  • Project 1: Company’s Summary Generator:
    • Prepared training data features by scraping nodes of HTML DOM implementing BFS/DFS for over 3500 unique URLs.
    • Developed a NLP based web scraper training Linear SVC to extract company’s summary from the URL with 88% precision.
    • Deployed the model in production by building an asynchronous Restful API using Flask.
    • Developed a Training data pipeline for training the model on failed example to further increase the precision.
  • Project 2: Company Name Identifier:
    • Used Spacy to developed a Name Entity Recognition model to identify company’s name from the T&C page with 93% accuracy.
    • Developed Prodigy Annotation tool pipeline to prepare training data to train Spacy on all failed examples.
    • Used Fuzzy-Wuzzy string matching technique on the tokenized name to measure the performance of the model.

May - Sept 2018, Redwood City, CA

  • Automated training data preparation by querying & cleaning nested JSON data from RethinkDB using Python.
  • Developed a Time-Series model to predict the destination of the driver based on it’s current location (geospatial data) using K-Nearest Neighbors model with 83% accuracy.
  • Built a Classification model to recommend drivers which can carry freight using ANN with 92% accuracy.
  • Deployed ML models to production by implementing a RESTful API using Flask and Redis Queue for feeding the live location of the driver to the model.

What's Next ?

I am always up for a new challenge. My inbox is always open.
Whether you have an idea for a project or just want to chat, feel free to shoot me an email!