MH.

Hi, I am Mutian He

Machine Learning & Data Engineering

Projects

FinSight AI

Cloud & AI Engineering

Engineered an end-to-end document intelligence pipeline parsing SEC 10-K filings (HTML + PDF) via BeautifulSoup and AWS Textract, extracting and structuring Risk Factor disclosures into standardized JSON. Integrated AWS Bedrock (Nova Lite) to auto-classify risks into 13 categories and generate executive-level summaries, enabling AI-powered year-over-year and cross-company risk comparison.

Multimodal Image Search Engine

Multimodal ML

Built a CLIP-based multimodal retrieval system for text-to-image search by projecting images and text queries into a shared embedding space. Designed the full pipeline from data preprocessing and batch embedding generation to FAISS ANN indexing, and evaluated retrieval quality with Recall@1/5/10 and Median Rank.

End-to-End Instacart Reorder Prediction System

ML Engineering

An end-to-end machine learning pipeline for predicting user reorder behavior on the Instacart platform. Focuses on data engineering and ML workflow design, including ETL, feature aggregation, temporal data splitting, model training, and inference. Trained a Random Forest model on user–product interaction data with emphasis on preventing data leakage and building a reproducible, production-oriented pipeline.

Jenkins as a Service (JaaS)

DevOps Platform Engineering

Designed an enterprise Jenkins-as-a-Service platform to replace fragmented CI/CD tooling across teams. Standardized pipeline templates, centralized RBAC and audit logging, and planned rollback- oriented release workflows on VMware to improve delivery reliability, security compliance, and operating efficiency.

Bank Marketing Subscription Predictor

Machine Learning

Built an end-to-end ML classification pipeline to predict customer subscription likelihood for bank term deposits. Addressed 88% class imbalance using SMOTE, tuned decision threshold to optimize Recall/Precision tradeoff, and applied SHAP values to deliver interpretable, business-actionable insights. Achieved ROC-AUC of 0.80 with Random Forest.

Cloud-Based Real-Time Stock Data Pipeline

Data Engineering

Built a cloud-based real-time stock data streaming pipeline using Apache Kafka on AWS EC2. Implemented Python producers and consumers to simulate live market data ingestion. Persisted streaming data to Amazon S3 and integrated AWS Glue Data Catalog and Amazon Athena to enable scalable, serverless SQL analytics.

Spotify Podcast Popularity Analysis

Data Analysis

Analyzed 228,000+ Spotify podcast episodes to identify factors driving Top 10 rankings. Performed EDA across 22 countries, engineered predictive features from audio/video and genre attributes, and trained a Random Forest classifier evaluated by accuracy and AUC.

Real-Time Flight Delay Prediction System

ML Engineering

Developed an end-to-end machine learning system to predict flight delays using high-cardinality categorical features such as airline carriers and origin–destination pairs. Focuses on production-oriented ML engineering, including feature processing, model training with CatBoost, and real-time inference through an interactive web interface.

Skills

Contact

Email me