Atul Kumar Pandey

Available for hire • Boston, USA

Atul
Kumar
Pandey

Data Engineer

Building production data platforms processing 500K+ records daily for a 150M+ user platform across AWS, GCP, and Azure.

Specialized in real-time streaming, lakehouse architecture, and GenAI/RAG infrastructure — with proven impact on operational efficiency and data quality.

6+
Years Experience
40
Member Team Led
4.0
GPA (MS Analytics)
Atul Kumar Pandey - Data Engineer

Measurable Impact

Quantifiable results from data engineering initiatives across different organizations

75%

Processing Time Cut

Reduced manual data processing effort on 500K+ daily records

500K+

Records / Day

Production ETL pipelines serving a 150M+ user platform

99%

SLA Achievement

Consistent data pipeline reliability and reporting adherence

15%

Analytics-Driven Insights

Increase from unified data model architecture

40

Member Team Led

Cross-functional Agile data engineering team

5min

CDC Latency

End-to-end change data capture from S3 to warehouse

96%

Customer Retention

Achieved through data-driven e-commerce operations

20%

Faster Resolution

Via Salesforce automation and Python-driven workflows

About Me

I'm a data engineer with 6+ years of experience building production pipelines and analytics infrastructure. My journey began in e-commerce operations, where I discovered the transformative power of data-driven decision making.

Currently pursuing an MS in Business Analytics (AI and Data Analytics) with a perfect 4.0 GPA at UMass Boston, I specialize in building enterprise-scale data platforms that process hundreds of thousands of records daily for platforms serving 150M+ users.

My expertise spans the full data stack: from real-time streaming with Kafka and Spark to lakehouse architectures on Databricks, and cutting-edge GenAI infrastructure including RAG pipelines with LangChain and vector databases.

Current Focus

  • Real-time streaming & event-driven architectures
  • Lakehouse & medallion architecture patterns
  • GenAI infrastructure: RAG pipelines & vector databases
  • MLOps pipeline automation & monitoring

Core Principles

Observability
Reliability
Simplicity
Scalability

Technical Expertise

Comprehensive skill set spanning the modern data engineering stack

Cloud Platforms

Amazon Web Services

S3GlueLambdaRedshiftEMRKinesisAthena

Google Cloud Platform

BigQueryDataflowPub/SubComposer

Microsoft Azure

DatabricksADLS Gen2SynapseData Factory

Big Data & Streaming

Processing Engines

Apache SparkPySparkSpark Structured StreamingApache Flink

Streaming Platforms

Apache KafkaAWS KinesisGCP Pub/SubApache NiFi

GenAI Infrastructure

NEW

RAG & LLMOps

RAG PipelinesLangChainLLMOps

Vector Databases

PineconeChromaDBEmbeddings

AI Integrations

Semantic SearchHybrid RetrievalBM25 + Vector

Data Warehousing & Orchestration

Data Warehouses

SnowflakeBigQueryRedshiftDelta Lake

Orchestration & Transform

Apache AirflowdbtMage AI

Data Governance

Great ExpectationsCDC / SCDIAM

Programming & DevOps

Languages

PythonSQLPySparkRBash

DevOps & IaC

DockerTerraformGitHub ActionsCI/CD

Visualization

TableauPower BILookerStreamlit

Featured Projects

Real-world data engineering solutions with measurable business impact

FEATURED REAL-TIME GenAI + RAG

Real-Time Stock Market Analytics & AI Intelligence Platform

Lambda-style batch + streaming architecture with RAG-powered market intelligence at sub-second latency

Data Flow

Market API
Kafka Topics
Spark Streaming
dbt → Snowflake
RAG / Pinecone

Key Features

  • 15-min & 1-hr window aggregations via Spark Structured Streaming
  • dbt transformation layer — versioned, auditable analytics in Snowflake
  • RAG module: SEC filings → Pinecone → LangChain natural language queries
  • CI/CD via GitHub Actions: linting, unit tests, dbt validation, Docker builds

Impact

Latency Sub-second
Architecture Lambda (batch + stream)
AI Layer RAG + Vector Search
Kafka Spark Streaming dbt Snowflake LangChain Pinecone Airflow Docker GitHub Actions
View on GitHub
GenAI RAG PIPELINE

Enterprise Document Intelligence RAG Pipeline

Airflow-orchestrated RAG system with hybrid semantic search and CDC-driven embedding auto-refresh

Pipeline Flow

PDFs / Reports
Recursive Chunking
LangChain Embeddings
ChromaDB
Hybrid Search API

Key Features

  • Hybrid retrieval: BM25 + vector similarity with reranking for high precision
  • CDC-driven auto-refresh: embeddings update when source documents change
  • Semantic search API for enterprise document Q&A
  • Dockerized deployment with Airflow sensor orchestration

Technical Highlights

Retrieval BM25 + Vector Hybrid
Vector Store ChromaDB
Freshness CDC auto-refresh
LangChain ChromaDB Airflow BM25 Python Docker CDC
View on GitHub

Professional Journey

Progressive career growth from operations to data engineering leadership

Data Engineer (Assistant Manager)

Think and Learn Pvt Ltd (Byju's) Jul 2022 – Jul 2024

Data engineering & analytics leadership for India's largest ed-tech platform (150M+ users)

Key Achievements

  • Built production ETL pipelines processing 500K+ daily records across student activity, attendance, assessments, and class interactions
  • 75% reduction in manual data processing effort through Python/SQL automation
  • Led cross-functional team of up to 40 members in an Agile environment — roadmaps, mentoring, stakeholder coordination
  • Unified data model architecture integrating siloed platforms — 15% increase in analytics-driven insights

Technical Impact

Data Quality Framework

Reusable schema validation, null checks, SLA monitoring — standardized across analytics verticals

Airflow Orchestration

Built and optimized DAGs across ingestion, transformation, and reporting with automated alerting and retry logic

PythonSQLAWSAirflowTableauData QualityAgile/Scrum

Data Analyst (Product Expert)

Think and Learn Pvt Ltd (Byju's) Oct 2020 – Jun 2022

Analytics leadership for customer operations team of 20+ members

Key Achievements

  • Automated Salesforce case routing using Python, reducing issue resolution time by 20%
  • Built executive KPI dashboards in Power BI and SQL for strategic decision-making
  • Maintained 99% SLA adherence for reporting and data workflows

Technical Work

Automated Reporting Pipelines

Eliminated manual overhead by extracting from Salesforce and internal databases

Operational Monitoring

Real-time tracking of team productivity and business-critical metrics

SalesforcePythonSQLPower BIKPI Dashboards

Operations Manager (E-Commerce Ops)

Binarify Jun 2018 – Sep 2020

E-commerce operations across Amazon, eBay, and Shopify platforms

Key Achievements

  • 96% customer retention through business analytics and data-driven process optimization
  • Built inventory tracking dashboards and streamlined order-to-delivery workflows
  • Developed logistics reporting & vendor tracking systems in SQL — foundation for data automation

Process Automation

Fulfillment Efficiency

Improved operational visibility and fulfillment workflows across multi-platform e-commerce

SQLE-Commerce AnalyticsInventory ManagementAmazon/eBay/Shopify

Education & Certifications

Continuous learning through formal education and industry certifications

Formal Education

MS Business Analytics (AI and Data Analytics)

University of Massachusetts Boston

GPA: 4.0Expected May 2026

Advanced Machine Learning, Predictive Analytics, Big Data Processing, Statistical Modeling

BE Mechanical Engineering

Visvesvaraya Technological University, Bangalore

GPA: 3.8Jun 2018

Senior project on manufacturing process optimization

Professional Certifications

Data Engineering Professional Certificate

DeepLearning.AI • 2024

Vector Databases & Embeddings, RAG, LangChain for LLM Applications, LLMOps

DeepLearning.AI • 2024

Data Warehousing with Snowflake

Datavidhya • 2024

Apache Spark with Databricks

Datavidhya • 2024

Apache Airflow & Apache Kafka

Datavidhya • 2024

Let's Build Something Amazing

Ready to discuss your data engineering needs? Let's connect and explore how we can transform your data infrastructure.