Atul Kumar Pandey — Data Engineer • Cloud & Analytics

Measurable Impact

Quantifiable results from data engineering initiatives across different organizations

75%

Processing Time Cut

Reduced manual data processing effort on 500K+ daily records

500K+

Records / Day

Production ETL pipelines serving a 150M+ user platform

99%

SLA Achievement

Consistent data pipeline reliability and reporting adherence

15%

Analytics-Driven Insights

Increase from unified data model architecture

40

Member Team Led

Cross-functional Agile data engineering team

5min

CDC Latency

End-to-end change data capture from S3 to warehouse

96%

Customer Retention

Achieved through data-driven e-commerce operations

20%

Faster Resolution

Via Salesforce automation and Python-driven workflows

About Me

I'm a data engineer with 6+ years of experience building production pipelines and analytics infrastructure. My journey began in e-commerce operations, where I discovered the transformative power of data-driven decision making.

Currently pursuing an MS in Business Analytics (AI and Data Analytics) with a perfect 4.0 GPA at UMass Boston, I specialize in building enterprise-scale data platforms that process hundreds of thousands of records daily for platforms serving 150M+ users.

My expertise spans the full data stack: from real-time streaming with Kafka and Spark to lakehouse architectures on Databricks, and cutting-edge GenAI infrastructure including RAG pipelines with LangChain and vector databases.

Current Focus

Real-time streaming & event-driven architectures
Lakehouse & medallion architecture patterns
GenAI infrastructure: RAG pipelines & vector databases
MLOps pipeline automation & monitoring

Core Principles

Observability

Reliability

Simplicity

Scalability

Technical Expertise

Comprehensive skill set spanning the modern data engineering stack

Cloud Platforms

Amazon Web Services

S3GlueLambdaRedshiftEMRKinesisAthena

Google Cloud Platform

BigQueryDataflowPub/SubComposer

Microsoft Azure

DatabricksADLS Gen2SynapseData Factory

Big Data & Streaming

Processing Engines

Apache SparkPySparkSpark Structured StreamingApache Flink

Streaming Platforms

Apache KafkaAWS KinesisGCP Pub/SubApache NiFi

GenAI Infrastructure

NEW

RAG & LLMOps

RAG PipelinesLangChainLLMOps

Vector Databases

PineconeChromaDBEmbeddings

AI Integrations

Semantic SearchHybrid RetrievalBM25 + Vector

Data Warehousing & Orchestration

Data Warehouses

SnowflakeBigQueryRedshiftDelta Lake

Orchestration & Transform

Apache AirflowdbtMage AI

Data Governance

Great ExpectationsCDC / SCDIAM

Programming & DevOps

Languages

PythonSQLPySparkRBash

DevOps & IaC

DockerTerraformGitHub ActionsCI/CD

Visualization

TableauPower BILookerStreamlit

Featured Projects

Real-world data engineering solutions with measurable business impact

FEATURED REAL-TIME GenAI + RAG

Real-Time Stock Market Analytics & AI Intelligence Platform

Lambda-style batch + streaming architecture with RAG-powered market intelligence at sub-second latency

Data Flow

Key Features

15-min & 1-hr window aggregations via Spark Structured Streaming
dbt transformation layer — versioned, auditable analytics in Snowflake
RAG module: SEC filings → Pinecone → LangChain natural language queries
CI/CD via GitHub Actions: linting, unit tests, dbt validation, Docker builds

Impact

Latency Sub-second

Architecture Lambda (batch + stream)

AI Layer RAG + Vector Search

Kafka Spark Streaming dbt Snowflake LangChain Pinecone Airflow Docker GitHub Actions

View on GitHub

GenAI RAG PIPELINE

Enterprise Document Intelligence RAG Pipeline

Airflow-orchestrated RAG system with hybrid semantic search and CDC-driven embedding auto-refresh

Pipeline Flow

Key Features

Hybrid retrieval: BM25 + vector similarity with reranking for high precision
CDC-driven auto-refresh: embeddings update when source documents change
Semantic search API for enterprise document Q&A
Dockerized deployment with Airflow sensor orchestration

Technical Highlights

Retrieval BM25 + Vector Hybrid

Vector Store ChromaDB

Freshness CDC auto-refresh

LangChain ChromaDB Airflow BM25 Python Docker CDC

View on GitHub

LAKEHOUSE

E-Commerce Analytics Lakehouse

Medallion architecture on Azure Databricks with ADF event-triggered ingestion

Bronze → Silver → Gold · 15+ PySpark quality rules · ACID

Delta LakeTime-travel + schema evolution

Databricks Delta Lake ADLS Gen2 Data Factory PySpark

SERVERLESS

Spotify Music Analytics ETL

Serverless AWS pipeline with Snowpipe for zero-delay loading

SchemaDimensional (Artist / Album / Song)

TriggerS3KeySensor → Snowpipe

Airflow Lambda Glue S3 Snowflake

GCP

Uber Data Analytics Pipeline

Star schema DWH on BigQuery with Looker dashboards

Query Performance60% faster

Trip Records100K+

Mage AI BigQuery Cloud Storage Looker

CDC

Near Real-Time CDC Pipeline

5-min end-to-end CDC from AWS S3 to Snowflake with SCD Type 1/2

• NiFi ListFile/FetchFile + Snowpipe — 5-min S3→warehouse
• SCD Type 1/2 via Snowflake Streams & Tasks
• Dockerized with Faker for automated CDC testing

NiFi Snowpipe Streams & Tasks Docker

Professional Journey

Progressive career growth from operations to data engineering leadership

Data Engineer (Assistant Manager)

Think and Learn Pvt Ltd (Byju's) Jul 2022 – Jul 2024

Data engineering & analytics leadership for India's largest ed-tech platform (150M+ users)

Key Achievements

Built production ETL pipelines processing 500K+ daily records across student activity, attendance, assessments, and class interactions
75% reduction in manual data processing effort through Python/SQL automation
Led cross-functional team of up to 40 members in an Agile environment — roadmaps, mentoring, stakeholder coordination
Unified data model architecture integrating siloed platforms — 15% increase in analytics-driven insights

Technical Impact

Data Quality Framework

Reusable schema validation, null checks, SLA monitoring — standardized across analytics verticals

Airflow Orchestration

Built and optimized DAGs across ingestion, transformation, and reporting with automated alerting and retry logic

PythonSQLAWSAirflowTableauData QualityAgile/Scrum

Data Analyst (Product Expert)

Think and Learn Pvt Ltd (Byju's) Oct 2020 – Jun 2022

Analytics leadership for customer operations team of 20+ members

Key Achievements

Automated Salesforce case routing using Python, reducing issue resolution time by 20%
Built executive KPI dashboards in Power BI and SQL for strategic decision-making
Maintained 99% SLA adherence for reporting and data workflows

Technical Work

Automated Reporting Pipelines

Eliminated manual overhead by extracting from Salesforce and internal databases

Operational Monitoring

Real-time tracking of team productivity and business-critical metrics

SalesforcePythonSQLPower BIKPI Dashboards

Operations Manager (E-Commerce Ops)

Binarify Jun 2018 – Sep 2020

E-commerce operations across Amazon, eBay, and Shopify platforms

Key Achievements

96% customer retention through business analytics and data-driven process optimization
Built inventory tracking dashboards and streamlined order-to-delivery workflows
Developed logistics reporting & vendor tracking systems in SQL — foundation for data automation

Process Automation

Fulfillment Efficiency

Improved operational visibility and fulfillment workflows across multi-platform e-commerce

SQLE-Commerce AnalyticsInventory ManagementAmazon/eBay/Shopify

Education & Certifications

Continuous learning through formal education and industry certifications

Formal Education

MS Business Analytics (AI and Data Analytics)

University of Massachusetts Boston

GPA: 4.0Expected May 2026

Advanced Machine Learning, Predictive Analytics, Big Data Processing, Statistical Modeling

BE Mechanical Engineering

Visvesvaraya Technological University, Bangalore

GPA: 3.8Jun 2018

Senior project on manufacturing process optimization

Professional Certifications

Data Engineering Professional Certificate

DeepLearning.AI • 2024

Vector Databases & Embeddings, RAG, LangChain for LLM Applications, LLMOps

DeepLearning.AI • 2024

Data Warehousing with Snowflake

Datavidhya • 2024

Apache Spark with Databricks

Datavidhya • 2024

Apache Airflow & Apache Kafka

Datavidhya • 2024

Let's Build Something Amazing

Ready to discuss your data engineering needs? Let's connect and explore how we can transform your data infrastructure.

Boston, Massachusetts

Available for remote/hybrid roles

atulpandey02@gmail.com

Quick response within 24h

LinkedIn Profile

Professional network

View My Work