Summary
Overview
Work History
Education
Skills
Websites
Languages
Timeline
Generic

Subramanian Venkataraman

ON

Summary

Big Data professional with strong focus on data architecture, data analytics, and ETL processes. Skilled in Hadoop, Spark, Scala ,Kafka, Python and NOSQL DBs, with significant experience in designing and implementing scalable data solutions. Known for effective team collaboration, adaptability to changing project needs, and results-driven approach that consistently drives project success.

Proactive and goal-oriented professional with excellent time management and problem-solving skills. Known for reliability and adaptability, with swift capacity to learn and apply new skills. Committed to leveraging these qualities to drive team success and contribute to organizational growth.

Overview

30
30
years of professional experience

Work History

Sr. Big Data Engineer

Citi Bank
10.2019 - 08.2024
  • Developed an ETL framework in Spark, Scala, Kafka, Oracle and Hive for a complete transformation solution for Mortgage product by developing reusable methods for reading data from Oracle, Writing into Oracle, Reading AVRO files, writing into AVRO files, parsing input
  • ZIP file, loading the data into Oracle, loading the data into Hive-Staging and loading the data into
  • Hive-Target environments as part of transformation for Mortgage data
  • Developed an ETL (Extract, Transform, Load) framework using Spark, Scala, Java, Kafka
  • Oracle, and Hive to manage data translation for mortgage products
  • This included creating reusable methods to read and write data into Oracle and AVRO files, parse ZIP file inputs, and load data into both Hive-Staging and Hive-Target environments
  • It's essential to know that ETL workflows streamline the handling of large data sets in complex data ecosystems
  • Implemented a comprehensive framework to handle data reconciliation by reading Kafka messages in JSON format, validating and parsing them
  • This framework responds with
  • COMPLETED or FAILED messages alongside the reconciliation outcomes using Spark services executed via the Spark-Submit command through a Livy URL
  • The nuances of using Kafka for real-time messaging and Spark for large-scale data processing underscore the advanced capabilities of modern data ecosystems
  • Enhanced multi-threading model developed using Scala-Akka to trigger Spark Submit command with set of input parameters and track the status for each execution
  • Once, the execution is completed or failed, read the corresponding status and returned the results of the execution to the users
  • Developed Kafka Producer and Consumer modules to manage upstream message intake for the reconciliation framework and communicate execution results back to the original sender
  • Kafka is often used for its high throughput and fault tolerance in managing real-time data streams
  • Created a Generic Framework Utility in Scala to read ZIP files within a regression framework
  • This utility aids in automating repetitive tasks, thus reducing manual intervention and enhancing efficiency
  • Achieved significant effort savings by implementing the framework in the user acceptance testing(UAT) environment to run GLRS and FLME jobs, which involve extensive data validation of 18and 50 products, respectively
  • GLRS jobs, spanning 30 minutes, and FLME jobs, taking 3 hours,deal with substantial data volumes up to 10 million and 40 million records
  • This automation via
  • Kafka messaging significantly simplifies complex validation processes
  • Designed an ETL flow in Scala leveraging reusable methods to process mortgage, credit card,and personal loan data, enhancing the flexibility and scalability of data handling
  • Enhanced the existing reconciliation framework using Scala to read inputs from a JSON file,including schema names, table names, partition columns, partition values, and comparison requirements, to refine data comparison processes
  • Optimized performance for Scala applications in the UAT environment by tuning memory and resource utilization to manage heavy loads, resolving bottlenecks, reallocating memory parameters, and adjusting executors
  • Improved the reconciliation framework to generate and email comparison results in Excel files using Scala, facilitating easier interpretation and distribution of data insights
  • Developed a project integrating Scala, Oracle, and Hive to create dynamic SQL queries for data reconciliation between balance columns, domain columns, and date columns, storing the queries in Hive and Oracle tables for streamlined data comparison
  • Linked the dynamic queries to Arcadia BI Tool to execute based on filter conditions and display comparison results in the user interface, enhancing the utility of business intelligence tools in data analysis
  • Created a tool in Scala to convert AVRO format data into Parquet format based on partition keys,leveraging the advantages of Parquet format for efficient query executions
  • Developed data transfer tool in Python to transfer data between production, UAT, and SIT environments by reading input partition values from a CSV file, constructing ‘distcp' commands,and generating log files to validate data transfer tasks, thus improving productivity and simplifying business deliveries
  • Developed a ‘FileWatcher' tool in Python to monitor file drops in Unix folders, read, parse, and validate files, and trigger Spark-Submit programs, automating data workflow processes
  • Developed various utilities in Python scripts for running SQL queries in Hive, data copying, data transferring, generating Spark logs, and sending files as email attachments, showcasing the versatility in handling extensive data tasks
  • Created PySpark code for a logging module to log user-friendly messages, error messages, and validation messages, and further enhanced proof-of-concept PySpark code for ETL requirements to process DAT files and load data into Hive tables, demonstrating the use of PySpark in effective data management and analysis
  • Coordinated with business groups for requirement analysis, design, planning, delivery, providing demonstrations to Business Analysts, task allocation, JIRA status tracking, and sending daily/weekly status reports to clients
  • Effective management of these project aspects ensures alignment with business goals and successful project delivery
  • Environment: Spark, Scala, Java, Kafka, Hive, Impala, Cloudera, Hadoop, HDFS, Oracle SQL
  • Server, Arcadia BI tool, Unix, shell scripting

Sr. Machine, Engineering

Capital One
01.2019 - 07.2019
  • Lead
  • Project: Linear Regression Model for Auto Loan Default prediction
  • Project Description: During this project, developed scripts using Python, AWS Lambda, AWS-S3
  • Scala, Quantum Framework, Scala Test, Jenkins for inserting customer records into customer agreement table as part of approval work flow
  • The AWS Lambda performs the workflow process
  • Worked in Linear regression model which predicts vehicle loan outstanding, severity, probability of default using Python, AWS-S3 , Redshift and AWS-EC2 to optimize the execution speed, reducing memory consumption and improving the existing algorithm
  • Responsibilities:
  • Using Python programming, validated the presence of customer's car loan records in the source table , verified the approval status in S3 bucket, the number of records present in the data file and number of records in customer PDF files present in S3 bucket
  • Enhanced Python codes to report error messages as part of loan rejection work flow
  • Developed test scripts in Scala for performing validations for the presence of Parquet file and‘_SUCESS' file as part of daily execution process which are created by the model
  • Developed Jenkins pipeline for onboarding the application to production with build preparation,validations as part of deployment process
  • Performed analysis, traceability of source data and lineage for Front Book Recovery model which forecasts the schedule of payments, Auction Payments, Deficiency amount
  • Developed flow charts to explain the logical flow of the process for front book recovery
  • Developed lineage report to explain the flow of data through input, intermediate and output variables that gets created to explain the list of variables needed for final scoring equations
  • Executed the application and generated reports for memory usage and execution time for each phase of the process
  • Optimized memory usage by 50% by localizing the data read and processing data within functions in Python
  • Optimized execution speed for scoring module by 50% in Python
  • Introduced multi-threading in Python for reading CSV files which reduced the file read in few seconds
  • Introduced data table assignment statement to increase the speed of execution.

Scala Developer

Walmart
08.2018 - 01.2019
  • Worked in Bookkeeping project for maintenance and enhancement of Kafka pipelines wheredata originates from Point Of Sales (POS) and ends in General Ledger
  • Developed Scala akka Micro Services using Futures for sending successful message or errormessage after sending the data through producer and writing the data into Cassandra based on theoutcome
  • Developed Scala akka micro services for retrying DB connection for DB2 in case of failures
  • Configured the number of retry the attempts from Type Safe Config
  • Participated in Scrum Meetings for User, Technical story creation, Retrospective analysis
  • Environment: Scala, Play, Micro Services, Kafka, Cassandra, DB2.

Data Engineer

Hayward Industries
01.2018 - 06.2018
  • This project is for a Swimming Pool installation and maintenance-based company who haveapproximately around 10,000 clients in United States
  • As part of this project developed analytical reportsto get insight into their customer base spread over US, to study the demographics of the client's deviceinstallations and find opportunities for business development
  • Developed dashboards for displaying customer's device installations, various device typesand PH status of Swimming Pool water using R, R-Shiny Dashboard, Kafka, Scala
  • Cassandra
  • Developed codes in Kafka, Scala to read and parse the input data files and load into multiple
  • Cassandra DB tables
  • Developed R libraries for Geo-location based analytical reports for displaying customer's deviceinstallations and various device types
  • Analyzed and developed R codes for time series graphs and scatter plots for PH level data ofswimming pool water along with deviation from recommended levels
  • This graph supportsmultiple state, city level analysis
  • Developed analytical product for Warranty Management using Power BI, SQL Server, AZUREand Power BI Gateway
  • Generated analytical reports using Power-BI for Warranty costs incurred on various parameterssuch as product types, product names, part numbers, warranty timelines, supplier locations,regions, states and customer locations
  • Developed Python modules NLTK (Natural Language Tool Kit)-Part Of Speech (POS) forcapturing failure types from Problem Description to categorize the problem types and devicetypes to perform root-cause analysis
  • Performed analysis and designed and developed R programs for decoding Warranty Serialnumbers to find the manufacturing date
  • This data would help the clients to find out after howmany days, months and years the product has failed at customer location
  • Performed requirement analysis to study the installation of devices such as heater, pumps, filters,water PH level requirements, log messages
  • Installed SQL Server DB and Power BI, moved customer data given in CSV format into SQL
  • Server DB
  • Performed integration testing between Data base and Power BI Reports to ensure that the data isreported in the reports for various product types, warranty periods, product names, customername and sub-contractor name
  • Verified data integration between SQL Server and Power BI Reports to ensure that reportingvalues are matching with data base
  • Published PC and mobile compatible versions for production releases
  • Environment: R-Shiny Dashboard, R-Studio, R, Python, NLTK, Kafka, Scala, Cassandra, SQL
  • Server, Microsoft Power BI

MS Analytics Student

Harrisburg University of Science and Technology
03.2016 - 10.2017
  • Project: Presidential job Approval Rating Analysis through social media
  • Description: President's Approval Ratings help measure chances of re-election, predict performance inmid-terms of the party in power and generally take stock of the public's approval of the administration'sagenda and performance
  • The objective of the project is to capture public sentiments from Twitter andcompare it against the results of major approval rating companies such as Gallup Poll, Rasmussen
  • Reports, Fox News, NBC News, Investor's Business Daily (IBD/TIPP to see is there any correlationbetween Twitter Sentiments and Scientific polling
  • Developed codes in Spark/Scala to download tweets from Twitter and saved into Cassandra DB
  • Developed Training Set and Test Sets for performing Sentiment analysis using Naïve Bayes
  • Classifier
  • Developed code in Python for Naïve Bayes Classifier for performing Sentiment analysis on
  • Tweets
  • Data from polling companies are downloaded manually and correlation graphs are generated
  • Developed R codes using R-Shiny dashboard to display the analytical graphs
  • The Twitter data results are normalized between 1% to 100% to match with approval ratinglimits
  • Environment: Spark/Scala, Python, Sentiment Analysis, Machine Learning, NLP, Naïve Bayes
  • Classifier, PyCharm, TensorFlow/ Keras, Scikit-Learn, PyTorch, DNNs, GANS, GNNs and Time
  • Series
  • Project : Traffic Monitoring Analytix
  • Description: This project studies the impact and root-cause of Traffic Congestion for 5 chosen cities of
  • US
  • It studies the reason for the high and low traffic in the selected US cities by reviewing population,population density, average household income and commutation time to work by framing set of theoriesand tries to prove or disprove based on the actual results
  • Downloaded Tweets for the 5 chosen selected cities from twitter for the public sentiments
  • Developed R Codes for Twitter-Sentiment Analysis, R-Shiny Dashboard, loading Twitter
  • Sentiment results into Mongo DB
  • Generated correlation graphs, location-based graphs using R-Leaflet and Twitter sentimentanalysis reports
  • Performed Sentiment Analysis in Twitter for traffic and grouped tweets into various emotionalcategories
  • Mongo DB is installed in m-lab which uses Amazon Web Services - AWS S3 Instance, werecreated DB, created relations
  • Developed Analytical application using R-shiny and connected to S3 instance for reading anddisplaying analytics from ShinyApps.io server
  • Environment: R-Programming, R-Shiny Dashboard, Mongo DB, Sentiment Analysis, AWS
  • Project: Twitter Analytix
  • Description: This project performs real time sentiment analysis for the US Presidential Nominees ofelection by reading the tweets from twitter in real time
  • It reads the twitter using Spark Steaming
  • Scala and performs analysis using R language in order to generate the Sentiment analytics report
  • It uses
  • Cassandra, Mongo DB and AWS for data maintenance
  • The reports are displayed using R-Shinydashboard
  • Developed Real time sentiment analysis tool to display instant and continues line graphs forthe positive and negative sentiments of the public for each second based on the data capturedfrom US location using Spark-Scala, Cassandra, R-Programming and Mongo DB installedin m-Lab running under AWS
  • Installed Mongo DB in m-Lab, running under the hood of AWS, created DB and relations
  • Developed R Programming to read the twitter data from Cassandra, performed sentimentanalysis using R-libraries and aggregated the data to each second and loaded into Mongo DB
  • Developed Analytical application using R-Shiny and connected to S3 instance for reading anddisplaying analytics from ShinyApps.io server
  • The continuous instant line graphs are generated for positive and negative sentiment countsaggregated per second for the presidential candidates for the data downloaded by Spark
  • Streaming into Cassandra which is subsequently processed by R-Programming for sentimentanalysis and uploaded into the Mongo DB in cloud
  • Environment: Spark Streaming, Scala, R-Programming, R-Shiny Dashboard, Cassandra
  • Mongo DB, Sentiment Analysis, Machine Learning, AWS, Maven, Eclipse
  • Client

Data Engineer

Barclays
11.2016 - 07.2017
  • Worked in CCAR Group for the projection of 9Q for Capital Funding
  • The client requires to find out the projection for capital funds for 9 quarters in series in order to identify and mitigate risks involved in capital funds being maintained
  • Responsibilities:
  • Worked in Report generation process in Hive and Oracle data base based on various account codes
  • Developed Avro files for defining data structure
  • Created tables in Hive using Avro files and partition strategy
  • Developed ETL-Java code to load data into Hive tables
  • Developed codes in Spark-Scala to load data from CSV files into HIVE tables to ensure that the data is loaded into Hive tables as per the source files
  • Developed Data Compare Tool in Scala and Hash map to compare the records between various sources such as CSV, Excel and Hive Tables
  • Developed Tools in Scala to query Oracle database and retrieve the results into CSV/Excel to analyze the impact in Moniker SQL, designed based on business flow as part of upcoming changes in data files
  • Developed Tools in Scala to verify the Reports generated from Fixed form Reports against target expected reports for each cell of the matrix as it is complex to verify the output manually
  • Developed Tools in ‘R' programming to generate SQLs in run time as part of Functional
  • Reporting requirements to query data from Hive/Oracle Tables
  • Developed UDFs (User Defined Function) in Hive to query Oracle DB from Hive to integrated from Oracle based on user's requirements
  • Environment: Cloudera, Hadoop, Unix, Spark, Scala, Hive, Oracle, ‘R', Python, Java, Eclipse
  • IntelliJ-Idea, Agile, Maven, SBT, Maven, Office Data Administration (ODA) Technical
  • Environment: ODIN RDBMS (SQL, DB Design), Sun Solaris-Unix, Bashrc, Shell Scripting
  • UNIX Utility Commands
  • Responsibilities:
  • Participated in the analysis of feature specifications, designed, coded, tested and fixed bugs in the application as per the customer's requirements
  • Coordinated with clients in providing designs, suggestions, status reporting, product release and Configuration Management.

Sr. Consultant

Navy Mutual
08.2015 - 12.2015
  • Technical Environment: Eclipse, Selenium, Jenkins, Sauce Labs, Java, Webservice, TestNG,Maven, Eclipse, SVN

Sr. Consultant

Starwood Hotels
04.2015 - 08.2015
  • Technical Environment: Eclipse, Selenium, Java, BDD, Cucumber, Maven for iPhone Devices, Android Devices and Desktop Web Applications


Expert Systems Analyst, Test Automation Architect

Allscripts Healthcare LLC
12.2012 - 05.2014
  • Technical Environment: HP-QC using Excel Macros and Oracle SQL, HP-QTP 11, VB, VBA - Excel Macro, Hybrid Mode Test Automation Framework Design, Development and Implementation

Sr. Test Analyst

Auto Insurance Domain
10.2011 - 10.2012
  • Environment: ALM 11.0, QTP-11.0, BPT, Hybrid Model Test Automation Framework, Java

Senior Consultant

Capco
09.2010 - 09.2011
  • Technical Environment: Agile - Scrum, Quality Center 10.0, Frontier Applications-Recon , Recollect or and Admin

Test Automation Engineer

Geeksoft LLC
07.2010 - 08.2010
  • Reviewed and provided solutions for the existing test automation framework used for testing a trading application
  • Prepared requirement study, test plan, test case preparation, test execution, reporting defects for a trading application testing and release.

Test Automation Consultant

Credit Suisse
06.2009 - 07.2010
  • Technical Environment: QTP 9.5, Descriptive Programming, QC 9.0, BO XI (Business Objects, Desktop Intelligence), SQL Query Analyzer, Embarcadero Rapid SQL 7.5.5, Test Automation, Framework Design and Development

Test Automation Engineer

JetBlue Airways
09.2008 - 05.2009
  • Technical Environment: .NET, ASP, QTP 9.5, QC 9.0, BizTalk Server, SQL Server 2000, SQL, Query Analyzer, Test Automation Framework maintenance and Enhancement

Test Engineer

Wachovia
05.2008 - 09.2008
  • Technical Environment: QTP 9.2, QC 9.0, Agile - Scrum, SQL Server 2000, SQL Query Analyzer, Windows-XP, Java, J2EE, .NET, Windows XP, Sun Solaris

Assistant Vice President and Onshore Lead

Merrill Lynch India and US
07.2000 - 02.2008

Worked in India and USA for Merrill Lynch through various organizations.

Environment: NET, ASP, Seagate Crystal Reports, Oracle Toad, SQL Server 2000, SQL Query Analyzer, QTP 9.1, QC 9.0, Rational Functional Tester, Functional Tester, Test, Manager, SQA Basic 7.3 Windows XP, SQL Server, Test Partner, JIRA Administration, DB, Testing, Unix, Shell Scripting, Bourne Shell, Unix-Utility Commands


Senior Software Professional

India Comnet International (P) Ltd
06.1999 - 06.2000
  • AT&T, USA: Call Detail Data System Technical Environment: Informix, SQL, ESQL-C, C, Sun Solaris Responsibilities:


Member Technical Staff

India Comnet International (P) Ltd
04.1997 - 12.1998

Worked for Lucent Technologies for product development and release through various in-house development tools.

Systems Analyst

Railway Products (India) Ltd
06.1994 - 07.1996
  • Technical Environment: UNIFY RDBMS, Sybase-10 (SQL, DB Design), Unix, Shell-
  • Scripting, Bourne Shell, ,C, Unix-Utility Commands, ,Assembly Language, MS-DOS, Unify
  • RDBMS (SQL, RPT and DB Design)
  • Responsibilities:
  • Managed development and maintenance of Daily, Weekly, Monthly and Audit Reportsusing SQL, Unify-C, Report processor programs for Material Information System
  • Payroll information System, Sales Order Processing, Financial Information System
  • Developed data base design in Sybase-10 with tables, referential integrity, triggers andstored procedures for migrating Material Information System from Unify Data base to
  • Sybase.

Education

Master of Science - Analytics

Harrisburg University of Science And Technology
Harrisburg, PA, USA
10.2017

Master of Science - Computer Applications

Annamalai University
India
06.1992

Bachelor of Science - Physics

Barathidhasan University
India
06.1989

Skills

    Data Engineering

    Scala Programming

    Python Programming

    Java programming

    Spark Development

    Apache Kafka

    Hadoop Ecosystem

    Cloudera Distribution

    ETL development

    Real-time Processing

    NoSQL Databases

    RESTful APIs

    Data Migration

    Machine Learning

    Big Data Analytics

    RDBMS

Languages

English
Full Professional
Tamil
Native or Bilingual

Timeline

Sr. Big Data Engineer

Citi Bank
10.2019 - 08.2024

Sr. Machine, Engineering

Capital One
01.2019 - 07.2019

Scala Developer

Walmart
08.2018 - 01.2019

Data Engineer

Hayward Industries
01.2018 - 06.2018

Data Engineer

Barclays
11.2016 - 07.2017

MS Analytics Student

Harrisburg University of Science and Technology
03.2016 - 10.2017

Sr. Consultant

Navy Mutual
08.2015 - 12.2015

Sr. Consultant

Starwood Hotels
04.2015 - 08.2015

Expert Systems Analyst, Test Automation Architect

Allscripts Healthcare LLC
12.2012 - 05.2014

Sr. Test Analyst

Auto Insurance Domain
10.2011 - 10.2012

Senior Consultant

Capco
09.2010 - 09.2011

Test Automation Engineer

Geeksoft LLC
07.2010 - 08.2010

Test Automation Consultant

Credit Suisse
06.2009 - 07.2010

Test Automation Engineer

JetBlue Airways
09.2008 - 05.2009

Test Engineer

Wachovia
05.2008 - 09.2008

Assistant Vice President and Onshore Lead

Merrill Lynch India and US
07.2000 - 02.2008

Senior Software Professional

India Comnet International (P) Ltd
06.1999 - 06.2000

Member Technical Staff

India Comnet International (P) Ltd
04.1997 - 12.1998

Systems Analyst

Railway Products (India) Ltd
06.1994 - 07.1996

Master of Science - Analytics

Harrisburg University of Science And Technology

Master of Science - Computer Applications

Annamalai University

Bachelor of Science - Physics

Barathidhasan University
Subramanian Venkataraman