0% found this document useful (0 votes)
58 views6 pages

SumanaV Bigdata

BIG DATA

Uploaded by

babjileo22
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
58 views6 pages

SumanaV Bigdata

BIG DATA

Uploaded by

babjileo22
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 6

Sumana V

Contact No: 443-953-9991


Email Id: [email protected]
Data Engineer / Big Data Engineer

Summary:
 Around 7 years of extensive experience as a Data Engineer and Big data Developer specialized in Big Data
Ecosystem-Data Ingestion, Modeling, Analysis, Integration, and Data Processing.
 Extensive experience in providing solutions for Big Data using Hadoop, Spark, HDFS, Map Reduce, YARN,
Kafka, Pig, Hive, Sqoop, HBase, Oozie, Zookeeper, Cloudera Manager, Horton works.
 Strong experience working with Amazon cloud services like EMR, Redshift, DynamoDB, Lambda, Athena,
Glue, S3, API Gateway, RDS, CloudWatch for efficient processing of Big Data.
 Hands on experience building PySpark, Spark Java and Scala applications for batch and stream processing
involving Transformations, Actions, Spark SQL queries on RDD’s, Data frames and Datasets.
 Strong experience writing, troubleshooting and optimizing Spark scripts using Python, Scala.
 Experienced in using Kafka as a distributed publisher-subscriber messaging system.
 Strong knowledge on performance tuning of Hive queries and troubleshooting various issues related to Joins,
memory exceptions in Hive.
 Exceptionally good understanding of partitioning, bucketing concepts in Hive and designed both Managed and
External tables in Hive.
 Strong knowledge on Azure Data Explorer and Kusto Query Language(KQL)
 Experience in importing and exporting data between HDFS and Relational Databases using Sqoop.
 Experience in real time analytics with Spark Streaming, Kafka and implementation of batch processing using
Hadoop, Map Reduce, Pig and Hive.
 Experienced in building highly scalable Big-data solutions using NoSQL column-oriented databases like
Cassandra, MongoDB and HBase by integrating them with Hadoop Cluster.
 Hands-on experience on MS Azure Cloud suite: Azure SQL Database, Azure Data Lake (ADL), Azure Data Factory
(ADF), Azure SQL Data Warehouse, Azure Service Bus, Azure Key Vault, Azure Analysis Service(AAS), Azure Blob
Storage, Azure Search, Azure App Service, and Azure Data Platform Services.
 Experience working on Data Ingestion to Azure Services and processing the data in Azure Databricks.
 Extensive work on ETL processes consisting of data transformation, data sourcing, mapping, conversion and
loading data from heterogeneous systems like flat files, Excel, Oracle, Teradata, and MSSQL Server.
 Experience of building ETL production pipelines using Informatica Power Center, SSIS, SSAS, SSRS.
 Proficient at writing MapReduce jobs and UDF’s to gather, analyze, transform, and deliver the data as per
business requirements and optimizing the existing algorithms for best results.
 Experience in working with Data warehousing concepts like Star Schema, Snowflake Schema, DataMarts, and
Kimball Methodology used in Relational and Multidimensional data modeling.
 Used AWS IAM, Kerberos and Ranger for security compliance.
 Strong experience leveraging different file formats like Avro, ORC, Parquet, JSON and Flat files.
 Sound knowledge on Normalization and De-normalization techniques on OLAP and OLTP systems.
 Good experience with Version Control tools Bitbucket, GitHub, and GIT.
 Experience with Jira, Confluence and Rally for project management and Oozie, AirFlow scheduling tools.
 Experienced in Strong scripting skills in Python, Scala and UNIX shell.
 Involved in writing Python, Java API’s for Amazon Lambda functions to manage the AWS services.
 Good Knowledge in building interactive dashboards, performing ad-hoc analysis, generating reports and
visualizations using Tableau and PowerBI.
 Experience in design, development and testing of Distributed Client/Server and Database applications using Java,
spring, Hibernate, Struts, JSP, JDBC, REST services on Apache Tomcat Servers.
 Hands on working experience with RESTful API’s, API life cycle management and consuming RESTful services
 Have good working experience in Agile/Scrum methodologies, communication with scrum calls for project
analysis and development aspects.

Technical Skills:
Programming Languages: Python, Scala, SQL, Shell Scripting
Web Technologies: HTML, CSS, XML, AJAX, JSP, Servlets, JavaScript
Big Data Stack: Hadoop, Spark, MapReduce, Hive, Pig, Yarn, Sqoop, Flume, Oozie, Kafka,
Impala, Storm
Cloud Platform: Amazon AWS, EC2, EC3, MS Azure, Azure SQL Database, Google Cloud
Services (GCP), Azure data factory, Azure databricks, Azure sql server
Relational databases: Oracle, MySQL, SQL Server, PostgreSQL, Teradata, Snowflake,KQL
NoSQL databases: MongoDB, Cassandra, HBase, Pig
Version Control Systems: Bitbucket, GIT, SVN, GitHub
IDEs: PyCharm, Intellij IDEA, Jupyter Notebooks, Google Colab, Eclipse
Operating Systems: UNIX, Linux, Windows

Professional experience:
Impact Research, Columbia , MD February 2022 – present
Role: Data Engineer/ Big Data engineer
Responsibilities:
 Participate in requirement grooming meetings which involves understanding functional requirements from
business perspective and providing estimates to convert those requirements into software solutions (Design and
Develop & Deliver the Code to IT/UAT/PROD and validate and manage data Pipelines from multiple
applications with fast-paced Agile Development methodology using Sprints with JIRA Management Tool)
 Responsible to check data in DynamoDB tables and to check EC2 instances are upon running for
 (DEV, QA, CERT and PROD) in AWS.
 Analysis on existing data flows and creates high level/low level technical design documents for business
stakeholders that confirm technical design aligns with business requirements.
 Creation and deployment of Spark jobs in different environments and loading data to no sql database
Cassandra/Hive/HDFS. Secure the data by implementing encryption-based
 Implemented AWS solutions using E2C, S3, RDS, EBS, Elastic Load Balancer, Auto scaling groups,
Optimized volumes, and EC2 instances and created monitors, alarms, and notifications for EC2 hosts using Cloud
Watch.
 Utilised power query in power BI to pivot and un-pivot the data model for data cleansing and data massaging
 Contributed to Azure-specific DevOps practices, including CI/CD pipelines and Kubernetes containerization,
resulting in a 25% faster deployment and scaling process.
 Developing code using: Apache Spark and Scala, IntelliJ, NoSQL databases (Cassandra), Jenkins, Docker
pipelines, GITHUB, Kubernetes, and HDFS file System, Hive, Kafka for streaming Real time streaming data,
Kibana for monitor logs etc. authentication/authorization to the data Responsible to deployments to DEV, QA,
PRE-PROD (CERT) and PROD using AWS.
 Scheduled Jobs through Airflow scheduling tool.
 Created quick Filters Customized Calculations on SOQL for SFDC queries, Used Data loader for ad hoc data
loads for Salesforce
 Extensively worked on Informatica power center Mappings, Mapping Parameters, Workflows, Variables and
Session Parameters.
 Responsible for facilitating load data pipelines and benchmarking the developed product with the set performance
standards.
 Used Debugger within the Mapping Designer to test the data flow between source and target and to troubleshoot
the invalid mappings.
 Worked on SQL tools like TOAD and SQL Developer to run SQL Queries and validate the data.
 Study the existing system and conduct reviews to provide a unified review on jobs.
 Involved in Onsite & Offshore coordination to ensure the deliverables.
 Involving in testing the database using complex SQL scripts and handling the performance issues effectively.
Environment: Apache spark 2.4.5, Scala2.1.1, Cassandra, HDFS, Hive, GitHub, Jenkins, kafka, SQL Server 2008, ,
Visio, TOAD, Putty, Autosys Scheduler, UNIX, AWS, Azure, WinScp,

Client: Index Analytics, Baltimore, MD July 2021– December 2021


Role: Data Engineer/Big data Engineer
Responsibilities:
 Developed and automated risk-scoring models using PySpark for predictive healthcare analysis, contributing to a
20% improvement in predictive accuracy on Azure.
 Implemented proactive monitoring strategies, reducing application downtime by 20% through early detection and
resolution of potential issues.
 Streamlined incident response procedures, resulting in a 15% decrease in mean time to resolution (MTTR) for
production incidents.
 Created SSIS packages for moving data between databases
 Spearheaded data migration to Azure resources from sources like MS Excel and CSV, enhancing data import
efficiency by 25% and reducing processing time by 35%.
 Crafted Spark-based pipelines for data loading with EMR and AWS S3, realizing a 25% boost in speed and a
45% decrease in processing time.
 Experience in developing spark application using spark- SQL in Databricks for data extraction, transformation,
aggregation, from multiple file formats like JSON, parquet for analyzing and transforming the data to uncover
insights into customer usage patterns
 Streamlined ETL tasks for Azure SQL Server, reducing data loading time by 30 minutes and enhancing accuracy
by 40%.
 Collaborated closely with Cloud Engineering teams in an Agile environment, ensuring accurate translation of
Azure-related requirements into efficient data pipelines.
 Implemented Kafka-based data pipelines, transforming and streaming data across systems, enhancing the
efficiency of data processing by 30%
 Developed MapReduce programs to parse the raw data, populate staging tables and store the refined data in
partitioned tables in the EDW.
 Utilized power BI gateway to keep dashboards and reports up to-date with on premise data sources
 Wrote Sqoop Scripts for importing and exporting data from RDBMS to HDFS.
 Actively contributed to threat modelling and automated vulnerability management on AWS, resulting in a 40%
reduction in identified security risks.
 Wrote KQL queries and publish the data into Power BI
 Documented Azure-based data engineering processes, leading to a 50% increase in knowledge-sharing efficiency
and team collaboration.
 Played a pivotal role in the enhancement of ETL applications, leading to a 30% improvement in data processing
speed.
Environment: Hadoop 3.3, Azure, power BI, Spark 3.0, PySpark, Sqoop 1.4.7, ETL, HDFS, Snowflake DW, Oracle Sql,
MapReduce, Kafka 2.8 and Agile process.

Client: Cyberpack Ventures, Baltimore, MD June 2021 – December 2021


Role: Data Engineer
Responsibilities:
 Collaborated with cross-functional teams to successfully plan and execute global platform upgrades on Azure,
achieving a 30% increase in system efficiency and compatibility with Azure architecture.
 Leveraged Azure AKS expertise to enhance architecture design, firewall, and network security, resulting in a
25% reduction in security vulnerabilities and ensuring a robust deployment environment.
 Orchestrated Kubernetes clusters for containerized application, enhancing capability and resource utilization in
data processing workflows
 Engaged in Azure service for specific projects, utilizing Azure Data Factory for ETL processes and Azure
Cosmos DB for scalable NoSQL data storage
 Developed Spark Applications by using Python and Implemented Apache Spark data processing Project to handle
data from various RDBMS and Streaming sources.
 Worked with the Spark for improving performance and optimization of the existing algorithms in Hadoop.
 Using Spark Context, Spark-SQL, Spark MLlib, Data Frame, Pair RDD and Spark YARN.
 Used Spark Streaming APIs to perform transformations and actions on the fly for building common.
 Learner data model which gets the data from Kafka in real time and persist it to Cassandra.
 Developed Kafka consumer API in python for consuming data from Kafka topics.
 Consumed Extensible Markup Language (XML) messages using Kafka and processed the XML file using Spark
Streaming to capture User Interface (UI) updates.
 Developed Preprocessing job using Spark Data frames to flatten JSON documents to flat file.
 Load D-Stream data into Spark RDD and do in memory data Computation to generate output response.
 Experienced in writing live Real-time Processing and core jobs using Spark Streaming with Kafka as a Data
pipeline system.
 Migrated an existing on-premises application to AWS. Used AWS services like EC2 and S3 for data sets
processing and storage.
 Experienced in Maintaining the Hadoop cluster on AWS EMR.
 Loaded data into S3 buckets using AWS Glue and PySpark. Involved in filtering data stored in S3 buckets using
Elasticsearch and loaded data into Hive external tables.
 Configured Snow pipe to pull the data from S3 buckets into Snowflakes table.
 Used Apache Kafka to aggregate web log data from multiple servers and make them available in Downstream
systems for Data analysis and engineering type of roles.
 Worked in Implementing Kafka Security and Boosting its performance.
 Experience in using Avro, Parquet, RCFile and JSON file formats, developed UDF in Hive.
 Developed Custom UDF in Python and used UDFs for sorting and preparing the data.
 Worked on Custom Loaders and Storage Classes in PIG to work on several data formats like JSON, XML, CSV
and generated Bags for processing using pig etc.
 Designed and developed power BI graphical and visualization solutions with business requirement document and
plans for creating interactive dashboard
 Developed Sqoop and Kafka Jobs to load data from RDBMS, External Systems into HDFS and HIVE.
 Developed Oozie coordinators to schedule Hive scripts to create Data pipelines.
 Written several Map Reduce Jobs using Pyspark, Numpy and used Jenkins for Continuous integration.
 Setting up and worked on Kerberos authentication principals to establish secure network communication.
 On cluster and testing of HDFS, Hive, Pig and MapReduce to access cluster for new users.
 Continuous monitoring and managing the Hadoop cluster through Cloudera Manager.
Environment: Spark, Spark-Streaming, Spark SQL, AWS EMR, map R, HDFS, Hive, Pig, Apache Kafka, Sqoop, Azure,
Python, Pyspark, Shell scripting, Linux, MySQL Oracle Enterprise DB, SOLR, Jenkins, Eclipse, Oracle, Git, Oozie,
Tableau, power BIMySQL, Soap, Cassandra and Agile Methodologies.

Client: Infosys Limited, Bangalore, India June 2017 – January 2020


Role: Big Data Engineer
Responsibilities:
 Involved in start to end process of Hadoop jobs that used various technologies such as Sqoop, PIG, Hive,
MapReduce, Spark, and Shells scripts.
 Implemented various Azure platforms such as Azure SQL Database, Azure SQL Data Warehouse, Azure
Analysis Services, HDInsight, Azure Data Lake and Data Factory.
 Extracted and loaded data into Data Lake environment (MS Azure) by using Sqoop which was accessed by
business users.
 Manage and support of enterprise Data Warehouse operation, big data advanced predictive application
development using Cloudera & Hortonworks HDP.
 Developed PIG scripts to transform the raw data into intelligent data as specified by business users.
 Utilized Apache Spark with Python to develop and execute Big Data Analytics and Machine learning
applications, executed machine learning use cases under Spark ML and MLLib.
 Installed Hadoop, Map Reduce, HDFS, Azure to develop multiple MapReduce jobs in PIG and Hive for data
cleansing and pre-processing.
 Used Spark API over Hortonworks Hadoop YARN to perform analytics on data in Hive.
 Improved the performance and optimization of the existing algorithms in Hadoop using SparkContext, Spark-
SQL, Data Frame, Pair RDD's, Spark YARN.
 Developed Spark code using Scala and Spark-SQL/Streaming for faster testing and processing of data.
 Developed a Spark job in Java which indexes data into Elastic Search from external Hive tables which are in
HDFS.
 Performed transformations, cleaning and filtering on imported data using Hive, MapReduce, and loaded final data
into HDFS.
 Explored with the Spark improving the performance and optimization of the existing algorithms in Hadoop using
Spark Context, Spark-SQL, Data Frame, Pair RDD's, Spark YARN.
 Import the data from different sources like HDFS/HBase into Spark RDD and developed a data pipeline using
Kafka and Storm to store data into HDFS.
 Used Spark streaming to receive real time data from the Kafka and store the stream data to HDFS using Scala and
NoSQL databases such as HBase and Cassandra.
 Documented the requirements including the available code which should be implemented using Spark, Hive,
HDFS, HBase and Elastic Search.
 Performed transformations like event joins, filter boot traffic and some pre-aggregations using Pig.
 Explored MLLib algorithms in Spark to understand the possible Machine Learning functionalities that can be
used for our use case
 Worked on Azure data explorer and used kusto query language (KQL) for querying
 Used windows Azure SQL reporting services to create reports with tables, charts, and maps.
 Executed Hive queries on Parquet tables stored in Hive to perform data analysis to meet the business
requirements.
 Configured Oozie workflow to run multiple Hive and Pig jobs which run independently with time and data
availability.
 Imported and exported the analyzed data to the relational databases using Sqoop for visualization and to generate
reports for the BI team.
Environment: Hadoop 3.0, Azure, Sqoop 1.4.6, KQL, PIG 0.17, KQL, Hive 2.3, MapReduce, Spark 2.2.1, Shells scripts,
SQL, Hortonworks, Python, MLLib, HDFS, YARN, Java, Kafka 1.0, Cassandra 3.11, Oozie, Agile

Client: Sling Technologies, Bangalore, India June 2015 to June 2017


Role: Data Engineeer
Responsibilities:
 Responsible for gathering requirements from Business Analyst and Operational Analyst and identifying the
data sources required for the request.
 Worked closely with a data architect to review all the conceptual, logical and physical database design models
with respect to functions, definition, maintenance review and support data analysis, Data quality and ETL design
that feeds the logical data models.
 Maintained and developed complex SQL queries, stored procedures, views, functions, and reports that qualify
customer requirements using SQL Server 2012.
 Creating automated anomaly detection systems and constant tracking of its performance.
 Support Sales and Engagement's management planning and decision making on sales incentives.
 Used statistical analysis, simulations, predictive modelling to analyze information and develop practical solutions
to business problems.
 Created a Spark application to filter, transform the data according to the needs and feed this to ML team.
 Developed ETL pipeline using AWS for huge volume of data to measure channel campaign impact.
 Worked closely with Ad platform teams, to calculate various metrics related to clicks, impressions, Ad
effectiveness
 and targeting, cost per click, cost per impression, churn rate etc. which are in turn used by the business to measure
the direct impact of their ads.
 Extending the company's data with third-party sources of information when needed.
 Précised development of several types of sub-reports, drill down reports, summary reports, parameterized reports,
and ad-hoc reports using SSRS through mailing server subscriptions &SharePoint server.
 Generated ad-hoc reports using Crystal Reports 9 and SQL Server Reporting Services (SSRS).
 Developed the reports and visualizations based on the insights mainly using Tableau and dashboards for the
company insight teams.
Environment: SQL Server 2012, SSRS, SSIS, SQL Profiler, Tableau, Qlik View, Agile, ETL, Anomaly detection,spark.

Educations:
Master of Science in Data Science , Maryland

You might also like