Responsibilities:
• Design, implement, and optimize data pipelines for batch and real-time data processing using Cloudera (Hadoop, Hive, Spark, Impala) and Informatica (PowerCenter, Cloud Data Integration)
• Build data extraction, transformation, and loading (ETL) workflows using Informatica PowerCenter for large-scale data integration from source systems (e.g., relational databases, flat files, APIs) into Cloudera Data Lake or data warehouse environments.
• Implement Spark jobs on Cloudera for distributed data processing and optimization of data workflows.
• Leverage Informatica for orchestrating ETL workflows, including data extraction, cleansing, transformation, and loading into data repositories (HDFS, Hive, SQL databases, etc.).
• Optimize the Informatica workflows to minimize runtime, ensure smooth data integration, and maintain high data quality.
• Utilize Hadoop and Spark on Cloudera to process large datasets and implement data transformations using MapReduce, Spark SQL, and PySpark.
• Leverage Impala for low-latency SQL queries on Hadoop, ensuring real-time access to processed data.
• Implement partitioning, bucketing, and indexing strategies in Hive and HBase to improve query performance on large datasets.
• Implement and enforce data quality rules within Informatica workflows, ensuring that all transformations meet the required standards for completeness, consistency, and accuracy.
• Ensure compliance with data governance and security protocols (e.g., encryption, masking, access control) in accordance with industry best practices.
• Automation and Scheduling: Automate ETL workflows using Informatica Server, integrating with Airflow, Nifi or other workflow orchestration tools for scheduling and monitoring jobs.
• Utilize Cloudera Navigator for monitoring and auditing data processes within the Hadoop ecosystem.
• Perform regular tuning of the ETL pipelines, data flows, and SQL queries to ensure optimal performance.

Qualifications:
• Bachelor’s degree in Computer Science, Engineering, or related field.
• 6+ years of experience in the same field.
• Proven experience with the Cloudera Distribution of Hadoop (CDH), including expertise in HDFS, Hive, Impala, Spark, and HBase.
• Strong hands-on experience with Informatica PowerCenter (ETL), EDC, IDQ, B2B, and Axon.
• Deep understanding of ETL best practices, data pipelines, and distributed computing technologies such as Spark, MapReduce, PySpark, and Hadoop ecosystem components.
• Advanced SQL skills for data manipulation, aggregation, optimization, and reporting across relational and non-relational data stores (e.g., SQL Server, MySQL, PostgreSQL, Hive, Impala).
• Experience in Python and SQL.
• Strong background in data warehousing principles and data modeling, including dimensional modeling (star schema, snowflake schema) and OLAP/OLTP considerations.

Founded in 2009, with presence in Egypt, the UAE, Saudi Arabia, Qatar and Algeria, BBI has been operating as a trusted partner at the cutting edge of data and AI solutions. Driven by a team of 250+ professionals who have completed 500+ successful engagements across multiple sectors, BBI has been enabling clients across the region to unlock the potential of their data, transforming it into actionable insights and strategic assets. BBI’s vision grounded in innovation and partnerships is to provide end-to-end data and AI solutions starting by the definition of an actionable and customized data and AI strategy followed by its effective implementation by BBI’s experienced professionals, powered by sustainable and ethical AI.

Senior Big Data Engineer

About BBI

Senior Big Data Engineer

Already working at BBI?