Apply now »

We are now ONE! #CelcomDigi                                                                                                                                                                                                                                                                  Celcom and Digi have merged with the aim to create Malaysia’s most inspiring telco-tech company, building on two well-loved brands with over six decades of combined strengths in innovation and connecting Malaysians.

AI Ops Specialist

Date:  29 Apr 2025
Requisition ID:  13656
Employment Type:  Permanent
Work Location:  CD Hub
Job Description: 

Job Summary

We are looking for a dynamic and results-driven AI Ops Engineer to join our AI & Automation team. The ideal candidate will play a pivotal role in enabling and managing production environments for our GenAI products, ensuring seamless deployment, scalability, and performance optimization. You will collaborate with cross-functional teams to integrate AI solutions into production, monitor their performance, and implement automation to streamline operational workflows.

Responsibilities

  • Production Environment Enablement: Design, implement, and manage production environments for GenAI products, ensuring high availability, scalability, and security. Establish CI/CD pipelines to automate deployment and updates of AI models and services.
  • AI & ML Integration: Collaborate with Data Scientists and ML Engineers to transition AI models from development to production. Optimize model performance and ensure efficient resource utilization in a production setting.
  • Monitoring & Performance Optimization: Develop and maintain monitoring tools to track AI product performance, availability, and reliability. Implement automated alerting systems to detect and resolve issues proactively.
  • Automation & Process Improvement: Automate operational workflows to reduce manual intervention and improve system reliability. Leverage AI Ops platforms to optimize system health and performance.
  • Collaboration & Stakeholder Engagement:  Work closely with Enterprise Architects and IT Operations teams to ensure AI solutions align with the company’s strategic goals. Engage with external partners like Amazon Bedrock to customize AI models and integrate them into the production environment.
  • Security & Compliance: Ensure data security, privacy, and compliance with industry standards and regulatory requirements. Implement robust security protocols to safeguard AI models and data.

Requirements

  • Bachelor’s degree in Computer Science, Data Science, Telecommunications, or a related field.
  • 2-4 years of experience in AI Ops, IT operations, or production environment management.
  • Proficiency in cloud platforms (AWS, Azure) and AI Ops tools (Moogsoft, Splunk, Datadog).
  • Strong knowledge of AI/ML deployment frameworks (TensorFlow Serving, MLflow) and containerization tools (Docker, Kubernetes).
  • Experience with CI/CD pipelines and infrastructure as code (Terraform, Ansible).
  • Familiarity with Amazon Bedrock or similar platforms for AI model customization.
  • Excellent problem-solving skills and the ability to troubleshoot complex production issues.
  • Experience with Generative AI applications in a production environment.
  • Knowledge of DevOps practices and site reliability engineering (SRE) principles.
  • Strong understanding of network infrastructure and security best practices.
  • Excellent communication and collaboration skills.
Division:  TECHNOLOGY

Job Summary

We are looking for a dynamic and results-driven AI Ops Engineer to join our AI & Automation team. The ideal candidate will play a pivotal role in enabling and managing production environments for our GenAI products, ensuring seamless deployment, scalability, and performance optimization. You will collaborate with cross-functional teams to integrate AI solutions into production, monitor their performance, and implement automation to streamline operational workflows.

Responsibilities

  • Production Environment Enablement: Design, implement, and manage production environments for GenAI products, ensuring high availability, scalability, and security. Establish CI/CD pipelines to automate deployment and updates of AI models and services.
  • AI & ML Integration: Collaborate with Data Scientists and ML Engineers to transition AI models from development to production. Optimize model performance and ensure efficient resource utilization in a production setting.
  • Monitoring & Performance Optimization: Develop and maintain monitoring tools to track AI product performance, availability, and reliability. Implement automated alerting systems to detect and resolve issues proactively.
  • Automation & Process Improvement: Automate operational workflows to reduce manual intervention and improve system reliability. Leverage AI Ops platforms to optimize system health and performance.
  • Collaboration & Stakeholder Engagement:  Work closely with Enterprise Architects and IT Operations teams to ensure AI solutions align with the company’s strategic goals. Engage with external partners like Amazon Bedrock to customize AI models and integrate them into the production environment.
  • Security & Compliance: Ensure data security, privacy, and compliance with industry standards and regulatory requirements. Implement robust security protocols to safeguard AI models and data.

Requirements

  • Bachelor’s degree in Computer Science, Data Science, Telecommunications, or a related field.
  • 2-4 years of experience in AI Ops, IT operations, or production environment management.
  • Proficiency in cloud platforms (AWS, Azure) and AI Ops tools (Moogsoft, Splunk, Datadog).
  • Strong knowledge of AI/ML deployment frameworks (TensorFlow Serving, MLflow) and containerization tools (Docker, Kubernetes).
  • Experience with CI/CD pipelines and infrastructure as code (Terraform, Ansible).
  • Familiarity with Amazon Bedrock or similar platforms for AI model customization.
  • Excellent problem-solving skills and the ability to troubleshoot complex production issues.
  • Experience with Generative AI applications in a production environment.
  • Knowledge of DevOps practices and site reliability engineering (SRE) principles.
  • Strong understanding of network infrastructure and security best practices.
  • Excellent communication and collaboration skills.

Next Steps

Next Steps

Thank you for taking the first step towards joining our team at CelcomDigi! After submitting your application, our Talent Acquisition team will review your CV and reach out to shortlisted candidates to guide you through the next steps, including a pre-screening conversation, interviews and or assessments.

At CelcomDigi, we aspire to be Malaysia’s leading telco-tech company — the nation’s digital growth engine — powering transformation through 5G, AI, and innovation that impacts over 20 million customers. Here, your role goes beyond work. It’s about enabling businesses to thrive, connecting communities, and advancing society, as we build a brand rooted in trust, reliability and customer excellence. Aligned with our employer value proposition, Grow with Purpose. Build with Trust, you’ll have the opportunity to innovate responsibly and create digital solutions that truly make a difference. If you're driven, future focused, and ready to be part of something bigger, we want you on our team. 

Let’s advance and inspire Malaysia together! #WeAreCelcomDigi

Follow CelcomDigi on LinkedIn and vote for us as Malaysia’s Most Preferred Employer at the GRADUAN Brand Awards

CelcomDigi is an equal opportunity employer, and committed to promote employment practices that are transparent, objective and fair. 


Job Segment: Compliance, Computer Science, Business Process, Law, Legal, Technology, Management, Operations

Apply now »