Lead Site Reliability Engineer (SRE) (San Francisco) Job at EPAM Systems, San Francisco, CA

VXFrL1M3ckhRcXF0ekdTRnZQczRMU0tCb3c9PQ==
  • EPAM Systems
  • San Francisco, CA

Job Description

At EPAM, were not just building software were engineering excellence.

Were looking for a Lead Site Reliability Engineer (SRE) with a passion for performance, precision, and proactive problem-solving to join a high-impact team supporting a leading sell-side trading environment.

This role is ideal for someone who thrives in fast-paced financial systems, has a passion for working with data and monitoring tools, and wants to shape the reliability and efficiency of next-generation trading platforms.

The Site Reliability Engineer will focus on ensuring stable connectivity to external partners within a SaaS environment. The ideal candidate will have expertise in financial systems, especially within trading ecosystems, and the ability to proactively drive performance enhancements and improve data usage and analysis. By identifying areas of opportunity, they will help deliver improved service and systems for end users.

Additionally, the candidate will help proactively identify system issues, implement changes and resolutions, and ensure the stability of business-critical applications. They will collaborate to build actionable plans, execute strategies, and lead initiatives to enhance system reliability.

Responsibilities

  • Provide a strategic vision for trading portfolio performance, covering network connectivity, traffic throughput, and applications

  • Define, configure, and set up alerting and monitoring frameworks for critical applications

  • Monitor application and platform performance using APM and monitoring tools to diagnose and resolve performance issues

  • Collaborate with Azure Cloud environments and contribute to a 24x7x365 support team to diagnose and address system challenges

  • Assess environmental and incident priorities, investigate issues swiftly, and execute efficient resolutions

  • Troubleshoot mission-critical systems and implement preventative problem management solutions

  • Lead on promoting observability, scalability, and resiliency best practices across development and operations teams

  • Analyze, design, and implement solutions to meet application performance and reliability goals

  • Collaborate with cross-functional teams to ensure smooth and unified troubleshooting and resolution processes across departments

  • Craft and maintain SLA/SLO dashboards to monitor system health and performance

  • Define and maintain SLIs, SLOs, and error budgets for applications and infrastructure to drive service improvement

  • Automate operational processes to enhance service offerings and system reliability

Requirements

  • 5+ years of experience in site reliability engineering, production support, or related roles in fast-paced environments

  • Showcase of leadership or mentoring experience (minimum of 1 year) in guiding cross-functional teams on system reliability

  • Knowledge of monitoring and observability tools such as AppDynamics, New Relic, Prometheus, or Grafana

  • Background in Azure Cloud services, CI/CD pipelines, and container orchestration (Kubernetes or Docker)

  • Proficiency in scripting with Python, Bash, or PowerShell for automation and efficiency gains

  • Understanding of network protocols (TCP/IP, DNS, and troubleshooting tools such as Wireshark or tcpdump

  • Capability to analyze complex system issues and performance bottlenecks using APM and log analysis

  • Familiarity with implementing SLA/SLO metrics and monitoring for production systems

  • Combined skills in high-availability systems and database performance optimization

Nice to have

  • Expertise in SaaS solutions and APIs with a focus on handling external trading partners

  • Knowledge of disaster recovery strategies and business continuity planning

  • Background in trading platforms or buy-side/sell-side financial environments

EPAM is a leading global provider of digital platform engineering and development services. We are committed to having a positive impact on our clients, our employees, and our communities. We embrace a dynamic and inclusive culture. Here you will collaborate with multi-national teams, contribute to a myriad of innovative projects that deliver the most creative and cuttingedge solutions, and have an opportunity to continuously learn and grow. No matter where you are located, you will join a dedicated, creative, and diverse community that will help you discover your fullest potential. Engineer the Future with a Career at EPAM (

This Remote Position Cannot be Performed in New York City.

Applications will be accepted on a rolling basis.

In accordance with the LA County Fair Chance Ordinance, you may find a copy of the Notice containing a summary of the Ordinances key provisions here: Concept FCO Posting 8 27 24 (lacounty.gov)

H1B visa sponsorship is not available for this position.

It is unlawful in Massachusetts to require or administer a lie detector test as a condition of employment or continued employment. An employer who violates this law shall be subject to criminal penalties and civil liability.

EPAM Systems, Inc. is an equal opportunity employer. We recognize the value of diversity and inclusion in creating success for our customers, business partners, shareholders, employees and communities. We are committed to recruiting, hiring, developing and promoting employees without discrimination. As a global employer, this commitment includes complying with all laws in the countries in which we operate. Nevertheless, we believe equal employment practices should not be limited to what the law requires. Equal opportunity and inclusion are essential to motivate, empower and recognize the best in everyone.

At EPAM, employment actions are based on individual qualifications, without regard to race, color, religion, creed, gender, pregnancy status, sexual orientation, gender identity, gender expression, marital or familial status, national origin, ancestry, genetics, age, disability status, veteran status, citizenship status when otherwise legally able to work, or any other characteristic protected by law.

#J-18808-Ljbffr

Job Tags

Full time, H1b, Remote work,

Similar Jobs

Direct Counsel

IP Litigation Associate Attorney - San Francisco Job at Direct Counsel

 ...Job Description Job Description Direct Counsel is representing an Am Law 100 firm in its search for an IP Litigation Associate to join its thriving Intellectual Property Litigation Practice . Positions are available in the following offices: Austin, Boston,... 

Avera Health

Cardiovascular Sonographer Job at Avera Health

 ...and our patients. Work where you matter. A Brief Overview Responsible to provide quality cardiac and peripheral vascular ultrasound scans as directed by physician. What you will do Perform adult cardiac, vascular, SE exams as directed by the physician.... 

Carnegie Mellon University

Research Assistant - Mellon College of Science - Biological Sciences Department Job at Carnegie Mellon University

 ...most renowned education institutions. With ground-breaking brain science, path-breaking performances, creative start-ups, big data, big...  ...**Research Assistant** with strong organizational and communication skills to help with the day-to-day operations in the lab. This... 

Allied Universal

Security Professional - Armed Patrol - Full Time Job at Allied Universal

 ...Overview Allied Universal, North America's leading security and facility services company, offers rewarding careers that provide you...  ...serve. Job Description As a Security Professional - Armed Patrol - Full Time in Miami, FL , you will serve and safeguard... 

Harvard University

Digital Asset Manager Job at Harvard University

 ...external audiences the story of HLS and its faculty, students, and alumni.The Communications Office is seeking a highly skilled Digital Asset Manager to oversee the acquisition, cataloging, curation, and distribution of a rapidly growing archive of photographs and videos,...