Senior Software Engineer

January 9 2025
Industries IT: Software
Categories Computer Engineering, Software Engineering,
Vancouver, BC • Full time

Overview

The MAIA System Infrastructure team is pioneering the development of the next-generation developer ecosystem for AI Accelerators. We are at the core of creating the infrastructure that enables deep observability into our proprietary MAIA chips, empowering developers to harness the full potential of these advanced AI accelerators. Our mission is to build a transparent, efficient, and powerful ecosystem that goes beyond traditional GPU observability, providing unmatched insights into the operations and performance of our AI accelerators.

We operate at the intersection of cutting-edge AI hardware, system software, and developer tools, constantly pushing the boundaries of what is possible. We not only focus on the internal execution and performance metrics of the MAIA chips but also play a crucial role in optimizing the broader data flow infrastructure, particularly over PCIe, eBPF and various frontend networks, ensuring seamless and efficient data movement between the host and accelerators. By decomposing and optimizing data flow infrastructure into state-of-the-art designs, we aim to maximize the performance and efficiency of AI workloads, enhancing the overall ecosystem's capabilities. Our collaborative efforts span across hardware architects, system engineers, and AI researchers, all aimed at building a holistic observability stack that drives the next wave of AI innovation

As a Senior Software Engineer on the MAIA System Infrastructure team, you will be instrumental in building and optimizing the observability infrastructure for our MAIA AI accelerators. Your focus will include not only providing deep insights into the execution and performance of the MAIA chips but also optimizing the data flow infrastructure across our hardware and systems as a whole. You will work on decomposing into state-of-the-art designs that enable efficient data transfer and processing, crucial for the performance of large-scale AI workloads.

Microsoft’s mission is to empower every person and every organization on the planet to achieve more. As employees we come together with a growth mindset, innovate to empower others, and collaborate to realize our shared goals. Each day we build on our values of respect, integrity, and accountability to create a culture of inclusion where everyone can thrive at work and beyond.

Qualifications

Required Qualifications:

  • Bachelor's Degree in Computer Science, or related technical discipline AND 4+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python
    • OR equivalent experience.
  • 4+ years experience in system-level programming.
  • 4+ years experience in optimizing data movement and communication with extremely low-latency latency requirements.

Other Requirements

  • Ability to meet Microsoft, customer and/or government security screening requirements are required for this role. These requirements include, but are not limited to the following specialized security screenings: Microsoft Cloud Background Check:
    • This position will be required to pass the Microsoft Cloud background check upon hire/transfer and every two years thereafter.

Preferred Qualifications:

  • Proficient problem-solving skills and a track record of innovating solutions for complex system challenges in AI hardware and data infrastructure.
  • Strong collaboration and communication skills, with the ability to work across multidisciplinary teams and engage with the developer community.
  • Experience with open-source development and contributions is a plus.
  • Experience developing within existing GPU ecosystems is a plus
  • Preferably experience with a focus on AI accelerators or advanced embedded systems systems and low-latency data flow optimization.
  • Proven expertise in developing observability, profiling, or debugging tools for complex hardware systems, including deep knowledge of PCIe communication.
  • Ability to design and implement software that captures and analyzes low-level operations of AI accelerators and data flow across multiple abstractions and software stacks.
  • In-depth experience with eBPF and related tools (e.g., BCC, bpftrace), with a strong understanding of how to leverage eBPF for advanced monitoring, tracing, and debugging in complex systems.

Software Engineering IC4 - The typical base pay range for this role across Canada is CAD $108,100 - CAD $199,700 per year.

Find additional pay information here:
https://careers.microsoft.com/v2/global/en/canada-pay-information.html

Microsoft will accept applications for the role until January 20, 2025.

Responsibilities

This role requires a deep technical background and a hands-on approach, as you will design and implement software that interfaces with both the MAIA chips and the data flow infrastructure.

  • Lead by example in creating an inclusive culture that embraces diversity. Mentor and empower teammates, fostering an environment where all voices are heard and valued. Cultivate a team dynamic that drives high performance through mutual support and respect.
  • Design, develop, and maintain the observability infrastructure for the MAIA AI accelerators, enabling developers to gather the data necessary to debug, profile, analyze, and optimize AI models with unprecedented depth.
  • Optimize the data flow infrastructure over PCIe, ensuring efficient and high-throughput communication between the host and MAIA chips
  • Decompose the data flow infrastructure into state-of-the-art designs, enhancing the overall efficiency and performance of AI workloads. And example of this is driving the development of innovative eBPF-based solutions that supercharge data collection, analysis, and system optimization.
  • Collaborate with hardware architects and system engineers to integrate the observability stack with the broader system, capturing detailed metrics and insights into data movement
  • Develop tools and libraries that provide a holistic view of data flow, execution, and performance, extending beyond traditional GPU observability to meet the unique needs of our accelerators.
  • Engage with the AI research and developer community to understand their needs and incorporate feedback into the observability tools and data flow optimizations.
  • Ensure that the observability and data flow infrastructure meet the highest standards of performance, security, and reliability.

Other

Apply now!

Similar offers

Searching...
No similar offer found.
An error has occured, try again later.

Jobs.ca network