Together AI's Posts (85)

2025-07-19

Marketing Analytics and Ops Manager

We are seeking a highly analytical and process-oriented Marketing Analytics and Operations professional to join our dynamic marketing team. This role will report into the Revenue Strategy and Operations team. The ideal candidate will bridge the gap between marketing strategy and execution, leveraging data to optimize campaigns, streamline operations, and enhance overall marketing performance. Marketing Analytics: Marketing Operations: Together AI is a research-driven artificial intelligence company. We believe open and transparent AI systems will drive innovation and create the best outcomes for society, and together we are on a mission to significantly lower the cost of modern AI systems by co-designing software, hardware, algorithms, and models. We have contributed to leading open-source research, models, and datasets to advance the frontier of AI, and our team has been behind technological advancement such as FlashAttention, Hyena, FlexGen, and RedPajama. We invite you to join a passionate group of researchers in our journey in building the next generation AI infrastructure. We offer competitive compensation, startup equity, health insurance and other competitive benefits. The US base salary range for this full-time position is: $170-210K OTE + equity + benefits. Our salary ranges are determined by location, level and role. Individual compensation will be determined by experience, skills, and job-related knowledge. This is a hybrid role based in the Bay Area. Together AI is an Equal Opportunity Employer and is proud to offer equal employment opportunity to everyone regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity, veteran status, and more. Please see our privacy policy athttps://www.together.ai/privacy - Develop, maintain, and automate marketing dashboards and reports that provide actionable insights into campaign performance, pipeline contribution, and ROI. - Analyze marketing data from various sources (CRM, marketing automation, web analytics, advertising platforms) to identify trends, opportunities, and areas for optimization. - Conduct deep-dive analyses on specific campaigns, channels, and segments to understand their impact and inform future strategies. - Define and track key marketing metrics (e.g., MQLs, SQLs, CAC, LTV, conversion rates, funnel velocity) and communicate performance to stakeholders. - Collaborate with sales operations and finance teams to ensure alignment on reporting, data definitions, and attribution models. - Forecast marketing performance and identify potential risks and opportunities based on data. - Manage and optimize our marketing technology stack (e.g., Salesforce, Pardot, Outreach, Google Analytics, BI tools). - Develop, implement, and optimize marketing processes, workflows, and best practices to improve efficiency and scalability. - Ensure data integrity and cleanliness within marketing systems, including lead routing, lead scoring, and data enrichment. - Oversee the implementation and maintenance of lead scoring models, nurturing programs, and segmentation strategies. - Manage audience segmentation and targeting efforts to ensure effective delivery of marketing messages. - Support campaign execution by providing technical assistance, list management, and performance tracking setup. - Document marketing operations processes and provide training to marketing team members on system usage and best practices. - Stay up-to-date with industry trends, marketing automation best practices, and new technologies. - Bachelor's degree in Marketing, Business, Statistics, Economics, Computer Science, or a related quantitative field. - 5-7 years of experience in Marketing Analytics, Marketing Operations, Business Intelligence, or a similar role, preferably in a B2B SaaS environment. - Strong proficiency with Marketing Automation Platforms (e.g., HubSpot, Marketo, Pardot). - Proven experience with data visualization tools (e.g., Tableau, Power BI, Google Data Studio, Looker) and creating impactful dashboards. - Excellent analytical skills with the ability to collect, organize, analyze, and disseminate significant amounts of information with attention to detail and accuracy. - Advanced Excel skills (e.g., pivot tables, VLOOKUPs, complex formulas). - Solid understanding of marketing funnel mechanics, lead management processes, and sales & marketing alignment. - Experience with SQL, R, or Python for data extraction and analysis is a strong plus. - Familiarity with web analytics platforms (e.g., Google Analytics, Adobe Analytics). - Strong project management skills and the ability to manage multiple priorities in a fast-paced environment. - Exceptional communication and interpersonal skills, with the ability to translate complex data into clear, actionable insights for non-technical stakeholders.

Location: San Francisco

Salary range: None - None

2025-07-19

Distributed ML Systems Engineer- Inference

Together AI is seeking a Distributed ML Systems Engineer to design and build scalable machine learning systems that power our accelerated AI initiatives. This role involves developing large-scale, fault-tolerant distributed systems that handle high-load and high-performance requirements. If you are passionate about designing ML systems that operate at scale and eager to create impactful solutions, we want to hear from you. This position offers the chance to work closely with our AI researchers and infrastructure teams to ensure our systems are robust and efficient. Join us in shaping the future at Together AI! Together AI is a research-drven artificial intelligence company. We believe open and transparent AI systems will drive innovation and create the best outcomes for society. Together, we are on a mission to significantly lower the cost of modern AI systems by co-designing software, hardware, algorithms, and models. We have contributed to leading open-source research, models, and datasets to advance the frontier of AI. Our team has been behind technological advancements such as FlashAttention, Hyena, FlexGen, and RedPajama. We invite you to join a passionate group of researchers and engineers in our journey to build the next-generation AI infrastructure. We offer competitive compensation, startup equity, health insurance, and other competitive benefits. The US base salary range for this full-time position is $160,000 - $230,000 + equity + benefits. Our salary ranges are determined by location, level, and role. Individual compensation will be determined by experience, skills, and job-related knowledge. Together AI is an Equal Opportunity Employer and is proud to offer equal employment opportunities to everyone regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity, veteran status, and more. Please see our privacy policy athttps://www.together.ai/privacy - Design and build large-scale, distributed machine learning systems that are fault-tolerant and high-performance. - Develop and optimize distributed processing frameworks and storage systems. - Collaborate with researchers, engineers, and product managers to integrate ML systems into our infrastructure. - Conduct architecture and design reviews to ensure best practices in system design. - Implement robust monitoring and logging systems to ensure the health and performance of our ML systems. - 3+ years of experience in building large-scale, fault-tolerant, high-performance distributed systems. - Strong programming skills in one or more of Python, Go, Rust, or C/C++. - Excellent understanding of low-level operating systems concepts including multi-threading, memory management, networking, and storage, performance, and scale. - Experience with cloud computing platforms (AWS, GCP, Azure etc.) and large-scale infrastructure. - Strong problem-solving skills and ability to work in a fast-paced environment. - Preferred: Experience with Kubernetes - Preferred: Experience with Pytorch

Location: San Francisco

Salary range: None - None

2025-07-19

GPU Cluster Resource Scheduling and Optimization Engineer

Together.ai is driving innovation in AI infrastructure by creating cutting-edge systems that enable scalable and efficient machine learning workloads. Our team tackles the unique challenges of resource scheduling, optimization, and orchestration for large-scale AI training and inference systems. We are looking for a talented AI Workload Resource Scheduling and Optimization Engineer to join our team. This role focuses on designing and implementing advanced scheduling algorithms, resource management strategies, and optimization techniques to maximize performance and minimize costs for large-scale distributed AI workloads. Must-Have: Nice-to-Have: Together AI is a research-driven artificial intelligence company. We believe open and transparent AI systems will drive innovation and create the best outcomes for society, and together we are on a mission to significantly lower the cost of modern AI systems by co-designing software, hardware, algorithms, and models. We have contributed to leading open-source research, models, and datasets to advance the frontier of AI, and our team has been behind technological advancement such as FlashAttention, Hyena, FlexGen, and RedPajama. We invite you to join a passionate group of researchers in our journey in building the next generation AI infrastructure. We offer competitive compensation, startup equity, health insurance and other competitive benefits. The US base salary range for this full-time position is: $160,000 - $230,000 + equity + benefits. Our salary ranges are determined by location, level and role. Individual compensation will be determined by experience, skills, and job-related knowledge. Together AI is an Equal Opportunity Employer and is proud to offer equal employment opportunity to everyone regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity, veteran status, and more. Please see our privacy policy athttps://www.together.ai/privacy - Resource Scheduling and Allocation:Develop and implement intelligent scheduling algorithms tailored for distributed AI workloads on multi-cluster and multi-tenant environments.Ensure efficient allocation of GPUs, TPUs, and CPUs across diverse workloads, balancing resource utilization and job performance. - Develop and implement intelligent scheduling algorithms tailored for distributed AI workloads on multi-cluster and multi-tenant environments. - Ensure efficient allocation of GPUs, TPUs, and CPUs across diverse workloads, balancing resource utilization and job performance. - Performance Optimization:Design optimization techniques for dynamic resource allocation, addressing real-time variations in workload demand.Implement load balancing, job preemption, and task placement strategies to maximize throughput and minimize latency. - Design optimization techniques for dynamic resource allocation, addressing real-time variations in workload demand. - Implement load balancing, job preemption, and task placement strategies to maximize throughput and minimize latency. - Scalability and Efficiency:Build systems that efficiently scale to thousands of nodes and petabytes of data.Optimize training and inference pipelines to reduce runtime and cost while maintaining accuracy and reliability. - Build systems that efficiently scale to thousands of nodes and petabytes of data. - Optimize training and inference pipelines to reduce runtime and cost while maintaining accuracy and reliability. - Monitoring and Analytics:Build tools for real-time monitoring and diagnostics of resource utilization, job scheduling efficiency, and bottlenecks.Leverage telemetry data and machine learning models for predictive analytics and proactive optimization. - Build tools for real-time monitoring and diagnostics of resource utilization, job scheduling efficiency, and bottlenecks. - Leverage telemetry data and machine learning models for predictive analytics and proactive optimization. - Collaboration and Innovation:Collaborate with researchers, data scientists, and platform engineers to understand workload requirements and align resource management solutions.Stay updated with the latest trends in distributed systems, AI model training, and cloud-native technologies. - Collaborate with researchers, data scientists, and platform engineers to understand workload requirements and align resource management solutions. - Stay updated with the latest trends in distributed systems, AI model training, and cloud-native technologies. - Develop and implement intelligent scheduling algorithms tailored for distributed AI workloads on multi-cluster and multi-tenant environments. - Ensure efficient allocation of GPUs, TPUs, and CPUs across diverse workloads, balancing resource utilization and job performance. - Design optimization techniques for dynamic resource allocation, addressing real-time variations in workload demand. - Implement load balancing, job preemption, and task placement strategies to maximize throughput and minimize latency. - Build systems that efficiently scale to thousands of nodes and petabytes of data. - Optimize training and inference pipelines to reduce runtime and cost while maintaining accuracy and reliability. - Build tools for real-time monitoring and diagnostics of resource utilization, job scheduling efficiency, and bottlenecks. - Leverage telemetry data and machine learning models for predictive analytics and proactive optimization. - Collaborate with researchers, data scientists, and platform engineers to understand workload requirements and align resource management solutions. - Stay updated with the latest trends in distributed systems, AI model training, and cloud-native technologies. - Experience:5+ years of experience in resource scheduling, distributed systems, or large-scale machine learning infrastructure. - 5+ years of experience in resource scheduling, distributed systems, or large-scale machine learning infrastructure. - Technical Skills:Proficiency in distributed computing frameworks (e.g., Kubernetes, Slurm, Ray).Expertise in designing and implementing resource allocation algorithms and scheduling frameworks.Hands-on experience with cloud platforms (e.g., AWS, GCP, Azure) and GPU orchestration. - Proficiency in distributed computing frameworks (e.g., Kubernetes, Slurm, Ray). - Expertise in designing and implementing resource allocation algorithms and scheduling frameworks. - Hands-on experience with cloud platforms (e.g., AWS, GCP, Azure) and GPU orchestration. - Programming:Proficient in Python, C++, or Go for building high-performance systems. - Proficient in Python, C++, or Go for building high-performance systems. - Optimization Skills:Strong understanding of operational research techniques, such as linear programming, graph algorithms, or evolutionary strategies. - Strong understanding of operational research techniques, such as linear programming, graph algorithms, or evolutionary strategies. - Soft Skills:Analytical mindset with a focus on problem-solving and performance tuning.Excellent collaboration and communication skills across teams. - Analytical mindset with a focus on problem-solving and performance tuning. - Excellent collaboration and communication skills across teams. - 5+ years of experience in resource scheduling, distributed systems, or large-scale machine learning infrastructure. - Proficiency in distributed computing frameworks (e.g., Kubernetes, Slurm, Ray). - Expertise in designing and implementing resource allocation algorithms and scheduling frameworks. - Hands-on experience with cloud platforms (e.g., AWS, GCP, Azure) and GPU orchestration. - Proficient in Python, C++, or Go for building high-performance systems. - Strong understanding of operational research techniques, such as linear programming, graph algorithms, or evolutionary strategies. - Analytical mindset with a focus on problem-solving and performance tuning. - Excellent collaboration and communication skills across teams. - Experience with AI/ML frameworks (e.g., TensorFlow, PyTorch, JAX). - Familiarity with AI-specific workloads like DDP, sharded training, or reinforcement learning. - Knowledge of auto-scaling and cost-optimization strategies in cloud environments. - Contributions to open-source scheduling or orchestration projects.

Location: San Francisco

Salary range: None - None

2025-07-19

LLM Training Dataset and Checkpoint Optimization Engineer

Together.ai is a leader in developing AI infrastructure that powers the training of state-of-the-art models. We focus on creating scalable, efficient systems for handling massive datasets and managing large-scale distributed checkpoints, ensuring seamless workflows for training and fine-tuning AI models. We are seeking aTraining Dataset and Checkpoint Acceleration Engineerto optimize data pipelines and checkpoint mechanisms for large-scale machine learning workloads. In this role, you will work at the intersection of data engineering and distributed systems, ensuring that training workflows are highly performant, reliable, and cost-efficient. Must-Have: Nice-to-Have: Together AI is a research-driven artificial intelligence company. We believe open and transparent AI systems will drive innovation and create the best outcomes for society, and together we are on a mission to significantly lower the cost of modern AI systems by co-designing software, hardware, algorithms, and models. We have contributed to leading open-source research, models, and datasets to advance the frontier of AI, and our team has been behind technological advancement such as FlashAttention, Hyena, FlexGen, and RedPajama. We invite you to join a passionate group of researchers in our journey in building the next generation AI infrastructure. We offer competitive compensation, startup equity, health insurance and other competitive benefits. The US base salary range for this full-time position is: $160,000 - $230,000 + equity + benefits. Our salary ranges are determined by location, level and role. Individual compensation will be determined by experience, skills, and job-related knowledge. Together AI is an Equal Opportunity Employer and is proud to offer equal employment opportunity to everyone regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity, veteran status, and more. Please see our privacy policy athttps://www.together.ai/privacy - Dataset Acceleration:Design and optimize high-throughput data pipelines for streaming and processing massive training datasets.Implement caching, sharding, and prefetching techniques to maximize data-loading efficiency.Ensure efficient integration with distributed storage systems (e.g., S3, GCS, Lustre, Ceph). - Design and optimize high-throughput data pipelines for streaming and processing massive training datasets. - Implement caching, sharding, and prefetching techniques to maximize data-loading efficiency. - Ensure efficient integration with distributed storage systems (e.g., S3, GCS, Lustre, Ceph). - Checkpointing Systems:Build and optimize distributed checkpoint mechanisms for large-scale training workflows.Implement techniques to minimize checkpoint I/O overhead and ensure fault tolerance.Develop incremental and differential checkpointing solutions to reduce storage costs. - Build and optimize distributed checkpoint mechanisms for large-scale training workflows. - Implement techniques to minimize checkpoint I/O overhead and ensure fault tolerance. - Develop incremental and differential checkpointing solutions to reduce storage costs. - Performance Optimization:Profile and debug bottlenecks in data pipelines and checkpoint systems.Optimize for GPU/TPU utilization by ensuring efficient data feeding and checkpoint recovery times. - Profile and debug bottlenecks in data pipelines and checkpoint systems. - Optimize for GPU/TPU utilization by ensuring efficient data feeding and checkpoint recovery times. - Scalability and Reliability:Develop systems that scale efficiently across thousands of nodes and petabyte-scale datasets.Ensure fault-tolerant recovery and resume mechanisms for long-running training jobs. - Develop systems that scale efficiently across thousands of nodes and petabyte-scale datasets. - Ensure fault-tolerant recovery and resume mechanisms for long-running training jobs. - Collaboration and Support:Work closely with ML researchers, data engineers, and infrastructure teams to understand workload requirements.Build tools and frameworks to enable seamless integration of dataset and checkpointing systems with existing ML workflows. - Work closely with ML researchers, data engineers, and infrastructure teams to understand workload requirements. - Build tools and frameworks to enable seamless integration of dataset and checkpointing systems with existing ML workflows. - Design and optimize high-throughput data pipelines for streaming and processing massive training datasets. - Implement caching, sharding, and prefetching techniques to maximize data-loading efficiency. - Ensure efficient integration with distributed storage systems (e.g., S3, GCS, Lustre, Ceph). - Build and optimize distributed checkpoint mechanisms for large-scale training workflows. - Implement techniques to minimize checkpoint I/O overhead and ensure fault tolerance. - Develop incremental and differential checkpointing solutions to reduce storage costs. - Profile and debug bottlenecks in data pipelines and checkpoint systems. - Optimize for GPU/TPU utilization by ensuring efficient data feeding and checkpoint recovery times. - Develop systems that scale efficiently across thousands of nodes and petabyte-scale datasets. - Ensure fault-tolerant recovery and resume mechanisms for long-running training jobs. - Work closely with ML researchers, data engineers, and infrastructure teams to understand workload requirements. - Build tools and frameworks to enable seamless integration of dataset and checkpointing systems with existing ML workflows. - Experience:5+ years of experience in data engineering, distributed systems, or ML infrastructure. - 5+ years of experience in data engineering, distributed systems, or ML infrastructure. - Technical Skills:Expertise in high-performance data processing libraries (e.g., PyTorch DataLoader, TensorFlow Data, DALI).Proficiency in distributed storage systems and data formats (e.g., Parquet, HDF5).Strong understanding of checkpointing frameworks and file systems (e.g., POSIX, Lustre, GPFS). - Expertise in high-performance data processing libraries (e.g., PyTorch DataLoader, TensorFlow Data, DALI). - Proficiency in distributed storage systems and data formats (e.g., Parquet, HDF5). - Strong understanding of checkpointing frameworks and file systems (e.g., POSIX, Lustre, GPFS). - Programming:Proficient in Python, C++, or Go for performance-critical systems. - Proficient in Python, C++, or Go for performance-critical systems. - Optimization Techniques:Experience with I/O optimization techniques (e.g., asynchronous data loading, prefetching).Familiarity with compression and serialization for large datasets and checkpoints. - Experience with I/O optimization techniques (e.g., asynchronous data loading, prefetching). - Familiarity with compression and serialization for large datasets and checkpoints. - Soft Skills:Analytical and problem-solving mindset.Strong communication and collaboration skills across teams. - Analytical and problem-solving mindset. - Strong communication and collaboration skills across teams. - 5+ years of experience in data engineering, distributed systems, or ML infrastructure. - Expertise in high-performance data processing libraries (e.g., PyTorch DataLoader, TensorFlow Data, DALI). - Proficiency in distributed storage systems and data formats (e.g., Parquet, HDF5). - Strong understanding of checkpointing frameworks and file systems (e.g., POSIX, Lustre, GPFS). - Proficient in Python, C++, or Go for performance-critical systems. - Experience with I/O optimization techniques (e.g., asynchronous data loading, prefetching). - Familiarity with compression and serialization for large datasets and checkpoints. - Analytical and problem-solving mindset. - Strong communication and collaboration skills across teams. - Experience with ML frameworks (e.g., PyTorch, TensorFlow, JAX) and distributed training. - Familiarity with hardware accelerators (e.g., GPUs, TPUs) and storage optimizations. - Knowledge of open-source contributions or projects related to data pipelines or checkpointing. - Experience with incremental and real-time checkpointing solutions.

Location: San Francisco

Salary range: None - None

2025-07-19

Research Scientist, Large-Scale LearningNew

The Model Shaping team at Together AI works on products and research for tailoring open foundation models to downstream applications. We build services that allow machine learning developers to choose the best models for their tasks and further improve these models using domain-specific data. In addition to that, we develop new methods for more efficient model training and evaluation, drawing inspiration from a broad spectrum of ideas across machine learning, natural language processing, and ML systems. As a Research Scientist in Large-Scale Learning, you will work on the methods for increasing the efficiency of training foundation models, in terms of both speed and resource efficiency. You will analyze the limitations of state-of-the art techniques for neural network training, as well as the unique performance challenges of Together’s training setups. Based on this analysis, you will propose and implement new approaches, targeting both algorithmic improvements and systems optimizations. After evaluating your ideas through experimentation, you will present your findings to the global scientific community at leading ML/ML Systems conferences and collaborate with your teammates to integrate those improvements into Together’s platform. Together AI is a research-driven artificial intelligence company. We believe open and transparent AI systems will drive innovation and create the best outcomes for society, and together we are on a mission to significantly lower the cost of modern AI systems by co-designing software, hardware, algorithms, and models. We have contributed to leading open-source research, models, and datasets to advance the frontier of AI, and our team has been behind technological advancements such as FlashAttention, RedPajama, SWARM Parallelism, and SpecExec. We invite you to join a passionate group of researchers in our journey in building the next generation AI infrastructure. We offer competitive compensation, startup equity, health insurance, and other benefits, as well as flexibility in terms of remote work. The US base salary range for this full-time position is $225,000 - $300,000. Our salary ranges are determined by location, level and role. Individual compensation will be determined by experience, skills, and job-related knowledge. Together AI is an Equal Opportunity Employer and is proud to offer equal employment opportunity to everyone regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity, veteran status, and more. Please see our privacy policy athttps://www.together.ai/privacy - Define and drive the research agenda around efficiency and performance of foundation model training - Study recent results from the broader AI research community, analyzing their relevance to the team’s research directions and ongoing projects - Conduct experiments to empirically validate your hypotheses and compare the outcomes with relevant related work - Share your findings both internally and externally (e.g., at top-tier conferences on ML and ML Systems) - Partner with Machine Learning Engineers to integrate advanced methods into Together’s Model Shaping platform - Can autonomously design, implement, and validate your research ideas - Skilled at writing high-quality and efficient code in Python and PyTorch - Have first-author publications at leading conferences on ML or ML Systems (ICLR, ICML, NeurIPS, MLSys) - Are a strong communicator, ready to both discuss your research plans with other scientists and explain them to broader audience - Follow the latest advances in relevant subfields of AI - Passionate about seeing your research create real-world impact through Together AI's services and willing to work hands-on with production systems to achieve it - Algorithmic modifications of large neural network training (e.g., novel optimization algorithms or model adaptation techniques) - Distributed optimization (including federated learning, communication-efficient optimization, and decentralized training) - ML systems optimizations for distributed training, memory efficiency, or compute efficiency - Writing optimized NVIDIA GPU kernels or communication collectives using NVIDIA’s networking stack (e.g., NCCL or NVSHMEM) - Running large-scale experiments on GPU clusters

Location: San Francisco, Amsterdam, London

Salary range: None - None

1 ... 5 6 7 8 ... 17