Compelling questions for curious minds
Publications
MLHC 2023 | DuETT: Dual Event Time Transformer for Electronic Health Records
Nature Communications | Decentralized federated learning through proxy model sharing
ICML 2023 | TR0N: Translator Networks for 0-Shot Plug-and-Play Conditional Generation
ICLR 2023 Spotlight | Disparate Impact in Differential Privacy from Gradient Misalignment
ICLR 2023 | Verifying the Union of Manifolds Hypothesis for Image Data
ICLR 2023 | Temporal Dependencies in Feature Importance for Time Series Prediction
ICML 2022 | Bayesian Nonparametrics for Offline Skill Discovery
RecSys Challenge 2022 2nd Place | Robust User Engagement Modeling with Transformers and Self-Supervision
TMLR 2022 | Diagnosing and Fixing Manifold Overfitting in Deep Generative Models
RecSys Challenge 2023 1st Place | Robust User Engagement Modeling with Transformers and Self-Supervision
RecSys Challenge 2023 1st Place | Robust User Engagement Modeling with Transformers and Self-Supervision

Abstract
Online advertising has seen exponential growth transforming into a vast and dynamic market that encompasses many diverse platforms such web search, e-commerce, social media and mobile apps. The rapid growth of products and services presents a formidable challenge for advertising platforms, and accurately modeling user intent is increasingly critical for targeted ad placement. The 2023 ACM RecSys Challenge, organized by ShareChat, provides a standardized benchmark for developing and evaluating user intent models using a large dataset of impression from the ShareChat and Moj apps. In this paper we present our approach to this challenge. We use Transformers to automatically capture interactions between different types of input features, and propose a self-supervised optimization framework based on the contrastive objective. Empirically, we demonstrate that self-supervised learning effectively reduces overfitting improving model generalization and leading to significant gains in performance. Our team, Layer 6 AI, achieved 1st place on the final leaderboard out of over 100 teams.
MLHC 2023 | DuETT: Dual Event Time Transformer for Electronic Health Records
MLHC 2023 | DuETT: Dual Event Time Transformer for Electronic Health Records

Abstract
Electronic health records (EHRs) recorded in hospital settings typically contain a wide range of numeric time series data that is characterized by high sparsity and irregular observations. Effective modelling for such data must exploit its time series nature, the semantic relationship between different types of observations, and information in the sparsity structure of the data. Self-supervised Transformers have shown outstanding performance in a variety of structured tasks in NLP and computer vision. But multivariate time series data contains structured relationships over two dimensions: time and recorded event type, and straightforward applications of Transformers to time series data do not leverage this distinct structure. The quadratic scaling of self-attention layers can also significantly limit the input sequence length without appropriate input engineering. We introduce the DuETT architecture, an extension of Transformers designed to attend over both time and event type dimensions, yielding robust representations from EHR data. DuETT uses an aggregated input where sparse time series are transformed into a regular sequence with fixed length; this lowers the computational complexity relative to previous EHR Transformer models and, more importantly, enables the use of larger and deeper neural networks. When trained with self-supervised prediction tasks, that provide rich and informative signals for model pre-training, our model outperforms state-of-the-art deep learning models on multiple downstream tasks from the MIMIC-IV and PhysioNet-2012 EHR datasets.
Nature Communications | Decentralized federated learning through proxy model sharing
Nature Communications | Decentralized federated learning through proxy model sharing

Abstract
Institutions in highly regulated domains such as finance and healthcare often have restrictive rules around data sharing. Federated learning is a distributed learning framework that enables multi-institutional collaborations on decentralized data with improved protection for each collaborator’s data privacy. In this paper, we propose a communication-efficient scheme for decentralized federated learning called ProxyFL, or proxy-based federated learning. Each participant in ProxyFL maintains two models, a private model, and a publicly shared proxy model designed to protect the participant’s privacy. Proxy models allow efficient information exchange among participants without the need of a centralized server. The proposed method eliminates a significant limitation of canonical federated learning by allowing model heterogeneity; each participant can have a private model with any architecture. Furthermore, our protocol for communication by proxy leads to stronger privacy guarantees using differential privacy analysis. Experiments on popular image datasets, and a cancer diagnostic problem using high-quality gigapixel histology whole slide images, show that ProxyFL can outperform existing alternatives with much less communication overhead and stronger privacy.
ICML 2023 | TR0N: Translator Networks for 0-Shot Plug-and-Play Conditional Generation
ICML 2023 | TR0N: Translator Networks for 0-Shot Plug-and-Play Conditional Generation

Abstract
We propose TR0N, a highly general framework to turn pre-trained unconditional generative models, such as GANs and VAEs, into conditional models. The conditioning can be highly arbitrary, and requires only a pre-trained auxiliary model. For example, we show how to turn unconditional models into class-conditional ones with the help of a classifier, and also into text-to-image models by leveraging CLIP. TR0N learns a lightweight stochastic mapping which “translates” between the space of conditions and the latent space of the generative model, in such a way that the generated latent corresponds to a data sample satisfying the desired condition. The translated latent samples are then further improved upon through Langevin dynamics, enabling us to obtain higher-quality data samples. TR0N requires no training data nor fine-tuning, yet can achieve a zero-shot FID of 10.9 on MS-COCO, outperforming competing alternatives not only on this metric, but also in sampling speed — all while retaining a much higher level of generality.
ICLR 2023 Spotlight | Disparate Impact in Differential Privacy from Gradient Misalignment
ICLR 2023 Spotlight | Disparate Impact in Differential Privacy from Gradient Misalignment

Abstract
As machine learning becomes more widespread throughout society, aspects including data privacy and fairness must be carefully considered, and are crucial for deployment in highly regulated industries. Unfortunately, the application of privacy enhancing technologies can worsen unfair tendencies in models. In particular, one of the most widely used techniques for private model training, differentially private stochastic gradient descent (DPSGD), frequently intensifies disparate impact on groups within data. In this work we study the fine-grained causes of unfairness in DPSGD and identify gradient misalignment due to inequitable gradient clipping as the most significant source. This observation leads us to a new method for reducing unfairness by preventing gradient misalignment in DPSGD.
ICLR 2023 | Verifying the Union of Manifolds Hypothesis for Image Data
ICLR 2023 | Verifying the Union of Manifolds Hypothesis for Image Data

Abstract
Deep learning has had tremendous success at learning low-dimensional representations of high-dimensional data. This success would be impossible if there was no hidden low-dimensional structure in data of interest; this existence is posited by the manifold hypothesis, which states that the data lies on an unknown manifold of low intrinsic dimension. In this paper, we argue that this hypothesis does not properly capture the low-dimensional structure typically present in image data. Assuming that data lies on a single manifold implies intrinsic dimension is identical across the entire data space, and does not allow for subregions of this space to have a different number of factors of variation. To address this deficiency, we consider the union of manifolds hypothesis, which states that data lies on a disjoint union of manifolds of varying intrinsic dimensions. We empirically verify this hypothesis on commonly-used image datasets, finding that indeed, observed data lies on a disconnected set and that intrinsic dimension is not constant. We also provide insights into the implications of the union of manifolds hypothesis in deep learning, both supervised and unsupervised, showing that designing models with an inductive bias for this structure improves performance across classification and generative modelling tasks. Our code is available at https://github.com/layer6ai-labs/UoMH.
ICLR 2023 | Temporal Dependencies in Feature Importance for Time Series Prediction
ICLR 2023 | Temporal Dependencies in Feature Importance for Time Series Prediction

Abstract
Time series data introduces two key challenges for explainability methods: firstly, observations of the same feature over subsequent time steps are not independent, and secondly, the same feature can have varying importance to model predictions over time. In this paper, we propose Windowed Feature Importance in Time (WinIT), a feature removal based explainability approach to address these issues. Unlike existing feature removal explanation methods, WinIT explicitly accounts for the temporal dependence between different observations of the same feature in the construction of its importance score. Furthermore, WinIT captures the varying importance of a feature over time, by summarizing its importance over a window of past time steps. We conduct an extensive empirical study on synthetic and real-world data, compare against a wide range of leading explainability methods, and explore the impact of various evaluation strategies. Our results show that WinIT achieves significant gains over existing methods, with more consistent performance across different evaluation metrics.
ICML 2022 | Bayesian Nonparametrics for Offline Skill Discovery
ICML 2022 | Bayesian Nonparametrics for Offline Skill Discovery

Abstract
Skills or low-level policies in reinforcement learning are temporally extended actions that can speed up learning and enable complex behaviours. Recent work in offline reinforcement learning and imitation learning has proposed several techniques for skill discovery from a set of expert trajectories. While these methods are promising, the number K of skills to discover is always a fixed hyperparameter, which requires either prior knowledge about the environment or an additional parameter search to tune it. We first propose a method for offline learning of options (a particular skill framework) exploiting advances in variational inference and continuous relaxations. We then highlight an unexplored connection between Bayesian nonparametrics and offline skill discovery, and show how to obtain a nonparametric version of our model. This version is tractable thanks to a carefully structured approximate posterior with a dynamically-changing number of options, removing the need to specify K. We also show how our nonparametric extension can be applied in other skill frameworks, and empirically demonstrate that our method can outperform state-of-the-art offline skill learning algorithms across a variety of environments.
RecSys Challenge 2022 2nd Place | Robust User Engagement Modeling with Transformers and Self-Supervision
RecSys Challenge 2022 2nd Place | Robust User Engagement Modeling with Transformers and Self-Supervision

Abstract
Large item catalogs and constantly changing preference trends make recommendations a critically important component of every fashion e-commerce platform. However, since most users browse anonymously, historical preference data is rarely available and recommendations have to be made using only information from within the session. In the 2022 ACM RecSys challenge, Dressipi released a dataset with 1.1 million online retail sessions in the fashion domain that span an 18-month period. The goal is to predict the item purchased at the end of each session. To simulate a common production scenario all sessions are anonymous and no previous user preference information is available. In this paper, we present our approach to this challenge. We leverage the Transformer architecture with two different learning objectives inspired by the self-supervised learning techniques to improve generalization. Our
team, LAYER 6, achieves strong results placing 2nd on the final leaderboard out of over 300 teams.
TMLR 2022 | Diagnosing and Fixing Manifold Overfitting in Deep Generative Models
TMLR 2022 | Diagnosing and Fixing Manifold Overfitting in Deep Generative Models

Abstract
Likelihood-based, or explicit, deep generative models use neural networks to construct flexible high-dimensional densities. This formulation directly contradicts the manifold hypothesis, which states that observed data lies on a low-dimensional manifold embedded in high-dimensional ambient space. In this paper we investigate the pathologies of maximum-likelihood training in the presence of this dimensionality mismatch. We formally prove that degenerate optima are achieved wherein the manifold itself is learned but not the distribution on it, a phenomenon we call manifold overfitting. We propose a class of two-step procedures consisting of a dimensionality reduction step followed by maximum-likelihood density estimation, and prove that they recover the data-generating distribution in the nonparametric regime, thus avoiding manifold overfitting. We also show that these procedures enable density estimation on the manifolds learned by implicit models, such as generative adversarial networks, hence addressing a major shortcoming of these models. Several recently proposed methods are instances of our two-step procedures; we thus unify, extend, and theoretically justify a large class of models.
CVPR 2022 | X-Pool: Cross-Modal Language-Video Attention for Text-Video Retrieval
CVPR 2022 | X-Pool: Cross-Modal Language-Video Attention for Text-Video Retrieval

Abstract
In text-video retrieval, the objective is to learn a crossmodal similarity function between a text and a video that ranks relevant text-video pairs higher than irrelevant pairs. However, videos inherently express a much wider gamut of information than texts. Instead, texts often capture subregions of entire videos and are most semantically similar to certain frames within videos. Therefore, for a given text, a retrieval model should focus on the text’s most semantically similar video sub-regions to make a more relevant comparison. Yet, most existing works aggregate entire videos without directly considering text. Common text-agnostic aggregations schemes include mean-pooling or self-attention over the frames, but these are likely to encode misleading visual information not described in the given text. To address this, we propose a cross-modal attention model called XPool that reasons between a text and the frames of a video. Our core mechanism is a scaled dot product attention for a text to attend to its most semantically similar frames. We then generate an aggregated video representation conditioned on the text’s attention weights over the frames. We evaluate our method on three benchmark datasets of MSRVTT, MSVD and LSMDC, achieving new state-of-the-art results by up to 12% in relative improvement in Recall@1. Our findings thereby highlight the importance of joint textvideo reasoning to extract important visual cues according to text. Full code and demo can be found at: layer6ailabs.github.io/xpool/.
Nature Scientific Reports | Federated learning and differential privacy for medical image analysis
Nature Scientific Reports | Federated learning and differential privacy for medical image analysis

Abstract
The artificial intelligence revolution has been spurred forward by the availability of large-scale datasets. In contrast, the paucity of large-scale medical datasets hinders the application of machine learning in healthcare. The lack of publicly available multi-centric and diverse datasets mainly stems from confidentiality and privacy concerns around sharing medical data. To demonstrate a feasible path forward in medical image imaging, we conduct a case study of applying a differentially private federated learning framework for analysis of histopathology images, the largest and perhaps most complex medical images. We study the effects of IID and non-IID distributions along with the number of healthcare providers, i.e., hospitals and clinics, and the individual dataset sizes, using The Cancer Genome Atlas (TCGA) dataset, a public repository, to simulate a distributed environment. We empirically compare the performance of private, distributed training to conventional training and demonstrate that distributed training can achieve similar performance with strong privacy guarantees. We also study the effect of different source domains for histopathology images by evaluating the performance using external validation. Our work indicates that differentially private federated learning is a viable and reliable framework for the collaborative development of machine learning models in medical image analysis.
RecSys 2021 | User Engagement Modeling with Deep Learning and Language Models
RecSys 2021 | User Engagement Modeling with Deep Learning and Language Models

Abstract
Twitter is one of the main information sharing platforms in the world with millions of tweets created daily. To ensure that users get relevant content in their feeds Twitter extensively leverages machine learning-based recommender systems. However, given the large volume of data, all production systems must be both memory and CPU efficient. In the 2021 ACM RecSys challenge Twitter simulates the production environment with a large dataset of almost 1 bilion user-tweet engagements that span a 4 week period. The goal is to accurately predict engagement type, and all models are subject to strict run-time constraints during inference. In this paper we present our approach to the 2021 ACM Recsys challenge. We use a hybrid pipeline and leverage gradient boosting, neural network classifiers and multi-lingual language models to maximize performance. Our approach achieves strong results placing 3’rd on the public leaderboard. We further explore the complexity of language model inference, and show that through distillation it can be possible to run such models in highly constrained production environments.
ICCV 2021 | Context-aware Scene Graph Generation with Seq2Seq Transformers
Context-aware Scene Graph Generation with Seq2Seq Transformers

Abstract
Scene graph generation is an important task in computer vision aimed at improving the semantic understanding of the visual world. In this task, the model needs to detect objects and predict visual relationships between them. Most of the existing models predict relationships in parallel assuming their independence. While there are different ways to capture these dependencies, we explore a conditional approach motivated by the sequence-to-sequence (Seq2Seq) formalism. Different from the previous research, our proposed model predicts visual relationships one at a time in an autoregressive manner by explicitly conditioning on the already predicted relationships. Drawing from translation models in NLP, we propose an encoder-decoder model built using Transformers where the encoder captures global context and long range interactions. The decoder then makes sequential predictions by conditioning on the scene graph constructed so far. In addition, we introduce a novel reinforcement learning-based training strategy tailored to Seq2Seq scene graph generation. By using a self-critical policy gradient training approach with Monte Carlo search we directly optimize for the (mean) recall metrics and bridge the gap between training and evaluation. Experimental results on two public benchmark datasets demonstrate that our Seq2Seq learning approach achieves strong empirical performance, outperforming previous state-of-the-art, while remaining efficient in terms of training and inference time.
NeurIPS 2021 | Rectangular Flows for Manifold Learning
NeurIPS 2021 | Rectangular Flows for Manifold Learning

Abstract
Normalizing flows are invertible neural networks with tractable change-of-volume terms, which allow optimization of their parameters to be efficiently performed via maximum likelihood. However, data of interest are typically assumed to live in some (often unknown) low-dimensional manifold embedded in a high-dimensional ambient space. The result is a modelling mismatch since — by construction — the invertibility requirement implies high-dimensional support of the learned distribution. Injective flows, mappings from low- to high-dimensional spaces, aim to fix this discrepancy by learning distributions on manifolds, but the resulting volume-change term becomes more challenging to evaluate. Current approaches either avoid computing this term entirely using various heuristics, or assume the manifold is known beforehand and therefore are not widely applicable. Instead, we propose two methods to tractably calculate the gradient of this term with respect to the parameters of the model, relying on careful use of automatic differentiation and techniques from numerical linear algebra. Both approaches perform end-to-end nonlinear manifold learning and density estimation for data projected onto this manifold. We study the trade-offs between our proposed methods, empirically verify that we outperform approaches ignoring the volume-change term by more accurately learning manifolds and the corresponding distributions on them, and show promising results on out-of-distribution detection. Our code is available at https://github.com/layer6ai-labs/rectangular-flows.
NeurIPS 2021 | Tractable Density Estimation on Learned Manifolds with Conformal Embedding Flows
NeurIPS 2021 | Tractable Density Estimation on Learned Manifolds with Conformal Embedding Flows

Abstract
Normalizing flows are generative models that provide tractable density estimation via an invertible transformation from a simple base distribution to a complex target distribution. However, this technique cannot directly model data supported on an unknown low-dimensional manifold, a common occurrence in real-world domains such as image data. Recent attempts to remedy this limitation have introduced geometric complications that defeat a central benefit of normalizing flows: exact density estimation. We recover this benefit with Conformal Embedding Flows, a framework for designing flows that learn manifolds with tractable densities. We argue that composing a standard flow with a trainable conformal embedding is the most natural way to model manifold-supported data. To this end, we present a series of conformal building blocks and apply them in experiments with synthetic and real-world data to demonstrate that flows can model manifold-supported distributions without sacrificing tractable likelihoods.
CVPR 2021 | Weakly Supervised Action Selection Learning in Video
Weakly Supervised Action Selection Learning in Video

Abstract
Localizing actions in video is a core task in computer vision. The weakly supervised temporal localization problem investigates whether this task can be adequately solved with only video-level labels, significantly reducing the amount of expensive and error-prone annotation that is required. A common approach is to train a frame-level classifier where frames with the highest class probability are selected to make a video-level prediction. Frame-level activations are then used for localization. However, the absence of frame-level annotations cause the classifier to impart class bias on every frame. To address this, we propose the Action Selection Learning (ASL) approach to capture the general concept of action, a property we refer to as “actionness”. Under ASL, the model is trained with a novel class-agnostic task to predict which frames will be selected by the classifier. Empirically, we show that ASL outperforms leading baselines on two popular benchmarks THUMOS-14 and ActivityNet-1.2, with 10.3% and 5.7% relative improvement respectively. We further analyze the properties of ASL and demonstrate the importance of actionness.
ICLR 2021 | C-Learning: Horizon-Aware Cumulative Accessibility Estimation
C-Learning: Horizon-Aware Cumulative Accessibility Estimation

Abstract
Multi-goal reaching is an important problem in reinforcement learning needed to achieve algorithmic generalization. Despite recent advances in this field, current algorithms suffer from three major challenges: high sample complexity, learning only a single way of reaching the goals, and difficulties in solving complex motion planning tasks. In order to address these limitations, we introduce the concept of cumulative accessibility functions, which measure the reachability of a goal from a given state within a specified horizon. We show that these functions obey a recurrence relation, which enables learning from offline interactions. We also prove that optimal cumulative accessibility functions are monotonic in the planning horizon. Additionally, our method can trade off speed and reliability in goal-reaching by suggesting multiple paths to a single goal depending on the provided horizon. We evaluate our approach on a set of multi-goal discrete and continuous control tasks. We show that our method outperforms state-of-the-art goal-reaching algorithms in success rate, sample complexity, and path optimality. Our code is available at this https URL, and additional visualizations can be found at this https URL.
The Web Conference 2021 | HGCF: Hyperbolic Graph Convolution Networks for Collaborative Filtering
HGCF: Hyperbolic Graph Convolution Networks for Collaborative Filtering

Abstract
Hyperbolic spaces offer a rich setup to learn embeddings with superior properties that have been leveraged in areas such as computer vision, natural language processing and computational biology. Recently, several hyperbolic approaches have been proposed to learn robust representations for users and items in the recommendation setting. However, these approaches don’t capture the higher order relationships that typically exist in the recommendation domain. Graph convolutional neural networks (GCNs) on the other hand excel atcapturing higher order information by applying multiple levels of aggregation to local representations. In this paper we combine these frameworks in a novel way, by proposing a hyperbolic GCN model for collaborative filtering. We demonstrate that our model can be effectively learned with a margin ranking loss, and show that hyperbolic space has desirable properties under the rank margin setting. At test time, inference in our model is done using the hyperbolic distance which preserves the structure of the learned space. We conduct extensive empirical analysis on three public benchmarks and compare against a large set of baselines. Our approach achieves highly competitive results and outperforms leading baselines including the Euclidean GCN counterpart. We further study the properties of the learned hyperbolic embeddings and show that they offer meaningful insights into the data.
npj | digital medicine: Predicting adverse outcomes due to diabetes complications
Predicting adverse outcomes due to diabetes complications with machine learning using administrative health data

Abstract
Across jurisdictions, government and health insurance providers hold a large amount of data from patient interactions with the healthcare system. We aimed to develop a machine learning-based model for predicting adverse outcomes due to diabetes complications using administrative health data from the single-payer health system in Ontario, Canada. A Gradient Boosting Decision Tree model was trained on data from 1,029,366 patients, validated on 272,864 patients, and tested on 265,406 patients. Discrimination was assessed using the AUC statistic and calibration was assessed visually using calibration plots overall and across population subgroups. Our model predicting three-year risk of adverse outcomes due to diabetes complications (hyper/hypoglycemia, tissue infection, retinopathy, cardiovascular events, amputation) included 700 features from multiple diverse data sources and had strong discrimination (average test AUC = 77.7, range 77.7–77.9). Through the design and validation of a high-performance model to predict diabetes complications adverse outcomes at the population level, we demonstrate the potential of machine learning and administrative health data to inform health planning and healthcare resource allocation for diabetes management.
Characterizing early Canadian federal, provincial, territorial and municipal nonpharmaceutical interventions in response to COVID-19: a descriptive analysis
Characterizing early Canadian federal, provincial, territorial and municipal nonpharmaceutical interventions in response to COVID-19: a descriptive analysis

Abstract
Nonpharmaceutical interventions (NPIs) are the primary tools to mitigate early spread of the coronavirus disease 2019 (COVID-19) pandemic; however, such policies are implemented variably at the federal, provincial or territorial, and municipal levels without centralized documentation. We describe the development of the comprehensive open Canadian Non-Pharmaceutical Intervention (CAN-NPI) data set, which identifies and classifies all NPIs implemented in regions across Canada in response to COVID-19, and provides an accompanying description of geographic and temporal heterogeneity.
COVID-19 Publication to Nature Scientific Reports
Evolutionary and structural analyses of SARS-CoV-2 D614G spike protein mutation now documented worldwide

Abstract
Evolutionary and structural analyses of SARS-CoV-2 D614G spike protein mutation now documented worldwide
The COVID-19 pandemic, caused by the Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2), was declared on March 11, 2020 by the World Health Organization. As of the 31st of May, 2020, there have been more than 6 million COVID-19 cases diagnosed worldwide and over 370,000 deaths, according to Johns Hopkins. Thousands of SARS-CoV-2 strains have been sequenced to date, providing a valuable opportunity to investigate the evolution of the virus on a global scale. We performed a phylogenetic analysis of over 1,225 SARS-CoV-2 genomes spanning from late December 2019 to mid-March 2020. We identified a missense mutation, D614G, in the spike protein of SARS-CoV-2, which has emerged as a predominant clade in Europe (954 of 1,449 (66%) sequences) and is spreading worldwide (1,237 of 2,795 (44%) sequences). Molecular dating analysis estimated the emergence of this clade around mid-to-late January (10–25 January) 2020. We also applied structural bioinformatics to assess the potential impact of D614G on the virulence and epidemiology of SARS-CoV-2. In silico analyses on the spike protein structure suggests that the mutation is most likely neutral to protein function as it relates to its interaction with the human ACE2 receptor. The lack of clinical metadata available prevented our investigation of association between viral clade and disease severity phenotype. Future work that can leverage clinical outcome data with both viral and human genomic diversity is needed to monitor the pandemic.
The ACM Conference on Recommender System
TAFA: Two-headed Attention Fused Autoencoder for Context-Aware Recommendations

Abstract
TAFA: Two-headed Attention Fused Autoencoder for Context-Aware Recommendations
Collaborative filtering with implicit feedback is a ubiquitous class of recommendation problems where only positive interactions such as purchases or clicks are observed. Autoencoder-based recommendation models have shown strong performance on many implicit feedback benchmarks. However, these models tend to suffer from popularity bias making recommendations less personalized. User-generated reviews contain a rich source of preference information, often with specific details that are important to each user, and can help mitigate the popularity bias. Since not all reviews are equally useful, existing work has been exploring various forms of attention to distill relevant information. In the majority of proposed approaches, representations from implicit feedback and review branches are simply concatenated at the end to generate predictions. This can prevent the model from learning deeper correlations between the two modalities and affect prediction accuracy. To address these problems, we propose a novel Two-headed Attention Fused Autoencoder (TAFA) model that jointly learns representations from user reviews and implicit feedback to make recommendations. We apply early and late modality fusion which allows the model to fully correlate and extract relevant information from both input sources. To further combat popularity bias, we leverage the Noise Contrastive Estimation (NCE) objective to “de-popularize” the fused user representation via a two-headed decoder architecture. Empirically, we show that TAFA outperforms leading baselines on multiple real-world benchmarks. Moreover, by tracing attention weights back to reviews we can provide explanations for the generated recommendations and gain further insights into user preferences.
ACM RecSys Challenge 2020
2nd Place
Predicting Twitter Engagement With Deep Language Models

Abstract
ACM RecSys Challenge 2020
Twitter has become one of the main information sharing platforms for millions of users world-wide. Numerous tweets are created daily, many with highly time sensitive content such as breaking news,
new multimedia content or personal updates. Consequently, accurately recommending relevant tweets to users in a timely manner is a highly important and challenging problem. The 2020 ACM RecSys
Challenge is aimed at benchmarking leading recommendation models for this task. The challenge is based on a large and recent dataset of over 200M tweet engagements released by Twitter with content in over 50 languages. In this work we present our approach where we leverage recent advances in deep language modeling and attention architectures, to combine information from extracted features, user engagement history and target tweet content. We first fine tune leading multilingual language models M-BERT and XLM-R for Twitter data. Embeddings from these models are used to extract tweet and user history representations. We then combine all components together and jointly train them to maximize engagement prediction accuracy. Our approach achieves highly competitive performance placing 2’nd on the final private leaderboard.
International Conference on Machine Learning (ICML 2020)
Improving Transformer Optimization Through Better Initialization

Abstract
The Transformer architecture has achieved considerable success recently; the key component of the Transformer is the attention layer that enables the model to focus on important regions within an input sequence. Gradient optimization with attention layers can be notoriously difficult requiring tricks such as learning rate warmup to prevent divergence. As Transformer models are becoming larger and more expensive to train, recent research has focused on understanding and improving optimization in these architectures. In
this work our contributions are two-fold: we first investigate and empirically validate the source of optimization problems in the encoder-decoder Transformer architecture; we then propose a new weight initialization scheme with theoretical justification, that enables training without warmup or layer normalization. Empirical results on public machine translation benchmarks show that our approach achieves leading accuracy, allowing to train deep Transformer models with 200 layers in both encoder and decoder (over 1000 attention/MLP blocks) without difficulty.
Conference on Neural Information Processing Systems (NeurIPS 2019)
Guided Similarity Separation for Image Retrieval (Oral)

Abstract
Despite recent progress in computer vision, image retrieval remains a challenging open problem. Numerous variations such as view angle, lighting and occlusion make it difficult to design models that are both robust and efficient. Many leading methods traverse the nearest neighbor graph to exploit higher order neighbor information and uncover the highly complex underlying manifold. In this work we propose a different approach where we leverage graph convolutional networks to directly encode neighbor information into image descriptors. We further leverage ideas from clustering and manifold learning, and introduce an unsupervised loss based on pairwise separation of image similarities. Empirically, we demonstrate that our model is able to successfully learn a new descriptor space that significantly improves retrieval accuracy, while still allowing efficient inner product inference. Experiments on five public benchmarks show highly competitive performance with up to 24% relative improvement in mAP over leading baselines.
YouTube-8M Video Understanding Challenge
1st Place
Cross-Class Relevance Learning for Temporal Concept Localization

Abstract
We present a novel Cross-Class Relevance Learning approach for the task of temporal concept localization. Most localization architectures rely on feature extraction layers followed by a classification layer which outputs class probabilities for each segment. However, in many real-world applications classes can exhibit complex relationships that are difficult to model with this architecture. In contrast, we propose to incorporate target class and class-related features as input, and learn a pairwise binary model to predict general segment to class relevance. This facilitates learning of shared information between classes, and allows for arbitrary class-specific feature engineering. We apply this approach to the 3rd YouTube-8M Video Understanding Challenge together with other leading models, and achieve first place out of over 280 teams. In this paper we describe our approach and show some empirical results.
Open Images 2019 Visual Relationship Challenge
1st Place
Learning Effective Visual Relationship Detector on 1 GPU

Abstract
We present our winning solution to the Open Images 2019 Visual Relationship challenge. This is the largest challenge of its kind to date with nearly 9 million training images. Challenge task consists of detecting objects and identifying relationships between them in complex scenes. Our solution has three stages, first object detection model is finetuned for the challenge classes using a novel weight transfer approach. Then, spatio-semantic and visual relationship models are trained on candidate object pairs. Finally, features and model predictions are combined to generate the final relationship prediction. Throughout the challenge we focused on minimizing the hardware requirements of our architecture. Specifically, our weight transfer approach enables much faster optimization, allowing the entire architecture to be trained on a single GPU in under two days. In addition to efficient optimization, our approach also achieves superior accuracy winning first place out of over 200 teams,
and outperforming the second place team by over 5% on the held-out private leaderboard.
ACM RecSys Challenge 2019
2nd Place
Robust Contextual Models for In-Session Personalization

Abstract
Most online activity happens in the context of a session; to enable better user experience many online platforms aim to dynamically refine their recommendations as sessions progress. A popular approach is to continuously re-rank recommendations based on current session activity and past session logs. This motivates the 2019 ACM RecSys Challenge organised by Trivago. Using the session log dataset released by Trivago, the challenge aims to benchmark models for in-session re-ranking of hotel recommendations. In this paper we present our approach to this challenge where we first contextualize sessions in a global and local manner, and then train gradient boosting and deep learning models for re-ranking. Our team achieved 2nd place out of over 570 teams, with less than 0.3% relative difference in Mean Reciprocal Rank from the 1st place team.
Conference on Computer Vision and Pattern Recognition (CVPR 2019)
Explore-Exploit Graph Traversal for Image Retrieval

Abstract
In this paper, we propose a novel graph-based approach for image retrieval.
We propose a novel graph-based approach for image retrieval. Given a nearest neighbor graph produced by the global descriptor model, we traverse it by alternating between exploit and explore steps. The exploit step maximally utilizes the immediate neighborhood of each vertex, while the explore step traverses vertices that are farther away in the descriptor space. By combining these two steps we can better capture the underlying image manifold, and successfully retrieve relevant images that are visually dissimilar to the query. Our traversal algorithm is conceptually simple, has few tunable parameters and can be implemented with basic data structures. This enables fast real-time inference for previously unseen queries with minimal memory overhead. Despite relative simplicity, we show highly competitive results on multiple public benchmarks, including the largest image retrieval dataset that is currently publicly available.
Google Landmark Retrieval Challenge 2019
3rd Place
Semi-Supervised Exploration in Image Retrieval

Abstract
Semi-Supervised Exploration in Image Retrieval
We present our solution to Landmark Image Retrieval Challenge 2019. This challenge was based on the large Google Landmarks Dataset V2. The goal was to retrieve all database images containing the same landmark for every provided query image. Our solution is a combination of global and local models to form an initial KNN graph. We then use a novel extension of the recently proposed graph traversal method EGT referred to as semi-supervised EGT to refine the graph and retrieve better candidates.
➢DIR concatenated with GeM trained on Landmark-v1 dataset.
➢QE and spatial verification (RANSAC) using DELF-V2
➢Novel Contribution: Semi-supervised extension of EGT for final ranking.
Stanford Question Answering Dataset 2.0
2nd Place
Top-performing Natural Language Processing Model

Abstract
Released in 2016, the Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset consisting of questions posed by crowdworkers on a set of Wikipedia articles. The answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. Performance on SQuAD surpassed human performance in 2018, and in response the SQuAD2.0 challenge was released, which combines the 100,000 questions in SQuAD1.1 with over 50,000 new, unanswerable questions written adversarially by crowdworkers to look similar to answerable ones. Solutions to SQuAD2.0 will not only answer questions when possible, but also determine when no answer is supported by the paragraph and abstain from answering. The Layer 6 NLP Team developed a model that ranked the 2nd place in the leadership board (as of March 28, 2019).
Spotify RecSys 2018
Winner
Two-stage Model for Automatic Playlist Continuation at Scale

Abstract
The paper presents a two-stage model to evaluate and advance current state-of-the-art in automated playlist continuation using a large scale dataset released by Spotify.
Automatic playlist continuation is a prominent problem in music recommendation. Significant portion of music consumption is now done online through playlists and playlist-like online radio stations. Manually compiling playlists for consumers is a highly time consuming task that is difficult to do at scale given the diversity of tastes and the large amount of musical content available. Consequently, automated playlist continuation has received increasing attention recently. The 2018 ACM RecSys Challenge is dedicated to evaluating and advancing current state-of-the-art in automated playlist continuation using a large scale dataset released by Spotify. In this paper we present our approach to this challenge.
We use a two-stage model where the first stage is optimized for fast retrieval, and the second stage re-ranks retrieved candidates maximizing the accuracy at the top of the recommended list. Our team vl6 achieved 1’st place in both main and creative tracks out of over 100 teams.
RSNA Pneumonia Detection Challenge 2018
4th Place
4th Place in the RSNA Pneumonia Detection Challenge | Kaggle

Abstract
Pneumonia accounts for over 15% of all deaths of children under 5 years old internationally. In 2015, 920,000 children under the age of 5 died from the disease. In this challenge, we were challenged to build an algorithm to detect a visual signal for pneumonia in medical images. Layer 6 collaborated with 16Bit and developed an ensemble of 15 state-of-the-art object detection models (10 Mask RCNN, 3 YOLOv3, and 2 Faster RCNN models), in combination with a classifier (DenseNet-121architecture pre-trained on NIH Chest X-rays data set) that served to reduce false positives, to detect pneumonia chest X-rays. We found that using a relaxed detection threshold for object detection, whilst requiring unanimous agreement among the detectors, effectively consolidated the need to minimize both false positives and false negatives. Adaptive histogram equalization was used to improve image contrast as a data preprocessing step. We used age, sex, and view position as inputs into the penultimate layer of the classifier to improve performance.
Google Landmark Retrieval Challenge 2018
2nd Place
Modified Maximum Spanning Tree Clustering for Large-Scale Image Retrieval

Abstract
Retrieve all the images depicting the same landmark regardless of visual similarity.
Put images with the same landmark closer to the approximated centers of the landmark clusters iteratively.
- Involve local descriptors and geometric verification.
- Test images can be used as additional bridging.

ACM RecSys Challenge 2017
Winner
Content-based Neighbor Models for Cold Start in Recommender Systems

Abstract
In this paper we address the cold start problem in recommender system by providing a standardized framework to benchmark cold start models.
Cold start remains a prominent problem in recommender systems. While rich content information is often available for both users and items few existing models can fully exploit it for personalization. Slow progress in this area can be partially attributed to the lack of publicly available benchmarks to validate and compare models. This year’s ACM Recommender Systems Challenge’17 aimed to address this gap by providing a standardized framework to benchmark cold start models. The challenge organizer XING released a large scaled data collection of user-job interactions from their career oriented social network. Unlike other competitions, here the participating teams were evaluated in two phases – offline and online. Models were first evaluated on the held-out offline test set. Top models were then A/B tested in the online phase where new target users and items were released daily and recommendations were pushed into XING’s live production system. In this paper we present our approach to this challenge, we used a combination of content and neighbor-based models winning both offline and online phases. Our model produced the most consistent online performance wining four of the five online weeks, and showed excellent generalization in the live A/B setting.
Conference on Neural Information Processing Systems (NIPS 2017)
DropoutNet: Addressing Cold Start In Recommender Systems

Abstract
In this paper, we propose a neural network based latent model called DropoutNet to address the cold start problem in recommender systems.
Latent models have become the default choice for recommender systems due to their performance and scalability. However, research in this area has primarily focused on modeling user-item interactions, and few latent models have been developed for cold start. Deep learning has recently achieved remarkable success showing excellent results for diverse input types. Inspired by these results we propose a neural network based latent model called DropoutNet to address the cold start problem in recommender systems. Unlike existing approaches that incorporate additional content-based objective terms, we instead focus on the optimization and show that neural network models can be explicitly trained for cold start through dropout. Our model can be applied on top of any existing latent model effectively providing cold start capabilities, and full power of deep architectures. Empirically we demonstrate state-of-the-art accuracy on publicly available benchmarks.