Table Representation Learning Workshop

TRL 2023 is a wrap! Interested in updates for future events, news, etc? Sign up for the mailinglist here!

Tables are a promising modality for representation learning with too much application potential to ignore. However, tables have long been overlooked despite their dominant presence in the data landscape, e.g. data management and analysis pipelines. The majority of datasets in Google Dataset Search, for example, resembles typical tabular file formats like CSVs. Similarly, the top-3 most-used database management systems are all intended for relational data (RDBMS). Representation learning over tables, possibly combined with other modalities such as text or SQL, has shown impressive performance for tasks like semantic parsing, question answering, table understanding, and data preparation. More recently, the pre-training paradigm was shown to be effective for tabular ML as well, while researchers also started exploring the impressive capabilities of LLMs for table encoding and data manipulation.

The Table Representation Learning (TRL) workshop is the first in this emerging research area and concentrates on three main goals:

(1) Motivate tables as a primary modality for representation and generative learning and advance the area further.
(2) Showcase impactful applications of pretrained table models and discussing future opportunities.
(3) Foster discussion and collaboration across the ML, NLP and DB communities.

When: Friday 15 December 2023, 8:30am - 5:30pm (local time).
Where: Room 235-236, New Orleans Ernest N. Morial Convention Center.

Submit: https://openreview.net/group?id=NeurIPS.cc/2023/Workshop/TRL

Public questions: table-representation-learning-workshop@googlegroups.com

Private questions: madelon@berkeley.edu

Follow on Twitter: @TrlWorkshop

Program

TRL is again entirely in-person, and will this year feature 2 longer poster sessions and more (spotlight) talks on contributed work.
We also host a few exciting invited talks on established research in this emerging area, and a panel discussion.

Invited Speakers

Wenhu Chen

University of Waterloo,
Google DeepMind, Vector Institute

Frank Hutter

University of Freiburg

Simran Arora

Stanford University

Immanuel Trummer

Cornell University

Tao Yu

University of Hong Kong

Panelists (TBC)

Paolo Papotti (moderator)

Eurecom

Andreas Mueller

Microsoft

Wenhu Chen

University of Waterloo

Valeriia Cherepanova

Amazon

Schedule

06:30 AM	Opening Notes
06:45 AM	Invited Talk: Simran Arora Co-Designing LLMs and LLM-Powered Data Management Tools [Abstract] Abstract: Large Language Models (LLMs) are now widely used for data management. We recently proposed Evaporate [ICLR Spotlight 2023, VLDB 2024], a system that uses LLMs to help users efficiently query semi-structured documents. We also showed how off-the-shelf LLMs perform data-wrangling tasks with state-of-the-art quality and no specialized training [VLDB 2023]. This talk discusses some of my lessons from working on these early LLM-for-data-management projects and subsequent research to improve the reach of these systems — in particular, there is a ways to go for extending LLMs to datatypes such as private, semi-structured, and long-sequence data. Towards extending our capabilities on these datatypes, I’ll discuss MQAR and Monarch Mixer [NeurIPS Oral 2023], new LM architectures that can match the quality of attention-based LMs, while remaining asymptotically more efficient at training and inference time. We’ll finally discuss how these fundamental breakthroughs can power next-generation data management tools. [Speaker Bio] Bio: Simran Arora is a PhD student at Stanford University in machine learning systems, advised by Chris Ré. She develops tools that help users apply foundation models to challenging datatypes such as private, semi-structured, and long-sequence data. To unlock these capabilities, she leverages a detailed understanding of the data and inductive biases used to train foundation models. Her work has received Oral (top 5%) and Spotlight (top 25%) awards at NeurIPS and ICLR. Simran also recently created and taught a new [Systems for Machine Learning](https://cs229s.stanford.edu/fall2023/) full-unit course at Stanford. She is grateful for the support of the SGF Sequoia Fellowship.
07:15 AM	Spotlight Talks: Session 1 (Data Wrangling and Table QA)
	07:15 AM ~ 07:22 AM High-Performance Transformers for Table Structure Recognition Need Early Convolutions Anthony Peng, Seongmin Lee, Xiaojing Wang, Rajarajeswari (Raji) Balasubramaniyan, Duen Horng Chau [Abstract] Abstract: Table structure recognition (TSR) aims to convert tabular images into a machine-readable format, where a visual encoder extracts image features and a textual decoder generates table-representing tokens. Existing approaches use classic convolutional neural network (CNN) backbones for the visual encoder and transformers for the textual decoder. However, this hybrid CNN-Transformer architecture introduces a complex visual encoder that accounts for nearly half of the total model parameters, markedly reduces both training and inference speed, and hinders the potential for self-supervised learning in TSR. In this work, we design a lightweight visual encoder for TSR without sacrificing expressive power. We discover that a convolutional stem can match classic CNN backbone performance, with a much simpler model. The convolutional stem strikes an optimal balance between two crucial factors for high-performance TSR: a higher receptive field (RF) ratio and a longer sequence length. This allows it to "see" an appropriate portion of the table and "store" the complex table structure within sufficient context length for the subsequent transformer. We conducted reproducible ablation studies and open-sourced our code at https://anonymous.4open.science/r/NeurIPS23-TRL-2 to enhance transparency, inspire innovations, and facilitate fair comparisons in our domain as tables are a promising modality for representation learning.
	07:23 AM ~ 07:30 AM Pool-Search-Demonstrate: Improving Data-wrangling LLMs via better in-context examples Joon Suk Huh, Changho Shin, Elina Choi [Abstract] Abstract: Data-wrangling is a process that transforms raw data for further analysis and for use in downstream tasks. Recently, it has been shown that foundation models can be successfully used for data-wrangling tasks (Narayan et. al., 2022). An important aspect of data wrangling with LMs is to properly construct prompts for the given task. Within these prompts, a crucial component is the choice of in-context examples. In the previous study of Narayan et. al., demonstration examples are chosen manually by the authors, which may not be scalable to new datasets. In this work, we propose a simple demonstration strategy that individualizes demonstration examples for each input by selecting them from a pool based on their distance in the embedding space. Additionally, we propose a postprocessing method that exploits the embedding of labels under a closed-world assumption. Empirically, our embedding-based example retrieval and postprocessing improve foundation models' performance by up to 84\% over randomly selected examples and 49\% over manually selected examples in the demonstration. Ablation tests reveal the effect of class embeddings, and various factors in demonstration such as quantity, quality, and diversity.
	07:31 AM ~ 07:38 AM TabPFGen – Tabular Data Generation with TabPFN Jeremy (Junwei) Ma, Apoorv Dankar, George Stein, Guangwei Yu, Anthony Caterini [Abstract] Abstract: Advances in deep generative modelling have not translated well to tabular data. We argue that this is caused by a mismatch in structure between popular generative models and discriminative models of tabular data. We thus devise a technique to turn TabPFN -- a highly performant transformer initially designed for in-context discriminative tabular tasks -- into an energy-based generative model, which we dub TabPFGen. This novel framework leverages the pre-trained TabPFN as part of the energy function and does not require any additional training or hyperparameter tuning, thus inheriting TabPFN's in-context learning capability. We can sample from TabPFGen analogously to other energy-based models. We demonstrate strong results on standard generative modelling tasks, including data augmentation, class-balancing, and imputation, unlocking a new frontier of tabular data generation.
	07:38 AM ~ 07:45 AM Data Ambiguity Strikes Back: How Documentation Improves GPT's Text-to-SQL Zachary Huang, Pavan Kalyan Damalapati, Eugene Wu [Abstract] Abstract: Text-to-SQL allows experts to use databases without in-depth knowledge of them. However, real-world tasks have both query and data ambiguities. Most works on Text-to-SQL focused on query ambiguities and designed chat interfaces for experts to provide clarifications.In contrast, the data management community has long studied data ambiguities, but mainly addresses error detection and correction, rather than documenting them for disambiguation in data tasks. This work delves into these data ambiguities in real-world datasets. We have identified prevalent data ambiguities of value consistency, data coverage, and data granularity that affect tasks. We examine how documentation, originally made to help humans to disambiguate data, can help GPT-4 with Text-to-SQL tasks. By offering documentation on these, we found GPT-4's performance improved by 28.9%.
	07:46 AM ~ 07:53 AM MultiTabQA: Generating Tabular Answers for Multi-Table Question Answering Vaishali Pal, Andrew Yates, Evangelos Kanoulas, Maarten Rijke [Abstract] Abstract: Recent advances in tabular question answering (QA) with large language models are constrained in their coverage and only answer questions over a single table. However, real-world queries are complex in nature, often over multiple tables in a relational database or web page. Single table questions do not involve common table operations such as set operations, Cartesian products (joins), or nested queries. Furthermore, multi-table operations often result in a tabular output, which necessitates table generation capabilities of tabular QA models. To fill this gap, we propose a new task of answering questions over multiple tables. Our model, MultiTabQA, not only answers questions over multiple tables, but also generalizes to generate tabular answers. To enable effective training, we build a pre-training dataset comprising of 132,645 SQL queries and tabular answers. Further, we evaluate the generated tables by introducing table-specific metrics of varying strictness assessing various levels of granularity of the table structure. MultiTabQA outperforms state-of-the-art single table QA models adapted to a multi-table QA setting by finetuning on three datasets: Spider, Atis and GeoQuery.
08:00 AM	Coffee Break / Poster Setup
08:20 AM	Poster: Session 1 [Details] MultiTabQA: Generating Tabular Answers for Multi-Table Question Answering. Vaishali Pal, Andrew Yates, Evangelos Kanoulas, Maarten Rijke. Generating Data Augmentation Queries Using Large Language Models.Christopher Buss, Jasmin Mousavi, Mikhail Tokarev, Arash Termehchy, David Maier, Stefan Lee. ReConTab: Regularized Contrastive Representation Learning for Tabular Data. Suiyao Chen, Jing Wu, Naira Hovakimyan, Handong Yao. Explaining Explainers: Necessity and Sufficiency in Tabular Data. Prithwijit Chowdhury, Mohit Prabhushankar, Ghassan AlRegib. Beyond Individual Input for Deep Anomaly Detection on Tabular Data. Hugo Thimonier, Fabrice Popineau, Arpad Rimmel, Bich-Liên DOAN. InterpreTabNet: Enhancing Interpretability of Tabular Data Using Deep Generative Models and Large Language Models. Jacob Yoke Hong Si, Michael Cooper, Wendy Yusi Cheng, Rahul Krishnan. On Incorporating new Variables during Evaluation. Harsimran Bhasin, Soumyadeep Ghosh. GCondNet: A Novel Method for Improving Neural Networks on Small High-Dimensional Tabular Data. Andrei Margeloiu, Nikola Simidjievski, Pietro Lio, Mateja Jamnik. High-Performance Transformers for Table Structure Recognition Need Early Convolutions. Anthony Peng, Seongmin Lee, Xiaojing Wang, Rajarajeswari (Raji) Balasubramaniyan, Duen Horng Chau. Unnormalized Density Estimation with Root Sobolev Norm Regularization. Mark Kozdoba, Binyamin Perets, Shie Mannor. Tree-Regularized Tabular Embeddings. Xuan Li, Yun Wang, Bo Li. A Deep Learning Blueprint for Relational Databases. Lukáš Zahradník, Jan Neumann, Gustav Šír. Scaling TabPFN: Sketching and Feature Selection for Tabular Prior-Data Fitted Networks. Benjamin Feuer, Niv Cohen, Chinmay Hegde. NeuroDB: Efficient, Privacy-Preserving and Robust Query Answering with Neural Networks. Sepanta Zeighami, Cyrus Shahabi. A Performance-Driven Benchmark for Feature Selection in Tabular Deep Learning. Valeriia Cherepanova, Roman Levin, Gowthami Somepalli, Jonas Geiping, C. Bruss, Andrew Wilson, Tom Goldstein, Micah Goldblum. Benchmarking Tabular Representation Models in Transfer Learning Settings. Qixuan Jin, Talip Ucar. Data Ambiguity Strikes Back: How Documentation Improves GPT's Text-to-SQL. Zezhou Huang, Pavan Kalyan Damalapati, Eugene Wu. Pool-Search-Demonstrate: Improving Data-wrangling LLMs via better in-context examples. Joon Suk Huh, Changho Shin, Elina Choi. TabPFGen – Tabular Data Generation with TabPFN. Junwei Ma, Apoorv Dankar, George Stein, Guangwei Yu, Anthony Caterini. Multitask-Guided Self-Supervised Tabular Learning for Patient-Specific Survival Prediction. You Wu, Omid Bazgir, Yongju Lee, Tommaso Biancalani, James Lu, Ehsan Hajiramezanali.
09:00 AM	Invited Talk: Frank Hutter Advances in In-Context Learning for Tabular Datasets [Abstract] Abstract: A year ago, we introduced TabPFN, the first in-context learning method for tabular data. In this talk, I will discuss what happened since. I will start by briefly discussing CAAFE, a system that uses LLMs for automated feature engineering on tabular data and makes effective use of TabPFN's speed. Then, I will situate prior-fitted PFNs in the in-context learning literature, review various applications of PFN, explain TabPFN in some more detail and then discuss our ongoing work on removing TabPFN's remaining limitations. [Speaker Bio] Bio: Frank Hutter is a Full Professor for Machine Learning at the University of Freiburg (Germany). He holds a PhD from the University of British Columbia (UBC, 2009) and a Diplom (eq. MSc) from TU Darmstadt (2004). He received the 2010 CAIAC doctoral dissertation award for the best thesis in AI in Canada, and with his coauthors, several best paper awards and prizes in international competitions on machine learning, SAT solving, and AI planning. He is a Fellow of EurAI and ELLIS, the director of the ELLIS unit Freiburg and the recipient of 3 ERC grants. Frank is best known for his research on automated machine learning (AutoML), including neural architecture search and efficient hyperparameter optimization. He co-authored the first book on AutoML and the prominent AutoML tools Auto-WEKA, Auto-sklearn and Auto-PyTorch, won the first two AutoML challenges with his team, co-organized the ICML workshop series on AutoML every year 2014-2021, has been the general chair of the inaugural AutoML conference 2022 and is general chair again in 2023.
09:30 AM	Spotlight Talks: Session 2 (Tabular ML)
	09:30 AM ~ 09:37 AM Self-supervised Representation Learning from Random Data Projectors Yi Sui, Tongzi Wu, Jesse Cresswell, Ga Wu, George Stein, Xiao Shi Huang, Xiaochen Zhang, Maksims Volkovs [Abstract] Abstract: Self-supervised representation learning SSRL has advanced considerably by exploiting the transformation invariance assumption under artificially designed data augmentations. While augmentation-based SSRL algorithms push the boundaries of performance in computer vision and natural language processing, they are often not directly applicable to other data modalities such as tabular and time-series data. This paper presents an SSRL approach that can be applied to these data modalities because it does not rely on augmentations or masking. Specifically, we show that high-quality data representations can be learned by reconstructing random data projections. We evaluate the proposed approach on real-world applications with tabular and time-series data. We show that it outperforms multiple state-of-the-art SSRL baselines and is competitive with methods built on domain-specific knowledge. Due to its wide applicability and strong empirical results, we argue that learning from randomness is a fruitful research direction worthy of attention and further study.
	09:38 AM ~ 09:45 AM GCondNet: A Novel Method for Improving Neural Networks on Small High-Dimensional Tabular Data Andrei Margeloiu, Nikola Simidjievski, Pietro Lió, Mateja Jamnik [Abstract] Abstract: Neural network models often struggle with high-dimensional but small sample-size tabular datasets. One reason is that current weight initialisation methods assume independence between weights, which can be problematic when there are insufficient samples to estimate the model's parameters accurately. In such small data scenarios, leveraging additional structures can improve the model's performance and training stability. To address this, we propose GCondNet, a general approach to enhance neural networks by leveraging implicit structures present in tabular data. We create a graph between samples for each data dimension, and utilise Graph Neural Networks (GNNs) for extracting this implicit structure, and for conditioning the parameters of the first layer of an underlying predictor network. By creating many small graphs, GCondNet exploits the data's high-dimensionality, and thus improves the performance of an underlying predictor network. We demonstrate the effectiveness of our method on 9 real-world datasets, where GCondNet outperforms 15 standard and state-of-the-art methods. The results show that GCondNet is a versatile framework for injecting graph-regularisation into various types of neural networks, including MLPs and tabular Transformers.
	09:46 AM ~ 09:53 AM HyperFast: Instant Classification for Tabular Data David Bonet, Daniel Mas Montserrat, Xavier Giró-i-Nieto, Alexander Ioannidis [Abstract] Abstract: Training deep learning models and performing hyperparameter tuning can be computationally demanding and time-consuming. Meanwhile, traditional machine learning methods like gradient-boosting algorithms remain the preferred choice for most tabular data applications, while neural network alternatives require extensive hyperparameter tuning or work only in toy datasets under limited settings. In this paper, we introduce HyperFast, a meta-trained hypernetwork designed for instant classification of tabular data in a single forward pass. HyperFast generates a task-specific neural network tailored to an unseen dataset that can be directly used for classification inference, removing the need for training a model. We report extensive experiments with OpenML and genomic data, comparing HyperFast to competing tabular data neural networks, traditional ML methods, AutoML systems, and boosting machines. HyperFast shows highly competitive results, while being significantly faster. Additionally, our approach demonstrates robust adaptability across a variety of classification tasks with little to no fine-tuning, positioning HyperFast as a strong solution for numerous applications and rapid model deployment. HyperFast introduces a promising paradigm for fast classification, with the potential to substantially decrease the computational burden of deep learning. Our code, which offers a scikit-learn-like interface, along with the trained HyperFast model, can be found at www.url-hidden-for-submission.
	09:54 AM ~ 10:01 AM Training-Free Generalization on Heterogeneous Tabular Data via Meta-Representation Han-Jia Ye, Qile Zhou, De-Chuan Zhan [Abstract] Abstract: Tabular data is prevalent across various machine learning domains. Yet, the inherent heterogeneities in attribute and class spaces across different tabular datasets hinder the effective sharing of knowledge, limiting a tabular model to benefit from other datasets. In this paper, we propose Tabular data Pre-Training via Meta-representation (TabPTM), which allows one tabular model pre-training on a set of heterogeneous datasets. Then, this pre-trained model can be directly applied to unseen datasets that have diverse attributes and classes without additional training. Specifically, TabPTM represents an instance through its distance to a fixed number of prototypes, thereby standardizing heterogeneous tabular datasets. A deep neural network is then trained to associate these meta-representations with dataset-specific classification confidences, endowing TabPTM with the ability of training-free generalization. Experiments validate that TabPTM achieves promising performance in new datasets, even under few-shot scenarios.
10:00 AM	Lunch Break
11:30 AM	Invited Talk: Immanuel Trummer Next-Generation Data Management with Large Language Models [Abstract] Abstract: The past years have been marked by several breakthrough results in the domain of generative AI, culminating in the rise of tools like ChatGPT, able to solve a variety of language-related tasks without specialized training. In this talk, I discuss several recent research projects at Cornell, exploiting large language models to enhance relational database management systems. These projects cover applications of language models in the database interface, enabling users to specify high-level analysis goals for fully automated end-to-end analysis, as well as applications in the backend, using language models to extract useful information for data profiling and database tuning from text documents. [Speaker Bio] Bio: Immanuel Trummer is an assistant professor at Cornell University and a member of the Cornell Database Group. His papers were selected for “Best of VLDB”, “Best of SIGMOD”, for the ACM SIGMOD Research Highlight Award, and for publication in CACM as CACM Research Highlight. His online lecture introducing students to database topics collected over a million views. He received the NSF CAREER Award and multiple Google Faculty Research Awards.
12:00 PM	Spotlight Talks: Session 3 (Tables + LLMs)
	12:00 PM ~ 12:07 PM Tabular Representation, Noisy Operators, and Impacts on Table Structure Understanding Tasks in LLMs Ananya Singha, José Cambronero, Sumit Gulwani, Vu Le, Chris Parnin [Abstract] Abstract: Large language models (LLMs) are increasingly applied for tabular tasks usingin-context learning. The prompt representation for a table may play a role in theLLMs ability to process the table. Inspired by prior work, we generate a collectionof self-supervised structural tasks (e.g. navigate to a cell and row; transpose thetable) and evaluate the performance differences when using 8 formats. In contrastto past work, we introduce 8 noise operations inspired by real-world messy dataand adversarial inputs, and show that such operations can impact LLM performanceacross formats for different structural understanding tasks.
	12:08 PM ~ 12:15 PM How to Prompt LLMs for Text-to-SQL: A Study in Zero-shot, Single-domain, and Cross-domain Settings Shuaichen Chang, Eric Fosler-Lussier [Abstract] Abstract: Large language models (LLMs) with in-context learning have demonstrated remarkable capability in the text-to-SQL task. Previous research has prompted LLMs with various demonstration-retrieval strategies and intermediate reasoning steps to enhance the performance of LLMs. However, those works often employ varied strategies when constructing the prompt text for text-to-SQL inputs, such as databases and demonstration examples. This leads to a lack of comparability in both the prompt constructions and their primary contributions. Furthermore, selecting an effective prompt construction has emerged as a persistent problem for future research. To address this limitation, we comprehensively investigate the impact of prompt constructions across various settings and provide insights into prompt constructions for future text-to-SQL studies.
	12:16 PM ~ 12:23 PM IngesTables: Scalable and Efficient Training of LLM-Enabled Tabular Foundation Models Scott Yak, Yihe Dong, Javier Gonzalvo, Sercan Arik [Abstract] Abstract: There is a massive amount of tabular data that can be taken advantage of via `foundation models' to improve prediction performance for downstream tabular prediction tasks. However, numerous challenges constitute bottlenecks in building tabular foundation models, including learning semantic relevance between tables and features, mismatched schemes, arbitrarily high cardinality for categorical values, and scalability to many tables, rows and features. We propose \texttt{IngesTables}, a novel canonical tabular foundation model building framework, designed to address the aforementioned challenges. \texttt{IngesTables} employs LLMs to encode representations of table/feature semantics and the relationships, that are then modeled via an attention-based tabular architecture. Unlike other LLM-based approaches, \texttt{IngesTables} is much cheaper to train and faster to run inference, because of how LLM-generated embeddings are defined and cached.We show that \texttt{IngesTables} demonstrates significant improvements over commonly-used models like XGBoost on clinical trial datasets in standard supervised learning settings, and is competitive with tabular prediction models that are specialized for clinical trial datasets without incurring LLM-level cost and latency.
12:30 PM	Invited Talk: Tao Yu Advancing Natural Language Interfaces to Data with Language Models as Agents [Abstract] Abstract: Traditional Natural Language Interfaces (NLIs) to data often necessitate users to provide detailed, step-by-step instructions, reflecting an assumption of user familiarity with the underlying data and systems, which can limit accessibility. The emergence of Large Language Models (LLMs) has, however, revolutionized NLIs, enabling them to perform sophisticated reasoning, decision-making, and planning multi-step actions in diverse environments autonomously. In this talk, I will discuss how these language models as agents facilitate a paradigm shift towards moving beyond traditional code generation to more autonomous and user-friendly NLIs, capable of understanding high-level objectives without requiring intricate directives. I will also present our latest work in this direction, including instruction-finetuned retrievers for diverse environment adaptation, the enhancement of LLM capabilities with tool integration, and the development of open, state-of-the-art LLMs and platforms for constructing such language agents. The talk will conclude with an exploration of the current and future research prospects in this rapidly evolving domain. [Speaker Bio] Bio: Tao Yu is an Assistant Professor of Computer Science at The University of Hong Kong and a director of the XLANG Lab (as part of the HKU NLP Group). He spent one year in the UW NLP Group working with Noah Smith, Luke Zettlemoyer, and Mari Ostendorf. He completed his Ph.D. in Computer Science from Yale University, advised by Dragomir Radev and master's at Columbia University advised by Owen Rambow and Kathleen McKeown. Tao has received the Google and Amazon faculty research awards (Google Research Scholar Award 2023, Amazon Research Award 2022). His main research interest is in Natural Language Processing. His research aims to build language model agents that transform (“grounding”) language instructions into code or actions executable in real-world environments, including databases, web applications, and the physical world etc,. It lies at the heart of the next generation of natural language interfaces that can interact with and learn from these real-world environments to facilitate human interaction with data analysis, web applications, and robotic instruction through conversation.
01:00 PM	Coffee Break / Poster Setup
01:20 PM	Poster: Session 2 [Details] Testing the Limits of Unified Sequence to Sequence LLM Pretraining on Diverse Table Data Tasks. Soumajyoti Sarkar, Leonard Lausen. How to Prompt LLMs for Text-to-SQL: A Study in Zero-shot, Single-domain, and Cross-domain Settings. Shuaichen Chang, Eric Fosler-Lussier. IngesTables: Scalable and Efficient Training of LLM-Enabled Tabular Foundation Models. Scott Yak, Yihe Dong, Javier Gonzalvo, Sercan Arik In Defense of Zero Imputation for Tabular Deep Learning. Mike Van Ness, Madeleine Udell. Fine-Tuning the Retrieval Mechanism for Tabular Deep Learning. Felix den Breejen, Sangmin Bae, Stephen Cha, Tae-Young Kim, Seoung Hyun Koh, Se-Young Yun. Scaling Experiments in Self-Supervised Cross-Table Representation Learning. Maximilian Schambach, Dominique Paul, Johannes Otterbach. Tabular Representation, Noisy Operators, and Impacts on Table Structure Understanding Tasks in LLMs. Ananya Singha, José Cambronero, Sumit Gulwani, Vu Le, Chris Parnin. CHORUS: Foundation Models for Unified Data Discovery and Exploration. Moe Kayali, Anton Lykov, Ilias Fountalis, Nikolaos Vasiloglou, Dan Olteanu, Dan Suciu. Incorporating LLM Priors into Tabular Learners. Max Zhu, Siniša Stanivuk, Andrija Petrovic, Mladen Nikolic, Pietro Lio. A DB-First approach to query factual information in LLMs. Mohammed Saeed, Nicola De Cao, Paolo Papotti. Training-Free Generalization on Heterogeneous Tabular Data via Meta-Representation. Han-Jia Ye, Qile Zhou, De-Chuan Zhan. Hopular: Modern Hopfield Networks for Tabular Data. Bernhard Schäfl, Lukas Gruber, Angela Bitto-Nemling, Sepp Hochreiter. HyperFast: Instant Classification for Tabular Data. David Bonet, Daniel Mas Montserrat, Xavier Giró-i-Nieto, Alexander Ioannidis. Modeling string entries for tabular data prediction: do we need big large language models? Leo Grinsztajn, Myung Jun Kim, Edouard Oyallon, Gael Varoquaux. Binning as a Pretext Task: Improving Self-Supervised Learning in Tabular Domains. Kyungeun Lee, Ye Seul Sim, Hyeseung Cho, Suhee Yoon, Sanghyu Yoon, Woohyung Lim. Self-supervised Representation Learning from Random Data Projectors. Yi Sui, Tongzi Wu, Jesse Cresswell, Ga Wu, George Stein, Xiao Shi Huang, Xiaochen Zhang, Maksims Volkovs. Unlocking the Transferability of Tokens in Deep Models for Tabular Data. Qile Zhou, Han-Jia Ye, Leye Wang, De-Chuan Zhan. Augmentation for Context in Financial Numerical Reasoning over Textual and Tabular Data with Large-Scale Language Model. Yechan Hwang, Jinsu Lim, Young-Jun Lee, Ho-Jin Choi. TabContrast: A Local-Global Level Method for Tabular Contrastive Learning. Hao Liu, Yixin Chen, Bradley Fritz, Christopher King. GradTree: Learning Axis-Aligned Decision Trees with Gradient Descent. Sascha Marton, Stefan Lüdtke, Christian Bartelt, Heiner Stuckenschmidt. Elephants Never Forget: Testing Language Models for Memorization of Tabular Data. Sebastian Bordt, Harsha Nori, Rich Caruana. Introducing the Observatory Library for End-to-End Table Embedding Inference. Tianji Cong, Zhenjie Sun, Paul Groth, H.V. Jagadish, Madelon Hulsebos.
02:00 PM	Invited Talk: Wenhu Chen Enabling Large Language Models to Reason with Tables [Abstract] Abstract: Large language models (LLMs) are becoming attractive as few-shot reasoners to solve Natural Language (NL)-related tasks. However, there is still much to learn about how well LLMs understand structured data, such as tables. While it is true that tables can be used as inputs to LLMs with serialization, there lack comprehensive studies examining whether LLMs can truly comprehend such data. In this talk, I will cover different ways to utilize LLMs to interface with tables. One approach is to feed the whole table as a sequence to LLMs for reasoning. In this direction, we will talk about the recent paper GPT4Table to summarize the lessons learned in different table linearization strategies, including table input format, content order, role prompting, and partition marks. The other approach is to use tools like SQL or other language to interface with table for data access without feeding the entire table. LLMs will work as a reasoner to derive the answer based on the interfaced results from the table. [Speaker Bio] Bio: Wenhu Chen has been an assistant professor at Computer Science Department in University of Waterloo and Vector Institute since 2022. He obtained Canada CIFAR AI Chair Award in 2022. He also works for Google Deepmind as a part-time research scientist since 2021. Before that, he obtained his PhD from the University of California, Santa Barbara under the supervision of William Wang and Xifeng Yan. His research interest lies in natural language processing, deep learning and multimodal learning. He aims to design models to handle complex reasoning scenarios like math problem-solving, structure knowledge grounding, etc. He is also interested in building more powerful multimodal models to bridge different modalities. He received the Area Chair Award in AACL-IJCNLP 2023, the Best Paper Honorable Mention in WACV 2021, and the UCSB CS Outstanding Dissertation Award in 2021.
02:30 PM	Panel - TBA
03:15 PM	Closing Notes

Call for Papers

Important Dates

Submission Open	August 10, 2023
Submission Deadline	October 4, 2023 (11:59PM AoE)
Notifications	~~October 27, 2023 (11:59PM AoE)~~
Camera-ready	~~November 15, 2023 (11:59PM AoE)~~
Slides for spotlight talks	~~November 28, 2023 (11:59PM AoE)~~
Video pitches for posters	~~November 28, 2023 (11:59PM AoE)~~
Workshop Date	December 15, 2023

Scope

We invite submissions on representation and generative learning over tables, related to any of the following topics:

Representation Learning over tables which can be structured as well as semi-structured, and extend to full databases. Example contributions are new model architectures, data encoding techniques, pre-training, fine-tuning, and prompting strategies, multi-task learning, etc.
Generative Learning and LLMs for structured data and interfaces to structured data (e.g. queries, analysis).
Multimodal learning where tables are jointly embedded with, for example, natural language, code (e.g. SQL), knowledge bases, visualizations/images.
Downstream Applications of table representations for tasks like data preparation (e.g. data cleaning, validation, integration, cataloging, feature engineering), retrieval (e.g. search, fact-checking/QA, KG construction), analysis (e.g. summarization, visualization, and query recommendation), and (end-to-end) machine learning.
Upstream Applications of table representation models for optimizing table parsers/extraction (from documents, spreadsheets, presentations), storage (e.g. compression, indexing) and query processing e.g. query optimization
Production challenges of table representation models. Work addressing the challenges of maintaining and managing TRL models in fast evolving contexts, e.g. data updating, error correction, and monitoring, and other industry challenges such as privacy, personalization performance, etc.
Domain-specific challenges for learned table models often arise in domains such as enterprise, finance, medical, law. These challenges pertain to table content, table structure, privacy, security limitations, and other factors that necessitate tailored solutions.
Benchmarks and analyses of table representation models, including the utility of language models as base models versus alternatives and robustness regarding large, messy, heterogeneous, or complex tables.
Others: Formalization, surveys, datasets, visions, and reflections to structure and guide future research.

Submission Guidelines

Submission link

Submit your (anonymized) paper through OpenReview at: https://openreview.net/group?id=NeurIPS.cc/2023/Workshop/TRL
Please be aware that accepted papers are expected to be presented at the workshop in-person.

Formatting guidelines

The workshop accepts regular research papers and industrial papers of the following types:

Short paper: 4 pages + references.

Regular paper: 8 pages + references.

Submissions should be anonymized and follow the NeurIPS style files, but can exclude the checklist. Non-anonymous preprints are no problem, and artifacts do not have to be anonymized. Just submitting the paper without author names/affiliations is sufficient. Supplementary material, if any, may be added in the appendix. The footer of accepted papers should state “Table Representation Learning Workshop at NeurIPS 2023”. We expect authors to adopt an inclusive and diverse writing style. The “Diversity and Inclusion in Writing” guide by the DE&I in DB Conferences effort is a good resource.

Review process

Papers will receive light reviews in a double-anonymous manner. All accepted submissions will be published on the website and made public on OpenReview but the workshop is non-archival (i.e. without proceedings).

Novelty and conflicts

The workshop does not accept submissions that have previously been published at NeurIPS or other machine learning venues. However, we do welcome submissions that have been published in, for example, data management or natural language processing venues. We rely on OpenReview for handling conflicts, so please ensure that the conflicts in every author's OpenReview profile are complete, in particular, with respect to the organization and program committees.

Camera-ready instructions

Camera-ready papers are expected to express the authors and affiliations on the first page, and state "Table Representation Learning Workshop at NeurIPS 2023'' in the footer. The camera-ready version may exceed the page limit for acknowledgements or small content changes, but revision is not required (for short papers: please be aware of novelty requirements of archival venues, e.g. SIGMOD, CVPR). The camera-ready version should be submitted through OpenReview (submission -> edit -> revision), and will be published on OpenReview and this website. Please make sure that all meta-data is correct as well, as it will be imported to the NeurIPS website.

Presentation instructions

All accepted papers will be presented as poster during one of the poster sessions (TBA). For poster formatting, please refer to the poster instructions on the NeurIPS site, you can print and bring the poster yourself or consider the FedEx offer for NeurIPS. Optional: authors of poster submissions are also invited to send a teaser video of approx. 3 minutes (.mp4) to m.hulsebos@uva.nl, which will be hosted on the website and YouTube channel of the workshop.
Papers selected for spotlight talks are also asked to prepare a talk of 6 minutes (+1 min Q&A), and upload their slides through the "slides" field in OpenReview. Timeslots for the spotlights will be published soon. The recordings of oral talks will be published as well.

Organization

Workshop Chairs

Madelon Hulsebos

UC Berkeley

Haoyu Dong

Microsoft

Bojan Karlas

Harvard

Laurel Orr

Numbers Station AI

Pengcheng Yin

Google DeepMind

Gaël Varoquaux

INRIA

Qian Liu

Sea AI Lab

Program Committee

Unfold for full committee

Vadim Borisov, University of Tuebingen
Paul Groth, University of Amsterdam
Wensheng Dou, Institute of Software Chinese Academy of Sciences
Hiroshi Iida, The University of Tokyo
Sharad Chitlangia, Amazon
Jaehyun Nam, KAIST
Jinyang Li, The University of Hong Kong
Gerardo Vitagliano, Hasso Plattner Institute
Rajat Agarwal, Amazon
Micah Goldblum, New York University
Yury Gorishniy, Yandex Research
Roman Levin, Amazon
Bhavesh Neekhra, Ashoka University
Sebastian Schelter, University of Amsterdam
Qingping Yang, University of Chinese Academy of Sciences
Matteo Interlandi, Microsoft
Tianji Cong, University of Michigan
Xiang Deng, Google
Beliz Gunel, Google
Qian Liu, Sea AI Lab
Shuaichen Chang, Ohio State University
Zhoujun Cheng, Shanghai Jiaotong University
Roee Shraga, Worcester Polytechnic Institute
Yi Zhang, AWS AI Labs
Xi Rao, ETH Zurich
Liane Vogel, Technical University of Darmstadt
Aneta Koleva, University of Munich / Siemens
Ivan Rubachev, HSE University / Yandex
Meghana Moorthy Bhat, Salesforce Research
José Cambronero, Microsoft
Till Döhmen, MotherDuck / University of Amsterdam
Noah Hollman, Charité Berlin / University of Freiburg
Julian Martin Eisenschlos, Google
Paolo Papotti, Eurecom
Zhiruo Wang, Carnegie Mellon University
Mukul Singh, Microsoft
Zezhou Huang, Columbia University
Carsten Binnig, TU Darmstadt
Linyong Nan, Yale
Shuo Zhang, Bloomberg
Alejandro Sierra Múnera, Hasso Plattner Institute
Qian Liu, Sea AI Labs
Anirudh Khatry, Microsoft
Haoyu Dong, Microsoft

Accepted Papers

2023

Oral

MultiTabQA: Generating Tabular Answers for Multi-Table Question Answering
Vaishali Pal, Andrew Yates, Evangelos Kanoulas, Maarten Rijke
GCondNet: A Novel Method for Improving Neural Networks on Small High-Dimensional Tabular Data
Andrei Margeloiu, Nikola Simidjievski, Pietro Lió, Mateja Jamnik
High-Performance Transformers for Table Structure Recognition Need Early Convolutions
ShengYun Peng, Seongmin Lee, Xiaojing Wang, Rajarajeswari Balasubramaniyan, Duen Horng Chau
Self-supervised Representation Learning from Random Data Projectors
Yi Sui, Tongzi Wu, Jesse Cresswell, Ga Wu, George Stein, Xiao Shi Huang, Xiaochen Zhang, Maksims Volkovs
HyperFast: Instant Classification for Tabular Data
David Bonet, Daniel Mas Montserrat, Xavier Giró-i-Nieto, Alexander Ioannidis
Training-Free Generalization on Heterogeneous Tabular Data via Meta-Representation
Han-Jia Ye, Qile Zhou, De-Chuan Zhan
Tabular Representation, Noisy Operators, and Impacts on Table Structure Understanding Tasks in LLMs
Ananya Singha, José Cambronero, Sumit Gulwani, Vu Le, Chris Parnin
Data Ambiguity Strikes Back: How Documentation Improves GPT's Text-to-SQL
Zachary Huang, Pavan Kalyan Damalapati, Eugene Wu
IngesTables: Scalable and Efficient Training of LLM-Enabled Tabular Foundation Models
Scott Yak, Yihe Dong, Javier Gonzalvo, Sercan Arik
Pool-Search-Demonstrate: Improving Data-wrangling LLMs via better in-context examples
Joon Suk Huh, Changho Shin, Elina Choi
How to Prompt LLMs for Text-to-SQL: A Study in Zero-shot, Single-domain, and Cross-domain Settings
Shuaichen Chang, Eric Fosler-Lussier
TabPFGen – Tabular Data Generation with TabPFN
Jeremy (Junwei) Ma, Apoorv Dankar, George Stein, Guangwei Yu, Anthony Caterini

Poster

Generating Data Augmentation Queries Using Large Language Models
Christopher Buss, Jasmin Mousavi, Mikhail Tokarev, Arash Termehchy, David Maier, Stefan Lee
ReConTab: Regularized Contrastive Representation Learning for Tabular Data
Suiyao Chen, Jing Wu, NAIRA HOVAKIMYAN, Handong Yao
Unlocking the Transferability of Tokens in Deep Models for Tabular Data
Qile Zhou, Han-Jia Ye, Leye Wang, De-Chuan Zhan
Augmentation for Context in Financial Numerical Reasoning over Textual and Tabular Data with Large-Scale Language Model recording
Yechan Hwang, Jinsu Lim, Young-Jun Lee, Ho-Jin Choi
TabContrast: A Local-Global Level Method for Tabular Contrastive Learning
Hao Liu, Yixin Chen, Bradley A Fritz, Christopher King
Explaining Explainers: Necessity and Sufficiency in Tabular Data
Prithwijit Chowdhury, Mohit Prabhushankar, Ghassan AlRegib
Beyond Individual Input for Deep Anomaly Detection on Tabular Data
Hugo Thimonier, Fabrice Popineau, Arpad Rimmel, Bich-Liên DOAN
GradTree: Learning Axis-Aligned Decision Trees with Gradient Descent
Sascha Marton, Stefan Lüdtke, Christian Bartelt, Heiner Stuckenschmidt
Elephants Never Forget: Testing Language Models for Memorization of Tabular Data
Sebastian Bordt, Harsha Nori, Rich Caruana
InterpreTabNet: Enhancing Interpretability of Tabular Data Using Deep Generative Models and Large Language Models
Jacob Yoke Hong Si, Rahul Krishnan, Michael Cooper, Wendy Yusi Cheng
On Incorporating new Variables during Evaluation
Harsimran Bhasin, Soumyadeep Ghosh
Unnormalized Density Estimation with Root Sobolev Norm Regularization
Mark Kozdoba, Binyamin Perets, Shie Mannor
Tree-Regularized Tabular Embeddings
Xuan Li, Yun Wang, Bo Li
Binning as a Pretext Task: Improving Self-Supervised Learning in Tabular Domains
Kyungeun Lee, Ye Seul Sim, Hyeseung Cho, Suhee Yoon, Sanghyu Yoon, Woohyung Lim
A Deep Learning Blueprint for Relational Databases recording
Lukáš Zahradník, Jan Neumann, Gustav Šír
Scaling TabPFN: Sketching and Feature Selection for Tabular Prior-Data Fitted Networks
Benjamin Feuer, Niv Cohen, Chinmay Hegde
Modeling string entries for tabular data prediction: do we need big large language models?
Leo Grinsztajn, Myung Jun Kim, Edouard Oyallon, Gael Varoquaux
Hopular: Modern Hopfield Networks for Tabular Data
Bernhard Schäfl, Lukas Gruber, Angela Bitto, Sepp Hochreiter
NeuroDB: Efficient, Privacy-Preserving and Robust Query Answering with Neural Networks
Sepanta Zeighami, Cyrus Shahabi
A DB-First approach to query factual information in LLMs
Mohammed SAEED, Nicola De Cao, Paolo Papotti
A Performance-Driven Benchmark for Feature Selection in Tabular Deep Learning
Valeriia Cherepanova, Roman Levin, Gowthami Somepalli, Jonas Geiping, C. Bayan Bruss, Andrew Wilson, Tom Goldstein, Micah Goldblum
Incorporating LLM Priors into Tabular Learners
Max Zhu, Siniša Stanivuk, Andrija Petrovic, Mladen Nikolic, Pietro Lió
CHORUS: Foundation Models for Unified Data Discovery and Exploration
Moe Kayali, Anton Lykov, Ilias Fountalis, Nikolaos Vasiloglou, Dan Olteanu, Dan Suciu
Introducing the Observatory Library for End-to-End Table Embedding Inference recording
Tianji Cong, Zhenjie Sun, Paul Groth, H. V. Jagadish, Madelon Hulsebos
Scaling Experiments in Self-Supervised Cross-Table Representation Learning
Maximilian Schambach, Dominique Paul, Johannes Otterbach
Benchmarking Tabular Representation Models in Transfer Learning Settings
Qixuan Jin, Talip Ucar
Exploring the Retrieval Mechanism for Tabular Deep Learning
Felix den Breejen, Sangmin Bae, Stephen Cha, Tae-Young Kim, Seoung Hyun Koh, Se-Young Yun
In Defense of Zero Imputation for Tabular Deep Learning
John Van Ness, Madeleine Udell
Multitask-Guided Self-Supervised Tabular Learning for Patient-Specific Survival Prediction
You Wu, Omid Bazgir, Yongju Lee, Tommaso Biancalani, James Lu, Ehsan Hajiramezanali
Testing the Limits of Unified Sequence to Sequence LLM Pretraining on Diverse Table Data Tasks
Soumajyoti Sarkar, Leonard Lausen

2022

Oral

Analysis of the Attention in Tabular Language Models recording
Aneta Koleva, Martin Ringsquandl, Volker Tresp
Transfer Learning with Deep Tabular Models recording
Roman Levin, Valeriia Cherepanova, Avi Schwarzschild, Arpit Bansal, C. Bayan Bruss, Tom Goldstein, Andrew Gordon Wilson, Micah Goldblum
STable: Table Generation Framework for Encoder-Decoder Models recording
Michał Pietruszka, Michał Turski, Łukasz Borchmann, Tomasz Dwojak, Gabriela Pałka, Karolina Szyndler, Dawid Jurkiewicz, Łukasz Garncarek
TabPFN: A Transformer That Solves Small Tabular Classification Problems in a Second recording
Noah Hollmann, Samuel Müller, Katharina Eggensperger, Frank Hutter
Towards Parameter-Efficient Automation of Data Wrangling Tasks with Prefix-Tuning recording
David Vos, Till Döhmen, Sebastian Schelter
RegCLR: A Self-Supervised Framework for Tabular Representation Learning in the Wild recording
Weiyao Wang, Byung-Hak Kim, Varun Ganapathi

Poster

SAINT: Improved Neural Networks for Tabular Data via Row Attention and Contrastive Pre-Training
Gowthami Somepalli, Avi Schwarzschild, Micah Goldblum, C. Bayan Bruss, Tom Goldstein
Generic Entity Resolution Models
Jiawei Tang, Yifei Zuo, Lei Cao, Samuel Madden
Towards Foundation Models for Relational Databases video pitch
Liane Vogel, Benjamin Hilprecht, Carsten Binnig
Diffusion models for missing value imputation in tabular data video pitch
Shuhan Zheng, Nontawat Charoenphakdee
STab: Self-supervised Learning for Tabular Data
Ehsan Hajiramezanali, Max W Shen, Gabriele Scalia, Nathaniel Lee Diamant
CASPR: Customer Activity Sequence based Prediction and Representation
Damian Konrad Kowalczyk, Pin-Jung Chen, Sahil Bhatnagar
Conditional Contrastive Networks
Emily Mu, John Guttag
Self-supervised Representation Learning Across Sequential and Tabular Features Using Transformers
Rajat Agarwal, Anand Muralidhar, Agniva Som, Hemant Kowshik
The Need for Tabular Representation Learning: An Industry Perspective
Joyce Cahoon, Alexandra Savelieva, Andreas C Mueller, Avrilia Floratou, Carlo Curino, Hiren Patel, Jordan Henkel, Markus Weimer, Roman Batoukov, Shaleen Deep, Venkatesh Emani, Richard Wydrowski, Nellie Gustafsson
STUNT: Few-shot Tabular Learning with Self-generated Tasks from Unlabeled Tables
Jaehyun Nam, Jihoon Tack, Kyungmin Lee, Hankook Lee, Jinwoo Shin<
Tabular Data Generation: Can We Fool XGBoost?
EL Hacen Zein, Tanguy Urvoy
SiMa: Federating Data Silos using GNNs video pitch
Christos Koutras, Rihan Hai, Kyriakos Psarakis, Marios Fragkoulis, Asterios Katsifodimos
Self Supervised Pre-training for Large Scale Tabular Data
Sharad Chitlangia, Anand Muralidhar, Rajat Agarwal
RoTaR: Efficient Row-Based Table Representation Learning via Teacher-Student Training
Zui Chen, Lei Cao, Samuel Madden
MapQA: A Dataset for Question Answering on Choropleth Maps
Shuaichen Chang, David Palzer, Jialin Li, Eric Fosler-Lussier, Ningchuan Xiao
MET: Masked Encoding for Tabular Data
Kushal Alpesh Majmundar, Sachin Goyal, Praneeth Netrapalli, Prateek Jain
Active Learning with Table Language Models
Martin Ringsquandl, Aneta Koleva
Structural Embedding of Data Files with MAGRITTE video pitch
Gerardo Vitagliano, Mazhar Hameed, Felix Naumann

2nd Table Representation Learning Workshop @ NeurIPS 2023

15 December 2023, New Orleans, USA.

Program

Invited Speakers

University of Waterloo,Google DeepMind, Vector Institute

University of Freiburg

Stanford University

Cornell University

University of Hong Kong

Panelists (TBC)

Eurecom

Microsoft

University of Waterloo

Amazon

Schedule

Call for Papers

Important Dates

Scope

Submission Guidelines

Submission link

Formatting guidelines

Review process

Novelty and conflicts

Camera-ready instructions

Presentation instructions

Organization

Workshop Chairs

UC Berkeley

Microsoft

Harvard

Numbers Station AI

Google DeepMind

INRIA

Sea AI Lab

Program Committee

Accepted Papers

2023

Oral

Poster

2022

Oral

Poster

University of Waterloo,
Google DeepMind, Vector Institute