2nd Table Representation Learning Workshop

15 December, NeurIPS 2023. New Orleans, USA.



TRL 2023 is a wrap! Interested in updates for future events, news, etc? Sign up for the mailinglist here!

Tables are a promising modality for representation learning with too much application potential to ignore. However, tables have long been overlooked despite their dominant presence in the data landscape, e.g. data management and analysis pipelines. The majority of datasets in Google Dataset Search, for example, resembles typical tabular file formats like CSVs. Similarly, the top-3 most-used database management systems are all intended for relational data (RDBMS). Representation learning over tables, possibly combined with other modalities such as text or SQL, has shown impressive performance for tasks like semantic parsing, question answering, table understanding, and data preparation. More recently, the pre-training paradigm was shown to be effective for tabular ML as well, while researchers also started exploring the impressive capabilities of LLMs for table encoding and data manipulation.

The Table Representation Learning (TRL) workshop is the first in this emerging research area and concentrates on three main goals:

  • (1) Motivate tables as a primary modality for representation and generative learning and advance the area further.
  • (2) Showcase impactful applications of pretrained table models and discussing future opportunities.
  • (3) Foster discussion and collaboration across the ML, NLP and DB communities.

When: Friday 15 December 2023, 8:30am - 5:30pm (local time).
Where: Room 235-236, New Orleans Ernest N. Morial Convention Center.
Private questions: madelon@berkeley.edu
Follow on Twitter: @TrlWorkshop
Sponsored by:


Program

TRL is again entirely in-person, and will this year feature 2 longer poster sessions and more (spotlight) talks on contributed work.
We also host a few exciting invited talks on established research in this emerging area, and a panel discussion.

Invited Speakers


Wenhu Chen
University of Waterloo,
Google DeepMind, Vector Institute
Frank Hutter
University of Freiburg
Simran Arora
Stanford University
Immanuel Trummer
Cornell University
Tao Yu
University of Hong Kong


Panelists (TBC)


Andreas Mueller
Microsoft


Wenhu Chen
University of Waterloo














Schedule


06:30 AM Opening Notes
06:45 AM Invited Talk: Simran Arora
Co-Designing LLMs and LLM-Powered Data Management Tools
[Abstract]
Abstract: Large Language Models (LLMs) are now widely used for data management. We recently proposed Evaporate [ICLR Spotlight 2023, VLDB 2024], a system that uses LLMs to help users efficiently query semi-structured documents. We also showed how off-the-shelf LLMs perform data-wrangling tasks with state-of-the-art quality and no specialized training [VLDB 2023]. This talk discusses some of my lessons from working on these early LLM-for-data-management projects and subsequent research to improve the reach of these systems — in particular, there is a ways to go for extending LLMs to datatypes such as private, semi-structured, and long-sequence data. Towards extending our capabilities on these datatypes, I’ll discuss MQAR and Monarch Mixer [NeurIPS Oral 2023], new LM architectures that can match the quality of attention-based LMs, while remaining asymptotically more efficient at training and inference time. We’ll finally discuss how these fundamental breakthroughs can power next-generation data management tools.
[Speaker Bio]
Bio: Simran Arora is a PhD student at Stanford University in machine learning systems, advised by Chris Ré. She develops tools that help users apply foundation models to challenging datatypes such as private, semi-structured, and long-sequence data. To unlock these capabilities, she leverages a detailed understanding of the data and inductive biases used to train foundation models. Her work has received Oral (top 5%) and Spotlight (top 25%) awards at NeurIPS and ICLR. Simran also recently created and taught a new [Systems for Machine Learning](https://cs229s.stanford.edu/fall2023/) full-unit course at Stanford. She is grateful for the support of the SGF Sequoia Fellowship.
07:15 AM Spotlight Talks: Session 1 (Data Wrangling and Table QA)
07:15 AM ~ 07:22 AM High-Performance Transformers for Table Structure Recognition Need Early Convolutions
Anthony Peng, Seongmin Lee, Xiaojing Wang, Rajarajeswari (Raji) Balasubramaniyan, Duen Horng Chau [Abstract]
Abstract: Table structure recognition (TSR) aims to convert tabular images into a machine-readable format, where a visual encoder extracts image features and a textual decoder generates table-representing tokens. Existing approaches use classic convolutional neural network (CNN) backbones for the visual encoder and transformers for the textual decoder. However, this hybrid CNN-Transformer architecture introduces a complex visual encoder that accounts for nearly half of the total model parameters, markedly reduces both training and inference speed, and hinders the potential for self-supervised learning in TSR. In this work, we design a lightweight visual encoder for TSR without sacrificing expressive power. We discover that a convolutional stem can match classic CNN backbone performance, with a much simpler model. The convolutional stem strikes an optimal balance between two crucial factors for high-performance TSR: a higher receptive field (RF) ratio and a longer sequence length. This allows it to "see" an appropriate portion of the table and "store" the complex table structure within sufficient context length for the subsequent transformer. We conducted reproducible ablation studies and open-sourced our code at https://anonymous.4open.science/r/NeurIPS23-TRL-2 to enhance transparency, inspire innovations, and facilitate fair comparisons in our domain as tables are a promising modality for representation learning.
07:23 AM ~ 07:30 AM Pool-Search-Demonstrate: Improving Data-wrangling LLMs via better in-context examples
Joon Suk Huh, Changho Shin, Elina Choi [Abstract]
Abstract: Data-wrangling is a process that transforms raw data for further analysis and for use in downstream tasks. Recently, it has been shown that foundation models can be successfully used for data-wrangling tasks (Narayan et. al., 2022). An important aspect of data wrangling with LMs is to properly construct prompts for the given task. Within these prompts, a crucial component is the choice of in-context examples. In the previous study of Narayan et. al., demonstration examples are chosen manually by the authors, which may not be scalable to new datasets. In this work, we propose a simple demonstration strategy that individualizes demonstration examples for each input by selecting them from a pool based on their distance in the embedding space. Additionally, we propose a postprocessing method that exploits the embedding of labels under a closed-world assumption. Empirically, our embedding-based example retrieval and postprocessing improve foundation models' performance by up to 84\% over randomly selected examples and 49\% over manually selected examples in the demonstration. Ablation tests reveal the effect of class embeddings, and various factors in demonstration such as quantity, quality, and diversity.
07:31 AM ~ 07:38 AM TabPFGen – Tabular Data Generation with TabPFN
Jeremy (Junwei) Ma, Apoorv Dankar, George Stein, Guangwei Yu, Anthony Caterini [Abstract]
Abstract: Advances in deep generative modelling have not translated well to tabular data. We argue that this is caused by a mismatch in structure between popular generative models and discriminative models of tabular data. We thus devise a technique to turn TabPFN -- a highly performant transformer initially designed for in-context discriminative tabular tasks -- into an energy-based generative model, which we dub TabPFGen. This novel framework leverages the pre-trained TabPFN as part of the energy function and does not require any additional training or hyperparameter tuning, thus inheriting TabPFN's in-context learning capability. We can sample from TabPFGen analogously to other energy-based models. We demonstrate strong results on standard generative modelling tasks, including data augmentation, class-balancing, and imputation, unlocking a new frontier of tabular data generation.
07:38 AM ~ 07:45 AM Data Ambiguity Strikes Back: How Documentation Improves GPT's Text-to-SQL
Zachary Huang, Pavan Kalyan Damalapati, Eugene Wu [Abstract]
Abstract: Text-to-SQL allows experts to use databases without in-depth knowledge of them. However, real-world tasks have both query and data ambiguities. Most works on Text-to-SQL focused on query ambiguities and designed chat interfaces for experts to provide clarifications.In contrast, the data management community has long studied data ambiguities, but mainly addresses error detection and correction, rather than documenting them for disambiguation in data tasks. This work delves into these data ambiguities in real-world datasets. We have identified prevalent data ambiguities of value consistency, data coverage, and data granularity that affect tasks. We examine how documentation, originally made to help humans to disambiguate data, can help GPT-4 with Text-to-SQL tasks. By offering documentation on these, we found GPT-4's performance improved by 28.9%.
07:46 AM ~ 07:53 AM MultiTabQA: Generating Tabular Answers for Multi-Table Question Answering
Vaishali Pal, Andrew Yates, Evangelos Kanoulas, Maarten Rijke [Abstract]
Abstract: Recent advances in tabular question answering (QA) with large language models are constrained in their coverage and only answer questions over a single table. However, real-world queries are complex in nature, often over multiple tables in a relational database or web page. Single table questions do not involve common table operations such as set operations, Cartesian products (joins), or nested queries. Furthermore, multi-table operations often result in a tabular output, which necessitates table generation capabilities of tabular QA models. To fill this gap, we propose a new task of answering questions over multiple tables. Our model, MultiTabQA, not only answers questions over multiple tables, but also generalizes to generate tabular answers. To enable effective training, we build a pre-training dataset comprising of 132,645 SQL queries and tabular answers. Further, we evaluate the generated tables by introducing table-specific metrics of varying strictness assessing various levels of granularity of the table structure. MultiTabQA outperforms state-of-the-art single table QA models adapted to a multi-table QA setting by finetuning on three datasets: Spider, Atis and GeoQuery.
08:00 AM Coffee Break / Poster Setup
08:20 AM Poster: Session 1 [Details]
  • MultiTabQA: Generating Tabular Answers for Multi-Table Question Answering. Vaishali Pal, Andrew Yates, Evangelos Kanoulas, Maarten Rijke.
  • Generating Data Augmentation Queries Using Large Language Models.Christopher Buss, Jasmin Mousavi, Mikhail Tokarev, Arash Termehchy, David Maier, Stefan Lee.
  • ReConTab: Regularized Contrastive Representation Learning for Tabular Data. Suiyao Chen, Jing Wu, Naira Hovakimyan, Handong Yao.
  • Explaining Explainers: Necessity and Sufficiency in Tabular Data. Prithwijit Chowdhury, Mohit Prabhushankar, Ghassan AlRegib.
  • Beyond Individual Input for Deep Anomaly Detection on Tabular Data. Hugo Thimonier, Fabrice Popineau, Arpad Rimmel, Bich-Liên DOAN.
  • InterpreTabNet: Enhancing Interpretability of Tabular Data Using Deep Generative Models and Large Language Models. Jacob Yoke Hong Si, Michael Cooper, Wendy Yusi Cheng, Rahul Krishnan.
  • On Incorporating new Variables during Evaluation. Harsimran Bhasin, Soumyadeep Ghosh.
  • GCondNet: A Novel Method for Improving Neural Networks on Small High-Dimensional Tabular Data. Andrei Margeloiu, Nikola Simidjievski, Pietro Lio, Mateja Jamnik.
  • High-Performance Transformers for Table Structure Recognition Need Early Convolutions. Anthony Peng, Seongmin Lee, Xiaojing Wang, Rajarajeswari (Raji) Balasubramaniyan, Duen Horng Chau.
  • Unnormalized Density Estimation with Root Sobolev Norm Regularization. Mark Kozdoba, Binyamin Perets, Shie Mannor.
  • Tree-Regularized Tabular Embeddings. Xuan Li, Yun Wang, Bo Li.
  • A Deep Learning Blueprint for Relational Databases. Lukáš Zahradník, Jan Neumann, Gustav Šír.
  • Scaling TabPFN: Sketching and Feature Selection for Tabular Prior-Data Fitted Networks. Benjamin Feuer, Niv Cohen, Chinmay Hegde.
  • NeuroDB: Efficient, Privacy-Preserving and Robust Query Answering with Neural Networks. Sepanta Zeighami, Cyrus Shahabi.
  • A Performance-Driven Benchmark for Feature Selection in Tabular Deep Learning. Valeriia Cherepanova, Roman Levin, Gowthami Somepalli, Jonas Geiping, C. Bruss, Andrew Wilson, Tom Goldstein, Micah Goldblum.
  • Benchmarking Tabular Representation Models in Transfer Learning Settings. Qixuan Jin, Talip Ucar.
  • Data Ambiguity Strikes Back: How Documentation Improves GPT's Text-to-SQL. Zezhou Huang, Pavan Kalyan Damalapati, Eugene Wu.
  • Pool-Search-Demonstrate: Improving Data-wrangling LLMs via better in-context examples. Joon Suk Huh, Changho Shin, Elina Choi.
  • TabPFGen – Tabular Data Generation with TabPFN. Junwei Ma, Apoorv Dankar, George Stein, Guangwei Yu, Anthony Caterini.
  • Multitask-Guided Self-Supervised Tabular Learning for Patient-Specific Survival Prediction. You Wu, Omid Bazgir, Yongju Lee, Tommaso Biancalani, James Lu, Ehsan Hajiramezanali.
  • 09:00 AM Invited Talk: Frank Hutter
    Advances in In-Context Learning for Tabular Datasets
    [Abstract]
    Abstract: A year ago, we introduced TabPFN, the first in-context learning method for tabular data. In this talk, I will discuss what happened since. I will start by briefly discussing CAAFE, a system that uses LLMs for automated feature engineering on tabular data and makes effective use of TabPFN's speed. Then, I will situate prior-fitted PFNs in the in-context learning literature, review various applications of PFN, explain TabPFN in some more detail and then discuss our ongoing work on removing TabPFN's remaining limitations.
    [Speaker Bio]
    Bio: Frank Hutter is a Full Professor for Machine Learning at the University of Freiburg (Germany). He holds a PhD from the University of British Columbia (UBC, 2009) and a Diplom (eq. MSc) from TU Darmstadt (2004). He received the 2010 CAIAC doctoral dissertation award for the best thesis in AI in Canada, and with his coauthors, several best paper awards and prizes in international competitions on machine learning, SAT solving, and AI planning. He is a Fellow of EurAI and ELLIS, the director of the ELLIS unit Freiburg and the recipient of 3 ERC grants. Frank is best known for his research on automated machine learning (AutoML), including neural architecture search and efficient hyperparameter optimization. He co-authored the first book on AutoML and the prominent AutoML tools Auto-WEKA, Auto-sklearn and Auto-PyTorch, won the first two AutoML challenges with his team, co-organized the ICML workshop series on AutoML every year 2014-2021, has been the general chair of the inaugural AutoML conference 2022 and is general chair again in 2023.
    09:30 AM Spotlight Talks: Session 2 (Tabular ML)
    09:30 AM ~ 09:37 AM Self-supervised Representation Learning from Random Data Projectors
    Yi Sui, Tongzi Wu, Jesse Cresswell, Ga Wu, George Stein, Xiao Shi Huang, Xiaochen Zhang, Maksims Volkovs [Abstract]
    Abstract: Self-supervised representation learning SSRL has advanced considerably by exploiting the transformation invariance assumption under artificially designed data augmentations. While augmentation-based SSRL algorithms push the boundaries of performance in computer vision and natural language processing, they are often not directly applicable to other data modalities such as tabular and time-series data. This paper presents an SSRL approach that can be applied to these data modalities because it does not rely on augmentations or masking. Specifically, we show that high-quality data representations can be learned by reconstructing random data projections. We evaluate the proposed approach on real-world applications with tabular and time-series data. We show that it outperforms multiple state-of-the-art SSRL baselines and is competitive with methods built on domain-specific knowledge. Due to its wide applicability and strong empirical results, we argue that learning from randomness is a fruitful research direction worthy of attention and further study.
    09:38 AM ~ 09:45 AM GCondNet: A Novel Method for Improving Neural Networks on Small High-Dimensional Tabular Data
    Andrei Margeloiu, Nikola Simidjievski, Pietro Lió, Mateja Jamnik [Abstract]
    Abstract: Neural network models often struggle with high-dimensional but small sample-size tabular datasets. One reason is that current weight initialisation methods assume independence between weights, which can be problematic when there are insufficient samples to estimate the model's parameters accurately. In such small data scenarios, leveraging additional structures can improve the model's performance and training stability. To address this, we propose GCondNet, a general approach to enhance neural networks by leveraging implicit structures present in tabular data. We create a graph between samples for each data dimension, and utilise Graph Neural Networks (GNNs) for extracting this implicit structure, and for conditioning the parameters of the first layer of an underlying predictor network. By creating many small graphs, GCondNet exploits the data's high-dimensionality, and thus improves the performance of an underlying predictor network. We demonstrate the effectiveness of our method on 9 real-world datasets, where GCondNet outperforms 15 standard and state-of-the-art methods. The results show that GCondNet is a versatile framework for injecting graph-regularisation into various types of neural networks, including MLPs and tabular Transformers.
    09:46 AM ~ 09:53 AM HyperFast: Instant Classification for Tabular Data
    David Bonet, Daniel Mas Montserrat, Xavier Giró-i-Nieto, Alexander Ioannidis [Abstract]
    Abstract: Training deep learning models and performing hyperparameter tuning can be computationally demanding and time-consuming. Meanwhile, traditional machine learning methods like gradient-boosting algorithms remain the preferred choice for most tabular data applications, while neural network alternatives require extensive hyperparameter tuning or work only in toy datasets under limited settings. In this paper, we introduce HyperFast, a meta-trained hypernetwork designed for instant classification of tabular data in a single forward pass. HyperFast generates a task-specific neural network tailored to an unseen dataset that can be directly used for classification inference, removing the need for training a model. We report extensive experiments with OpenML and genomic data, comparing HyperFast to competing tabular data neural networks, traditional ML methods, AutoML systems, and boosting machines. HyperFast shows highly competitive results, while being significantly faster. Additionally, our approach demonstrates robust adaptability across a variety of classification tasks with little to no fine-tuning, positioning HyperFast as a strong solution for numerous applications and rapid model deployment. HyperFast introduces a promising paradigm for fast classification, with the potential to substantially decrease the computational burden of deep learning. Our code, which offers a scikit-learn-like interface, along with the trained HyperFast model, can be found at www.url-hidden-for-submission.
    09:54 AM ~ 10:01 AM Training-Free Generalization on Heterogeneous Tabular Data via Meta-Representation
    Han-Jia Ye, Qile Zhou, De-Chuan Zhan [Abstract]
    Abstract: Tabular data is prevalent across various machine learning domains. Yet, the inherent heterogeneities in attribute and class spaces across different tabular datasets hinder the effective sharing of knowledge, limiting a tabular model to benefit from other datasets. In this paper, we propose Tabular data Pre-Training via Meta-representation (TabPTM), which allows one tabular model pre-training on a set of heterogeneous datasets. Then, this pre-trained model can be directly applied to unseen datasets that have diverse attributes and classes without additional training. Specifically, TabPTM represents an instance through its distance to a fixed number of prototypes, thereby standardizing heterogeneous tabular datasets. A deep neural network is then trained to associate these meta-representations with dataset-specific classification confidences, endowing TabPTM with the ability of training-free generalization. Experiments validate that TabPTM achieves promising performance in new datasets, even under few-shot scenarios.
    10:00 AM Lunch Break
    11:30 AM Invited Talk: Immanuel Trummer
    Next-Generation Data Management with Large Language Models
    [Abstract]
    Abstract: The past years have been marked by several breakthrough results in the domain of generative AI, culminating in the rise of tools like ChatGPT, able to solve a variety of language-related tasks without specialized training. In this talk, I discuss several recent research projects at Cornell, exploiting large language models to enhance relational database management systems. These projects cover applications of language models in the database interface, enabling users to specify high-level analysis goals for fully automated end-to-end analysis, as well as applications in the backend, using language models to extract useful information for data profiling and database tuning from text documents.
    [Speaker Bio]
    Bio: Immanuel Trummer is an assistant professor at Cornell University and a member of the Cornell Database Group. His papers were selected for “Best of VLDB”, “Best of SIGMOD”, for the ACM SIGMOD Research Highlight Award, and for publication in CACM as CACM Research Highlight. His online lecture introducing students to database topics collected over a million views. He received the NSF CAREER Award and multiple Google Faculty Research Awards.
    12:00 PM Spotlight Talks: Session 3 (Tables + LLMs)
    12:00 PM ~ 12:07 PM Tabular Representation, Noisy Operators, and Impacts on Table Structure Understanding Tasks in LLMs
    Ananya Singha, José Cambronero, Sumit Gulwani, Vu Le, Chris Parnin [Abstract]
    Abstract: Large language models (LLMs) are increasingly applied for tabular tasks usingin-context learning. The prompt representation for a table may play a role in theLLMs ability to process the table. Inspired by prior work, we generate a collectionof self-supervised structural tasks (e.g. navigate to a cell and row; transpose thetable) and evaluate the performance differences when using 8 formats. In contrastto past work, we introduce 8 noise operations inspired by real-world messy dataand adversarial inputs, and show that such operations can impact LLM performanceacross formats for different structural understanding tasks.
    12:08 PM ~ 12:15 PM How to Prompt LLMs for Text-to-SQL: A Study in Zero-shot, Single-domain, and Cross-domain Settings
    Shuaichen Chang, Eric Fosler-Lussier [Abstract]
    Abstract: Large language models (LLMs) with in-context learning have demonstrated remarkable capability in the text-to-SQL task. Previous research has prompted LLMs with various demonstration-retrieval strategies and intermediate reasoning steps to enhance the performance of LLMs. However, those works often employ varied strategies when constructing the prompt text for text-to-SQL inputs, such as databases and demonstration examples. This leads to a lack of comparability in both the prompt constructions and their primary contributions. Furthermore, selecting an effective prompt construction has emerged as a persistent problem for future research. To address this limitation, we comprehensively investigate the impact of prompt constructions across various settings and provide insights into prompt constructions for future text-to-SQL studies.
    12:16 PM ~ 12:23 PM IngesTables: Scalable and Efficient Training of LLM-Enabled Tabular Foundation Models
    Scott Yak, Yihe Dong, Javier Gonzalvo, Sercan Arik [Abstract]
    Abstract: There is a massive amount of tabular data that can be taken advantage of via `foundation models' to improve prediction performance for downstream tabular prediction tasks. However, numerous challenges constitute bottlenecks in building tabular foundation models, including learning semantic relevance between tables and features, mismatched schemes, arbitrarily high cardinality for categorical values, and scalability to many tables, rows and features. We propose \texttt{IngesTables}, a novel canonical tabular foundation model building framework, designed to address the aforementioned challenges. \texttt{IngesTables} employs LLMs to encode representations of table/feature semantics and the relationships, that are then modeled via an attention-based tabular architecture. Unlike other LLM-based approaches, \texttt{IngesTables} is much cheaper to train and faster to run inference, because of how LLM-generated embeddings are defined and cached.We show that \texttt{IngesTables} demonstrates significant improvements over commonly-used models like XGBoost on clinical trial datasets in standard supervised learning settings, and is competitive with tabular prediction models that are specialized for clinical trial datasets without incurring LLM-level cost and latency.
    12:30 PM Invited Talk: Tao Yu
    Advancing Natural Language Interfaces to Data with Language Models as Agents
    [Abstract]
    Abstract: Traditional Natural Language Interfaces (NLIs) to data often necessitate users to provide detailed, step-by-step instructions, reflecting an assumption of user familiarity with the underlying data and systems, which can limit accessibility. The emergence of Large Language Models (LLMs) has, however, revolutionized NLIs, enabling them to perform sophisticated reasoning, decision-making, and planning multi-step actions in diverse environments autonomously. In this talk, I will discuss how these language models as agents facilitate a paradigm shift towards moving beyond traditional code generation to more autonomous and user-friendly NLIs, capable of understanding high-level objectives without requiring intricate directives. I will also present our latest work in this direction, including instruction-finetuned retrievers for diverse environment adaptation, the enhancement of LLM capabilities with tool integration, and the development of open, state-of-the-art LLMs and platforms for constructing such language agents. The talk will conclude with an exploration of the current and future research prospects in this rapidly evolving domain.
    [Speaker Bio]
    Bio: Tao Yu is an Assistant Professor of Computer Science at The University of Hong Kong and a director of the XLANG Lab (as part of the HKU NLP Group). He spent one year in the UW NLP Group working with Noah Smith, Luke Zettlemoyer, and Mari Ostendorf. He completed his Ph.D. in Computer Science from Yale University, advised by Dragomir Radev and master's at Columbia University advised by Owen Rambow and Kathleen McKeown. Tao has received the Google and Amazon faculty research awards (Google Research Scholar Award 2023, Amazon Research Award 2022). His main research interest is in Natural Language Processing. His research aims to build language model agents that transform (“grounding”) language instructions into code or actions executable in real-world environments, including databases, web applications, and the physical world etc,. It lies at the heart of the next generation of natural language interfaces that can interact with and learn from these real-world environments to facilitate human interaction with data analysis, web applications, and robotic instruction through conversation.
    01:00 PM Coffee Break / Poster Setup
    01:20 PM Poster: Session 2 [Details]
  • Testing the Limits of Unified Sequence to Sequence LLM Pretraining on Diverse Table Data Tasks. Soumajyoti Sarkar, Leonard Lausen.
  • How to Prompt LLMs for Text-to-SQL: A Study in Zero-shot, Single-domain, and Cross-domain Settings. Shuaichen Chang, Eric Fosler-Lussier.
  • IngesTables: Scalable and Efficient Training of LLM-Enabled Tabular Foundation Models. Scott Yak, Yihe Dong, Javier Gonzalvo, Sercan Arik
  • In Defense of Zero Imputation for Tabular Deep Learning. Mike Van Ness, Madeleine Udell.
  • Fine-Tuning the Retrieval Mechanism for Tabular Deep Learning. Felix den Breejen, Sangmin Bae, Stephen Cha, Tae-Young Kim, Seoung Hyun Koh, Se-Young Yun.
  • Scaling Experiments in Self-Supervised Cross-Table Representation Learning. Maximilian Schambach, Dominique Paul, Johannes Otterbach.
  • Tabular Representation, Noisy Operators, and Impacts on Table Structure Understanding Tasks in LLMs. Ananya Singha, José Cambronero, Sumit Gulwani, Vu Le, Chris Parnin.
  • CHORUS: Foundation Models for Unified Data Discovery and Exploration. Moe Kayali, Anton Lykov, Ilias Fountalis, Nikolaos Vasiloglou, Dan Olteanu, Dan Suciu.
  • Incorporating LLM Priors into Tabular Learners. Max Zhu, Siniša Stanivuk, Andrija Petrovic, Mladen Nikolic, Pietro Lio.
  • A DB-First approach to query factual information in LLMs. Mohammed Saeed, Nicola De Cao, Paolo Papotti.
  • Training-Free Generalization on Heterogeneous Tabular Data via Meta-Representation. Han-Jia Ye, Qile Zhou, De-Chuan Zhan.
  • Hopular: Modern Hopfield Networks for Tabular Data. Bernhard Schäfl, Lukas Gruber, Angela Bitto-Nemling, Sepp Hochreiter.
  • HyperFast: Instant Classification for Tabular Data. David Bonet, Daniel Mas Montserrat, Xavier Giró-i-Nieto, Alexander Ioannidis.
  • Modeling string entries for tabular data prediction: do we need big large language models? Leo Grinsztajn, Myung Jun Kim, Edouard Oyallon, Gael Varoquaux.
  • Binning as a Pretext Task: Improving Self-Supervised Learning in Tabular Domains. Kyungeun Lee, Ye Seul Sim, Hyeseung Cho, Suhee Yoon, Sanghyu Yoon, Woohyung Lim.
  • Self-supervised Representation Learning from Random Data Projectors. Yi Sui, Tongzi Wu, Jesse Cresswell, Ga Wu, George Stein, Xiao Shi Huang, Xiaochen Zhang, Maksims Volkovs.
  • Unlocking the Transferability of Tokens in Deep Models for Tabular Data. Qile Zhou, Han-Jia Ye, Leye Wang, De-Chuan Zhan.
  • Augmentation for Context in Financial Numerical Reasoning over Textual and Tabular Data with Large-Scale Language Model. Yechan Hwang, Jinsu Lim, Young-Jun Lee, Ho-Jin Choi.
  • TabContrast: A Local-Global Level Method for Tabular Contrastive Learning. Hao Liu, Yixin Chen, Bradley Fritz, Christopher King.
  • GradTree: Learning Axis-Aligned Decision Trees with Gradient Descent. Sascha Marton, Stefan Lüdtke, Christian Bartelt, Heiner Stuckenschmidt.
  • Elephants Never Forget: Testing Language Models for Memorization of Tabular Data. Sebastian Bordt, Harsha Nori, Rich Caruana.
  • Introducing the Observatory Library for End-to-End Table Embedding Inference. Tianji Cong, Zhenjie Sun, Paul Groth, H.V. Jagadish, Madelon Hulsebos.
  • 02:00 PM Invited Talk: Wenhu Chen
    Enabling Large Language Models to Reason with Tables
    [Abstract]
    Abstract: Large language models (LLMs) are becoming attractive as few-shot reasoners to solve Natural Language (NL)-related tasks. However, there is still much to learn about how well LLMs understand structured data, such as tables. While it is true that tables can be used as inputs to LLMs with serialization, there lack comprehensive studies examining whether LLMs can truly comprehend such data. In this talk, I will cover different ways to utilize LLMs to interface with tables. One approach is to feed the whole table as a sequence to LLMs for reasoning. In this direction, we will talk about the recent paper GPT4Table to summarize the lessons learned in different table linearization strategies, including table input format, content order, role prompting, and partition marks. The other approach is to use tools like SQL or other language to interface with table for data access without feeding the entire table. LLMs will work as a reasoner to derive the answer based on the interfaced results from the table.
    [Speaker Bio]
    Bio: Wenhu Chen has been an assistant professor at Computer Science Department in University of Waterloo and Vector Institute since 2022. He obtained Canada CIFAR AI Chair Award in 2022. He also works for Google Deepmind as a part-time research scientist since 2021. Before that, he obtained his PhD from the University of California, Santa Barbara under the supervision of William Wang and Xifeng Yan. His research interest lies in natural language processing, deep learning and multimodal learning. He aims to design models to handle complex reasoning scenarios like math problem-solving, structure knowledge grounding, etc. He is also interested in building more powerful multimodal models to bridge different modalities. He received the Area Chair Award in AACL-IJCNLP 2023, the Best Paper Honorable Mention in WACV 2021, and the UCSB CS Outstanding Dissertation Award in 2021.
    02:30 PM Panel - TBA
    03:15 PM Closing Notes

    Call for Papers


    Important Dates


    Submission Open August 10, 2023
    Submission Deadline October 4, 2023 (11:59PM AoE)
    Notifications October 27, 2023 (11:59PM AoE)
    Camera-ready November 15, 2023 (11:59PM AoE)
    Slides for spotlight talks November 28, 2023 (11:59PM AoE)
    Video pitches for posters November 28, 2023 (11:59PM AoE)
    Workshop Date December 15, 2023

    Scope

    We invite submissions on representation and generative learning over tables, related to any of the following topics:

    • Representation Learning over tables which can be structured as well as semi-structured, and extend to full databases. Example contributions are new model architectures, data encoding techniques, pre-training, fine-tuning, and prompting strategies, multi-task learning, etc.
    • Generative Learning and LLMs for structured data and interfaces to structured data (e.g. queries, analysis).
    • Multimodal learning where tables are jointly embedded with, for example, natural language, code (e.g. SQL), knowledge bases, visualizations/images.
    • Downstream Applications of table representations for tasks like data preparation (e.g. data cleaning, validation, integration, cataloging, feature engineering), retrieval (e.g. search, fact-checking/QA, KG construction), analysis (e.g. summarization, visualization, and query recommendation), and (end-to-end) machine learning.
    • Upstream Applications of table representation models for optimizing table parsers/extraction (from documents, spreadsheets, presentations), storage (e.g. compression, indexing) and query processing e.g. query optimization
    • Production challenges of table representation models. Work addressing the challenges of maintaining and managing TRL models in fast evolving contexts, e.g. data updating, error correction, and monitoring, and other industry challenges such as privacy, personalization performance, etc.
    • Domain-specific challenges for learned table models often arise in domains such as enterprise, finance, medical, law. These challenges pertain to table content, table structure, privacy, security limitations, and other factors that necessitate tailored solutions.
    • Benchmarks and analyses of table representation models, including the utility of language models as base models versus alternatives and robustness regarding large, messy, heterogeneous, or complex tables.
    • Others: Formalization, surveys, datasets, visions, and reflections to structure and guide future research.

    Submission Guidelines

    Submission link

    Submit your (anonymized) paper through OpenReview at: https://openreview.net/group?id=NeurIPS.cc/2023/Workshop/TRL
    Please be aware that accepted papers are expected to be presented at the workshop in-person.

    Formatting guidelines

    The workshop accepts regular research papers and industrial papers of the following types:
    • Short paper: 4 pages + references.
    • Regular paper: 8 pages + references.


    Submissions should be anonymized and follow the NeurIPS style files, but can exclude the checklist. Non-anonymous preprints are no problem, and artifacts do not have to be anonymized. Just submitting the paper without author names/affiliations is sufficient. Supplementary material, if any, may be added in the appendix. The footer of accepted papers should state “Table Representation Learning Workshop at NeurIPS 2023”. We expect authors to adopt an inclusive and diverse writing style. The “Diversity and Inclusion in Writing” guide by the DE&I in DB Conferences effort is a good resource.

    Review process

    Papers will receive light reviews in a double-anonymous manner. All accepted submissions will be published on the website and made public on OpenReview but the workshop is non-archival (i.e. without proceedings).

    Novelty and conflicts

    The workshop does not accept submissions that have previously been published at NeurIPS or other machine learning venues. However, we do welcome submissions that have been published in, for example, data management or natural language processing venues. We rely on OpenReview for handling conflicts, so please ensure that the conflicts in every author's OpenReview profile are complete, in particular, with respect to the organization and program committees.

    Camera-ready instructions

    Camera-ready papers are expected to express the authors and affiliations on the first page, and state "Table Representation Learning Workshop at NeurIPS 2023'' in the footer. The camera-ready version may exceed the page limit for acknowledgements or small content changes, but revision is not required (for short papers: please be aware of novelty requirements of archival venues, e.g. SIGMOD, CVPR). The camera-ready version should be submitted through OpenReview (submission -> edit -> revision), and will be published on OpenReview and this website. Please make sure that all meta-data is correct as well, as it will be imported to the NeurIPS website.

    Presentation instructions

    All accepted papers will be presented as poster during one of the poster sessions (TBA). For poster formatting, please refer to the poster instructions on the NeurIPS site, you can print and bring the poster yourself or consider the FedEx offer for NeurIPS. Optional: authors of poster submissions are also invited to send a teaser video of approx. 3 minutes (.mp4) to m.hulsebos@uva.nl, which will be hosted on the website and YouTube channel of the workshop.
    Papers selected for spotlight talks are also asked to prepare a talk of 6 minutes (+1 min Q&A), and upload their slides through the "slides" field in OpenReview. Timeslots for the spotlights will be published soon. The recordings of oral talks will be published as well.

    Organization

    Workshop Chairs


    Madelon Hulsebos
    UC Berkeley
    Haoyu Dong
    Microsoft
    Bojan Karlas
    Harvard
    Laurel Orr
    Numbers Station AI
    Pengcheng Yin
    Google DeepMind
    Qian Liu
    Sea AI Lab

    Program Committee

    Unfold for full committee
    Vadim Borisov, University of Tuebingen
    Paul Groth, University of Amsterdam
    Wensheng Dou, Institute of Software Chinese Academy of Sciences
    Hiroshi Iida, The University of Tokyo
    Sharad Chitlangia, Amazon
    Jaehyun Nam, KAIST
    Jinyang Li, The University of Hong Kong
    Gerardo Vitagliano, Hasso Plattner Institute
    Rajat Agarwal, Amazon
    Micah Goldblum, New York University
    Yury Gorishniy, Yandex Research
    Roman Levin, Amazon
    Bhavesh Neekhra, Ashoka University
    Sebastian Schelter, University of Amsterdam
    Qingping Yang, University of Chinese Academy of Sciences
    Matteo Interlandi, Microsoft
    Tianji Cong, University of Michigan
    Xiang Deng, Google
    Beliz Gunel, Google
    Qian Liu, Sea AI Lab
    Shuaichen Chang, Ohio State University
    Zhoujun Cheng, Shanghai Jiaotong University
    Roee Shraga, Worcester Polytechnic Institute
    Yi Zhang, AWS AI Labs
    Xi Rao, ETH Zurich
    Liane Vogel, Technical University of Darmstadt
    Aneta Koleva, University of Munich / Siemens
    Ivan Rubachev, HSE University / Yandex
    Meghana Moorthy Bhat, Salesforce Research
    José Cambronero, Microsoft
    Till Döhmen, MotherDuck / University of Amsterdam
    Noah Hollman, Charité Berlin / University of Freiburg
    Julian Martin Eisenschlos, Google
    Paolo Papotti, Eurecom
    Zhiruo Wang, Carnegie Mellon University
    Mukul Singh, Microsoft
    Zezhou Huang, Columbia University
    Carsten Binnig, TU Darmstadt
    Linyong Nan, Yale
    Shuo Zhang, Bloomberg
    Alejandro Sierra Múnera, Hasso Plattner Institute
    Qian Liu, Sea AI Labs
    Anirudh Khatry, Microsoft
    Haoyu Dong, Microsoft


    Accepted Papers


    2023


    Oral

    Poster





    2022


    Oral



    Poster