Mailinglist: signup here!
Tables are a promising modality for representation learning and generative models with too much application potential to ignore. However, tables have long been overlooked despite their dominant presence in the data landscape, e.g. data management and analysis pipelines. The majority of datasets in Google Dataset Search, for example, resembles typical tabular file formats like CSVs. Similarly, the top-3 most-used database management systems are all intended for relational data. Representation learning for tables, possibly combined with other modalities such as code and text, has shown impressive performance for tasks like semantic parsing, question answering, table understanding, data preparation, and data analysis (e.g. text-to-sql). The pre-training paradigm was shown to be effective for tabular ML (classification/regression) as well. More recently, we also observe promising potential in applying and enhancing LLMs in the domain of structured data to improve how we process and derive insights from structured data.
The Table Representation Learning (TRL) workshop is the premier venue in this emerging research area and has three main goals:
- (1) Motivate tables as a primary modality for representation and generative models and advance the area further.
- (2) Showcase impactful applications of pretrained table models and identify open challenges for future research, with a particular focus on industry insights in 2024.
- (3) Foster discussion and collaboration across the ML, NLP, IR and DB communities.
Where: East Meeting Room 11 & 12, Vancouver Convention Centre, Vancouver, Canada.
Call for Papers
Important Dates
Submission Open | September 1, 2024 |
Submission Deadline | September 20, 2024 (11:59PM AoE) |
Notifications | October 9, 2024 (11:59PM AoE) |
Camera-ready | October 30, 2024 (11:59PM AoE) |
Slides for contributed talks | November 30, 2024 (11:59PM AoE) |
Video pitches for posters (optional) | November 30, 2024 (11:59PM AoE) |
Workshop Date | December 14, 2024 |
Scope
We invite submissions on representation and generative learning over tables, related to any of the following topics:
- Representation Learning for (semi-)Structured Data such as spreadsheets, tables, and full relational databases. Example contributions are new model architectures, data encoding techniques, tailored tokenization methods, pre-training and fine-tuning techniques, etc.
- Generative Models and LLMs for Structured Data such as Large Language Models (LLMs) and diffusion models, and specialized techniques for prompt engineering, single-task and multi-task fine-tuning, LLM-driven interfaces and multi-agent systems, retrieval-augmented generation, etc.
- Multimodal Learning where structured data is jointly embedded or combined with other modalities such as text, images, and code (e.g., SQL), knowledge graphs, visualizations/images.
- Applications of TRL models of table representations for tasks like data preparation (e.g. data cleaning, validation, integration, cataloging, feature engineering), retrieval (e.g. data search, fact-checking/QA, KG alignment), analysis (e.g. text-to-SQL and visualization), tabular data generation, (end-to-end) tabular machine learning, table extraction (e.g. parsers/extraction for unstructured data), and query optimization (e.g. cardinality estimation).
- Challenges of TRL models in production Work addressing the challenges of maintaining and managing TRL models in fast-evolving contexts, e.g., data updating, error correction, and monitoring, handling data privacy, personalization performance, etc.
- Domain-specific challenges for learned table models often arise in domains such as enterprise, finance, medical, law. These challenges pertain to table content, table structure, privacy, security limitations, and other factors that necessitate tailored solutions.
- Benchmarks, analyses, and datasets for TRL including assessing LLMs and other generative models as base models versus alternative approaches, analysis of model robustness with respect to large, messy, and heterogeneous tabular data, etc.
- Other contributions such as surveys, demonstrations, visions, and reflections on table representation learning and generative models for structured data.
Organization
Program
TRL is again entirely in-person, and will this year feature 2 poster sessions and contributed talks. We also host a few exciting invited talks on established research in this emerging area, and a panel discussion focused on industry/startup perspectives.Invited Speakers
Panelists
Moderated by Laurel Orr.
Schedule (tentative)
08:30 AM | Opening Notes |
---|---|
08:40 PM | Session 1 (TRL for Tabular ML): |
08:40 AM -
09:20 AM
Invited talk by Gaël Varoquaux: Tabular foundation models for analytics: challenges and progress
[Abstract]
Deep-learning typically does not outperform tree-based models on tabular data. Often this may be explained by the small size of such datasets. For images, sound, text, the solution has be pretrained models, leading to foundation models, adapted and reused for many tasks. I will discuss the challenges to bring these ideas to tabular learning, and the progress that we have made, leading to the CARTE tabular model.
[Speaker Bio]
Gaël Varoquaux is a research director working on data science at Inria (French computer science national research) where he leads the Soda team. He is also co-founder and scientific advisor of Probabl.
Varoquaux's research covers fundamentals of artificial intelligence, statistical learning, natural language processing, causal inference, as well as applications to health, with a current focus on public health and epidemiology. He also creates technology: he co-founded scikit-learn, one of the reference machine-learning toolboxes, and helped build various central tools for data analysis in Python.
Varoquaux has worked at UC Berkeley, McGill, and university of Florence. He did a PhD in quantum physics supervised by Alain Aspect and is a graduate from Ecole Normale Superieure, Paris.
|
|
09:20 AM -
09:30 AM
MotherNet: Fast Training and Inference via Hyper-Network Transformers
|
|
09:30 AM-
09:40 AM -
PyTorch Frame: A Modular Framework for Multi-Modal Tabular Learning
|
|
09:40 AM | Coffee Break |
10:00 AM | Session 2 (LLMs & Tables): |
10:00 AM-
10:35 AM
Invited Talk by Yasemin Altun (Google DeepMind): Advancements in Structure-Aware Reasoning for Tabular Data
[Abstract]
Traditional methods had struggled to capture the rich structural relationships within tables, limiting their reasoning capabilities. I will give an overview of our work in the last few years to highlight some of the advancements in structure-aware reasoning over tabular data with novel model architectures, task formulations and training regimes, leading to significant performance gains in various tasks, as well as opening up new possibilities for applications.
[Speaker Bio]
Yasemin Altun is a Research Scientist at Google working on natural language understanding. She received her PhD from Brown University. Before joining Google, she was a faculty member in Toyota Technological Institute, Chicago and Max Planck Institute for Biological Cybernetics, Tuebingen.
|
|
10:35 AM -
10:45 AM
Large Language Models Engineer Too Many Simple Features for Tabular Data
|
|
10:45 AM -
10:55 AM
TabDiff: a Unified Diffusion Model for Multi-Modal Tabular Data Generation
|
|
10:55 AM -
11:05 AM
Expertise-Centric Prompting Framework for Financial Tabular Data Generation using Pre-trained Large Language Models
|
|
11:05 AM -
11:15 AM
TabSketchFM: Sketch-based Tabular Representation Learning for Data Discovery over Data Lakes
|
|
11:15 AM-
12:00 AM
Poster session 1 (unfold)
|
|
12:00 | Lunch Break |
1:30 PM | Session 3 (NL interfaces to tables) |
1:30 PM -
2:10 PM
Invited Talk by Matei Zaharia (UC Berkeley / Databricks): Lessons from building natural language query interfaces in Databricks AI/BI
[Abstract]
TBC
[Speaker Bio]
Matei Zaharia is an Associate Professor of EECS at Berkeley and a Cofounder and CTO of Databricks. He started the Apache Spark open source project during his PhD at UC Berkeley in 2009, and has worked broadly on other widely used data and AI software, including Delta Lake, MLflow, Dolly and ColBERT. He currently works on a variety of research projects in cloud computing, database management, AI and information retrieval. Matei’s research was recognized through the 2014 ACM Doctoral Dissertation Award, an NSF CAREER Award, and the US Presidential Early Career Award for Scientists and Engineers (PECASE).
|
|
02:10 PM -
02:20 PM
MSc-SQL: Multi-Sample Critiquing Small Language Models For Text-To-SQL Translation
|
|
02:20 PM -
02:30 PM
The Death of Schema Linking? Text-to-SQL in the Age of Well-Reasoned Language Models
|
|
02:30 PM -
03:15 PM
Poster: Session 2 (unfold)
|
|
03:15 PM | Coffee Break |
03:30 PM | Session 4: 🔥 |
03:30 PM -
04:10 PM
Invited Talk by Josh Gardner (Apple): Toward Robust, Reliable, and Generalizable Tabular Data Models
[Abstract]
Tabular data is widely used across many domains and real-world applications. However, tabular data research has not yet achieved the level of robust and transferable large-scale modeling which has been hugely impactful for data modalities such as text, images, and audio. In this talk, I will apply an empirical perspective to understand the landscape of robustness and transferability in tabular data, and discuss our recent progress toward improved data, models, and evaluations to enable true tabular foundation models.
[Speaker Bio]
Josh is a Research Scientist on the Foundation Modeling team at Apple. Prior to joining Apple, Josh completed his PhD at the University of Washington. Josh's research centers on the empirical foundations of machine learning -- in particular, the impact of data on foundation models, and improving foundation models' understanding of new data modalities beyond images and text.
|
|
04:10 PM -
04:50 PM
Panel: TRL and Startups
Xiao Ling (Numbers Station), Kenny Daniel (hyperparam), Maithra Raghu (Samaya AI), Shivam Singhai (Structured), Junwei Ma (layer6) |
|
04:50 PM | Closing Notes & Awards |
Submission Guidelines
Submission link
Submit your (anonymized) paper through OpenReview at: TBCPlease be aware that accepted papers are expected to be presented at the workshop in-person.
Formatting guidelines
The workshop accepts regular research papers and industrial papers of the following types:- Short paper: 4 pages + references and appendix.
- Regular paper: 8 pages + references and appendix.
Submissions should be anonymized and follow the NeurIPS style files (zip), but can exclude the checklist. Non-anonymous preprints are no problem, and artifacts do not have to be anonymized. Just submitting the paper without author names/affiliations is sufficient. Supplementary material, if any, may be added in the appendix. The footer of accepted papers should state “Table Representation Learning Workshop at NeurIPS 2024”. We expect authors to adopt an inclusive and diverse writing style. The “Diversity and Inclusion in Writing” guide by the DE&I in DB Conferences effort is a good resource.
Review process
Papers will receive light reviews in a double-anonymous manner. All accepted submissions will be published on the website and made public on OpenReview but the workshop is non-archival (i.e. without proceedings).Novelty and conflicts
The workshop cannot accept submissions that have been published at NeurIPS or other machine learning venues as-is, but we do invite relevant papers from the main conference (NeurIPS) to be submitted to the workshop as 4-page short papers. We also welcome submissions that have been published in, for example, data management or natural language processing venues. We rely on OpenReview for handling conflicts, so please ensure that the conflicts in every author's OpenReview profile are complete, in particular, with respect to the organization and program committees.Camera-ready instructions
Camera-ready papers are expected to express the authors and affiliations on the first page, and state "Table Representation Learning Workshop at NeurIPS 2024'' in the footer. The camera-ready version may exceed the page limit for acknowledgements or small content changes, but revision is not required (for short papers: please be aware of novelty requirements of archival venues, e.g. SIGMOD, CVPR). The camera-ready version should be submitted through OpenReview (submission -> edit -> revision), and will be published on OpenReview and this website. Please make sure that all meta-data is correct as well, as it will be imported to the NeurIPS website.Presentation instructions
All accepted papers will be presented as poster during one of the poster sessions (the schedule per poster session will be released soon). For poster formatting, please refer to the poster instructions on the NeurIPS site (template, upload, etc), you can print and bring the poster yourself or print it locally through the facilities offered by NeurIPS.Optional: authors of poster submissions are also invited to send a teaser video of approx. 3 minutes (.mp4) to madelon@berkeley.edu, which will be hosted on the website and YouTube channel of the workshop.
Papers selected for oral presentation are also asked to prepare a talk of 9 minutes (+1 min Q&A), and upload their slides through the "slides" field in OpenReview (pdf) or share a link to Google Slides with madelon@cwi.nl. The schedule for the oral talks will be published soon. The recordings of oral talks will be published afterwards.
Program Committee: TBC
Unfold for full committeeSercan O Arik, Google
Micah Goldblum, New York University
Andreas Muller, Microsoft
Xi Fang, Yale University
Naihao Deng, University of Michigan
Sebastian Schelter, BIFOLD & TU Berlin
Weijie Xu, Amazon
Rajat Agarwal, Amazon
Lei Cao, University of Arizona
Paul Groth, University of Amsterdam
Alex Zhuang, University of Waterloo
Sepanta Zeighami, University of California, Berkeley
Jayoung Kim, Yonsei University
Jaehyun Nam, KAIST
Sascha Marton, University of Mannheim
Tianji Cong, University of Michigan
Myung Jun Kim, Inria
Aneta Koleva, University of Munich
Peter Baile Chen, MIT
Gerardo Vitagliano, MIT
Reynold Cheng, the University of Hong Kong
Till Döhmen, MotherDuck / University of Amsterdam
Ivan Rubachev, Higher School of Economics
Raul Castro Fernandez, University of Chicago
Paolo Papotti, Eurecom
Carsten Binnig, TU Darmstadt / Google
Tianbao Xie, the University of Hong Kong
Jintai Chen, University of Illinois at Urbana-Champaign
Sebastian Bordt, Eberhard-Karls-Universität Tübingen
Panupong Pasupat, Google
Liangming Pan, University of Arizona
Xinyuan Lu, National University of Singapore
Ziyu Yao, George Mason University
Shuhan Zheng, Hitachi, Ltd.
Shuaichen Chang, Amazon
Julian Martin Eisenschlos, Google DeepMind
Noah Hollmann, Albert-Ludwigs-Universität Freiburg
Tianshu Zhang, Ohio State University
Liane Vogel, Technische Universität Darmstadt
Roman Levin Amazon
Yury Gorishniy, Moscow Institute of Physics and Technology
Edward Choi, KAIST
Gyubok Lee, KAIST
Mingyu Zheng, University of Chinese Academy of Sciences
Tassilo Klein, SAP
Ge Qu, the University of Hong Kong
Artem Babenko, Yandex
Shreya Shankar, University of California Berkeley
Xiang Deng, Google
Mengyu Zhou, Microsoft Research
Mira Moukheiber, MIT
Amine Mhedhbi, Polytechnique Montréal
Accepted Papers
2024
Oral
-
MotherNet: Fast Training and Inference via Hyper-Network Transformers
-
PyTorch Frame: A Modular Framework for Multi-Modal Tabular Learning
-
TART: An Open-Source Tool-Augmented Framework for Explainable Table-based Reasoning video pitch
-
Large Language Models Engineer Too Many Simple Features for Tabular Data
-
TabDiff: a Unified Diffusion Model for Multi-Modal Tabular Data Generation
-
Expertise-Centric Prompting Framework for Financial Tabular Data Generation using Pre-trained Large Language Models
-
TabSketchFM: Sketch-based Tabular Representation Learning for Data Discovery over Data Lakes
-
The Death of Schema Linking? Text-to-SQL in the Age of Well-Reasoned Language Models
-
MSc-SQL: Multi-Sample Critiquing Small Language Models For Text-To-SQL Translation
Poster
-
AdapTable: Test-Time Adaptation for Tabular Data via Shift-Aware Uncertainty Calibrator and Label Distribution Handler
-
Distributionally robust self-supervised learning for tabular data
-
Unmasking Trees for Tabular Data
-
UniTable: Towards a Unified Framework for Table Recognition via Self-Supervised Pretraining
-
Improving LLM Group Fairness on Tabular Data via In-Context Learning
-
Recurrent Interpolants for Probabilistic Time Series Prediction
-
Lightweight Correlation-Aware Table Compression video pitch
-
TARGET: Benchmarking Table Retrieval for Generative Tasks
-
LLM Embeddings Improve Test-time Adaptation to Tabular Y|X-Shifts
-
Enhancing Table Representations with LLM-powered Synthetic Data Generation
-
Learning Metadata-Agnostic Representations for Text-to-SQL In-Context Example Selection
-
Unnormalized Density Estimation with Root Sobolev Norm Regularization
-
TabDeco: A Comprehensive Contrastive Framework for Decoupled Representations in Tabular Data
-
SALT: Sales Autocompletion Linked Business Tables Dataset
-
ICE-T: Interactions-aware Cross-column Contrastive Embedding for Heterogeneous Tabular Datasets
-
Multi-Stage QLoRA with Augmented Structured Dialogue Corpora: Efficient and Improved Conversational Healthcare AI video pitch
-
Data-Centric Text-to-SQL with Large Language Models
-
Matchmaker: Self-Improving Compositional LLM Programs for Table Schema Matching
-
Relational Deep Learning: Graph Representation Learning on Relational Databases
-
Tabby: Tabular Adaptation for Language Models
-
Adaptivee: Adaptive Ensemble for Tabular Data
-
The Tabular Foundation Model TabPFN Outperforms Specialized Time Series Forecasting Models Based on Simple Features
-
Scaling Generative Tabular Learning for Large Language Models
-
Drift-Resilient TabPFN: In-Context Learning Temporal Distribution Shifts on Tabular Data
-
PORTAL: Scalable Tabular Foundation Models via Content-Specific Tokenization
-
RACOON: An LLM-based Framework for Retrieval-Augmented Column Type Annotation with a Knowledge Graph
- Augmenting Small-size Tabular Data with Class-Specific Energy-Based Models
-
TabGraphs: A Benchmark and Strong Baselines for Learning on Graphs with Tabular Node Features
-
Exploration of autoregressive models for in-context learning on tabular data
-
Adapting TabPFN for Zero-Inflated Metagenomic Data
-
Automating Enterprise Data Engineering with LLMs
-
Tabular Data Generation using Binary Diffusion video pitch
-
On Short Textual Value Column Representation Using Symbol Level Language Models
-
Scalable Representation Learning for Multimodal Tabular Transactions
-
Unlearning Tabular Data Without a "Forget Set''
-
From One to Zero: RAG-IM Adapts Language Models for Interpretable Zero-Shot Predictions on Clinical Tabular Data
-
SynQL: Synthetic Data Generation for In-Domain, Low-Resource Text-to-SQL Parsing
-
TabFlex: Scaling Tabular Learning to Millions with Linear Attention
-
Enhancing Biomedical Schema Matching with LLM-based Training Data Generation
-
Benchmarking table comprehension in the wild
-
GAMformer: Exploring In-Context Learning for Generalized Additive Models
-
Towards Agentic Schema Refinement
-
Synthetic SQL Column Descriptions and Their Impact on Text-to-SQL Performance
-
Towards Localization via Data Embedding for TabPFN
-
AGATa: Attention-Guided Augmentation for Tabular Data in Contrastive Learning
-
HySem: A context length optimized LLM pipeline for unstructured tabular extraction
-
DynoClass: A Dynamic Table-Class Detection System Without the Need for Predefined Ontologies
-
TABGEN-RAG: Iterative Retrieval for Tabular Data Generation with Large Language Models
-
Relational Data Generation with Graph Neural Networks and Latent Diffusion Models
-
Learnable Numerical Input Normalization for Tabular Representation Learning based on B-splines
-
Towards Optimizing SQL Generation via LLM Routing
-
Sparsely Connected Layers for Financial Tabular Data
2023 (unfold)
Oral
-
MultiTabQA: Generating Tabular Answers for Multi-Table Question Answering
-
GCondNet: A Novel Method for Improving Neural Networks on Small High-Dimensional Tabular Data
-
High-Performance Transformers for Table Structure Recognition Need Early Convolutions
-
Self-supervised Representation Learning from Random Data Projectors
-
HyperFast: Instant Classification for Tabular Data
-
Training-Free Generalization on Heterogeneous Tabular Data via Meta-Representation
-
Tabular Representation, Noisy Operators, and Impacts on Table Structure Understanding Tasks in LLMs
-
Data Ambiguity Strikes Back: How Documentation Improves GPT's Text-to-SQL
-
IngesTables: Scalable and Efficient Training of LLM-Enabled Tabular Foundation Models
-
Pool-Search-Demonstrate: Improving Data-wrangling LLMs via better in-context examples
-
How to Prompt LLMs for Text-to-SQL: A Study in Zero-shot, Single-domain, and Cross-domain Settings
-
TabPFGen – Tabular Data Generation with TabPFN
Poster
-
Generating Data Augmentation Queries Using Large Language Models
-
ReConTab: Regularized Contrastive Representation Learning for Tabular Data
-
Unlocking the Transferability of Tokens in Deep Models for Tabular Data
-
Augmentation for Context in Financial Numerical Reasoning over Textual and Tabular Data with Large-Scale Language Model
recording
-
TabContrast: A Local-Global Level Method for Tabular Contrastive Learning
-
Explaining Explainers: Necessity and Sufficiency in Tabular Data
-
Beyond Individual Input for Deep Anomaly Detection on Tabular Data
-
GradTree: Learning Axis-Aligned Decision Trees with Gradient Descent
-
Elephants Never Forget: Testing Language Models for Memorization of Tabular Data
-
InterpreTabNet: Enhancing Interpretability of Tabular Data Using Deep Generative Models and Large Language Models
-
On Incorporating new Variables during Evaluation
-
Unnormalized Density Estimation with Root Sobolev Norm Regularization
-
Tree-Regularized Tabular Embeddings
-
Binning as a Pretext Task: Improving Self-Supervised Learning in Tabular Domains
-
A Deep Learning Blueprint for Relational Databases
recording
-
Scaling TabPFN: Sketching and Feature Selection for Tabular Prior-Data Fitted Networks
-
Modeling string entries for tabular data prediction: do we need big large language models?
-
Hopular: Modern Hopfield Networks for Tabular Data
-
NeuroDB: Efficient, Privacy-Preserving and Robust Query Answering with Neural Networks
-
A DB-First approach to query factual information in LLMs
-
A Performance-Driven Benchmark for Feature Selection in Tabular Deep Learning
-
Incorporating LLM Priors into Tabular Learners
-
CHORUS: Foundation Models for Unified Data Discovery and Exploration
-
Introducing the Observatory Library for End-to-End Table Embedding Inference
recording
-
Scaling Experiments in Self-Supervised Cross-Table Representation Learning
-
Benchmarking Tabular Representation Models in Transfer Learning Settings
-
Exploring the Retrieval Mechanism for Tabular Deep Learning
-
In Defense of Zero Imputation for Tabular Deep Learning
-
Multitask-Guided Self-Supervised Tabular Learning for Patient-Specific Survival Prediction
-
Testing the Limits of Unified Sequence to Sequence LLM Pretraining on Diverse Table Data Tasks
Oral
-
Analysis of the Attention in
Tabular
Language Models recording
- Transfer Learning with Deep
Tabular
Models recording
- STable: Table Generation
Framework
for
Encoder-Decoder Models
recording
- TabPFN: A Transformer That
Solves
Small
Tabular Classification Problems in a Second
recording
- Towards Parameter-Efficient
Automation
of Data Wrangling Tasks with Prefix-Tuning
recording
- RegCLR: A Self-Supervised Framework
for Tabular Representation Learning in the Wild
recording
Poster
- SAINT: Improved Neural Networks
for
Tabular Data via Row Attention and Contrastive Pre-Training
- Generic Entity Resolution
Models
- Towards Foundation Models for
Relational Databases video
pitch
- Diffusion models for missing value imputation in tabular data video pitch
- STab: Self-supervised Learning
for
Tabular Data
- CASPR: Customer Activity
Sequence
based
Prediction and Representation
- Conditional Contrastive
Networks
- Self-supervised Representation Learning Across Sequential and Tabular Features Using Transformers
- The Need for Tabular
Representation
Learning: An Industry Perspective
- STUNT: Few-shot Tabular Learning
with
Jaehyun Nam, Jihoon Tack, Kyungmin Lee, Hankook Lee, Jinwoo Shin< - Tabular Data Generation: Can We
Fool
XGBoost?
- SiMa: Federating Data Silos using
GNNs video
pitch