3rd Table Representation Learning Workshop @ NeurIPS 2024

14 December 2024, Vancouver, Canada.



Mailinglist: signup here!
Follow on Bluesky: @trl-research
New: Join the TRL Discord!: TRL Discord

Tables are a promising modality for representation learning and generative models with too much application potential to ignore. However, tables have long been overlooked despite their dominant presence in the data landscape, e.g. data management and analysis pipelines. The majority of datasets in Google Dataset Search, for example, resembles typical tabular file formats like CSVs. Similarly, the top-3 most-used database management systems are all intended for relational data. Representation learning for tables, possibly combined with other modalities such as code and text, has shown impressive performance for tasks like semantic parsing, question answering, table understanding, data preparation, and data analysis (e.g. text-to-sql). The pre-training paradigm was shown to be effective for tabular ML (classification/regression) as well. More recently, we also observe promising potential in applying and enhancing LLMs in the domain of structured data to improve how we process and derive insights from structured data.

The Table Representation Learning (TRL) workshop is the premier venue in this emerging research area and has three main goals:

  • (1) Motivate tables as a primary modality for representation and generative models and advance the area further.
  • (2) Showcase impactful applications of pretrained table models and identify open challenges for future research, with a particular focus on industry insights in 2024.
  • (3) Foster discussion and collaboration across the ML, NLP, IR and DB communities.

When: Saturday 14 December 2024.
Where: East Meeting Room 11 & 12, Vancouver Convention Centre, Vancouver, Canada.
Specific questions: madelon@berkeley.edu

Sponsored by:


Call for Papers


Important Dates


Submission Open September 1, 2024
Submission Deadline September 20, 2024 (11:59PM AoE)
Notifications October 9, 2024 (11:59PM AoE)
Camera-ready October 30, 2024 (11:59PM AoE)
Slides for contributed talks November 30, 2024 (11:59PM AoE)
Video pitches for posters (optional) November 30, 2024 (11:59PM AoE)
Workshop Date December 14, 2024

Scope

We invite submissions on representation and generative learning over tables, related to any of the following topics:

  • Representation Learning for (semi-)Structured Data such as spreadsheets, tables, and full relational databases. Example contributions are new model architectures, data encoding techniques, tailored tokenization methods, pre-training and fine-tuning techniques, etc.
  • Generative Models and LLMs for Structured Data such as Large Language Models (LLMs) and diffusion models, and specialized techniques for prompt engineering, single-task and multi-task fine-tuning, LLM-driven interfaces and multi-agent systems, retrieval-augmented generation, etc.
  • Multimodal Learning where structured data is jointly embedded or combined with other modalities such as text, images, and code (e.g., SQL), knowledge graphs, visualizations/images.
  • Applications of TRL models of table representations for tasks like data preparation (e.g. data cleaning, validation, integration, cataloging, feature engineering), retrieval (e.g. data search, fact-checking/QA, KG alignment), analysis (e.g. text-to-SQL and visualization), tabular data generation, (end-to-end) tabular machine learning, table extraction (e.g. parsers/extraction for unstructured data), and query optimization (e.g. cardinality estimation).
  • Challenges of TRL models in production Work addressing the challenges of maintaining and managing TRL models in fast-evolving contexts, e.g., data updating, error correction, and monitoring, handling data privacy, personalization performance, etc.
  • Domain-specific challenges for learned table models often arise in domains such as enterprise, finance, medical, law. These challenges pertain to table content, table structure, privacy, security limitations, and other factors that necessitate tailored solutions.
  • Benchmarks, analyses, and datasets for TRL including assessing LLMs and other generative models as base models versus alternative approaches, analysis of model robustness with respect to large, messy, and heterogeneous tabular data, etc.
  • Other contributions such as surveys, demonstrations, visions, and reflections on table representation learning and generative models for structured data.

Organization

Workshop Chairs


Haoyu Dong
Microsoft
Laurel Orr
Numbers Station AI
Qian Liu
Sea AI Lab

Vadim Borisov
University of Tübingen




Program

TRL is again entirely in-person, and will this year feature 2 poster sessions and contributed talks. We also host a few exciting invited talks on established research in this emerging area, and a panel discussion focused on industry/startup perspectives.

Invited Speakers


Gaël Varoquaux
Inria, Probabl
Yasemin Altun
Google DeepMind
Matei Zaharia
UC Berkeley, Databricks





Panelists


Moderated by Laurel Orr.
Xiao Ling
Numbers Station


Kenny Daniel
hyperparam


Maithra Raghu
Samaya AI


Shivam Singhal
Structured


Junwei Ma
layer6














Schedule (tentative)


08:30 AM Opening Notes
08:40 PM Session 1 (TRL for Tabular ML):
08:40 AM - 09:20 AM Invited talk by Gaël Varoquaux: Tabular foundation models for analytics: challenges and progress
[Abstract]
Deep-learning typically does not outperform tree-based models on tabular data. Often this may be explained by the small size of such datasets. For images, sound, text, the solution has be pretrained models, leading to foundation models, adapted and reused for many tasks. I will discuss the challenges to bring these ideas to tabular learning, and the progress that we have made, leading to the CARTE tabular model.
[Speaker Bio]
Gaël Varoquaux is a research director working on data science at Inria (French computer science national research) where he leads the Soda team. He is also co-founder and scientific advisor of Probabl. Varoquaux's research covers fundamentals of artificial intelligence, statistical learning, natural language processing, causal inference, as well as applications to health, with a current focus on public health and epidemiology. He also creates technology: he co-founded scikit-learn, one of the reference machine-learning toolboxes, and helped build various central tools for data analysis in Python. Varoquaux has worked at UC Berkeley, McGill, and university of Florence. He did a PhD in quantum physics supervised by Alain Aspect and is a graduate from Ecole Normale Superieure, Paris.
09:20 AM - 09:30 AM MotherNet: Fast Training and Inference via Hyper-Network Transformers
Andreas Mueller, Carlo Curino, Raghu Ramakrishnan
09:30 AM- 09:40 AM - PyTorch Frame: A Modular Framework for Multi-Modal Tabular Learning
Weihua Hu, Yiwen Yuan, Zecheng Zhang, Akihiro Nitta, Kaidi Cao, Vid Kocijan, Jinu Sunil, Jure Leskovec, Matthias Fey
09:40 AM Coffee Break
10:00 AM Session 2 (LLMs & Tables):
10:00 AM- 10:35 AM Invited Talk by Yasemin Altun (Google DeepMind): Advancements in Structure-Aware Reasoning for Tabular Data
[Abstract]
Traditional methods had struggled to capture the rich structural relationships within tables, limiting their reasoning capabilities. I will give an overview of our work in the last few years to highlight some of the advancements in structure-aware reasoning over tabular data with novel model architectures, task formulations and training regimes, leading to significant performance gains in various tasks, as well as opening up new possibilities for applications.
[Speaker Bio]
Yasemin Altun is a Research Scientist at Google working on natural language understanding. She received her PhD from Brown University. Before joining Google, she was a faculty member in Toyota Technological Institute, Chicago and Max Planck Institute for Biological Cybernetics, Tuebingen.
10:35 AM - 10:45 AM Large Language Models Engineer Too Many Simple Features for Tabular Data
Jaris Küken, Lennart Purucker, Frank Hutter
10:45 AM - 10:55 AM TabDiff: a Unified Diffusion Model for Multi-Modal Tabular Data Generation
Juntong Shi, Minkai Xu, Harper Hua, Hengrui Zhang, Stefano Ermon, Jure Leskovec
10:55 AM - 11:05 AM Expertise-Centric Prompting Framework for Financial Tabular Data Generation using Pre-trained Large Language Models
Subin Kim, Jungmin Son, Minyoung Jung, Youngjun Kwak
11:05 AM - 11:15 AM TabSketchFM: Sketch-based Tabular Representation Learning for Data Discovery over Data Lakes
Aamod Khatiwada, Harsha Kokel, Ibrahim Abdelaziz, Subhajit Chaudhury, Julian Dolby, Oktie Hassanzadeh, Zhenhan Huang, Tejaswini Pedapati, Horst Samulowitz, Kavitha Srinivas
11:15 AM- 12:00 AM Poster session 1 (unfold)
  • Drift-Resilient TabPFN: In-Context Learning Temporal Distribution Shifts on Tabular Data
  • Towards Localization via Data Embedding for TabPFN
  • From One to Zero: RAG-IM Adapts Language Models for Interpretable Zero-Shot Predictions on Clinical Tabular Data
  • Tabular Data Generation using Binary Diffusion
  • Unmasking Trees for Tabular Data
  • AdapTable: Test-Time Adaptation for Tabular Data via Shift-Aware Uncertainty Calibrator and Label Distribution Handler
  • Adapting TabPFN for Zero-Inflated Metagenomic Data
  • Exploration of autoregressive models for in-context learning on tabular data
  • TabGraphs: A Benchmark and Strong Baselines for Learning on Graphs with Tabular Node Features
  • Augmenting Small-size Tabular Data with Class-Specific Energy-Based Models
  • The Tabular Foundation Model TabPFN Outperforms Specialized Time Series Forecasting Models Based on Simple Features
  • Adaptivee: Adaptive Ensemble for Tabular Data
  • MotherNet: Fast Training and Inference via Hyper-Network Transformers
  • PyTorch Frame: A Modular Framework for Multi-Modal Tabular Learning
  • Scaling Generative Tabular Learning for Large Language Models
  • TabDiff: a Unified Diffusion Model for Multi-Modal Tabular Data Generation
  • Large Language Models Engineer Too Many Simple Features for Tabular Data
  • TART: An Open-Source Tool-Augmented Framework for Explainable Table-based Reasoning
  • Expertise-Centric Prompting Framework for Financial Tabular Data Generation using Pre-trained Large Language Models
  • TABGEN-RAG: Iterative Retrieval for Tabular Data Generation with Large Language Models
  • DynoClass: A Dynamic Table-Class Detection System Without the Need for Predefined Ontologies
  • HySem: A context length optimized LLM pipeline for unstructured tabular extraction
  • Enhancing Biomedical Schema Matching with LLM-based Training Data Generation
  • Relational Deep Learning: Graph Representation Learning on Relational Databases
  • SALT: Sales Autocompletion Linked Business Tables Dataset
  • Towards Agentic Schema Refinement
  • RACOON: An LLM-based Framework for Retrieval-Augmented Column Type Annotation with a Knowledge Graph
  • Synthetic SQL Column Descriptions and Their Impact on Text-to-SQL Performance
  • SynQL: Synthetic Data Generation for In-Domain, Low-Resource Text-to-SQL Parsing
  • Data-Centric Text-to-SQL with Large Language Models
  • Towards Optimizing SQL Generation via LLM Routing
  • 12:00 Lunch Break
    1:30 PM Session 3 (NL interfaces to tables)
    1:30 PM - 2:10 PM Invited Talk by Matei Zaharia (UC Berkeley / Databricks): Lessons from building natural language query interfaces in Databricks AI/BI
    [Abstract]
    TBC
    [Speaker Bio]
    Matei Zaharia is an Associate Professor of EECS at Berkeley and a Cofounder and CTO of Databricks. He started the Apache Spark open source project during his PhD at UC Berkeley in 2009, and has worked broadly on other widely used data and AI software, including Delta Lake, MLflow, Dolly and ColBERT. He currently works on a variety of research projects in cloud computing, database management, AI and information retrieval. Matei’s research was recognized through the 2014 ACM Doctoral Dissertation Award, an NSF CAREER Award, and the US Presidential Early Career Award for Scientists and Engineers (PECASE).
    02:10 PM - 02:20 PM MSc-SQL: Multi-Sample Critiquing Small Language Models For Text-To-SQL Translation
    Satya Krishna Gorti, Ilan Gofman, Zhaoyan Liu, Jiapeng Wu, Noël Vouitsis, Guangwei Yu, Jesse Cresswell, Rasa Hosseinzadeh
    02:20 PM - 02:30 PM The Death of Schema Linking? Text-to-SQL in the Age of Well-Reasoned Language Models
    Karime Maamari, Fadhil Abubaker, Daniel Jaroslawicz, Amine Mhedhbi
    02:30 PM - 03:15 PM Poster: Session 2 (unfold)
  • On Short Textual Value Column Representation Using Symbol Level Language Models
  • Benchmarking table comprehension in the wild
  • Matchmaker: Self-Improving Compositional LLM Programs for Table Schema Matching
  • TARGET: Benchmarking Table Retrieval for Generative Tasks
  • Multi-Stage QLoRA with Augmented Structured Dialogue Corpora: Efficient and Improved Conversational Healthcare AI
  • Learning Metadata-Agnostic Representations for Text-to-SQL In-Context Example Selection
  • Enhancing Table Representations with LLM-powered Synthetic Data Generation
  • UniTable: Towards a Unified Framework for Table Recognition via Self-Supervised Pretraining
  • Automating Enterprise Data Engineering with LLMs
  • Lightweight Correlation-Aware Table Compression
  • MSc-SQL: Multi-Sample Critiquing Small Language Models For Text-To-SQL Translation
  • The Death of Schema Linking? Text-to-SQL in the Age of Well-Reasoned Language Models
  • TabSketchFM: Sketch-based Tabular Representation Learning for Data Discovery over Data Lakes
  • PORTAL: Scalable Tabular Foundation Models via Content-Specific Tokenization
  • Tabby: Tabular Adaptation for Language Models
  • ICE-T: Interactions-aware Cross-column Contrastive Embedding for Heterogeneous Tabular Datasets
  • TabDeco: A Comprehensive Contrastive Framework for Decoupled Representations in Tabular Data
  • LLM Embeddings Improve Test-time Adaptation to Tabular -Shifts
  • Recurrent Interpolants for Probabilistic Time Series Prediction
  • Improving LLM Group Fairness on Tabular Data via In-Context Learning
  • Distributionally robust self-supervised learning for tabular data
  • Sparsely Connected Layers for Financial Tabular Data
  • Learnable Numerical Input Normalization for Tabular Representation Learning based on B-splines
  • Relational Data Generation with Graph Neural Networks and Latent Diffusion Models
  • AGATa: Attention-Guided Augmentation for Tabular Data in Contrastive Learning
  • GAMformer: Exploring In-Context Learning for Generalized Additive Models
  • TabFlex: Scaling Tabular Learning to Millions with Linear Attention
  • Unlearning Tabular Data Without a "Forget Set''
  • Scalable Representation Learning for Multimodal Tabular Transactions
  • 03:15 PM Coffee Break
    03:30 PM Session 4: 🔥
    03:30 PM - 04:10 PM Invited Talk by Josh Gardner (Apple): Toward Robust, Reliable, and Generalizable Tabular Data Models
    [Abstract]
    Tabular data is widely used across many domains and real-world applications. However, tabular data research has not yet achieved the level of robust and transferable large-scale modeling which has been hugely impactful for data modalities such as text, images, and audio. In this talk, I will apply an empirical perspective to understand the landscape of robustness and transferability in tabular data, and discuss our recent progress toward improved data, models, and evaluations to enable true tabular foundation models.
    [Speaker Bio]
    Josh is a Research Scientist on the Foundation Modeling team at Apple. Prior to joining Apple, Josh completed his PhD at the University of Washington. Josh's research centers on the empirical foundations of machine learning -- in particular, the impact of data on foundation models, and improving foundation models' understanding of new data modalities beyond images and text.
    04:10 PM - 04:50 PM Panel: TRL and Startups
    Xiao Ling (Numbers Station), Kenny Daniel (hyperparam), Maithra Raghu (Samaya AI), Shivam Singhai (Structured), Junwei Ma (layer6)
    04:50 PM Closing Notes & Awards


    Submission Guidelines

    Submission link

    Submit your (anonymized) paper through OpenReview at: TBC
    Please be aware that accepted papers are expected to be presented at the workshop in-person.

    Formatting guidelines

    The workshop accepts regular research papers and industrial papers of the following types:
    • Short paper: 4 pages + references and appendix.
    • Regular paper: 8 pages + references and appendix.


    Submissions should be anonymized and follow the NeurIPS style files (zip), but can exclude the checklist. Non-anonymous preprints are no problem, and artifacts do not have to be anonymized. Just submitting the paper without author names/affiliations is sufficient. Supplementary material, if any, may be added in the appendix. The footer of accepted papers should state “Table Representation Learning Workshop at NeurIPS 2024”. We expect authors to adopt an inclusive and diverse writing style. The “Diversity and Inclusion in Writing” guide by the DE&I in DB Conferences effort is a good resource.

    Review process

    Papers will receive light reviews in a double-anonymous manner. All accepted submissions will be published on the website and made public on OpenReview but the workshop is non-archival (i.e. without proceedings).

    Novelty and conflicts

    The workshop cannot accept submissions that have been published at NeurIPS or other machine learning venues as-is, but we do invite relevant papers from the main conference (NeurIPS) to be submitted to the workshop as 4-page short papers. We also welcome submissions that have been published in, for example, data management or natural language processing venues. We rely on OpenReview for handling conflicts, so please ensure that the conflicts in every author's OpenReview profile are complete, in particular, with respect to the organization and program committees.

    Camera-ready instructions

    Camera-ready papers are expected to express the authors and affiliations on the first page, and state "Table Representation Learning Workshop at NeurIPS 2024'' in the footer. The camera-ready version may exceed the page limit for acknowledgements or small content changes, but revision is not required (for short papers: please be aware of novelty requirements of archival venues, e.g. SIGMOD, CVPR). The camera-ready version should be submitted through OpenReview (submission -> edit -> revision), and will be published on OpenReview and this website. Please make sure that all meta-data is correct as well, as it will be imported to the NeurIPS website.

    Presentation instructions

    All accepted papers will be presented as poster during one of the poster sessions (the schedule per poster session will be released soon). For poster formatting, please refer to the poster instructions on the NeurIPS site (template, upload, etc), you can print and bring the poster yourself or print it locally through the facilities offered by NeurIPS.
    Optional: authors of poster submissions are also invited to send a teaser video of approx. 3 minutes (.mp4) to madelon@berkeley.edu, which will be hosted on the website and YouTube channel of the workshop.
    Papers selected for oral presentation are also asked to prepare a talk of 9 minutes (+1 min Q&A), and upload their slides through the "slides" field in OpenReview (pdf) or share a link to Google Slides with madelon@cwi.nl. The schedule for the oral talks will be published soon. The recordings of oral talks will be published afterwards.

    Program Committee: TBC

    Unfold for full committee
    We are very grateful to all below members of the Program Committee!
    Sercan O Arik, Google
    Micah Goldblum, New York University
    Andreas Muller, Microsoft
    Xi Fang, Yale University
    Naihao Deng, University of Michigan
    Sebastian Schelter, BIFOLD & TU Berlin
    Weijie Xu, Amazon
    Rajat Agarwal, Amazon
    Lei Cao, University of Arizona
    Paul Groth, University of Amsterdam
    Alex Zhuang, University of Waterloo
    Sepanta Zeighami, University of California, Berkeley
    Jayoung Kim, Yonsei University
    Jaehyun Nam, KAIST
    Sascha Marton, University of Mannheim
    Tianji Cong, University of Michigan
    Myung Jun Kim, Inria
    Aneta Koleva, University of Munich
    Peter Baile Chen, MIT
    Gerardo Vitagliano, MIT
    Reynold Cheng, the University of Hong Kong
    Till Döhmen, MotherDuck / University of Amsterdam
    Ivan Rubachev, Higher School of Economics
    Raul Castro Fernandez, University of Chicago
    Paolo Papotti, Eurecom
    Carsten Binnig, TU Darmstadt / Google
    Tianbao Xie, the University of Hong Kong
    Jintai Chen, University of Illinois at Urbana-Champaign
    Sebastian Bordt, Eberhard-Karls-Universität Tübingen
    Panupong Pasupat, Google
    Liangming Pan, University of Arizona
    Xinyuan Lu, National University of Singapore
    Ziyu Yao, George Mason University
    Shuhan Zheng, Hitachi, Ltd.
    Shuaichen Chang, Amazon
    Julian Martin Eisenschlos, Google DeepMind
    Noah Hollmann, Albert-Ludwigs-Universität Freiburg
    Tianshu Zhang, Ohio State University
    Liane Vogel, Technische Universität Darmstadt
    Roman Levin Amazon
    Yury Gorishniy, Moscow Institute of Physics and Technology
    Edward Choi, KAIST
    Gyubok Lee, KAIST
    Mingyu Zheng, University of Chinese Academy of Sciences
    Tassilo Klein, SAP
    Ge Qu, the University of Hong Kong
    Artem Babenko, Yandex
    Shreya Shankar, University of California Berkeley
    Xiang Deng, Google
    Mengyu Zhou, Microsoft Research
    Mira Moukheiber, MIT
    Amine Mhedhbi, Polytechnique Montréal



    Accepted Papers


    2024


    Oral

    Poster





    2023 (unfold)

    Oral

    Poster





    2022 (unfold)

    Oral



    Poster