TRL 2023 is a wrap! Interested in updates for future events, news, etc? Sign up for the mailinglist here!
Tables are a promising modality for representation learning with too much application potential to ignore. However, tables have long been overlooked despite their dominant presence in the data landscape, e.g. data management and analysis pipelines. The majority of datasets in Google Dataset Search, for example, resembles typical tabular file formats like CSVs. Similarly, the top-3 most-used database management systems are all intended for relational data (RDBMS). Representation learning over tables, possibly combined with other modalities such as text or SQL, has shown impressive performance for tasks like semantic parsing, question answering, table understanding, and data preparation. More recently, the pre-training paradigm was shown to be effective for tabular ML as well, while researchers also started exploring the impressive capabilities of LLMs for table encoding and data manipulation.
The Table Representation Learning (TRL) workshop is the first in this emerging research area and concentrates on three main goals:
- (1) Motivate tables as a primary modality for representation and generative learning and advance the area further.
- (2) Showcase impactful applications of pretrained table models and discussing future opportunities.
- (3) Foster discussion and collaboration across the ML, NLP and DB communities.
Where: Room 235-236, New Orleans Ernest N. Morial Convention Center.
Program
TRL is again entirely in-person, and will this year feature 2 longer poster sessions and more (spotlight) talks on contributed work.We also host a few exciting invited talks on established research in this emerging area, and a panel discussion.
Invited Speakers
University of Waterloo,
Google DeepMind, Vector Institute
University of Freiburg
Stanford University
Cornell University
Panelists (TBC)
Schedule
06:30 AM | Opening Notes |
---|---|
06:45 AM |
Invited Talk: Simran Arora Co-Designing LLMs and LLM-Powered Data Management Tools [Abstract]
Abstract: Large Language Models (LLMs) are now widely used for data management. We recently proposed Evaporate [ICLR Spotlight 2023, VLDB 2024], a system that uses LLMs to help users efficiently query semi-structured documents. We also showed how off-the-shelf LLMs perform data-wrangling tasks with state-of-the-art quality and no specialized training [VLDB 2023]. This talk discusses some of my lessons from working on these early LLM-for-data-management projects and subsequent research to improve the reach of these systems — in particular, there is a ways to go for extending LLMs to datatypes such as private, semi-structured, and long-sequence data. Towards extending our capabilities on these datatypes, I’ll discuss MQAR and Monarch Mixer [NeurIPS Oral 2023], new LM architectures that can match the quality of attention-based LMs, while remaining asymptotically more efficient at training and inference time. We’ll finally discuss how these fundamental breakthroughs can power next-generation data management tools.
[Speaker Bio]
Bio: Simran Arora is a PhD student at Stanford University in machine learning systems, advised by Chris Ré. She develops tools that help users apply foundation models to challenging datatypes such as private, semi-structured, and long-sequence data. To unlock these capabilities, she leverages a detailed understanding of the data and inductive biases used to train foundation models. Her work has received Oral (top 5%) and Spotlight (top 25%) awards at NeurIPS and ICLR. Simran also recently created and taught a new [Systems for Machine Learning](https://cs229s.stanford.edu/fall2023/) full-unit course at Stanford. She is grateful for the support of the SGF Sequoia Fellowship.
|
07:15 AM | Spotlight Talks: Session 1 (Data Wrangling and Table QA) |
07:15 AM ~
07:22 AM
High-Performance Transformers for Table Structure Recognition Need Early Convolutions
[Abstract]
Abstract: Table structure recognition (TSR) aims to convert tabular images into a machine-readable format, where a visual encoder extracts image features and a textual decoder generates table-representing tokens. Existing approaches use classic convolutional neural network (CNN) backbones for the visual encoder and transformers for the textual decoder. However, this hybrid CNN-Transformer architecture introduces a complex visual encoder that accounts for nearly half of the total model parameters, markedly reduces both training and inference speed, and hinders the potential for self-supervised learning in TSR. In this work, we design a lightweight visual encoder for TSR without sacrificing expressive power. We discover that a convolutional stem can match classic CNN backbone performance, with a much simpler model. The convolutional stem strikes an optimal balance between two crucial factors for high-performance TSR: a higher receptive field (RF) ratio and a longer sequence length. This allows it to "see" an appropriate portion of the table and "store" the complex table structure within sufficient context length for the subsequent transformer. We conducted reproducible ablation studies and open-sourced our code at https://anonymous.4open.science/r/NeurIPS23-TRL-2 to enhance transparency, inspire innovations, and facilitate fair comparisons in our domain as tables are a promising modality for representation learning.
|
|
07:23 AM ~
07:30 AM
Pool-Search-Demonstrate: Improving Data-wrangling LLMs via better in-context examples
[Abstract]
Abstract: Data-wrangling is a process that transforms raw data for further analysis and for use in downstream tasks. Recently, it has been shown that foundation models can be successfully used for data-wrangling tasks (Narayan et. al., 2022). An important aspect of data wrangling with LMs is to properly construct prompts for the given task. Within these prompts, a crucial component is the choice of in-context examples. In the previous study of Narayan et. al., demonstration examples are chosen manually by the authors, which may not be scalable to new datasets. In this work, we propose a simple demonstration strategy that individualizes demonstration examples for each input by selecting them from a pool based on their distance in the embedding space. Additionally, we propose a postprocessing method that exploits the embedding of labels under a closed-world assumption. Empirically, our embedding-based example retrieval and postprocessing improve foundation models' performance by up to 84\% over randomly selected examples and 49\% over manually selected examples in the demonstration. Ablation tests reveal the effect of class embeddings, and various factors in demonstration such as quantity, quality, and diversity.
|
|
07:31 AM ~
07:38 AM
TabPFGen – Tabular Data Generation with TabPFN
[Abstract]
Abstract: Advances in deep generative modelling have not translated well to tabular data. We argue that this is caused by a mismatch in structure between popular generative models and discriminative models of tabular data. We thus devise a technique to turn TabPFN -- a highly performant transformer initially designed for in-context discriminative tabular tasks -- into an energy-based generative model, which we dub TabPFGen. This novel framework leverages the pre-trained TabPFN as part of the energy function and does not require any additional training or hyperparameter tuning, thus inheriting TabPFN's in-context learning capability. We can sample from TabPFGen analogously to other energy-based models. We demonstrate strong results on standard generative modelling tasks, including data augmentation, class-balancing, and imputation, unlocking a new frontier of tabular data generation.
|
|
07:38 AM ~
07:45 AM
Data Ambiguity Strikes Back: How Documentation Improves GPT's Text-to-SQL
[Abstract]
Abstract: Text-to-SQL allows experts to use databases without in-depth knowledge of them. However, real-world tasks have both query and data ambiguities. Most works on Text-to-SQL focused on query ambiguities and designed chat interfaces for experts to provide clarifications.In contrast, the data management community has long studied data ambiguities, but mainly addresses error detection and correction, rather than documenting them for disambiguation in data tasks. This work delves into these data ambiguities in real-world datasets. We have identified prevalent data ambiguities of value consistency, data coverage, and data granularity that affect tasks. We examine how documentation, originally made to help humans to disambiguate data, can help GPT-4 with Text-to-SQL tasks. By offering documentation on these, we found GPT-4's performance improved by 28.9%.
|
|
07:46 AM ~
07:53 AM
MultiTabQA: Generating Tabular Answers for Multi-Table Question Answering
[Abstract]
Abstract: Recent advances in tabular question answering (QA) with large language models are constrained in their coverage and only answer questions over a single table. However, real-world queries are complex in nature, often over multiple tables in a relational database or web page. Single table questions do not involve common table operations such as set operations, Cartesian products (joins), or nested queries. Furthermore, multi-table operations often result in a tabular output, which necessitates table generation capabilities of tabular QA models. To fill this gap, we propose a new task of answering questions over multiple tables. Our model, MultiTabQA, not only answers questions over multiple tables, but also generalizes to generate tabular answers. To enable effective training, we build a pre-training dataset comprising of 132,645 SQL queries and tabular answers. Further, we evaluate the generated tables by introducing table-specific metrics of varying strictness assessing various levels of granularity of the table structure. MultiTabQA outperforms state-of-the-art single table QA models adapted to a multi-table QA setting by finetuning on three datasets: Spider, Atis and GeoQuery.
|
|
08:00 AM | Coffee Break / Poster Setup |
08:20 AM | Poster: Session 1
[Details]
|
09:00 AM |
Invited Talk: Frank Hutter Advances in In-Context Learning for Tabular Datasets [Abstract]
Abstract: A year ago, we introduced TabPFN, the first in-context learning method for tabular data. In this talk, I will discuss what happened since. I will start by briefly discussing CAAFE, a system that uses LLMs for automated feature engineering on tabular data and makes effective use of TabPFN's speed. Then, I will situate prior-fitted PFNs in the in-context learning literature, review various applications of PFN, explain TabPFN in some more detail and then discuss our ongoing work on removing TabPFN's remaining limitations.
[Speaker Bio]
Bio: Frank Hutter is a Full Professor for Machine Learning at the University of Freiburg (Germany). He holds a PhD from the University of British Columbia (UBC, 2009) and a Diplom (eq. MSc) from TU Darmstadt (2004). He received the 2010 CAIAC doctoral dissertation award for the best thesis in AI in Canada, and with his coauthors, several best paper awards and prizes in international competitions on machine learning, SAT solving, and AI planning. He is a Fellow of EurAI and ELLIS, the director of the ELLIS unit Freiburg and the recipient of 3 ERC grants. Frank is best known for his research on automated machine learning (AutoML), including neural architecture search and efficient hyperparameter optimization. He co-authored the first book on AutoML and the prominent AutoML tools Auto-WEKA, Auto-sklearn and Auto-PyTorch, won the first two AutoML challenges with his team, co-organized the ICML workshop series on AutoML every year 2014-2021, has been the general chair of the inaugural AutoML conference 2022 and is general chair again in 2023.
|
09:30 AM | Spotlight Talks: Session 2 (Tabular ML) |
09:30 AM ~
09:37 AM
Self-supervised Representation Learning from Random Data Projectors
[Abstract]
Abstract: Self-supervised representation learning SSRL has advanced considerably by exploiting the transformation invariance assumption under artificially designed data augmentations. While augmentation-based SSRL algorithms push the boundaries of performance in computer vision and natural language processing, they are often not directly applicable to other data modalities such as tabular and time-series data. This paper presents an SSRL approach that can be applied to these data modalities because it does not rely on augmentations or masking. Specifically, we show that high-quality data representations can be learned by reconstructing random data projections. We evaluate the proposed approach on real-world applications with tabular and time-series data. We show that it outperforms multiple state-of-the-art SSRL baselines and is competitive with methods built on domain-specific knowledge. Due to its wide applicability and strong empirical results, we argue that learning from randomness is a fruitful research direction worthy of attention and further study.
|
|
09:38 AM ~
09:45 AM
GCondNet: A Novel Method for Improving Neural Networks on Small High-Dimensional Tabular Data
[Abstract]
Abstract: Neural network models often struggle with high-dimensional but small sample-size tabular datasets. One reason is that current weight initialisation methods assume independence between weights, which can be problematic when there are insufficient samples to estimate the model's parameters accurately. In such small data scenarios, leveraging additional structures can improve the model's performance and training stability. To address this, we propose GCondNet, a general approach to enhance neural networks by leveraging implicit structures present in tabular data. We create a graph between samples for each data dimension, and utilise Graph Neural Networks (GNNs) for extracting this implicit structure, and for conditioning the parameters of the first layer of an underlying predictor network. By creating many small graphs, GCondNet exploits the data's high-dimensionality, and thus improves the performance of an underlying predictor network. We demonstrate the effectiveness of our method on 9 real-world datasets, where GCondNet outperforms 15 standard and state-of-the-art methods. The results show that GCondNet is a versatile framework for injecting graph-regularisation into various types of neural networks, including MLPs and tabular Transformers.
|
|
09:46 AM ~
09:53 AM
HyperFast: Instant Classification for Tabular Data
[Abstract]
Abstract: Training deep learning models and performing hyperparameter tuning can be computationally demanding and time-consuming. Meanwhile, traditional machine learning methods like gradient-boosting algorithms remain the preferred choice for most tabular data applications, while neural network alternatives require extensive hyperparameter tuning or work only in toy datasets under limited settings. In this paper, we introduce HyperFast, a meta-trained hypernetwork designed for instant classification of tabular data in a single forward pass. HyperFast generates a task-specific neural network tailored to an unseen dataset that can be directly used for classification inference, removing the need for training a model. We report extensive experiments with OpenML and genomic data, comparing HyperFast to competing tabular data neural networks, traditional ML methods, AutoML systems, and boosting machines. HyperFast shows highly competitive results, while being significantly faster. Additionally, our approach demonstrates robust adaptability across a variety of classification tasks with little to no fine-tuning, positioning HyperFast as a strong solution for numerous applications and rapid model deployment. HyperFast introduces a promising paradigm for fast classification, with the potential to substantially decrease the computational burden of deep learning. Our code, which offers a scikit-learn-like interface, along with the trained HyperFast model, can be found at www.url-hidden-for-submission.
|
|
09:54 AM ~
10:01 AM
Training-Free Generalization on Heterogeneous Tabular Data via Meta-Representation
[Abstract]
Abstract: Tabular data is prevalent across various machine learning domains. Yet, the inherent heterogeneities in attribute and class spaces across different tabular datasets hinder the effective sharing of knowledge, limiting a tabular model to benefit from other datasets. In this paper, we propose Tabular data Pre-Training via Meta-representation (TabPTM), which allows one tabular model pre-training on a set of heterogeneous datasets. Then, this pre-trained model can be directly applied to unseen datasets that have diverse attributes and classes without additional training. Specifically, TabPTM represents an instance through its distance to a fixed number of prototypes, thereby standardizing heterogeneous tabular datasets. A deep neural network is then trained to associate these meta-representations with dataset-specific classification confidences, endowing TabPTM with the ability of training-free generalization. Experiments validate that TabPTM achieves promising performance in new datasets, even under few-shot scenarios.
|
|
10:00 AM | Lunch Break |
11:30 AM |
Invited Talk: Immanuel Trummer Next-Generation Data Management with Large Language Models [Abstract]
Abstract: The past years have been marked by several breakthrough results in the domain of generative AI, culminating in the rise of tools like ChatGPT, able to solve a variety of language-related tasks without specialized training. In this talk, I discuss several recent research projects at Cornell, exploiting large language models to enhance relational database management systems. These projects cover applications of language models in the database interface, enabling users to specify high-level analysis goals for fully automated end-to-end analysis, as well as applications in the backend, using language models to extract useful information for data profiling and database tuning from text documents.
[Speaker Bio]
Bio: Immanuel Trummer is an assistant professor at Cornell University and a member of the Cornell Database Group. His papers were selected for “Best of VLDB”, “Best of SIGMOD”, for the ACM SIGMOD Research Highlight Award, and for publication in CACM as CACM Research Highlight. His online lecture introducing students to database topics collected over a million views. He received the NSF CAREER Award and multiple Google Faculty Research Awards.
|
12:00 PM | Spotlight Talks: Session 3 (Tables + LLMs) |
12:00 PM ~
12:07 PM
Tabular Representation, Noisy Operators, and Impacts on Table Structure Understanding Tasks in LLMs
[Abstract]
Abstract: Large language models (LLMs) are increasingly applied for tabular tasks usingin-context learning. The prompt representation for a table may play a role in theLLMs ability to process the table. Inspired by prior work, we generate a collectionof self-supervised structural tasks (e.g. navigate to a cell and row; transpose thetable) and evaluate the performance differences when using 8 formats. In contrastto past work, we introduce 8 noise operations inspired by real-world messy dataand adversarial inputs, and show that such operations can impact LLM performanceacross formats for different structural understanding tasks.
|
|
12:08 PM ~
12:15 PM
How to Prompt LLMs for Text-to-SQL: A Study in Zero-shot, Single-domain, and Cross-domain Settings
[Abstract]
Abstract: Large language models (LLMs) with in-context learning have demonstrated remarkable capability in the text-to-SQL task. Previous research has prompted LLMs with various demonstration-retrieval strategies and intermediate reasoning steps to enhance the performance of LLMs. However, those works often employ varied strategies when constructing the prompt text for text-to-SQL inputs, such as databases and demonstration examples. This leads to a lack of comparability in both the prompt constructions and their primary contributions. Furthermore, selecting an effective prompt construction has emerged as a persistent problem for future research. To address this limitation, we comprehensively investigate the impact of prompt constructions across various settings and provide insights into prompt constructions for future text-to-SQL studies.
|
|
12:16 PM ~
12:23 PM
IngesTables: Scalable and Efficient Training of LLM-Enabled Tabular Foundation Models
[Abstract]
Abstract: There is a massive amount of tabular data that can be taken advantage of via `foundation models' to improve prediction performance for downstream tabular prediction tasks. However, numerous challenges constitute bottlenecks in building tabular foundation models, including learning semantic relevance between tables and features, mismatched schemes, arbitrarily high cardinality for categorical values, and scalability to many tables, rows and features. We propose \texttt{IngesTables}, a novel canonical tabular foundation model building framework, designed to address the aforementioned challenges. \texttt{IngesTables} employs LLMs to encode representations of table/feature semantics and the relationships, that are then modeled via an attention-based tabular architecture. Unlike other LLM-based approaches, \texttt{IngesTables} is much cheaper to train and faster to run inference, because of how LLM-generated embeddings are defined and cached.We show that \texttt{IngesTables} demonstrates significant improvements over commonly-used models like XGBoost on clinical trial datasets in standard supervised learning settings, and is competitive with tabular prediction models that are specialized for clinical trial datasets without incurring LLM-level cost and latency.
|
|
12:30 PM |
Invited Talk: Tao Yu Advancing Natural Language Interfaces to Data with Language Models as Agents [Abstract]
Abstract: Traditional Natural Language Interfaces (NLIs) to data often necessitate users to provide detailed, step-by-step instructions, reflecting an assumption of user familiarity with the underlying data and systems, which can limit accessibility. The emergence of Large Language Models (LLMs) has, however, revolutionized NLIs, enabling them to perform sophisticated reasoning, decision-making, and planning multi-step actions in diverse environments autonomously. In this talk, I will discuss how these language models as agents facilitate a paradigm shift towards moving beyond traditional code generation to more autonomous and user-friendly NLIs, capable of understanding high-level objectives without requiring intricate directives. I will also present our latest work in this direction, including instruction-finetuned retrievers for diverse environment adaptation, the enhancement of LLM capabilities with tool integration, and the development of open, state-of-the-art LLMs and platforms for constructing such language agents. The talk will conclude with an exploration of the current and future research prospects in this rapidly evolving domain.
[Speaker Bio]
Bio: Tao Yu is an Assistant Professor of Computer Science at The University of Hong Kong and a director of the XLANG Lab (as part of the HKU NLP Group). He spent one year in the UW NLP Group working with Noah Smith, Luke Zettlemoyer, and Mari Ostendorf. He completed his Ph.D. in Computer Science from Yale University, advised by Dragomir Radev and master's at Columbia University advised by Owen Rambow and Kathleen McKeown. Tao has received the Google and Amazon faculty research awards (Google Research Scholar Award 2023, Amazon Research Award 2022). His main research interest is in Natural Language Processing. His research aims to build language model agents that transform (“grounding”) language instructions into code or actions executable in real-world environments, including databases, web applications, and the physical world etc,. It lies at the heart of the next generation of natural language interfaces that can interact with and learn from these real-world environments to facilitate human interaction with data analysis, web applications, and robotic instruction through conversation.
|
01:00 PM | Coffee Break / Poster Setup |
01:20 PM | Poster: Session 2
[Details]
|
02:00 PM |
Invited Talk: Wenhu Chen Enabling Large Language Models to Reason with Tables [Abstract]
Abstract: Large language models (LLMs) are becoming attractive as few-shot reasoners to solve Natural Language (NL)-related tasks. However, there is still much to learn about how well LLMs understand structured data, such as tables. While it is true that tables can be used as inputs to LLMs with serialization, there lack comprehensive studies examining whether LLMs can truly comprehend such data. In this talk, I will cover different ways to utilize LLMs to interface with tables. One approach is to feed the whole table as a sequence to LLMs for reasoning. In this direction, we will talk about the recent paper GPT4Table to summarize the lessons learned in different table linearization strategies, including table input format, content order, role prompting, and partition marks. The other approach is to use tools like SQL or other language to interface with table for data access without feeding the entire table. LLMs will work as a reasoner to derive the answer based on the interfaced results from the table.
[Speaker Bio]
Bio: Wenhu Chen has been an assistant professor at Computer Science Department in University of Waterloo and Vector Institute since 2022. He obtained Canada CIFAR AI Chair Award in 2022. He also works for Google Deepmind as a part-time research scientist since 2021. Before that, he obtained his PhD from the University of California, Santa Barbara under the supervision of William Wang and Xifeng Yan. His research interest lies in natural language processing, deep learning and multimodal learning. He aims to design models to handle complex reasoning scenarios like math problem-solving, structure knowledge grounding, etc. He is also interested in building more powerful multimodal models to bridge different modalities. He received the Area Chair Award in AACL-IJCNLP 2023, the Best Paper Honorable Mention in WACV 2021, and the UCSB CS Outstanding Dissertation Award in 2021.
|
02:30 PM | Panel - TBA |
03:15 PM | Closing Notes |
Call for Papers
Important Dates
Submission Open | August 10, 2023 |
Submission Deadline | |
Notifications | |
Camera-ready | |
Slides for spotlight talks | |
Video pitches for posters | |
Workshop Date | December 15, 2023 |
Scope
We invite submissions on representation and generative learning over tables, related to any of the following topics:
- Representation Learning over tables which can be structured as well as semi-structured, and extend to full databases. Example contributions are new model architectures, data encoding techniques, pre-training, fine-tuning, and prompting strategies, multi-task learning, etc.
- Generative Learning and LLMs for structured data and interfaces to structured data (e.g. queries, analysis).
- Multimodal learning where tables are jointly embedded with, for example, natural language, code (e.g. SQL), knowledge bases, visualizations/images.
- Downstream Applications of table representations for tasks like data preparation (e.g. data cleaning, validation, integration, cataloging, feature engineering), retrieval (e.g. search, fact-checking/QA, KG construction), analysis (e.g. summarization, visualization, and query recommendation), and (end-to-end) machine learning.
- Upstream Applications of table representation models for optimizing table parsers/extraction (from documents, spreadsheets, presentations), storage (e.g. compression, indexing) and query processing e.g. query optimization
- Production challenges of table representation models. Work addressing the challenges of maintaining and managing TRL models in fast evolving contexts, e.g. data updating, error correction, and monitoring, and other industry challenges such as privacy, personalization performance, etc.
- Domain-specific challenges for learned table models often arise in domains such as enterprise, finance, medical, law. These challenges pertain to table content, table structure, privacy, security limitations, and other factors that necessitate tailored solutions.
- Benchmarks and analyses of table representation models, including the utility of language models as base models versus alternatives and robustness regarding large, messy, heterogeneous, or complex tables.
- Others: Formalization, surveys, datasets, visions, and reflections to structure and guide future research.
Submission Guidelines
Submission link
Submit your (anonymized) paper through OpenReview at: https://openreview.net/group?id=NeurIPS.cc/2023/Workshop/TRLPlease be aware that accepted papers are expected to be presented at the workshop in-person.
Formatting guidelines
The workshop accepts regular research papers and industrial papers of the following types:- Short paper: 4 pages + references.
- Regular paper: 8 pages + references.
Submissions should be anonymized and follow the NeurIPS style files, but can exclude the checklist. Non-anonymous preprints are no problem, and artifacts do not have to be anonymized. Just submitting the paper without author names/affiliations is sufficient. Supplementary material, if any, may be added in the appendix. The footer of accepted papers should state “Table Representation Learning Workshop at NeurIPS 2023”. We expect authors to adopt an inclusive and diverse writing style. The “Diversity and Inclusion in Writing” guide by the DE&I in DB Conferences effort is a good resource.
Review process
Papers will receive light reviews in a double-anonymous manner. All accepted submissions will be published on the website and made public on OpenReview but the workshop is non-archival (i.e. without proceedings).Novelty and conflicts
The workshop does not accept submissions that have previously been published at NeurIPS or other machine learning venues. However, we do welcome submissions that have been published in, for example, data management or natural language processing venues. We rely on OpenReview for handling conflicts, so please ensure that the conflicts in every author's OpenReview profile are complete, in particular, with respect to the organization and program committees.Camera-ready instructions
Camera-ready papers are expected to express the authors and affiliations on the first page, and state "Table Representation Learning Workshop at NeurIPS 2023'' in the footer. The camera-ready version may exceed the page limit for acknowledgements or small content changes, but revision is not required (for short papers: please be aware of novelty requirements of archival venues, e.g. SIGMOD, CVPR). The camera-ready version should be submitted through OpenReview (submission -> edit -> revision), and will be published on OpenReview and this website. Please make sure that all meta-data is correct as well, as it will be imported to the NeurIPS website.Presentation instructions
All accepted papers will be presented as poster during one of the poster sessions (TBA). For poster formatting, please refer to the poster instructions on the NeurIPS site, you can print and bring the poster yourself or consider the FedEx offer for NeurIPS. Optional: authors of poster submissions are also invited to send a teaser video of approx. 3 minutes (.mp4) to m.hulsebos@uva.nl, which will be hosted on the website and YouTube channel of the workshop.Papers selected for spotlight talks are also asked to prepare a talk of 6 minutes (+1 min Q&A), and upload their slides through the "slides" field in OpenReview. Timeslots for the spotlights will be published soon. The recordings of oral talks will be published as well.
Organization
Workshop Chairs
UC Berkeley
Microsoft
Harvard
Numbers Station AI
Google DeepMind
INRIA
Program Committee
Unfold for full committeePaul Groth, University of Amsterdam
Wensheng Dou, Institute of Software Chinese Academy of Sciences
Hiroshi Iida, The University of Tokyo
Sharad Chitlangia, Amazon
Jaehyun Nam, KAIST
Jinyang Li, The University of Hong Kong
Gerardo Vitagliano, Hasso Plattner Institute
Rajat Agarwal, Amazon
Micah Goldblum, New York University
Yury Gorishniy, Yandex Research
Roman Levin, Amazon
Bhavesh Neekhra, Ashoka University
Sebastian Schelter, University of Amsterdam
Qingping Yang, University of Chinese Academy of Sciences
Matteo Interlandi, Microsoft
Tianji Cong, University of Michigan
Xiang Deng, Google
Beliz Gunel, Google
Qian Liu, Sea AI Lab
Shuaichen Chang, Ohio State University
Zhoujun Cheng, Shanghai Jiaotong University
Roee Shraga, Worcester Polytechnic Institute
Yi Zhang, AWS AI Labs
Xi Rao, ETH Zurich
Liane Vogel, Technical University of Darmstadt
Aneta Koleva, University of Munich / Siemens
Ivan Rubachev, HSE University / Yandex
Meghana Moorthy Bhat, Salesforce Research
José Cambronero, Microsoft
Till Döhmen, MotherDuck / University of Amsterdam
Noah Hollman, Charité Berlin / University of Freiburg
Julian Martin Eisenschlos, Google
Paolo Papotti, Eurecom
Zhiruo Wang, Carnegie Mellon University
Mukul Singh, Microsoft
Zezhou Huang, Columbia University
Carsten Binnig, TU Darmstadt
Linyong Nan, Yale
Shuo Zhang, Bloomberg
Alejandro Sierra Múnera, Hasso Plattner Institute
Qian Liu, Sea AI Labs
Anirudh Khatry, Microsoft
Haoyu Dong, Microsoft
Accepted Papers
2023
Oral
-
MultiTabQA: Generating Tabular Answers for Multi-Table Question Answering
-
GCondNet: A Novel Method for Improving Neural Networks on Small High-Dimensional Tabular Data
-
High-Performance Transformers for Table Structure Recognition Need Early Convolutions
-
Self-supervised Representation Learning from Random Data Projectors
-
HyperFast: Instant Classification for Tabular Data
-
Training-Free Generalization on Heterogeneous Tabular Data via Meta-Representation
-
Tabular Representation, Noisy Operators, and Impacts on Table Structure Understanding Tasks in LLMs
-
Data Ambiguity Strikes Back: How Documentation Improves GPT's Text-to-SQL
-
IngesTables: Scalable and Efficient Training of LLM-Enabled Tabular Foundation Models
-
Pool-Search-Demonstrate: Improving Data-wrangling LLMs via better in-context examples
-
How to Prompt LLMs for Text-to-SQL: A Study in Zero-shot, Single-domain, and Cross-domain Settings
-
TabPFGen – Tabular Data Generation with TabPFN
Poster
-
Generating Data Augmentation Queries Using Large Language Models
-
ReConTab: Regularized Contrastive Representation Learning for Tabular Data
-
Unlocking the Transferability of Tokens in Deep Models for Tabular Data
-
Augmentation for Context in Financial Numerical Reasoning over Textual and Tabular Data with Large-Scale Language Model
recording
-
TabContrast: A Local-Global Level Method for Tabular Contrastive Learning
-
Explaining Explainers: Necessity and Sufficiency in Tabular Data
-
Beyond Individual Input for Deep Anomaly Detection on Tabular Data
-
GradTree: Learning Axis-Aligned Decision Trees with Gradient Descent
-
Elephants Never Forget: Testing Language Models for Memorization of Tabular Data
-
InterpreTabNet: Enhancing Interpretability of Tabular Data Using Deep Generative Models and Large Language Models
-
On Incorporating new Variables during Evaluation
-
Unnormalized Density Estimation with Root Sobolev Norm Regularization
-
Tree-Regularized Tabular Embeddings
-
Binning as a Pretext Task: Improving Self-Supervised Learning in Tabular Domains
-
A Deep Learning Blueprint for Relational Databases
recording
-
Scaling TabPFN: Sketching and Feature Selection for Tabular Prior-Data Fitted Networks
-
Modeling string entries for tabular data prediction: do we need big large language models?
-
Hopular: Modern Hopfield Networks for Tabular Data
-
NeuroDB: Efficient, Privacy-Preserving and Robust Query Answering with Neural Networks
-
A DB-First approach to query factual information in LLMs
-
A Performance-Driven Benchmark for Feature Selection in Tabular Deep Learning
-
Incorporating LLM Priors into Tabular Learners
-
CHORUS: Foundation Models for Unified Data Discovery and Exploration
-
Introducing the Observatory Library for End-to-End Table Embedding Inference
recording
-
Scaling Experiments in Self-Supervised Cross-Table Representation Learning
-
Benchmarking Tabular Representation Models in Transfer Learning Settings
-
Exploring the Retrieval Mechanism for Tabular Deep Learning
-
In Defense of Zero Imputation for Tabular Deep Learning
-
Multitask-Guided Self-Supervised Tabular Learning for Patient-Specific Survival Prediction
-
Testing the Limits of Unified Sequence to Sequence LLM Pretraining on Diverse Table Data Tasks
2022
Oral
-
Analysis of the Attention in
Tabular
Language Models recording
- Transfer Learning with Deep
Tabular
Models recording
- STable: Table Generation
Framework
for
Encoder-Decoder Models
recording
- TabPFN: A Transformer That
Solves
Small
Tabular Classification Problems in a Second
recording
- Towards Parameter-Efficient
Automation
of Data Wrangling Tasks with Prefix-Tuning
recording
- RegCLR: A Self-Supervised Framework
for
Tabular Representation Learning in the Wild
recording
Poster
- SAINT: Improved Neural Networks
for
Tabular Data via Row Attention and Contrastive Pre-Training
- Generic Entity Resolution
Models
- Towards Foundation Models for
Relational
Databases video
pitch
- Diffusion models for missing
value
imputation in tabular data
video
pitch
- STab: Self-supervised Learning
for
Tabular Data
- CASPR: Customer Activity
Sequence
based
Prediction and Representation
- Conditional Contrastive
Networks
- Self-supervised Representation Learning Across Sequential and Tabular Features Using Transformers
- The Need for Tabular
Representation
Learning: An Industry Perspective
- STUNT: Few-shot Tabular Learning
with
Jaehyun Nam, Jihoon Tack, Kyungmin Lee, Hankook Lee, Jinwoo Shin< - Tabular Data Generation: Can We
Fool
XGBoost?
- SiMa: Federating Data Silos using
GNNs video
pitch