What is the TAPAS Transformer in NLP and how does it work?

Ask Questions Forum: ask Machine Learning Questions to our readersCategory: Deep LearningWhat is the TAPAS Transformer in NLP and how does it work?
Chris Staff asked 2 months ago
1 Answers
Best Answer
Chris Staff answered 2 months ago

TAPAS (Table Parser) is a Transformer-based (BERT) approach for automatically answering questions from tables. Let’s summarize the main findings from the TAPAS paper here so that you can understand what it is and how it works.

Question answering from tables is seen as a semantic parsing task. In this task, questions are translated into logical form that can be executed against the table. If performed with ML, semantic parses until now rely on supervised training data for generating the question -> logical form relationship. This is expensive. Recent approaches focus on weak supervision, where training examples consist of questions and denotations (“descriptions”). Difficulties emerge related to spurious logical forms and other issues.

The paper “TAPAS: Weakly Supervised Table Parsing via Pre-Training” proposes the TAPAS model (Table Parser). It is a weakly supervised question answering model that reasons over tables *without* generating logical forms explicitly. Instead, it predicts a minimal program by selecting a relevant subset of table cells + the most likely aggregation operator to be executed on top of these cells, jointly. This allows TAPAS to learn operations based on natural language without requiring some explicit formalism.

We’ll now take a look at the individual components (architecture, pretraining, finetuning, experiments, results) in more detail, but let’s summarize the main findings and contributions briefly:
 
1. The work introduces a pretraining method for tables, extending BERT’s Masked Language Modeling (MLM) to structured data.
2. It subsequently pretrains the model over millions of tables and related text segments crawled from Wikipedia.
3. The work introduces an end-to-end differentiable training recipe for finetuning, allowing TAPAS to train from weak supervision.
4. Compared to previous approaches, TAPAS achieves better accuracy with several additional advantages:
* Simpler architecture: single encoder without autoregressive decoding.
* Enjoys pretraining: cost effective.
* Tackles more question types such as the ones involving aggregation.
* Capable of directly handling conversational settings.
5. TAPAS performs better than or on par with modern approaches on three different semantic parsing datasets: SQA, WikiTQ and WikiSQL.
 
TAPAS architecture
* TAPAS’ architecture is BERT-based. It extends BERT’s architecture with additional positional embeddings to encode the tabular structure.
* The table is first flattened into a sequence of words, then converted into word pieces. Tokens for the question are put in front of the table tokens.
* Two classification layers are added for selecting table cells and aggregation operators.
* These are the embeddings present within the TAPAS BERT-like architecture:
-> Position ID Embedding: index in the flattened sequence (like BERT).
-> Segment ID Embedding: two possible values – 0 for a question, 1 for table header and cells (like BERT’s segment embedding).
-> Column / Row ID Embedding: the (column, row) position of the cell within the table is encoded here, or 0 if the token is part of the question. It is an integer value and allows the model to learn the token’s position in the table.
-> Rank ID Embedding: if columns can be parsed as floats or dates, this embedding provides information about the rank / position of the value in the column if it were sorted ASC. It is expected that this helps the model with questions involving superlatives because words might not provide rank order information itself.
-> Previous Answer Embedding: given a conversational setup where questions may refer to previous questions/answers, this embedding marks whether a cell token was the answer to the previous question (1 or 0).
* Once trained, here’s how inference works:
-> The most likely aggregation operator is predicted together with a subset of the cells.
 
TAPAS pretraining
* Following success in other NLP models, TAPAS is pretrained on a large number of tables from Wikipedia. This is an unsupervised approach.
* 6.2 million tables, of which 3.3 million Infoboxes and 2.9 million WikiTables. At most 500 cells per table. Horizontal tables only with column names in header row (Infoboxes are transposed).
* As a proxy for questions in the pretraining setting, table caption, article title, article description, segment title and segment text where table is found used as relevant text snippets (21.3 million in total).
* A BERT-like Masked Language Modeling objective is followed. A second objective – predicting if the table belongs to the text or if it is random – was added, but dropped because it did not improve much (which looks like the issues found with BERT’s NSP in subsequent works).
* Maximum input length is set to 128: that’s the combined length for the tokenized text and table cells; it must be <128. Inputs are constructed by randomly selecting 8-16 word pieces from text to get 10 different variations per table.
* The masking procedure is as follows: whole word masking + whole cell masking (mask all pieces in the cell if any of the pieces is masked in whole word masking).
* No improvement was found through data augmentation.
 
TAPAS finetuning
* Finetuning the TAPAS model is performed with labeled data and is hence supervised.
* It is formally defined as follows: given N examples {(xi, Ti, yi)}, where xi = “utterance”; Ti = “table” and yi = “denotation”, the goal is to train a model that can map xi -> zi, where zi is a program that when executed against Ti produces yi. Zi comprises a subset of table cells and an aggregation operator.
* Preprocessing converts all denotations y into a tuple (C, s) of cell coordinates C and scalars s. S is only populated when y (the denotation, the expected outcome) is a scalar value; otherwise it is a cell selection procedure and only C coordinates are necessary.
* Training is guided according to the (C, s) combinations. Learning cell selection involves selecting all cells in c. For scalar answer examples, where s is populated but C is empty, the model is trained to predict aggregates over the selected table cells that amounts to s.
* Training for cell selection, detailed:
-> In this case, yi maps to a subset of table cell coordinates C.
-> A hierarchical model is used that first selects a column and then cells from the column.
-> This model is trained to select a column “col” that has the highest number of cells in C. If C is empty, a special “empty” column is selected.
-> The model then trained to select cells that belong to this column.
-> Loss for constructing this model is computed as follows: Loss = Loss_col + Loss_cells + Alpha * Loss_aggr.
-> Here, Loss_col is binary crossentropy loss for column selection; Loss_cells is binary crossentropy loss for cell selection given the column; aggregation loss involves the aggregation operator, where Alpha is a tunable hyperparameter.
* Training for scalar answer production, detailed:
-> In this case, yi maps to a scalar value, and C is empty. In other words, it’s about predicting aggregation operators like those in SQL – COUNT, AVERAGE and SUM – although the model is not restricted to them!
-> Previous approaches applied search strategies, but these can be tricky given exponentiality issues, such in the case where s = 5 and the operator is COUNT – there can be many and many of them if the table is big!
-> TAPAS developed a recipe for scalar answer production that does not require searching for correct programs. It applies an end-to-end differentiable training strategy with a fully differentiable layer that learns weights for aggregation prediction layer without strict supervision for the aggregation type.
-> Implemented by means of a probability estimation for each operator and an expected outcome for the computation operation. Summing the probabilities multiplied with these expected outcomes gives an expected outcome of the operations as a whole. A Huber loss compares this scalar output with si, i.e. minimizing this difference translates to maximizing the best matching operation. It also includes aggregation loss that penalizes probabilities for a NONE operator; this one should not be present since C is empty.
-> Total loss is therefore J_SA = J_Aggr + J_slr * Beta, where Beta is a tunable hyperparameter. To stabilize training, loss values are clipped at some cutoff value — outliers are thus somewhat ignored.
* Training when answers are ambiguous, detailed:
-> It can be the case that a scalar answer s appears in the table when C is not empty. The question then: is this aggregation prediction or cell selection prediction?
-> This is resolved by means of a dynamic choice picking the most probable outcome.
 
TAPAS experiments
* Experiments undertaken on three datasets
-> WikiTQ: complex questions on Wikipedia tables, with comparisons, superlatives, aggregations, arithmetic operations. Dataset is crowd sourced and independently crowd verified.
-> SQA: crowdsourced highly compositional questions from WikiTQ, where each question can be answered by one or more table cells. ~6k questions.
-> WikiSQL: translating text to SQL. Paraphrased natural language from a template.
* All datasets are formatted as (Question, Cell coordinates, Scalar answer) – which is already natural for some datasets.
* Experimental setup is as follows:
-> Standard BERT tokenizer with same 32k WordPiece vocabulary.
-> Pretraining from BERTlarge, starting from the pretrained text model helps here, with random initialization of the new embeddings.
-> Pretraining and finetuning ran on 32 Cloud TPUv3 cores, with maximum sequence length of 512. Pretraining takes 3 days. Finetuning 10 hours for WikiSQL and WikiTQ; 20 hours for SQA.
-> Resource-wise, the requirements are similar to those of training BERTlarge.
-> Hyperparameter tuning for the model is automated through a “Black Box Bayesian Optimizer” for WikiSQL and WikiTQ; “Grid search” for SQA.
 
TAPAS results
* Results are the median of 5 independent runs due to possible BERT degeneration.
* TAPAS achieves close to state-of-the-art (SOTA) performance for WikiSQL. If supervised, even more strong than SOTA.
* Even passing SOTA by first pretraining on WikiTQ as an extension to original pretraining.
* Substantial improvement on SQA.
* Ablation studies suggest that:
-> Column, row embeddings are most important for performance.
-> Position and rank embeddings ensure slight performance gains as well.
-> Setting “Scalar Answer” during finetuning for the answer selection to zero all the time drops performance, except for WikiSQL. This occurs because most examples in WikiSQL need no aggregation, and setting this loss to zero equals to removing aggregation.
 
Conclusions
 
Once more:
1. The work introduces a pretraining method for tables, extending BERT’s Masked Language Modeling (MLM) to structured data.
2. It subsequently pretrains the model over millions of tables and related text segments crawled from Wikipedia.
3. The work introduces an end-to-end differentiable training recipe for finetuning, allowing TAPAS to train from weak supervision.
4. Compared to previous approaches, TAPAS achieves better accuracy with several additional advantages:
* Simpler architecture: single encoder without autoregressive decoding.
* Enjoys pretraining: cost effective.
* Tackles more question types such as the ones involving aggregation.
* Capable of directly handling conversational settings.
5. TAPAS performs better than or on par with modern approaches on three different semantic parsing datasets: SQA, WikiTQ and WikiSQL.
 
Source:
Herzig, J., Nowak, P. K., Müller, T., Piccinno, F., & Eisenschlos, J. M. (2020). Tapas: Weakly supervised table parsing via pre-training. arXiv preprint arXiv:2004.02349.
 

Your Answer

11 + 12 =