With 1000 rows and 100 samples and markdown-kv, I got these scores:
- gpt-4.1-nano: 52%
- gpt-4.1-mini: 72%
- gpt-4.1: 93%
- gpt-5: 100%
I was so surprised by gpt-5 getting 100% that I ran it again with 1000 samples. It got 999 correct, and one wrong.
To reproduce it yourself, clone the repo, add a .env file with OPENAI_API_KEY, `uv sync`, and then run:
uv run inspect eval evals/table_formats_eval.py@table_formats_markdown_kv --model openai/gpt-5 --limit 100
Update: Also, number of rows makes a massive difference, unsurprisingly; at 100 rows, gpt-4.1-nano scores 95%+ for both markdown-kv and csv. Both model and record count seem to matter a lot more than format.
Not to mention that the least poorly performing format is probably the stupidest way to encode tabular data, beating even XML. But I guess that’s the new normal because we’re trying to shoehorn conversational AI models to every use case rather than, say, training finetunes that are better at particular tasks. (Yes, of course you can’t train finetunes when the model is a proprietary black box on someone else’s computer.) Something about hammers and nails…
With small amounts of input data, the accuracy is near 100%. As you increase the size of the input data, the accuracy gradually decreases.
For this test, I intentionally chose an input data set large enough that the LLM would score in the region of 50% accuracy (with variation between formats) in order to maximise the discriminative power of the test.
Thanks for your work on this! It's a very legit domain of problem for LLMs to optimize for. I've produced a comprehensive eval based on your post and run it against 30 models, each tasked with recalling specific data from 500 rows in different tabular formats. Have a look at the results here: https://weval.org/analysis/table-format-sensitivity__combine...
As you can see it's near 100% recall across all formats for a good chunk of frontier models, with a few (curiously, mostly Claude) failing a basic prompt adherance ("Return just the number") but still returning the right answers. The major failures are from Mistral Medium, Llama Maverick, Llama 3 70b Instruct, Mistral Nemo, Gemma 3 12b It, GPT 4o/4.1 Mini etc.
Based on these limited tests, here's the leaderboards on formats FWIW:
So, the biggest takeaway really is: Use the best model you can reasonably afford, then format will matter less. The cheapest 100% coverage models are Gemini 2.5 Flash and Deepseek Chat V3.1
And if you have no control over model, then use CSV or Markdown Table.
> As you increase the size of the input data, the accuracy gradually decreases.
Interesting.
On your section "Limitations and Areas for Further Study",
What I'd be curious on future work would be,
- changing the order of the data on each table type
- changing the order of the questions
I'm curious to know if what it fails is the same, if it changes depending on the location, if it's a bias.
Is it always a specific question? Is it always a specific value? Is it always question #x (or around question #x?). Does it tend towards x or y on types of questions?
LLMs have documented position biases, with skew towards first and last. This is strongest in messages due to system prompt + current question training data, but it's present in list data in general.
Exactly. But the papers I’ve seen, the tests are done based on answers being multiple choice usually.
Where do you eat?
A) floor
B) table
C) dirt
In this case, the questions asked have an answer. The bias would then be on the order of the input data. It’s different enough that it triggered my curiosity.
The best performing isn't markdown tables, it's markdown key/value pairs:
## Record 1
```
id: 1
name: Charlie A0
age: 56
city: New York
department: Operations
salary: 67896
years_experience: 7
project_count: 1
```
Which makes sense to me because the problem with formats like CSV and regular markdown tables is that it is too easy for the model to mistakenly associate a value in a row with the wrong header.
Explicit key/value formats like this or YAML or JSON objects make that a lot less likely.
I was surprised that XML (56%), with closing tags, wasn’t as good as YAMl/KV(60%), though line breaks perform the same kind of grouping function.
Then I realized from the table that XML used about 50% more tokens (~75K vs ~50K) for similar accuracy, and for the first time felt a kind of sympathy for the LLM…
Yeah that was my intuition as well. I think the KV-Markdown format gains additional advantage over JSON and YAML in the special syntax for headers helping to break up records.
I was looking for the frontier curve where they tested their benchmark across different models since this sort of behavior is highly parameter, architecture, training, and fine tuning sensitive. It’s a practically useful question so I was really disappointed when a) they didn’t publish their code so you could test yourself, b) they didn’t do even a cursory examination of other models and sizes.
This should be higher. While the research question is interesting, the sample size makes the conclusion highly suspect. I'd like to see more research on this.
The test really needed to be run on multiple data sizes (50, 100, 500, 1000, 5000). The more token efficient formats would probably eventually overtake the token heavy ones due to context pollution. All this test really says is what performs best for 1 particular model at one particular context length.
Interesting. Curious to reproduce across models, I made a comprehensive eval based on your post and ran it against 30 models, each tasked with recalling specific data from 500 rows in different tabular formats. Have a look at the results here: https://weval.org/analysis/table-format-sensitivity__combine...
As you can see it's near 100% recall across all formats for a good chunk of frontier models, with a few (curiously, mostly Claude) failing at basic prompt adherance ("Return just the number") but still returning the right answers. The major failures are from Mistral Medium, Llama Maverick, Llama 3 70b Instruct, Mistral Nemo, Gemma 3 12b It, GPT 4o/4.1 Mini etc.
Based on these limited tests, here's the leaderboards on formats FWIW:
IMO the biggest takeaway really is: Use the best model you can reasonably afford, then the format chosen will matter less. The cheapest 100%-coverage models are Gemini 2.5 Flash and Deepseek Chat V3.1 FWIW. However, if you have no control over model, then use CSV or Markdown Table as these have highest chance of success.
The MAJOR issue that we might not want to admit is that there are a thousand confounders that prevent any meaningful canonical learning here. Crucially: The data within the tabular structure itself matters HUGELY. The scary probabilistic nature of LLMs mean the very subject of your queries can affect how the query is run, which is quite absurd from a IO/computing purity perspective. This is why tooling is so important. Enable the LLM to write and execute code safely, and you don't need to worry about such free-prose frailties.
Can someone explain why one would want to use an LLM to read tabular data? This is something even trivial code could do while using far fewer compute and energy resources.
Really interesting post. I ran into some of the limitations of working with tables and LLM's last year.
I experimented with an approach to use the llm to generate a bespoke transformation machine that uses an LLM to generate a series of transform steps to extracting key data from large data sets.
It appears that this is just testing data retrieval from somewhere in the table? Do the results translate to something where data analysis is performed? From something as simple as summing across rows or averages to generating graphs.
I once tried to get Claude and ChatGPT to build me a excel financial model, failed pretty hard. The models seem to lose track where they are in a table
Bizarre conclusions when on average all the formats perform poorly with average accuracy of 50%. Sure 60% is better than 40% but they are both unusable if you actually care about numbers...
I've been stunned by how many smart people talk so casually about LLMs becoming better at math. Do they just forget that a calculator that is wrong 1% of the time is a de facto calculator that doesn't work and should not be used?
Doing math is not the same as calculating. LLMs can be very useful in doing math; for calculating they are the wrong tool (and even there they can be very useful, but you ask them to use calculating tools, not to do the calculations themselves—both Claude and ChatGPT are set up to do this).
If you're curious, check out how mathematicians like Robert Ghrist or Terence Tao are using LLMs for math research, both have written about it online repeatedly (along with an increasing number of other researchers).
Apart from assisting with research, their ability on e.g. math olympiad problems is periodically measured and objectively rapidly improving, so this isn't just a matter of opinion.
You realize that when typing into a calculator, you probably hit a wrong key more than 1% of the time? Which is why you always type important calculations twice?
I've been stunned by how many smart people talk so casually about how because LLMs aren't perfect, they therefore have no value. Do they just forget that nothing in the world is perfect, and the values of things are measured in degrees?
There’s a big difference between mistyping 1% of the time yourself (human error) and a calculator failing 1% of the time (machine error) and I am willing to bet there isn’t a company out there (maybe a handful of less scrupulous ones) that has knowingly shipped a calculator that got it wrong 1% of the time. Especially in previous decades when countless people were using a dedicated calculator dozens of times a day. Hard to imagine a 1% margin of error was acceptable.
Not to mention now you have the compounded problem of your mistakes plus the calculator’s mistakes.
There isn't a difference in the big picture. Error is error. Even when we have incredibly reliable things, there's error when they interface with humans. Humans have error interfacing with each other.
But you seem to have missed the main point I was making. See? Another error. They're everwhere! ;)
I intentionally chose input data large enough that the LLM would be scoring in the region of 50% accuracy in order to maximise the discriminative power of the test.
I did a small test with just a couple of formats and something like 100 records, saw that the accuracy was higher than I wanted, then increased the number of records until the accuracy was down to 50%-ish (e.g. 100 -> 200 -> 500 -> 1000, though I forget the precise numbers.)
Yeah I mean for many real world scale datasets you don’t want to blow the whole context window on a massive markdown file. Instead you can provide a tool that presents the data as a SQLite database. In my testing Claude code seems very capable of answering questions via SQLite queries or even `head` and `grep` on CSV files.
But the result from the SQL query is going to be... a table. So at some point, tables need to go into context, and we need to know how well LLMs can incorporate those tables.
This was exactly my thought. Rather than feed the table directly to the LLM, build agents that extract the data and have the LLM act on the extracted data items. Then it’s a preference issue.
The author didn’t see much more than 60% accuracy which is not very useful for many (most?) real world tasks.
Reinventing? No. Using? Yes, for a lot of good reasons.
LLMs are expensive. Spending tokens to do something in bulk that is well suited to existing tools and algorithms, is wasteful and slow. And the main reason is that, using LLMs, the original author indicated only a 60% success rate for the task. Why spend many times more time and money and energy just to use an LLM on a well-understood preparatory task that it sucks at, when you can get much better results more inexpensively with off-the-shelf tools, and feed their results to the LLM for its unique value.
Well, ironically you then have the issue of how to present your database schema (including important things like the values in some categorical fields) to the LLM and in what format, so you never really escape this issue.
The article has interesting data. But it’s frustrating to read AI generated text like this:
> Performance Optimization: Reducing processing overhead while maintaining accuracy
What on earth does it mean that this “optimized performance”? This is nonsensical content. Performance wasn’t even measured, accuracy was. You can tell this was AI generated because “ Reducing processing overhead while maintaining accuracy” would likely be true for a perf optimization, but it has no meaning whatsoever in the context of the article.
This really throws into question whether I can take the rest of the article and data seriously.
This is a bit silly way to use LLMs to process tabular data. In reality, you'd ask it to write functions and execute them. First you'd ask it to create a type definition from the table, then ask it to create functions to process the data.
"Write a function to find years of experience by name? Return just the number, e.g. '12'."
It works much better, and it can single-shot many of the processing requirements just from type definitions it can infer from the data.
This way it's easier to stick to tabular formats that have easy reading libraries, like with TypeScript/JavaScript JSON, and with Python, maybe CSV...
Tbh I am more interested in processing data and formatting it to tabular forms than extracting data from tabular forms. One of the main uses I see in LLMs is structuring unstructured/semistructured data. I may occasionally feed a table to an LLM and ask such kinds of questions when I feel lazy, but I see no serious application of this as compared with using whatever language/library to process the data from the table (whether using an llm or not in the whole process). The point of having structured data is exactly this. But much more often I feed data to an llm and ask it to create a table.
Great benchmark! It highlights an important but often downstream problem. In real-world pipelines, the bigger issue comes before this: extracting tables from PDFs or scans without breaking their layout. Once the structure is lost (merged headers, nested cells, footnotes, etc.), no data format can fully recover it.
Check out LLMWhisperer from Unstract —> it preserves table and layout fidelity when converting documents for LLM use. You can try it on complex PDFs or forms here: https://pg.llmwhisperer.unstract.com (no signup needed)
Layout preservation upstream often improves downstream accuracy more than choosing between CSV, JSON, or Markdown. Find more details here: https://unstract.com/llmwhisperer/
Curious how text-aligned tabular formats work for LLMs considering humans probably find them more readable than other formats
System Sales(a)
Number of Units (in Millions)
────────────────────────────────────────────────────────────────────────
KFC Division 31,981 $ 34,452
Taco Bell Division 8,757 17,193
Pizza Hut Division 20,225 13,108
Habit Burger & Grill Division 383 713
YUM 61,346 $ 65,466
I'm seeing pretty good success with extracting data out of 10-Qs which are formatted like this by default using the `edgartools` library's default `filing.text()` method.
These sort of experiments and results are really important for language model implementation. This has a tangible implication for my AI startup and how we approach tool design.
Much more important than citation farming a paper on 1 % improved performance
The current OCR approach typically relies on a Vision-Language Model (VLM) to convert a table into a JSON structure. However, a table inherently has a 2D spatial structure, while Large Language Models (LLMs) are optimized for processing 1D sequential text. This creates a fundamental mismatch between the data representation and the model’s input format.
Most existing pipelines address this by preprocessing the table into a linearized 1D string before passing it to the LLM — a question-agnostic step that may lose structural information.
Instead, one could retain the original table form and, when a question is asked, feed both the question and the original table (as an image) directly into the VLM. This approach allows the model to reason over the data in its native 2D domain, providing a more natural and potentially more accurate solution.
Why would anyone trust the output of an LLM, if it is barely better than guessing and much much worse than humans?
GPT-5 shows more impressive numbers, but for that particular task, the precision should be 100% - always. No matter how large the data set is or in which format.
Why are we doing this?
Inputs were not long enough to properly see either of the true wins in terms of reduced token counts for terser formats or their benefits in terms of avoiding stuffing the context window thereby potentially reducing accuracy. The test really needs to be conducted across multiple dimensions!
I've found that xml is surprisingly good for llms when it comes to table extraction in production. I only found out when I send the raw xml storage format to benchmark again various flavours of everything else. XML turns out to the best format for tables that have more than three levels of nesting.
We ended up making middleware for LLM 'tools/functions' that take common data/table formats like CSV, Excel and JSON.
The tool uses an LLM to write code to parse the data and conduct the analysis to return back to the LLM. Otherwise, we found pumping raw table data into a LLM is just not reliable, even if you go to the effort to conduct analysis on smaller chunks and merge the results.
Only testing GPT-4.1-nano makes this basically useless. Most people are almost certainly using GPT-5 mini or better. This very poor analysis is like an LLM literacy test for readers.
If you want 100% accuracy from these kinds of tasks with LLMs you can get it today, but you need to provide the LLM with the ability to run Python code and tell it to use something like Pandas.
You can confirm it's doing the right thing by reviewing the code it wrote.
Simon is right about using code execution, but many tables one might look at outside of formal data work are small enough for LLMs to be very reliable at, so this format question is practically relevant. I wish they had tested better models.
I am not an expert on the subject but i suggest that you can also save context space by using shorter XML element names (like f instead of function, c instead of class, etc.). Just add a legend at the top or bottom to explain what each abbreviation means, LLMs can figure out the mapping without issues. I use this approach when generating project structure maps with Tree-sitter. I did a quick comparison and didn't notice much degradation with claude, so the context space you save may make it worthwhile. I would be interested to see a proper comparison.
Common enough words like `function` and `class` are generally encoded as a single token by the tokenizer and may provide a slightly better context to the LLM. For openai you can test this stuff at https://platform.openai.com/tokenizer
This is an interesting theoretical exercise but please for the love of god don't actually use an LLM to search tabular data. This is a solved problem. Free software does this with 100% accuracy and insane efficiency.
This is a really eye-popping example. Because here we have input text that is fully structured perfectly unambiguous (it was carefully designed that way!) and yet the LLM can't get all the information out of it. Yet people are using these tools to summarize unstructured text, assuming the summary will capture the most salient points. Well how is the LLM supposed to be good for that task, if it can't even summarize the dang XML document? They keep telling me this thing is more expert than all the experts combined.
They did. The KV-Markdown is essentially a dict with ``` wrapper, and INI which is similar scored very high as well. The worst performers were index-based rows like CSV or Markdown tables. JSON is in the middle with high context and more syntactic noise and less clear record labels.
The odd ones to me are HTML which uses th and td to make indexed-based rows but did better than JSON somehow, and XML which is like JSON with even more syntactic noise placing better than INI. If I had to guess I'd say because vast amounts of the web were in the training set.
There are other studies on this topic with similar results across LLM systems:
Y. Sui, M. Zhou, M. Zhou, S. Han, and D. Zhang, “Table Meets LLM: Can Large Language Models Understand Structured Table Data? A Benchmark and Empirical Study,” in Proceedings of the 17th ACM International Conference on Web Search and Data Mining, Merida Mexico: ACM, Mar. 2024, pp. 645–654. doi: 10.1145/3616855.3635752.
C. Pang, Y. Cao, C. Yang, and P. Luo, “Uncovering Limitations of Large Language Models in Information Seeking from Tables,” June 06, 2024, arXiv: arXiv:2406.04113. doi: 10.48550/arXiv.2406.04113.
The context I used in the test was pretty large. You'll see much better (near 100%) accuracy if you're using smaller amounts of context.
[I chose the context size so that the LLM would be scoring in the ballpark of 50% accuracy (with variation between formats) to maximise the discriminative power of the test.]
I was curious enough to have Codex create a similar benchmark: https://github.com/jcheng5/table-formats
With 1000 rows and 100 samples and markdown-kv, I got these scores:
- gpt-4.1-nano: 52%
- gpt-4.1-mini: 72%
- gpt-4.1: 93%
- gpt-5: 100%
I was so surprised by gpt-5 getting 100% that I ran it again with 1000 samples. It got 999 correct, and one wrong.
To reproduce it yourself, clone the repo, add a .env file with OPENAI_API_KEY, `uv sync`, and then run:
Update: Also, number of rows makes a massive difference, unsurprisingly; at 100 rows, gpt-4.1-nano scores 95%+ for both markdown-kv and csv. Both model and record count seem to matter a lot more than format.gpt-5 also got 100/100 for both CSV and JSON.
Curious: how many iterations did you run of each benchmark and what was the variance?
Cool tool. I tried a few different things to get to work with google/gemini-2.5-pro, but couldn't figure it out.
Unfortunately I started getting "quota exceeded" almost immediately, but it did give 6/6 correct answers before it crapped out.
Thanks! That worked perfectly.
100 samples:
- gemini-2.5-pro: 100%
- gemini-2.5-flash: 97%
how about PNG?
> where accuracy is paramount
> accuracy: 60%
Not to mention that the least poorly performing format is probably the stupidest way to encode tabular data, beating even XML. But I guess that’s the new normal because we’re trying to shoehorn conversational AI models to every use case rather than, say, training finetunes that are better at particular tasks. (Yes, of course you can’t train finetunes when the model is a proprietary black box on someone else’s computer.) Something about hammers and nails…
I'm the person who ran the test.
To explain the 60% a bit more...
With small amounts of input data, the accuracy is near 100%. As you increase the size of the input data, the accuracy gradually decreases.
For this test, I intentionally chose an input data set large enough that the LLM would score in the region of 50% accuracy (with variation between formats) in order to maximise the discriminative power of the test.
Thanks for your work on this! It's a very legit domain of problem for LLMs to optimize for. I've produced a comprehensive eval based on your post and run it against 30 models, each tasked with recalling specific data from 500 rows in different tabular formats. Have a look at the results here: https://weval.org/analysis/table-format-sensitivity__combine...
As you can see it's near 100% recall across all formats for a good chunk of frontier models, with a few (curiously, mostly Claude) failing a basic prompt adherance ("Return just the number") but still returning the right answers. The major failures are from Mistral Medium, Llama Maverick, Llama 3 70b Instruct, Mistral Nemo, Gemma 3 12b It, GPT 4o/4.1 Mini etc.
Based on these limited tests, here's the leaderboards on formats FWIW:
So, the biggest takeaway really is: Use the best model you can reasonably afford, then format will matter less. The cheapest 100% coverage models are Gemini 2.5 Flash and Deepseek Chat V3.1And if you have no control over model, then use CSV or Markdown Table.
Wouldn't it be more useful to measure the number of rows the model can process while still hitting 100% accuracy?
> As you increase the size of the input data, the accuracy gradually decreases.
Interesting.
On your section "Limitations and Areas for Further Study",
What I'd be curious on future work would be,
I'm curious to know if what it fails is the same, if it changes depending on the location, if it's a bias.Is it always a specific question? Is it always a specific value? Is it always question #x (or around question #x?). Does it tend towards x or y on types of questions?
Good idea
LLMs have documented position biases, with skew towards first and last. This is strongest in messages due to system prompt + current question training data, but it's present in list data in general.
Exactly. But the papers I’ve seen, the tests are done based on answers being multiple choice usually.
In this case, the questions asked have an answer. The bias would then be on the order of the input data. It’s different enough that it triggered my curiosity.Isn't the best performing (markdown tables) and the worst (pipe delimited tables) basically the same format?
The best performing isn't markdown tables, it's markdown key/value pairs:
Which makes sense to me because the problem with formats like CSV and regular markdown tables is that it is too easy for the model to mistakenly associate a value in a row with the wrong header.Explicit key/value formats like this or YAML or JSON objects make that a lot less likely.
I was surprised that XML (56%), with closing tags, wasn’t as good as YAMl/KV(60%), though line breaks perform the same kind of grouping function.
Then I realized from the table that XML used about 50% more tokens (~75K vs ~50K) for similar accuracy, and for the first time felt a kind of sympathy for the LLM…
Yeah that was my intuition as well. I think the KV-Markdown format gains additional advantage over JSON and YAML in the special syntax for headers helping to break up records.
they used GPT-4.1 nano, results would be quite different with sonnet or gpt5.
I was looking for the frontier curve where they tested their benchmark across different models since this sort of behavior is highly parameter, architecture, training, and fine tuning sensitive. It’s a practically useful question so I was really disappointed when a) they didn’t publish their code so you could test yourself, b) they didn’t do even a cursory examination of other models and sizes.
trust me bro, the next model bro, it's just way better bro
To be fair nano was an absolute crap model when it came out
Or just regular gpt-4.1, it's a quite capable model.
Title says "LLMs" (plural) but they only tested one
> We only tested OpenAI’s GPT-4.1 nano.
This should be higher. While the research question is interesting, the sample size makes the conclusion highly suspect. I'd like to see more research on this.
And not even a commonly used one. Gemini Flash or o4-mini would have been a much better choice if they wanted a cheap model
This article screams for a accuracy vs. tokens plot. Thanks though, interesting results.
The test really needed to be run on multiple data sizes (50, 100, 500, 1000, 5000). The more token efficient formats would probably eventually overtake the token heavy ones due to context pollution. All this test really says is what performs best for 1 particular model at one particular context length.
Interesting. Curious to reproduce across models, I made a comprehensive eval based on your post and ran it against 30 models, each tasked with recalling specific data from 500 rows in different tabular formats. Have a look at the results here: https://weval.org/analysis/table-format-sensitivity__combine...
As you can see it's near 100% recall across all formats for a good chunk of frontier models, with a few (curiously, mostly Claude) failing at basic prompt adherance ("Return just the number") but still returning the right answers. The major failures are from Mistral Medium, Llama Maverick, Llama 3 70b Instruct, Mistral Nemo, Gemma 3 12b It, GPT 4o/4.1 Mini etc.
Based on these limited tests, here's the leaderboards on formats FWIW:
IMO the biggest takeaway really is: Use the best model you can reasonably afford, then the format chosen will matter less. The cheapest 100%-coverage models are Gemini 2.5 Flash and Deepseek Chat V3.1 FWIW. However, if you have no control over model, then use CSV or Markdown Table as these have highest chance of success.The MAJOR issue that we might not want to admit is that there are a thousand confounders that prevent any meaningful canonical learning here. Crucially: The data within the tabular structure itself matters HUGELY. The scary probabilistic nature of LLMs mean the very subject of your queries can affect how the query is run, which is quite absurd from a IO/computing purity perspective. This is why tooling is so important. Enable the LLM to write and execute code safely, and you don't need to worry about such free-prose frailties.
Can someone explain why one would want to use an LLM to read tabular data? This is something even trivial code could do while using far fewer compute and energy resources.
Understanding the question is the hard part, that's where the LLM comes as an useful tool.
Really interesting post. I ran into some of the limitations of working with tables and LLM's last year.
I experimented with an approach to use the llm to generate a bespoke transformation machine that uses an LLM to generate a series of transform steps to extracting key data from large data sets.
https://tombers.github.io/oblique-angles/ai/education/2025/0...
It appears that this is just testing data retrieval from somewhere in the table? Do the results translate to something where data analysis is performed? From something as simple as summing across rows or averages to generating graphs.
I once tried to get Claude and ChatGPT to build me a excel financial model, failed pretty hard. The models seem to lose track where they are in a table
Bizarre conclusions when on average all the formats perform poorly with average accuracy of 50%. Sure 60% is better than 40% but they are both unusable if you actually care about numbers...
I've been stunned by how many smart people talk so casually about LLMs becoming better at math. Do they just forget that a calculator that is wrong 1% of the time is a de facto calculator that doesn't work and should not be used?
> I've been stunned by how many smart people talk so casually about LLMs becoming better at math
Could they be referring to this?
"Advanced version of Gemini with Deep Think officially achieves gold-medal standard at the International Mathematical Olympiad" https://deepmind.google/discover/blog/advanced-version-of-ge...
Doing math is not the same as calculating. LLMs can be very useful in doing math; for calculating they are the wrong tool (and even there they can be very useful, but you ask them to use calculating tools, not to do the calculations themselves—both Claude and ChatGPT are set up to do this).
If you're curious, check out how mathematicians like Robert Ghrist or Terence Tao are using LLMs for math research, both have written about it online repeatedly (along with an increasing number of other researchers).
Apart from assisting with research, their ability on e.g. math olympiad problems is periodically measured and objectively rapidly improving, so this isn't just a matter of opinion.
The best math lecturers I had at university sucked at mental calculations. Some almost screwed up 2+2 on the blackboard.
Yes LLMs suck at calculating stuff. However they can manipulate equations and such, and sometimes impressively so.
You realize that when typing into a calculator, you probably hit a wrong key more than 1% of the time? Which is why you always type important calculations twice?
I've been stunned by how many smart people talk so casually about how because LLMs aren't perfect, they therefore have no value. Do they just forget that nothing in the world is perfect, and the values of things are measured in degrees?
There’s a big difference between mistyping 1% of the time yourself (human error) and a calculator failing 1% of the time (machine error) and I am willing to bet there isn’t a company out there (maybe a handful of less scrupulous ones) that has knowingly shipped a calculator that got it wrong 1% of the time. Especially in previous decades when countless people were using a dedicated calculator dozens of times a day. Hard to imagine a 1% margin of error was acceptable.
Not to mention now you have the compounded problem of your mistakes plus the calculator’s mistakes.
There isn't a difference in the big picture. Error is error. Even when we have incredibly reliable things, there's error when they interface with humans. Humans have error interfacing with each other.
But you seem to have missed the main point I was making. See? Another error. They're everwhere! ;)
> But you seem to have missed the main point I was making. See? Another error. They're everwhere! ;)
Ah, but whose error? ;)
> But you seem to have missed the main point I was making. See? Another error. They're everwhere! ;)
You really could’ve done without this bit.
I'm the person who ran the test.
To hopefully clarify a bit...
I intentionally chose input data large enough that the LLM would be scoring in the region of 50% accuracy in order to maximise the discriminative power of the test.
Can you expand on how you did this?
I did a small test with just a couple of formats and something like 100 records, saw that the accuracy was higher than I wanted, then increased the number of records until the accuracy was down to 50%-ish (e.g. 100 -> 200 -> 500 -> 1000, though I forget the precise numbers.)
My sentiments exactly. All the formats were so poorly read that they are all effectively useless.
I wonder how this compares to a more agentic approach where the LLM composes SQL queries to answer the questions, for example.
Yeah I mean for many real world scale datasets you don’t want to blow the whole context window on a massive markdown file. Instead you can provide a tool that presents the data as a SQLite database. In my testing Claude code seems very capable of answering questions via SQLite queries or even `head` and `grep` on CSV files.
But the result from the SQL query is going to be... a table. So at some point, tables need to go into context, and we need to know how well LLMs can incorporate those tables.
This was exactly my thought. Rather than feed the table directly to the LLM, build agents that extract the data and have the LLM act on the extracted data items. Then it’s a preference issue.
The author didn’t see much more than 60% accuracy which is not very useful for many (most?) real world tasks.
“Agents that extract the data” Are we really reinventing data frame readers to have an LLM in the critical path?
Reinventing? No. Using? Yes, for a lot of good reasons.
LLMs are expensive. Spending tokens to do something in bulk that is well suited to existing tools and algorithms, is wasteful and slow. And the main reason is that, using LLMs, the original author indicated only a 60% success rate for the task. Why spend many times more time and money and energy just to use an LLM on a well-understood preparatory task that it sucks at, when you can get much better results more inexpensively with off-the-shelf tools, and feed their results to the LLM for its unique value.
Well, ironically you then have the issue of how to present your database schema (including important things like the values in some categorical fields) to the LLM and in what format, so you never really escape this issue.
The article has interesting data. But it’s frustrating to read AI generated text like this:
> Performance Optimization: Reducing processing overhead while maintaining accuracy
What on earth does it mean that this “optimized performance”? This is nonsensical content. Performance wasn’t even measured, accuracy was. You can tell this was AI generated because “ Reducing processing overhead while maintaining accuracy” would likely be true for a perf optimization, but it has no meaning whatsoever in the context of the article.
This really throws into question whether I can take the rest of the article and data seriously.
I think they may be referring to token usage, which is mentioned in the article. fewer tokens = higher performance
This is a bit silly way to use LLMs to process tabular data. In reality, you'd ask it to write functions and execute them. First you'd ask it to create a type definition from the table, then ask it to create functions to process the data.
"Write a function to find years of experience by name? Return just the number, e.g. '12'."
It works much better, and it can single-shot many of the processing requirements just from type definitions it can infer from the data.
This way it's easier to stick to tabular formats that have easy reading libraries, like with TypeScript/JavaScript JSON, and with Python, maybe CSV...
Tbh I am more interested in processing data and formatting it to tabular forms than extracting data from tabular forms. One of the main uses I see in LLMs is structuring unstructured/semistructured data. I may occasionally feed a table to an LLM and ask such kinds of questions when I feel lazy, but I see no serious application of this as compared with using whatever language/library to process the data from the table (whether using an llm or not in the whole process). The point of having structured data is exactly this. But much more often I feed data to an llm and ask it to create a table.
Great benchmark! It highlights an important but often downstream problem. In real-world pipelines, the bigger issue comes before this: extracting tables from PDFs or scans without breaking their layout. Once the structure is lost (merged headers, nested cells, footnotes, etc.), no data format can fully recover it.
Check out LLMWhisperer from Unstract —> it preserves table and layout fidelity when converting documents for LLM use. You can try it on complex PDFs or forms here: https://pg.llmwhisperer.unstract.com (no signup needed)
Layout preservation upstream often improves downstream accuracy more than choosing between CSV, JSON, or Markdown. Find more details here: https://unstract.com/llmwhisperer/
Curious how text-aligned tabular formats work for LLMs considering humans probably find them more readable than other formats
I'm seeing pretty good success with extracting data out of 10-Qs which are formatted like this by default using the `edgartools` library's default `filing.text()` method.They don’t understand any table formats; as shown by these results.
They can transform information in tables but information is lost due to that lack of understanding.
These sort of experiments and results are really important for language model implementation. This has a tangible implication for my AI startup and how we approach tool design.
Much more important than citation farming a paper on 1 % improved performance
The current OCR approach typically relies on a Vision-Language Model (VLM) to convert a table into a JSON structure. However, a table inherently has a 2D spatial structure, while Large Language Models (LLMs) are optimized for processing 1D sequential text. This creates a fundamental mismatch between the data representation and the model’s input format.
Most existing pipelines address this by preprocessing the table into a linearized 1D string before passing it to the LLM — a question-agnostic step that may lose structural information.
Instead, one could retain the original table form and, when a question is asked, feed both the question and the original table (as an image) directly into the VLM. This approach allows the model to reason over the data in its native 2D domain, providing a more natural and potentially more accurate solution.
Yeah, I wonder how PNG would fare in this contest.
> 60.7%
Why would anyone trust the output of an LLM, if it is barely better than guessing and much much worse than humans?
GPT-5 shows more impressive numbers, but for that particular task, the precision should be 100% - always. No matter how large the data set is or in which format. Why are we doing this?
Inputs were not long enough to properly see either of the true wins in terms of reduced token counts for terser formats or their benefits in terms of avoiding stuffing the context window thereby potentially reducing accuracy. The test really needs to be conducted across multiple dimensions!
I've found that xml is surprisingly good for llms when it comes to table extraction in production. I only found out when I send the raw xml storage format to benchmark again various flavours of everything else. XML turns out to the best format for tables that have more than three levels of nesting.
We ended up making middleware for LLM 'tools/functions' that take common data/table formats like CSV, Excel and JSON.
The tool uses an LLM to write code to parse the data and conduct the analysis to return back to the LLM. Otherwise, we found pumping raw table data into a LLM is just not reliable, even if you go to the effort to conduct analysis on smaller chunks and merge the results.
Only testing GPT-4.1-nano makes this basically useless. Most people are almost certainly using GPT-5 mini or better. This very poor analysis is like an LLM literacy test for readers.
Please go away and do the work for us and let us know what anmazing accuracy you got with whatever version you think is better.
Anything below 100% is actually pretty useless when it comes to stats.
If you want 100% accuracy from these kinds of tasks with LLMs you can get it today, but you need to provide the LLM with the ability to run Python code and tell it to use something like Pandas.
You can confirm it's doing the right thing by reviewing the code it wrote.
Or you can just write the code to do it correctly. Which would be quicker. If you can review it properly you already understand how to do it.
That would require me to have memorized the pandas API.
I've been using pandas on-and-off for over a decade and I still haven't come close to doing that.
Simon is right about using code execution, but many tables one might look at outside of formal data work are small enough for LLMs to be very reliable at, so this format question is practically relevant. I wish they had tested better models.
Was a bit surprised about the low csv performance, in my exp. it‘s very good (use it a lot with Excel and small tables, well below 100 rows).
As markdown kv performs so well, I am now curious about TOML.
TOML works decently well both directions, useful if you need structured data out of models or APIs that don't support structured outputs.
I am not an expert on the subject but i suggest that you can also save context space by using shorter XML element names (like f instead of function, c instead of class, etc.). Just add a legend at the top or bottom to explain what each abbreviation means, LLMs can figure out the mapping without issues. I use this approach when generating project structure maps with Tree-sitter. I did a quick comparison and didn't notice much degradation with claude, so the context space you save may make it worthwhile. I would be interested to see a proper comparison.
Common enough words like `function` and `class` are generally encoded as a single token by the tokenizer and may provide a slightly better context to the LLM. For openai you can test this stuff at https://platform.openai.com/tokenizer
if both f and function uses 1 token, are you really saving anything?
This is an interesting theoretical exercise but please for the love of god don't actually use an LLM to search tabular data. This is a solved problem. Free software does this with 100% accuracy and insane efficiency.
This is a really eye-popping example. Because here we have input text that is fully structured perfectly unambiguous (it was carefully designed that way!) and yet the LLM can't get all the information out of it. Yet people are using these tools to summarize unstructured text, assuming the summary will capture the most salient points. Well how is the LLM supposed to be good for that task, if it can't even summarize the dang XML document? They keep telling me this thing is more expert than all the experts combined.
KSON? (I'm a complete ignoramus in this area but recently read about KSON in a piece posted here at HN.)
https://ochagavia.nl/blog/configuration-files-are-user-inter...
https://news.ycombinator.com/item?id=45291858 (135 comments)
That's a cool concept - would be curious about a more common setup for agentic data analysis (ex: for using in Claude Code) like:
* Multiple tasks vs 1
* O3/o3-mini + 4o/4o-mini instead of nano
* Extra credit: Inside a fixed cost/length reasoning loop
Ex: does the md-kv benefit disappear with smarter models that you'r typically use, and thus just become a 2-3x cost?
I find this extremely surprising. I would have expected dict structures to have higher semantic context associated with them.
They did. The KV-Markdown is essentially a dict with ``` wrapper, and INI which is similar scored very high as well. The worst performers were index-based rows like CSV or Markdown tables. JSON is in the middle with high context and more syntactic noise and less clear record labels.
The odd ones to me are HTML which uses th and td to make indexed-based rows but did better than JSON somehow, and XML which is like JSON with even more syntactic noise placing better than INI. If I had to guess I'd say because vast amounts of the web were in the training set.
Hmmm. I’ve been using YAML data for tables for a while now, and had pretty good results.
Super surprised, I would expect CSV to beat all the others. And Markdown KV is something I hear first time about.
It's made up, not a standard format
There are other studies on this topic with similar results across LLM systems:
Y. Sui, M. Zhou, M. Zhou, S. Han, and D. Zhang, “Table Meets LLM: Can Large Language Models Understand Structured Table Data? A Benchmark and Empirical Study,” in Proceedings of the 17th ACM International Conference on Web Search and Data Mining, Merida Mexico: ACM, Mar. 2024, pp. 645–654. doi: 10.1145/3616855.3635752.
C. Pang, Y. Cao, C. Yang, and P. Luo, “Uncovering Limitations of Large Language Models in Information Seeking from Tables,” June 06, 2024, arXiv: arXiv:2406.04113. doi: 10.48550/arXiv.2406.04113.
I’d be interested in testing different data formats when using the structured outputs api
I'm surprised by the accuracy, in practice, I feel like I generally have a lot better results
I'm the person who ran the test.
The context I used in the test was pretty large. You'll see much better (near 100%) accuracy if you're using smaller amounts of context.
[I chose the context size so that the LLM would be scoring in the ballpark of 50% accuracy (with variation between formats) to maximise the discriminative power of the test.]
Do you measure your results in a repeatable way? In a way where your hypotheses about accuracy are falsifiable? Or do they just “feel” right?
Misleading title, just one LLM was tested.
Great idea. Very limited execution. If they release the source data and question set, I'll repeat with more LLMs to flesh out the findings.
interesting. I'm curious how this compares across different model families.
accuracy: 60%
This should have been a python script.
How much of the current peak of the Gartner Hype Cycle should just be python scripts?
In mice.
Or in this case gpt-4.1-nano
Author here.
This has made me chuckle several times - thanks!
maybe be org table