"Preislamic" is a common term for near-East history. Islam is well dated, it introduced many changes and unified the region, so it's a powerful marker.
I've never encountered the word "Pre-Arabic" about the Arabic peninsula. It would be hard to define precisely. The word "arab" is probably more than 3000 years old. The Arabic languages may be older ; they're semitic languages like the Akkadian of Mesopotamia. And when did an "Arab" people or culture emerge from the semitic people and culture? I guess between 6000 BP and 3000 BP, but it was probably a long process, and nomad tribes didn't leave many vestiges.
Pre-Islamic Arabia is, as far as I know, a fairly widely accepted term. Not that different from pre-Roman Britain, pre-Columbian Americas, pre-colonial Africa, pre-imperial China, or even Pagan Europe. In all these cases a significant change took place which drastically changed the course of the region (usually some sort of unification as a nation or religion, not always peaceful or voluntary of course).
I wonder if you could decypher these scripts by bruteforcing decoding layers until an LLM could predict the next token. That would assume the text has a sort of logic to it that would still work in modern language, but the decyphering would be fully automatic so we could throw a bunch of compute at it.
It's possible to identify a surprisingly large number of matching words by learning a linear transformation mapping word vectors from two different languages into the same space (e.g. https://arxiv.org/abs/1805.06297 ).
But the problem with ancient languages is typically that there's not enough data to usefully constrain a large enough model. Doubly so for undeciphered scripts where scholars might not even agree on how many different letters there are.
Presumably, they'd want to get at embeddings, and compare the dimensional space somehow to say: 'the relation between tokens a,b,c is close to the relation of tokens a1,b1,c1 in a similar model of texts of known language of apparently same family (same up to aN,bN,cN), and out of these N sequences, sequence X makes most sense given existing examples'.
(As you can tell, the argument involves some handwaving, but it may possible?)
In English. The decoder translates from the Dhofari to tokens the LLM understands. So you present the LLM with the decoded Dhofari, and a question in English, like "Please express the following in modern English" and the LLM would answer in English. There's also a chance the decoded Dhofari would be intelligible to humans directly, though I don't know how large that chance is.
the available data from some of those lesser used scripts are miniscule. the most common ancient North Arabian script is safitic and only around 50K texts are processed and widely available each with a few words to a few sentences.
it's a form of Thamudic / Ancient North Arabian script https://en.wikipedia.org/wiki/Ancient_North_Arabian
"Pre-Islamic" is an odd description of a script that predates Islam by a millennium. Did they mean "pre-Arabic?"
"Preislamic" is a common term for near-East history. Islam is well dated, it introduced many changes and unified the region, so it's a powerful marker.
I've never encountered the word "Pre-Arabic" about the Arabic peninsula. It would be hard to define precisely. The word "arab" is probably more than 3000 years old. The Arabic languages may be older ; they're semitic languages like the Akkadian of Mesopotamia. And when did an "Arab" people or culture emerge from the semitic people and culture? I guess between 6000 BP and 3000 BP, but it was probably a long process, and nomad tribes didn't leave many vestiges.
Pre-Islamic Arabia is, as far as I know, a fairly widely accepted term. Not that different from pre-Roman Britain, pre-Columbian Americas, pre-colonial Africa, pre-imperial China, or even Pagan Europe. In all these cases a significant change took place which drastically changed the course of the region (usually some sort of unification as a nation or religion, not always peaceful or voluntary of course).
is it "pre-arabic" though ? it's believed that old arabic existed back then.
[flagged]
Please don't post religious flamebait to HN. It leads to religious flamewars, which we definitely don't want here.
https://news.ycombinator.com/newsguidelines.html
Completely unreadable on iOS mobile...
Works fine here. https://imgur.com/a/px7cZAL
Interesting. I didn’t have any issues. Could you elaborate a bit more?
[flagged]
I wonder if you could decypher these scripts by bruteforcing decoding layers until an LLM could predict the next token. That would assume the text has a sort of logic to it that would still work in modern language, but the decyphering would be fully automatic so we could throw a bunch of compute at it.
Ok, your LLM can perfectly predict the next token. How do you extract the "logic" out of the weights?
It's possible to identify a surprisingly large number of matching words by learning a linear transformation mapping word vectors from two different languages into the same space (e.g. https://arxiv.org/abs/1805.06297 ).
But the problem with ancient languages is typically that there's not enough data to usefully constrain a large enough model. Doubly so for undeciphered scripts where scholars might not even agree on how many different letters there are.
Presumably, they'd want to get at embeddings, and compare the dimensional space somehow to say: 'the relation between tokens a,b,c is close to the relation of tokens a1,b1,c1 in a similar model of texts of known language of apparently same family (same up to aN,bN,cN), and out of these N sequences, sequence X makes most sense given existing examples'.
(As you can tell, the argument involves some handwaving, but it may possible?)
It's LLMs all the way down.
I don't think OP's idea would work, but if it did you could just ask for a translation.
In what language? The model wouldn't speak english.
In English. The decoder translates from the Dhofari to tokens the LLM understands. So you present the LLM with the decoded Dhofari, and a question in English, like "Please express the following in modern English" and the LLM would answer in English. There's also a chance the decoded Dhofari would be intelligible to humans directly, though I don't know how large that chance is.
the available data from some of those lesser used scripts are miniscule. the most common ancient North Arabian script is safitic and only around 50K texts are processed and widely available each with a few words to a few sentences.