Why do regexes use `$` and `^` as line anchors? (2024)

79 points | by srijan4 a day ago

47 comments

er4hn a day ago
> Thompson and Lampson are still alive and we can just ask them. I just sent emails to both, I'll update this if I get an answer!
This is one of the great things about this age. The barrier of reaching out to some person and asking them a question has never been lower. I've done this for RFCs, I've done this for questions about blog posts I've discovered here on HN. I've gotten lots of responses from people and it's always been illuminating to me.
red_admiral a day ago
> But also, $ is so important in business that every typewriter has one.
Some of the older Baudot code ones actually had a £ but not a $ symbol. After we'd decided that 5-key chorded keyboards were not the way forwards, and made QWERTY ones, we still had this encoding to deal with: https://hackaday.com/2015/09/27/demonstrating-baudot-code/
The US version then put the $ sign on what we'd today call ^D, which originally was ENQ (a code that still exists in ascii today, and was sort of the pre-TCP version of a SYN).
SoftTalker a day ago
> $ is so important in business that every typewriter has one
I actually had a typewriter without one. I would simulate it with S <backspace> / which was not very satisfactory but generally understandable in context.
[-]
- kragen a day ago
  Was it a US typewriter? This peso symbol is very important in business in the US and Latin America but not, for example, England, Ireland, India, Finland, or Sweden.
  [-]
  - garciansmith a day ago
    It would depend on the age of the typewriter, even the US. Many early ones did not have dedicated keys for symbols, which were made as the parent describes: by going back and typing another character over them (e.g., S backspace / to get a dollar sign).
    [-]
    - kragen a day ago
      Sure, I used one back in the 01980s where "!" was "'." with a backspace in between to overstrike them. And of course that's also what _, ^, `, etc., are for.
  - SoftTalker a day ago
    It was, a Smith-Corona, but it was a scientific version where the characters that were normally on the "shifted" top row number keys were instead subscripted numbers. So you could type chemical formulas with proper subscripts (or roll the paper down one notch and have superscripts) but it wasn't great for normal writing.
  - pc86 a day ago
    I know this symbol is used for pesos, but when 24 countries refer to it as "dollar"[0] compared to 7 for "peso" it seems fair to call it a dollar symbol? Not to mention that it's referred to as "dollar" in unicode.
    It's also very interesting to me that [1] mentions it can have one or two bars but in the list above the double-barred version is not only not in unicode but refers specifically to the cifrão[2].
    I guess the TLDR is currency stuff is confusing and nonstandard more often than not.
    [0] https://en.wikipedia.org/wiki/Currency_symbol#List_of_curren...
    [1] https://en.wikipedia.org/wiki/Dollar_sign
    [2] https://en.wikipedia.org/wiki/Cape_Verdean_escudo
    [-]
    - kragen a day ago
      It's a peso sign. The English word for the Spanish "peso" at the time was "dollar", and the US dollar was initially defined to be equal to the Spanish "dollar". The US dollar, which was a coin, had the same value as the Spanish peso because it consisted of the same weight (in Spanish, "peso") of silver. So the US used the peso sign for its dollars. Unicode calls it a "dollar sign" because that's what ASCII calls it, and that's because ASCII is from the US. But it's really a peso sign.
      https://en.wikipedia.org/wiki/Peso
      > In most countries of the Americas, the symbol commonly known as dollar sign, "$", was originally used as an abbreviation of "pesos" and later adopted by the dollar. The dollar itself actually originated from the peso or Spanish dollar in the late 18th century.
      https://en.wikipedia.org/wiki/Dollar_sign#History
      > The symbol appears in business correspondence in the 1770s from Spanish America, the early independent U.S., British America and Britain, referring to the Spanish American peso,[1][2] also known as "Spanish dollar" or "piece of eight" in British America. Those coins provided the model for the currency that the United States adopted in 1792, and for the larger coins of the new Spanish American republics, such as the Mexican peso, Argentine peso, Peruvian real, and Bolivian sol coins.
      https://en.wikipedia.org/wiki/Spanish_dollar
      > Most theories trace the origin of the "$" symbol, which originally had two vertical bars, to the pillars of Hercules wrapped in ribbons that appear on the reverse side of the Spanish dollar.[6] ¶ The term peso was used in Spanish to refer to this denomination, (...)
      [-]
      - pc86 a day ago
        > > also known as ... "piece of eight"
        I wonder if it is a coincidence that "peso" sounds pretty similar to "piece of"
        [-]
        kragen 12 hours ago
        The phonetics would have been even more similar at the time, but I suspect it's a coincidence. The words stem from different origins†, the sense development of "piece" is straightforward (a coin was a chunk of precious metal cut off the end of an ingot, thus being a piece of that ingot, and struck), and the chronology is probably wrong for a phonetic influence. https://www.etymonline.com/word/piece says "piece" for a coin is "c. 1400", which would be about 200 years older than the earliest attested occurrence of "piece of eight" and 100 years older than the peso itself, which was introduced in 01497.
        However, the earliest attestation given in https://archive.org/details/oed07arch/page/836/mode/1up?view... is from 01575, in Scots: "To be payit all in half mark pecis," and the first attestation of "piece of eight" is from 01610: "Round trunkes, Furnish'd with pistolets, and pieces of eight." Maybe Etymonline knows of a much earlier attestation? Because if "piece" in the sense of "coin" really didn't come into use until the late 16th century, a phonetic influence would be much more plausible.
        ______
        † "Peso", like "poise", is from Latin "pensum", "weight", which may be from "pensare", the frequentative of "pendere", "to hang, to weigh", while "piece" comes from Latin "petia" or "pecia", "fragment". In Spanish today "peso" is still the normal word for "weight" (and "pesar" is the normal verb for "weigh") and "pieza" is still a fairly common word for "fragment".
        marky1991 a day ago
        From the rae, https://dle.rae.es/peso:
        (From) latin, pensum.
        Piece, from https://www.etymonline.com/word/piece#etymonline_v_14956
        c. 1200, pece, "fixed amount, measure, portion;" c. 1300, "fragment of an object, bit of a whole, slice of meat; separate fragment, section, or part," from Old French piece "piece, bit portion; item; coin" (12c.), from Vulgar Latin pettia, probably from Gaulish pettsi (compare Welsh peth "thing," Breton pez "piece, a little"), perhaps from an Old Celtic base kwezd-i-, from PIE root kwezd- "a part, piece" (source also of Russian chast' "part").
        So yes, it looks like coincidence.
        (Peso is a word in Spanish and up until modern times, there's very little transfer from English to Spanish)
        [-]
        marky1991 a day ago
        It does make me wonder, what's the oldest word in Spanish that's taken from English?
        Off the top of my head, whiskey is the oldest, but I'm sure there's probably something much older
        [-]
        kragen 12 hours ago
        Part of the reason for this is that English didn't even exist until fairly recently (though still before modern times). But there were various Germanic languages, some of which English evolved from. There are a fair number of Spanish words of Germanic origin, some fairly old, perhaps dating from the Visigothic kingdoms in Iberia: guerra, guante, rico, orgullo, arpa, blanco, banco, frasco, yelmo, flecha, emboscar, jardín, flotar, bisonte, etc.
        More specifically with respect to English, though, https://en.wikipedia.org/wiki/List_of_Spanish_words_of_Germa... says that bote comes from Old English "bāt" via Middle English "boot" and then Old French "bot". The other words in that category include arlequín, este, norte, oeste, and sur/sud-. But when did those words make the jump into Spanish? Some, like arlequín, were fairly recent!
        "Old French" was supposedly spoken up to the mid-14th century, so words that came into Spanish directly from Old French probably came in before 01375 CE. But there's always the possibility that the word lingered for a century or three in some intermediate dialect like Occitan or Catalan before making it into Spanish. Unfortunately I don't know of any resource in Spanish comparable to the OED or Etymonline to find old Spanish attestations.
    - kccqzy a day ago
      The double-barred version is not in Unicode because it's just font variations for the same abstract character. Just like the letter g can be written with one or two loops (just look at Helvetica vs Times); Unicode treats it as a font difference. This is called an allograph. Using Baskerville in iOS to render the $ character shows two bars (there are many different variations of Baskerville though).
milliams a day ago
I remember when learning regexes, struggling to remember which of ^ and $ were the beginning or end of the line. I know that my (and many other) shells used $ as the prompt symbol - i.e. at the beginning of what I was to type in the terminal. It being at the beginning of that line made it easy for me to remember that it also marks the beginning of regexes.
Of course I was 100% incorrect as $ is in fact the end-of-line marker so now I just remember that "it's the opposite"!
[-]
- magicalhippo a day ago
  > so now I just remember that "it's the opposite"
  "No, the other left..."
  For me it was easier to remember what ^ was, as it looks like a pointer of sorts, so felt natural it would be the beginning. Like how a string variable points to the first character.
- aylmao a day ago
  Now that you mention it, I weirdly never struggled with this. I wonder if it's because I first started using the terminal on macOS, which I think by default starts prompts with % instead of $ or >.
  Trying to consciously think about the gut-feeling that makes me remember this, all I can come up with is that $ feels "heavier" than ^, which just leads it to naturally feel like the end of a line. Perhaps its verticalness makes it feel like the vertical blinking cursor, and that's what this gut-feeling is really about? I'm not sure, the mind is a mystery sometimes haha.
- rented_mule a day ago
  In many dialects of BASIC (my first programming language), $ comes at the end of a variable name (e.g., NAME$) to indicate that the variable was a string (that example would often be read out loud as name-string). Remembering that, I never had to do the "it's the opposite" thing (for this case... I've done it for many other things). It's one of the few things I still remember about BASIC because I don't think I've written any ~40 years.
- jiehong a day ago
  For me, it's '^' pointing up, up being before, or first if time flows from top to bottom.
- new_user_final a day ago
  “Carrots (^) cost dollars ($)” is a mnemonic for the order — the carrot comes first and the dollar sign comes last.
- binaryturtle a day ago
  But isn't the $ usually at the end of the prompt? :)
- wizzwizz4 a day ago
  I never found this hard, because $ is conventionally used at the end of many strings in Windows land (e.g. the names of network drives), which probably dates back to the CP/M dollar-terminated string.
- NotAnOtter a day ago
  Idk why anyone would ever "learn regex's" without the use of a look up table. It's not like timestables where having it memorized will really accelerate other aspects of life. Knowing what ^ does by memory will almost never benefit you.
  [-]
  - tom_ a day ago
    It does come in handy if you search for stuff with regular expressions a lot!
    [-]
    - NotAnOtter a day ago
      I do search with regex pretty often and as a result have memorized these, but if I was mentoring someone I would never encourage them to memorize it.
      Start with . and .*, then add additional stuff as you get comfortable
      [-]
      - grues-dinner a day ago
        Quite. I use regex a lot, but mostly in bursts. Sometimes I still have to check how a negative lookbehind goes! The important thing, as with everything, is to try to keep in mind what is and isn't possible, even if you don't keep the specific syntactic details in your mental L1 cache.
  - wormius 20 hours ago
    Depends if you decide to use vim as an editor. By now it's second nature (even though I'm nowhere near a vim master and struggle) this is one thing I use often.
teddyh a day ago
IIRC, the original ASCII standard had ↑ and ← but later changed them to ^ and _, respectively.
[-]
- fanf2 a day ago
  ASCII 1963 was the first version of the standard and it had ← https://dl.acm.org/doi/10.1145/366707.367524
  The 1965 draft had _ instead https://dl.acm.org/doi/10.1145/363831.363839
  The first standard edition with _ was 1968 https://www.rfc-editor.org/info/rfc20
  The 1977 version is also available https://nvlpubs.nist.gov/nistpubs/Legacy/FIPS/fipspub1-2-197...
  Xerox PARC used ASCII 1963 (for reasons I have yet to unearth) so its programming languages Mesa and SmallTalk used ← for assignment and camelCase to separate words in identifiers. This stylistic quirk was carried over to later object-oriented programming languages.
amelius 4 hours ago
If invented now, with current Unicode definitions, what symbols would they use?
acegopher a day ago
At least for ^ I was always told that the ADM3a terminal keyboard had a HOME key with ~ and ^ on it, which is why ~ means "home directory", and ^ means "home of the line".
[-]
- kjellsbells a day ago
  Correct on the ADM3a keyboard. In fact the OP site has a pic:
  https://www.hillelwayne.com/post/always-more-history/
  Whether that explains the use of tilde and caret, I dont know.
- fanf2 a day ago
  ^ for start of line predates the ADM3a
tarkin2 a day ago
Everything starts with someone pointing at something. Everything ends with dollars.
That's how I remembered them anyhow
[-]
- tetris11 a day ago
  SHOW me the MONEY
VogonPoetry a day ago
I seem to remember an early assembler style where if you wanted to define a string, a "$" was used as the terminator. Print routines would output chars until the "$". Also an early linker format that had symbols that always ended with a $ (cannot remember which system now, perhaps CP/M)
There is also Pascal's ^pointer syntax for the other end of things.
NotAnOtter a day ago
So basically, they used them because they needed to assign them to something ang they knew type writers would have those characters due to their use in business & math.
I mean. Fair enough. I guess.
franky47 a day ago
My AZERTY keyboard conveniently has them side by side (^ left of $).
Could it be a similar reason?
[-]
- jcranmer a day ago
  AZERTY is primarily a French keyboard layout, to my knowledge.
  The standard QWERTY layout for the number key row is `/~, 1/!, 2/@, 3/#, 4/$, 5/%, 6/^, 7/&, 8/*, 9/(, 0/), -/_, =/+. I don't know how far back the mapping of shift keys for the numbers go, but I'd be shocked if there was any around the 1960s or 1970s that put them like your AZERTY keyboard.
  [-]
  - dfox a day ago
    During the relevant time there were two predominant layouts for the number key row: “typewriter-paired”, which is more or less same as the one used today and “bit-paired” where the symbols had the same order as they have in ASCII (ie. pressing a number key while holding shift produces a character represented by that number's ascii codepoint minus 0x10)
  - franky47 a day ago
    Yes (I am French). Those characters are found on the top letter row: AZERTYUIOP^$
- hwayne a day ago
  I think this is unlikely, as the three people involved (Deustch, Lampson, and Thompson) are all American.
- amelius a day ago
  Exactly what I thought of, but on querty it is the other way around.
ramesh31 a day ago
As with so many things in computer science, the answer comes down to "someone working on a UNIX system in the 70's had to make an arbitrary decision on something, which no one else then ever had a good reason to change".
a day ago
[deleted]
a day ago
[deleted]