Python's splitlines does more than just newlines

92 points | by Bogdanp 12 hours ago

38 comments

dleeftink 10 hours ago
For more controlled splitting, I really like Unicode named characters classes[0] for more precise splitting and matching tasks.
[0]: https://en.wikipedia.org/wiki/Unicode_character_property#Gen...
[-]
- Rendello 8 hours ago
  Given that encoded characters must have one and only one General_Category value, it might be too imprecise or arbitrary in some cases. If you ever need more power, it's worth browsing the other character properties Unicode exposes. For example, `Lu` (Uppercase_Letter) only covers some uppercase letters, whereas the `Uppercase` property covers all of them.
  ---
  For anyone that wants to learn more about specific Unicode stuff, the three big data sources are The Core Spec, the Unicode Technical Annexes (UAXs), and the Unicode Character Database itself (the database is a bunch of text files. There's an XML version now as well).
  For further reading on this specifically, it might be worth looking at:
  [Unicode Core Spec - Chapter 4: Character Properties] https://www.unicode.org/versions/Unicode17.0.0/core-spec/cha...
  ├ [General Category] https://www.unicode.org/versions/Unicode17.0.0/core-spec/cha...
  └ [Properties for Text Boundaries] https://www.unicode.org/versions/Unicode17.0.0/core-spec/cha...
  [UAX #44 - Unicode Character Database (Technical Report)] https://www.unicode.org/reports/tr44/
  ├ [General Category Values] https://www.unicode.org/reports/tr44/#General_Category_Value...
  └ [Property Definitions] https://www.unicode.org/reports/tr44/#Property_Definitions
  And, if you're brave and want to see the data itself (skim through UAX #44 first):
  [Unicode Character Database] https://www.unicode.org/Public/17.0.0/ucd/
  [-]
  - sundarurfriend 2 hours ago
    Unfortunately, even Properties aren't fully reliable. The Tamil pulli for eg., does not have the Alphabetic property despite being a central part of the Tamil alphabet, pretty much due to historical accident: other Indic languages have a similar looking Virama character that aren't alphabetic, and pulli (despite being a separate, unrelated character) was lumped in with those.
    When I tried to raise this in the mailing list and get it rectified, the response I got was pretty much that many properties need language-specific processing anyway, and shouldn't be relied upon fully, and so this wasn't worth fixing.
    [1] https://util.unicode.org/UnicodeJsps/character.jsp?a=0BCD&B1...
mixmastamyk 9 hours ago
Splitlines is generally not needed. for line in file: is more idiomatic.
[-]
- tiltowait 9 hours ago
  Splitlines additionally strips the newline character, functionality which is often (maybe even usually?) desired.
  [-]
  - masklinn 6 hours ago
    This has been controlled via a boolean parameter since at least 2.0, which as far as I can tell is when this method was added to `str`.
- fulafel 7 hours ago
  It has similar (but not identical) behaviour though:
```
  >>> for line in StringIO("foo\x85bar\vquux\u2028zoot"): print(line)
  ... 
  foo
  bar
   quux zoot
```
  [-]
  - amelius 3 hours ago
    I would expect it to have identical behavior.
- rangerelf 9 hours ago
  What if the text is already in a [string] buffer?
  [-]
  - mixmastamyk 9 hours ago
    StringIO can help, .rstrip() for the sibling comment.
- drdrey 9 hours ago
  not every line is read from a file
  [-]
  - mixmastamyk 9 hours ago
    That's where the generally fits in.
    [-]
    - crazygringo 7 hours ago
      No, because that still assumes files are the general usage.
      In my experience, they're not. It's strings.
      [-]
      - mixmastamyk 5 hours ago
        And where do you get these input strings? Big enough that .split() is not sufficient? Files, and yes sockets support the interface as well with a method call.
        [-]
        crazygringo 3 hours ago
        > And where do you get these input strings?
        From database fields, API calls, JSON values, HTML tag content, function inputs generally, you know -- the normal places.
        In my experience, most people aren't dealing directly with files (or streams) most of the time.
        gnulinux 3 hours ago
        They might be programmatically generated, for example.
        There are countless sources one can get a string from. Surely you don't think filesystems are the only source of strings?
- paulddraper 5 hours ago
  If it's reading from a file, you wouldn't be using splitlines() anyway; you'd use readlines().
  For string you’d need to
```
  import io

  for line in io.StringIO(str):
    pass
```
cuckoos-jicamas 9 hours ago
str.split() function does the same:
>>> s = "line1\nline2\rline3\r\nline4\vline5\x1dhello"
>>> s.split() ['line1', 'line2', 'line3', 'line4', 'line5', 'hello']
>>> s.splitlines() ['line1', 'line2', 'line3', 'line4', 'line5', 'hello']
But split() has sep argument to define delimiter according which to split the string.. In which case it provides what you expected to happen:
>>> s.split('\n') ['line1', 'line2\rline3\r', 'line4\x0bline5\x1dhello']
In general you want this:
>>> linesep_splitter = re.compile(r'\n|\r\n?')
>>> linesep_splitter.split(s) ['line1', 'line2', 'line3', 'line4\x0bline5\x1dhello']
[-]
- roelschroeven 6 hours ago
  In that example str.split() has the same result as str.splitlines(), but it's not in general the same, even without custom delimiter.
  str.split() splits on runs of consecutive whitespace, any type of whitespace, including tabs and spaces which splitlines() doesn't do.
```
    >>> 'one two'.split()
    ['one', 'two']
    >>> 'one two'.splitlines()
    ['one two']
```
  split() without custom delimiter also splits on runs of whitespace, which splitline() also doesn't do (except for \r\n because that combination counts as one line ending):
```
    >>> 'one\n\ntwo'.split()
    ['one', 'two']
    >>> 'one\n\ntwo'.splitlines()
    ['one', '', 'two']
```
- gertlex 6 hours ago
  splitlines() is sometimes nice for adhoc parsing (of well behaved stuff...) because it throws out whitespace-only lines from the resulting list of strings.
  #1 use-case of that for me is probably just avoiding the cases where there's a trailing newline character in the output of a command I ran by subprocess.
meken 10 hours ago
TIL: Python has a splitlines function
[-]
- Frotag 7 hours ago
  There's so many super useful things in the Python docs that you never see in the wild. For example, I recently learned that the sqlite3 module has a set_authorizer function that lets you limit the types of statements that can be run / tables that can be accessed.
  https://www.sqlite.org/c3ref/set_authorizer.html
  https://docs.python.org/3/library/sqlite3.html#sqlite3.Conne...
wvbdmp 10 hours ago
What, no <br\s*\/?>?
zb3 4 hours ago
Useful to know for security purposes, surprises like that might cause vulnerabilities..
zzzeek 9 hours ago
in the same theme, NTLAIL strip(), rstrip(), lstrip() can strip other kinds of characters besides whitespace.
[-]
- masklinn 6 hours ago
  One thing to note tho is that they take character sets, as long as they encounter characters in the specified set they will keep stripping. Lots of people think if you give it a string it will remove that string.
  That feature was added in 3.9 with the addition of `removeprefix` and `removesuffix`.
  Sadly,
  1. unlike Rust's version they provide no way of knowing whether they stripped things out
  2. unlike startswith/endswith they do not take tuples of prefixes/suffixes
7bit 11 hours ago
This article provides no additional value to the splitlines() docs.
[-]
- woodruffw 10 hours ago
  The "article" is my TIL mini-blog. What were you expecting besides a "today I learned"?
  [-]
  - kstrauser 10 hours ago
    I already knew this information, more or less, but I like reading TIL posts like this. It's fun seeing the someone learn new things, and sometimes I pick up something myself, or at least look at it in a new way.
  - cap11235 9 hours ago
    Yeah, don't listen to parent. I like these sorts of articles a lot; its only useless if you assume that everyone interested has also memorized the Python docs fully (which I imagine is zero people). Fun technical tangents are quite fun indeed.
  - zahlman 9 hours ago
    What is "yossarian", BTW? I'd gotten confused thinking it was someone else's blog, because I naturally parse that as a surname.
    [-]
    - woodruffw 9 hours ago
      John Yossarian is the protagonist of Joseph Heller’s Catch-22[1], which was my favorite book in high school. Like a lot of people, my handle is a slightly embarrassing memorialization of my younger self :-)
      [1]: https://en.wikipedia.org/wiki/Catch-22
      [-]
      - di 7 hours ago
        Don't be embarrassed, it's a good book (and was my favorite too).
      - zahlman 9 hours ago
        > Like a lot of people, my handle is a slightly embarrassing memorialization of my younger self :-)
        ... Guilty, actually.
- rsyring 10 hours ago
  Sometimes value is measured by awareness. I benefited from becoming aware of the behavior because of the article. Yes, it's in the docs, but the docs are not something I would have gone looking to read today.
- diath 10 hours ago
  The value of this article, to me, is that I'd never read the splitlines documentation, so this is a little detail that I just learned thanks to it being linked here.
- happytoexplain 10 hours ago
  I've been working with Python for a year or so now, and never knew this. I'm grateful to the author.
- felipelemos 10 hours ago
  For all of us that don't read all documentation for every single method, tool, function or similar, it is, by awarenes, very useful.