Is Mozilla's Readability really abandoned? The latest release (v0.6.0) is just 2 months ago, and its maintainer (Gijs) is pretty active on responding issues.
Interesting as I was researching this recently and certainly not impressed with the quality of the Readability implementations in various languages. Although Readability.js was clearly the best, it being Javascript didn't suit my project.
In the end I found the python trifatura library to extract the best quality content with accurate meta data.
You might want to compare your implementation to trifatura to see if there is room for improvement.
for the curious: Trafilatura means "extrusion" in Italian.
| This method creates a porous surface that distinguishes pasta trafilata for its extraordinary way of holding the sauce. search maccheroni trafilati vs maccheroni lisci :)
I was just looking at obsidian web-clipper's source code because I've been quite impressed at its markdown conversion results and came across Defuddle in there. I'll be using for my bespoke read-it-later/ knowledge-base app, so thank you in advance :D
I've been super happy with Obsidian Web Clipper! It's worked really well for me with the one exception of importing publish dates (which is more than forgivable !)
As an open source developer for decades now, I used to have this attitude. Trust me when I say, it doesn't work. Step up to the plate, build the framework for tests and then require anyone who wants to help build the product to write tests with their PRs. You can't just push some code out there and expect people to "feel free to help", it doesn't happen and is quite a turnoff.
Sorry you're not getting my point. It isn't a complaint. I'm responding to a rather flippant "feel free to help" with some advice from someone who's been doing this a long time.
I've got a project that has been going for 6 years now and attracted 500 stars and gets 49k downloads a month. It works because it has comprehensive unit tests and people can rely on it. When I was just starting out, I didn't tell people to feel free to help. I put the effort in. It is important to lay the groundwork beyond just writing the utility.
The Python analogues seem to be well maintained. I did my own implementation of the Readability algorithm years ago and dropped it in favor them, and I have a few scrapers going strong with regular updates.
can confirm that readability seems to be on life support. I used it slurp, an obsidian plugin which serves the same basic purpose as web clipper, and always had a hard time getting PRs reviewed and merged.
i started working on my own alternative but life (and web clipper) derailed the work.
it's funny. somehow slurp keeps gaining new users even though web clipper exists. so i might have to refactor it to use your library sometime soon even though I don't use slurp myself anymore.
Are you using ai models behind the scenes? I saw Gemini and others in the code. I am asking mainly to understand the cost of using yours vs. readability. Thank!
No it's all rules-based. I think the code you're referring to is "extractors", which are website-specific rules that I'm working on to standardize the output from sites with comments threads (e.g. HN, Reddit) and conversational chats (ChatGPT, Claude, Gemini).
In the playground, after I enter a url, I can't seem to figure out how to submit it to fetch the url? I tried pressing the return key on iOS keyboard but it didn't do anything. Am I missing something?
The input is there to test the url option — which I admit is a bit confusing, so I have removed it for now. I haven't found a good and free way to proxy requests from a GitHub page (yet).
A bit off-topic, but I'm very excited to see the launch of Bases! I've obsessively followed the roadmap for like a year awaiting this day and have been frequently disappointed to still see it stuck somewhere under "planned".
Not that I didn't already implement a read-it-later solution with Obsidian+Dataview, but this definitely makes things simpler!
Is Mozilla's Readability really abandoned? The latest release (v0.6.0) is just 2 months ago, and its maintainer (Gijs) is pretty active on responding issues.
Interesting as I was researching this recently and certainly not impressed with the quality of the Readability implementations in various languages. Although Readability.js was clearly the best, it being Javascript didn't suit my project.
In the end I found the python trifatura library to extract the best quality content with accurate meta data.
You might want to compare your implementation to trifatura to see if there is room for improvement.
reference to the library: https://trafilatura.readthedocs.io/en/latest/
for the curious: Trafilatura means "extrusion" in Italian.
| This method creates a porous surface that distinguishes pasta trafilata for its extraordinary way of holding the sauce. search maccheroni trafilati vs maccheroni lisci :)
(btw I think you meant trafilatura not trifatura)
Obsidian Web Clipper is a great tool to turn chatGPT conversations in markdown, or to just print it (believe me, it is a user case)
I just ask ChatGPT to provide the summary or whatever I need as a markdown file.
For those not in the know: [Readability](https://github.com/mozilla/readability)
I was just looking at obsidian web-clipper's source code because I've been quite impressed at its markdown conversion results and came across Defuddle in there. I'll be using for my bespoke read-it-later/ knowledge-base app, so thank you in advance :D
I've been super happy with Obsidian Web Clipper! It's worked really well for me with the one exception of importing publish dates (which is more than forgivable !)
This is something that looks like it would benefit from a lot of unit tests, yet I don't see any.
Feel free to help :)
As an open source developer for decades now, I used to have this attitude. Trust me when I say, it doesn't work. Step up to the plate, build the framework for tests and then require anyone who wants to help build the product to write tests with their PRs. You can't just push some code out there and expect people to "feel free to help", it doesn't happen and is quite a turnoff.
You just wanted to complain and not add anything? Not really getting your point at all
Sorry you're not getting my point. It isn't a complaint. I'm responding to a rather flippant "feel free to help" with some advice from someone who's been doing this a long time.
I've got a project that has been going for 6 years now and attracted 500 stars and gets 49k downloads a month. It works because it has comprehensive unit tests and people can rely on it. When I was just starting out, I didn't tell people to feel free to help. I put the effort in. It is important to lay the groundwork beyond just writing the utility.
The Python analogues seem to be well maintained. I did my own implementation of the Readability algorithm years ago and dropped it in favor them, and I have a few scrapers going strong with regular updates.
Are there any in particular you can recommend?
not parent, but this one looks maintained https://github.com/buriy/python-readability
seems pretty much perfect including obsidian clipper. Thanks!
can confirm that readability seems to be on life support. I used it slurp, an obsidian plugin which serves the same basic purpose as web clipper, and always had a hard time getting PRs reviewed and merged.
i started working on my own alternative but life (and web clipper) derailed the work.
it's funny. somehow slurp keeps gaining new users even though web clipper exists. so i might have to refactor it to use your library sometime soon even though I don't use slurp myself anymore.
Neat. With ~3 more lines of code, you could get a URL and render it in simpler HTML and be a full fledged replacement.
Are you using ai models behind the scenes? I saw Gemini and others in the code. I am asking mainly to understand the cost of using yours vs. readability. Thank!
No it's all rules-based. I think the code you're referring to is "extractors", which are website-specific rules that I'm working on to standardize the output from sites with comments threads (e.g. HN, Reddit) and conversational chats (ChatGPT, Claude, Gemini).
In the playground, after I enter a url, I can't seem to figure out how to submit it to fetch the url? I tried pressing the return key on iOS keyboard but it didn't do anything. Am I missing something?
The input is there to test the url option — which I admit is a bit confusing, so I have removed it for now. I haven't found a good and free way to proxy requests from a GitHub page (yet).
A bit off-topic, but I'm very excited to see the launch of Bases! I've obsessively followed the roadmap for like a year awaiting this day and have been frequently disappointed to still see it stuck somewhere under "planned".
Not that I didn't already implement a read-it-later solution with Obsidian+Dataview, but this definitely makes things simpler!
Didn't it release just some days ago?