One problem of mine that seems to fall in line with this: There is a set of data that some other daily process requires. This data is only available on a well structured and difficult to quickly navigate web page and any of it can change at any time.
I wrote a small Python program using Pandas and Beautiful Soup to parse the web page and insert all the data into a table. Cron schedules it to run hourly such that you get Cron->wget->Python->database. I considered using this approach on a variety of other things such that instead of browsing the raw data I could use SQL to query a database and detect changes, trends, etc.
Of course this opens the door of easy programatic access to that data. Honestly the need for me was never great enough to go any deeper.
Your approach makes a lot of sense. Especially for "old school" websites with simple Server Side rendered pages, tables are a treasure of data. Wikipedia tables would be a great use of your tool too.
I was thinking more on the lines of programmatic access that `enforces` structured output. LLMs are really good at this step. Define a schema and get an output that is guaranteed to fit.
One problem of mine that seems to fall in line with this: There is a set of data that some other daily process requires. This data is only available on a well structured and difficult to quickly navigate web page and any of it can change at any time.
I wrote a small Python program using Pandas and Beautiful Soup to parse the web page and insert all the data into a table. Cron schedules it to run hourly such that you get Cron->wget->Python->database. I considered using this approach on a variety of other things such that instead of browsing the raw data I could use SQL to query a database and detect changes, trends, etc.
Of course this opens the door of easy programatic access to that data. Honestly the need for me was never great enough to go any deeper.
Your approach makes a lot of sense. Especially for "old school" websites with simple Server Side rendered pages, tables are a treasure of data. Wikipedia tables would be a great use of your tool too.
I was thinking more on the lines of programmatic access that `enforces` structured output. LLMs are really good at this step. Define a schema and get an output that is guaranteed to fit.
You can see some examples of what I mean here: https://query-rho.vercel.app/