0 for 10 on some startups (large and small, YC and not) that came to mind.
It's easy to scrape YC startups from https://www.ycombinator.com/companies. Scrape that and a dozen other investors' portfolio pages and you'll have a useful fraction of startups.
Sounds good! its just I used up most of my API key limits in development, and I'm keeping some so I can run improvement pipelines or fix errors, so il batch the YC companies day by day, there's 5000 companies, so il do about 800 each day for 5-6 days.
Wait, why would you use (tokens?) on a routine scrape job? Just make a generic parser (vibe-coded or by hand, I don’t think it'll be that hard), then run it across the company list.
no scraping the yc startups is easy, I've already done that, its just writing and researching the actual profiles uses an internal tool (built into the website,I can trigger it from an admin dashboard) which uses tokens.
I expected the VERIFIED badges to link to some sort of provenance information. That seems like a must, otherwise (given the "assume everything's incorrect" disclaimers) I'm not sure why one would take that badge seriously.
I got the agents to cite sources, there's a bug with fetching the urls from the DB, the way it should work is when you hit verified it leads you to the source, working on fixing it now. Also I will try to add an agent ledger tab soon, that shows exactly what the agents were doing.
> I got the agents to cite sources, there's a bug with fetching the urls from the DB, the way it should work is when you hit verified it leads you to the source
I expect "verified" to mean that either you or someone else has confirmed it, not just that an LLM was asked for one.
What is your content/data license? I don't see anything about this on the site. For something to feel like a community wiki, the community needs to co-own the content and be able to fork. If you think the content is in the public domain because of AI, applying a license like CC BY or CC BY-SA won't hurt, but the content is copyrighted, not applying a license will. (This isn't legal advice.) See "WP:CRANDO" (https://en.wikipedia.org/wiki/Wikipedia:Copyrights#Contribut...) for how Wikipedia does it.
It's a good idea. Why not ask startups to upload a startup.txt (as opposed to robots.txt) to their web root and collect from that? Pre-filled text forms can be downloaded. Also, as with CB, collect data on individuals through a similar opt-in. Enable users to ping your site when it's ready to collect.
You could have a "traction" stat and ask for a JS snippet be installed on homepages or a set of pages. Old school and unreliable. Registered users is also a good way to assess traction. Not sure how that information could be readily obtained.
In my previous comment I mentioned attaching a crypto address to domains - you could do that too. That'd be interesting. One feature you could add long-term is crowdfunding. Either for new features, code releases, media, documents - whatever.
Crowdfunding activity on startups and individuals would be a great way to measure traction.
Thanks, this is all really interesting feedback, I'm mostly free right now, so il definitely try to roll out these changes soon. probably will announce them on website blog.
I think it's important to attach a bunch of people to any given startup (you've already done it.) Going forward, I'd expect more startups to behave like bands. Any 1 person can be attached to a number of bands, but eventually one of these bands will make it big and the members will get locked-in. Staff also bring users, credibility and hence traction early on.
It sounds like none of the data will be reliable? Ai and community seems like very little will be true and I will have no idea which part will be true.
Build trust, collect data from cdrowdsource, if you want to succeed on this.
Build trust by: truly making this a public good, by open sourcing it. Be the maitnainer. Data dump every week as a zipball/tarball. These will ensure you can't rugpull.
With this trust, offer an extension (open source of course) to all, which whever a user goes through crunchbase, traxn, etc, sends any factual data (hence non-copyrightable) to you. If you gained trust, I would also do this.
You get the right to be a maintainer, and figure out if you also want to make a business with it on top.
How about expose an API so that users can put the name of a startup and it goes through your AI agent pipeline to acquire an estimate? That way, you don’t need to know every startup under the sun and focus on optimizing your pipeline instead.
a random complain on my part would be the log in with google. hate that. looks great, otherwise. i don't even have a problem creating an account, honestly. but i try to not use the google for anything unless i have to.
Would be a good way to have others absorb some of your inference limits and fill in missing data that they need. A call to action on a blank search would be a great flow.
just added a agent ledger, it shows exactly what the agents were doing during the pipeline, u can find it at the top of the sources tab. (it truncates part of the ledger sometimes though, working on fixing that bug)
Really cool concept but so much of the data is wrong. Anthropic ARR is an order of magnitude higher, Replicate did a Series B as well which is not mentioned. There is probably a lot more.
thanks! I tried writing a program where if it can't find the runway online, then it tries to guess by using other values, clerk is not really in risk of ever running out of money, and so it juste ended up outputting a really high value, I just updated it so if runway is calculated to be above 20 months+ it just says not at risk
I just approved a whole bunch of micro businesses! I do not get payed for this, and these aren't ads, instead I do it so people can find under the radar early companies.
you may be relying on AI to do the heavy lifting for you too much. If you are sending out agents, you should have strict rules around the recency of the data they are aggregating. Otherwise, you will end up with outdated and useless data.
As far as I can tell from FAQs on hacker news, if your previous post failed to gain significant feedback (in this case, only 1 user interacted with my old post) you are allowed to repost in 36 hours.
I see quite outdated data. Anthropic listed with valuation 18B and latest round at 4b? Just to compare, their real latest round was 65b with valuation 965b.
yeah, just spotted the error, AI agents seem to be searching for news without adding keywords like "latest", I'm updating that, and changing some system prompts, also adding a fact checking agent, and restarting the server to run an imrpovment pipeline to update these profiles. Might take a while for it to finish running though, Il try to update stuff manually till its done.
acessing crunchbase even if its the free version with AI agents is illegal since they have a paid API, so I need to manually get it, which is fine for now, but I eventually need to improve the pipeline.
hey just checked the pipeline after running, the anthropic profile looks better now. (I will roll out more updates in the next few days to keep improving accuracy)
this is pretty early stage and so I didn't want to make a ton of profiles only to have to update them later because of an issue int he generation pipeline, Im planning on adding about 800 startups every day.
AI agents have to cite where they get stuff from, also people can flag issues, and I'm gonna run pipelines periodically to fact check pages. But yeah, with this kind of site I do agree accuracy is gonna take a lot of engineering to improve.
AI agents have to cite sources for each thing (there's a bug with displaying sources, it should let u click a fact and send you to where the agent got it. I'm working on fixing that right now). Users can also flag errors, and I'm going to periodically run fact-checking agents and manually go in and check info. However, obviously this will likely still not be perfect, accuracy will probably be the number 1 challenge with this site.
0 for 10 on some startups (large and small, YC and not) that came to mind.
It's easy to scrape YC startups from https://www.ycombinator.com/companies. Scrape that and a dozen other investors' portfolio pages and you'll have a useful fraction of startups.
Sounds good! its just I used up most of my API key limits in development, and I'm keeping some so I can run improvement pipelines or fix errors, so il batch the YC companies day by day, there's 5000 companies, so il do about 800 each day for 5-6 days.
Wait, why would you use (tokens?) on a routine scrape job? Just make a generic parser (vibe-coded or by hand, I don’t think it'll be that hard), then run it across the company list.
Or are you talking about some other API?
no scraping the yc startups is easy, I've already done that, its just writing and researching the actual profiles uses an internal tool (built into the website,I can trigger it from an admin dashboard) which uses tokens.
Create placeholder pages and backfill any extended LLM generated stuff. Either after first request of a page or simply queued.
Same here. I work with a lot of startups, some of them very prominent and none of them are listed.
I expected the VERIFIED badges to link to some sort of provenance information. That seems like a must, otherwise (given the "assume everything's incorrect" disclaimers) I'm not sure why one would take that badge seriously.
Yeah, the "verified" badges are useless if they don't link to sources or at least provide some indication of how they were verified and when.
I got the agents to cite sources, there's a bug with fetching the urls from the DB, the way it should work is when you hit verified it leads you to the source, working on fixing it now. Also I will try to add an agent ledger tab soon, that shows exactly what the agents were doing.
> I got the agents to cite sources, there's a bug with fetching the urls from the DB, the way it should work is when you hit verified it leads you to the source
I expect "verified" to mean that either you or someone else has confirmed it, not just that an LLM was asked for one.
What is your content/data license? I don't see anything about this on the site. For something to feel like a community wiki, the community needs to co-own the content and be able to fork. If you think the content is in the public domain because of AI, applying a license like CC BY or CC BY-SA won't hurt, but the content is copyrighted, not applying a license will. (This isn't legal advice.) See "WP:CRANDO" (https://en.wikipedia.org/wiki/Wikipedia:Copyrights#Contribut...) for how Wikipedia does it.
It's a good idea. Why not ask startups to upload a startup.txt (as opposed to robots.txt) to their web root and collect from that? Pre-filled text forms can be downloaded. Also, as with CB, collect data on individuals through a similar opt-in. Enable users to ping your site when it's ready to collect.
You could have a "traction" stat and ask for a JS snippet be installed on homepages or a set of pages. Old school and unreliable. Registered users is also a good way to assess traction. Not sure how that information could be readily obtained.
In my previous comment I mentioned attaching a crypto address to domains - you could do that too. That'd be interesting. One feature you could add long-term is crowdfunding. Either for new features, code releases, media, documents - whatever.
Crowdfunding activity on startups and individuals would be a great way to measure traction.
Thanks, this is all really interesting feedback, I'm mostly free right now, so il definitely try to roll out these changes soon. probably will announce them on website blog.
Good on you. Also, checkout https://news.ycombinator.com/item?id=47778306 "Every CEO and CFO change at US public companies, live from SEC"
I think it's important to attach a bunch of people to any given startup (you've already done it.) Going forward, I'd expect more startups to behave like bands. Any 1 person can be attached to a number of bands, but eventually one of these bands will make it big and the members will get locked-in. Staff also bring users, credibility and hence traction early on.
"If you want something done, ask a busy person."
Umm, what about machine readable JSON in .well-known instead of yet another txt in root
That's a great idea. I also like the idea of having a changelog in there too, and key screenshots. Really, WP plugin pages have a great range of info as a guide: eg https://wordpress.org/plugins/tablepress/#developers https://tablepress.org/info/#changelog
THIS IS A GREAT IDEA!
It sounds like none of the data will be reliable? Ai and community seems like very little will be true and I will have no idea which part will be true.
Crunchbase is also not very reliable. It's community/self-reported data.
Crunchbase is generally self reported data
Build trust, collect data from cdrowdsource, if you want to succeed on this.
Build trust by: truly making this a public good, by open sourcing it. Be the maitnainer. Data dump every week as a zipball/tarball. These will ensure you can't rugpull.
With this trust, offer an extension (open source of course) to all, which whever a user goes through crunchbase, traxn, etc, sends any factual data (hence non-copyrightable) to you. If you gained trust, I would also do this.
You get the right to be a maintainer, and figure out if you also want to make a business with it on top.
Hello ! Great project. Do you plan to make it open source as it is already free to use ? If so already, I didn't find the github repository.
How about expose an API so that users can put the name of a startup and it goes through your AI agent pipeline to acquire an estimate? That way, you don’t need to know every startup under the sun and focus on optimizing your pipeline instead.
a random complain on my part would be the log in with google. hate that. looks great, otherwise. i don't even have a problem creating an account, honestly. but i try to not use the google for anything unless i have to.
Me too, magic link should be the default these days and then oauth providers are fine for people who like them.
Would you consider allowing people to login with OpenRouter?
https://openrouter.ai/docs/guides/overview/auth/oauth
Would be a good way to have others absorb some of your inference limits and fill in missing data that they need. A call to action on a blank search would be a great flow.
just added a agent ledger, it shows exactly what the agents were doing during the pipeline, u can find it at the top of the sources tab. (it truncates part of the ledger sometimes though, working on fixing that bug)
Really cool concept but so much of the data is wrong. Anthropic ARR is an order of magnitude higher, Replicate did a Series B as well which is not mentioned. There is probably a lot more.
Looks super cool and love the idea!
I saw Clerk and noticed that it says that they have verified 250 months runway. Maybe true but sounds crazy high.
Maybe if there's a specific article a verification is attributed to, you could add it being cited?
Anyways thanks for making this.
thanks! I tried writing a program where if it can't find the runway online, then it tries to guess by using other values, clerk is not really in risk of ever running out of money, and so it juste ended up outputting a really high value, I just updated it so if runway is calculated to be above 20 months+ it just says not at risk
I just approved a whole bunch of micro businesses! I do not get payed for this, and these aren't ads, instead I do it so people can find under the radar early companies.
you may be relying on AI to do the heavy lifting for you too much. If you are sending out agents, you should have strict rules around the recency of the data they are aggregating. Otherwise, you will end up with outdated and useless data.
Can you add founder university affiliation? That's the only thing I begrudgingly pay crunchbase for and its wildly inaccurate.
It is unclear how I can list my company here. Are small companies coming later?
just launched the button, click it, fill out form, il manually go in aprove, and write your profile.
I get "JSON.parse: unexpected character at line 1 column 1 of the JSON data"
just fixed it
Works
https://news.ycombinator.com/item?id=48572472
Why do you ask again for feedback after three days?
As far as I can tell from FAQs on hacker news, if your previous post failed to gain significant feedback (in this case, only 1 user interacted with my old post) you are allowed to repost in 36 hours.
what's wrong with that? Just ignore the post if you don't want to see it.
Build and sharing is awesome
I see quite outdated data. Anthropic listed with valuation 18B and latest round at 4b? Just to compare, their real latest round was 65b with valuation 965b.
yeah, just spotted the error, AI agents seem to be searching for news without adding keywords like "latest", I'm updating that, and changing some system prompts, also adding a fact checking agent, and restarting the server to run an imrpovment pipeline to update these profiles. Might take a while for it to finish running though, Il try to update stuff manually till its done.
Probably a good idea to just use CrunchBase's numbers until you have a plan for doing it yourself.
acessing crunchbase even if its the free version with AI agents is illegal since they have a paid API, so I need to manually get it, which is fine for now, but I eventually need to improve the pipeline.
Meaning Claude won't help you scrape it and you can't do it yourself? Lol.
hey just checked the pipeline after running, the anthropic profile looks better now. (I will roll out more updates in the next few days to keep improving accuracy)
you should link your data to wikidata which will get you free connection back to crunchbase and other sources e.g. https://www.wikidata.org/wiki/Q97041185
You could even back some of the data from there
Yeah tried my own start up and found nothing. I don't know where your sources come from
this is pretty early stage and so I didn't want to make a ton of profiles only to have to update them later because of an issue int he generation pipeline, Im planning on adding about 800 startups every day.
Mobile view is not working on my iPhone. Scroll is messed up and the page is not properly fitting in the view.
Notice this too. Also the animations might be too much for mobile. Consider just listing the cards without them moving horizontally.
Nice initiative. but, I am concerned about the reliability of the data. how are you gonna take care of that?
AI agents have to cite where they get stuff from, also people can flag issues, and I'm gonna run pipelines periodically to fact check pages. But yeah, with this kind of site I do agree accuracy is gonna take a lot of engineering to improve.
He's going to take your comment and give it to Claude as a prompt.
How about adding some stats to your landing page?
good idea, I will roll out tomorrow.
I wonder why there are No Micro Companies Yet on the platform?
micro companies aren't added by me, people submit their own, then I go in and approve, haven't gotten any submissions yet, when I do I'll add them in.
I’ve been trying to submit but hit an error about the string format not being correct. It doesn’t give any indication on which string is wrong.
I fixed it
How are you going to take care of the genuineness of the data
AI agents have to cite sources for each thing (there's a bug with displaying sources, it should let u click a fact and send you to where the agent got it. I'm working on fixing that right now). Users can also flag errors, and I'm going to periodically run fact-checking agents and manually go in and check info. However, obviously this will likely still not be perfect, accuracy will probably be the number 1 challenge with this site.
The search for Luma and Saronic didn't work
This is pretty early stage, I'm planning on adding more startups every day, il Luma and Saronic soon.
just added roughly 20 startups, focusing on biotech
Looks like a vibe coded slop.