Some websites are partially (or entirely) rendered on the client (aka your web browser). If you try to search the initial HTML for elements that haven’t finished rendering, you won’t find them.
One solution is to use a headless browser that runs a web browser in the background that fetches the page, renders it, and then allows you to search the final document.
Headless browsers aren’t a good fit for Val Town due to the amount of resources they require to run. However, services like Browserless provide APIs to interact with a hosted headless browser. For example, their /scrape API.
Copy your API Key from https://cloud.browserless.io/account/ and save it as a Val Town secret as browserlessKey
.
Check the documentation for the /scrape API and form your request.
For example, here’s how you scrape the introduction paragraph of OpenAI’s wikipedia page.
https://www.val.town/embed/vtdocs.browserlessScrapeExample
Browserless also has more APIs for taking screenshots and PDFs of websites.
You can use the Puppeteer library to connect to a browser instance running on Browserless.
Once you’ve navigated to a page, you can run arbitrary JavaScript with page.evaluate
– like getting the text from a paragraph.
https://www.val.town/embed/vtdocs.browserlessPuppeteerExample